Analysis of AlphaMissense data in different protein groups and structural context

Single amino acid substitutions can profoundly affect protein folding, dynamics, and function. The ability to discern between benign and pathogenic substitutions is pivotal for therapeutic interventions and research directions. Given the limitations in experimental examination of these variants, AlphaMissense has emerged as a promising predictor of the pathogenicity of missense variants. Since heterogenous performance on different types of proteins can be expected, we assessed the efficacy of AlphaMissense across several protein groups (e.g. soluble, transmembrane, and mitochondrial proteins) and regions (e.g. intramembrane, membrane interacting, and high confidence AlphaFold segments) using ClinVar data for validation. Our comprehensive evaluation showed that AlphaMissense delivers outstanding performance, with MCC scores predominantly between 0.6 and 0.74. We observed low performance on disordered datasets and ClinVar data related to the CFTR ABC protein. However, a superior performance was shown when benchmarked against the high quality CFTR2 database. Our results with CFTR emphasizes AlphaMissense’s potential in pinpointing functional hot spots, with its performance likely surpassing benchmarks calculated from ClinVar and ProteinGym datasets.


Introduction
In both the medical field and the broader realm of biology, understanding the pathogenicity of mutations holds high significance 1,2 .Pathogenic mutations disrupt the normal function of genes, leading to multiple diseases and medical conditions.From the early onset of genetic disorders in infants to the development of complex diseases in adults, the transformative power of a single nucleotide change can be profound.Discerning between benign and pathogenic mutations can influence diagnostic accuracy, guide therapeutic interventions, and inform prognosis 3 .Therefore, reliable tools and methodologies to predict and understand mutation impact are essential.
Prior to the advent of more advanced genetic analytical tools, several algorithms emerged as standard bearers in predicting the potential impact of mutations, such as PROVEAN, PolyPhen-2, and SIFT.PROVEAN (Protein Variation Effect Analyzer) offers predictions based on the alignment of homologous protein sequences.Meanwhile, PolyPhen-2 (Polymorphism Phenotyping v2) employs a combination of sequence and structural information to classify variants as benign or probably damaging 4 .SIFT (Sorting Intolerant From Tolerant) operates by considering the degree of conservation of amino acid residues in sequence alignments derived from closely related sequences to predict whether an amino acid substitution affects protein function 5 .While these tools have undeniably advanced our understanding of mutation pathogenicity, they also underscore the complexity of the task and highlight the need for continuous refinement in the face of rapidly accumulating genomic data.Newer tools for evaluating the pathogenicity of missense mutations were created.MVP (Missense Variant Pathogenicity prediction) has gained attention for its sophisticated integration of multiple features related to genetic variation 6 .MetaSVM is an ensemble method that merges the outputs of various tools using support vector machines to consolidate pathogenicity prediction 7 .M-CAP (Mendelian Clinically Applicable Pathogenicity) stands out for its high specificity in distinguishing disease-associated variants from neutral ones 8 .VESPA, the Variant Effect Scoring Prediction Algorithm, is based on embeddings of a protein language model, which captures nuanced relationships between amino acid residues, allowing for a more refined and context-aware prediction of variant impacts 9 .
AlphaMissense machine learning, developed recently by DeepMind, can predict the pathogenicity of missense variants and stands at the frontier of missense variant pathogenicity prediction 10 .Importantly, it leverages the structural prediction capabilities of AlphaFold 11 to analyze these variants.To potentially enhance the precision of missense variant pathogenicity insights, AlphaMissense evolved the field by merging sophisticated machine learning with structural biology.Moreover, AlphaMissense aims to tackle the challenge of interpreting the vast number of missense variants in the human genome, many of which have unclear clinical significance.It holds the promise of revolutionizing the understanding and diagnosis of genetic diseases by classifying missense variants as likely benign or likely pathogenic 10 .
While the conception of AlphaMissense represents a commendable stride, defined by its intricate design and advanced methodologies, there remain gaps in our understanding of its performance on selected groups of proteins or individual proteins.In particular, a pivotal concern arises from the specificities of its missense mutation predictions and the limited accessibility to its dataset.Whereas there are initiatives to make the data accessible through R and Python tools [12][13][14][15][16] , these require a certain level of computational skills, thus significantly restricting the user base.Addressing these voids, we assessed AlphaMissense performance on different datasets using ClinVar data.

Results
Performance of AlphaMissense across diverse protein groups in relation to ClinVar data.The performance of AlphaMissense may exhibit variability across different protein types, necessitating careful scrutiny when analyzing target proteins.We evaluated AlphaMissense's efficiency across a range of protein groups, choosing single nucleotide variants from ClinVar as our benchmark.While ClinVar is a valuable resource, it has its shortcomings.For instance, it may disproportionately represent genes under intensive study while underrepresenting highly pathogenic mutations due to the fact that individuals harboring them might not survive to birth.Additionally, heterozygotes also make it challenging to draw conclusions about the effects of mutations.For our analysis, we juxtaposed all benign and pathogenic missense mutations rated with at least one star in ClinVar against AlphaMissense predictions for proteins in our datasets.Only genes with corresponding ClinVar entries were considered.Subsequently, we derived precision (position predictive value, PPV), recall (true positive rate, TPR), F1 score, aucROC, and Matthew's Correlation Coefficient (MCC) (Table 1).In general, the calculated statistical measures were high for all the groups studied.Most importantly, MCC exceeded 0.6 for all but two groups, with low values possibly stemming from sparse input data for MemMoRFs and compromised ClinVar data quality, especially for CFTR.We also determined the frequency of likely benign and pathogenic mutations in ClinVar relative to protein length (Table 1).
Our initial analysis centered on mitochondrial proteins of bacterial origin.Given the unique sequence attributes of these proteins, prediction biases were anticipated.Intriguingly, the pathogenic variation frequency for these proteins was higher than that of the entire human protein ensemble.The important cellular function of these proteins in energy balance might hint their role as housekeeping genes.Drawing from a specific database (https://housekeeping.unicamp.br) 17, we cross-referenced 1,011 housekeeping genes with 299 mitochondrial genes from our collection and only a modest overlap of 98 genes was observed.The anticipated elevation in pathogenic mutation frequency was evident in the housekeeping gene dataset.
Mutation frequencies and AlphaMissense efficiency on transmembrane (TM) proteins were also assessed.We segregated residues into TM and non-TM subsets using the Human Transmembrane Proteome database 18 .Counterintuitively, AlphaMissense performed better on TM regions (88% correct and 6% failed predictions versus 85% and 8% for soluble regions, respectively; Table 1 and Fig. 1a,b).This is unexpected, since hydrophobicity reduces sequence variance thus evolutionary insights from sequence alignments.However, the spatial constraints of transmembrane domains lacking intrinsically disordered regions might boost the AlphaFold-based AlphaMissense predictions 19 .Remarkably, pathogenic mutations were more prevalent in TM domains than benign ones (Table 1).
Then we focused on specific membrane protein subsets.While a surge in pathogenic mutations for GPCRs in ClinVar was anticipated, this was not observed.In contrast, ABC proteins manifested elevated pathogenic mutation frequencies in the ClinVar database.Such disparities might be the result of the disease-associated specific protein classes or research biases.Importantly, type and quality of data can profoundly impact these types of analyses.For instance, when juxtaposing AlphaMissense's predictions against ClinVar data for the CFTR/ ABCC7 protein, benign mutations were infrequent, whereas pathogenic mutations predominated.The MCC for CFTR ClinVar/AlphaMissesnse comparison was low (0.478).
Membrane-interacting protein residues were also investigated.One dataset included interfacial binding site (IBS) residues 20 while the other contained membrane molecular recognition features (MemMoRFs; lipid-interacting disordered regions) 21 .For IBS residues, pathogenic mutations were approximately twice as frequent as benign ones (0.760 vs. 0.340), likely reflecting the functional significance of these residues.Similar trends were evident for the MemMoRF set, although it's crucial to recognize the limited sample size for this category that might explain the diminished MCC when comparing ClinVar and AlphaMissense outcomes.Moreover, the intrinsic disorder and low sequence conservation of these regions might also influence AlphaMissense's predictive power on these proteins 10 .
Finally, the potential source of low MCC values were investigated.In the case of CFTR, we tested AlphaMissense predictions against a gold standard CFTR mutation database, CFTR2 (The Clinical and Functional TRanslation of CFTR (CFTR2); available at http://cftr2.org).The CFTR2 database exhibited benign mutation frequencies comparable to other groups but a marked increase in pathogenic mutations.The calculated MCC with this benchmark set was one of the highest (0.725) compared to any of the other protein groups.We assumed that the very low MCC for MemMoRF groups may have caused by the high prevalence of disordered residues in these proteins.Because of the small size of this dataset we tested this possibility on soluble proteins, by excluding those residues from the calculations, which residues exhibit a pLDDT score lower than 50 in AlphaFold structures as a proxy for intrinsically disordered regions 22 .A small increase was observed for PPV, TPR, and F1, but not for rocAUC and MCC values (SOL-pLDDT50 in Table 1) when compared to all soluble proteins.Therefore, we assumed that low results of proteins with MemMoRF may have arisen from the AlphaFold's capabilities for predicting their structures, since the MemMoRF containing protein set involve several single-pass, bitopic transmembrane proteins.Therefore, we indirectly investigated this possibility, and used a transmembrane protein set with failed AlphaFold predictions 23 , which group of proteins resulted also very low MCC scores (lowAF in Table 1).Interestingly, excluding residues with a pLDDT score lower than 50 (lowAF-pLDDT50 in Table 1) increased the TPR, F1, and MCC scores.The latter score for this set became 0.573.Transmembrane and soluble parts were determined for HTP entries with a confidence score higher than 85.Benign and pathogenic AlphaMissense predictions for SNVs present in ClinVar were collected and split into true and false categories for plotting.Ambiguous AlphaMissense predictions (6% and 7% for TM and soluble regions, respectively) were not included.
Variability in AlphaMissense predictions across different groups of proteins.The observed differences in True Positive Rate (TPR) and F1 scores implied that the distribution of benign and pathogenic mutations is not uniform across protein groups.To gain a deeper insight and understand AlphaMissense's predictive properties, we investigated the frequency and distribution of its SNV predictions across various protein categories (Table 1).Typically, benign mutations were more frequent, with values hovering between 3 to 3.5, as opposed to pathogenic mutations, which ranged from approx.1.5 to 2. Given that AlphaMissense predictions cover all possible missense mutations, not biased by human issues, it is reasonable to deduce that only about 30-35% of the possible human missense mutations are pathogenic.A few of our protein sets deviated from this trend.Housekeeping genes displayed slightly lower benign and higher pathogenic mutation frequencies.Both the IBS dataset and the transmembrane regions of transmembrane proteins demonstrated a large reduction in benign and an increase in pathogenic mutation frequencies.This elevated pathogenic frequency in the latter two datasets likely stems from the inclusion of functionally critical sites, which are more susceptible to mutations.
We next examined whether the reverse mutations demonstrated similar average AlphaMissense scores.For each variation, we calculated the mean scores and paired them with their reverse counterpart for visualization.We highlighted variation pairs that showed a difference of at least 0.2 in their average scores (Fig. 2a).The pathogenicity labels of three pairs are changed from pathogenic to benign (highlighted by asterisks).The contrasting mean values of the Cys/Ser mutation, categorized as likely-pathogenic, and the Ser/Cys, which is deemed likely-benign, can be rationalized based on amino acid properties and structural implications.Cysteine plays a pivotal structural role, particularly in forming disulfide bridges.In a simplified form, this makes the replacement of Serine with Cysteine more tolerable than the other way around, as Serine cannot replicate Cysteine's capability in forming disulfide bridges.Accordingly, Cys/Ser pathogenic mutation frequency (0.011) is 5.5 times higher than Ser/Cys pathogenic frequency (0.002) in the ClinVar dataset.The asymmetry of the Leu/Pro replacement can be understood as Pro restricts the available conformational space.The greater disruptiveness of the Leu/Ser replacement compared to Ser/Leu can be attributed to the structural importance of the hydrophobic Leucine, which has a high alpha-helix propensity, in contrast to the hydrophilic Serine that often occurs on protein surfaces 24 .
We also analyzed how the mean scores of all variations correlated with the symmetric BLOSUM62 matrix, a representation derived from amino acid substitution frequencies based on sequence alignments.BLOSUM62 and mean AlphaMissense scores calculated from all possible amino acid substitutions correlated well (correlation coefficient: −0.678, p = 6.39 × 10 −27 , Fig. 2b).Interestingly, numerous average scores for less favorable substitutions fell below the likely-pathogenic threshold set by AlphaMissense.This trend may arise from the higher ratio of variations predicted as likely-benign.Notably, the averages for Cys/Ser, Pro/Thr, and the Met/ Thr variations, which have a BLOSUM62 substitution score of −1, lie slightly below 0.34, placing them in the likely-benign category (Fig. 2b).
Analyzing functional hotspots using AlphaMissense -CFTR as an example.We assessed the AlphaMissense predictions for the CFTR protein, which attracted substantial attention within the scientific community, primarily because of its association with cystic fibrosis 25 .For our study, we relied on the CFTR2 database (CFTR2_7April2023.xlsx,https://cftr2.org) to annotate mutations.Impressively, out of the 102 pathogenic and 20 benign mutations listed in the CFTR2 database, AlphaMissense mispredicted only four pathogenic (I601F, A613T, I1234V, and V1240G with scores 0.49, 0.39, 0.08, and 0.5637) and four benign (F508C, L997F, T1053I, and R1162L with scores 0.87, 0.74, 0.35, and 0.89) mutations to the opposite or ambiguous category.Performance metrics for AlphaMissence on CFTR against ClinVar and CFTR2 databases are listed in Table 1 and corresponding false predictions are shown in Fig. 3a utilizing the AlphaFold-predicted structure (AF-P13569-F1-AM_v4) 26 , demonstrating no clusterization of false predictions in specific structural areas, such as interfaces or ATP binding sites.The particular AlphaMissense scores of the 122 values for the CFTR2 mutations are visualized in Fig. 3b.
For spatial representation of these mutations we used the AlphaFold-predicted CFTR structure colored according the mean AlphaMissense score calculated for SNVs, since multiple nucleotide changes result in more pathogenic amino acid substitutions (Fig. 2b) and mask valuable information (Fig. 4a,b).The ATP binding sites of CFTR, especially, warrant attention.The formation of an ATP binding site is an intricate interplay between one Walker A motif from a Nucleotide Binding Domain (NBD) and a signature motif from the opposite NBD.In comparison to the functional site-2, both the count of CFTR2-sourced mutations and the AlphaMissense scores were observed to be lesser at the site1 (15 versus 3 and 0.584 versus 0.493, respectively; Fig. 4b,c), which site is degenerate, rendering it incapable of ATP hydrolysis 27 .The difference in the mean AlphaMissense scores decreased (0.725 versus 0.675) when calculated not only from possible SNVs but from all amino acid variations.The structural landscape around the F508 residue provides more insight.The CH4 coupling helix, which interacts with the F508 residue, presents a greater number of both predicted and CFTR2-based mutations in comparison to CH2, which is a structural counterpart of CH4 (Fig. 4d,e).No CFTR2 mutations are present in the other coupling helices.CH1, 2, 3, and 4 mean AlphaMissense scores are 0.336, 0.411, 0.136, and 0.648, respectively (0.478, 0.598, 0.237, and 0.773 when calculated from all possible amino acid variations).Interestingly, CH1 was found to be devoid of CFTR2 mutations, but in vitro experiments in this region revealed that the R170G mutation, which has a likely-benign AlphaMissense label, impairs the domain-domain assembly and would be pathogenic if harbored by an individual 28 .
The F508 residue is not only an epicenter for deleterious mutations but has also been extensively researched.While CFTR2 lists no additional pathogenic mutations for this residue, a range of experimental works have delved into substituting the Phe with all the other nineteen possible amino acids to discern the impacts on the functional expression of CFTR 29 .All F508 substitution were predicted as likely pathogenic in the AlphaMissense dataset.However, experimental data suggests that apart from the F508C variant the F508V mutation might also be functionally permissive 29 , deviating from AlphaMissense's likely-pathogenic prediction.Two other variants, labeled as "unknown" or of "varying significance" in the CFTR2 database, show discrepancies between in vitro experiments and AlphaMissense predictions.Specifically, the F1052V mutation, predicted by AlphaMissense as likely-pathogenic, demonstrates a functional expression, with 57% mature protein form and 60% functionality relative to the wild type 30 .Conversely, the S912L variant, predicted as benign, appears to be a potential false negative AM prediction.This was based on displayed CF phenotypes in individuals with S912L CFTR 31 which may be explained by its substantially reduced function, at 16% of the wild type, despite an expression level nearly on par at 92% relative to the wild type 30 .However, earlier research suggests that the S912L variant should be viewed as neutral in isolation, and highlights how complex alleles contribute to the broad phenotypic variability seen in CF 32,33 .

Discussion
We embarked on an in-depth analysis of AlphaMissense predictions, ranging from broad protein groups down to the individual CFTR protein.Our objective was to gain insights that would aid the interpretation of predictions for specific target proteins, since heterogeneous performance on different protein groups can be expected.For benchmarking purposes, we turned to ClinVar, given its substantial repository of curated and reviewed entries.Remarkably, AlphaMissense exhibited consistent performance across various protein categories, evidenced by an MCC value exceeding 0.6 (Table 1).While these falsified expectations for degraded performance in the case of some protein groups, exceptions arose in scenarios where either the volume of benchmark data was sparse or when the quality of the data was lower.These cases included MemMoRFs and ClinVar's CFTR data, respectively.Our results indicate AlphaMissense performing well when comparing to the CFTR2 database and suggest that AlphaMissense performance likely performs better than expected based on benchmarks calculated from ClinVar.Our assessment based on CFTR2 is in contrast with the study of McDonald et al. 31 , whose differences likely arise from our exclusion of entries with unknown consequences and ambiguous AlphaMissense predictions.The discrepancies observed, like the S912L CFTR mutation [30][31][32][33] , between AlphaMissense predictions and studies on CFTR are not unexpected, especially when the mutations in question are part of complex alleles in cystic fibrosis or other diseases.We also emphasize that AlphaFold's pLDDT scores can provide insights into AlphaMissense performance as the quality of the structures may further indicate the reliability of AlphaMissense predictions (lowAF in Table 1).
Both within ClinVar and the AlphaMissense SNV predictions, benign mutations typically outnumbered their pathogenic counterparts by a factor of approximately two, in several protein groups.Intriguing deviations from this trend were noted in groups such as mitochondrial proteins, housekeeping genes, transmembrane regions of membrane proteins, and IBS residues that pattern aligns with expectations.The IBS dataset, with its notably high pathogenic frequency, exclusively contains functional positions (Table 1).The pathogenicity of CFTR coupling helices were also predicted with remarkable congruency with CFTR2 data (Fig. 4d,e).These observations accentuate the potential of AlphaMissense predictions as a valuable tool for aiding the identification of functionally crucial sites.To facilitate hotspot detection and access to AlphaMissense data, we established a dedicated web resource available at https://alphamissense.hegelab.org,which also provides structure files with mapped AlphaMissense scores for visualization, e.g. in PyMOL with our coloring plugin coloram.py,for facilitating local analysis 26 .These enhancements crucially aid in mutational hotspot detection, paving the way for more detailed and user-friendly analyses.
Missense data was retrieved from ClinVar 35 as of 26 th September 2023 and made available at Zenodo (clin-var_result.txt) 26.The dataset representing the human proteome was obtained from UniProt Release 2023_04, specifically from the file UP000005640_9606.dat(reference proteomes from https://www.uniprot.org/help/downloads) 36 .This dataset proved instrumental in mapping Ensemble IDs from ClinVar to UniProt accession numbers since the inherent online ID mapping tool at UniProt matched only a very low number of entries.
Human protein structures were downloaded from AlphaFoldDB (version 4; https://alphafold.ebi.ac.uk/ download#proteomes-section) 37 .The gen_pdb_occupancy.py script was used to insert the mean AlphaMissense score for each residue into the occupancy and B factor columns of structure files.All of these structures are available at Zenodo as a zip file for bulk download.Individual structure files can be accessed manually or programmatically as https://alphamissense.hegelab.org/pdb/AF-{UNIPROT_ACC}-F1-AM_v4.pdb.
Analysis.All data analyses were carried out using Python-based tools to ensure flexibility and scalability.
To facilitate a lightweight and seamless interaction with the data stored in PostgreSQL 12, we employed the SQLalchemy 2.0.21 library 41 renowned for its capability to provide a high-level, Pythonic interface to relational databases.Matplotlib 3.7.0 was used for generating plots that delineate various aspects of the data 42 .Structural visualization of proteins was done using PyMOL (version 2.4, Schrödinger, LLC.), a molecular graphics system with an embedded Python interpreter.To bridge the predictions of AlphaMissense with these structures, MDAnalysis 2.4.2 was employed 43 .This Python toolkit allowed us to incorporate the AlphaMissense scores directly into the PDB files, specifically inserting them into both the occupancy and B-factor columns.
The ClinVar entries and AlphaMissense predictions of the above protein groups were compared using ana_ clinvar_set.py,ana_clinvar_resi.py,and ana_clinvar_set_plddt.pywhen full protein sequences, specific residues (e.g.IBS, MemMoRF, and TM residues), and residues with high pLDDT scores were analyzed, respectively.Since aucROC calculation requires not only a contingency table but all the true labels and predictions, aucROC was calculated with separate scripts named calc_*_aucroc.py.The outputs were collected in an Excel table (table1.xlsx).
The AlphaMissense scores were averaged for all possible amino acid changes for each residue in the full dataset using the calc_revfreq.pyscript (output is stored in aaaa_revfreq.pkl).The AlphaMissense scores were also averaged for pairwise amino acid changes (aaa_freq.pkl)to compare them with the BLOSUM62 substitution matrix.The substitution matrix was taken from the BioPython 1.81 package (https://biopython.org/docs/latest/api/Bio.Align.substitution_matrices.html).The aa_substitutions.ipynb notebook contains the code for analysis including linear regression and plotting the panels of Fig. 1.
Distribution of AlphaMissense scores for CFTR benign and pathogenic variations listed in the CFTR2 database were calculated and plotted (Fig. 2a) with the ana_mutspreds.ipynbnotebook.The CFTR structure AF-P13569-F1-AM_v4.pdb was visualized in PyMOL and colored using coloram.pyscript (Fig. 2b).Residues indicated pathogenic in the ClinVar database are displayed with spheres using the show_clinvar_patho.pyscript.AlphaMissense mean values referenced in the main text for ATP binding sites (Fig. 2c) and for NBD/TMD interfaces (Fig. 2c).
were calculated using the atpbsites-mean.pyand interfaces-mean.pyscripts, respectively.ATP binding sites and coupling helices in these panels were highlighted by setting the cartoon_transparency to 0.5 for all other parts of the structure.

Fig. 1
Fig. 1 Distribution of AlphaMissense predictions in transmembrane (a) and soluble regions (b) of TM proteins.Transmembrane and soluble parts were determined for HTP entries with a confidence score higher than 85.Benign and pathogenic AlphaMissense predictions for SNVs present in ClinVar were collected and split into true and false categories for plotting.Ambiguous AlphaMissense predictions (6% and 7% for TM and soluble regions, respectively) were not included.

Fig. 2
Fig. 2 Symmetries of AlphaMissense amino acid substitutions.(a) Mean AlphaMissense scores for variations, which display a minimum score difference of 0.2 when compared to the reverse amino acid change.Asterisks mark those changes which get the opposite label (benign/pathogenic) in the case of reverse change.(b) Mean AlphaMissence scores for each variation grouped by their BLOSUM62 score.Dashed and dashed dotted lines indicate the cutoffs of the ambiguous AlphaMissense predictions.Solid back line was fitted (r = −0.678,p = 6.39 × 10 −27 ).Orange circles: amino acid substitutions possible with single nucleotide change; blue circles: all other substitutions.