Genome-wide prediction of disease variant effects with a deep protein language model

Predicting the effects of coding variants is a major challenge. While recent deep-learning models have improved variant effect prediction accuracy, they cannot analyze all coding variants due to dependency on close homologs or software limitations. Here we developed a workflow using ESM1b, a 650-million-parameter protein language model, to predict all ~450 million possible missense variant effects in the human genome, and made all predictions available on a web portal. ESM1b outperformed existing methods in classifying ~150,000 ClinVar/HGMD missense variants as pathogenic or benign and predicting measurements across 28 deep mutational scan datasets. We further annotated ~2 million variants as damaging only in specific protein isoforms, demonstrating the importance of considering all isoforms when predicting variant effects. Our approach also generalizes to more complex coding variants such as in-frame indels and stop-gains. Together, these results establish protein language models as an effective, accurate and general approach to predicting variant effects.

), we obtained a mapping between the NM codes and NP codes (RefSeq protein records).We then matched those NP codes with UniProt IDs, using the same set of 42,336 protein isoforms from UniProt.Following this mapping, from NM code to NP code to UniProt ID, we obtained 386,033 missense variants with clinical significance contributing to 848,320 effects on 12,179 UniProt isoforms (~2.2 effects per variant on average) across 5,683 unique genes.These variants were used in the analysis of ClinVar over alternative isoforms (Fig. 4).

Clinical benchmarks of missense variants (ClinVar and HGMD/gnomAD)
For the benchmarks incorporating only the primary isoforms (Fig. 2, Extended Data Fig. 1A-B, Extended Data Fig. 4 and Extended Data Fig. 6A), we used 46,726 variants with high-quality ClinVar labels (i.e., at least 1 "Gold Stars") across ~3K disease genes downloaded from the EVE portal (https://evemodel.org/).Importantly, this dataset includes all ClinVar labels for these genes (including variants not covered by EVE).
Access to the full HGMD dataset (https://www.hgmd.cf.ac.uk/ac/index.php)was provided upon request (see Acknowledgements) 26 .Of the entire set of 331,912 HGMD variants, 172,461 were missense variants, of which 158,725 were mapped to a unique UniProt isoform, of which 98,086 were annotated as high-confidence disease-causing variants (DM).
To create a list of common missense variants, we used exomed-derived variants from the gnomAD database (version 2.1.1),by downloading the file gnomad.exomes.r2.1.1.sites.vcf.bgz.From this VCF file, we extracted 5,624,824 missense variants with a PASS filter, of which 4,876,367 were called in over 200,000 alleles (according to the AN value in the VCF file) and were considered high-quality variants.Of these, 26,359 variants (0.5%) had MAF > 0.01 and were considered common variants.
Altogether, we had 124,445 variants in the HGMD/gnomAD benchmark (98,086 pathogenic and 26,359 likely benign variants).Of these, 92,942 variants (70,714 pathogenic and 22,228 benign) affected the primary UniProt isoform and were used in this benchmark throughout this work.

Indels and stop gain ClinVar benchmarks
The set of all ClinVar variants was downloaded from ClinVar's FTP site on January 24, 2023 (https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz).Of the 1,596,737 variants in version GRCh38 of the human reference genome, we detected 3,511 high-quality (i.e., at least 1 "Gold Stars") in-frame indels.The protein sequence matching each variant was determined by the reported RefSeq identifier.The amino acid changes reported for 17 of the 3,511 indels didn't match the RefSeq protein sequence.Filtering out these 17 variants, and 24 additional variants which were duplicates or included non-standard amino-acids, the final indel benchmark included 1,679 benign and 1,791 pathogenic variants.Notably, 3% of these indels included insertions or deletions of more than 17 residues (50nt), which are often considered structural variation rather than indels.
We further detected 36,489 high-quality (i.e., at least 1 "Gold Stars") stop-gain variants.To determine whether the 50bp rule 45 applies for each of these variants, we recovered the coordinate of the last exon junction in the coding region of the affected gene based on the exon annotations in RefSeq.Of the 36,489 variants, 110 didn't have exon annotations and were filtered out.The remaining dataset included 36,034 pathogenic and 345 benign stop-gain variants.
EVE scores and MSA coverage EVE scores for missense variants affecting the ~3K disease-associated genes analyzed by the model were downloaded from the official EVE portal (https://evemodel.org/).From the same portal, we also downloaded multiple sequence alignments (MSAs) for the same ~3K disease genes.For each MSA, the target (human) protein sequence is reported as a combination of uppercase and lowercase letters.The uppercase letters correspond to residues that are part of the MSA profile, whereas lowercase letters correspond to residues that are not.Accordingly, we defined the coverage of a human protein (reported in Fig. 1C-D) to be the fraction of the target sequence letters that are uppercase.

Supplementary Table 1: Deep mutational scans used for model evaluation
Our analysis of protein isoforms is based on 42,336 manually-reviewed protein isoform sequences taken from UniProt 18 (in February 2022).To get ClinVar labels across all annotated variants and map them to UniProt protein sequences, we downloaded the variant_summary.txt.gzfile from ClinVar (on April 29, 2022).This dataset contained information about ~1.3M variants, including clinical significance and NM code (RefSeq mRNA record).From a separate dataset on ClinVar's website (hgvs4variation.txt.gz,downloaded on May 3, 2022