Predicting functional effect of missense variants using graph attention neural networks

Zhang, Haicang; Xu, Michelle S.; Fan, Xiao; Chung, Wendy K.; Shen, Yufeng

doi:10.1038/s42256-022-00561-w

Article
Published: 15 November 2022

Predicting functional effect of missense variants using graph attention neural networks

Nature Machine Intelligence volume 4, pages 1017–1028 (2022)Cite this article

2948 Accesses
12 Citations
7 Altmetric
Metrics details

Subjects

Abstract

Accurate prediction of damaging missense variants is critically important for interpreting a genome sequence. Although many methods have been developed, their performance has been limited. Recent advances in machine learning and the availability of large-scale population genomic sequencing data provide new opportunities to considerably improve computational predictions. Here we describe the graphical missense variant pathogenicity predictor (gMVP), a new method based on graph attention neural networks. Its main component is a graph with nodes that capture predictive features of amino acids and edges weighted by co-evolution strength, enabling effective pooling of information from the local protein context and functionally correlated distal positions. Evaluation of deep mutational scan data shows that gMVP outperforms other published methods in identifying damaging variants in TP53, PTEN, BRCA1 and MSH2. Furthermore, it achieves the best separation of de novo missense variants in neurodevelopmental disorder cases from those in controls. Finally, the model supports transfer learning to optimize gain- and loss-of-function predictions in sodium and calcium channels. In summary, we demonstrate that gMVP can improve interpretation of missense variants in clinical testing and genetic studies.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 2: Evaluating gMVP and published methods using cancer somatic mutation hotspots and random variants in population.**

**Fig. 3: Evaluating gMVP and published methods in identifying damaging variants in known disease genes such as *TP53*, *PTEN*, *BRCA1* and *MSH2*.**

**Fig. 4: Evaluating gMVP and published methods in distinguishing rare DNMs in cases with neurodevelopmental disorders from those in controls.**

**Fig. 5: Evaluating gMVP and other published methods in classifying pathogenetic and neutral variants, and in predicting GOF and LOF variants in ion-channel genes.**

**Fig. 6: Interpreting gMVP predictions with conservation, protein structure and genetic coding constraints.**

Exome-wide analysis implicates rare protein-altering variants in human handedness

Article Open access 02 April 2024

Dick Schijven, Sourena Soheili-Nezhad, … Clyde Francks

Genomic language model predicts protein co-regulation and function

Article Open access 03 April 2024

Yunha Hwang, Andre L. Cornman, … Peter R. Girguis

Genome-wide association studies

Article 26 August 2021

Emil Uffelmann, Qin Qin Huang, … Danielle Posthuma

Data availability

Pre-computed gMVP scores for all possible missense variants in canonical transcripts on human hg38 can be downloaded from https://www.dropbox.com/s/nce1jhg3i7jw1hx/gMVP.2021-02-28.csv.gz?dl=0. The training data of the main model were downloaded from http://www.discovehrshare.com/downloads (DiscovEHR), http://www.hgmd.cf.ac.uk/ac/index.php (HGMD), https://www.uniprot.org/docs/humpvar (UniProt) and https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/ (ClinVar). Other datasets supporting the findings of this study are available in the paper and the Supplementary Information.

Code availability

The codes for the model design and training and testing procedure are available on GitHub (https://github.com/ShenLab/gMVP/) and Zenodo⁸¹.

References

Boettcher, S. et al. A dominant-negative effect drives selection of TP53 missense mutations in myeloid malignancies. Science 365, 599–604 (2019).
Article Google Scholar
Huang, K. L. et al. Pathogenic germline variants in 10,389 adult cancers. Cell 173, 355–370.e14 (2018).
Article Google Scholar
Jin, S. C. et al. Contribution of rare inherited and de novo variants in 2,871 congenital heart disease probands. Nat. Genet. 49, 1593–1601 (2017).
Article Google Scholar
Satterstrom, F. K. et al. Large-scale exome sequencing study implicates both developmental and functional changes in the neurobiology of autism. Cell 180, 568–584.e23 (2020).
Article Google Scholar
Kaplanis, J. et al. Evidence for 28 genetic disorders discovered by combining healthcare and research data. Nature 586, 757–762 (2020).
Article Google Scholar
Rehm, H. L., Berg, J. S. & Plon, S. E. ClinGen and ClinVar—enabling genomics in precision medicine. Hum. Mutat. 39, 1473–1475 (2018).
Article Google Scholar
He, X. et al. Integrated model of de novo and inherited genetic variants yields greater power to identify risk genes. PLoS Genet. 9, e1003671 (2013).
Article Google Scholar
Nguyen, H. T. et al. Integrated Bayesian analysis of rare exonic variants to identify risk genes for schizophrenia and neurodevelopmental disorders. Genome Med. 9, 114 (2017).
Article Google Scholar
Adzhubei, I., Jordan, D. M. & Sunyaev, S.R. Predicting functional effect of human missense mutations using PolyPhen-2. Curr. Protoc. Hum. Genet. https://doi.org/10.1002/0471142905.hg0720s76 (2013).
Carter, H., Douville, C., Stenson, P. D., Cooper, D. N. & Karchin, R. Identifying Mendelian disease genes with the variant effect scoring tool. BMC Genom. 14, S3 (2013).
Article Google Scholar
Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310–315 (2014).
Article Google Scholar
Ioannidis, N. M. et al. REVEL: an ensemble method for predicting the pathogenicity of rare missense variants. Am. J. Hum. Genet. 99, 877–885 (2016).
Article Google Scholar
Dong, C. et al. Comparison and integration of deleteriousness prediction methods for nonsynonymous SNVs in whole exome sequencing studies. Hum. Mol. Genet. 24, 2125–2137 (2015).
Article Google Scholar
Jagadeesh, K. A. et al. M-CAP eliminates a majority of variants of uncertain significance in clinical exomes at high sensitivity. Nat. Genet. 48, 1581–1586 (2016).
Article Google Scholar
Ionita-Laza, I., McCallum, K., Xu, B. & Buxbaum, J. D. A spectral approach integrating functional genomic annotations for coding and noncoding variants. Nat. Genet. 48, 214–220 (2016).
Article Google Scholar
Qi, H. et al. MVP predicts the pathogenicity of missense variants by deep learning. Nat. Commun. 12, 510 (2021).
Article Google Scholar
Sundaram, L. et al. Predicting the clinical impact of human mutation with deep neural networks. Nat. Genet. 50, 1161 (2018).
Article Google Scholar
Samocha, K.E. et al. Regional missense constraint improves variant deleteriousness prediction. Preprint at bioRxiv https://doi.org/10.1101/148353 (2017).
Havrilla, J. M., Pedersen, B. S., Layer, R. M. & Quinlan, A. R. A map of constrained coding regions in the human genome. Nat. Genet. 51, 88–95 (2019).
Article Google Scholar
Davydov, E. V. et al. Identifying a high fraction of the human genome to be under selective constraint using GERP plus. PLoS Comput. Biol. 6, e1001025 (2010).
Article Google Scholar
Iqbal, S. et al. Comprehensive characterization of amino acid positions in protein structures reveals molecular effect of missense variants. Proc. Natl Acad. Sci. USA 117, 28201–28211 (2020).
Article Google Scholar
Hicks, M., Bartha, I., di Iulio, J., Venter, J. C. & Telenti, A. Functional characterization of 3D protein structures informed by human genetic diversity. Proc. Natl Acad. Sci. USA 116, 8960–8965 (2019).
Article Google Scholar
Sivley, R. M., Dou, X. Y., Meiler, J., Bush, W. S. & Capra, J. A. Comprehensive analysis of constraint on the spatial distribution of missense variants in human protein structures. Am. J. Hum. Genet. 102, 415–426 (2018).
Article Google Scholar
Hopf, T. A. et al. Mutation effects predicted from sequence co-variation. Nat. Biotechnol. 35, 128–135 (2017).
Article Google Scholar
Liang, S., Mort, M., Stenson, P.D., Cooper, D.N. & Yu, H. PIVOTAL: prioritizing variants of uncertain significance with spatial genomic patterns in the 3D proteome. Preprint at bioRxiv https://doi.org/10.1101/2020.06.04.135103 (2021).
Chang, M. T. et al. Accelerating discovery of functional mutant alleles in cancer. Cancer Discov. 8, 174–183 (2018).
Article Google Scholar
Jia, X. et al. Massively parallel functional testing of MSH2 missense variants conferring Lynch syndrome risk. Am. J. Hum. Genet. 108, 163–175 (2021).
Article Google Scholar
Findlay, G. M. et al. Accurate classification of BRCA1 variants with saturation genome editing. Nature 562, 217–222 (2018).
Article Google Scholar
Mighell, T. L., Evans-Dutson, S. & O’Roak, B. J. A saturation mutagenesis approach to understanding PTEN lipid phosphatase activity and genotype–phenotype relationships. Am. J. Hum. Genet. 102, 943–955 (2018).
Article Google Scholar
Kotler, E. et al. A systematic p53 mutation library links differential functional impact to cancer mutation pattern and evolutionary conservation. Mol. Cell 71, 178–190.e8 (2018).
Article Google Scholar
de Juan, D., Pazos, F. & Valencia, A. Emerging methods in protein co-evolution. Nat. Rev. Genet. 14, 249–261 (2013).
Article Google Scholar
Morcos, F. et al. Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proc. Natl Acad. Sci. USA 108, E1293–E1301 (2011).
Article Google Scholar
Vaswani, A. et al. Attention is all you need. In 31st Conference on Neural Information Processing Systems 5998–6008 (NeurIPS, 2017).
Veličković, P. et al. Graph attention networks. In 6th International Conference on Learning Representations (Univ. Cambridge, 2018).
Cho, K. et al. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proc. 2014 Conference on Empirical Methods in Natural Language Processing (Association for Computational Linguistics, 2014).
Stenson, P. D. et al. Human gene mutation database (HGMD (R)): 2003 update. Hum. Mutat. 21, 577–581 (2003).
Article Google Scholar
Landrum, M. J. et al. ClinVar: public archive of relationships among sequence variation and human phenotype. Nucl. Acids Res. 42, D980–D985 (2014).
Article Google Scholar
Mottaz, A., David, F. P., Veuthey, A. L. & Yip, Y. L. Easy retrieval of single amino-acid polymorphisms and phenotype information using SwissVar. Bioinformatics 26, 851–852 (2010).
Article Google Scholar
Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. In 2015 International Conference on Learning Representations (ICLR, 2015).
Abadi, M. et al. TensorFlow: large-scale machine learning on heterogeneous distributed systems. Preprint at https://arxiv.org/abs/1603.04467 (2016).
Alirezaie, N., Kernohan, K. D., Hartley, T., Majewski, J. & Hocking, T. D. ClinPred: prediction tool to identify disease-relevant nonsynonymous single-nucleotide variants. Am. J. Hum. Genet. 103, 474–483 (2018).
Article Google Scholar
Feng, B. J. PERCH: a unified framework for disease gene prioritization. Hum. Mutat. 38, 243–251 (2017).
Article Google Scholar
Dewey, F. E. et al. Distribution and clinical impact of functional variants in 50,726 whole-exome sequences from the DiscovEHR Study. Science 354, aaf6814 (2016).
Article Google Scholar
Iossifov, I. et al. The contribution of de novo coding mutations to autism spectrum disorder. Nature 515, 216–221 (2014).
Article Google Scholar
De Rubeis, S. et al. Synaptic, transcriptional and chromatin genes disrupted in autism. Nature 515, 209–215 (2014).
Article Google Scholar
Zuk, O. et al. Searching for missing heritability: designing rare variant association studies. Proc. Natl Acad. Sci. USA 111, E455–E464 (2014).
Article Google Scholar
Heyne, H. O. et al. Predicting functional effects of missense variants in voltage-gated sodium and calcium channels. Sci. Transl. Med. 12, eaay6848 (2020).
Article Google Scholar
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
Article Google Scholar
Abrusán, G. & Marsh, J. A. Alpha helices are more robust to mutations than beta strands. PLoS Comput. Biol. 12, e1005242 (2016).
Article Google Scholar
Gao, M., Zhou, H. & Skolnick, J. Insights into disease-associated mutations in the human proteome through protein structural analysis. Structure 23, 1362–1369 (2015).
Article Google Scholar
Li, S.-C., Goto, N. K., Williams, K. A. & Deber, C. M. Alpha-helical, but not beta-sheet, propensity of proline is determined by peptide environment. Proc. Natl Acad. Sci. USA 93, 6676–6681 (1996).
Article Google Scholar
Senior, A. W. et al. Improved protein structure prediction using potentials from deep learning. Nature 577, 706–710 (2020).
Article Google Scholar
Yang, J. Y. et al. Improved protein structure prediction using predicted interresidue orientations. Proc. Natl Acad. Sci. USA 117, 1496–1503 (2020).
Article Google Scholar
Wang, S., Sun, S., Li, Z., Zhang, R. & Xu, J. Accurate de novo prediction of protein contact map by ultra-deep learning model. PLoS Comput. Biol. 13, e1005324 (2017).
Article Google Scholar
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Article Google Scholar
Kumar, S., Clarke, D. & Gerstein, M. B. Leveraging protein dynamics to identify cancer mutational hotspots using 3D structures. Proc. Natl Acad. Sci. USA 116, 18962–18970 (2019).
Article Google Scholar
Anishchenko, I., Ovchinnikov, S., Kamisetty, H. & Baker, D. Origins of coevolution between residues distant in protein 3D structures. Proc. Natl Acad. Sci. USA 114, 9122–9127 (2017).
Article Google Scholar
Tunyasuvunakool, K. et al. Highly accurate protein structure prediction for the human proteome. Nature 596, 590–596 (2021).
Article Google Scholar
Rao, R. et al. MSA transformer. In Proc. 38th International Conference on Machine Learning 8844–8856 (PMLR, 2021).
Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl Acad. Sci. USA 118, e2016239118 (2021).
Article Google Scholar
Rao, R., Meier, J., Sercu, T., Ovchinnikov, S. & Rives, A. Transformer protein language models are unsupervised structure learners. In 2015 International Conference on Learning Representations (ICLR, 2015).
Lal, D. et al. Gene family information facilitates variant interpretation and identification of disease-associated genes in neurodevelopmental disorders. Genome Med. 12, 28 (2020).
Article Google Scholar
Zhang, X. et al. Disease-specific variant pathogenicity prediction significantly improves variant interpretation in inherited cardiac conditions. Genet. Med. 23, 69–79 (2021).
Article Google Scholar
Starita, L. M. et al. Variant interpretation: functional assays to the rescue. Am. J. Human Genet. 101, 315–325 (2017).
Article Google Scholar
Brnich, S. E. et al. Recommendations for application of the functional evidence PS3/BS3 criterion using the ACMG/AMP sequence variant interpretation framework. Genome Med. 12, 3 (2019).
Article Google Scholar
Hartl, D. L. & Clark, A. G. Principles of Population Genetics 4th edn (Sinauer Associates, 1989).
Cassa, C. A. et al. Estimating the selective effects of heterozygous protein-truncating variants from human exome data. Nat. Genet. 49, 806–810 (2017).
Article Google Scholar
Charlesworth, B. & Hill, W. G. Selective effects of heterozygous protein-truncating variants. Nat. Genet. 51, 2 (2019).
Article Google Scholar
Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).
Article Google Scholar
Mulder, N. et al. H3Africa: current perspectives. Pharmgenomics Pers. Med. 11, 59–66 (2018).
Google Scholar
Siepel, A. et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15, 1034–1050 (2005).
Article Google Scholar
Glorot, X., Bordes, A. & Bengio, Y. Deep sparse rectifier neural networks. In Proc. 14th International Conference on Artificial Intelligence and Statistics 315–323 (JMLR, 2011).
Ke, G., He, D. & Liu, T.-Y. Rethinking positional encoding in language pre-training. In 2021 International Conference on Learning Representations (ICLR, 2021).
Bateman, A. Uniprot: a universal hub of protein knowledge. Protein Sci. 28, 32–32 (2019).
Google Scholar
Remmert, M., Biegert, A., Hauser, A. & Soding, J. HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat. Methods 9, 173–175 (2012).
Article Google Scholar
Herrero, J. et al. Ensembl comparative genomics resources. Database 2016, bav096 (2016).
Klausen, M. S. et al. NetSurfP-2.0: improved prediction of protein structural features by integrated deep learning. Proteins 87, 520–527 (2019).
Article Google Scholar
Armean, I. M. et al. Enhanced access to extensive phenotype and disease annotation of genes and genetic variation in Ensembl. Eur. J. Human Genet. 27, 1721–1721 (2019).
Google Scholar
McLaren, W. et al. The Ensembl variant effect predictor. Genome Biol. 17, 122 (2016).
Article Google Scholar
Ge, R., Kakade, S. M., Kidambi, R. & Netrapalli, P. Rethinking learning rate schedules for stochastic optimization. In 2019 International Conference on Learning Representations (ICLR, 2018).
Zhang, H. & Shen, Y. ShenLab/gMVP: v1.0.0-alpha. Zenodo https://doi.org/10.5281/zenodo.7134878 (2022).

Download references

Acknowledgements

This work was supported by NIH grants (nos. R01GM120609, R03HL147197, U01HG008680 and K99HG011490) and the Columbia University Precision Medicine Joint Pilot Grants Program. We thank Y. Zhao, G. Zhong, M. AlQuraishi and D. Knowles for helpful discussions.

Author information

Authors and Affiliations

Department of Systems Biology, Columbia University, New York, NY, USA
Haicang Zhang, Xiao Fan & Yufeng Shen
Columbia College, Columbia University, New York, USA
Michelle S. Xu
Department of Pediatrics, Columbia University, New York, NY, USA
Xiao Fan & Wendy K. Chung
Department of Medicine, Columbia University, New York, NY, USA
Wendy K. Chung
Department of Biomedical Informatics, Columbia University, New York, NY, USA
Yufeng Shen
JP Sulzberger Columbia Genome Center, Columbia University, New York, NY, USA
Yufeng Shen

Authors

Haicang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Michelle S. Xu
View author publications
You can also search for this author in PubMed Google Scholar
Xiao Fan
View author publications
You can also search for this author in PubMed Google Scholar
Wendy K. Chung
View author publications
You can also search for this author in PubMed Google Scholar
Yufeng Shen
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Y.S. conceived and guided the study. H.Z. implemented the algorithms and performed the main analyses. All authors contributed to data analysis, interpretation and manuscript writing.

Corresponding author

Correspondence to Yufeng Shen.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Xiaoming Liu, Wim Vranken, Amit R Majithia and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Test, Figs. 1–15 and Tables 11–18.

Reporting Summary

Supplementary Tables

Supplementary Tables 1–10.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhang, H., Xu, M.S., Fan, X. et al. Predicting functional effect of missense variants using graph attention neural networks. Nat Mach Intell 4, 1017–1028 (2022). https://doi.org/10.1038/s42256-022-00561-w

Download citation

Received: 09 July 2021
Accepted: 07 October 2022
Published: 15 November 2022
Issue Date: November 2022
DOI: https://doi.org/10.1038/s42256-022-00561-w