Abstract
With large-scale population sequencing projects gathering pace, there is a need for strategies that advance disease gene prioritization1,2. Metrics that provide information about a gene and its ability to tolerate protein-altering variation can aid in clinical interpretation of human genomes and can advance disease gene discovery1,2,3,4. Previous reported methods analyzed the total variant load in a gene1,2,3,4, but did not analyze the distribution pattern of variants within a gene. Using data from 138,632 exome and genome sequences2, we developed gene variation intolerance rank (GeVIR), a continuous gene-level metric for 19,361 genes that is able to prioritize both dominant and recessive Mendelian disease genes5, that outperforms missense constraint metrics3 and that is comparable—but complementary—to loss-of-function (LOF) constraint metrics2. GeVIR is also able to prioritize short genes, for which LOF constraint cannot be estimated with confidence2. The majority of the most intolerant genes identified here have no defined phenotype and are candidates for severe dominant disorders.
This is a preview of subscription content, access via your institution
Relevant articles
Open Access articles citing this article.
-
Genic constraint against nonsynonymous variation across the mouse genome
BMC Genomics Open Access 22 September 2023
-
An unsupervised deep learning framework for predicting human essential genes from population and functional genomic data
BMC Bioinformatics Open Access 18 September 2023
-
Chromatin regulators in the TBX1 network confer risk for conotruncal heart defects in 22q11.2DS
npj Genomic Medicine Open Access 18 July 2023
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Rent or buy this article
Prices vary by article type
from$1.95
to$39.95
Prices may be subject to local taxes which are calculated during checkout




Data availability
The GERP++ file can be found at http://mendel.stanford.edu/SidowLab/downloads/gerp/hg19.GERP_scores.tar.gz. The ClinVar files can be found at ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz and ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/var_citations.txt. The CCR files can be found at https://s3.us-east-2.amazonaws.com/ccrs/ccrs/ccrs.autosomes.v2.20180420.bed.gz and https://s3.us-east-2.amazonaws.com/ccrs/ccrs/ccrs.xchrom.v2.20180420.bed.gz. The OMIM genemap2.txt file can be found, after registration, at https://omim.org/downloads. The gnomAD gene constraint metric file can be found at https://storage.googleapis.com/gnomad-public/release/2.1.1/constraint/gnomad.v2.1.1.lof_metrics.by_transcript.txt.bgz. The gnomAD exomes variants and coverage files can be found at https://storage.googleapis.com/gnomad-public/release/2.0.2/vcf/exomes/gnomad.exomes.r2.0.2.sites.vcf.bgz and https://storage.googleapis.com/gnomad-public/release/2.0.2/coverage/combined_tars/gnomad.exomes.r2.0.2.coverage.all.tar, respectively. The gnomAD genomes variants files can be found at https://storage.googleapis.com/gnomad-public/release/2.0.2/vcf/genomes/gnomad.genomes.r2.0.2.sites.coding_only.chr1-22.vcf.bgz and https://storage.googleapis.com/gnomad-public/release/2.0.2/vcf/genomes/gnomad.genomes.r2.0.2.sites.coding_only.chrX.vcf.bgz. The gnomAD genes, transcripts and exons files can be found at http://broadinstitute.org/~konradk/exac_browser/exac_browser.tar.gz. The Ensembl coding and peptide sequences from build GRCh37/hg19 can be found at https://grch37.ensembl.org/biomart/martview (data set: Human genes (GRCh37.p13); Attributes → Sequences → ‘Coding sequence’ and ‘Peptide’). The homozygous LOF tolerant genes (that is, nulls) can be found at https://github.com/macarthur-lab/gene_lists/blob/master/lists/homozygous_lof_tolerant_twohit.tsv. The cell essential and non-essential genes from CRISPR–Cas experiments can be found at https://github.com/macarthur-lab/gene_lists/blob/master/lists/CEGv2_subset_universe.tsv and https://github.com/macarthur-lab/gene_lists/blob/master/lists/NEGv1_subset_universe.tsv, respectively. The mouse heterozygous lethal genes can be obtained from http://www.mousemine.org/ by querying the database with the following search terms: path = ‘OntologyAnnotation.ontologyTerm’ type = ‘MPTerm’; path = ‘OntologyAnnotation.subject’ type = ‘SequenceFeature’; path = ‘OntologyAnnotation.evidence.baseAnnotations.subject’ type = ‘Genotype’; path = ‘OntologyAnnotation.evidence.baseAnnotations.subject.zygosity’ op = ‘ = ’ value = ‘ht’ code = ‘B’; path = ‘OntologyAnnotation.ontologyTerm.name’ op = ‘CONTAINS’ value = ‘lethal’. The human–mouse ortholog mapping file can be found at http://www.informatics.jax.org/downloads/reports/HMD_HumanPhenotype.rpt. The HGNC approved gene symbols can be found at https://www.genenames.org/download/statistics-and-files.
Code availability
Code for calculating GeVIR/VIRLOF scores, data analysis and figures can be found at https://github.com/gevirank/gevir. Computed GeVIR/VIRLOF scores are available in Supplementary Table 2.
References
Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).
Karczewski, K. J. et al. Variation across 141,456 human exomes and genomes reveals the spectrum of loss-of-function intolerance across human protein-coding genes. Preprint at bioRxiv https://doi.org/10.1101/531210 (2019).
Samocha, K. E. et al. A framework for the interpretation of de novo mutation in human disease. Nat. Genet. 46, 944–950 (2014).
Petrovski, S., Wang, Q., Heinzen, E. L., Allen, A. S. & Goldstein, D. B. Genic intolerance to functional variation and the interpretation of personal genomes. PLoS Genet. 9, e1003709 (2013).
Hamosh, A., Scott, A. F., Amberger, J. S., Bocchini, C. A. & McKusick, V. A. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res. 33, D514–D517 (2005).
Cassa, C. A. et al. Estimating the selective effects of heterozygous protein-truncating variants from human exome data. Nat. Genet. 49, 806–810 (2017).
Gussow, A. B., Petrovski, S., Wang, Q., Allen, A. S. & Goldstein, D. B. The intolerance to functional genetic variation of protein domains predicts the localization of pathogenic mutations within genes. Genome Biol. 17, 9 (2016).
Samocha, K. E. et al. Regional missense constraint improves variant deleteriousness prediction. Preprint at bioRxiv https://doi.org/10.1101/148353 (2017).
Sivley, M. Comprehensive analysis of constraint on the spatial distribution of missense variants in human protein structures. Am. J. Hum. Genet. 102, 415–426 (2018).
Havrilla, J. M., Pedersen, B. S., Layer, R. M. & Quinlan, A. R. A map of constrained coding regions in the human genome. Nat. Genet. 51, 88–95 (2018).
Landrum, M. J. et al. ClinVar: public archive of interpretations of clinically relevant variants. Nucleic Acids Res. 44, D862–D868 (2016).
Davydov, E. V. et al. Identifying a high fraction of the human genome to be under selective constraint using GERP++. PLoS Comput. Biol. 6, e1001025 (2010).
Motenko, H., Neuhauser, S. B., O’Keefe, M. & Richardson, J. E. MouseMine: a new data warehouse for MGI. Mamm. Genome 26, 325–330 (2015).
Eppig, J. T. et al. The Mouse Genome Database (MGD): facilitating mouse as a model for human biology and disease. Nucleic Acids Res. 43, D726–D736 (2015).
Hart, T. et al. Evaluation and design of genome-wide CRISPR/SpCas9 knockout screens. G3 7, 2719–2727 (2017).
Huang, D. W., Sherman, B. T. & Lempicki, R. A. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protoc. 4, 44–57 (2009).
Huang, D. W., Sherman, B. T. & Lempicki, R. A. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 37, 1–13 (2009).
Kobayashi, Y. et al. Pathogenic variant burden in the ExAC database: an empirical approach to evaluating population data for clinical variant interpretation. Genome Med. 9, 13 (2017).
Huang, N., Lee, I., Marcotte, E. M. & Hurles, M. E. Characterising and predicting haploinsufficiency in the human genome. PLoS Genet. 6, e1001154 (2010).
Steinberg, J., Honti, F., Meader, S. & Webber, C. Haploinsufficiency predictions without study bias. Nucleic Acids Res. 43, e101–e101 (2015).
Yates, B. et al. Genenames.org: the HGNC and VGNC resources in 2017. Nucleic Acids Res. 45, D619–D625 (2017).
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Virtanen, P. et al. SciPy 1.0–fundamental algorithms for scientific computing in Python. Preprint at https://arxiv.org/abs/1907.10121 (2019).
Acknowledgements
This work was supported by the Engineering and Physical Sciences Research Council (EP/N509565/1). M.T. was funded by the Newlife Foundation (grant no.14–15/15). We also acknowledge the support of the Manchester Academic Health Science Centre. We thank gnomAD and the groups that provided exome and genome variant data to this resource. A full list of contributing groups can be found at https://gnomad.broadinstitute.org/about.
Author information
Authors and Affiliations
Contributions
N.A., M.T. and A.B. conceived and designed the research. N.A. executed the analysis. N.A. and M.T. performed the primary writing. M.T. and A.B. supervised all aspects of the research, and reviewed and edited the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Comparison of top genes ranked by GeVIR with a list of genes sorted by number of CCRs at 95th or greater percentile (7,000 genes).
a, Cumulative number of genes associated exclusively with AD diseases in OMIM (n = 770). b, Cumulative number of genes associated exclusively with AR diseases in OMIM (n = 1,553). c, AD class F1 score calculated at each subset of top genes (cumulative) considering AD genes as true positives and AR genes as false positives. d, Gene canonical transcript protein length in each thousand ranked genes (that is 1–1,000, 1,001–2,000 … 6,001–7,000). Standard notations are used for elements of the box plot (that is, upper or lower hinges: 75th or 25th percentiles; inner segment: median, notches are calculated using a Gaussian-based asymptotic approximation; and upper or lower whiskers: extension of the hinges to the largest or smallest value at most 1.5 times of interquartile range). Outliers are not shown due to the presence of genes with extreme protein length (for example TTN, ~36,000 amino acids) in the data set, which would distort the figure. Correlation between protein length and gene rank was measured with Spearman’s rank correlation coefficient.
Supplementary information
Supplementary Information
Supplementary Figures 1–5, Note and Tables 1, 3, 4, 5, 7 and 8
Supplementary Data 1
Supplementary Tables 2 and 6
Rights and permissions
About this article
Cite this article
Abramovs, N., Brass, A. & Tassabehji, M. GeVIR is a continuous gene-level metric that uses variant distribution patterns to prioritize disease candidate genes. Nat Genet 52, 35–39 (2020). https://doi.org/10.1038/s41588-019-0560-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41588-019-0560-2
This article is cited by
-
Genic constraint against nonsynonymous variation across the mouse genome
BMC Genomics (2023)
-
An unsupervised deep learning framework for predicting human essential genes from population and functional genomic data
BMC Bioinformatics (2023)
-
Chromatin regulators in the TBX1 network confer risk for conotruncal heart defects in 22q11.2DS
npj Genomic Medicine (2023)
-
Redefining germline predisposition in children with molecularly characterized ependymoma: a population-based 20-year cohort
Acta Neuropathologica Communications (2022)
-
CADD-Splice—improving genome-wide variant effect prediction using deep learning-derived splice scores
Genome Medicine (2021)