A general framework for estimating the relative pathogenicity of human genetic variants

Journal name:
Nature Genetics
Volume:
46,
Pages:
310–315
Year published:
DOI:
doi:10.1038/ng.2892
Received
Accepted
Published online

Abstract

Current methods for annotating and interpreting human genetic variation tend to exploit a single information type (for example, conservation) and/or are restricted in scope (for example, to missense changes). Here we describe Combined Annotation–Dependent Depletion (CADD), a method for objectively integrating many diverse annotations into a single measure (C score) for each variant. We implement CADD as a support vector machine trained to differentiate 14.7 million high-frequency human-derived alleles from 14.7 million simulated variants. We precompute C scores for all 8.6 billion possible human single-nucleotide variants and enable scoring of short insertions-deletions. C scores correlate with allelic diversity, annotations of functionality, pathogenicity, disease severity, experimentally measured regulatory effects and complex trait associations, and they highly rank known pathogenic variants within individual genomes. The ability of CADD to prioritize functional, deleterious and pathogenic variants across many functional categories, effect sizes and genetic architectures is unmatched by any current single-annotation method.

At a glance

Figures

  1. Relationship of scaled C scores and categorical variant consequences.
    Figure 1: Relationship of scaled C scores and categorical variant consequences.

    (a) Proportion of substitutions with a specific consequence for each scaled C score bin. (b) Proportion of substitutions with a specific consequence after first normalizing by the total number of variants observed in that category. The legend includes in parentheses the median and range of scaled C score values for each category. Consequences were obtained from Ensembl VEP16 (Supplementary Note); for example, noncoding refers to changes in annotated noncoding transcripts. Detailed counts of functional assignments in each C score bin are provided in Supplementary Table 8. (c) Violin plots of the median C scores of potential nonsense (stop-gain) variants for genes that harbor at least 5 known pathogenic mutations48 (disease); are predicted to be essential23; harbor variants associated with complex traits41 (GWAS); harbor at least 2 loss-of-function mutations in 1000 Genomes Project data49 (LoF); encode olfactory receptor proteins; or are in a random selection of 500 genes (other; Supplementary Note).

  2. Relationship between scaled C scores and genetic variation.
    Figure 2: Relationship between scaled C scores and genetic variation.

    (a) Mean DAF by scaled C score for variants listed by the 1000 Genomes Project14 or ESP24. Dashed lines indicate mean DAF values, and confidence intervals indicate 1.96 × s.e.m. for DAFs in each bin. (b) Under-representation of polymorphic sites in 1000 Genomes Project data. (c) Under-representation of chimpanzee lineage–derived variants. Under-representation is defined as the proportion of 1000 Genomes Project (b) or chimpanzee-derived (c) variants in a specific scaled C score bin divided by the frequency with which that scaled C score is observed for all possible mutations of the human reference assembly (10C score/−10). The stronger under-representation of chimpanzee-derived variants relative to 1000 Genomes Project variants is expected given that the former are mostly fixed or high-frequency variants (and have survived many generations of purifying selection), whereas the latter are mostly low-frequency variants. Depletion values in b,c for C score bins other than 0 are significantly different from expectation (binomial proportion test, all P < 1 × 10−11).

  3. Sensitivity of methods in distinguishing pathogenic and benign variants.
    Figure 3: Sensitivity of methods in distinguishing pathogenic and benign variants.

    Receiver operating characteristics (ROCs) are shown discriminating curated, pathogenic mutations defined by the ClinVar database27 from matched, likely benign ESP alleles (DAF ≥ 5%)24 with the same categorical consequence. (a) Genome-wide variants for which GerpS, PhCons and phyloP scores are defined (n = 16,334). (b) Analysis limited to missense changes (n = 15,154), with missing values imputed to an upper limit of each score. (c) Analysis limited to missense changes for which PolyPhen, SIFT and Grantham scores are all defined (n = 13,358). Versions of the plot in c that exclude overlap between PolyPhen training data and the ClinVar database or use a CADD model trained without PolyPhen as a feature are shown in Supplementary Figure 12. Area under the curve (AUC) values are provided for each of the scores used.

  4. Ranking of pathogenic ClinVar variants among the variants identified by whole-genome sequencing in 11 human individuals from diverse populations.
    Figure 4: Ranking of pathogenic ClinVar variants among the variants identified by whole-genome sequencing in 11 human individuals from diverse populations.

    (a) Cumulative distribution of the rankings of 9,831 pathogenic ClinVar variants when 'spiked' into each of 11 personal genomes. For example, C scores of ~30% for ClinVar variants rank in the top 0.1% of all variants within a personal genome, and most rank in the top 1%. About 25% of pathogenic ClinVar SNVs are not scored by PolyPhen or SIFT because of missing values or the restriction of these methods to missense variation; note also that rankings for PolyPhen and SIFT are computed among missense variants only and are therefore derived from far fewer total variants (see a plot restricted to missense variation in Supplementary Fig. 16). (b) Quantile-quantile plot of C scores for the SNVs identified in the 11 individual genomes and pathogenic ClinVar SNVs. For a given scaled C score observed in an individual, the fraction of that individual's variants with a C score at least that high was computed (y axis). The C score corresponding to this quantile of the distribution of all possible variants is displayed on the x axis. High C scores are under-represented compared to the set of all possible variants. In contrast, known disease-causal variants from ClinVar have large C scores relative to the set of all possible variants. This fact can be exploited to prioritize causal variants identified from whole-genome sequencing of individual genomes as in a (see also Supplementary Tables 10 and 11).

  5. C scores for GWAS SNPs are higher than for nearby control SNPs and are dependent on study sample size.
    Figure 5: C scores for GWAS SNPs are higher than for nearby control SNPs and are dependent on study sample size.

    The average scaled C score (y axis) is plotted for each category of SNPs, as indicated by color, relative to the sample size of the association study in which the SNP was identified (x axis). Sample size bins are log2 scaled and mutually exclusive; for example, the bin labeled 1,024 represents all SNPs from studies with between 512 and 1,024 samples. Error bars, ±1 s.e.m. Each shaded rectangle represents overall (across all sample sizes) scaled C score mean ± 1 s.e.m. for each category as indicated by color.

References

  1. Cooper, G.M. et al. Single-nucleotide evolutionary constraint scores highlight disease-causing mutations. Nat. Methods 7, 250251 (2010).
  2. Cooper, G.M. & Shendure, J. Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data. Nat. Rev. Genet. 12, 628640 (2011).
  3. Musunuru, K. et al. From noncoding variant to phenotype via SORT1 at the 1p13 cholesterol locus. Nature 466, 714719 (2010).
  4. Ward, L.D. & Kellis, M. Interpreting noncoding genetic variation in complex traits and human disease. Nat. Biotechnol. 30, 10951106 (2012).
  5. Ng, S.B. et al. Targeted capture and massively parallel sequencing of 12 human exomes. Nature 461, 272276 (2009).
  6. Adzhubei, I.A. et al. A method and server for predicting damaging missense mutations. Nat. Methods 7, 248249 (2010).
  7. Ng, P.C. & Henikoff, S. SIFT: predicting amino acid changes that affect protein function. Nucleic Acids Res. 31, 38123814 (2003).
  8. Cooper, G.M. et al. Distribution and intensity of constraint in mammalian genomic sequence. Genome Res. 15, 901913 (2005).
  9. Siepel, A. et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15, 10341050 (2005).
  10. Pollard, K.S., Hubisz, M.J., Rosenbloom, K.R. & Siepel, A. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res. 20, 110121 (2010).
  11. ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 5774 (2012).
  12. Kimura, M. The Neutral Theory of Molecular Evolution (Cambridge University Press, Cambridge and New York, 1983).
  13. Paten, B. et al. Genome-wide nucleotide-level mammalian ancestor reconstruction. Genome Res. 18, 18291843 (2008).
  14. 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 5665 (2012).
  15. Paten, B., Herrero, J., Beal, K., Fitzgerald, S. & Birney, E. Enredo and Pecan: genome-wide mammalian consistency-based multiple alignment with paralogs. Genome Res. 18, 18141828 (2008).
  16. McLaren, W. et al. Deriving the consequences of genomic variants with the Ensembl API and SNP Effect Predictor. Bioinformatics 26, 20692070 (2010).
  17. Meyer, L.R. et al. The UCSC Genome Browser database: extensions and updates 2013. Nucleic Acids Res. 41, D64D69 (2013).
  18. Boyle, A.P. et al. High-resolution mapping and characterization of open chromatin across the genome. Cell 132, 311322 (2008).
  19. Johnson, D.S., Mortazavi, A., Myers, R.M. & Wold, B. Genome-wide mapping of in vivo protein-DNA interactions. Science 316, 14971502 (2007).
  20. Grantham, R. Amino acid difference formula to help explain protein evolution. Science 185, 862864 (1974).
  21. Franc, V. & Sonnenburg, S. Optimized cutting plane algorithm for large-scale risk minimization. J. Mach. Learn. Res. 10, 21572192 (2009).
  22. Ewing, B. & Green, P. Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 8, 186194 (1998).
  23. Liao, B.Y. & Zhang, J. Null mutations in human and mouse orthologs frequently result in different phenotypes. Proc. Natl. Acad. Sci. USA 105, 69876992 (2008).
  24. Fu, W. et al. Analysis of 6,515 exomes reveals the recent origin of most human protein-coding variants. Nature 493, 216220 (2013).
  25. Makrythanasis, P. et al. MLL2 mutation detection in 86 patients with Kabuki syndrome: a genotype-phenotype study. Clin. Genet. doi:10.1111/cge.12081 (16 January 2013).
  26. Giardine, B. et al. HbVar database of human hemoglobin variants and thalassemia mutations: 2007 update. Hum. Mutat. 28, 206 (2007).
  27. Baker, M. One-stop shop for disease genes. Nature 491, 171 (2012).
  28. Patwardhan, R.P. et al. Massively parallel functional dissection of mammalian enhancers in vivo. Nat. Biotechnol. 30, 265270 (2012).
  29. Patwardhan, R.P. et al. High-resolution analysis of DNA regulatory elements by synthetic saturation mutagenesis. Nat. Biotechnol. 27, 11731175 (2009).
  30. O'Roak, B.J. et al. Exome sequencing in sporadic autism spectrum disorders identifies severe de novo mutations. Nat. Genet. 43, 585589 (2011).
  31. O'Roak, B.J. et al. Sporadic autism exomes reveal a highly interconnected protein network of de novo mutations. Nature 485, 246250 (2012).
  32. Sanders, S.J. et al. De novo mutations revealed by whole-exome sequencing are strongly associated with autism. Nature 485, 237241 (2012).
  33. Neale, B.M. et al. Patterns and rates of exonic de novo mutations in autism spectrum disorders. Nature 485, 242245 (2012).
  34. Iossifov, I. et al. De novo gene disruptions in children on the autistic spectrum. Neuron 74, 285299 (2012).
  35. Rauch, A. et al. Range of genetic mutations associated with severe non-syndromic sporadic intellectual disability: an exome sequencing study. Lancet 380, 16741682 (2012).
  36. de Ligt, J. et al. Diagnostic exome sequencing in persons with severe intellectual disability. N. Engl. J. Med. 367, 19211929 (2012).
  37. Cooper, G.M. et al. A copy number variation morbidity map of developmental delay. Nat. Genet. 43, 838846 (2011).
  38. Ng, S.B. et al. Exome sequencing identifies MLL2 mutations as a cause of Kabuki syndrome. Nat. Genet. 42, 790793 (2010).
  39. Rohland, N. & Reich, D. Cost-effective, high-throughput DNA sequencing libraries for multiplexed target capture. Genome Res. 22, 939946 (2012).
  40. Meyer, M. et al. A high-coverage genome sequence from an archaic Denisovan individual. Science 338, 222226 (2012).
  41. Hindorff, L.A. et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc. Natl. Acad. Sci. USA 106, 93629367 (2009).
  42. Nicolae, D.L. et al. Trait-associated SNPs are more likely to be eQTLs: annotation to enhance discovery from GWAS. PLoS Genet. 6, e1000888 (2010).
  43. Gerstein, M.B. et al. Architecture of the human regulatory network derived from ENCODE data. Nature 489, 91100 (2012).
  44. Schaub, M.A., Boyle, A.P., Kundaje, A., Batzoglou, S. & Snyder, M. Linking disease associations with regulatory information in the human genome. Genome Res. 22, 17481759 (2012).
  45. González-Pérez, A. & Lopez-Bigas, N. Improving the assessment of the outcome of nonsynonymous SNVs with a consensus deleteriousness score, Condel. Am. J. Hum. Genet. 88, 440449 (2011).
  46. Arbiza, L. et al. Genome-wide inference of natural selection on human transcription factor binding sites. Nat. Genet. 45, 723729 (2013).
  47. Weedon, M.N. et al. Recessive mutations in a distal PTF1A enhancer cause isolated pancreatic agenesis. Nat. Genet. 46, 6164 (2014).
  48. Stenson, P.D. et al. The Human Gene Mutation Database: 2008 update. Genome Med. 1, 13 (2009).
  49. MacArthur, D.G. et al. A systematic survey of loss-of-function variants in human protein-coding genes. Science 335, 823828 (2012).
  50. Tavaré, S. Some probabilistic and statistical problems in the analysis of DNA sequences. Lect. Math. Life Sci. 17, 5786 (1986).
  51. Fujita, P.A. et al. The UCSC Genome Browser database: update 2011. Nucleic Acids Res. 39, D876D882 (2011).
  52. Rosenbloom, K.R. et al. ENCODE whole-genome data in the UCSC Genome Browser: update 2012. Nucleic Acids Res. 40, D912D917 (2012).
  53. Hubisz, M.J., Pollard, K.S. & Siepel, A. PHAST and RPHAST: phylogenetic analysis with space/time models. Brief. Bioinform. 12, 4151 (2011).
  54. Davydov, E.V. et al. Identifying a high fraction of the human genome to be under selective constraint using GERP. PLoS Comput. Biol. 6, e1001025 (2010).
  55. McVicker, G., Gordon, D., Davis, C. & Green, P. Widespread genomic signatures of natural selection in hominid evolution. PLoS Genet. 5, e1000471 (2009).
  56. Hoffman, M.M. et al. Unsupervised pattern discovery in human chromatin structure through genomic segmentation. Nat. Methods 9, 473476 (2012).
  57. Tennessen, J.A. et al. Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science 337, 6469 (2012).
  58. Khurana, E., Fu, Y., Chen, J. & Gerstein, M. Interpretation of genomic variants using a unified biological network approach. PLoS Comput. Biol. 9, e1002886 (2013).
  59. Liu, X., Jian, X. & Boerwinkle, E. dbNSFP: a lightweight database of human nonsynonymous SNPs and their functional predictions. Hum. Mutat. 32, 894899 (2011).

Download references

Author information

  1. These authors contributed equally to this work.

    • Martin Kircher &
    • Daniela M Witten

Affiliations

  1. Department of Genome Sciences, University of Washington, Seattle, Washington, USA.

    • Martin Kircher,
    • Brian J O'Roak &
    • Jay Shendure
  2. Department of Biostatistics, University of Washington, Seattle, Washington, USA.

    • Daniela M Witten
  3. HudsonAlpha Institute for Biotechnology, Huntsville, Alabama, USA.

    • Preti Jain &
    • Gregory M Cooper
  4. Present address: Department of Molecular and Medical Genetics, Oregon Health and Science University, Portland, Oregon, USA.

    • Preti Jain &
    • Brian J O'Roak

Contributions

G.M.C. and J.S. designed the study. M.K. processed the annotation data and scores and developed and implemented the simulator and scripts required for scoring. P.J. and B.J.O. prepared and provided data sets and annotations. D.M.W. and M.K. developed the model and performed model training. D.M.W. performed the analysis of individual features and interactions. M.K., D.M.W., G.M.C. and J.S. analyzed the model's performance on different data sets. G.M.C. analyzed the GWAS data. J.S., G.M.C., M.K. and D.M.W. wrote the manuscript with input from all authors.

Competing financial interests

The authors (M.K., D.M.W., G.M.C. and J.S.) have filed a provisional patent application with the US Patent and Trademark Office on the basis of CADD.

Corresponding authors

Correspondence to:

Author details

Supplementary information

PDF files

  1. Supplementary Text and Figures (4,118 KB)

    Supplementary Figures 1–18, Supplementary Tables 1–12 and Supplementary Note

Additional data