Review Article | Published:

Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data

Nature Reviews Genetics volume 12, pages 628640 (2011) | Download Citation

Abstract

Genome and exome sequencing yield extensive catalogues of human genetic variation. However, pinpointing the few phenotypically causal variants among the many variants present in human genomes remains a major challenge, particularly for rare and complex traits wherein genetic information alone is often insufficient. Here, we review approaches to estimate the deleteriousness of single nucleotide variants (SNVs), which can be used to prioritize disease-causal variants. We describe recent advances in comparative and functional genomics that enable systematic annotation of both coding and non-coding variants. Application and optimization of these methods will be essential to find the genetic answers that sequencing promises to hide in plain sight.

Key points

  • Genome and exome sequencing yield extensive catalogues of genetic variation in many individuals, but purely genetic approaches are often insufficiently powered to specifically identify the few variants that are causally related to any given phenotype. Indeed, variant interpretation is an increasingly important challenge at the interface of genetics, statistics and biology.

  • Non-uniform estimates of the prior probability for variants to be biologically functional will be required to address this challenge. For disease studies, this can be translated into the need to estimate variant deleteriousness.

  • Nearly all computational methods to predict deleteriousness use comparative sequence analysis, exploiting the fact that natural selection removes deleterious variants and tends to conserve the identities of important positions within genes and genomes.

  • Assessment of protein-altering variants leverages both biochemical and evolutionary information, whereas non-coding variation is more challenging to study, given a lack of understanding of the molecular functionality of non-coding sequences relative to coding sequences.

  • Experimental assessments of the functional impact of variants have historically relied on low-throughput assays. However, projects such as the Encyclopedia of DNA Elements (ENCODE) and the clever use of next-generation sequencing technologies are increasingly facilitating large-scale, systematic experimental assessment of genomic variation of many types.

  • Ultimately, unified predictive methods that are applicable to both coding and non-coding variants that leverage both functional and evolutionary information will be crucial for the meaningful interpretation of personal genomes. However, important unknowns and unsolved phenomena, including the relative abundance and penetrance of coding versus non-coding variants, disagreements between evolutionary and experimental definitions of molecular functionality, and the vocabularies that define transcriptional regulatory elements, must first be addressed.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

References

  1. 1.

    & Next-generation DNA sequencing. Nature Biotech. 26, 1135–1145 (2008).

  2. 2.

    et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59 (2008).

  3. 3.

    The 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010).

  4. 4.

    Initial impact of the sequencing of the human genome. Nature 470, 187–197 (2011).

  5. 5.

    , & Genomics, prior probability, and statistical tests of multiple hypotheses. Genome Res. 14, 997–1001 (2004). This is a valuable review of the relationships between prior probability, statistical significance and false-discovery rates as they pertain to genome-wide analyses.

  6. 6.

    Sequential tests for the detection of linkage. Am. J. Hum. Genet. 7, 277–318 (1955).

  7. 7.

    et al. Exome sequencing identifies the cause of a mendelian disorder. Nature Genet. 42, 30–35 (2010).

  8. 8.

    et al. Targeted capture and massively parallel sequencing of 12 human exomes. Nature 461, 272–276 (2009). This is the first demonstration of exome sequencing being used to identify the causal variants for a Mendelian disease. Protein-based annotations of functional deleteriousness were essential to this effort.

  9. 9.

    et al. Genetic diagnosis by whole exome capture and massively parallel DNA sequencing. Proc. Natl Acad. Sci. USA 106, 19096–19101 (2009).

  10. 10.

    et al. Exome sequencing and disease-network analysis of a single family implicate a mutation in KIF1A in hereditary spastic paraparesis. Genome Res. 21, 658–664 (2011).

  11. 11.

    The Neutral Theory Of Molecular Evolution (Cambridge Univ. Press, New York, 1983).

  12. 12.

    & Qualifying the relationship between sequence conservation and molecular function. Genome Res. 18, 201–205 (2008).

  13. 13.

    , & Subtree power analysis and species selection for comparative genomics. Proc. Natl Acad. Sci. USA 102, 7900–7905 (2005).

  14. 14.

    , & Trade-offs in detecting evolutionarily constrained sequence by comparative genomics. Annu. Rev. Genomics Hum. Genet. 6, 143–164 (2005).

  15. 15.

    A model of the statistical power of comparative genome sequence analysis. PLoS Biol. 3, e10 (2005).

  16. 16.

    The Chimpanzee Sequencing and Analysis Consortium. Initial sequence of the chimpanzee genome and comparison with the human genome. Nature 437, 69–87 (2005).

  17. 17.

    et al. Identifying a high fraction of the human genome to be under selective constraint using GERP++. PLoS Comput. Biol. 6, e1001025 (2010).

  18. 18.

    The Mouse Genome Sequencing Consortium. Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562 (2002).

  19. 19.

    The ENCODE Project Consortium. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447, 799–816 (2007).

  20. 20.

    et al. Phylogenetic shadowing of primate sequences to find functional regions of the human genome. Science 299, 1391–1394 (2003).

  21. 21.

    et al. Close sequence comparisons are sufficient to identify human cis-regulatory elements. Genome Res. 16, 855–863 (2006).

  22. 22.

    Origins, evolution, and phenotypic impact of new genes. Genome Res. 20, 1313–1326 (2010).

  23. 23.

    et al. Positive selection of a gene family during the emergence of humans and African apes. Nature 413, 514–519 (2001).

  24. 24.

    et al. Species-specific endogenous retroviruses shape the transcriptional network of the human tumor suppressor protein p53. Proc. Natl Acad. Sci. USA 104, 18613–18618 (2007).

  25. 25.

    et al. Molecular evolution of FOXP2, a gene involved in speech and language. Nature 418, 869–872 (2002).

  26. 26.

    et al. Human-specific gain of function in a developmental enhancer. Science 321, 1346–1350 (2008). This study demonstrates that constraint-based measures may also identify sequences with human-specific functionality.

  27. 27.

    & Physicochemical constraint violation by missense substitutions mediates impairment of protein function and disease severity. Genome Res. 15, 978–986 (2005). The authors describe a combined phylogenetic and biochemical approach to predict the effects of amino acid substitutions. They demonstrate a quantitative relationship between past evolutionary rates of biochemical change and present day deleteriousness.

  28. 28.

    et al. A regulatory SNP causes a human genetic disease by creating a new transcriptional promoter. Science 312, 1215–1217 (2006).

  29. 29.

    & Discovering genotypes underlying human phenotypes: past successes for mendelian disease, future approaches for complex disease. Nature Genet. 33, 228–237 (2003).

  30. 30.

    et al. Exome sequencing identifies MLL2 mutations as a cause of Kabuki syndrome. Nature Genet. 42, 790–793 (2010).

  31. 31.

    & Loss-of-function variants in the genomes of healthy humans. Hum. Mol. Genet. 19, R125–R130 (2010).

  32. 32.

    Amino acid difference formula to help explain protein evolution. Science 185, 862–864 (1974).

  33. 33.

    & Predicting the effects of amino acid substitutions on protein function. Annu. Rev. Genomics Hum. Genet. 7, 61–80 (2006).

  34. 34.

    , , & Deleterious SNP prediction: be mindful of your training data! Bioinformatics 23, 664–672 (2007).

  35. 35.

    et al. A method and server for predicting damaging missense mutations. Nature Methods 7, 248–249 (2010).

  36. 36.

    & SNAP: predict effect of non-synonymous polymorphisms on function. Nucleic Acids Res. 35, 3823–3835 (2007).

  37. 37.

    , & Predicting the insurgence of human genetic diseases associated to single point protein mutations with support vector machines and evolutionary information. Bioinformatics 22, 2729–2734 (2006).

  38. 38.

    , & Sequence-based prediction of pathological mutations. Proteins 57, 811–819 (2004).

  39. 39.

    & Predicting deleterious amino acid substitutions. Genome Res. 11, 863–874 (2001). This describes SIFT (also see reference 46), a commonly used tool to predict the effects of amino acid substitutions and an early demonstration of the importance of sequence conservation to functional predictions.

  40. 40.

    , , & MutationTaster evaluates disease-causing potential of sequence alterations. Nature Methods 7, 575–576 (2010).

  41. 41.

    et al. PANTHER: a library of protein families and subfamilies indexed by function. Genome Res. 13, 2129–2141 (2003).

  42. 42.

    et al. Finding new structural and sequence attributes to predict possible disease association of single amino acid polymorphism (SAP). Bioinformatics 23, 1444–1450 (2007).

  43. 43.

    & Identification of deleterious mutations within three human genomes. Genome Res. 19, 1553–1561 (2009).

  44. 44.

    , & nsSNPAnalyzer: identifying disease-associated nonsynonymous single nucleotide polymorphisms. Nucleic Acids Res. 33, W480–W482 (2005).

  45. 45.

    et al. Prediction of deleterious human alleles. Hum. Mol. Genet. 10, 591–597 (2001). This paper describes polymorphism phenotyping (polyPhen) (also see reference 35), a commonly used tool to predict the effects of amino acid substitutions, and illustrates the value of classifiers trained on numerous biochemical and evolutionary features.

  46. 46.

    & SIFT: predicting amino acid changes that affect protein function. Nucleic Acids Res. 31, 3812–3814 (2003).

  47. 47.

    & The evolutionary fate and consequences of duplicate genes. Science 290, 1151–1155 (2000).

  48. 48.

    , & The use of orthologous sequences to predict the impact of amino acid substitutions on protein function. PLoS Genet. 6, e1000968 (2010).

  49. 49.

    , , & Predicting deleterious nsSNPs: an analysis of sequence and structural attributes. BMC Bioinformatics 7, 217 (2006).

  50. 50.

    & Evaluation of structural and evolutionary contributions to deleterious mutation prediction. J. Mol. Biol. 322, 891–901 (2002).

  51. 51.

    , & Loss of protein structure stability as a major causative factor in monogenic disease. J. Mol. Biol. 353, 459–473 (2005).

  52. 52.

    & Prediction of the phenotypic effects of non-synonymous single nucleotide polymorphisms using structural and evolutionary information. Bioinformatics 21, 2185–2190 (2005).

  53. 53.

    et al. Predicting disease-associated substitution of a single amino acid by analyzing residue interactions. BMC Bioinformatics 12, 14 (2011).

  54. 54.

    et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc. Natl Acad. Sci. USA 106, 9362–9367 (2009).

  55. 55.

    et al. From noncoding variant to phenotype via SORT1 at the 1p13 cholesterol locus. Nature 466, 714–719 (2010). This paper describes the precise identification of a common transcriptional regulatory variant that influences cholesterol levels and cardiovascular disease risk.

  56. 56.

    et al. Gene-expression variation within and among human populations. Am. J. Hum. Genet. 80, 502–509 (2007).

  57. 57.

    et al. Trait-associated SNPs are more likely to be eQTLs: annotation to enhance discovery from GWAS. PLoS Genet. 6, e1000888 (2010). This analysis demonstrated that expression-associated variants are enriched among trait-associated variants, suggesting that non-coding regulatory variants are causally relevant for many traits.

  58. 58.

    & Evolution at two levels in humans and chimpanzees. Science 188, 107–116 (1975).

  59. 59.

    Evolution at two levels: on genes and form. PLoS Biol. 3, e245 (2005).

  60. 60.

    et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15, 1034–1050 (2005).

  61. 61.

    et al. A long-range Shh enhancer regulates expression in the developing limb and fin and is associated with preaxial polydactyly. Hum. Mol. Genet. 12, 1725–1735 (2003). This study describes non-coding mutations that cause Mendelian limb defects by affecting enhancers important to developmental sonic hedgehog (Shh) gene regulation. A combination of evolutionary sequence conservation and mouse-based experimental assessments of variant function were used.

  62. 62.

    et al. The Human Gene Mutation Database: providing a comprehensive central mutation database for molecular diagnostics and personalized genomics. Hum. Genomics 4, 69–72 (2009).

  63. 63.

    , & Specific transcription and RNA splicing defects in five cloned β-thalassaemia genes. Nature 302, 591–596 (1983).

  64. 64.

    et al. Highly conserved non-coding sequences are associated with vertebrate development. PLoS Biol. 3, e7 (2005).

  65. 65.

    et al. The draft genome of Ciona intestinalis: insights into chordate and vertebrate origins. Science 298, 2157–2167 (2002).

  66. 66.

    , , & Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res. 20, 110–121 (2010).

  67. 67.

    et al. Distribution and intensity of constraint in mammalian genomic sequence. Genome Res. 15, 901–913 (2005).

  68. 68.

    , , & Analysis of sequence conservation at nucleotide resolution. PLoS Comput. Biol. 3, e254 (2007).

  69. 69.

    , , & Identification and characterization of multi-species conserved sequences. Genome Res. 13, 2507–2518 (2003).

  70. 70.

    et al. Active conservation of noncoding sequences revealed by three-way species comparisons. Genome Res. 10, 1304–1306 (2000).

  71. 71.

    , , , & Local DNA topography correlates with functional noncoding regions of the human genome. Science 324, 389–392 (2009).

  72. 72.

    et al. Single-nucleotide evolutionary constraint scores highlight disease-causing mutations. Nature Methods 7, 250–251 (2010). This paper demonstrated that functionally agnostic nucleotide-level constraint scores, defined by GERP (also see references 17 and 67), offer considerable utility for causal variant discovery in exome analyses.

  73. 73.

    & Splicing in disease: disruption of the splicing code and the decoding machinery. Nature Rev. Genet. 8, 749–761 (2007).

  74. 74.

    et al. Conserved noncoding sequences are selectively constrained and not mutation cold spots. Nature Genet. 38, 223–227 (2006).

  75. 75.

    et al. Human genome ultraconserved elements are ultraselected. Science 317, 915 (2007).

  76. 76.

    et al. Evolutionary constraint facilitates interpretation of genetic variation in resequenced human genomes. Genome Res. 20, 301–310 (2010).

  77. 77.

    et al. In vivo enhancer analysis of human conserved non-coding sequences. Nature 444, 499–502 (2006).

  78. 78.

    et al. Analyses of deep mammalian sequence alignments and constraint predictions for 1% of the human genome. Genome Res. 17, 760–774 (2007).

  79. 79.

    The ENCODE Project Consortium. A user's guide to the encyclopedia of DNA elements (ENCODE). PLoS Biol. 9, e1001046 (2011).

  80. 80.

    et al. Global patterns of cis variation in human cells revealed by high-density allelic expression analysis. Nature Genet. 41, 1216–1222 (2009).

  81. 81.

    et al. Candidate causal regulatory effects by integration of expression QTLs with complex trait genetic associations. PLoS Genet. 6, e1000895 (2010).

  82. 82.

    , , , & Genetic analysis of variation in transcription factor binding in yeast. Nature 464, 1187–1191 (2010).

  83. 83.

    et al. High-resolution analysis of DNA regulatory elements by synthetic saturation mutagenesis. Nature Biotech. 27, 1173–1175 (2009). This paper defined a method to exploit next-generation sequencing to comprehensively yet efficiently assay point mutations in transcriptional promoters.

  84. 84.

    & Rapid construction of empirical RNA fitness landscapes. Science 330, 376–379 (2010).

  85. 85.

    et al. High-resolution mapping of protein sequence-function relationships. Nature Methods 7, 741–746 (2010).

  86. 86.

    , , , & Mapping and quantifying mammalian transcriptomes by RNA-seq. Nature Methods 5, 621–628 (2008).

  87. 87.

    , , & Genome-wide mapping of in vivo protein-DNA interactions. Science 316, 1497–1502 (2007).

  88. 88.

    et al. Genome-wide analysis of transcription factor E2F1 mutant proteins reveals that N- and C-terminal protein interaction domains do not participate in targeting E2F1 to the human genome. J. Biol. Chem. 286, 11985–11996 (2011).

  89. 89.

    & Strategies and applications of in vitro mutagenesis. Science 229, 1193–1201 (1985).

  90. 90.

    et al. ChIP-seq identification of weakly conserved heart enhancers. Nature Genet. 42, 806–810 (2010).

  91. 91.

    et al. Erythroid GATA1 function revealed by genome-wide analysis of transcription factor occupancy, histone modifications, and mRNA expression. Genome Res. 19, 2172–2184 (2009).

  92. 92.

    et al. Consensus statement: chromosomal microarray is a first-tier clinical diagnostic test for individuals with developmental disabilities or congenital anomalies. Am. J. Hum. Genet. 86, 749–764 (2010).

  93. 93.

    et al. Strong association of de novo copy number mutations with autism. Science 316, 445–449 (2007).

  94. 94.

    et al. Rare structural variants disrupt multiple genes in neurodevelopmental pathways in schizophrenia. Science 320, 539–543 (2008).

  95. 95.

    , , , & Genetic studies of the lac repressor. XIV. Analysis of 4000 altered Escherichia coli lac repressors reveals essential and non-essential residues, as well as “spacers” which do not require a specific sequence. J. Mol. Biol. 240, 421–433 (1994).

  96. 96.

    , , & Systematic mutation of bacteriophage T4 lysozyme. J. Mol. Biol. 222, 67–88 (1991).

  97. 97.

    et al. Complete mutagenesis of the HIV-1 protease. Nature 340, 397–400 (1989).

  98. 98.

    et al. HbVar: a relational database of human hemoglobin variants and thalassemia mutations at the globin gene server. Hum. Mutat. 19, 225–233 (2002).

  99. 99.

    et al. The IARC TP53 database: new online mutation analysis and recommendations to users. Hum. Mutat. 19, 607–614 (2002).

  100. 100.

    et al. The Swiss-Prot variant page and the ModSNP database: a resource for sequence and structure information on human protein variants. Hum. Mutat. 23, 464–470 (2004).

  101. 101.

    , & Functional architecture and evolution of transcriptional elements that drive gene coexpression. Science 317, 1557–1560 (2007).

  102. 102.

    , & Evolution of regulatory sequences in 12 Drosophila species. PLoS Genet. 5, e1000330 (2009).

  103. 103.

    , , , & Position specific variation in the rate of evolution in transcription factor binding sites. BMC Evol. Biol. 3, 19 (2003).

  104. 104.

    et al. ChIP-seq accurately predicts tissue-specific activity of enhancers. Nature 457, 854–858 (2009).

  105. 105.

    , , & Capturing chromosome conformation. Science 295, 1306–1311 (2002).

  106. 106.

    & A novel adaptive method for the analysis of next-generation sequencing data to detect complex trait associations with rare variants due to gene main effects and interactions. PLoS Genet. 6, e1001156 (2010). This paper describes an approach to assess the significance of correlations between gene or locus aggregates of rare variants and phenotypes and may also be useful in identifying significant variant interactions.

  107. 107.

    et al. A probabilistic disease-gene finder for personal genomes. Genome Res. 23 Jun 2011 (doi:10.1101/gr.123158.111). This paper defines a method, VAAST, to predict disease genes or loci on the basis of the total predicted deleteriousness of rare variants observed in affected individuals.

  108. 108.

    , & Genetic interactions between transcription factors cause natural variation in yeast. Science 323, 498–501 (2009).

  109. 109.

    , , & Gene–environment interactions at nucleotide resolution. PLoS Genet. 6, e1001144 (2010).

  110. 110.

    et al. A knowledge-driven interaction analysis reveals potential neurodegenerative mechanism of multiple sclerosis susceptibility. Genes Immun. (2011).

  111. 111.

    et al. Towards a proteome-scale map of the human protein-protein interaction network. Nature 437, 1173–1178 (2005).

  112. 112.

    et al. Genetics of gene expression and its effect on disease. Nature 452, 423–428 (2008).

  113. 113.

    The Gene Ontology Consortium. et al. Gene ontology: tool for the unification of biology. Nature Genet. 25, 25–29 (2000).

  114. 114.

    & The future of genetic studies of complex human diseases. Science 273, 1516–1517 (1996).

  115. 115.

    Why most published research findings are false. PLoS Med. 2, e124 (2005).

  116. 116.

    No adjustments are needed for multiple comparisons. Epidemiology 1, 43–46 (1990).

  117. 117.

    , , & Measurement of the human allele frequency spectrum demonstrates greater genetic drift in East Asians than in Europeans. Nature Genet. 39, 1251–1255 (2007).

Download references

Acknowledgements

We thank C. Brown and E. Stone for comments on an earlier draft and R. Patwardhan for sharing data.

Author information

Affiliations

  1. HudsonAlpha Institute for Biotechnology, Huntsville, Alabama 35806, USA.

    • Gregory M. Cooper
  2. Department of Genome Sciences, University of Washington, Seattle, Washington 98115, USA.

    • Jay Shendure

Authors

  1. Search for Gregory M. Cooper in:

  2. Search for Jay Shendure in:

Competing interests

The authors declare no competing financial interests.

Corresponding authors

Correspondence to Gregory M. Cooper or Jay Shendure.

Glossary

Private

A genetic variant that is confined to a single individual, family or population.

Prior probability

Otherwise simply known as the 'prior', this is the probability of a hypothesis (or parameter value) without reference to the available data. Priors can be derived from first principles or be based on general knowledge or previous experiments.

Deleterious

A genetic variant that lowers the fitness of an organism: that is, it decreases survival or reproductive success.

Conserved

Shared identity of either protein or nucleotide sequences, which can be indicative of constraint.

Neutral

Sequences that are free to evolve in the absence of natural selection and are therefore subject only to random mutational and genetic drift processes.

Phylogenetic scope

The taxonomic range captured by a given comparative sequence analysis — for example, mammals or eukaryotes.

Constrained

Sequences that are under purifying selection to maintain function, which often, but not always, results in sequence conservation.

About this article

Publication history

Published

DOI

https://doi.org/10.1038/nrg3046

Further reading