Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data

Key Points

  • Genome and exome sequencing yield extensive catalogues of genetic variation in many individuals, but purely genetic approaches are often insufficiently powered to specifically identify the few variants that are causally related to any given phenotype. Indeed, variant interpretation is an increasingly important challenge at the interface of genetics, statistics and biology.

  • Non-uniform estimates of the prior probability for variants to be biologically functional will be required to address this challenge. For disease studies, this can be translated into the need to estimate variant deleteriousness.

  • Nearly all computational methods to predict deleteriousness use comparative sequence analysis, exploiting the fact that natural selection removes deleterious variants and tends to conserve the identities of important positions within genes and genomes.

  • Assessment of protein-altering variants leverages both biochemical and evolutionary information, whereas non-coding variation is more challenging to study, given a lack of understanding of the molecular functionality of non-coding sequences relative to coding sequences.

  • Experimental assessments of the functional impact of variants have historically relied on low-throughput assays. However, projects such as the Encyclopedia of DNA Elements (ENCODE) and the clever use of next-generation sequencing technologies are increasingly facilitating large-scale, systematic experimental assessment of genomic variation of many types.

  • Ultimately, unified predictive methods that are applicable to both coding and non-coding variants that leverage both functional and evolutionary information will be crucial for the meaningful interpretation of personal genomes. However, important unknowns and unsolved phenomena, including the relative abundance and penetrance of coding versus non-coding variants, disagreements between evolutionary and experimental definitions of molecular functionality, and the vocabularies that define transcriptional regulatory elements, must first be addressed.

Abstract

Genome and exome sequencing yield extensive catalogues of human genetic variation. However, pinpointing the few phenotypically causal variants among the many variants present in human genomes remains a major challenge, particularly for rare and complex traits wherein genetic information alone is often insufficient. Here, we review approaches to estimate the deleteriousness of single nucleotide variants (SNVs), which can be used to prioritize disease-causal variants. We describe recent advances in comparative and functional genomics that enable systematic annotation of both coding and non-coding variants. Application and optimization of these methods will be essential to find the genetic answers that sequencing promises to hide in plain sight.

This is a preview of subscription content, access via your institution

Access options

Rent or buy this article

Prices vary by article type

from$1.95

to$39.95

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Assessing variant deleteriousness to boost discovery power of genetic analyses.
Figure 2: Functional and evolutionary annotations highlight disease variation at the HBB locus.
Figure 3: High-throughput experimental assessment of variant function.

References

  1. Shendure, J. & Ji, H. Next-generation DNA sequencing. Nature Biotech. 26, 1135–1145 (2008).

    Article  CAS  Google Scholar 

  2. Bentley, D. R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. The 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010).

  4. Lander, E. S. Initial impact of the sequencing of the human genome. Nature 470, 187–197 (2011).

    Article  CAS  PubMed  Google Scholar 

  5. Manly, K. F., Nettleton, D. & Hwang, J. T. Genomics, prior probability, and statistical tests of multiple hypotheses. Genome Res. 14, 997–1001 (2004). This is a valuable review of the relationships between prior probability, statistical significance and false-discovery rates as they pertain to genome-wide analyses.

    Article  CAS  PubMed  Google Scholar 

  6. Morton, N. E. Sequential tests for the detection of linkage. Am. J. Hum. Genet. 7, 277–318 (1955).

    CAS  PubMed  PubMed Central  Google Scholar 

  7. Ng, S. B. et al. Exome sequencing identifies the cause of a mendelian disorder. Nature Genet. 42, 30–35 (2010).

    Article  CAS  PubMed  Google Scholar 

  8. Ng, S. B. et al. Targeted capture and massively parallel sequencing of 12 human exomes. Nature 461, 272–276 (2009). This is the first demonstration of exome sequencing being used to identify the causal variants for a Mendelian disease. Protein-based annotations of functional deleteriousness were essential to this effort.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Choi, M. et al. Genetic diagnosis by whole exome capture and massively parallel DNA sequencing. Proc. Natl Acad. Sci. USA 106, 19096–19101 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Erlich, Y. et al. Exome sequencing and disease-network analysis of a single family implicate a mutation in KIF1A in hereditary spastic paraparesis. Genome Res. 21, 658–664 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Kimura, M. The Neutral Theory Of Molecular Evolution (Cambridge Univ. Press, New York, 1983).

    Book  Google Scholar 

  12. Cooper, G. M. & Brown, C. D. Qualifying the relationship between sequence conservation and molecular function. Genome Res. 18, 201–205 (2008).

    Article  CAS  PubMed  Google Scholar 

  13. McAuliffe, J. D., Jordan, M. I. & Pachter, L. Subtree power analysis and species selection for comparative genomics. Proc. Natl Acad. Sci. USA 102, 7900–7905 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Stone, E. A., Cooper, G. M. & Sidow, A. Trade-offs in detecting evolutionarily constrained sequence by comparative genomics. Annu. Rev. Genomics Hum. Genet. 6, 143–164 (2005).

    Article  CAS  PubMed  Google Scholar 

  15. Eddy, S. R. A model of the statistical power of comparative genome sequence analysis. PLoS Biol. 3, e10 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. The Chimpanzee Sequencing and Analysis Consortium. Initial sequence of the chimpanzee genome and comparison with the human genome. Nature 437, 69–87 (2005).

  17. Davydov, E. V. et al. Identifying a high fraction of the human genome to be under selective constraint using GERP++. PLoS Comput. Biol. 6, e1001025 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. The Mouse Genome Sequencing Consortium. Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562 (2002).

  19. The ENCODE Project Consortium. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447, 799–816 (2007).

  20. Boffelli, D. et al. Phylogenetic shadowing of primate sequences to find functional regions of the human genome. Science 299, 1391–1394 (2003).

    Article  CAS  PubMed  Google Scholar 

  21. Prabhakar, S. et al. Close sequence comparisons are sufficient to identify human cis-regulatory elements. Genome Res. 16, 855–863 (2006).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Kaessmann, H. Origins, evolution, and phenotypic impact of new genes. Genome Res. 20, 1313–1326 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Johnson, M. E. et al. Positive selection of a gene family during the emergence of humans and African apes. Nature 413, 514–519 (2001).

    Article  CAS  PubMed  Google Scholar 

  24. Wang, T. et al. Species-specific endogenous retroviruses shape the transcriptional network of the human tumor suppressor protein p53. Proc. Natl Acad. Sci. USA 104, 18613–18618 (2007).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Enard, W. et al. Molecular evolution of FOXP2, a gene involved in speech and language. Nature 418, 869–872 (2002).

    Article  CAS  PubMed  Google Scholar 

  26. Prabhakar, S. et al. Human-specific gain of function in a developmental enhancer. Science 321, 1346–1350 (2008). This study demonstrates that constraint-based measures may also identify sequences with human-specific functionality.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Stone, E. A. & Sidow, A. Physicochemical constraint violation by missense substitutions mediates impairment of protein function and disease severity. Genome Res. 15, 978–986 (2005). The authors describe a combined phylogenetic and biochemical approach to predict the effects of amino acid substitutions. They demonstrate a quantitative relationship between past evolutionary rates of biochemical change and present day deleteriousness.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. De Gobbi, M. et al. A regulatory SNP causes a human genetic disease by creating a new transcriptional promoter. Science 312, 1215–1217 (2006).

    Article  CAS  PubMed  Google Scholar 

  29. Botstein, D. & Risch, N. Discovering genotypes underlying human phenotypes: past successes for mendelian disease, future approaches for complex disease. Nature Genet. 33, 228–237 (2003).

    Article  CAS  PubMed  Google Scholar 

  30. Ng, S. B. et al. Exome sequencing identifies MLL2 mutations as a cause of Kabuki syndrome. Nature Genet. 42, 790–793 (2010).

    Article  CAS  PubMed  Google Scholar 

  31. MacArthur, D. G. & Tyler-Smith, C. Loss-of-function variants in the genomes of healthy humans. Hum. Mol. Genet. 19, R125–R130 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Grantham, R. Amino acid difference formula to help explain protein evolution. Science 185, 862–864 (1974).

    Article  CAS  PubMed  Google Scholar 

  33. Ng, P. C. & Henikoff, S. Predicting the effects of amino acid substitutions on protein function. Annu. Rev. Genomics Hum. Genet. 7, 61–80 (2006).

    Article  CAS  PubMed  Google Scholar 

  34. Care, M. A., Needham, C. J., Bulpitt, A. J. & Westhead, D. R. Deleterious SNP prediction: be mindful of your training data! Bioinformatics 23, 664–672 (2007).

    Article  CAS  PubMed  Google Scholar 

  35. Adzhubei, I. A. et al. A method and server for predicting damaging missense mutations. Nature Methods 7, 248–249 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Bromberg, Y. & Rost, B. SNAP: predict effect of non-synonymous polymorphisms on function. Nucleic Acids Res. 35, 3823–3835 (2007).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  37. Capriotti, E., Calabrese, R. & Casadio, R. Predicting the insurgence of human genetic diseases associated to single point protein mutations with support vector machines and evolutionary information. Bioinformatics 22, 2729–2734 (2006).

    Article  CAS  PubMed  Google Scholar 

  38. Ferrer-Costa, C., Orozco, M. & de la Cruz, X. Sequence-based prediction of pathological mutations. Proteins 57, 811–819 (2004).

    Article  CAS  PubMed  Google Scholar 

  39. Ng, P. C. & Henikoff, S. Predicting deleterious amino acid substitutions. Genome Res. 11, 863–874 (2001). This describes SIFT (also see reference 46), a commonly used tool to predict the effects of amino acid substitutions and an early demonstration of the importance of sequence conservation to functional predictions.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Schwarz, J. M., Rodelsperger, C., Schuelke, M. & Seelow, D. MutationTaster evaluates disease-causing potential of sequence alterations. Nature Methods 7, 575–576 (2010).

    Article  CAS  PubMed  Google Scholar 

  41. Thomas, P. D. et al. PANTHER: a library of protein families and subfamilies indexed by function. Genome Res. 13, 2129–2141 (2003).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Ye, Z. Q. et al. Finding new structural and sequence attributes to predict possible disease association of single amino acid polymorphism (SAP). Bioinformatics 23, 1444–1450 (2007).

    Article  CAS  PubMed  Google Scholar 

  43. Chun, S. & Fay, J. C. Identification of deleterious mutations within three human genomes. Genome Res. 19, 1553–1561 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. Bao, L., Zhou, M. & Cui, Y. nsSNPAnalyzer: identifying disease-associated nonsynonymous single nucleotide polymorphisms. Nucleic Acids Res. 33, W480–W482 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Sunyaev, S. et al. Prediction of deleterious human alleles. Hum. Mol. Genet. 10, 591–597 (2001). This paper describes polymorphism phenotyping (polyPhen) (also see reference 35), a commonly used tool to predict the effects of amino acid substitutions, and illustrates the value of classifiers trained on numerous biochemical and evolutionary features.

    Article  CAS  PubMed  Google Scholar 

  46. Ng, P. C. & Henikoff, S. SIFT: predicting amino acid changes that affect protein function. Nucleic Acids Res. 31, 3812–3814 (2003).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  47. Lynch, M. & Conery, J. S. The evolutionary fate and consequences of duplicate genes. Science 290, 1151–1155 (2000).

    Article  CAS  PubMed  Google Scholar 

  48. Marini, N. J., Thomas, P. D. & Rine, J. The use of orthologous sequences to predict the impact of amino acid substitutions on protein function. PLoS Genet. 6, e1000968 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  49. Dobson, R. J., Munroe, P. B., Caulfield, M. J. & Saqi, M. A. Predicting deleterious nsSNPs: an analysis of sequence and structural attributes. BMC Bioinformatics 7, 217 (2006).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  50. Saunders, C. T. & Baker, D. Evaluation of structural and evolutionary contributions to deleterious mutation prediction. J. Mol. Biol. 322, 891–901 (2002).

    Article  CAS  PubMed  Google Scholar 

  51. Yue, P., Li, Z. & Moult, J. Loss of protein structure stability as a major causative factor in monogenic disease. J. Mol. Biol. 353, 459–473 (2005).

    Article  CAS  PubMed  Google Scholar 

  52. Bao, L. & Cui, Y. Prediction of the phenotypic effects of non-synonymous single nucleotide polymorphisms using structural and evolutionary information. Bioinformatics 21, 2185–2190 (2005).

    Article  CAS  PubMed  Google Scholar 

  53. Li, Y. et al. Predicting disease-associated substitution of a single amino acid by analyzing residue interactions. BMC Bioinformatics 12, 14 (2011).

    Article  PubMed  PubMed Central  Google Scholar 

  54. Hindorff, L. A. et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc. Natl Acad. Sci. USA 106, 9362–9367 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  55. Musunuru, K. et al. From noncoding variant to phenotype via SORT1 at the 1p13 cholesterol locus. Nature 466, 714–719 (2010). This paper describes the precise identification of a common transcriptional regulatory variant that influences cholesterol levels and cardiovascular disease risk.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  56. Storey, J. D. et al. Gene-expression variation within and among human populations. Am. J. Hum. Genet. 80, 502–509 (2007).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  57. Nicolae, D. L. et al. Trait-associated SNPs are more likely to be eQTLs: annotation to enhance discovery from GWAS. PLoS Genet. 6, e1000888 (2010). This analysis demonstrated that expression-associated variants are enriched among trait-associated variants, suggesting that non-coding regulatory variants are causally relevant for many traits.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  58. King, M. C. & Wilson, A. C. Evolution at two levels in humans and chimpanzees. Science 188, 107–116 (1975).

    Article  CAS  PubMed  Google Scholar 

  59. Carroll, S. B. Evolution at two levels: on genes and form. PLoS Biol. 3, e245 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  60. Siepel, A. et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15, 1034–1050 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  61. Lettice, L. A. et al. A long-range Shh enhancer regulates expression in the developing limb and fin and is associated with preaxial polydactyly. Hum. Mol. Genet. 12, 1725–1735 (2003). This study describes non-coding mutations that cause Mendelian limb defects by affecting enhancers important to developmental sonic hedgehog ( Shh ) gene regulation. A combination of evolutionary sequence conservation and mouse-based experimental assessments of variant function were used.

    Article  CAS  PubMed  Google Scholar 

  62. Stenson, P. D. et al. The Human Gene Mutation Database: providing a comprehensive central mutation database for molecular diagnostics and personalized genomics. Hum. Genomics 4, 69–72 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  63. Treisman, R., Orkin, S. H. & Maniatis, T. Specific transcription and RNA splicing defects in five cloned β-thalassaemia genes. Nature 302, 591–596 (1983).

    Article  CAS  PubMed  Google Scholar 

  64. Woolfe, A. et al. Highly conserved non-coding sequences are associated with vertebrate development. PLoS Biol. 3, e7 (2005).

    Article  CAS  PubMed  Google Scholar 

  65. Dehal, P. et al. The draft genome of Ciona intestinalis: insights into chordate and vertebrate origins. Science 298, 2157–2167 (2002).

    Article  CAS  PubMed  Google Scholar 

  66. Pollard, K. S., Hubisz, M. J., Rosenbloom, K. R. & Siepel, A. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res. 20, 110–121 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  67. Cooper, G. M. et al. Distribution and intensity of constraint in mammalian genomic sequence. Genome Res. 15, 901–913 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  68. Asthana, S., Roytberg, M., Stamatoyannopoulos, J. & Sunyaev, S. Analysis of sequence conservation at nucleotide resolution. PLoS Comput. Biol. 3, e254 (2007).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  69. Margulies, E. H., Blanchette, M., Haussler, D. & Green, E. D. Identification and characterization of multi-species conserved sequences. Genome Res. 13, 2507–2518 (2003).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  70. Dubchak, I. et al. Active conservation of noncoding sequences revealed by three-way species comparisons. Genome Res. 10, 1304–1306 (2000).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  71. Parker, S. C., Hansen, L., Abaan, H. O., Tullius, T. D. & Margulies, E. H. Local DNA topography correlates with functional noncoding regions of the human genome. Science 324, 389–392 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  72. Cooper, G. M. et al. Single-nucleotide evolutionary constraint scores highlight disease-causing mutations. Nature Methods 7, 250–251 (2010). This paper demonstrated that functionally agnostic nucleotide-level constraint scores, defined by GERP (also see references 17 and 67), offer considerable utility for causal variant discovery in exome analyses.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  73. Wang, G. S. & Cooper, T. A. Splicing in disease: disruption of the splicing code and the decoding machinery. Nature Rev. Genet. 8, 749–761 (2007).

    Article  CAS  PubMed  Google Scholar 

  74. Drake, J. A. et al. Conserved noncoding sequences are selectively constrained and not mutation cold spots. Nature Genet. 38, 223–227 (2006).

    Article  CAS  PubMed  Google Scholar 

  75. Katzman, S. et al. Human genome ultraconserved elements are ultraselected. Science 317, 915 (2007).

    Article  CAS  PubMed  Google Scholar 

  76. Goode, D. L. et al. Evolutionary constraint facilitates interpretation of genetic variation in resequenced human genomes. Genome Res. 20, 301–310 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  77. Pennacchio, L. A. et al. In vivo enhancer analysis of human conserved non-coding sequences. Nature 444, 499–502 (2006).

    Article  CAS  PubMed  Google Scholar 

  78. Margulies, E. H. et al. Analyses of deep mammalian sequence alignments and constraint predictions for 1% of the human genome. Genome Res. 17, 760–774 (2007).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  79. The ENCODE Project Consortium. A user's guide to the encyclopedia of DNA elements (ENCODE). PLoS Biol. 9, e1001046 (2011).

  80. Ge, B. et al. Global patterns of cis variation in human cells revealed by high-density allelic expression analysis. Nature Genet. 41, 1216–1222 (2009).

    Article  CAS  PubMed  Google Scholar 

  81. Nica, A. C. et al. Candidate causal regulatory effects by integration of expression QTLs with complex trait genetic associations. PLoS Genet. 6, e1000895 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  82. Zheng, W., Zhao, H., Mancera, E., Steinmetz, L. M. & Snyder, M. Genetic analysis of variation in transcription factor binding in yeast. Nature 464, 1187–1191 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  83. Patwardhan, R. P. et al. High-resolution analysis of DNA regulatory elements by synthetic saturation mutagenesis. Nature Biotech. 27, 1173–1175 (2009). This paper defined a method to exploit next-generation sequencing to comprehensively yet efficiently assay point mutations in transcriptional promoters.

    Article  CAS  Google Scholar 

  84. Pitt, J. N. & Ferre-D'Amare, A. R. Rapid construction of empirical RNA fitness landscapes. Science 330, 376–379 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  85. Fowler, D. M. et al. High-resolution mapping of protein sequence-function relationships. Nature Methods 7, 741–746 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  86. Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L. & Wold, B. Mapping and quantifying mammalian transcriptomes by RNA-seq. Nature Methods 5, 621–628 (2008).

    Article  CAS  PubMed  Google Scholar 

  87. Johnson, D. S., Mortazavi, A., Myers, R. M. & Wold, B. Genome-wide mapping of in vivo protein-DNA interactions. Science 316, 1497–1502 (2007).

    Article  CAS  PubMed  Google Scholar 

  88. Cao, A. R. et al. Genome-wide analysis of transcription factor E2F1 mutant proteins reveals that N- and C-terminal protein interaction domains do not participate in targeting E2F1 to the human genome. J. Biol. Chem. 286, 11985–11996 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  89. Botstein, D. & Shortle, D. Strategies and applications of in vitro mutagenesis. Science 229, 1193–1201 (1985).

    Article  CAS  PubMed  Google Scholar 

  90. Blow, M. J. et al. ChIP-seq identification of weakly conserved heart enhancers. Nature Genet. 42, 806–810 (2010).

    Article  CAS  PubMed  Google Scholar 

  91. Cheng, Y. et al. Erythroid GATA1 function revealed by genome-wide analysis of transcription factor occupancy, histone modifications, and mRNA expression. Genome Res. 19, 2172–2184 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  92. Miller, D. T. et al. Consensus statement: chromosomal microarray is a first-tier clinical diagnostic test for individuals with developmental disabilities or congenital anomalies. Am. J. Hum. Genet. 86, 749–764 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  93. Sebat, J. et al. Strong association of de novo copy number mutations with autism. Science 316, 445–449 (2007).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  94. Walsh, T. et al. Rare structural variants disrupt multiple genes in neurodevelopmental pathways in schizophrenia. Science 320, 539–543 (2008).

    Article  CAS  PubMed  Google Scholar 

  95. Markiewicz, P., Kleina, L. G., Cruz, C., Ehret, S. & Miller, J. H. Genetic studies of the lac repressor. XIV. Analysis of 4000 altered Escherichia coli lac repressors reveals essential and non-essential residues, as well as “spacers” which do not require a specific sequence. J. Mol. Biol. 240, 421–433 (1994).

    Article  CAS  PubMed  Google Scholar 

  96. Rennell, D., Bouvier, S. E., Hardy, L. W. & Poteete, A. R. Systematic mutation of bacteriophage T4 lysozyme. J. Mol. Biol. 222, 67–88 (1991).

    Article  CAS  PubMed  Google Scholar 

  97. Loeb, D. D. et al. Complete mutagenesis of the HIV-1 protease. Nature 340, 397–400 (1989).

    Article  CAS  PubMed  Google Scholar 

  98. Hardison, R. C. et al. HbVar: a relational database of human hemoglobin variants and thalassemia mutations at the globin gene server. Hum. Mutat. 19, 225–233 (2002).

    Article  CAS  PubMed  Google Scholar 

  99. Olivier, M. et al. The IARC TP53 database: new online mutation analysis and recommendations to users. Hum. Mutat. 19, 607–614 (2002).

    Article  CAS  PubMed  Google Scholar 

  100. Yip, Y. L. et al. The Swiss-Prot variant page and the ModSNP database: a resource for sequence and structure information on human protein variants. Hum. Mutat. 23, 464–470 (2004).

    Article  CAS  PubMed  Google Scholar 

  101. Brown, C. D., Johnson, D. S. & Sidow, A. Functional architecture and evolution of transcriptional elements that drive gene coexpression. Science 317, 1557–1560 (2007).

    Article  CAS  PubMed  Google Scholar 

  102. Kim, J., He, X. & Sinha, S. Evolution of regulatory sequences in 12 Drosophila species. PLoS Genet. 5, e1000330 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  103. Moses, A. M., Chiang, D. Y., Kellis, M., Lander, E. S. & Eisen, M. B. Position specific variation in the rate of evolution in transcription factor binding sites. BMC Evol. Biol. 3, 19 (2003).

    Article  PubMed  PubMed Central  Google Scholar 

  104. Visel, A. et al. ChIP-seq accurately predicts tissue-specific activity of enhancers. Nature 457, 854–858 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  105. Dekker, J., Rippe, K., Dekker, M. & Kleckner, N. Capturing chromosome conformation. Science 295, 1306–1311 (2002).

    Article  CAS  PubMed  Google Scholar 

  106. Liu, D. J. & Leal, S. M. A novel adaptive method for the analysis of next-generation sequencing data to detect complex trait associations with rare variants due to gene main effects and interactions. PLoS Genet. 6, e1001156 (2010). This paper describes an approach to assess the significance of correlations between gene or locus aggregates of rare variants and phenotypes and may also be useful in identifying significant variant interactions.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  107. Yandell, M. et al. A probabilistic disease-gene finder for personal genomes. Genome Res. 23 Jun 2011 (doi:10.1101/gr.123158.111). This paper defines a method, VAAST, to predict disease genes or loci on the basis of the total predicted deleteriousness of rare variants observed in affected individuals.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  108. Gerke, J., Lorenz, K. & Cohen, B. Genetic interactions between transcription factors cause natural variation in yeast. Science 323, 498–501 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  109. Gerke, J., Lorenz, K., Ramnarine, S. & Cohen, B. Gene–environment interactions at nucleotide resolution. PLoS Genet. 6, e1001144 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  110. Bush, W. S. et al. A knowledge-driven interaction analysis reveals potential neurodegenerative mechanism of multiple sclerosis susceptibility. Genes Immun. (2011).

  111. Rual, J. F. et al. Towards a proteome-scale map of the human protein-protein interaction network. Nature 437, 1173–1178 (2005).

    Article  CAS  PubMed  Google Scholar 

  112. Emilsson, V. et al. Genetics of gene expression and its effect on disease. Nature 452, 423–428 (2008).

    Article  CAS  PubMed  Google Scholar 

  113. The Gene Ontology Consortium. et al. Gene ontology: tool for the unification of biology. Nature Genet. 25, 25–29 (2000).

  114. Risch, N. & Merikangas, K. The future of genetic studies of complex human diseases. Science 273, 1516–1517 (1996).

    Article  CAS  PubMed  Google Scholar 

  115. Ioannidis, J. P. Why most published research findings are false. PLoS Med. 2, e124 (2005).

    Article  PubMed  PubMed Central  Google Scholar 

  116. Rothman, K. J. No adjustments are needed for multiple comparisons. Epidemiology 1, 43–46 (1990).

    Article  CAS  PubMed  Google Scholar 

  117. Keinan, A., Mullikin, J. C., Patterson, N. & Reich, D. Measurement of the human allele frequency spectrum demonstrates greater genetic drift in East Asians than in Europeans. Nature Genet. 39, 1251–1255 (2007).

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

We thank C. Brown and E. Stone for comments on an earlier draft and R. Patwardhan for sharing data.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Gregory M. Cooper or Jay Shendure.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Related links

Related links

FURTHER INFORMATION

Gregory M. Cooper's homepage

Jay Shendure's homepage

Encyclopedia of DNA Elements (ENCODE)

The Genome 10K Project

University of California, Santa Cruz (UCSC) Genome Bioinformatics

Human Gene Mutation Database (HGMD)

National Human Genome Research Institute Catalog of Published Genome-Wide Association Studies

Online Mendelian Inheritance in Man (OMIM)

Glossary

Private

A genetic variant that is confined to a single individual, family or population.

Prior probability

Otherwise simply known as the 'prior', this is the probability of a hypothesis (or parameter value) without reference to the available data. Priors can be derived from first principles or be based on general knowledge or previous experiments.

Deleterious

A genetic variant that lowers the fitness of an organism: that is, it decreases survival or reproductive success.

Conserved

Shared identity of either protein or nucleotide sequences, which can be indicative of constraint.

Neutral

Sequences that are free to evolve in the absence of natural selection and are therefore subject only to random mutational and genetic drift processes.

Phylogenetic scope

The taxonomic range captured by a given comparative sequence analysis — for example, mammals or eukaryotes.

Constrained

Sequences that are under purifying selection to maintain function, which often, but not always, results in sequence conservation.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cooper, G., Shendure, J. Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data. Nat Rev Genet 12, 628–640 (2011). https://doi.org/10.1038/nrg3046

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/nrg3046

This article is cited by

Search

Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research