Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Sequencing studies in human genetics: design and interpretation

Subjects

Key Points

  • The interpretation of next-generation sequencing data is technically and conceptually much more challenging than the data used in genome-wide association studies.

  • Minimizing false-positive signals in sequencing studies depends on careful management of the overall work flow and, in particular, on appropriate statistical criteria used to support claims of significant association.

  • One key feature in the interpretation of sequence data is that most researchers currently distinguish among variants in their prior probabilities of influencing disease, either implicitly or explicitly. Considerable development of appropriate ways to do this, however, is still required.

  • Population genetic, phylogenetic and other data sources can help to establish frameworks for distinguishing among the prior probabilities of variants influencing disease.

  • Although establishing appropriate statistical criteria for interpreting sequence data remains a work in progress, good study designs mandate careful consideration and appropriate correction for the real number of tests that are inherent in any given study design.

  • Interpretation of sequence data should always take into account the narrative potential that is inherent in any human genome, in that all genomes carry many functional and probably deleterious (in an evolutionary sense) rare variants that could be used to argue that the mutations influence traits of interest.

  • Whereas functional characterization of pathogenic mutations is essential in order to derive translational benefits from genetic discoveries, functional characterization should not be used to buttress weak statistical arguments for pathogenicity. In general, with only narrowly defined exceptions, evidence of pathogenicity should come from the genetics alone.

Abstract

Next-generation sequencing is becoming the primary discovery tool in human genetics. There have been many clear successes in identifying genes that are responsible for Mendelian diseases, and sequencing approaches are now poised to identify the mutations that cause undiagnosed childhood genetic diseases and those that predispose individuals to more common complex diseases. There are, however, growing concerns that the complexity and magnitude of complete sequence data could lead to an explosion of weakly justified claims of association between genetic variants and disease. Here, we provide an overview of the basic workflow in next-generation sequencing studies and emphasize, where possible, measures and considerations that facilitate accurate inferences from human sequencing studies.

Your institute does not have access to this article

Relevant articles

Open Access articles citing this article.

Access options

Buy article

Get time limited or full article access on ReadCube.

$32.00

All prices are NET prices.

References

  1. Hindorff, L. A. et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc. Natl Acad. Sci. USA 106, 9362–9367 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  2. McCarthy, M. I. et al. Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nature Rev. Genet. 9, 356–369 (2008). This influential Review compiles into one paper the basics of doing a GWAS, including best practice guidelines, such as controlling for population stratification. The Review also reinforces the universally followed guideline of 5 × 10−8 as a threshold for significance in GWAS.

    Article  CAS  PubMed  Google Scholar 

  3. Hoggart, C. J., Clark, T. G., De Iorio, M., Whittaker, J. C. & Balding, D. J. Genome-wide significance for dense SNP and resequencing data. Genet. Epidemiol. 32, 179–185 (2008).

    Article  PubMed  Google Scholar 

  4. Cirulli, E. T. & Goldstein, D. B. Uncovering the roles of rare variants in common disease through whole-genome sequencing. Nature Rev. Genet. 11, 415–425 (2010).

    Article  CAS  PubMed  Google Scholar 

  5. Bamshad, M. J. et al. Exome sequencing as a tool for Mendelian disease gene discovery. Nature Rev. Genet. 12, 745–755 (2011).

    Article  CAS  PubMed  Google Scholar 

  6. Meyerson, M., Gabriel, S. & Getz, G. Advances in understanding cancer genomes through second-generation sequencing. Nature Rev. Genet. 11, 685–696 (2010).

    Article  CAS  PubMed  Google Scholar 

  7. Ding, L., Wendl, M. C., Koboldt, D. C. & Mardis, E. R. Analysis of next-generation genomic data in cancer: accomplishments and challenges. Hum. Mol. Genet. 19, R188–R196 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Shendure, J. & Ji, H. Next-generation DNA sequencing. Nature Biotech. 26, 1135–1145 (2008).

    Article  CAS  Google Scholar 

  9. Ajay, S. S., Parker, S. C., Abaan, H. O., Fajardo, K. V. & Margulies, E. H. Accurate and comprehensive sequencing of personal genomes. Genome Res. 21, 1498–1505 (2011).

    Article  PubMed  PubMed Central  Google Scholar 

  10. Genomes Project, C. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010).

    Article  CAS  Google Scholar 

  11. Wendl, M. C. & Wilson, R. K. The theory of discovering rare variants via DNA sequencing. BMC Genomics 10, 485 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Need, A. C. et al. Clinical application of exome sequencing in undiagnosed genetic conditions. J. Med. Genet. 49, 353–361 (2012). This is the first study that estimates the 'success rate' of getting a genetic diagnosis through whole-exome sequencing of undiagnosed conditions in a real clinical setting considering 12 children with a broad range of severe childhood genetic conditions. The primary conclusion is that the success rate is remarkably high but depends in many cases on functional characterization of previously unidentified mutations in already known disease genes.

    Article  CAS  PubMed  Google Scholar 

  13. Heinzen, E. L. et al. Exome sequencing followed by large-scale genotyping fails to identify single rare variants of large effect in idiopathic generalized epilepsy. Am. J. Hum. Genet. 91, 293–302 (2012). The largest epilepsy exome-sequencing study to date is reported in this paper. The results suggest high locus and allelic heterogeneity for both disorders, requiring larger sample sizes.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Need, A. C. et al. Exome sequencing followed by large-scale genotyping suggests a limited role for moderately rare risk factors of strong effect in schizophrenia. Am. J. Hum. Genet. 91, 303–312 (2012). The largest schizophrenia exome-sequencing study to date is reported in this paper. The results suggest high locus and allelic heterogeneity for both disorders, requiring larger sample sizes.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Zhu, M. et al. Using ERDS to infer copy-number variants in high-coverage genomes. Am. J. Hum. Genet. 91, 408–421 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Heinzen, E. L. et al. De novo mutations in ATP1A3 cause alternating hemiplegia of childhood. Nature Genet. 44, 1030–1034 (2012).

    Article  CAS  PubMed  Google Scholar 

  17. Li, B. et al. A likelihood-based framework for variant calling and de novo mutation detection in families. PLoS Genet. 8, e1002944 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Nielsen, R., Paul, J. S., Albrechtsen, A. & Song, Y. S. Genotype and SNP calling from next-generation sequencing data. Nature Rev. Genet. 12, 443–451 (2011).

    CAS  PubMed  Google Scholar 

  19. Flicek, P. & Birney, E. Sense from sequence reads: methods for alignment and assembly. Nature Methods 6, S6–S12 (2009).

    Article  CAS  PubMed  Google Scholar 

  20. DePristo, M. A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genet. 43, 491–498 (2011). This paper describes what has become the most widely used variant-calling environment.

    Article  CAS  PubMed  Google Scholar 

  21. Lunter, G. & Goodson, M. Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads. Genome Res. 21, 936–939 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Li, H. Improving SNP discovery by base alignment quality. Bioinformatics 27, 1157–1158 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Meacham, L. R. et al. Diabetes mellitus in long-term survivors of childhood cancer. Increased risk associated with radiation therapy: a report for the childhood cancer survivor study. Arch. Intern. Med. 169, 1381–1388 (2009).

    Article  PubMed  PubMed Central  Google Scholar 

  24. McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Neale, B. M. et al. Patterns and rates of exonic de novo mutations in autism spectrum disorders. Nature 485, 242–245 (2012). This paper was one of the first to analyse a large number of patients with a common disease using a trio design. Importantly, the authors established a formal framework for assessing whether excess de novo mutations are observed over expectation under the null hypothesis and found that autism genomes carry only modest excess of such mutations.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Chen, W. et al. Genotype calling and haplotyping in parent-offspring trios. Genome Res. 23, 142–151 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Conrad, D. F. et al. Variation in genome-wide mutation rates within and between human families. Nature Genet. 43, 712–714 (2011).

    Article  CAS  PubMed  Google Scholar 

  28. Alkan, C. et al. Personalized copy number and segmental duplication maps using next-generation sequencing. Nature Genet. 41, 1061–1067 (2009).

    Article  CAS  PubMed  Google Scholar 

  29. Hach, F. et al. mrsFAST: a cache-oblivious algorithm for short-read mapping. Nature Methods 7, 576–577 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Iossifov, I. et al. De novo gene disruptions in children on the autistic spectrum. Neuron 74, 285–299 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. de Ligt, J. et al. Diagnostic exome sequencing in persons with severe intellectual disability. N. Engl. J. Med. 367, 1921–1929 (2012).

    Article  CAS  PubMed  Google Scholar 

  32. Rauch, A. et al. Range of genetic mutations associated with severe non-syndromic sporadic intellectual disability: an exome sequencing study. Lancet 380, 1674–1682 (2012).

    Article  CAS  PubMed  Google Scholar 

  33. Sanders, S. J. et al. De novo mutations revealed by whole-exome sequencing are strongly associated with autism. Nature 485, 237–241 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. O'Roak, B. J. et al. Sporadic autism exomes reveal a highly interconnected protein network of de novo mutations. Nature 485, 246–250 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Danecek, P. et al. The variant call format and VCFtools. Bioinformatics 27, 2156–2158 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Saunders, C. J. et al. Rapid whole-genome sequencing for genetic disease diagnosis in neonatal intensive care units. Sci. Transl. Med. 4, 154ra135 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  37. Bell, C. J. et al. Carrier testing for severe childhood recessive diseases by next-generation sequencing. Sci. Transl. Med. 3, 65ra4 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. Kimura, M. The Neutral Theory of Molecular Evolution (Cambridge Press, 1983).

    Book  Google Scholar 

  39. Sim, N. L. et al. SIFT web server: predicting effects of amino acid substitutions on proteins. Nucleic Acids Res. 40, W452–W457 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Adzhubei, I. A. et al. A method and server for predicting damaging missense mutations. Nature Methods 7, 248–249 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Stone, E. A. & Sidow, A. Physicochemical constraint violation by missense substitutions mediates impairment of protein function and disease severity. Genome Res. 15, 978–986 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Jordan, D. M., Ramensky, V. E. & Sunyaev, S. R. Human allelic variation: perspective from protein function, structure, and evolution. Curr. Opin. Struct. Biol. 20, 342–350 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Schwarz, J. M., Rodelsperger, C., Schuelke, M. & Seelow, D. MutationTaster evaluates disease-causing potential of sequence alterations. Nature Methods 7, 575–576 (2010).

    Article  CAS  PubMed  Google Scholar 

  44. Hicks, S., Wheeler, D. A., Plon, S. E. & Kimmel, M. Prediction of missense mutation functionality depends on both the algorithm and sequence alignment employed. Hum. Mutat. 32, 661–668 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Cooper, G. M. & Shendure, J. Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data. Nature Rev. Genet. 12, 628–640 (2011). A comprehensive Review is presented here of the priors, such as evolutionary knowledge, in silico protein effect assessment and others, that can be used to prioritize variants on the basis of putative damaging impact scores.

    Article  CAS  PubMed  Google Scholar 

  46. Bustamante, C. D. et al. Natural selection on protein-coding genes in the human genome. Nature 437, 1153–1157 (2005).

    Article  CAS  PubMed  Google Scholar 

  47. Asthana, S. et al. Widely distributed noncoding purifying selection in the human genome. Proc. Natl Acad. Sci. USA 104, 12410–12415 (2007).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  48. Stenson, P. D. et al. Human Gene Mutation Database (HGMD): 2003 update. Hum. Mutat. 21, 577–581 (2003).

    Article  CAS  PubMed  Google Scholar 

  49. Morgenthaler, S. & Thilly, W. G. A strategy to discover genes that carry multi-allelic or mono-allelic risk for common diseases: a cohort allelic sums test (CAST). Mutat. Res. 615, 28–56 (2007).

    Article  CAS  PubMed  Google Scholar 

  50. Li, B. & Leal, S. M. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am. J. Hum. Genet. 83, 311–321 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  51. Madsen, B. E. & Browning, S. R. A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genet. 5, e1000384 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  52. Price, A. L. et al. Pooled association tests for rare variants in exon-resequencing studies. Am. J. Hum. Genet. 86, 832–838 (2010).

    Article  PubMed  PubMed Central  Google Scholar 

  53. Neale, B. M. et al. Testing for an unusual distribution of rare variants. PLoS Genet. 7, e1001322 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  54. Wu, M. C. et al. Rare-variant association testing for sequencing data with the sequence kernel association test. Am. J. Hum. Genet. 89, 82–93 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  55. Lin, D. Y. & Tang, Z. Z. A general framework for detecting disease associations with rare variants in sequencing studies. Am. J. Hum. Genet. 89, 354–367 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  56. Basu, S. & Pan, W. Comparison of statistical tests for disease association with rare variants. Genet. Epidemiol. 35, 606–619 (2011).

    Article  PubMed  PubMed Central  Google Scholar 

  57. Bansal, V., Libiger, O., Torkamani, A. & Schork, N. J. Statistical analysis strategies for association studies involving rare variants. Nature Rev. Genet. 11, 773–785 (2010).

    Article  CAS  PubMed  Google Scholar 

  58. Stitziel, N. O., Kiezun, A. & Sunyaev, S. Computational and statistical approaches to analyzing variants identified by exome sequencing. Genome Biol. 12, 227 (2011).

    Article  PubMed  PubMed Central  Google Scholar 

  59. Kiezun, A. et al. Exome sequencing and the genetic basis of complex traits. Nature Genet. 44, 623–630 (2012).

    Article  CAS  PubMed  Google Scholar 

  60. Ladouceur, M., Dastani, Z., Aulchenko, Y. S., Greenwood, C. M. & Richards, J. B. The empirical power of rare variant association methods: results from Sanger sequencing in 1,998 individuals. PLoS Genet. 8, e1002496 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  61. Zhu, Q. et al. A genome-wide comparison of the functional properties of rare and common genetic variants in humans. Am. J. Hum. Genet. 88, 458–468 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  62. Tennessen, J. A. et al. Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science 337, 64–69 (2012).

    Article  CAS  PubMed  Google Scholar 

  63. Harrison, P. J. & Weinberger, D. R. Schizophrenia genes, gene expression, and neuropathology: on the matter of their convergence. Mol. Psychiatry 10, 40–68 (2005).

    Article  CAS  PubMed  Google Scholar 

  64. Prathikanti, S. & Weinberger, D. R. Psychiatric genetics—the new era: genetic research and some clinical implications. Br. Med. Bull. 73–74, 107–122 (2005).

  65. Mutsuddi, M. et al. Analysis of high-resolution HapMap of DTNBP1 (Dysbindin) suggests no consistency between reported common variant associations and schizophrenia. Am. J. Hum. Genet. 79, 903–909 (2006).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  66. Need, A. C. et al. A genome-wide investigation of SNPs and CNVs in schizophrenia. PLoS Genet. 5, e1000373 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  67. Hoefen, R. et al. In silico cardiac risk assessment in patients with long QT syndrome: type 1: clinical predictability of cardiac models. J. Am. Coll. Cardiol 60, 2182–2191 (2012).

    Article  PubMed  Google Scholar 

  68. Berecki, G., Zegers, J. G., Wilders, R. & Van Ginneken, A. C. Cardiac channelopathies studied with the dynamic action potential-clamp technique. Methods Mol. Biol. 403, 233–250 (2007).

    Article  CAS  PubMed  Google Scholar 

  69. Zareba, W., Moss, A. J. & le Cessie, S. Dispersion of ventricular repolarization and arrhythmic cardiac death in coronary artery disease. Am. J. Cardiol. 74, 550–553 (1994).

    Article  CAS  PubMed  Google Scholar 

  70. Redfern, W. S. et al. Relationships between preclinical cardiac electrophysiology, clinical QT interval prolongation and torsade de pointes for a broad range of drugs: evidence for a provisional safety margin in drug development. Cardiovasc. Res. 58, 32–45 (2003).

    Article  CAS  PubMed  Google Scholar 

  71. Di Ventura, B., Lemerle, C., Michalodimitrakis, K. & Serrano, L. From in vivo to in silico biology and back. Nature 443, 527–533 (2006).

    Article  CAS  PubMed  Google Scholar 

  72. Reid, C. A. et al. Multiple molecular mechanisms for a single GABAA mutation in epilepsy. Neurology 80, 1003–1008 (2013). This paper uses an animal model to provide remarkable resolution in dissecting how a single mutation can result in two distinct clinical manifestations with one seizure type resulting from haploinsufficiency and the other from a distinct gain of function.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  73. Freimuth, J. et al. Epistatic interactions between Tgfb1 and genetic loci, Tgfbm2 and Tgfbm3, determine susceptibility to an asthmatic stimulus. Proc. Natl Acad. Sci. USA 109, 18042–18047 (2012).

    Article  PubMed  PubMed Central  Google Scholar 

  74. Lehner, B. Genotype to phenotype: lessons from model organisms for human genetics. Nature Rev Genet. 14, 168–178 (2013).

    Article  CAS  PubMed  Google Scholar 

  75. Tiscornia, G., Vivas, E. L. & Izpisua Belmonte, J. C. Diseases in a dish: modeling human genetic disorders using induced pluripotent cells. Nature Med. 17, 1570–1576 (2011).

    Article  CAS  PubMed  Google Scholar 

  76. Overington, J. P., Al-Lazikani, B. & Hopkins, A. L. How many drug targets are there? Nature Rev. Drug Discov. 5, 993–996 (2006).

    Article  CAS  Google Scholar 

  77. Consortium, E. P. et al. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).

    Article  CAS  Google Scholar 

  78. Ge, D. et al. SVA: software for annotating and visualizing sequenced human genomes. Bioinformatics 27, 1998–2000 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  79. Pruitt, K. D. et al. The consensus coding sequence (CCDS) project: identifying a common protein-coding gene set for the human and mouse genomes. Genome Res. 19, 1316–1323 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  80. Davydov, E. V. et al. Identifying a high fraction of the human genome to be under selective constraint using GERP++. PLoS Comput. Biol. 6, e1001025 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  81. Henikoff, S. & Henikoff, J. G. Amino acid substitution matrices from protein blocks. Proc. Natl Acad. Sci. USA 89, 10915–10919 (1992).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  82. Choi, J. W., Kang, D. K., Park, H., deMello, A. J. & Chang, S. I. High-throughput analysis of protein-protein interactions in picoliter-volume droplets using fluorescence polarization. Anal. Chem. 84, 3849–3854 (2012).

    Article  CAS  PubMed  Google Scholar 

  83. Ghosh, S., Matsuoka, Y., Asai, Y., Hsin, K. Y. & Kitano, H. Software for systems biology: from tools to integrated platforms. Nature Rev. Genet. 12, 821–832 (2011).

    Article  CAS  PubMed  Google Scholar 

  84. Ashcroft, F. M. From molecule to malady. Nature 440, 440–447 (2006).

    Article  CAS  PubMed  Google Scholar 

  85. Owens, J. Determining druggability. Nature Rev. Drug Discov. 6, 187 (2007).

    Article  CAS  Google Scholar 

  86. Marth, G. T. et al. A general approach to single-nucleotide polymorphism discovery. Nature Genet. 23, 452–456 (1999).

    Article  CAS  PubMed  Google Scholar 

  87. Bruce, H. A. et al. Long tandem repeats as a form of genomic copy number variation: structure and length polymorphism of a chromosome 5p repeat in control and schizophrenia populations. Psychiatr. Genet. 19, 64–71 (2009).

    Article  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

The authors thank the reviewers for their helpful comments. D.B.G. thanks L. Biesecker (NGHRI) for helpful discussions that contributed to the development of this Review. S.P. is a National Health and Medical Research Council (NHMRC) CJ Martin Fellow.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to David B. Goldstein.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Related links

PowerPoint slides

Supplementary information

Supplementary information S1 (box)

To provide a conservative assessment of narrative potential in a typical exome we used whole-exome sequence data from a 'healthy' control female with no reported disease diagnosis. (PDF 207 kb)

Supplementary information S2 (box)

Putative LGD homozygous variants (PDF 166 kb)

Supplementary information S3 (table)

Five putative LGD variants (AF ≤ 1%) in genes reported within OMIM database. (PDF 193 kb)

Supplementary information S4 (table)

Putative missense variants (AF ≤ 1%; meeting damaging score in at least three of the four algorithms used for this narrative-potential illustration) in genes reported within OMIM database. (PDF 244 kb)

Glossary

Priors

Used to reflect assumptions about the involvement of different classes of mutations before the evidence available from a given study is considered.

Cluster density

The density of clonal double-stranded DNA fragment clusters bound to an Illumina flow cell, typically expressed as clusters per mm2. It is used as a quality-control metric early during the sequencing reaction: low cluster densities will result in a lower sequencing yield in the resulting fastq library, whereas very high cluster densities will result in poor sequence quality.

Locus heterogeneity

Refers to the number of different genes in the genome that can carry mutations that influence risk of given disease.

Allelic heterogeneity

Refers to the number of different mutations at a single gene that can influence risk of disease.

Structural variation

Occurs in DNA regions generally greater than 1 kb in size, and includes genomic imbalances (namely, insertions and deletions (also known as copy number variants)), inversions and translocations.

De novo mutations

Non-inherited novel mutations in an individual that result from a germline mutation.

Indel

An alternative form of genetic variation to single- nucleotide variants that represents small insertion and deletion mutations.

Insert size

The length of the fragmented sequence between ligate adaptors. In paired-end sequencing, the insert size generally ranges from 200 to 500 bp.

Batch effects

Differences observed for samples that are experimentally handled in different ways that are unrelated to the biological or scientific variables being studied. If batch effects are not properly accounted for in sequence studies, they can generate false signals of association between genetic variation and the traits under study.

Library

The collection of processed genome fragments that are prepared for sequencing. In a bioinformatics context, the term may also generally refer to the set of sequences found in a single fastq file.

Variant call format files

(VCF files). A flexible text file format developed within the 1000 Genomes Project that contains data specific to one or more genomic sites, including site coordinates, reference allele, observed alternative allele (or alleles) and base-call quality metrics (see Further information).

Polymorphism-to-divergence ratios

Comparing sequence divergence across species with population polymorphism data (for example, McDonald–Kreitman test) facilitates identifying where selective forces are acting on the genomic sequence.

Site frequency spectra

Reflecting the distribution of allele frequencies. They are defined by the number of sites that has each of the possible allele frequencies. Different forms of selection perturb the site frequency spectrum in known ways.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Goldstein, D., Allen, A., Keebler, J. et al. Sequencing studies in human genetics: design and interpretation. Nat Rev Genet 14, 460–470 (2013). https://doi.org/10.1038/nrg3455

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/nrg3455

Further reading

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing