Sequencing depth and coverage: key considerations in genomic analyses

Key Points

  • The average depth of sequencing coverage can be defined theoretically as LN/G, where L is the read length, N is the number of reads and G is the haploid genome length.

  • The breadth of coverage is the percentage of target bases that have been sequenced for a given number of times.

  • Hybrid sequencing approaches are being introduced to overcome problems in genome assembly and in placing highly repetitive sequence in a genome.

  • For DNA resequencing studies, the required sequencing capacity depends on the size of the regions of interest, the types of variant and the disease model being studied.

  • The accuracy of variant calling is affected by sequence quality, uniformity of coverage and the threshold of false-discovery rate that is used.

  • The power to identify and accurately quantify RNA molecules is dependent on their lengths and abundance, and on the number of sequenced reads.

  • In human cells, 80% of transcripts that are expressed at >10 fragments per kilobase of exon per million reads mapped (FPKM) can be accurately quantified with ~36 million 100-bp paired-end sequenced reads.

  • Depth of coverage is affected by the accuracy of genome alignment algorithms and by the uniqueness or the 'mappability' of sequencing reads within a target genome.

  • Sequence depth influences the accuracy by which rare events can be quantified in RNA sequencing, chromatin immunoprecipitation followed by sequencing (ChIP–seq) and other quantification-based assays.

  • Sequence depth must be traded off against the need for control samples and replicates.

Abstract

Sequencing technologies have placed a wide range of genomic analyses within the capabilities of many laboratories. However, sequencing costs often set limits to the amount of sequences that can be generated and, consequently, the biological outcomes that can be achieved from an experimental design. In this Review, we discuss the issue of sequencing depth in the design of next-generation sequencing experiments. We review current guidelines and precedents on the issue of coverage, as well as their underlying considerations, for four major study designs, which include de novo genome sequencing, genome resequencing, transcriptome sequencing and genomic location analyses (for example, chromatin immunoprecipitation followed by sequencing (ChIP–seq) and chromosome conformation capture (3C)).

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Figure 1: Sequencing depths for different applications.
Figure 2: The three different types of peaks in chromatin immunoprecipitation followed by sequencing experiments.

References

  1. 1

    Wetterstrand, K. A. DNA sequencing costs: data from the NHGRI Genome Sequencing Program (GSP). National Human Genome Research Institute [online], (2013).

  2. 2

    Lander, E. S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).

    CAS  Google Scholar 

  3. 3

    Schatz, M. C., Delcher, A. L. & Salzberg, S. L. Assembly of large genomes using second-generation sequencing. Genome Res. 20, 1165–1173 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  4. 4

    Li, R. et al. The sequence and de novo assembly of the giant panda genome. Nature 463, 311–317 (2010).

    CAS  Google Scholar 

  5. 5

    Jia, J. et al. Aegilops tauschii draft genome sequence reveals a gene repertoire for wheat adaptation. Nature 496, 91–95 (2013).

    CAS  Google Scholar 

  6. 6

    Voskoboynik, A. et al. The genome sequence of the colonial chordate, Botryllus schlosseri. Elife 2, e00569 (2013).

    PubMed  PubMed Central  Google Scholar 

  7. 7

    Ribeiro, F. J. et al. Finished bacterial genomes from shotgun sequence data. Genome Res. 22, 2270–2277 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  8. 8

    Schatz, M. C., Witkowski, J. & McCombie, W. R. Current challenges in de novo plant genome sequencing and assembly. Genome Biol. 13, 243 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  9. 9

    Margulies, E. H. et al. An initial strategy for the systematic identification of functional elements in the human genome by low-redundancy comparative sequencing. Proc. Natl Acad. Sci. USA 102, 4795–4800 (2005).

    CAS  PubMed  Google Scholar 

  10. 10

    Green, P. 2x genomes — does depth matter? Genome Res. 17, 1547–1549 (2007).

    CAS  PubMed  Google Scholar 

  11. 11

    Rands, C. M. et al. Insights into the evolution of Darwin's finches from comparative analysis of the Geospiza magnirostris genome sequence. BMC Genomics 14, 95 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  12. 12

    Bentley, D. R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59 (2008). This is the first study to sequence a human genome using short reads; it examines the read depth that is required for calling SNVs.

    CAS  PubMed  PubMed Central  Google Scholar 

  13. 13

    Ahn, S. M. et al. The first Korean genome sequence and analysis: full genome sequencing for a socio-ethnic group. Genome Res. 19, 1622–1629 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  14. 14

    Wang, J. et al. The diploid genome sequence of an Asian individual. Nature 456, 60–65 (2008).

    CAS  PubMed  PubMed Central  Google Scholar 

  15. 15

    Ajay, S. S., Parker, S. C., Abaan, H. O., Fajardo, K. V. & Margulies, E. H. Accurate and comprehensive sequencing of personal genomes. Genome Res. 21, 1498–1505 (2011).

    PubMed  PubMed Central  Google Scholar 

  16. 16

    Kozarewa, I. et al. Amplification-free Illumina sequencing-library preparation facilitates improved mapping and assembly of (G+C)-biased genomes. Nature Methods 6, 291–295 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  17. 17

    Aird, D. et al. Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biol. 12, R18 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  18. 18

    Clark, M. J. et al. Performance comparison of exome DNA sequencing technologies. Nature Biotech. 29, 908–914 (2011).

    CAS  Google Scholar 

  19. 19

    Sulonen, A. M. et al. Comparison of solution-based exome capture methods for next generation sequencing. Genome Biol. 12, R94 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  20. 20

    Zhou, Q. et al. A hypermorphic missense mutation in PLCG2, encoding phospholipase Cγ2, causes a dominantly inherited autoinflammatory disease with immunodeficiency. Am. J. Hum. Genet. 91, 713–720 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  21. 21

    Thauvin-Robinet, C. et al. PIK3R1 mutations cause syndromic insulin resistance with lipoatrophy. Am. J. Hum. Genet. 93, 141–149 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  22. 22

    Yu, T. W. et al. Using whole-exome sequencing to identify inherited causes of autism. Neuron 77, 259–273 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  23. 23

    Quail, M. A. et al. A large genome center's improvements to the Illumina sequencing system. Nature Methods 5, 1005–1010 (2008).

    CAS  PubMed  PubMed Central  Google Scholar 

  24. 24

    Fromer, M. et al. Discovery and statistical genotyping of copy-number variation from whole-exome sequencing depth. Am. J. Hum. Genet. 91, 597–607 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  25. 25

    Krumm, N. et al. Copy number variation detection and genotyping from exome sequence data. Genome Res. 22, 1525–1532 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  26. 26

    Xie, C. & Tammi, M. T. CNV–seq, a new method to detect copy number variation using high-throughput sequencing. BMC Bioinformatics 10, 80 (2009).

    PubMed  PubMed Central  Google Scholar 

  27. 27

    Medvedev, P., Fiume, M., Dzamba, M., Smith, T. & Brudno, M. Detecting copy number variation with mated short reads. Genome Res. 20, 1613–1622 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  28. 28

    Klambauer, G. et al. cn.MOPS: mixture of Poissons for discovering copy number variations in next-generation sequencing data with a low false discovery rate. Nucleic Acids Res. 40, e69 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  29. 29

    Le, S. Q. & Durbin, R. SNP detection and genotyping from low-coverage sequencing data on multiple diploid samples. Genome Res. 21, 952–960 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  30. 30

    Li, Y., Sidore, C., Kang, H. M., Boehnke, M. & Abecasis, G. R. Low-coverage sequencing: implications for design of complex trait association studies. Genome Res. 21, 940–951 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  31. 31

    Abecasis, G. R. et al. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012).

    Google Scholar 

  32. 32

    Pasaniuc, B. et al. Extremely low-coverage sequencing and imputation increases power for genome-wide association studies. Nature Genet. 44, 631–635 (2012).

    CAS  Google Scholar 

  33. 33

    Lee, W. et al. The mutation spectrum revealed by paired genome sequences from a lung cancer patient. Nature 465, 473–477 (2010).

    CAS  PubMed  Google Scholar 

  34. 34

    Schuh, A. et al. Monitoring chronic lymphocytic leukemia progression by whole genome sequencing reveals heterogeneous clonal evolution patterns. Blood 120, 4191–4196 (2012).

    CAS  PubMed  Google Scholar 

  35. 35

    Li, B. et al. A likelihood-based framework for variant calling and de novo mutation detection in families. PLoS Genet. 8, e1002944 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  36. 36

    DePristo, M. A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genet. 43, 491–498 (2011).

    CAS  Google Scholar 

  37. 37

    Nagarajan, N. & Pop, M. Sequence assembly demystified. Nature Rev. Genet. 14, 157–167 (2013).

    CAS  PubMed  Google Scholar 

  38. 38

    Bradnam, K. R. et al. Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. GigaScience 2, 10 (2013).

    PubMed  PubMed Central  Google Scholar 

  39. 39

    Salzberg, S. L. et al. GAGE: a critical evaluation of genome assemblies and assembly algorithms. Genome Res. 22, 557–567 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  40. 40

    Iqbal, Z., Turner, I. & McVean, G. High-throughput microbial population genomics using the Cortex variation assembler. Bioinformatics 29, 275–276 (2013).

    CAS  PubMed  Google Scholar 

  41. 41

    Nookaew, I. et al. A comprehensive comparison of RNA-seq-based transcriptome analysis from reads to differential gene expression and cross-comparison with microarrays: a case study in Saccharomyces cerevisiae. Nucleic Acids Res. 40, 10084–10097 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  42. 42

    Wang, Z., Gerstein, M. & Snyder, M. RNA-Seq: a revolutionary tool for transcriptomics. Nature Rev. Genet. 10, 57–63 (2009).

    CAS  PubMed  Google Scholar 

  43. 43

    Kingston, R. E. Preparation of poly(A)+ RNA. Curr. Protoc. Mol. Biol. 21, 4.5.1–4.5.3 (2001).

    Google Scholar 

  44. 44

    Djebali, S. et al. Landscape of transcription in human cells. Nature 489, 101–108 (2012). In this study, RNA-seq data from 15 deeply sequenced ENCODE human cell lines are presented. It catalogues transcribed regions of the human genome and describes expression levels, RNA processing and subcellular localization for various classes of RNAs.

    CAS  PubMed  PubMed Central  Google Scholar 

  45. 45

    Cabili, M. N. et al. Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. Genes Dev. 25, 1915–1927 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  46. 46

    Derrien, T. et al. The GENCODE v7 catalog of human long noncoding RNAs: analysis of their gene structure, evolution, and expression. Genome Res. 22, 1775–1789 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  47. 47

    External RNA Controls Consortium. Proposed methods for testing and selecting the ERCC external RNA controls. BMC Genomics 6, 150 (2005).

  48. 48

    Jiang, L. et al. Synthetic spike-in standards for RNA-seq experiments. Genome Res. 21, 1543–1551 (2011). This study describes the use of synthetic RNAs for assessing the performance of RNA-seq methods. The importance of benchmarking performance and the limits of detection of RNA-seq are highlighted. It also reports the dependence of transcript detection on transcript length, GC composition and abundance.

    CAS  PubMed  PubMed Central  Google Scholar 

  49. 49

    Hansen, K. D., Brenner, S. E. & Dudoit, S. Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Res. 38, e131 (2010).

    PubMed  PubMed Central  Google Scholar 

  50. 50

    Tarazona, S., Garcia-Alcalde, F., Dopazo, J., Ferrer, A. & Conesa, A. Differential expression in RNA-seq: a matter of depth. Genome Res. 21, 2213–2223 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  51. 51

    Kapranov, P., Willingham, A. T. & Gingeras, T. R. Genome-wide transcription and the implications for genomic organization. Nature Rev. Genet. 8, 413–423 (2007).

    CAS  PubMed  Google Scholar 

  52. 52

    Nagalakshmi, U. et al. The transcriptional landscape of the yeast genome defined by RNA sequencing. Science 320, 1344–1349 (2008).

    CAS  PubMed  PubMed Central  Google Scholar 

  53. 53

    Haas, B. J., Chin, M., Nusbaum, C., Birren, B. W. & Livny, J. How deep is deep enough for RNA-seq profiling of bacterial transcriptomes? BMC Genomics 13, 734 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  54. 54

    Bullard, J. H., Purdom, E., Hansen, K. D. & Dudoit, S. Evaluation of statistical methods for normalization and differential expression in mRNA-seq experiments. BMC Bioinformatics 11, 94 (2010).

    PubMed  PubMed Central  Google Scholar 

  55. 55

    ENCODE Project Consortium. The ENCODE (ENCyclopedia of DNA elements) project. Science 306, 636–640 (2004).

  56. 56

    Trapnell, C., Pachter, L. & Salzberg, S. L. TopHat: discovering splice junctions with RNA-seq. Bioinformatics 25, 1105–1111 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  57. 57

    ENCODE Project Consortium. A user's guide to the encyclopedia of DNA elements (ENCODE). PLoS Biol. 9, e1001046 (2011). Using deeply sequenced human H1 embryonic stem cells, the ENCODE consortium describes the dependency of accurate transcript abundance on the number of sequenced reads and finds that 80% of transcripts that are expressed at >10 FPKM can be accurately quantified using ~36 million reads.

  58. 58

    Halvardson, J., Zaghlool, A. & Feuk, L. Exome RNA sequencing reveals rare and novel alternative transcripts. Nucleic Acids Res. 41, e6 (2013).

    CAS  PubMed  Google Scholar 

  59. 59

    Anders, S. & Huber, W. Differential expression analysis for sequence count data. Genome Biol. 11, R106 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  60. 60

    Robinson, M. D., McCarthy, D. J. & Smyth, G. K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139–140 (2010).

    CAS  Article  Google Scholar 

  61. 61

    Trapnell, C. et al. Differential analysis of gene regulation at transcript resolution with RNA-seq. Nature Biotech. 31, 46–53 (2013).

    CAS  Google Scholar 

  62. 62

    Kalsotra, A. & Cooper, T. A. Functional consequences of developmentally regulated alternative splicing. Nature Rev. Genet. 12, 715–729 (2011).

    CAS  PubMed  Google Scholar 

  63. 63

    Sultan, M. et al. A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science 321, 956–960 (2008). This is the first study to use deep RNA-seq to assess the extent of alternative splicing in human cells. It finds that the majority of human genes are spliced and that isoform distribution is variable across different cell types.

    CAS  PubMed  Google Scholar 

  64. 64

    Wang, E. T. et al. Alternative isoform regulation in human tissue transcriptomes. Nature 456, 470–476 (2008).

    CAS  PubMed  PubMed Central  Google Scholar 

  65. 65

    Dillman, A. A. et al. mRNA expression, splicing and editing in the embryonic and adult mouse cerebral cortex. Nature Neurosci. 16, 499–506 (2013).

    CAS  PubMed  Google Scholar 

  66. 66

    Johnson, D. S., Mortazavi, A., Myers, R. M. & Wold, B. Genome-wide mapping of in vivo protein–DNA interactions. Science 316, 1497–1502 (2007).

    CAS  PubMed  PubMed Central  Google Scholar 

  67. 67

    Rhee, H. S. & Pugh, B. F. ChIP–exo method for identifying genomic location of DNA-binding proteins with near-single-nucleotide accuracy. Curr. Protoc. Mol. Biol. 100, 21.24.1–21.24.14 (2012).

    Google Scholar 

  68. 68

    Sanford, J. R. et al. Splicing factor SFRS1 recognizes a functionally diverse landscape of RNA transcripts. Genome Res. 19, 381–394 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  69. 69

    Licatalosi, D. D. et al. HITS–CLIP yields genome-wide insights into brain alternative RNA processing. Nature 456, 464–469 (2008).

    CAS  PubMed  PubMed Central  Google Scholar 

  70. 70

    Konig, J. et al. iCLIP reveals the function of hnRNP particles in splicing at individual nucleotide resolution. Nature Struct. Mol. Biol. 17, 909–915 (2010).

    Google Scholar 

  71. 71

    Hafner, M. et al. Transcriptome-wide identification of RNA-binding protein and microRNA target sites by PAR–CLIP. Cell 141, 129–141 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  72. 72

    Simon, M. D. et al. The genomic binding sites of a noncoding RNA. Proc. Natl Acad. Sci. USA 108, 20497–20502 (2011).

    CAS  PubMed  Google Scholar 

  73. 73

    Chu, C., Qu, K., Zhong, F. L., Artandi, S. E. & Chang, H. Y. Genomic maps of long noncoding RNA occupancy reveal principles of RNA–chromatin interactions. Mol. Cell 44, 667–678 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  74. 74

    de Laat, W. & Dekker, J. 3C-based technologies to study the shape of the genome. Methods 58, 189–191 (2012). This is an introduction to a useful methods volume that contains detailed discussion of the experimental considerations (including sequence depth) and computational considerations that are required when designing high-throughput 3C-type experiments.

    CAS  PubMed  Google Scholar 

  75. 75

    Dekker, J., Marti-Renom, M. A. & Mirny, L. A. Exploring the three-dimensional organization of genomes: interpreting chromatin interaction data. Nature Rev. Genet. 14, 390–403 (2013).

    CAS  PubMed  Google Scholar 

  76. 76

    Hesselberth, J. R. et al. Global mapping of protein–DNA interactions in vivo by digital genomic footprinting. Nature Methods 6, 283–289 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  77. 77

    Down, T. A. et al. A Bayesian deconvolution strategy for immunoprecipitation-based DNA methylome analysis. Nature Biotech. 26, 779–785 (2008).

    CAS  Google Scholar 

  78. 78

    Blackledge, N. P. et al. Bio-CAP: a versatile and highly sensitive technique to purify and characterise regions of non-methylated DNA. Nucleic Acids Res. 40, e32 (2012).

    CAS  PubMed  Google Scholar 

  79. 79

    Landt, S. G. et al. ChIP–seq guidelines and practices of the ENCODE and modENCODE consortia. Genome Res. 22, 1813–1831 (2012). This paper presents the ENCODE guidelines for ChIP–seq and similar experiments, which provide a baseline minimum standard for the design of new studies, including recommendations on sequencing depth, number of replicates, controls and measures to assess the quality of results.

    CAS  PubMed  PubMed Central  Google Scholar 

  80. 80

    Kharchenko, P. V., Tolstorukov, M. Y. & Park, P. J. Design and analysis of ChIP–seq experiments for DNA-binding proteins. Nature Biotech. 26, 1351–1359 (2008).

    CAS  Google Scholar 

  81. 81

    Chen, Y. et al. Systematic evaluation of factors influencing ChIP–seq fidelity. Nature Methods 9, 609–614 (2012). This is a comprehensive analysis of the factors that affect the success of a ChIP–seq experiment, including sequencing depth, which is carried out to a high maximum depth.

    CAS  PubMed  PubMed Central  Google Scholar 

  82. 82

    Ozdemir, A. et al. High resolution mapping of Twist to DNA in Drosophila embryos: efficient functional analysis and evolutionary conservation. Genome Res. 21, 566–577 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  83. 83

    Rozowsky, J. et al. PeakSeq enables systematic scoring of ChIP–seq experiments relative to controls. Nature Biotech. 27, 66–75 (2009).

    CAS  Google Scholar 

  84. 84

    Park, P. J. ChIP–seq: advantages and challenges of a maturing technology. Nature Rev. Genet. 10, 669–680 (2009).

    CAS  PubMed  Google Scholar 

  85. 85

    Li, Q., Brown, J. B., Huang, H. & Bickel, P. J. Measuring reproducibility of high-throughput experiments. Ann. Appl. Statist. 5, 1752–1779 (2011).

    Google Scholar 

  86. 86

    Rhee, H. S. & Pugh, B. F. Comprehensive genome-wide protein–DNA interactions detected at single-nucleotide resolution. Cell 147, 1408–1419 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  87. 87

    Rhee, H. S. & Pugh, B. F. Genome-wide structure and organization of eukaryotic pre-initiation complexes. Nature 483, 295–301 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  88. 88

    Boyle, A. P. et al. High-resolution mapping and characterization of open chromatin across the genome. Cell 132, 311–322 (2008).

    CAS  PubMed  PubMed Central  Google Scholar 

  89. 89

    Cho, J. et al. LIN28A is a suppressor of ER-associated translation in embryonic stem cells. Cell 151, 765–777 (2012).

    CAS  PubMed  Google Scholar 

  90. 90

    Eom, T. et al. NOVA-dependent regulation of cryptic NMD exons controls synaptic protein levels after seizure. Elife 2, e00178 (2013).

    PubMed  PubMed Central  Google Scholar 

  91. 91

    Asan et al. Comprehensive comparison of three commercial human whole-exome capture platforms. Genome Biol. 12, R95 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  92. 92

    van de Werken, H. J. G. et al. Robust 4C–seq data analysis to screen for regulatory DNA interactions. Nature Methods 9, 969–972 (2012).

    CAS  PubMed  Google Scholar 

  93. 93

    Splinter, E., de Wit, E., van de Werken, H. J. G., Klous, P. & de Laat, W. Determining long-range chromatin interactions for selected genomic sites using 4C–seq technology: from fixation to computation. Methods 58, 221–230 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  94. 94

    Belton, J.-M. et al. Hi-C: a comprehensive technique to capture the conformation of genomes. Methods 58, 268–276 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  95. 95

    Ferraiuolo, M. A., Sanyal, A., Naumova, N., Dekker, J. & Dostie, J. From cells to chromatin: capturing snapshots of genome organization with 5C technology. Methods 58, 255–267 (2012).

    CAS  PubMed  Google Scholar 

  96. 96

    Lander, E. S. & Waterman, M. S. Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics 2, 231–239 (1988).

    CAS  PubMed  PubMed Central  Google Scholar 

  97. 97

    Veal, C. D. et al. A mechanistic basis for amplification differences between samples and between genome regions. BMC Genomics 13, 455 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  98. 98

    Sampson, J., Jacobs, K., Yeager, M., Chanock, S. & Chatterjee, N. Efficient study design for next generation sequencing. Genet. Epidemiol. 35, 269–277 (2011).

    PubMed  PubMed Central  Google Scholar 

  99. 99

    Wang, W., Wei, Z., Lam, T. W. & Wang, J. Next generation sequencing has lower sequence coverage and poorer SNP-detection capability in the regulatory regions. Scientif. Rep. 1, 55 (2011).

    Google Scholar 

  100. 100

    Hatem, A., Bozdag, D., Toland, A. E. & Catalyürek, Ü. V. Benchmarking short sequence mapping tools. BMC Bioinformatics 14, 184 (2013).

    PubMed  PubMed Central  Google Scholar 

  101. 101

    Mijuskovic, M. et al. A streamlined method for detecting structural variants in cancer genomes by short read paired-end sequencing. PLoS ONE 7, e48314 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  102. 102

    Lee, H. & Schatz, M. C. Genomic dark matter: the reliability of short read mapping illustrated by the genome mappability score. Bioinformatics 28, 2097–2105 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  103. 103

    Derrien, T. et al. Fast computation and applications of genome mappability. PLoS ONE 7, e30377 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  104. 104

    Daley, T. & Smith, A. D. Predicting the molecular complexity of sequencing libraries. Nature Methods 10, 325–327 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  105. 105

    Gottwein, E. et al. Viral microRNA targetome of KSHV-infected primary effusion lymphoma cell lines. Cell Host Microbe 10, 515–526 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  106. 106

    Rogelj, B. et al. Widespread binding of FUS along nascent RNA regulates alternative splicing in the brain. Scientif. Rep. 2, 603 (2012).

    Google Scholar 

  107. 107

    Zhang, J. et al. ChIA–PET analysis of transcriptional chromatin interactions. Methods 58, 289–299 (2012).

    CAS  PubMed  Google Scholar 

  108. 108

    Sanyal, A., Lajoie, B. R., Jain, G. & Dekker, J. The long-range interaction landscape of gene promoters. Nature 489, 109–113 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  109. 109

    Taiwo, O. et al. Methylome analysis using MeDIP–seq with low DNA concentrations. Nature Protoc. 7, 617–636 (2012).

    CAS  Google Scholar 

  110. 110

    Long, H. K. et al. Epigenetic conservation at gene regulatory elements revealed by non-methylated DNA profiling in seven vertebrates. Elife 2, e00348 (2013).

    PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

The Computational Genomics Analysis and Training Centre is funded by a UK Medical Research Council Strategic Award.

Author information

Affiliations

Authors

Corresponding authors

Correspondence to David Sims or Chris P. Ponting.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Related links

PowerPoint slides

Glossary

Depth

The average number of times that a particular nucleotide is represented in a collection of random raw sequences.

Sequence capture

The enrichment of fragmented DNA or RNA species of interest by hybridization to a set of sequence-specific DNA or RNA oligonucleotides.

GC bias

The difference between the observed GC content of sequenced reads and the expected GC content based on the reference sequence.

Variant calling

The process of identifying consistent differences between the sequenced reads and the reference genome; these differences include single base substitutions, small insertions and deletions, and larger copy number variants.

Low-complexity sequences

DNA regions that have a biased nucleotide composition, which are enriched with simple sequence repeats.

Clonal evolution

An iterative process of clonal expansion, genetic diversification and clonal selection that is thought to drive the evolution of cancers, which gives rise to metastasis and resistance to therapy.

Dynamic range

The range of expression levels over which genes and transcripts can be accurately quantified in gene expression analyses. In theory, RNA sequencing offers an infinite dynamic range, whereas microarrays are limited by the range of signal intensities.

Long non-coding RNAs

(lncRNAs). RNA molecules that are transcribed from non-protein-coding loci; such RNAs are >200 nt in length and show no predicted protein-coding capacity.

Cap analysis of gene expression

(CAGE). In contrast to RNA sequencing, CAGE produces short 'tag' sequences that represent the 5′ end of the RNA molecule. As CAGE does not sequence across an entire cDNA, it requires a lower depth of sequencing than RNA sequencing to quantify low-abundance transcripts.

Spike-in control RNAs

A pool of RNA molecules of known length, sequence composition and abundance that is introduced into an experiment to assess the performance of the technique.

Fragments per kilobase of exon per million reads mapped

(FPKM). A method for normalizing read counts over genes or transcripts. Read counts are first normalized by gene length and then by library size. After normalization, the expression value of each gene is less dependent on these variables.

Saturation

In the context of sequence depth, the point at which the addition of extra reads to an analysis yields no improvement in the number of significant effects identified.

Parametric methods

Methods that rely on assumptions regarding the distribution of sampled data. In RNA sequencing, differential expression analysis sampled reads are assumed to follow a Poisson or negative binomial distribution.

CLIP–seq

(Crosslinking immunoprecipitation followed by sequencing). A method for interrogating RNA–protein interactions, in which RNAs are crosslinked to proteins by ultraviolet radiation and then fragmented. After immunoprecipitation of the protein of interest, the RNA is converted to cDNA and sequenced.

iCLIP

(Individual nucleotide-resolution crosslinking and immunoprecipitation). An extension of CLIP–seq that produces base-pair resolution. It relies on the fact that most cDNA synthesis reactions terminate at the crosslinked bases of the RNA; these prematurely terminated bases are purified and sequenced.

PAR–CLIP

(Photoactivatable-ribonucleoside-enhanced crosslinking immunoprecipitation). An extension of CLIP–seq, in which the photoactivatable nucleotide uridine analogue 4SU is incorporated into RNA. Upon activation with ultraviolet radiation, these bases form covalent crosslinks with bound proteins. Following conversion to cDNA, uncrosslinked uridines become thymidines, whereas crosslinked uridines become cytosines, thus indicating the protein-binding sites in the RNA.

CHART

(Capture hybridization analysis of RNA targets). A method that uses biotinylated oligonucleotides to pull down complementary RNAs (which are generally long non-coding RNAs) and their associated DNA after crosslinking. The resulting DNA is then sequenced to identify sequences that are associated with the RNA.

CHiRP

(Chromatin isolation by RNA purification). A method to capture DNA that is associated with RNA (particularly long-non coding RNAs); it is based on a similar principle to CHART.

DNaseI-seq

(DNase I hypersensitive site sequencing). A method to identify regions of open chromatin. Regions of open chromatin are sensitive to DNase I digestion, whereas those in regions of close chromatin are not. Sequencing of fragment ends after DNase I digestion thus reveals the locations of open chromatin.

MeDIP–seq

(Methylated DNA immunoprecipitation followed by sequencing). A method to identify regions of methylated DNA, in which chromatin immunoprecipitation is carried out using an antibody that recognizes methylated cytosine and the resulting immunoprecipitated DNA fragments are subjected to sequencing.

CAP–seq

(CxxC affinity purification sequencing). A method to identify genomic regions that are enriched for unmethylated CpG dinucleotides on the basis of binding of the CxxC domain to such regions. A recombinant CxxC domain from the KDM2B protein is biotinylated and is bound to DNA. After fragmentation, DNA bound to the biotinylated CxxC domain is recovered and sequenced.

Peaks

Regions of the genome with an enrichment of mapped reads compared with a control track or a local background. Produced by peak callers, these are often the output of location-based experiments.

Point-source factor

A protein factor that yields narrow and localized peaks in chromatin immunoprecipitation followed by sequencing experiments, such as sequence-specific transcription factors or some modified histones that occur in localized regions.

Broad-source factor

A protein factor or modification that marks extended genomic regions, such as many modified histones.

Mixed-source factor

A protein factor or modification that produces peaks which are similar to those of both point-source and broad-source factors.

Technical replicates

Replicates that are derived from the same initial biological sample (as opposed to biological replicates). The variation between two such samples will be due to the variation that is introduced by the technique used rather than the underlying variation in the biology.

PCR duplicates

Pairs of reads that originated from the same molecule in the original biological sample and that are filtered out in many analyses.

Library complexity

The number of unique biological molecules that are represented in a sequencing library.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Sims, D., Sudbery, I., Ilott, N. et al. Sequencing depth and coverage: key considerations in genomic analyses. Nat Rev Genet 15, 121–132 (2014). https://doi.org/10.1038/nrg3642

Download citation

Further reading