Sequencing depth and coverage: key considerations in genomic analyses

Sims, David; Sudbery, Ian; Ilott, Nicholas E.; Heger, Andreas; Ponting, Chris P.

doi:10.1038/nrg3642

Review Article
Published: 17 January 2014

Sequencing depth and coverage: key considerations in genomic analyses

David Sims¹,
Ian Sudbery¹,
Nicholas E. Ilott¹,
Andreas Heger¹ &
…
Chris P. Ponting¹

Nature Reviews Genetics volume 15, pages 121–132 (2014)Cite this article

152k Accesses
808 Citations
58 Altmetric
Metrics details

Subjects

Key Points

The average depth of sequencing coverage can be defined theoretically as LN/G, where L is the read length, N is the number of reads and G is the haploid genome length.
The breadth of coverage is the percentage of target bases that have been sequenced for a given number of times.
Hybrid sequencing approaches are being introduced to overcome problems in genome assembly and in placing highly repetitive sequence in a genome.
For DNA resequencing studies, the required sequencing capacity depends on the size of the regions of interest, the types of variant and the disease model being studied.
The accuracy of variant calling is affected by sequence quality, uniformity of coverage and the threshold of false-discovery rate that is used.
The power to identify and accurately quantify RNA molecules is dependent on their lengths and abundance, and on the number of sequenced reads.
In human cells, 80% of transcripts that are expressed at >10 fragments per kilobase of exon per million reads mapped (FPKM) can be accurately quantified with ~36 million 100-bp paired-end sequenced reads.
Depth of coverage is affected by the accuracy of genome alignment algorithms and by the uniqueness or the 'mappability' of sequencing reads within a target genome.
Sequence depth influences the accuracy by which rare events can be quantified in RNA sequencing, chromatin immunoprecipitation followed by sequencing (ChIP–seq) and other quantification-based assays.
Sequence depth must be traded off against the need for control samples and replicates.

Abstract

Sequencing technologies have placed a wide range of genomic analyses within the capabilities of many laboratories. However, sequencing costs often set limits to the amount of sequences that can be generated and, consequently, the biological outcomes that can be achieved from an experimental design. In this Review, we discuss the issue of sequencing depth in the design of next-generation sequencing experiments. We review current guidelines and precedents on the issue of coverage, as well as their underlying considerations, for four major study designs, which include de novo genome sequencing, genome resequencing, transcriptome sequencing and genomic location analyses (for example, chromatin immunoprecipitation followed by sequencing (ChIP–seq) and chromosome conformation capture (3C)).

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Sequencing depths for different applications.**

**Figure 2: The three different types of peaks in chromatin immunoprecipitation followed by sequencing experiments.**

Beyond assembly: the increasing flexibility of single-molecule sequencing technology

Article 09 May 2023

Long-read human genome sequencing and its applications

Article 05 June 2020

Towards population-scale long-read sequencing

Article 28 May 2021

References

Wetterstrand, K. A. DNA sequencing costs: data from the NHGRI Genome Sequencing Program (GSP). National Human Genome Research Institute [online], (2013).
Lander, E. S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).
CAS PubMed Google Scholar
Schatz, M. C., Delcher, A. L. & Salzberg, S. L. Assembly of large genomes using second-generation sequencing. Genome Res. 20, 1165–1173 (2010).
CAS PubMed PubMed Central Google Scholar
Li, R. et al. The sequence and de novo assembly of the giant panda genome. Nature 463, 311–317 (2010).
CAS PubMed Google Scholar
Jia, J. et al. Aegilops tauschii draft genome sequence reveals a gene repertoire for wheat adaptation. Nature 496, 91–95 (2013).
CAS PubMed Google Scholar
Voskoboynik, A. et al. The genome sequence of the colonial chordate, Botryllus schlosseri. Elife 2, e00569 (2013).
PubMed PubMed Central Google Scholar
Ribeiro, F. J. et al. Finished bacterial genomes from shotgun sequence data. Genome Res. 22, 2270–2277 (2012).
CAS PubMed PubMed Central Google Scholar
Schatz, M. C., Witkowski, J. & McCombie, W. R. Current challenges in de novo plant genome sequencing and assembly. Genome Biol. 13, 243 (2012).
CAS PubMed PubMed Central Google Scholar
Margulies, E. H. et al. An initial strategy for the systematic identification of functional elements in the human genome by low-redundancy comparative sequencing. Proc. Natl Acad. Sci. USA 102, 4795–4800 (2005).
CAS PubMed PubMed Central Google Scholar
Green, P. 2x genomes — does depth matter? Genome Res. 17, 1547–1549 (2007).
CAS PubMed Google Scholar
Rands, C. M. et al. Insights into the evolution of Darwin's finches from comparative analysis of the Geospiza magnirostris genome sequence. BMC Genomics 14, 95 (2013).
CAS PubMed PubMed Central Google Scholar
Bentley, D. R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59 (2008). This is the first study to sequence a human genome using short reads; it examines the read depth that is required for calling SNVs.
CAS PubMed PubMed Central Google Scholar
Ahn, S. M. et al. The first Korean genome sequence and analysis: full genome sequencing for a socio-ethnic group. Genome Res. 19, 1622–1629 (2009).
CAS PubMed PubMed Central Google Scholar
Wang, J. et al. The diploid genome sequence of an Asian individual. Nature 456, 60–65 (2008).
CAS PubMed PubMed Central Google Scholar
Ajay, S. S., Parker, S. C., Abaan, H. O., Fajardo, K. V. & Margulies, E. H. Accurate and comprehensive sequencing of personal genomes. Genome Res. 21, 1498–1505 (2011).
PubMed PubMed Central Google Scholar
Kozarewa, I. et al. Amplification-free Illumina sequencing-library preparation facilitates improved mapping and assembly of (G+C)-biased genomes. Nature Methods 6, 291–295 (2009).
CAS PubMed PubMed Central Google Scholar
Aird, D. et al. Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biol. 12, R18 (2011).
CAS PubMed PubMed Central Google Scholar
Clark, M. J. et al. Performance comparison of exome DNA sequencing technologies. Nature Biotech. 29, 908–914 (2011).
CAS Google Scholar
Sulonen, A. M. et al. Comparison of solution-based exome capture methods for next generation sequencing. Genome Biol. 12, R94 (2011).
CAS PubMed PubMed Central Google Scholar
Zhou, Q. et al. A hypermorphic missense mutation in PLCG2, encoding phospholipase Cγ2, causes a dominantly inherited autoinflammatory disease with immunodeficiency. Am. J. Hum. Genet. 91, 713–720 (2012).
CAS PubMed PubMed Central Google Scholar
Thauvin-Robinet, C. et al. PIK3R1 mutations cause syndromic insulin resistance with lipoatrophy. Am. J. Hum. Genet. 93, 141–149 (2013).
CAS PubMed PubMed Central Google Scholar
Yu, T. W. et al. Using whole-exome sequencing to identify inherited causes of autism. Neuron 77, 259–273 (2013).
CAS PubMed PubMed Central Google Scholar
Quail, M. A. et al. A large genome center's improvements to the Illumina sequencing system. Nature Methods 5, 1005–1010 (2008).
CAS PubMed PubMed Central Google Scholar
Fromer, M. et al. Discovery and statistical genotyping of copy-number variation from whole-exome sequencing depth. Am. J. Hum. Genet. 91, 597–607 (2012).
CAS PubMed PubMed Central Google Scholar
Krumm, N. et al. Copy number variation detection and genotyping from exome sequence data. Genome Res. 22, 1525–1532 (2012).
CAS PubMed PubMed Central Google Scholar
Xie, C. & Tammi, M. T. CNV–seq, a new method to detect copy number variation using high-throughput sequencing. BMC Bioinformatics 10, 80 (2009).
PubMed PubMed Central Google Scholar
Medvedev, P., Fiume, M., Dzamba, M., Smith, T. & Brudno, M. Detecting copy number variation with mated short reads. Genome Res. 20, 1613–1622 (2010).
CAS PubMed PubMed Central Google Scholar
Klambauer, G. et al. cn.MOPS: mixture of Poissons for discovering copy number variations in next-generation sequencing data with a low false discovery rate. Nucleic Acids Res. 40, e69 (2012).
CAS PubMed PubMed Central Google Scholar
Le, S. Q. & Durbin, R. SNP detection and genotyping from low-coverage sequencing data on multiple diploid samples. Genome Res. 21, 952–960 (2011).
CAS PubMed PubMed Central Google Scholar
Li, Y., Sidore, C., Kang, H. M., Boehnke, M. & Abecasis, G. R. Low-coverage sequencing: implications for design of complex trait association studies. Genome Res. 21, 940–951 (2011).
CAS PubMed PubMed Central Google Scholar
Abecasis, G. R. et al. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012).
PubMed Google Scholar
Pasaniuc, B. et al. Extremely low-coverage sequencing and imputation increases power for genome-wide association studies. Nature Genet. 44, 631–635 (2012).
CAS PubMed Google Scholar
Lee, W. et al. The mutation spectrum revealed by paired genome sequences from a lung cancer patient. Nature 465, 473–477 (2010).
CAS PubMed Google Scholar
Schuh, A. et al. Monitoring chronic lymphocytic leukemia progression by whole genome sequencing reveals heterogeneous clonal evolution patterns. Blood 120, 4191–4196 (2012).
CAS PubMed Google Scholar
Li, B. et al. A likelihood-based framework for variant calling and de novo mutation detection in families. PLoS Genet. 8, e1002944 (2012).
CAS PubMed PubMed Central Google Scholar
DePristo, M. A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genet. 43, 491–498 (2011).
CAS PubMed Google Scholar
Nagarajan, N. & Pop, M. Sequence assembly demystified. Nature Rev. Genet. 14, 157–167 (2013).
CAS PubMed Google Scholar
Bradnam, K. R. et al. Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. GigaScience 2, 10 (2013).
PubMed PubMed Central Google Scholar
Salzberg, S. L. et al. GAGE: a critical evaluation of genome assemblies and assembly algorithms. Genome Res. 22, 557–567 (2012).
CAS PubMed PubMed Central Google Scholar
Iqbal, Z., Turner, I. & McVean, G. High-throughput microbial population genomics using the Cortex variation assembler. Bioinformatics 29, 275–276 (2013).
CAS PubMed Google Scholar
Nookaew, I. et al. A comprehensive comparison of RNA-seq-based transcriptome analysis from reads to differential gene expression and cross-comparison with microarrays: a case study in Saccharomyces cerevisiae. Nucleic Acids Res. 40, 10084–10097 (2012).
CAS PubMed PubMed Central Google Scholar
Wang, Z., Gerstein, M. & Snyder, M. RNA-Seq: a revolutionary tool for transcriptomics. Nature Rev. Genet. 10, 57–63 (2009).
CAS PubMed Google Scholar
Kingston, R. E. Preparation of poly(A)⁺ RNA. Curr. Protoc. Mol. Biol. 21, 4.5.1–4.5.3 (2001).
Google Scholar
Djebali, S. et al. Landscape of transcription in human cells. Nature 489, 101–108 (2012). In this study, RNA-seq data from 15 deeply sequenced ENCODE human cell lines are presented. It catalogues transcribed regions of the human genome and describes expression levels, RNA processing and subcellular localization for various classes of RNAs.
CAS PubMed PubMed Central Google Scholar
Cabili, M. N. et al. Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. Genes Dev. 25, 1915–1927 (2011).
CAS PubMed PubMed Central Google Scholar
Derrien, T. et al. The GENCODE v7 catalog of human long noncoding RNAs: analysis of their gene structure, evolution, and expression. Genome Res. 22, 1775–1789 (2012).
CAS PubMed PubMed Central Google Scholar
External RNA Controls Consortium. Proposed methods for testing and selecting the ERCC external RNA controls. BMC Genomics 6, 150 (2005).
Jiang, L. et al. Synthetic spike-in standards for RNA-seq experiments. Genome Res. 21, 1543–1551 (2011). This study describes the use of synthetic RNAs for assessing the performance of RNA-seq methods. The importance of benchmarking performance and the limits of detection of RNA-seq are highlighted. It also reports the dependence of transcript detection on transcript length, GC composition and abundance.
CAS PubMed PubMed Central Google Scholar
Hansen, K. D., Brenner, S. E. & Dudoit, S. Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Res. 38, e131 (2010).
PubMed PubMed Central Google Scholar
Tarazona, S., Garcia-Alcalde, F., Dopazo, J., Ferrer, A. & Conesa, A. Differential expression in RNA-seq: a matter of depth. Genome Res. 21, 2213–2223 (2011).
CAS PubMed PubMed Central Google Scholar
Kapranov, P., Willingham, A. T. & Gingeras, T. R. Genome-wide transcription and the implications for genomic organization. Nature Rev. Genet. 8, 413–423 (2007).
CAS PubMed Google Scholar
Nagalakshmi, U. et al. The transcriptional landscape of the yeast genome defined by RNA sequencing. Science 320, 1344–1349 (2008).
CAS PubMed PubMed Central Google Scholar
Haas, B. J., Chin, M., Nusbaum, C., Birren, B. W. & Livny, J. How deep is deep enough for RNA-seq profiling of bacterial transcriptomes? BMC Genomics 13, 734 (2012).
CAS PubMed PubMed Central Google Scholar
Bullard, J. H., Purdom, E., Hansen, K. D. & Dudoit, S. Evaluation of statistical methods for normalization and differential expression in mRNA-seq experiments. BMC Bioinformatics 11, 94 (2010).
PubMed PubMed Central Google Scholar
ENCODE Project Consortium. The ENCODE (ENCyclopedia of DNA elements) project. Science 306, 636–640 (2004).
Trapnell, C., Pachter, L. & Salzberg, S. L. TopHat: discovering splice junctions with RNA-seq. Bioinformatics 25, 1105–1111 (2009).
CAS PubMed PubMed Central Google Scholar
ENCODE Project Consortium. A user's guide to the encyclopedia of DNA elements (ENCODE). PLoS Biol. 9, e1001046 (2011). Using deeply sequenced human H1 embryonic stem cells, the ENCODE consortium describes the dependency of accurate transcript abundance on the number of sequenced reads and finds that 80% of transcripts that are expressed at >10 FPKM can be accurately quantified using ~36 million reads.
Halvardson, J., Zaghlool, A. & Feuk, L. Exome RNA sequencing reveals rare and novel alternative transcripts. Nucleic Acids Res. 41, e6 (2013).
CAS PubMed Google Scholar
Anders, S. & Huber, W. Differential expression analysis for sequence count data. Genome Biol. 11, R106 (2010).
CAS PubMed PubMed Central Google Scholar
Robinson, M. D., McCarthy, D. J. & Smyth, G. K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139–140 (2010).
Article CAS PubMed Google Scholar
Trapnell, C. et al. Differential analysis of gene regulation at transcript resolution with RNA-seq. Nature Biotech. 31, 46–53 (2013).
CAS Google Scholar
Kalsotra, A. & Cooper, T. A. Functional consequences of developmentally regulated alternative splicing. Nature Rev. Genet. 12, 715–729 (2011).
CAS PubMed Google Scholar
Sultan, M. et al. A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science 321, 956–960 (2008). This is the first study to use deep RNA-seq to assess the extent of alternative splicing in human cells. It finds that the majority of human genes are spliced and that isoform distribution is variable across different cell types.
CAS PubMed Google Scholar
Wang, E. T. et al. Alternative isoform regulation in human tissue transcriptomes. Nature 456, 470–476 (2008).
CAS PubMed PubMed Central Google Scholar
Dillman, A. A. et al. mRNA expression, splicing and editing in the embryonic and adult mouse cerebral cortex. Nature Neurosci. 16, 499–506 (2013).
CAS PubMed Google Scholar
Johnson, D. S., Mortazavi, A., Myers, R. M. & Wold, B. Genome-wide mapping of in vivo protein–DNA interactions. Science 316, 1497–1502 (2007).
CAS PubMed Google Scholar
Rhee, H. S. & Pugh, B. F. ChIP–exo method for identifying genomic location of DNA-binding proteins with near-single-nucleotide accuracy. Curr. Protoc. Mol. Biol. 100, 21.24.1–21.24.14 (2012).
Google Scholar
Sanford, J. R. et al. Splicing factor SFRS1 recognizes a functionally diverse landscape of RNA transcripts. Genome Res. 19, 381–394 (2009).
CAS PubMed PubMed Central Google Scholar
Licatalosi, D. D. et al. HITS–CLIP yields genome-wide insights into brain alternative RNA processing. Nature 456, 464–469 (2008).
CAS PubMed PubMed Central Google Scholar
Konig, J. et al. iCLIP reveals the function of hnRNP particles in splicing at individual nucleotide resolution. Nature Struct. Mol. Biol. 17, 909–915 (2010).
Google Scholar
Hafner, M. et al. Transcriptome-wide identification of RNA-binding protein and microRNA target sites by PAR–CLIP. Cell 141, 129–141 (2010).
CAS PubMed PubMed Central Google Scholar
Simon, M. D. et al. The genomic binding sites of a noncoding RNA. Proc. Natl Acad. Sci. USA 108, 20497–20502 (2011).
CAS PubMed PubMed Central Google Scholar
Chu, C., Qu, K., Zhong, F. L., Artandi, S. E. & Chang, H. Y. Genomic maps of long noncoding RNA occupancy reveal principles of RNA–chromatin interactions. Mol. Cell 44, 667–678 (2011).
CAS PubMed PubMed Central Google Scholar
de Laat, W. & Dekker, J. 3C-based technologies to study the shape of the genome. Methods 58, 189–191 (2012). This is an introduction to a useful methods volume that contains detailed discussion of the experimental considerations (including sequence depth) and computational considerations that are required when designing high-throughput 3C-type experiments.
CAS PubMed Google Scholar
Dekker, J., Marti-Renom, M. A. & Mirny, L. A. Exploring the three-dimensional organization of genomes: interpreting chromatin interaction data. Nature Rev. Genet. 14, 390–403 (2013).
CAS PubMed Google Scholar
Hesselberth, J. R. et al. Global mapping of protein–DNA interactions in vivo by digital genomic footprinting. Nature Methods 6, 283–289 (2009).
CAS PubMed PubMed Central Google Scholar
Down, T. A. et al. A Bayesian deconvolution strategy for immunoprecipitation-based DNA methylome analysis. Nature Biotech. 26, 779–785 (2008).
CAS Google Scholar
Blackledge, N. P. et al. Bio-CAP: a versatile and highly sensitive technique to purify and characterise regions of non-methylated DNA. Nucleic Acids Res. 40, e32 (2012).
CAS PubMed Google Scholar
Landt, S. G. et al. ChIP–seq guidelines and practices of the ENCODE and modENCODE consortia. Genome Res. 22, 1813–1831 (2012). This paper presents the ENCODE guidelines for ChIP–seq and similar experiments, which provide a baseline minimum standard for the design of new studies, including recommendations on sequencing depth, number of replicates, controls and measures to assess the quality of results.
CAS PubMed PubMed Central Google Scholar
Kharchenko, P. V., Tolstorukov, M. Y. & Park, P. J. Design and analysis of ChIP–seq experiments for DNA-binding proteins. Nature Biotech. 26, 1351–1359 (2008).
CAS Google Scholar
Chen, Y. et al. Systematic evaluation of factors influencing ChIP–seq fidelity. Nature Methods 9, 609–614 (2012). This is a comprehensive analysis of the factors that affect the success of a ChIP–seq experiment, including sequencing depth, which is carried out to a high maximum depth.
CAS PubMed PubMed Central Google Scholar
Ozdemir, A. et al. High resolution mapping of Twist to DNA in Drosophila embryos: efficient functional analysis and evolutionary conservation. Genome Res. 21, 566–577 (2011).
CAS PubMed PubMed Central Google Scholar
Rozowsky, J. et al. PeakSeq enables systematic scoring of ChIP–seq experiments relative to controls. Nature Biotech. 27, 66–75 (2009).
CAS Google Scholar
Park, P. J. ChIP–seq: advantages and challenges of a maturing technology. Nature Rev. Genet. 10, 669–680 (2009).
CAS PubMed Google Scholar
Li, Q., Brown, J. B., Huang, H. & Bickel, P. J. Measuring reproducibility of high-throughput experiments. Ann. Appl. Statist. 5, 1752–1779 (2011).
Google Scholar
Rhee, H. S. & Pugh, B. F. Comprehensive genome-wide protein–DNA interactions detected at single-nucleotide resolution. Cell 147, 1408–1419 (2011).
CAS PubMed PubMed Central Google Scholar
Rhee, H. S. & Pugh, B. F. Genome-wide structure and organization of eukaryotic pre-initiation complexes. Nature 483, 295–301 (2012).
CAS PubMed PubMed Central Google Scholar
Boyle, A. P. et al. High-resolution mapping and characterization of open chromatin across the genome. Cell 132, 311–322 (2008).
CAS PubMed PubMed Central Google Scholar
Cho, J. et al. LIN28A is a suppressor of ER-associated translation in embryonic stem cells. Cell 151, 765–777 (2012).
CAS PubMed Google Scholar
Eom, T. et al. NOVA-dependent regulation of cryptic NMD exons controls synaptic protein levels after seizure. Elife 2, e00178 (2013).
PubMed PubMed Central Google Scholar
Asan et al. Comprehensive comparison of three commercial human whole-exome capture platforms. Genome Biol. 12, R95 (2011).
CAS PubMed PubMed Central Google Scholar
van de Werken, H. J. G. et al. Robust 4C–seq data analysis to screen for regulatory DNA interactions. Nature Methods 9, 969–972 (2012).
CAS PubMed Google Scholar
Splinter, E., de Wit, E., van de Werken, H. J. G., Klous, P. & de Laat, W. Determining long-range chromatin interactions for selected genomic sites using 4C–seq technology: from fixation to computation. Methods 58, 221–230 (2012).
CAS PubMed Google Scholar
Belton, J.-M. et al. Hi-C: a comprehensive technique to capture the conformation of genomes. Methods 58, 268–276 (2012).
CAS PubMed Google Scholar
Ferraiuolo, M. A., Sanyal, A., Naumova, N., Dekker, J. & Dostie, J. From cells to chromatin: capturing snapshots of genome organization with 5C technology. Methods 58, 255–267 (2012).
CAS PubMed Google Scholar
Lander, E. S. & Waterman, M. S. Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics 2, 231–239 (1988).
CAS PubMed Google Scholar
Veal, C. D. et al. A mechanistic basis for amplification differences between samples and between genome regions. BMC Genomics 13, 455 (2012).
CAS PubMed PubMed Central Google Scholar
Sampson, J., Jacobs, K., Yeager, M., Chanock, S. & Chatterjee, N. Efficient study design for next generation sequencing. Genet. Epidemiol. 35, 269–277 (2011).
PubMed PubMed Central Google Scholar
Wang, W., Wei, Z., Lam, T. W. & Wang, J. Next generation sequencing has lower sequence coverage and poorer SNP-detection capability in the regulatory regions. Scientif. Rep. 1, 55 (2011).
Google Scholar
Hatem, A., Bozdag, D., Toland, A. E. & Catalyürek, Ü. V. Benchmarking short sequence mapping tools. BMC Bioinformatics 14, 184 (2013).
PubMed PubMed Central Google Scholar
Mijuskovic, M. et al. A streamlined method for detecting structural variants in cancer genomes by short read paired-end sequencing. PLoS ONE 7, e48314 (2012).
CAS PubMed PubMed Central Google Scholar
Lee, H. & Schatz, M. C. Genomic dark matter: the reliability of short read mapping illustrated by the genome mappability score. Bioinformatics 28, 2097–2105 (2012).
CAS PubMed PubMed Central Google Scholar
Derrien, T. et al. Fast computation and applications of genome mappability. PLoS ONE 7, e30377 (2012).
CAS PubMed PubMed Central Google Scholar
Daley, T. & Smith, A. D. Predicting the molecular complexity of sequencing libraries. Nature Methods 10, 325–327 (2013).
CAS PubMed PubMed Central Google Scholar
Gottwein, E. et al. Viral microRNA targetome of KSHV-infected primary effusion lymphoma cell lines. Cell Host Microbe 10, 515–526 (2011).
CAS PubMed PubMed Central Google Scholar
Rogelj, B. et al. Widespread binding of FUS along nascent RNA regulates alternative splicing in the brain. Scientif. Rep. 2, 603 (2012).
Google Scholar
Zhang, J. et al. ChIA–PET analysis of transcriptional chromatin interactions. Methods 58, 289–299 (2012).
CAS PubMed Google Scholar
Sanyal, A., Lajoie, B. R., Jain, G. & Dekker, J. The long-range interaction landscape of gene promoters. Nature 489, 109–113 (2012).
CAS PubMed PubMed Central Google Scholar
Taiwo, O. et al. Methylome analysis using MeDIP–seq with low DNA concentrations. Nature Protoc. 7, 617–636 (2012).
CAS Google Scholar
Long, H. K. et al. Epigenetic conservation at gene regulatory elements revealed by non-methylated DNA profiling in seven vertebrates. Elife 2, e00348 (2013).
PubMed PubMed Central Google Scholar

Download references

Acknowledgements

The Computational Genomics Analysis and Training Centre is funded by a UK Medical Research Council Strategic Award.

Author information

Authors and Affiliations

Department of Physiology, Computational Genomics Analysis and Training Programme, Medical Research Council Functional Genomics Unit, Anatomy and Genetics, Le Gros Clark Building, University of Oxford, Parks Road, Oxford, OX1 3PT, UK
David Sims, Ian Sudbery, Nicholas E. Ilott, Andreas Heger & Chris P. Ponting

Authors

David Sims
View author publications
You can also search for this author in PubMed Google Scholar
Ian Sudbery
View author publications
You can also search for this author in PubMed Google Scholar
Nicholas E. Ilott
View author publications
You can also search for this author in PubMed Google Scholar
Andreas Heger
View author publications
You can also search for this author in PubMed Google Scholar
Chris P. Ponting
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to David Sims or Chris P. Ponting.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Glossary

Depth: The average number of times that a particular nucleotide is represented in a collection of random raw sequences.
Sequence capture: The enrichment of fragmented DNA or RNA species of interest by hybridization to a set of sequence-specific DNA or RNA oligonucleotides.
GC bias: The difference between the observed GC content of sequenced reads and the expected GC content based on the reference sequence.
Variant calling: The process of identifying consistent differences between the sequenced reads and the reference genome; these differences include single base substitutions, small insertions and deletions, and larger copy number variants.
Low-complexity sequences: DNA regions that have a biased nucleotide composition, which are enriched with simple sequence repeats.
Clonal evolution: An iterative process of clonal expansion, genetic diversification and clonal selection that is thought to drive the evolution of cancers, which gives rise to metastasis and resistance to therapy.
Dynamic range: The range of expression levels over which genes and transcripts can be accurately quantified in gene expression analyses. In theory, RNA sequencing offers an infinite dynamic range, whereas microarrays are limited by the range of signal intensities.
Long non-coding RNAs: (lncRNAs). RNA molecules that are transcribed from non-protein-coding loci; such RNAs are >200 nt in length and show no predicted protein-coding capacity.
Cap analysis of gene expression: (CAGE). In contrast to RNA sequencing, CAGE produces short 'tag' sequences that represent the 5′ end of the RNA molecule. As CAGE does not sequence across an entire cDNA, it requires a lower depth of sequencing than RNA sequencing to quantify low-abundance transcripts.
Spike-in control RNAs: A pool of RNA molecules of known length, sequence composition and abundance that is introduced into an experiment to assess the performance of the technique.
Fragments per kilobase of exon per million reads mapped: (FPKM). A method for normalizing read counts over genes or transcripts. Read counts are first normalized by gene length and then by library size. After normalization, the expression value of each gene is less dependent on these variables.
Saturation: In the context of sequence depth, the point at which the addition of extra reads to an analysis yields no improvement in the number of significant effects identified.
Parametric methods: Methods that rely on assumptions regarding the distribution of sampled data. In RNA sequencing, differential expression analysis sampled reads are assumed to follow a Poisson or negative binomial distribution.
CLIP–seq: (Crosslinking immunoprecipitation followed by sequencing). A method for interrogating RNA–protein interactions, in which RNAs are crosslinked to proteins by ultraviolet radiation and then fragmented. After immunoprecipitation of the protein of interest, the RNA is converted to cDNA and sequenced.
iCLIP: (Individual nucleotide-resolution crosslinking and immunoprecipitation). An extension of CLIP–seq that produces base-pair resolution. It relies on the fact that most cDNA synthesis reactions terminate at the crosslinked bases of the RNA; these prematurely terminated bases are purified and sequenced.
PAR–CLIP: (Photoactivatable-ribonucleoside-enhanced crosslinking immunoprecipitation). An extension of CLIP–seq, in which the photoactivatable nucleotide uridine analogue 4SU is incorporated into RNA. Upon activation with ultraviolet radiation, these bases form covalent crosslinks with bound proteins. Following conversion to cDNA, uncrosslinked uridines become thymidines, whereas crosslinked uridines become cytosines, thus indicating the protein-binding sites in the RNA.
CHART: (Capture hybridization analysis of RNA targets). A method that uses biotinylated oligonucleotides to pull down complementary RNAs (which are generally long non-coding RNAs) and their associated DNA after crosslinking. The resulting DNA is then sequenced to identify sequences that are associated with the RNA.
CHiRP: (Chromatin isolation by RNA purification). A method to capture DNA that is associated with RNA (particularly long-non coding RNAs); it is based on a similar principle to CHART.
DNaseI-seq: (DNase I hypersensitive site sequencing). A method to identify regions of open chromatin. Regions of open chromatin are sensitive to DNase I digestion, whereas those in regions of close chromatin are not. Sequencing of fragment ends after DNase I digestion thus reveals the locations of open chromatin.
MeDIP–seq: (Methylated DNA immunoprecipitation followed by sequencing). A method to identify regions of methylated DNA, in which chromatin immunoprecipitation is carried out using an antibody that recognizes methylated cytosine and the resulting immunoprecipitated DNA fragments are subjected to sequencing.
CAP–seq: (CxxC affinity purification sequencing). A method to identify genomic regions that are enriched for unmethylated CpG dinucleotides on the basis of binding of the CxxC domain to such regions. A recombinant CxxC domain from the KDM2B protein is biotinylated and is bound to DNA. After fragmentation, DNA bound to the biotinylated CxxC domain is recovered and sequenced.
Peaks: Regions of the genome with an enrichment of mapped reads compared with a control track or a local background. Produced by peak callers, these are often the output of location-based experiments.
Point-source factor: A protein factor that yields narrow and localized peaks in chromatin immunoprecipitation followed by sequencing experiments, such as sequence-specific transcription factors or some modified histones that occur in localized regions.
Broad-source factor: A protein factor or modification that marks extended genomic regions, such as many modified histones.
Mixed-source factor: A protein factor or modification that produces peaks which are similar to those of both point-source and broad-source factors.
Technical replicates: Replicates that are derived from the same initial biological sample (as opposed to biological replicates). The variation between two such samples will be due to the variation that is introduced by the technique used rather than the underlying variation in the biology.
PCR duplicates: Pairs of reads that originated from the same molecule in the original biological sample and that are filtered out in many analyses.
Library complexity: The number of unique biological molecules that are represented in a sequencing library.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sims, D., Sudbery, I., Ilott, N. et al. Sequencing depth and coverage: key considerations in genomic analyses. Nat Rev Genet 15, 121–132 (2014). https://doi.org/10.1038/nrg3642

Download citation

Published: 17 January 2014
Issue Date: February 2014
DOI: https://doi.org/10.1038/nrg3642

This article is cited by

A cautionary tale of low-pass sequencing and imputation with respect to haplotype accuracy
- David Wragg
- Wengang Zhang
- Dylan N. Clements
Genetics Selection Evolution (2024)
Microbiome signatures associated with clinical stages of gastric Cancer: whole metagenome shotgun sequencing study
- Sohyun Jeong
- Yi-Tyng Liao
- Yi-Hsiang Hsu
BMC Microbiology (2024)
GSCIT: smart Hash Table-based mapping equipped genome sequence coverage inspection
- Samarth Godara
- Shbana Begam
- Rajender Parsad
Functional & Integrative Genomics (2024)
Alternative Splicing Reveals Acute Stress Response of Litopenaeus vannamei at High Alkalinity
- Xiang Shi
- Ruiqi Zhang
- Baoyi Fan
Marine Biotechnology (2024)
New biomarkers underlying acetic acid tolerance in the probiotic yeast Saccharomyces cerevisiae var. boulardii
- Wiwan Samakkarn
- Paul Vandecruys
- Nitnipa Soontorngun
Applied Microbiology and Biotechnology (2024)