Box 1: Sequencing coverage theory
Much of the original work on sequencing coverage stemmed from early genome mapping efforts. In 1988, Lander and Waterman96 described the theoretical redundancy of coverage (c) as LN/G, where L is the read length, N is the number of reads and G is the haploid genome length. The figure shows the theoretical coverage (shown as diagonal lines; c = 1× or 30×) according to the Lander–Waterman formula for human genome or exome sequencing. The coverage that is achieved by sequencing technologies according to the manufacturers' websites is also indicated (see the figure). Unfortunately, biases in sample preparation, sequencing, and genomic alignment and assembly can result in regions of the genome that lack coverage (that is, gaps) and in regions with much higher coverage than theoretically expected. GC-rich regions, such as CpG islands, are particularly prone to low depth of coverage partly because these regions remain annealed during amplification97. Consequently, it is important to assess the uniformity of coverage, and thus data quality, by calculating the variance in sequencing depth across the genome98. The term depth may also be used to describe how much of the complexity in a sequencing library has been sampled. All sequencing libraries contain finite pools of distinct DNA fragments. In a sequencing experiment only some of these fragments are sampled. The number of these distinct fragments sequenced is positively correlated with the depth of the true biological variation that has been sampled.
Computational Genomics Analysis and Training Programme, Medical Research Council Functional Genomics Unit, Department of Physiology, Anatomy and Genetics, Le Gros Clark Building, University of Oxford, Parks Road, Oxford OX1 3PT, UK.
- David Sims,
- Ian Sudbery,
- Nicholas E. Ilott,
- Andreas Heger &
- Chris P. Ponting
Competing interests statement
The authors declare no competing interests.
David Sims is the lead scientist in the Medical Research Council Computational Genomics Analysis and Training (CGAT) programme at the University of Oxford, UK. He started his scientific career with a Ph.D. in bioinformatics and functional genomics from University College London, UK, and went on to work as a postdoctoral scientist in breast cancer genomics at the Institute of Cancer Research in London before taking up his current post in Oxford.
Ian Sudbery is a postdoctoral fellow with the Medical Research Council Computational Genomics Analysis and Training programme (CGAT) at the University of Oxford, UK. He obtained his Ph.D. in functional genomics at the Wellcome Trust Sanger Institute, Hinxton, UK, and has since been involved as a postdoctoral fellow in genome resequencing and gene-regulation projects at the Sanger Institute and at Harvard Medical School, Boston, Massachusetts, USA, before joining CGAT.
Nicholas E. Ilott
Nicholas E. Ilott is a postdoctoral fellow with the Medical Research Council Computational Genomics Analysis and Training programme (CGAT) at the University of Oxford, UK. He obtained his Ph.D. from King's College London, UK, in 2010 where he studied the influence of prenatal nicotine exposure on striatal gene expression in rats. Since joining the CGAT in 2011, he has been using RNA sequencing data sets to understand the role of long non-coding RNAs in inflammation.
Andreas Heger is the Technical Director of the Medical Research Council Computational Genomics Analysis and Training programme (CGAT) at the University of Oxford, UK. Andreas obtained a Ph.D. in protein sequence analysis at the European Molecular Biology Laboratory–European Bioinformatics Institute, Cambridge, UK. After a postdoctoral position at the University of Helsinki, Finland, he joined Chris P. Ponting's laboratory at the Functional Genomics Unit, the University of Oxford, where he worked on comparative genomics in flies and mammals. He joined CGAT in 2010.
Chris P. Ponting
Chris P. Ponting is the Deputy Director of the Medical Research Council Functional Genomics Unit within the University of Oxford, UK, and the Director of the Computational Genomics Analysis and Training (CGAT) programme. His early research was on the evolution and the functions of protein domains, but the lure of newly sequenced human and mouse genomes proved irresistible and he has since been involved in numerous evolutionary functional genomics projects. Chris P. Ponting's homepage.
The average number of times that a particular nucleotide is represented in a collection of random raw sequences.
- Sequence capture
The enrichment of fragmented DNA or RNA species of interest by hybridization to a set of sequence-specific DNA or RNA oligonucleotides.
- GC bias
The difference between the observed GC content of sequenced reads and the expected GC content based on the reference sequence.
- Variant calling
The process of identifying consistent differences between the sequenced reads and the reference genome; these differences include single base substitutions, small insertions and deletions, and larger copy number variants.
- Low-complexity sequences
DNA regions that have a biased nucleotide composition, which are enriched with simple sequence repeats.
- Clonal evolution
An iterative process of clonal expansion, genetic diversification and clonal selection that is thought to drive the evolution of cancers, which gives rise to metastasis and resistance to therapy.
- Dynamic range
The range of expression levels over which genes and transcripts can be accurately quantified in gene expression analyses. In theory, RNA sequencing offers an infinite dynamic range, whereas microarrays are limited by the range of signal intensities.
- Long non-coding RNAs
(lncRNAs). RNA molecules that are transcribed from non-protein-coding loci; such RNAs are >200 nt in length and show no predicted protein-coding capacity.
- Cap analysis of gene expression
(CAGE). In contrast to RNA sequencing, CAGE produces short 'tag' sequences that represent the 5′ end of the RNA molecule. As CAGE does not sequence across an entire cDNA, it requires a lower depth of sequencing than RNA sequencing to quantify low-abundance transcripts.
- Spike-in control RNAs
A pool of RNA molecules of known length, sequence composition and abundance that is introduced into an experiment to assess the performance of the technique.
- Fragments per kilobase of exon per million reads mapped
(FPKM). A method for normalizing read counts over genes or transcripts. Read counts are first normalized by gene length and then by library size. After normalization, the expression value of each gene is less dependent on these variables.
In the context of sequence depth, the point at which the addition of extra reads to an analysis yields no improvement in the number of significant effects identified.
- Parametric methods
Methods that rely on assumptions regarding the distribution of sampled data. In RNA sequencing, differential expression analysis sampled reads are assumed to follow a Poisson or negative binomial distribution.
(Crosslinking immunoprecipitation followed by sequencing). A method for interrogating RNA–protein interactions, in which RNAs are crosslinked to proteins by ultraviolet radiation and then fragmented. After immunoprecipitation of the protein of interest, the RNA is converted to cDNA and sequenced.
(Individual nucleotide-resolution crosslinking and immunoprecipitation). An extension of CLIP–seq that produces base-pair resolution. It relies on the fact that most cDNA synthesis reactions terminate at the crosslinked bases of the RNA; these prematurely terminated bases are purified and sequenced.
(Photoactivatable-ribonucleoside-enhanced crosslinking immunoprecipitation). An extension of CLIP–seq, in which the photoactivatable nucleotide uridine analogue 4SU is incorporated into RNA. Upon activation with ultraviolet radiation, these bases form covalent crosslinks with bound proteins. Following conversion to cDNA, uncrosslinked uridines become thymidines, whereas crosslinked uridines become cytosines, thus indicating the protein-binding sites in the RNA.
(Capture hybridization analysis of RNA targets). A method that uses biotinylated oligonucleotides to pull down complementary RNAs (which are generally long non-coding RNAs) and their associated DNA after crosslinking. The resulting DNA is then sequenced to identify sequences that are associated with the RNA.
(Chromatin isolation by RNA purification). A method to capture DNA that is associated with RNA (particularly long-non coding RNAs); it is based on a similar principle to CHART.
(DNase I hypersensitive site sequencing). A method to identify regions of open chromatin. Regions of open chromatin are sensitive to DNase I digestion, whereas those in regions of close chromatin are not. Sequencing of fragment ends after DNase I digestion thus reveals the locations of open chromatin.
(Methylated DNA immunoprecipitation followed by sequencing). A method to identify regions of methylated DNA, in which chromatin immunoprecipitation is carried out using an antibody that recognizes methylated cytosine and the resulting immunoprecipitated DNA fragments are subjected to sequencing.
(CxxC affinity purification sequencing). A method to identify genomic regions that are enriched for unmethylated CpG dinucleotides on the basis of binding of the CxxC domain to such regions. A recombinant CxxC domain from the KDM2B protein is biotinylated and is bound to DNA. After fragmentation, DNA bound to the biotinylated CxxC domain is recovered and sequenced.
Regions of the genome with an enrichment of mapped reads compared with a control track or a local background. Produced by peak callers, these are often the output of location-based experiments.
- Point-source factor
A protein factor that yields narrow and localized peaks in chromatin immunoprecipitation followed by sequencing experiments, such as sequence-specific transcription factors or some modified histones that occur in localized regions.
- Broad-source factor
A protein factor or modification that marks extended genomic regions, such as many modified histones.
- Mixed-source factor
A protein factor or modification that produces peaks which are similar to those of both point-source and broad-source factors.
- Technical replicates
Replicates that are derived from the same initial biological sample (as opposed to biological replicates). The variation between two such samples will be due to the variation that is introduced by the technique used rather than the underlying variation in the biology.
- PCR duplicates
Pairs of reads that originated from the same molecule in the original biological sample and that are filtered out in many analyses.
- Library complexity
The number of unique biological molecules that are represented in a sequencing library.