Review Article | Published:

Identifying and mitigating bias in next-generation sequencing methods for chromatin biology

Nature Reviews Genetics volume 15, pages 709721 (2014) | Download Citation

Abstract

Next-generation sequencing (NGS) technologies have been used in diverse ways to investigate various aspects of chromatin biology by identifying genomic loci that are bound by transcription factors, occupied by nucleosomes or accessible to nuclease cleavage, or loci that physically interact with remote genomic loci. However, reaching sound biological conclusions from such NGS enrichment profiles requires many potential biases to be taken into account. In this Review, we discuss common ways in which biases may be introduced into NGS chromatin profiling data, approaches to diagnose these biases and analytical techniques to mitigate their effect.

Key points

  • In next-generation sequencing (NGS) chromatin profiling experiments technical artefacts may be introduced at any stage, most importantly in fragmenting DNA, selecting the fragment population of interest, DNA amplification, DNA sequencing itself and read mapping to a reference genome.

  • The effect of technical biases on experimental results will depend, to a large extent, on the genomic scale of the feature being analysed and the scale on which the bias is manifested. Bias will have the greatest effect when the length scale of the bias is similar to the scale of the feature.

  • Genomic experiments should be planned to recognize the potential confounding effects of biases and the limits of the technology. Proper controls to understand and characterize the potential biases in chromatin profiling should be included and sequenced to sufficient depth in such experiments.

  • Nuclease-induced fragmentation is usually biased by DNA sequence in ways that can produce patterns that might seem to have biological importance.

  • Basic principles of statistical analysis should be applied to the analysis of chromatin profiling experiments: variability and bias should be taken into account, and the fit of statistical models to observed data should be characterized.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

References

  1. 1.

    et al. High-resolution profiling of histone methylations in the human genome. Cell 129, 823–837 (2007). This paper reports the first use of MNase digestion followed by ChIP–seq to characterize genome-wide patterns of 20 varieties of histone lysine and arginine methylation.It identifies common modifications that are associated with active and repressed regions of the genome, transcription start sites, enhancers and insulator elements.

  2. 2.

    , , & Genome-wide mapping of in vivo protein–DNA interactions. Science 80, 1497–1502 (2007).

  3. 3.

    et al. Genome-wide maps of chromatin state in pluripotent and lineage-committed cells. Nature 448, 553–560 (2007).

  4. 4.

    , & Design and analysis of ChIP–seq experiments for DNA-binding proteins. Nature Biotech. 26, 1351–1359 (2008). This study proposes using the distribution of oriented reads to discriminate between real TF binding sites and artefacts.

  5. 5.

    et al. Dynamic regulation of nucleosome positioning in the human genome. Cell 132, 887–898 (2008).

  6. 6.

    et al. Nucleosome dynamics define transcriptional enhancers. Nature Genet. 42, 343–347 (2010).

  7. 7.

    et al. High-resolution genome-wide in vivo footprinting of diverse transcription factors in human cells. Genome Res. 21, 456–464 (2011).

  8. 8.

    et al. Global mapping of protein–DNA interactions in vivo by digital genomic footprinting. Nature Methods 6, 283–289 (2009).

  9. 9.

    et al. An expansive human regulatory lexicon encoded in transcription factor footprints. Nature 489, 83–90 (2012).

  10. 10.

    et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 326, 289–293 (2009).

  11. 11.

    et al. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature 485, 376–380 (2012).

  12. 12.

    et al. An oestrogen-receptor-α-bound human chromatin interactome. Nature 462, 58–64 (2009).

  13. 13.

    , , , & Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nature Methods 10, 1213–1218 (2013).

  14. 14.

    et al. ChIP–seq guidelines and practices of the ENCODE and modENCODE consortia. Genome Res. 22, 1813–1831 (2012).

  15. 15.

    et al. Impact of chromatin structures on DNA processing for genomic analyses. PLoS ONE 4, e6700 (2009).

  16. 16.

    & Analysis of DNA double- and single-strand breaks by two dimensional electrophoresis: action of micrococcal nuclease on chromatin and DNA, and degradation in vivo of lens fiber chromatin. Nucleic Acids Res. 8, 2665–2678 (1980).

  17. 17.

    & Surveying the epigenomic landscape, one base at a time. Genome Biol. 13, 250 (2012).

  18. 18.

    & Micrococcal nuclease: its specificity and use for chromatin analysis. Int. J. Biochem. 21, 127–137 (1989).

  19. 19.

    , , , & Epigenome characterization at single base-pair resolution. Proc. Natl Acad. Sci. USA 108, 18318–18323 (2011).

  20. 20.

    et al. High nucleosome occupancy is encoded at human regulatory sequences. PLoS ONE 5, e9129 (2010).

  21. 21.

    et al. Determinants of nucleosome organization in primary human cells. Nature 474, 516–520 (2011).

  22. 22.

    et al. Controls of nucleosome positioning in the human genome. PLoS Genet. 8, e1003036 (2012).

  23. 23.

    et al. Nucleosome depletion at yeast terminators is not intrinsic and can occur by a transcriptional mechanism linked to 3′-end formation. Proc. Natl Acad. Sci. USA 107, 17945–17950 (2010).

  24. 24.

    et al. The effect of micrococcal nuclease digestion on nucleosome positioning data. PLoS ONE 5, e15754 (2010).

  25. 25.

    & The effect of divalent cations on the mode of action of DNase I. The initial reaction products produced from covalently closed circular DNA. J. Biol. Chem. 255, 3726–3735 (1980).

  26. 26.

    et al. Refined DNase-seq protocol and data analysis reveals intrinsic bias in transcription factor footprint identification. Nature Methods 11, 73–78 (2014). This study shows how fragment size selection in DNase-seq can have a large impact on peak identification and that intrinsic DNase I cleavage bias can be mistaken as TF binding footprints.

  27. 27.

    , , & Coupling transcription factor occupancy to nucleosome architecture with DNase–FLASH. Nature Methods 11, 66–72 (2014).

  28. 28.

    et al. Probing DNA shape and methylation state on a genomic scale with DNase I. Proc. Natl Acad. Sci. USA 110, 6376–6381 (2013).

  29. 29.

    et al. Rapid genome-scale mapping of chromatin accessibility in tissue. Epigenetics Chromatin 5, 10 (2012).

  30. 30.

    et al. Systematic biases in DNA copy number originate from isolation procedures. Genome Biol. 14, R33 (2013).

  31. 31.

    & Isolation of active regulatory elements from eukaryotic chromatin using FAIRE (formaldehyde assisted isolation of regulatory elements). Methods 48, 233–239 (2009).

  32. 32.

    et al. Limitations and possibilities of low cell number ChIP–seq. BMC Genomics 13, 645 (2012).

  33. 33.

    & Length and GC-biases during sequencing library amplification: a comparison of various polymerase-buffer systems with ancient and modern DNA sequencing libraries. Biotechniques 52, 87–94 (2012).

  34. 34.

    & Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Res. 40, e72 (2012). This study shows the importance of selecting the correct genomic interval for bias analysis, as some sources of bias are best modelled using properties of DNA fragments rather than DNA reads.

  35. 35.

    et al. Dfam: a database of repetitive DNA based on profile hidden Markov models. Nucleic Acids Res. 41, D70–D82 (2013).

  36. 36.

    , & Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18, 1851–1858 (2008).

  37. 37.

    & Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009).

  38. 38.

    , , & Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009).

  39. 39.

    et al. Personalized copy number and segmental duplication maps using next-generation sequencing. Nature Genet. 41, 1061–1067 (2009).

  40. 40.

    et al. SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25, 1966–1967 (2009).

  41. 41.

    et al. Fast computation and applications of genome mappability. PLoS ONE 7, e30377 (2012).

  42. 42.

    et al. Transposable elements have rewired the core regulatory network of human embryonic stem cells. Nature Genet. 42, 631–634 (2010).

  43. 43.

    et al. Discovering transcription factor binding sites in highly repetitive regions of genomes with multi-read analysis of ChIP–seq data. PLoS Comput. Biol. 7, e1002111 (2011).

  44. 44.

    , , & Estimating enrichment of repetitive elements from high-throughput sequence data. Genome Biol. 11, R69 (2010).

  45. 45.

    et al. Species-specific endogenous retroviruses shape the transcriptional network of the human tumor suppressor protein p53. Proc. Natl Acad. Sci. USA 104, 18613–18618 (2007).

  46. 46.

    , , & False positive peaks in ChIP–seq and other sequencing-based functional assays caused by unannotated high copy number regions. Bioinformatics 27, 2144–2146 (2011).

  47. 47.

    et al. Cancer genome landscapes. Science 339, 1546–1558 (2013).

  48. 48.

    , , , & ZINBA integrates local covariates with DNA-seq data to identify broad and narrow regions of enrichment, even within amplified genomic regions. Genome Biol. 12, R67 (2011).

  49. 49.

    et al. Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data. Bioinformatics 25, 3207–3212 (2009).

  50. 50.

    et al. AlleleSeq: analysis of allele-specific expression and binding in a network framework. Mol. Syst. Biol. 7, 522 (2011).

  51. 51.

    et al. Discovery of directional and nondirectional pioneer transcription factors by modeling DNase profile magnitude and shape. Nature Biotech. 32, 171–178 (2014).

  52. 52.

    et al. iCLIP reveals the function of hnRNP particles in splicing at individual nucleotide resolution. Nature Struct. Mol. Biol. 17, 909–915 (2010).

  53. 53.

    & Predicting the molecular complexity of sequencing libraries. Nature Methods 10, 325–327 (2013).

  54. 54.

    , , & Large-scale quality analysis of published ChIP–seq data. G3 (Bethesda) 4, 209–223 (2014).

  55. 55.

    et al. Systematic evaluation of factors influencing ChIP–seq fidelity. Nature Methods 9, 609–614 (2012).

  56. 56.

    et al. ChIP–chip versus ChIP–seq: lessons for experimental design and data analysis. BMC Genomics 12, 134 (2011).

  57. 57.

    et al. Quantifying ChIP–seq data: a spiking method providing an internal reference for sample-to-sample normalization. Genome Res. 24, 1157–1168 (2014).

  58. 58.

    , & ChIP–seq: technical considerations for obtaining high-quality data. Nature Immunol. 12, 918–922 (2011).

  59. 59.

    , & SAMStat: monitoring biases in next generation sequencing data. Bioinformatics 27, 130–131 (2010).

  60. 60.

    et al. RNA-SeQC: RNA-seq metrics for quality control and process optimization. Bioinformatics 28, 1530–1532 (2012).

  61. 61.

    , & RSeQC: quality control of RNA-seq experiments. Bioinformatics 28, 2184–2185 (2012).

  62. 62.

    & , , & htSeqTools: high-throughput sequencing quality control, processing and visualization in R. Bioinformatics 28, 589–590 (2012).

  63. 63.

    , & CHANCE: comprehensive software for quality control and validation of ChIP–seq data. Genome Biol. 13, R98 (2012).

  64. 64.

    et al. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol. Cell 38, 576–589 (2010).

  65. 65.

    , & Removing technical variability in RNA-seq data using conditional quantile normalization. Biostatistics 13, 204–216 (2012).

  66. 66.

    Robust locally and smoothing weighted regression scatterplots. J. Am. Stat. Soc. 74, 829–836 (2013).

  67. 67.

    & Quantile regression. J. Econ. Perspect. 15, 143–156 (2013).

  68. 68.

    et al. PeakSeq enables systematic scoring of ChIP–seq experiments relative to controls. Nature Biotech. 27, 66–75 (2009).

  69. 69.

    & Detecting differential binding of transcription factors with ChIP–seq. Bioinformatics 28, 121–122 (2012).

  70. 70.

    & Differential expression analysis for sequence count data. Genome Biol. 11, R106 (2010).

  71. 71.

    & A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 11, R25 (2010).

  72. 72.

    , & edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139–140 (2010).

  73. 73.

    et al. A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Brief. Bioinform. 14, 671–683 (2012).

  74. 74.

    , , , & MAnorm: a robust model for quantitative comparison of ChIP–seq data sets. Genome Biol. 13, R16 (2012).

  75. 75.

    et al. Model-based analysis of ChIP–seq (MACS). Genome Biol. 9, R137 (2008). This study introduces the idea of estimating background effects using sliding windows on multiple scales. MACS remains one of the most widely used and best-performing algorithms for ChIP–seq peak calling.

  76. 76.

    , & Universal count correction for high-throughput sequencing. PLoS Comput. Biol. 10, 14–18 (2014).

  77. 77.

    et al. Count-based differential expression analysis of RNA sequencing data using R and Bioconductor. Nature Protoc. 8, 1765–1786 (2013).

  78. 78.

    et al. Identification of genetic variants that affect histone modifications in human cells. Science 342, 747–749 (2013).

  79. 79.

    et al. Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nature Methods 4, 651–657 (2007).

  80. 80.

    et al. An integrated software system for analyzing ChIP–chip and ChIP–seq data. Nature Biotech. 26, 1293–1300 (2008).

  81. 81.

    , & Empirical methods for controlling false positives and estimating confidence in ChIP–seq peaks. BMC Bioinformatics 9, 1–9 (2008).

  82. 82.

    et al. Genome-wide analysis of transcription factor binding sites based on ChIP–seq data. Nature Methods 5, 829–834 (2008).

  83. 83.

    , & Statistical analysis of ChIP–seq data with MOSAiCS. Methods Mol. Biol. 1038, 193–212 (2013).

  84. 84.

    et al. PICS: probabilistic inference for ChIP–seq. Biometrics 67, 151–163 (2011).

  85. 85.

    , , & The Triform algorithm: improved sensitivity and specificity in ChIP–seq peak finding BMC Bioinformatics 13, 176 (2012).

  86. 86.

    et al. Uniform, optimal signal processing of mapped deep-sequencing data. Nature Biotech. 31, 615–622 (2013).

  87. 87.

    , , , & A dynamic Bayesian network for identifying protein-binding footprints from single molecule-based sequencing data. Bioinformatics 26, i334–i342 (2010).

  88. 88.

    et al. Wellington: a novel method for the accurate identification of digital genomic footprints from DNase-seq data. Nucleic Acids Res. 41, e201 (2013).

  89. 89.

    , , & The insulator binding protein CTCF positions 20 nucleosomes around its binding sites across the human genome. PLoS Genet. 4, e1000138 (2008).

  90. 90.

    et al. Differential DNase I hypersensitivity reveals factor-dependent chromatin dynamics. Genome Res. 22, 1015–1025 (2012).

  91. 91.

    et al. Accurate inference of transcription factor binding from DNA sequence and chromatin accessibility data. Genome Res. 21, 447–455 (2011).

  92. 92.

    et al. A clustering approach for identification of enriched domains from histone modification ChIP–seq data. Bioinformatics 25, 1952–1958 (2009).

  93. 93.

    & Identifying dispersed epigenomic domains from ChIP–seq data. Bioinformatics 27, 870–871 (2011).

  94. 94.

    , & BroadPeak: a novel algorithm for identifying broad peaks in diffuse ChIP–seq datasets. Bioinformatics 29, 492–493 (2013).

  95. 95.

    & Discovery and characterization of chromatin states for systematic annotation of the human genome. Nature Biotech. 28, 817–825 (2010).

  96. 96.

    et al. Unsupervised pattern discovery in human chromatin structure through genomic segmentation. Nature Methods 9, 473–476 (2012).

  97. 97.

    , , , & A blind deconvolution approach to high-resolution mapping of transcription factor binding sites from ChIP–seq data. 12, 1–12 (2009).

  98. 98.

    et al. Discovering homotypic binding events at high spatial resolution. Bioinformatics 26, 3028–3034 (2010).

  99. 99.

    et al. dPeak: high resolution identification of transcription factor binding sites from PET and SET ChIP–seq data. PLos Comput. Biol. 9, 9–11 (2013).

  100. 100.

    , & Modeling non-uniformity in short-read rates in RNA-seq data. Genome Biol. 11, 1–11 (2010).

  101. 101.

    et al. Tackling the widespread and critical impact of batch effects in high-throughput data. Nature Rev. Genet. 11, 733–739 (2010). This review discusses the importance of modelling batch effects in genome-wide analyses and statistical techniques for such analyses.

  102. 102.

    , & Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118–127 (2007).

  103. 103.

    & Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 3, 1724–1735 (2007).

  104. 104.

    , , , & The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics 28, 882–883 (2012).

  105. 105.

    et al. HiCNorm: removing biases in Hi-C data via Poisson regression. Bioinformatics 28, 3131–3133 (2012).

  106. 106.

    et al. Bayesian inference of spatial organizations of chromosomes. PLoS Comput. Biol. 9, e1002893 (2013).

  107. 107.

    et al. Iterative correction of Hi-C data reveals hallmarks of chromosome organization Nature Methods 9, 999–1003 (2012). This study proposes a novel decomposition scheme for the analysis of Hi-C data that separates visibility and interaction components.

  108. 108.

    et al. Chromosome conformation capture carbon copy (5C): a massively parallel solution for mapping interactions between genomic elements. Genome Res. 16, 1299–1309 (2006).

  109. 109.

    et al. DNase I sensitivity QTLs are a major determinant of human expression variation. Nature 482, 390–394 (2012).

  110. 110.

    & Technical considerations for functional sequencing assays. Nature Immunol. 13, 802–807 (2012).

  111. 111.

    et al. Impact of sequencing depth in ChIP–seq experiments. Nucleic Acids Res. 42, e74 (2014).

  112. 112.

    et al. Intrinsic histone–DNA interactions are not the major determinant of nucleosome positions in vivo. Nature Struct. Mol. Biol. 16, 847–852 (2009).

  113. 113.

    & Model-based quality assessment and base-calling for second-generation sequencing data. Biometrics 66, 665–674 (2010).

  114. 114.

    , & Comment on “Widespread RNA & DNA sequence differences in the human transcriptome”. Science 335, 1302 (2012).

  115. 115.

    , , & Highly expressed loci are vulnerable to misleading ChIP localization of multiple unrelated proteins. Proc. Natl Acad. Sci. USA 110, 18602–18607 (2013).

  116. 116.

    et al. Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors. Genome Res. 22, 1798–1812 (2012).

  117. 117.

    ChIP–seq: advantages and challenges of a maturing technology. Nature Rev. Genet. 10, 669–680 (2009).

  118. 118.

    et al. Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature 464, 768–772 (2010).

  119. 119.

    , & The Elements of Statistical Learning. (Springer, 2001).

Download references

Acknowledgements

The authors thank members of X.S.L and M. Brown's laboratories for their discussions. This work is supported by the US National Institutes of Health grant R01GM099409.

Author information

Affiliations

  1. Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute and Harvard School of Public Health, Boston, Massachusetts 02115, USA; and Center for Functional Cancer Epigenetics, Dana-Farber Cancer Institute, Boston, Massachusetts 02215, USA.

    • Clifford A. Meyer
    •  & X. Shirley Liu

Authors

  1. Search for Clifford A. Meyer in:

  2. Search for X. Shirley Liu in:

Competing interests

The authors declare no competing financial interests.

Corresponding authors

Correspondence to Clifford A. Meyer or X. Shirley Liu.

Glossary

ChIP–seq

(Chromatin immunoprecipitation followed by next-generation DNA sequencing). A method to identify DNA-associated protein-binding sites.

MNase-seq

A method in which micrococcal nuclease (MNase) digestion of chromatin is followed by next-generation sequencing to identify loci of high nucleosome occupancy.

FAIRE–seq

(Formaldehyde-assisted isolation of regulatory elements followed by sequencing). A method to determine regulatory regions of the genome.

DNase-seq

A method in which DNase I digestion of chromatin is combined with next-generation sequencing to identify regulatory regions of the genome, including enhancers and promoters.

Hi-C

An extension of chromosome conformation capture that uses next-generation sequencing to observe long-range interaction frequencies between different regions of the genome.

ChIA-PET

(Chromatin interaction analysis by paired-end tag sequencing). A method that combines chromatin immunoprecipitation-based enrichment and chromatin proximity ligation with paired-end next-generation sequencing to determine genome-wide chromatin interactions.

ATAC-seq

(Assay for transposase-accessible chromatin using sequencing). A method that combines next-generation sequencing with in vitro transposition of sequencing adapters into native chromatin.

Random barcoding

A technique that ligates a diverse assortment of short random DNA sequences to an unamplified DNA sample, which can be used to distinguish duplicates produced by PCR from those originating from the unamplified DNA.

Spike-in

Controls that are known quantities of readily identifiable nucleic acids, which are added to a sample prior to critical steps in an experimental protocol. Such controls may be used for bias assessment and calibration purposes.

Splines

Flexible smooth nonlinear functions that are defined piecewise by polynomials for fitting nonlinear trends.

Locally estimated scatterplot smoothing

(LOESS). A simple yet robust method for fitting nonlinear trends.

Quantile regression

A statistical regression method that estimates the median or other quantile of the response variables and that is robust against outliers.

Surrogate variable analysis

A statistical analysis to identify and model variables that are not explicitly annotated but that have measureable effects.

About this article

Publication history

Published

DOI

https://doi.org/10.1038/nrg3788

Further reading