Analysis of computational footprinting methods for DNase sequencing experiments

Abstract

DNase-seq allows nucleotide-level identification of transcription factor binding sites on the basis of a computational search of footprint-like DNase I cleavage patterns on the DNA. Frequently in high-throughput methods, experimental artifacts such as DNase I cleavage bias affect the computational analysis of DNase-seq experiments. Here we performed a comprehensive and systematic study on the performance of computational footprinting methods. We evaluated ten footprinting methods in a panel of DNase-seq experiments for their ability to recover cell-specific transcription factor binding sites. We show that three methods—HINT, DNase2TF and PIQ—consistently outperformed the other evaluated methods and that correcting the DNase-seq signal for experimental artifacts significantly improved the accuracy of computational footprints. We also propose a score that can be used to detect footprints arising from transcription factors with potentially short residence times.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Figure 1: FLR-Exp evaluation metric.
Figure 2: Clustering of bias estimates.
Figure 3: Effects of sequence bias on methods.
Figure 4: Evaluation of computational footprinting methods.
Figure 5: Impact of TF binding time on computational footprinting.

Accession codes

Accessions

Gene Expression Omnibus

Sequence Read Archive

References

  1. 1

    The ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).

  2. 2

    Crawford, G.E. et al. Genome-wide mapping of DNase hypersensitive sites using massively parallel signature sequencing (MPSS). Genome Res. 16, 123–131 (2006).

  3. 3

    Sabo, P.J. et al. Genome-wide identification of DNase I hypersensitive sites using active chromatin sequence libraries. Proc. Natl. Acad. Sci. USA 101, 4537–4542 (2004).

  4. 4

    Neph, S. et al. An expansive human regulatory lexicon encoded in transcription factor footprints. Nature 489, 83–90 (2012).

  5. 5

    Boyle, A.P. et al. High-resolution genome-wide in vivo footprinting of diverse transcription factors in human cells. Genome Res. 21, 456–464 (2011).

  6. 6

    Piper, J. et al. Wellington: a novel method for the accurate identification of digital genomic footprints from DNase-seq data. Nucleic Acids Res. 41, e201 (2013).

  7. 7

    Sung, M.-H.H., Guertin, M.J., Baek, S. & Hager, G.L. DNase footprint signatures are dictated by factor dynamics and DNA sequence. Mol. Cell 56, 275–285 (2014).

  8. 8

    Gusmao, E.G., Dieterich, C., Zenke, M. & Costa, I.G. Detection of active transcription factor binding sites with the combination of DNase hypersensitivity and histone modifications. Bioinformatics 30, 3143–3151 (2014).

  9. 9

    Pique-Regi, R. et al. Accurate inference of transcription factor binding from DNA sequence and chromatin accessibility data. Genome Res. 21, 447–455 (2011).

  10. 10

    Cuellar-Partida, G. et al. Epigenetic priors for identifying active transcription factor binding sites. Bioinformatics 28, 56–62 (2012).

  11. 11

    Sherwood, R.I. et al. Discovery of directional and nondirectional pioneer transcription factors by modeling DNase profile magnitude and shape. Nat. Biotechnol. 32, 171–178 (2014).

  12. 12

    Yardımcı, G.G., Frank, C.L., Crawford, G.E. & Ohler, U. Explicit DNase sequence bias modeling enables high-resolution transcription factor footprint detection. Nucleic Acids Res. 42, 11865–11878 (2014).

  13. 13

    Kähärä, J. & Lähdesmäki, H. BinDNase: a discriminatory approach for transcription factor binding prediction using DNase I hypersensitivity data. Bioinformatics 31, 2852–2859 (2015).

  14. 14

    Stergachis, A.B. et al. Conservation of trans-acting circuitry during mammalian regulatory evolution. Nature 515, 365–370 (2014).

  15. 15

    He, H.H. et al. Refined DNase-seq protocol and data analysis reveals intrinsic bias in transcription factor footprint identification. Nat. Methods 11, 73–78 (2014).

  16. 16

    Meyer, C.A. & Liu, X.S. Identifying and mitigating bias in next-generation sequencing methods for chromatin biology. Nat. Rev. Genet. 15, 709–721 (2014).

  17. 17

    Park, P.J. ChIP-seq: advantages and challenges of a maturing technology. Nat. Rev. Genet. 10, 669–680 (2009).

  18. 18

    Teytelman, L., Thurtle, D.M., Rine, J. & van Oudenaarden, A. Highly expressed loci are vulnerable to misleading ChIP localization of multiple unrelated proteins. Proc. Natl. Acad. Sci. USA 110, 18602–18607 (2013).

  19. 19

    The difficulty of a fair comparison. Nat. Methods 12, 273 (2015).

  20. 20

    Davis, J. & Goadrich, M. The relationship between precision-recall and ROC curves. Proc. 23rd International Conference on Machine Learning—ICML 2006 233–240 (2006).

  21. 21

    Tewari, A.K. et al. Chromatin accessibility reveals insights into androgen receptor activation and transcriptional specificity. Genome Biol. 13, R88 (2012).

  22. 22

    Sharp, Z.D. et al. Estrogen-receptor-alpha exchange and chromatin dynamics are ligand- and domain-dependent. J. Cell Sci. 119, 4101–4116 (2006).

  23. 23

    McNally, J.G., Müller, W.G., Walker, D., Wolford, R. & Hager, G.L. The glucocorticoid receptor: rapid exchange with regulatory sites in living cells. Science 287, 1262–1265 (2000).

  24. 24

    Malnou, C.E. et al. Heterodimerization with different Jun proteins controls c-Fos intranuclear dynamics and distribution. J. Biol. Chem. 285, 6552–6562 (2010).

  25. 25

    Nakahashi, H. et al. A genome-wide map of CTCF multivalency redefines the CTCF code. Cell Rep. 3, 1678–1689 (2013).

  26. 26

    Lazarovici, A. et al. Probing DNA shape and methylation state on a genomic scale with DNase I. Proc. Natl. Acad. Sci. USA 110, 6376–6381 (2013).

  27. 27

    Langmead, B. & Salzberg, S.L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).

  28. 28

    Zhang, Y. et al. Model-based analysis of ChIP-seq (MACS). Genome Biol. 9, R137 (2008).

  29. 29

    Yu, J. et al. An integrated network of androgen receptor, polycomb and TMPRSS2-ERG gene fusions in prostate cancer progression. Cancer Cell 17, 443–454 (2010).

  30. 30

    Guertin, M.J., Zhang, X., Coonrod, S.A. & Hager, G.L. Transient estrogen receptor binding and p300 redistribution support a squelching mechanism for estradiol-repressed genes. Mol. Endocrinol. 28, 1522–1533 (2014).

  31. 31

    John, S. et al. Chromatin accessibility pre-determines glucocorticoid receptor binding patterns. Nat. Genet. 43, 264–268 (2011).

  32. 32

    Mathelier, A. et al. JASPAR 2014: an extensively expanded and updated open-access database of transcription factor binding profiles. Nucleic Acids Res. 42, D142–D147 (2014).

  33. 33

    Robasky, K. & Bulyk, M.L. UniPROBE, update 2011: expanded content and search tools in the online database of protein-binding microarray data on protein-DNA interactions. Nucleic Acids Res. 39, D124–D128 (2011).

  34. 34

    Matys, V. et al. TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Res. 34, D108–D110 (2006).

  35. 35

    Boyle, A.P., Guinney, J., Crawford, G.E. & Furey, T.S. F-seq: a feature density estimator for high-throughput sequence tags. Bioinformatics 24, 2537–2538 (2008).

  36. 36

    Hesselberth, J.R. et al. Global mapping of protein-DNA interactions in vivo by digital genomic footprinting. Nat. Methods 6, 283–289 (2009).

  37. 37

    Sabo, P.J. et al. Discovery of functional noncoding elements by digital analysis of chromatin structure. Proc. Natl. Acad. Sci. USA 101, 16837–16842 (2004).

  38. 38

    Madden, H.H. Comments on the Savitzky-Golay convolution method for least-squares fit smoothing and differentiation of digital data. Anal. Chem. 50, 1383–1386 (1978).

  39. 39

    Siepel, A. et al. Evolutionarily conserved elements in vertebrate, insect, worm and yeast genomes. Genome Res. 15, 1034–1050 (2005).

  40. 40

    Hubbard, T. et al. The Ensembl genome database project. Nucleic Acids Res. 30, 38–41 (2002).

  41. 41

    Grant, C.E., Bailey, T.L. & Noble, W.S. FIMO: scanning for occurrences of a given motif. Bioinformatics 27, 1017–1018 (2011).

  42. 42

    Stormo, G.D. DNA binding sites: representation and discovery. Bioinformatics 16, 16–23 (2000).

  43. 43

    Cock, P.J.A. et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25, 1422–1423 (2009).

  44. 44

    Korhonen, J., Martinmäki, P., Pizzi, C., Rastas, P. & Ukkonen, E. MOODS: fast search for position weight matrix matches in DNA sequences. Bioinformatics 25, 3181–3182 (2009).

  45. 45

    Wilczynski, B., Dojer, N., Patelak, M. & Tiuryn, J. Finding evolutionarily conserved cis-regulatory modules with a universal set of motifs. BMC Bioinformatics 10, 82 (2009).

  46. 46

    Fawcett, T. An introduction to ROC analysis. Pattern Recognit. Lett. 27, 861–874 (2006).

  47. 47

    Ritchie, M.E. et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 43, e47 (2015).

  48. 48

    Demšar, J. Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006).

  49. 49

    Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Series B Stat. Methodol. 57, 289–300 (1995).

Download references

Acknowledgements

This work was supported by the Interdisciplinary Center for Clinical Research (IZKF Aachen), RWTH Aachen University Medical School, Aachen, Germany (to E.G.G., M.A. and I.G.C.), and the Excellence Initiative of the German Federal and State Governments and the German Research Foundation (grant GSC 111 to M.A. and I.G.C.).

Author information

E.G.G., M.Z. and I.G.C. designed the research. E.G.G. wrote HINT program code. E.G.G., M.A. and I.G.C. analyzed data. E.G.G., M.Z. and I.G.C. wrote the manuscript.

Correspondence to Ivan G Costa.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 FLR-Exp results for different cell-type pairs.

Correlation between Kolmogorov-Smirnov (KS) test statistics from FLR scores versus expression fold change for cell type pairs H1-hESC versus K562 (left), H1-hESC versus GM12878 (middle) and GM12878 versus K562 (right) for footprints predicted by: HINT-BC, DNase2TF, Neph and FLR (from top to bottom, respectively). We observe high FLR-Exp (Spearman correlation) values (r > 0.8) for all cases. Moreover, similar rankings of methods are obtained on the FLR-Exp for each cell pair: H1- hESC/K562 versus H1-hESC/GM12878 r = 0.99, H1-hESC/K562 versus GM12878/K562 r = 0.96 and H1-hESC/GM12878 versus GM12878/K562 r = 0.97.

Supplementary Figure 2 FLR-Exp results for different footprint quality metrics.

Correlation between Kolmogorov-Smirnov (KS) test statistics versus expression fold change for cell type pair H1- hESC versus K562 by evaluating either the FLR (left), FS (middle) and TC (right) as quality metric for the footprints. Footprints were predicted with HINT-BC, DNase2TF, Neph and FLR (from top to bottom, respectively). The use of FLR as quality metric presents the highest Spearman correlation values (FLR-Exp). On the other hand, TC exhibits small correlation values (r < 0.4) and presents several cases in which the signal of KS and fold change disagree (off diagonal points). Note that the use of FS also have a high average correlation with fold change expression on all evaluated data/methods (average r = 0.73) and indicates a ranking of footprint methods similar to FLR (r = 0.89). Therefore, FS can be used as an alternative to the FLR score as a footprint quality metric.

Supplementary Figure 3 Clustering of sequence bias estimates.

Ward's minimum variance clustering on pairwise Spearman correlation coefficient (r) of sequence bias estimates of all ENCODE's Tier 1 and 2 DNase-seq data sets and naked DNA DNase-seq data sets. DNase-seq experiments were based on single-hit (red), double-hit (blue) protocols or naked DNA (yellow). We observe a high average correlation between sequence biases estimated on DNase-seq data sets originated from the same protocol: single-hit = 0.89; double-hit = 0.84. Also, lower average correlation values are observed from experimental biases estimates from different protocols: single-hit versus double-hit = 0.39. The group of sequence bias estimates based on the three naked DNA data sets have an average correlation of 0.96.

Supplementary Figure 4 Correlation between the performance of methods and their OBS from the He et al. data set.

The x-axis represents the observed sequence bias. The y-axis represents the ratio between the AUC at 10% FPR for a particular method and the TC-Rank method. In accordance with He et al.1, we observe that FS-Rank method has a high negative correlation (r = −0.4144; adjusted p-value < 0.001) with the sequence bias score, while no significant correlation is found for all other evaluated methods HINT, HINT-BCN, HINT-BC and PWM-Rank. It is important to notice that the correlation value for FS-Rank method differs from He et al.1. This stems from a different strategy to find the DHSs and MPBSs used in the evaluation dataset. Nevertheless, we were able to observe a strong bias for the FS-Rank method as in He et al.1.1. He, H.H. et al. Refined DNase-seq protocol and data analysis reveals intrinsic bias in transcription factor footprint identification. Nat. Methods 11, 73−78 (2014).

Supplementary Figure 5 Evaluation of sequence bias correction strategies and CG-content contribution.

(a) Distribution of AUC (at 10% FPR) differences between HINT-BC and HINT; HINT-BCN and HINT; HINT-BC and HINT-BCN for all 233 TFs of the comprehensive dataset. TFs are ranked by the difference between HINT-BC and HINT-BCN. There is a clear increase in AUC values between sequence bias-corrected methods (HINT-BC and HINT-BCN) and the uncorrected method HINT (p-value < 10−30; Mann- Whitney-Wilcoxon test). Moreover, HINT-BC has higher AUC values for all but seven TFs in the comparison with HINT-BCN. (b) CG content of TF motifs. We observe no correlation between CG content of the motifs and the individual AUC of each method: HINT r = 0.0144, HINT-BC r = 0.0254 and HINT-BCN r = 0.0108 (p-value > 0.05; Spearman correlation test). Furthermore, we observe no correlation between CG content of motifs and differences in AUC: HINT-BC − HINT-BCN r = 0.0188, HINT-BC − HINT r = 0.0724 and HINT-BCN − HINT r = 0.0644 (p-value > 0.05; Spearman correlation test).

Supplementary Figure 6 Average DNase-seq signals around selected TFs with ChIP-seq evidence in H1-hESC (DU) cells.

These TFs had the higher AUC gain between HINT-BC and HINT: (a) ATF3, (b) EGR1, (c) NRF1, (d) RAD21, (e) SP1 and (f) SP4. In the top panel of each graph, we show the strand-specific average DNase-seq signal on naked DNA DNase-seq experiments (MCF-7 cell type); the middle panel shows the strand-specific estimated DHS sequence bias signal; and the bottom panels shows the (1) uncorrected – observed DNase-seq I cleavage signal and (2) corrected – DNase-seq signal after the bias correction. Signals in the bottom graph were standardized to be in the interval [0,1]. The motif logo represents all underlying DNA sequences centered on the TFBSs. The bias correction led to a substantial change in the average DNase-seq sequence bias patterns surrounding several TFs. On EGR1, for instance, we observed that the bias-corrected DNase-seq signal presents three clear depletions, which fit the high affinity regions of EGR1 motif (two CC and one C). In contrast, EGR1 uncorrected DNase-seq signal presents a single peak in the center of the motif. The same observations can be made for other TFs, such as NRF1 (with affinity regions (C/G)(C/G)(G/C)C and G(G/C)(C/G)(C/G)C) and SP4 (with affinity region CGCCC). Such patterns reflect bias corrections which are clearly beneficial to footprinting method accuracy.

Supplementary Figure 7 Association between 6-mer CG content and DNase-seq sequence bias.

We sorted 6-mers by their bias estimates and grouped similar ranked 6-mers. We show scatter plots with CG content versus average sequence bias for 6-mer groups on DNase-seq data generated with the (a) single-hit (DU), (b) double-hit (UW) protocols and (c) naked DNA experiments. There is a strong positive correlation between DNase-seq sequence bias and CG content for all DHS sequence bias estimates from both single-hit and double-hit protocols (p-value < 0.01). Interestingly, we observe a negative correlation for two naked DNA experiments: K562 and IMR90 (p-value < 10−5).

Supplementary Figure 8 Analysis of footprint ranking strategies.

Distribution of AUC values (at 10% FPR) by using distinct ranking strategies for site centric methods: (a) BinDNase, (b) Centipede, (c) Cuellar, (d) FLR, (e) PIQ and (f) segmentation methods DNase2TF and Wellington. Ranking strategies (x-axis) are ranked by decreasing median AUC. The site-centric methods are tested based on probability (P) cutoffs of 0.8, 0.85, 0.9, 0.95, 0.99 and their own ranking strategy (Own rank). Segmentation methods are tested based on the TC metric ranking and their own ranking strategy (Own rank). Methods not shown in this figure do not contain an intrinsic ranking methodology. In all cases, using TC-based strategies/cutoff was significantly better than the original ranking of the methods (p-value < 10−12; Mann-Whitney-Wilcoxon test). Concerning site-centric methods, the use of a probability threshold (P) of 0.9 was best for all methods, with the exception of BinDNase, where 0.8 was best. The box plot depicts the distribution median value (middle dot) and first and third quartiles (box extremities). The whiskers represent the 1.5 IQR and external dots represent outliers (data greater than or smaller than 1.5 IQR).

Supplementary Figure 9 Accuracy of methods based on TF ChIP-seq evaluation strategy.

Accuracy distribution for all 15 footprinting methods regarding all TF ChIP-seq validation sets (ordered by Friedman Ranking). Accuracies are shown for the statistics: (a) AUC at 100% FPR (b) AUC at 10% FPR (c) AUC at 1% FPR and (d) AUPR. We used the Friedman-Nemenyi hypothesis test for statistical evaluation (see Supplementary Tables 3-6). The box plot depicts the distribution median value (middle dot) and first and third quartiles (box extremities). The whiskers represent the 1.5 IQR and external dots represent outliers (data greater than or smaller than 1.5 IQR).

Supplementary Figure 10 Average sequence bias and DNase-seq signals around nuclear receptors.

Results are shown for the TFs: (a) AR (R1881), (b) GR (with DEX), (c) ER (40 min) and (d) ER (160 min). In the top panel, we show the strand-specific average DNase-seq signal on naked DNA DNase-seq experiments (MCF-7 (DU) for data sets from single-hit and IMR90 (UW) for data sets with double-hit protocol); the middle panel shows the strand-specific estimated DHS sequence bias signal; and the bottom panels shows the (1) uncorrected – observed DNase-seq signal and (2) corrected – DNase-seq signal after the bias correction with the DHS sequence bias estimates. Signals in the bottom graph were standardized to be in the interval [0,1]. The motif logo represents all underlying DNA sequences centered on the TFBSs. While corrected DNase-seq profiles from ER have a better match with the underlying motif, this is not the case for AR and GR. However, we observed a small gain in the AUC score comparing HINT- BC and HINT. This difference is in the upper quartile range for all 233 TFs analyzed. These results indicate that cleavage bias correction also brings improvements to footprint prediction of nuclear receptors. However, all these TFs have low AUC scores in all footprinting methods, i.e. lower quartiles for HINT-BC or TC-Rank AUC scores. This indicates that short binding time indeed poses a challenge in footprint prediction.

Supplementary Figure 11 Average sequence bias and DNase-seq signals around binding sites of de novo motifs found using Neph footprints.

Results are shown for de novo motifs: (a) #0458 and (b) #0500 binding on cell type H7-hESC (UW). In the top panel, we show the strand-specific average DNase-seq signal on naked DNA DNase-seq experiments (MCF-7 cell type); the middle panel shows the strand-specific estimated DHS sequence bias signal; and the bottom panels shows the (1) uncorrected – observed DNase-seq signal and (2) corrected – DNase-seq signal after the bias correction using DHS sequence bias estimates. Signals in the bottom graph were standardized to be in the interval [0,1]. The motif logo represents all underlying DNA sequences centered on the TFBSs. These motifs were discovered in the footprint analysis of Neph et al.1 and indicated in He et al.2 to be artifacts of sequence bias. Bias-corrected DNase-seq profiles reveal no clear footprint shape. Furthermore, we compared the overlap between footprints generated by HINT-BC and Neph in H7-hESC (UW) cells. We considered only the MPBSs that overlapped DHSs in H7-hESC. We observed that 24.99% (motif #0458) and 28.58% (motif #0500) of MPBSs were associated with a Neph footprint. In contrast, only 0.73% (motif #0458) and 1.71% (motif #0500) of MPBSs overlapped with a HINT-BC footprint. Altogether, this indicates that these motifs are indeed potential artifacts of sequence bias and reinforces the importance of bias correction prior to any DNase-seq analysis.1. Neph, S. et al. An expansive human regulatory lexicon encoded in transcription factor footprints. Nature 489, 83–90 (2012). 2. He, H.H. et al. Refined DNase-seq protocol and data analysis reveals intrinsic bias in transcription factor footprint identification. Nat. Methods 11, 73−78 (2014).

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–11 and Supplementary Tables 1–6 (PDF 2811 kb)

Supplementary Data Set 1

Results of TF ChIP-seq–based evaluation and TF specific statistics. (XLSX 209 kb)

Supplementary Data Set 2

Results of FLR-Exp evaluation (XLSX 282 kb)

Source data

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Gusmao, E., Allhoff, M., Zenke, M. et al. Analysis of computational footprinting methods for DNase sequencing experiments. Nat Methods 13, 303–309 (2016). https://doi.org/10.1038/nmeth.3772

Download citation

Further reading