FreeHi-C simulates high-fidelity Hi-C data for benchmarking and data augmentation

Article metrics

Abstract

The ability to simulate high-throughput chromatin conformation (Hi-C) data is foundational for benchmarking Hi-C data analysis methods. Here we present a nonparametric strategy named FreeHi-C to simulate Hi-C data from the interacting genome fragments. Data from FreeHi-C exhibit high fidelity to biological Hi-C data. FreeHi-C boosts the precision and power of differential chromatin interaction detection through data augmentation under preserved false discovery rate control.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Fig. 1: FreeHi-C enables simulating high-fidelity Hi-C data.
Fig. 2: Data augmentation with FreeHi-C-simulated replicates improves DCI detection.

Data availability

To study the operating features of FreeHi-C, we utilized two publicly available human Hi-C datasets as examples of large genomes with independent experiments using four cell cultures, which are referred to as four biological replicates, from the GM12878 (ref. 3) cell line and another four from A549 (ref. 25). Raw FASTQ files for GM12878 were downloaded from GEO26 under the accession code GSE63525 and raw sequences for A549 were obtained from the ENCODE portal27 (https://www.encodeproject.org) with accession code ENCSR662QKG. For evaluation of FreeHi-C performance on small genomes, we leveraged three different stages of malaria parasite P. falciparum red blood cell cycles16. Raw sequences for P. falciparum were downloaded from GEO26 under the accession code GSE50199. GM12878 and A549 were both processed at 40 kb resolution, and P. falciparum at 10 kb. For validating the differential interaction detection with a differential expression analysis, we utilized RNA-seq gene expression data from the ENCODE portal (accession ENCSR000AED for GM12878 and ENCSR000CTM for A549). Similarly, the CTCF ChIP–seq peak signal files were also downloaded from ENCODE under accession ENCSR000DZN for GM12878 and ENCSR000DPF for A549. The data used in this paper are summarized in Supplementary Table 1. All the simulated data used in the analysis and Juicebox28 visualization data are available at Zenodo (https://doi.org/10.5281/zenodo.3345896).

Code availability

FreeHi-C pipeline is implemented in Python with C accelerated core calculations and it naturally fits in the high-performance computing environments for parallelization. The source codes and instructions for running FreeHi-C are publicly available at https://github.com/keleslab/FreeHiC.

References

  1. 1.

    Dekker, J., Rippe, K., Dekker, M. & Kleckner, N. Capturing chromosome conformation. Science 295, 1306–1311 (2002).

  2. 2.

    Lieberman-Aiden, E. et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 326, 289–293 (2009).

  3. 3.

    Rao, S. S. P. et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 159, 1665–1680 (2014).

  4. 4.

    Roy, S. et al. A predictive modeling approach for cell line-specific long-range regulatory interactions. Nucleic Acids Res. 44, 1977–1978 (2016).

  5. 5.

    Yardımcı, G. G. et al. Measuring the reproducibility and quality of hi-c data. Genome Biol. 20, 57 (2019).

  6. 6.

    Lun, A. T. & Smyth, G. K. diffHic: a Bioconductor package to detect differential genomic interactions in Hi-C data. BMC Bioinformatics 16, 258 (2015).

  7. 7.

    Forcato, M. et al. Comparison of computational methods for Hi-C data analysis. Nat. Methods 14, 679–685 (2017).

  8. 8.

    Ursu, O. et al. Genomedisco: a concordance score for chromosome conformation capture experiments using random walks on contact map graphs. Bioinformatics 34, 2701–2707 (2018).

  9. 9.

    Djekidel, M. N., Chen, Y. & Zhang, M. Q. Find: differential chromatin interactions detection using a spatial poisson process. Genome Res. 28, 412–422 (2018).

  10. 10.

    Stansfield, J. C., Cresswell, K. G., Vladimirov, V. I. & Dozmorov, M. G. Hiccompare: an R package for joint normalization and comparison of hi-c datasets. BMC Bioinformatics 19, 279 (2018).

  11. 11.

    Ardakany, A. R., Ay, F. & Lonardi, S. Selfish: discovery of differential chromatin interactions via a self-similarity measure. Bioinformatics 35, i145–i153 (2019).

  12. 12.

    Ay, F., Bailey, T. L. & Noble, W. S. Statistical confidence estimation for Hi-C data reveals regulatory chromatin contacts. Genome Res. 24, 999–1011 (2014).

  13. 13.

    Yang, T. et al. HiCRep: assessing the reproducibility of Hi-C data using a stratum-adjusted correlation coefficient. Genome Res. 27, 1939–1949 (2017). gr–220640.

  14. 14.

    DeMaere, M. Z. & Darling, A. E. Sim3c: simulation of hi-c and meta3c proximity ligation sequencing technologies. GigaScience 7, gix103 (2017).

  15. 15.

    DeMaere, M. Z. & Darling, A. E. bin3c: exploiting hi-c sequencing data to accurately resolve metagenome-assembled genomes. Genome Biol. 20, 46 (2019).

  16. 16.

    Ay, F. et al. Three-dimensional modeling of the P. falciparum genome during the erythrocytic cycle reveals a strong connection between genome architecture and gene ex- pression. Genome Res. 24, 974–988 (2014).

  17. 17.

    Stansfield, J. C., Cresswell, K. G. & Dozmorov, M. G. multiHiCcompare: joint normalization and comparative analysis of complex hi-c experiments. Bioinformatics 35, 2916–2923 (2019).

  18. 18.

    Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. B 57, 289–300 (1995).

  19. 19.

    Zheng, X. & Zheng, Y. Cscoretool: fast hi-c compartment analysis at high resolution. Bioinformatics 34, 1568–1570 (2017).

  20. 20.

    Giorgetti, L. et al. Structural organization of the inactive x chromosome in the mouse. Nature 535, 575–579 (2016).

  21. 21.

    Servant, N. et al. HiC-Pro: an optimized and flexible pipeline for Hi-C data processing. Genome Biol. 16, 259 (2015).

  22. 22.

    Zheng, Y., Ay, F. & Keles, S. Generative modeling of multi-mapping reads with mHi-C advances analysis of Hi-C studies. eLife 8, e38070 (2019).

  23. 23.

    Fisher, R. A. Statistical Methods for Research Workers (Genesis Publishing Pvt Ltd, 2006).

  24. 24.

    Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data withdeseq2. Genome Biol. 15, 550 (2014).

  25. 25.

    Dixon, J. R. et al. Integrative detection and analysis of structural variation in cancer genomes. Nat. Genet. 50, 1388–1398 (2018).

  26. 26.

    Barrett, T. et al. NCBI GEO: archive for functional genomics data sets update. Nucleic Acids Res. 41, D991–D995 (2012).

  27. 27.

    The ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).

  28. 28.

    Durand, N. C. et al. Juicebox provides a visualization system for Hi-C contact maps with unlimited zoom. Cell Syst. 3, 99–101 (2016).

Download references

Acknowledgements

This work was supported by NIH grants HG009744, HG003747 and HG007019 to S.K.

Author information

S.K. and Y.Z. conceived the project, designed the research and developed the method. Y.Z. developed the simulation framework and performed the experiments. Both authors contributed to the preparation of the manuscript.

Correspondence to Sündüz Keleş.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nicole Rusk was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Fig. 1 Hi-C contact matrices of chromosome 1 for replicates 2, 3, 4, and 6 of GM12878.

In each panel, the first row shows the chromosome-wide Hi-C contact matrices of the original biological replicate and corresponding FreeHi-C and Sim3C simulated replicates with matching sequencing depth of the original replicate. The contact matrices in the second row display a zoom-in genomic interactions. The numbers at the left bottom of each matrix represent the color scale.

Supplementary Fig. 2 Genomic distance stratified comparison of the Hi-C signal for replicates 2, 3, 4, and 6 of GM12878.

Genomic distance stratified comparison of the Hi-C signal, as quantified by the number of valid read pairs, between the original biological replicates and corresponding FreeHi-C simulated replicates of varying depths for replicates 2, 3, 4, and 6 of GM12878. The sample size for each box is the number of genomic interactions on chromosome 1 whose genomic distance matches the x-axis. Center lines of the boxplots correspond to the medians, box limits correspond to the 25th and 75th percentiles, and whiskers comprise all data points within 1.5 x the inter-quartile range.

Supplementary Fig. 3 Hi-C contact matrices and hierarchical clustering of P. falciparum for the ring, the trophozoite, and the schizont stages.

a. Column one corresponds to the original sample on chromosome 7 and columns two to three are FreeHi-C simulations at the original and twice the depth of the biological sample on chromosome 7, respectively. The numbers at the left bottom of each matrix represent the color scale. b. and c. Hierarchical clustering of the FreeHi-C replicates simulated to the same sequencing depth as the original P. falciparum schizont stage (b, up simulation) or the trophozoite stage (c, down simulation) samples. Regardless of up (b) or down (c) simulation, FreeHi-C replicates capture the known relationship between the three stages.

Supplementary Fig. 4 Differential chromatin interaction detection between biological replicates of GM12878 for evaluating the false discovery rates of diffHic and multiHiCcompare, across a series of sequencing depths using FreeHi-C and downsampling at 40kb.

Target FDR levels are delineated as the panel labels on the right-hand side. In the downsampling experiments, rep2, rep3, rep4, and rep6 of GM12878 are downsampled to 1/4, 1/2, and 3/4 of the original sequencing depths. Observed false discovery rates of diffHic and multiHiCcompare are depicted in dark purple and light purple, respectively. Biological replicates are also simulated to 1/4, 1/2, 3/4, the same as, three times, and five times the original sequencing depths with FreeHi-C. Resulting false discovery rates of simulated replicates are depicted in red and orange. For both diffHic and multiHiCcompare, a quasi-likelihood negative binomial generalized log-linear model was employed to test the coefficients with BH procedure adjustment for multiple comparisons. Center lines of the boxplots correspond to the medians, box limits correspond to the 25th and 75th percentiles, and whiskers comprise all data points within 1.5xthe inter-quartile range. n = 3.

Supplementary Fig. 5 Differential chromatin interaction detection between biological replicates of A549 for evaluating the false discovery rates of diffHic and multiHiCcompare, across a series of sequencing depths using FreeHi-C and downsampling at 40kb.

This is the analysis depicted in Supplementary Fig. 4 repeated with A549. For both diffHic and multiHiCcompare, a quasi-likelihood negative binomial generalized log-linear model was employed to test the coefficients with BH procedure adjustment for multiple comparisons. Center lines of the boxplots correspond to the medians, box limits correspond to the 25th and 75th percentiles, and whiskers comprise all data points within 1.5xthe inter-quartile range. n = 3.

Supplementary Fig. 6 Data augmentation with FreeHi-C simulated replicates improves differential chromatin interactions (DCIs) detection.

a delineates observed false discovery rates of within-sample comparisons for A549 data (i.e., comparisons of replicate(s) of A549 with other replicate(s) of A549). The dashed lines are y = x. b displays precision, computed as the percentage of top significant DCIs of each specific analysis in the gold standard differential chromatin interaction list, as a function of top-ranking DCIs. The gold standard set is defined by comparing the full set of 4 replicates of GM12878 with 4 replicates of A549 filtered by FDR ≤ 0.01. |logFC| refers to the absolute value of natural log transformed fold-change. c. Observed versus targeted FDR for differential chromatin interaction detection under variations of replicates per condition (left: two biological replicates per condition; right: one biological replicate per condition; middle: two biological replicates for one condition and FreeHi-C augmentation of the single biological replicate of the other condition). Middle panel corresponds to data augmentation with FreeHi-C and exhibits FDR control. d. Power, presented as the percentage of candidate DCIs declared as significant at the given FDR threshold, under variations on replicates per condition. Power with two replicates per condition (left panel) depicts the gold standard in this setting. e. Ranked up (green) DCIs of the uneven number of replicates per condition setting as a result of FreeHi-C data augmentation are significantly enriched for differentially expressed (DE) genes. In contrast, DCIs that are ranked lower (blue) among the top N DCIs, N = 1000, 5000, ..., 30000, as a result of FreeHi-C augmentation do not significantly overlap with the DE genes. Differentially expressed genes are detected by DESeq2 based on the negative binomial model. f. Ranked up (green) and down (blue) DCIs of the one biological replicate per condition setting as a result of FreeHi-C data augmentation are significantly enriched for differential CTCF peaks. Both sets show significant enrichment for CTCF peaks, with the exception at the top 100 DCIs. Dashed lines depict p-value cut-off of 0.05. Differential CTCF ChIP-seq peaks are defined by peaks that are uniquely enriched in only one cell line. Differential CTCF ChIP-seq peaks are defined by peaks that are uniquely enriched in only one cell line. Differential chromatin interaction detection is performed by HiCcompare, by converting the normalized contact counts into Z-scores, and multiHiCcompare, using a quasi-likelihood negative binomial generalized log-linear model, to test the coefficients (one-sided test). The p-values are adjusted by Benjamini-Hochberg procedure for multiple comparisons. For all the boxplots in this figure, the center lines correspond to the medians, box limits correspond to the 25th and 75th percentiles, and whiskers comprise all data points within 1.5× the inter-quartile range. n refers to the sample size.

Supplementary Fig. 7 False discovery rate control for differential chromatin interaction detection within pairs of A549 or GM12878 replicates with or without FreeHi-C augmentation.

a. Observed versus targeted FDR for differential chromatin interaction detection between pairs of A549 replicates (first column) and augmentation with 1 to 3 FreeHi-C simulated replicates (columns 2-4). These data depict the individual pairs of replicates summarized in Figure 2a. b. False discovery rate control as in Figure 2a with the GM12878 replicates (n = 6). Dashed lines are y = x. Differential chromatin interaction detection is performed by multiHiCcompare using a quasi-likelihood negative binomial generalized log-linear model to test the coefficients (one-sided test). P-values have been adjusted by BH procedure for multiple comparisons. Center lines of the boxplots correspond to the medians; box limits correspond to the 25th and 75th percentiles and whiskers comprise all data points within 1.5xthe inter-quartile range.

Supplementary Fig. 8 Power of differential chromatin interaction detection between pairs of A549 or GM12878 replicates with or without FreeHi-C data augmentation across a sequence of FDR thresholds.

Panels a and b report in the y-axes the percentage of candidate differential chromatin interactions and number of differential chromatin interactions, respectively. The first two columns are provided as a reference to highlight that with one replicate per condition (n = 6), the number of differential chromatin interactions detected between conditions (third column; n = 16) can be smaller than that of within conditions (number of false discoveries in the first two columns). Columns 4 to 6 illustrate the striking increase in the numbers of detected differential chromatin interactions with FreeHi-C data augmentation. The y-axes are on log scale (n = 16). c. Ranked up (green) DCIs of the one biological replicate per condition setting as a result of FreeHi-C data augmentation are significantly enriched for differentially expressed (DE) genes (n = 16). In contrast, DCIs that are ranked lower (blue) among the top N DCIs, N = 1000, 5000, ..., 25000, as a result of FreeHi-C augmentation do not significantly overlap with the DE genes. Differentially expressed genes are detected by DESeq2 based on the negative binomial model. d. Ranked up (green) and down (blue) DCIs of the one biological replicate per condition setting as a result of FreeHi-C data augmentation are significantly enriched for differential CTCF peaks (n = 16). Both sets show significant enrichment for CTCF peaks, with the exception of significant enrichment for the ranked-up DCIs compared to non-enrichment of ranked down DCIs for differential CTCF peaks among the top 100 DCIs. Dashed lines depict p-value cutoff of 0.05. Differential chromatin interaction detection is performed by multiHiCcompare using a quasi-likelihood negative binomial generalized log-linear model to test the coefficients (one-sided test). P-values have been adjusted by BH procedure for multiple comparisons. Center lines of the boxplots correspond to the medians, box limits correspond to the 25th and 75th percentiles and whiskers comprise all data points within 1.5 x inter-quartile range.

Supplementary Fig. 9 Data augmentation with FreeHi-C replicates boosts the power while preserving false discovery rate control.

a. Observed false discovery rates of within sample comparisons for GM12878 data (i.e., comparisons of two biological replicates of GM12878 with another set of two biological replicates from GM12878; n = 3). This is an analog of Figure 2e for GM12878. b and c delineate cases with 2 biological replicates (n = 36) and 3 biological replicates (n = 16) per condition, respectively. Power is presented as the percentage of candidate DCIs declared as significant at the given FDR threshold. Differential chromatin interaction detection is performed by multiHiCcompare using a quasi-likelihood negative binomial generalized log-linear model to test the coefficients (one-sided test). P-values have been adjusted by BH procedure for multiple comparisons. Center lines of the boxplots correspond to the medians, box limits correspond to the 25th and 75th percentiles and whiskers comprise all data points within 1.5 x the inter-quartile range.

Supplementary Fig. 10 Percentage of top significant DCIs in the gold standard differential chromatin interaction list, that is based on the comparison of the full set of 4 replicates of GM12878 with 4 replicates of A549, under the general setting of multiple replicates per condition and the corresponding FreeHi-C data augmentation settings.

Each condition includes 3 biological replicates (n = 16). The gold standard differential chromatin interaction set is defined as the most significant interactions in the comparison of 4 replicates of GM12878 with 4 replicates of A549 thresholded at FDR of 0.001. Leveraging FreeHi-C replicates boosts the precision of top significant DCIs. Differential chromatin interaction detection is performed by multiHiCcompare using a quasi-likelihood negative binomial generalized log-linear model to test the coefficients (one-sided test). P-values have been adjusted by BH procedure for multiple comparisons. Center lines of the boxplots correspond to the medians, box limits correspond to the 25th and 75th percentiles and whiskers comprise all data points within 1.5 x the inter-quartile range.

Supplementary Fig. 11 Percentage of top significant DCIs in the gold standard differential chromatin interaction list, that is based on the comparison of full set of 4 replicates of GM12878 with 4 replicates of A549, under the general setting of multiple replicates per condition and the corresponding FreeHi-C data augmentation settings.

Each condition includes 2 biological replicates (n = 36). The gold standard differential chromatin interaction set is defined as the most significant interactions in the comparison of 4 replicates of GM12878 with 4 replicates of A549 thresholded at FDR of 0.001. Leveraging FreeHi-C replicates boosts the precision of top significant DCIs. Differential chromatin interaction detection is performed by multiHiCcompare using a quasi-likelihood negative binomial generalized log-linear model to test the coefficients (one-sided test). P-values have been adjusted by BH procedure for multiple comparisons. Center lines of the boxplots correspond to the medians, box limits correspond to the 25th and 75th percentiles and whiskers comprise all data points within 1.5× the inter-quartile range.

Supplementary Fig. 12 Evaluation of differential chromatin interactions detected by the three replicates per condition settings with external RNA-seq data of GM12878 and A549 cells.

a. Significance of overlap of DCIs detected at varying FDRs with the differentially expressed genes between GM12878 and A459 cells (n = 16). Differential chromatin interaction detection is performed by multiHiCcompare using a quasi-likelihood negative binomial generalized log-linear model to test the coefficients (one-sided test). P-values have been adjusted by BH procedure for multiple comparisons. Differentially expressed genes are detected by DESeq2 based on the negative binomial model. Center lines of the boxplots correspond to the medians, box limits correspond to the 25th and 75th percentiles and whiskers comprise all data points within 1.5× the inter-quartile range. n refers to the sample size. b-d. are the individual randomization test results for the corresponding comparisons in a: 4 biological replicates (b), 3 biological replicates (c), and 3 biological replicates plus one FreeHi-C simulation for each of the biological replicates (d). The dashed lines represent the observed percentages of differentially expressed genes overlapping with differential interactions. n = 10000 independent experiments.

Supplementary Fig. 13 Evaluation of differential chromatin interactions detected by the two replicates per condition settings with external RNA-seq data of GM12878 and A549 cells.

a. Significance of overlap of DCIs detected at varying FDRs with the differentially expressed genes between GM12878 and A459 cells (n = 36). Differential chromatin interaction detection is performed by multiHiCcompare using a quasi-likelihood negative binomial generalized log-linear model to test the coefficients (one-sided test). P-values have been adjusted by BH procedure for multiple comparisons. Differentially expressed genes are detected by DESeq2 based on the negative binomial model. Center lines of the boxplots correspond to the medians, box limits correspond to the 25th and 75th percentiles and whiskers comprise all data points within 1.5× the inter-quartile range. n refers to the sample size. b-d. are the individual randomization test results for the corresponding comparisons in a: 4 biological replicates (b), 2 biological replicates (c), and 2 biological replicates plus one FreeHi-C simulation for each of the biological replicates (d). The dashed lines represent the observed percentage of differentially expressed genes overlapping with differential interactions. n = 10000 independent experiments.

Supplementary Fig. 14 Evaluation of differential chromatin interactions detected with external differential CTCF ChIP-seq data of GM12878 and A549 cells.

a. Significance of overlap of DCIs detected at varying FDRs with the CTCF ChIP-seq peaks (n = 36 for 2 replicates; n = 16 for 3 replicates). Differential chromatin interaction detection is performed by multiHiCcompare using a quasi-likelihood negative binomial generalized log-linear model to test the coefficients (one-sided test). P-values have been adjusted by BH procedure for multiple comparisons. Differential CTCF ChIP-seq peaks are defined by peaks that are uniquely enriched in only one cell line. Center lines of the boxplots correspond to the medians, box limits correspond to the 25th and 75th percentiles and whiskers comprise all data points within 1.5× the inter-quartile range. n refers to the sample size. b-d. are the individual randomization test results for the corresponding comparisons in a: 4 biological replicates (b), 3 biological replicates (c), and 2 biological replicates (d). The dashed lines represent the observed percentage of differential CTCF ChIP-seq peaks overlapping with differential interactions. n = 10000 independent experiments.

Supplementary information

Supplementary Information

Supplementary Figs. 1–14, Table 1 and note.

Reporting Summary

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Zheng, Y., Keleş, S. FreeHi-C simulates high-fidelity Hi-C data for benchmarking and data augmentation. Nat Methods (2019) doi:10.1038/s41592-019-0624-3

Download citation