Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Brief Communication
  • Published:

FreeHi-C simulates high-fidelity Hi-C data for benchmarking and data augmentation

Abstract

The ability to simulate high-throughput chromatin conformation (Hi-C) data is foundational for benchmarking Hi-C data analysis methods. Here we present a nonparametric strategy named FreeHi-C to simulate Hi-C data from the interacting genome fragments. Data from FreeHi-C exhibit high fidelity to biological Hi-C data. FreeHi-C boosts the precision and power of differential chromatin interaction detection through data augmentation under preserved false discovery rate control.

This is a preview of subscription content, access via your institution

Access options

Rent or buy this article

Prices vary by article type

from$1.95

to$39.95

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: FreeHi-C enables simulating high-fidelity Hi-C data.
Fig. 2: Data augmentation with FreeHi-C-simulated replicates improves DCI detection.

Similar content being viewed by others

Data availability

To study the operating features of FreeHi-C, we utilized two publicly available human Hi-C datasets as examples of large genomes with independent experiments using four cell cultures, which are referred to as four biological replicates, from the GM12878 (ref. 3) cell line and another four from A549 (ref. 25). Raw FASTQ files for GM12878 were downloaded from GEO26 under the accession code GSE63525 and raw sequences for A549 were obtained from the ENCODE portal27 (https://www.encodeproject.org) with accession code ENCSR662QKG. For evaluation of FreeHi-C performance on small genomes, we leveraged three different stages of malaria parasite P. falciparum red blood cell cycles16. Raw sequences for P. falciparum were downloaded from GEO26 under the accession code GSE50199. GM12878 and A549 were both processed at 40 kb resolution, and P. falciparum at 10 kb. For validating the differential interaction detection with a differential expression analysis, we utilized RNA-seq gene expression data from the ENCODE portal (accession ENCSR000AED for GM12878 and ENCSR000CTM for A549). Similarly, the CTCF ChIP–seq peak signal files were also downloaded from ENCODE under accession ENCSR000DZN for GM12878 and ENCSR000DPF for A549. The data used in this paper are summarized in Supplementary Table 1. All the simulated data used in the analysis and Juicebox28 visualization data are available at Zenodo (https://doi.org/10.5281/zenodo.3345896).

Code availability

FreeHi-C pipeline is implemented in Python with C accelerated core calculations and it naturally fits in the high-performance computing environments for parallelization. The source codes and instructions for running FreeHi-C are publicly available at https://github.com/keleslab/FreeHiC.

References

  1. Dekker, J., Rippe, K., Dekker, M. & Kleckner, N. Capturing chromosome conformation. Science 295, 1306–1311 (2002).

    Article  CAS  Google Scholar 

  2. Lieberman-Aiden, E. et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 326, 289–293 (2009).

    Article  CAS  Google Scholar 

  3. Rao, S. S. P. et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 159, 1665–1680 (2014).

    Article  CAS  Google Scholar 

  4. Roy, S. et al. A predictive modeling approach for cell line-specific long-range regulatory interactions. Nucleic Acids Res. 44, 1977–1978 (2016).

    Article  Google Scholar 

  5. Yardımcı, G. G. et al. Measuring the reproducibility and quality of hi-c data. Genome Biol. 20, 57 (2019).

    Article  Google Scholar 

  6. Lun, A. T. & Smyth, G. K. diffHic: a Bioconductor package to detect differential genomic interactions in Hi-C data. BMC Bioinformatics 16, 258 (2015).

    Article  Google Scholar 

  7. Forcato, M. et al. Comparison of computational methods for Hi-C data analysis. Nat. Methods 14, 679–685 (2017).

    Article  CAS  Google Scholar 

  8. Ursu, O. et al. Genomedisco: a concordance score for chromosome conformation capture experiments using random walks on contact map graphs. Bioinformatics 34, 2701–2707 (2018).

    Article  CAS  Google Scholar 

  9. Djekidel, M. N., Chen, Y. & Zhang, M. Q. Find: differential chromatin interactions detection using a spatial poisson process. Genome Res. 28, 412–422 (2018).

    Article  CAS  Google Scholar 

  10. Stansfield, J. C., Cresswell, K. G., Vladimirov, V. I. & Dozmorov, M. G. Hiccompare: an R package for joint normalization and comparison of hi-c datasets. BMC Bioinformatics 19, 279 (2018).

    Article  Google Scholar 

  11. Ardakany, A. R., Ay, F. & Lonardi, S. Selfish: discovery of differential chromatin interactions via a self-similarity measure. Bioinformatics 35, i145–i153 (2019).

    Article  Google Scholar 

  12. Ay, F., Bailey, T. L. & Noble, W. S. Statistical confidence estimation for Hi-C data reveals regulatory chromatin contacts. Genome Res. 24, 999–1011 (2014).

    Article  CAS  Google Scholar 

  13. Yang, T. et al. HiCRep: assessing the reproducibility of Hi-C data using a stratum-adjusted correlation coefficient. Genome Res. 27, 1939–1949 (2017). gr–220640.

    Article  CAS  Google Scholar 

  14. DeMaere, M. Z. & Darling, A. E. Sim3c: simulation of hi-c and meta3c proximity ligation sequencing technologies. GigaScience 7, gix103 (2017).

    Google Scholar 

  15. DeMaere, M. Z. & Darling, A. E. bin3c: exploiting hi-c sequencing data to accurately resolve metagenome-assembled genomes. Genome Biol. 20, 46 (2019).

    Article  Google Scholar 

  16. Ay, F. et al. Three-dimensional modeling of the P. falciparum genome during the erythrocytic cycle reveals a strong connection between genome architecture and gene ex- pression. Genome Res. 24, 974–988 (2014).

    Article  CAS  Google Scholar 

  17. Stansfield, J. C., Cresswell, K. G. & Dozmorov, M. G. multiHiCcompare: joint normalization and comparative analysis of complex hi-c experiments. Bioinformatics 35, 2916–2923 (2019).

    Article  Google Scholar 

  18. Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. B 57, 289–300 (1995).

    Google Scholar 

  19. Zheng, X. & Zheng, Y. Cscoretool: fast hi-c compartment analysis at high resolution. Bioinformatics 34, 1568–1570 (2017).

    Article  Google Scholar 

  20. Giorgetti, L. et al. Structural organization of the inactive x chromosome in the mouse. Nature 535, 575–579 (2016).

    Article  CAS  Google Scholar 

  21. Servant, N. et al. HiC-Pro: an optimized and flexible pipeline for Hi-C data processing. Genome Biol. 16, 259 (2015).

    Article  Google Scholar 

  22. Zheng, Y., Ay, F. & Keles, S. Generative modeling of multi-mapping reads with mHi-C advances analysis of Hi-C studies. eLife 8, e38070 (2019).

    Article  Google Scholar 

  23. Fisher, R. A. Statistical Methods for Research Workers (Genesis Publishing Pvt Ltd, 2006).

  24. Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data withdeseq2. Genome Biol. 15, 550 (2014).

    Article  Google Scholar 

  25. Dixon, J. R. et al. Integrative detection and analysis of structural variation in cancer genomes. Nat. Genet. 50, 1388–1398 (2018).

  26. Barrett, T. et al. NCBI GEO: archive for functional genomics data sets update. Nucleic Acids Res. 41, D991–D995 (2012).

    Article  Google Scholar 

  27. The ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).

    Article  Google Scholar 

  28. Durand, N. C. et al. Juicebox provides a visualization system for Hi-C contact maps with unlimited zoom. Cell Syst. 3, 99–101 (2016).

    Article  CAS  Google Scholar 

Download references

Acknowledgements

This work was supported by NIH grants HG009744, HG003747 and HG007019 to S.K.

Author information

Authors and Affiliations

Authors

Contributions

S.K. and Y.Z. conceived the project, designed the research and developed the method. Y.Z. developed the simulation framework and performed the experiments. Both authors contributed to the preparation of the manuscript.

Corresponding author

Correspondence to Sündüz Keleş.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nicole Rusk was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Fig. 1 Hi-C contact matrices of chromosome 1 for replicates 2, 3, 4, and 6 of GM12878.

In each panel, the first row shows the chromosome-wide Hi-C contact matrices of the original biological replicate and corresponding FreeHi-C and Sim3C simulated replicates with matching sequencing depth of the original replicate. The contact matrices in the second row display a zoom-in genomic interactions. The numbers at the left bottom of each matrix represent the color scale.

Supplementary Fig. 2 Genomic distance stratified comparison of the Hi-C signal for replicates 2, 3, 4, and 6 of GM12878.

Genomic distance stratified comparison of the Hi-C signal, as quantified by the number of valid read pairs, between the original biological replicates and corresponding FreeHi-C simulated replicates of varying depths for replicates 2, 3, 4, and 6 of GM12878. The sample size for each box is the number of genomic interactions on chromosome 1 whose genomic distance matches the x-axis. Center lines of the boxplots correspond to the medians, box limits correspond to the 25th and 75th percentiles, and whiskers comprise all data points within 1.5 x the inter-quartile range.

Supplementary Fig. 3 Hi-C contact matrices and hierarchical clustering of P. falciparum for the ring, the trophozoite, and the schizont stages.

a. Column one corresponds to the original sample on chromosome 7 and columns two to three are FreeHi-C simulations at the original and twice the depth of the biological sample on chromosome 7, respectively. The numbers at the left bottom of each matrix represent the color scale. b. and c. Hierarchical clustering of the FreeHi-C replicates simulated to the same sequencing depth as the original P. falciparum schizont stage (b, up simulation) or the trophozoite stage (c, down simulation) samples. Regardless of up (b) or down (c) simulation, FreeHi-C replicates capture the known relationship between the three stages.

Supplementary Fig. 4 Differential chromatin interaction detection between biological replicates of GM12878 for evaluating the false discovery rates of diffHic and multiHiCcompare, across a series of sequencing depths using FreeHi-C and downsampling at 40kb.

Target FDR levels are delineated as the panel labels on the right-hand side. In the downsampling experiments, rep2, rep3, rep4, and rep6 of GM12878 are downsampled to 1/4, 1/2, and 3/4 of the original sequencing depths. Observed false discovery rates of diffHic and multiHiCcompare are depicted in dark purple and light purple, respectively. Biological replicates are also simulated to 1/4, 1/2, 3/4, the same as, three times, and five times the original sequencing depths with FreeHi-C. Resulting false discovery rates of simulated replicates are depicted in red and orange. For both diffHic and multiHiCcompare, a quasi-likelihood negative binomial generalized log-linear model was employed to test the coefficients with BH procedure adjustment for multiple comparisons. Center lines of the boxplots correspond to the medians, box limits correspond to the 25th and 75th percentiles, and whiskers comprise all data points within 1.5xthe inter-quartile range. n = 3.

Supplementary Fig. 5 Differential chromatin interaction detection between biological replicates of A549 for evaluating the false discovery rates of diffHic and multiHiCcompare, across a series of sequencing depths using FreeHi-C and downsampling at 40kb.

This is the analysis depicted in Supplementary Fig. 4 repeated with A549. For both diffHic and multiHiCcompare, a quasi-likelihood negative binomial generalized log-linear model was employed to test the coefficients with BH procedure adjustment for multiple comparisons. Center lines of the boxplots correspond to the medians, box limits correspond to the 25th and 75th percentiles, and whiskers comprise all data points within 1.5xthe inter-quartile range. n = 3.

Supplementary Fig. 6 Data augmentation with FreeHi-C simulated replicates improves differential chromatin interactions (DCIs) detection.

a delineates observed false discovery rates of within-sample comparisons for A549 data (i.e., comparisons of replicate(s) of A549 with other replicate(s) of A549). The dashed lines are y = x. b displays precision, computed as the percentage of top significant DCIs of each specific analysis in the gold standard differential chromatin interaction list, as a function of top-ranking DCIs. The gold standard set is defined by comparing the full set of 4 replicates of GM12878 with 4 replicates of A549 filtered by FDR ≤ 0.01. |logFC| refers to the absolute value of natural log transformed fold-change. c. Observed versus targeted FDR for differential chromatin interaction detection under variations of replicates per condition (left: two biological replicates per condition; right: one biological replicate per condition; middle: two biological replicates for one condition and FreeHi-C augmentation of the single biological replicate of the other condition). Middle panel corresponds to data augmentation with FreeHi-C and exhibits FDR control. d. Power, presented as the percentage of candidate DCIs declared as significant at the given FDR threshold, under variations on replicates per condition. Power with two replicates per condition (left panel) depicts the gold standard in this setting. e. Ranked up (green) DCIs of the uneven number of replicates per condition setting as a result of FreeHi-C data augmentation are significantly enriched for differentially expressed (DE) genes. In contrast, DCIs that are ranked lower (blue) among the top N DCIs, N = 1000, 5000, ..., 30000, as a result of FreeHi-C augmentation do not significantly overlap with the DE genes. Differentially expressed genes are detected by DESeq2 based on the negative binomial model. f. Ranked up (green) and down (blue) DCIs of the one biological replicate per condition setting as a result of FreeHi-C data augmentation are significantly enriched for differential CTCF peaks. Both sets show significant enrichment for CTCF peaks, with the exception at the top 100 DCIs. Dashed lines depict p-value cut-off of 0.05. Differential CTCF ChIP-seq peaks are defined by peaks that are uniquely enriched in only one cell line. Differential CTCF ChIP-seq peaks are defined by peaks that are uniquely enriched in only one cell line. Differential chromatin interaction detection is performed by HiCcompare, by converting the normalized contact counts into Z-scores, and multiHiCcompare, using a quasi-likelihood negative binomial generalized log-linear model, to test the coefficients (one-sided test). The p-values are adjusted by Benjamini-Hochberg procedure for multiple comparisons. For all the boxplots in this figure, the center lines correspond to the medians, box limits correspond to the 25th and 75th percentiles, and whiskers comprise all data points within 1.5× the inter-quartile range. n refers to the sample size.

Supplementary Fig. 7 False discovery rate control for differential chromatin interaction detection within pairs of A549 or GM12878 replicates with or without FreeHi-C augmentation.

a. Observed versus targeted FDR for differential chromatin interaction detection between pairs of A549 replicates (first column) and augmentation with 1 to 3 FreeHi-C simulated replicates (columns 2-4). These data depict the individual pairs of replicates summarized in Figure 2a. b. False discovery rate control as in Figure 2a with the GM12878 replicates (n = 6). Dashed lines are y = x. Differential chromatin interaction detection is performed by multiHiCcompare using a quasi-likelihood negative binomial generalized log-linear model to test the coefficients (one-sided test). P-values have been adjusted by BH procedure for multiple comparisons. Center lines of the boxplots correspond to the medians; box limits correspond to the 25th and 75th percentiles and whiskers comprise all data points within 1.5xthe inter-quartile range.

Supplementary Fig. 8 Power of differential chromatin interaction detection between pairs of A549 or GM12878 replicates with or without FreeHi-C data augmentation across a sequence of FDR thresholds.

Panels a and b report in the y-axes the percentage of candidate differential chromatin interactions and number of differential chromatin interactions, respectively. The first two columns are provided as a reference to highlight that with one replicate per condition (n = 6), the number of differential chromatin interactions detected between conditions (third column; n = 16) can be smaller than that of within conditions (number of false discoveries in the first two columns). Columns 4 to 6 illustrate the striking increase in the numbers of detected differential chromatin interactions with FreeHi-C data augmentation. The y-axes are on log scale (n = 16). c. Ranked up (green) DCIs of the one biological replicate per condition setting as a result of FreeHi-C data augmentation are significantly enriched for differentially expressed (DE) genes (n = 16). In contrast, DCIs that are ranked lower (blue) among the top N DCIs, N = 1000, 5000, ..., 25000, as a result of FreeHi-C augmentation do not significantly overlap with the DE genes. Differentially expressed genes are detected by DESeq2 based on the negative binomial model. d. Ranked up (green) and down (blue) DCIs of the one biological replicate per condition setting as a result of FreeHi-C data augmentation are significantly enriched for differential CTCF peaks (n = 16). Both sets show significant enrichment for CTCF peaks, with the exception of significant enrichment for the ranked-up DCIs compared to non-enrichment of ranked down DCIs for differential CTCF peaks among the top 100 DCIs. Dashed lines depict p-value cutoff of 0.05. Differential chromatin interaction detection is performed by multiHiCcompare using a quasi-likelihood negative binomial generalized log-linear model to test the coefficients (one-sided test). P-values have been adjusted by BH procedure for multiple comparisons. Center lines of the boxplots correspond to the medians, box limits correspond to the 25th and 75th percentiles and whiskers comprise all data points within 1.5 x inter-quartile range.

Supplementary Fig. 9 Data augmentation with FreeHi-C replicates boosts the power while preserving false discovery rate control.

a. Observed false discovery rates of within sample comparisons for GM12878 data (i.e., comparisons of two biological replicates of GM12878 with another set of two biological replicates from GM12878; n = 3). This is an analog of Figure 2e for GM12878. b and c delineate cases with 2 biological replicates (n = 36) and 3 biological replicates (n = 16) per condition, respectively. Power is presented as the percentage of candidate DCIs declared as significant at the given FDR threshold. Differential chromatin interaction detection is performed by multiHiCcompare using a quasi-likelihood negative binomial generalized log-linear model to test the coefficients (one-sided test). P-values have been adjusted by BH procedure for multiple comparisons. Center lines of the boxplots correspond to the medians, box limits correspond to the 25th and 75th percentiles and whiskers comprise all data points within 1.5 x the inter-quartile range.

Supplementary Fig. 10 Percentage of top significant DCIs in the gold standard differential chromatin interaction list, that is based on the comparison of the full set of 4 replicates of GM12878 with 4 replicates of A549, under the general setting of multiple replicates per condition and the corresponding FreeHi-C data augmentation settings.

Each condition includes 3 biological replicates (n = 16). The gold standard differential chromatin interaction set is defined as the most significant interactions in the comparison of 4 replicates of GM12878 with 4 replicates of A549 thresholded at FDR of 0.001. Leveraging FreeHi-C replicates boosts the precision of top significant DCIs. Differential chromatin interaction detection is performed by multiHiCcompare using a quasi-likelihood negative binomial generalized log-linear model to test the coefficients (one-sided test). P-values have been adjusted by BH procedure for multiple comparisons. Center lines of the boxplots correspond to the medians, box limits correspond to the 25th and 75th percentiles and whiskers comprise all data points within 1.5 x the inter-quartile range.

Supplementary Fig. 11 Percentage of top significant DCIs in the gold standard differential chromatin interaction list, that is based on the comparison of full set of 4 replicates of GM12878 with 4 replicates of A549, under the general setting of multiple replicates per condition and the corresponding FreeHi-C data augmentation settings.

Each condition includes 2 biological replicates (n = 36). The gold standard differential chromatin interaction set is defined as the most significant interactions in the comparison of 4 replicates of GM12878 with 4 replicates of A549 thresholded at FDR of 0.001. Leveraging FreeHi-C replicates boosts the precision of top significant DCIs. Differential chromatin interaction detection is performed by multiHiCcompare using a quasi-likelihood negative binomial generalized log-linear model to test the coefficients (one-sided test). P-values have been adjusted by BH procedure for multiple comparisons. Center lines of the boxplots correspond to the medians, box limits correspond to the 25th and 75th percentiles and whiskers comprise all data points within 1.5× the inter-quartile range.

Supplementary Fig. 12 Evaluation of differential chromatin interactions detected by the three replicates per condition settings with external RNA-seq data of GM12878 and A549 cells.

a. Significance of overlap of DCIs detected at varying FDRs with the differentially expressed genes between GM12878 and A459 cells (n = 16). Differential chromatin interaction detection is performed by multiHiCcompare using a quasi-likelihood negative binomial generalized log-linear model to test the coefficients (one-sided test). P-values have been adjusted by BH procedure for multiple comparisons. Differentially expressed genes are detected by DESeq2 based on the negative binomial model. Center lines of the boxplots correspond to the medians, box limits correspond to the 25th and 75th percentiles and whiskers comprise all data points within 1.5× the inter-quartile range. n refers to the sample size. b-d. are the individual randomization test results for the corresponding comparisons in a: 4 biological replicates (b), 3 biological replicates (c), and 3 biological replicates plus one FreeHi-C simulation for each of the biological replicates (d). The dashed lines represent the observed percentages of differentially expressed genes overlapping with differential interactions. n = 10000 independent experiments.

Supplementary Fig. 13 Evaluation of differential chromatin interactions detected by the two replicates per condition settings with external RNA-seq data of GM12878 and A549 cells.

a. Significance of overlap of DCIs detected at varying FDRs with the differentially expressed genes between GM12878 and A459 cells (n = 36). Differential chromatin interaction detection is performed by multiHiCcompare using a quasi-likelihood negative binomial generalized log-linear model to test the coefficients (one-sided test). P-values have been adjusted by BH procedure for multiple comparisons. Differentially expressed genes are detected by DESeq2 based on the negative binomial model. Center lines of the boxplots correspond to the medians, box limits correspond to the 25th and 75th percentiles and whiskers comprise all data points within 1.5× the inter-quartile range. n refers to the sample size. b-d. are the individual randomization test results for the corresponding comparisons in a: 4 biological replicates (b), 2 biological replicates (c), and 2 biological replicates plus one FreeHi-C simulation for each of the biological replicates (d). The dashed lines represent the observed percentage of differentially expressed genes overlapping with differential interactions. n = 10000 independent experiments.

Supplementary Fig. 14 Evaluation of differential chromatin interactions detected with external differential CTCF ChIP-seq data of GM12878 and A549 cells.

a. Significance of overlap of DCIs detected at varying FDRs with the CTCF ChIP-seq peaks (n = 36 for 2 replicates; n = 16 for 3 replicates). Differential chromatin interaction detection is performed by multiHiCcompare using a quasi-likelihood negative binomial generalized log-linear model to test the coefficients (one-sided test). P-values have been adjusted by BH procedure for multiple comparisons. Differential CTCF ChIP-seq peaks are defined by peaks that are uniquely enriched in only one cell line. Center lines of the boxplots correspond to the medians, box limits correspond to the 25th and 75th percentiles and whiskers comprise all data points within 1.5× the inter-quartile range. n refers to the sample size. b-d. are the individual randomization test results for the corresponding comparisons in a: 4 biological replicates (b), 3 biological replicates (c), and 2 biological replicates (d). The dashed lines represent the observed percentage of differential CTCF ChIP-seq peaks overlapping with differential interactions. n = 10000 independent experiments.

Supplementary information

Supplementary Information

Supplementary Figs. 1–14, Table 1 and note.

Reporting Summary

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zheng, Y., Keleş, S. FreeHi-C simulates high-fidelity Hi-C data for benchmarking and data augmentation. Nat Methods 17, 37–40 (2020). https://doi.org/10.1038/s41592-019-0624-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41592-019-0624-3

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing