DeepC: predicting 3D genome folding using megabase-scale transfer learning

Abstract

Predicting the impact of noncoding genetic variation requires interpreting it in the context of three-dimensional genome architecture. We have developed deepC, a transfer-learning-based deep neural network that accurately predicts genome folding from megabase-scale DNA sequence. DeepC predicts domain boundaries at high resolution, learns the sequence determinants of genome folding and predicts the impact of both large-scale structural and single base-pair variations.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Fig. 1: Predicting Hi-C interactions from DNA sequence.
Fig. 2: Validation of deepC predictions.
Fig. 3: DeepC for dissecting the determinants of genome folding and predicting the impact of variation.

Data availability

Hi-C data from Rao et al. is available under GSE63525. Chromatin feature data from ENCODE, Roadmap and other publicly available data are listed in detail with accession numbers in Supplementary Table 1. Additional ENCODE data used for chromatin segmentation and visualization are listed with accession numbers in Supplementary Table 5. Tiled-C and NG Capture-C validation data are available under the Gene Expression Omnibus superseries GSE137437. Source data are provided with this paper.

Code availability

All code for training and employing deepC networks as well as trained models are available at https://github.com/rschwess/deepC; all code for training and employing chromatin feature networks is available at https://github.com/rschwess/deepHaem

References

  1. 1.

    Hindorff, L. A. et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc. Natl Acad. Sci. USA 106, 9362–9367 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  2. 2.

    Maurano, M. T. et al. Systematic localization of common disease-associated variation in regulatory DNA. Science 337, 1190–1195 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  3. 3.

    Schreiber, J., Libbrecht, M., Bilmes, J. & Noble, W. S. Nucleotide sequence and DNaseI sensitivity are predictive of 3D chromatin architecture. Preprint at bioRxiv https://doi.org/10.1101/103614 (2017).

  4. 4.

    Whalen, S., Truty, R. M. & Pollard, K. S. Enhancer–promoter interactions are encoded by complex genomic signatures on looping chromatin. Nat. Genet. 48, 488–496 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  5. 5.

    Li, W., Wong, W. H. & Jiang, R. DeepTACT: predicting 3D chromatin contacts via bootstrapping deep learning. Nucleic Acids Res. 47, e60 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  6. 6.

    Qi, Y. & Zhang, B. Predicting three-dimensional genome organization with chromatin states. PLoS Comput. Biol. 15, e1007024 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  7. 7.

    Belokopytova, P. S., Nuriddinov, M. A., Mozheiko, E. A., Fishman, D. & Fishman, V. Quantitative prediction of enhancer–promoter interactions. Genome Res. 30, 72–84 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  8. 8.

    Zhang, S., Chasman, D., Knaack, S. & Roy, S. In silico prediction of high-resolution Hi-C interaction matrices. Nat. Commun. 10, 5449 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  9. 9.

    Buckle, A., Brackley, C. A., Boyle, S., Marenduzzo, D. & Gilbert, N. Polymer simulations of heteromorphic chromatin predict the 3D folding of complex genomic loci. Mol. Cell 72, 786–797.e11 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  10. 10.

    Bianco, S. et al. Polymer physics predicts the effects of structural variants on chromatin architecture. Nat. Genet. 50, 662–667 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  11. 11.

    Hnisz, D., Day, D. S. & Young, R. A. Insulated neighborhoods: structural and functional units of mammalian gene control. Cell 167, 1188–1200 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  12. 12.

    Zhou, J. & Troyanskaya, O. G. Predicting effects of noncoding variants with deep learning–based sequence model. Nat. Methods 12, 931–934 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  13. 13.

    Kelley, D. R., Snoek, J. & Rinn, J. L. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. 26, 990–999 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  14. 14.

    Kelley, D. R. et al. Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Res. 28, 739–750 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  15. 15.

    Yu, F. & Koltun, V. Multi-scale context aggregation by dilated convolutions. Preprint at http://arxiv.org/abs/1511.07122 (2015).

  16. 16.

    van den Oord, A. et al. WaveNet: a generative model for raw audio. Preprint at https://arxiv.org/abs/1609.03499 (2016).

  17. 17.

    Yosinski, J., Clune, J., Bengio, Y. & Lipson, H. How transferable are features in deep neural networks? Adv. Neural Inf. Process. Syst. 4, 3320–3328 (2014).

    Google Scholar 

  18. 18.

    Bernstein, B. E. et al. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).

    Google Scholar 

  19. 19.

    Kundaje, A. et al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  20. 20.

    Rao, S. S. P. et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 159, 1665–1680 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  21. 21.

    Bonev, B. et al. Multiscale 3D genome rewiring during mouse neural development. Cell 171, 557–572.e24 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  22. 22.

    Zhang, Y. et al. Enhancing Hi-C data resolution with deep convolutional neural network HiCPlus. Nat. Commun. 9, 750 (2018).

    PubMed  PubMed Central  Google Scholar 

  23. 23.

    Liu, Q., Lv, H. & Jiang, R. hicGAN infers super resolution Hi-C data with generative adversarial networks. Bioinformatics 35, i99–i107 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  24. 24.

    Davies, J. O. J. et al. Multiplexed analysis of chromosome conformation at vastly improved sensitivity. Nat. Methods 13, 74–80 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  25. 25.

    Fudenberg, G. et al. Formation of chromosomal domains by loop extrusion. Cell Rep. 15, 2038–2049 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  26. 26.

    Crane, E. et al. Condensin-driven remodelling of X chromosome topology during dosage compensation. Nature 523, 240–244 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  27. 27.

    Shin, H. et al. TopDom: an efficient and deterministic method for identifying topological domains in genomes. Nucleic Acids Res. 44, e70 (2015).

    PubMed  PubMed Central  Google Scholar 

  28. 28.

    Zufferey, M., Tavernari, D., Oricchio, E. & Ciriello, G. Comparison of computational methods for the identification of topologically associating domains. Genome Biol. 19, 217 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  29. 29.

    Oudelaar, A. M. et al. Dissection of the 4D chromatin structure of the α-globin locus through in vivo erythroid differentiation with extreme spatial and temporal resolution. Preprint at bioRxiv https://doi.org/10.1101/763763 (2019).

  30. 30.

    Simonyan, K., Vedaldi, A. & Zisserman, A. Deep inside convolutional networks: visualising image classification models and saliency maps. In Proc. 2nd International Conference on Learning Representations (ICLR 2014) Workshop Track (2013).

  31. 31.

    Hnisz, D. et al. Activation of proto-oncogenes by disruption of chromosome neighborhoods. Science 351, 1454–1458 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  32. 32.

    Schmiedel, B. J. et al. 17q21 asthma-risk variants switch CTCF binding and regulate IL-2 production by T cells. Nat. Commun. 7, 13426 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  33. 33.

    Robson, M. I., Ringel, A. R. & Mundlos, S. Regulatory landscaping: how enhancer-promoter communication is sculpted in 3D. Mol. Cell 74, 1110–1122 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  34. 34.

    Dixon, J. R., Gorkin, D. U. & Ren, B. Chromatin domains: the unit of chromosome organization. Mol. Cell 62, 668–680 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  35. 35.

    Marti-Renom, M. A. et al. Challenges and guidelines toward 4D nucleome data and model standards. Nat. Genet. 50, 1352–1358 (2018).

    CAS  Google Scholar 

  36. 36.

    Spielmann, M., Lupiáñez, D. G. & Mundlos, S. Structural variation in the 3D genome. Nat. Rev. Genet. 19, 453–467 (2018).

    CAS  Google Scholar 

  37. 37.

    Fudenberg, G., Kelley, D. R. & Pollard, K. S. Predicting 3D genome folding from DNA sequence with Akita. Nat. Methods https://doi.org/10.1038/s41592-020-0958-x (2020).

  38. 38.

    Corces, M. R. et al. Lineage-specific and single-cell chromatin accessibility charts human hematopoiesis and leukemia evolution. Nat. Genet. 48, 1193–1203 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  39. 39.

    Schwessinger, R. et al. Sasquatch: predicting the impact of regulatory SNPs on transcription factor binding from cell- and tissue-specific DNase footprints. Genome Res. 27, 1730–1742 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  40. 40.

    Downes, D. J. et al. An integrated platform to systematically identify causal variants and genes for polygenic human traits. Preprint at bioRxiv https://doi.org/10.1101/813618 (2019).

  41. 41.

    Telenius, J., Consortium, T. W. & Hughes, J. R. NGseqBasic—a single-command UNIX tool for ATAC-seq, DNaseI-seq, Cut-and-Run, and ChIP–seq data mapping, high-resolution visualisation, and quality control. Preprint at bioRxiv https://doi.org/10.1101/393413 (2018).

  42. 42.

    Zhang, Y. et al. Model-based analysis of ChIP–Seq (MACS). Genome Biol. 9, R137 (2008).

    PubMed  PubMed Central  Google Scholar 

  43. 43.

    He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. Multimed. Tools Appl. 77, 10437–10453 (2015).

    Google Scholar 

  44. 44.

    Servant, N. et al. HiC-Pro: an optimized and flexible pipeline for Hi-C data processing. Genome Biol. 16, 259 (2015).

    PubMed  PubMed Central  Google Scholar 

  45. 45.

    Telenius, J. M. et al. CaptureCompendium: a comprehensive toolkit for 3C analysis. Preprint at bioRrxiv http://doi.org/10.1101/2020.02.17.952572 (2020).

  46. 46.

    Ramírez, F., Dündar, F., Diehl, S., Grüning, B. A. & Manke, T. deepTools: a flexible platform for exploring deep-sequencing data. Nucleic Acids Res. 42, W187–W191 (2014).

    PubMed  PubMed Central  Google Scholar 

  47. 47.

    Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  48. 48.

    Zacher, B. et al. Accurate promoter and enhancer identification in 127 ENCODE and roadmap epigenomics cell types and tissues by GenoSTAN. PLoS ONE 12, e0169249 (2017).

    PubMed  PubMed Central  Google Scholar 

  49. 49.

    Abadi, M. et al. TensorFlow: a system for large-scale machine learning. In Proc. 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’16) 265–284 (2016).

  50. 50.

    Li, H. et al. The sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

    PubMed  PubMed Central  Google Scholar 

  51. 51.

    Langmead, B., Trapnell, C., Pop, M. & Salzberg, S. L. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009).

    PubMed  PubMed Central  Google Scholar 

  52. 52.

    Lawrence, M. et al. Software for computing and annotating genomic ranges. PLoS Comput. Biol. 9, e1003118 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  53. 53.

    Wickham, H ggplot2: Elegant Graphics for Data Analysis (Springer, 2009).

  54. 54.

    Lawrence, M., Gentleman, R. & Carey, V. rtracklayer: an R package for interfacing with genome browsers. Bioinformatics 25, 1841–1842 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  55. 55.

    Zeileis, A. & Grothendieck, G. Zoo: S3 infrastructure for regular and irregular time series. J. Stat. Softw. 14, 1–27 (2005).

    Google Scholar 

  56. 56.

    van der Walt, S., Colbert, S. C. & Varoquaux, G. The NumPy array: a structure for efficient numerical computation. Comput. Sci. Eng. 13, 22–30 (2011).

    Google Scholar 

Download references

Acknowledgements

We thank R. Beagrie for help in refining the manuscript. This work was supported by the MRC (grant no. MC_UU_00016/14 to J.R.H.) and the Wellcome Trust via Strategic Award (no. 106130/Z/14/Z to J.R.H.) and Institutional Strategic Support Fund (reference no. 105605/Z/14/Z to J.R.H.). The Wellcome Trust Genomic Medicine and Statistics PhD Program (grant nos. 203728/Z/16/Z to R.S. and 203141/Z/16/Z to R.C.B.). The Stevenson Junior Research Fellowship at University College, Oxford (to A.M.O). G.L. is supported by the Wellcome Trust supporting award (no. 090532/Z/09/Z). Y.W.T. is supported by the European Research Council under the European Union’s Seventh Framework Program (grant no. FP7/2007-2013) ERC grant agreement no. 617071. The Genotype-Tissue Expression (GTEx) Project was supported by the Common Fund of the Office of the Director of the National Institutes of Health, and by the NCI, NHGRI, NHLBI, NIDA, NIMH and NINDS.

Author information

Affiliations

Authors

Contributions

R.S., G.L. and J.R.H. conceived the project. R.S., R.C.B., Y.W.T. and G.L designed the neural network architectures. R.S. optimized and trained the neural networks and performed downstream analysis. R.S., M.G., D.D., A.M.O. and J.R.H. designed and evaluated the validation strategy. M.G. performed NG Capture-C experiments. D.D. performed Tiled-C experiments. R.S., A.M.O. and J.T. performed bioinformatic analysis of NG Capture-C and Tiled-C. R.S. performed integrative analysis and prepared the figures. R.S., G.L and J.R.H. wrote the manuscript with inputs from all authors.

Corresponding authors

Correspondence to Gerton Lunter or Jim R. Hughes.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Lin Tang was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Percentile normalizing Hi-C data for deep learning.

The Hi-C interactions are percentile-binned in a distance-stratified manner. For every genomic distance, in steps equal to the bin size, the Hi-C signal is split into unequal percentiles ranging from 20 % bottom to 5 % top. The percentiles are attributed the values 1 to 10 yielding the Hi-C skeleton. The unequal percentile sizes ensure a finer distinction of the differences at the high Hi-C interaction value range, while minor differences in the low interaction value range are squished. Effectively, this procedure reduces the proximity signal and enhances domains and domain boundaries.

Extended Data Fig. 2 Comparison of deepC training with and without transfer learning.

Training a deepC model with the same architecture but without pre-seeding the lower convolutional layers with the chromatin feature model weights results in the emergence of triangular structures. Their positioning however does not match with the Hi-C structures. In contrast, with pre-seeding the predicted domains overlap well with the Hi-C skeleton.

Extended Data Fig. 3 Tissue-specific deepC predictions.

Shown is a region on chromosome 2 around the MEIS1 locus. DeepC predicts a small domain with insulation to the upstream regions (black arrow) in a tissue specific manner. The domain is only visible in K562 Hi-C data and matches with tissue-specific CTCF binding. Tiled-C confirms the tissue-specific domain. For contrast, Tiled-C data were bounded between the 5 and 95 percentiles.

Extended Data Fig. 4 NG Capture-C validation of deepC predictions.

a) Example region with overlap of GM12878: Hi-C, skeleton and deepC prediction; NG Capture-C tracks, distance-normalized NG Capture-C tracks and CTCF ChIP-seq track (red). Shown is a CTCF viewpoint (purple triangle) and an intra domain viewpoint (blue triangle) not overlapping with any active elements. Dashed lines in the distance-normalized NG Capture-C tracks indicate the expected interaction value. Dotted black lines highlight deepC prediction details that correspond to boundaries in the NG Capture-C tracks. b) K562 data of the same region.

Extended Data Fig. 5 Mapping important features for genome folding.

Shown are GM12878 deepC predictions over the IKZF2 locus (a) on chromosome 2 and focused on the IKZF2 promoter (b). Aligned are DHS as well as ChIP-seq tracks for CTCF and histone modifications. Shown in green is the saliency score which is a proxy for the importance every base has in predicting the chromatin interactions of that region. The saliency score shows sharp peaks overlapping CTCF binding sites and broader peaks overlapping active gene promoters. Resolving the saliency score at base-pair resolution (b) highlights CTCF and general transcription factor binding motifs.

Supplementary information

Supplementary Information

Supplementary Figs. 1–21 and Tables 3–5.

Reporting Summary

Supplementary Tables

Supplementary Table 1. Source of human and mouse chromatin features used. Supplementary Table 2. NG Capture-C validation probes

Source data

Source Data Fig. 1

Distance-stratified Pearson correlation between prediction and Hi-C skeleton across chromosomes in cross-validation scheme. Smoothed and unsmoothed (M0–M2).

Source Data Fig. 2

a, Pearson correlation between NG Capture-C profiles and virtual4C profiles from the Hi-C skeleton and predictions respectively. b1, Average distance-normalized NG Capture-C signal over insulation-score called boundaries in GM12878. b2, Average distance-normalized NG Capture-C signal over insulation-score called boundaries in GM12878.

Source Data Fig. 3

a, Meta-profiles of saliency over stratified genomic elements. b, Predicted effects of genomic element deletion screen. c, Predicted effect of 1,000 randomly sampled SNPs within CTCF sites.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Schwessinger, R., Gosden, M., Downes, D. et al. DeepC: predicting 3D genome folding using megabase-scale transfer learning. Nat Methods 17, 1118–1124 (2020). https://doi.org/10.1038/s41592-020-0960-3

Download citation

Search

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing