Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Sequence-based modeling of three-dimensional genome architecture from kilobase to chromosome scale

Abstract

To learn how genomic sequence influences multiscale three-dimensional (3D) genome architecture, this manuscript presents a sequence-based deep-learning approach, Orca, that predicts directly from sequence the 3D genome architecture from kilobase to whole-chromosome scale. Orca captures the sequence dependencies of structures including chromatin compartments and topologically associating domains, as well as diverse types of interactions from CTCF-mediated to enhancer–promoter interactions and Polycomb-mediated interactions with cell-type specificity. Orca enables various applications including predicting structural variant effects on multiscale genome organization and it recapitulated effects of experimentally studied variants at varying sizes (300 bp to 90 Mb). Moreover, Orca enables in silico virtual screens to probe the sequence basis of 3D genome organization at different scales. At the submegabase scale, it predicted specific transcription factor motifs underlying cell-type-specific genome interactions. At the compartment scale, virtual screens of sequence activities suggest a model for the sequence basis of chromatin compartments with a prominent role of transcription start sites.

This is a preview of subscription content

Access options

Buy article

Get time limited or full article access on ReadCube.

$32.00

All prices are NET prices.

Fig. 1: Predicting multiscale 3D genome architecture from sequence.
Fig. 2: Multiscale sequence-based prediction of SV effects on genome structure.
Fig. 3: Identification of cell-type-specific motifs that underlie predicted submegabase-scale genome interactions.
Fig. 4: Virtual screen profiling of sequence dependencies of chromatin compartments identifies a prominent role of TSS sequences.

Data availability

The GRCh38/hg38 reference genome and 3D genome datasets under 4DN accession numbers 4DNFI9GMP2J8, 4DNFI643OYP9 and 4DNFILP99QJS were used for training the Orca sequence models. All coordinates in the manuscript refer to GRCh38/hg38 unless otherwise indicated. SV experimental validation datasets were downloaded from NCBI GEO accessions GSE137372, GSE66383, GSE78109 and EBI ENA accession PRJEB5236. Data used and generated in this manuscript were also deposited into Zenodo: https://zenodo.org/record/6234936, https://zenodo.org/record/4594676 and https://zenodo.org/record/6227750.

Code availability

All code, models and data for running Orca are available from the Github repository https://github.com/jzhoulab/orca (https://doi.org/10.5281/zenodo.6257290). A user-friendly web server is available at https://orca.zhoulab.io. The manuscript analysis code is available at https://github.com/jzhoulab/orca_manuscript (https://doi.org/10.5281/zenodo.6257292).

References

  1. Rao, S. S. P. et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 159, 1665–1680 (2014).

    CAS  Article  Google Scholar 

  2. Dixon, J. R. et al. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature 485, 376–380 (2012).

    CAS  Article  Google Scholar 

  3. Nora, E. P. et al. Spatial partitioning of the regulatory landscape of the X-inactivation centre. Nature 485, 381–385 (2012).

    CAS  Article  Google Scholar 

  4. van Steensel, B. & Furlong, E. E. M. The role of transcription in shaping the spatial organization of the genome. Nat. Rev. Mol. Cell Biol. 20, 327–337 (2019).

    PubMed  PubMed Central  Google Scholar 

  5. Kosak, S. T. et al. Subnuclear compartmentalization of immunoglobulin loci during lymphocyte development. Science 296, 158–162 (2002).

    CAS  Article  Google Scholar 

  6. Dixon, J. R. et al. Chromatin architecture reorganization during stem cell differentiation. Nature 518, 331–336 (2015).

    CAS  Article  Google Scholar 

  7. Amat, R. et al. Rapid reversible changes in compartments and local chromatin organization revealed by hyperosmotic shock. Genome Res. 29, 18–28 (2019).

    CAS  Article  Google Scholar 

  8. Sima, J. et al. Identifying cis elements for spatiotemporal control of mammalian DNA replication. Cell 176, 816–830.e18 (2019).

    CAS  Article  Google Scholar 

  9. Alipour, E. & Marko, J. F. Self-organization of domain structures by DNA-loop-extruding enzymes. Nucleic Acids Res. 40, 11202–11212 (2012).

    CAS  Article  Google Scholar 

  10. Fudenberg, G., Abdennur, N., Imakaev, M., Goloborodko, A. & Mirny, L. A. Emerging evidence of chromosome folding by loop extrusion. Cold Spring Harb. Symp. Quant. Biol. 82, 45–55 (2017).

    Article  Google Scholar 

  11. Fudenberg, G. et al. Formation of chromosomal domains by loop extrusion. Cell Rep. 15, 2038–2049 (2016).

    CAS  Article  Google Scholar 

  12. Sanborn, A. L. et al. Chromatin extrusion explains key features of loop and domain formation in wild-type and engineered genomes. Proc. Natl Acad. Sci. USA 112, E6456–E6465 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  13. Dekker, J., Rippe, K., Dekker, M. & Kleckner, N. Capturing chromosome conformation. Science 295, 1306–1311 (2002).

    CAS  Article  Google Scholar 

  14. Lieberman-Aiden, E. et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 326, 289–293 (2009).

    CAS  Article  Google Scholar 

  15. Krietenstein, N. et al. Ultrastructural details of mammalian chromosome architecture. Mol. Cell 78, 554–565.e7 (2020).

    CAS  Article  Google Scholar 

  16. Zhou, J. & Troyanskaya, O. G. Predicting effects of noncoding variants with deep learning–based sequence model. Nat. Methods 12, 931–934 (2015).

    CAS  Article  Google Scholar 

  17. Alipanahi, B., Delong, A., Weirauch, M. T. & Frey, B. J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 831–838 (2015).

    CAS  Article  Google Scholar 

  18. Kelley, D. R., Snoek, J. & Rinn, J. L. Basset: Learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. https://doi.org/10.1101/gr.200535.115 (2016).

  19. Zhou, J. et al. Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk. Nat. Genet. https://doi.org/10.1038/s41588-018-0160-6 (2018).

  20. Kelley, D. R. et al. Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Res. 28, 739–750 (2018).

    CAS  Article  Google Scholar 

  21. Chen, K. M., Cofer, E. M., Zhou, J. & Troyanskaya, O. G. Selene: a PyTorch-based deep learning library for sequence data. Nat. Methods. https://doi.org/10.1038/s41592-019-0360-8 (2019).

  22. Avsec, Ž. et al. Base-resolution models of transcription-factor binding reveal soft motif syntax. Nat. Genet. 53, 354–366 (2021).

    CAS  Article  Google Scholar 

  23. Fudenberg, G., Kelley, D. R. & Pollard, K. S. Predicting 3D genome folding from DNA sequence with Akita. Nat. Methods 17, 1111–1117 (2020).

    Article  Google Scholar 

  24. Schwessinger, R. et al. DeepC: predicting 3D genome folding using megabase-scale transfer learning. Nat. Methods 17, 1118–1124 (2020).

    CAS  Article  Google Scholar 

  25. Durand, N. C. et al. Juicebox provides a visualization system for Hi-C contact maps with unlimited zoom. Cell Syst. 3, 99–101 (2016).

    CAS  Article  Google Scholar 

  26. Abdennur, N. & Mirny, L. A. Cooler: scalable storage for Hi-C data and other genomically labeled arrays. Bioinformatics 36, 311–316 (2020).

    CAS  Article  Google Scholar 

  27. Chiang, C. et al. The impact of structural variation on human gene expression. Nat. Genet. https://doi.org/10.1038/ng.3834 (2017).

  28. Zhang, D. et al. Alteration of genome folding via contact domain boundary insertion. Nat. Genet. 52, 1076–1087 (2020).

    Article  Google Scholar 

  29. Suzukawa, K. et al. Identification of a breakpoint cluster region 3′ of the ribophorin I gene at 3q21 associated with the transcriptional activation of the EVI1 gene in acute myelogenous leukemias with inv (3)(q21q26). Blood. 84, 2681–2688 (1994).

    CAS  Article  Google Scholar 

  30. Gröschel, S. et al. A single oncogenic enhancer rearrangement causes concomitant EVI1 and GATA2 deregulation in leukemia. Cell 157, 369–381 (2014).

    Article  Google Scholar 

  31. Lupiáñez, D. G. et al. Disruptions of topological chromatin domains cause pathogenic rewiring of gene-enhancer interactions. Cell 161, 1012–1025 (2015).

    Article  Google Scholar 

  32. Franke, M. et al. Formation of new chromatin domains determines pathogenicity of genomic duplications. Nature 538, 265–269 (2016).

    CAS  Article  Google Scholar 

  33. Croft, B. et al. Human sex reversal is caused by duplication or deletion of core enhancers upstream of SOX9. Nat. Commun. 9, 5319 (2018).

    CAS  Article  Google Scholar 

  34. Young, R. A. Control of the embryonic stem cell state. Cell 144, 940–954 (2011).

    CAS  Article  Google Scholar 

  35. Vierbuchen, T. et al. AP-1 transcription factors and the BAF complex mediate signal-dependent enhancer selection. Mol. Cell 68, 1067–1082.e12 (2017).

    CAS  Article  Google Scholar 

  36. Rao, S. S. P. et al. Cohesin loss eliminates all loop domains. Cell. https://doi.org/10.1016/j.cell.2017.09.026 (2017).

  37. Belaghzal, H. et al. Liquid chromatin Hi-C characterizes compartment-dependent chromatin interaction dynamics. Nat. Genet. https://doi.org/10.1038/s41588-021-00784-4 (2021).

  38. Meuleman, W. et al. Constitutive nuclear lamina-genome interactions are highly conserved and associated with A/T-rich sequence. Genome Res. 23, 270–280 (2013).

    CAS  Article  Google Scholar 

  39. Miga, K. H. et al. Telomere-to-telomere assembly of a complete human X chromosome. Nature 585, 79–84 (2020).

    CAS  Article  Google Scholar 

  40. Logsdon, G. A., Vollger, M. R. & Eichler, E. E. Long-read human genome sequencing and its applications. Nat. Rev. Genet. 21, 597–614 (2020).

    CAS  Article  Google Scholar 

  41. Vierstra, J. et al. Global reference mapping of human transcription factor footprints. Nature 583, 729–736 (2020).

    CAS  Article  Google Scholar 

  42. Chen, K. M., Wong, A. K., Troyanskaya, O. G. & Zhou, J. A sequence-based global map of regulatory activity for deciphering human genetics. Preprint at bioRxiv. https://doi.org/10.1101/2021.07.29.454384 (2021).

  43. Imakaev, M. et al. Iterative correction of Hi-C data reveals hallmarks of chromosome organization. Nat. Methods 9, 999–1003 (2012).

    CAS  Article  Google Scholar 

  44. Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D. & Wilson, A. G. Averaging weights leads to wider optima and better generalization. Preprint at https://arxiv.org/abs/1803.05407 (2018).

  45. Chen, T., Xu, B., Zhang, C. & Guestrin, C. Training deep nets with sublinear memory cost. Preprint at https://arxiv.org/abs/1604.06174 (2016).

  46. Khan, A. et al. JASPAR 2018: update of the open-access database of transcription factor binding profiles and its web framework. Nucleic Acids Res. 46, D1284 (2018).

    Article  Google Scholar 

  47. Boix, C. A., James, B. T., Park, Y. P., Meuleman, W. & Kellis, M. Regulatory genomic circuitry of human disease loci by integrative epigenomics. Nature 590, 300–307 (2021).

    CAS  Article  Google Scholar 

Download references

Acknowledgements

This work was performed using the high-performance computing resources, supported by the BioHPC, at the University of Texas Southwestern Medical Center. J.Z. is supported by the Cancer Prevention and Research Institute of Texas grant (no. RR190071), National Institutes of Health grant no. DP2GM146336 and the UT Southwestern Endowed Scholars program. The author thanks C. Park and K. Chen for feedback on an early draft of this manuscript.

Author information

Authors and Affiliations

Authors

Contributions

J.Z. conceived and designed the study, developed the computational methods, performed the analysis and wrote the manuscript.

Corresponding author

Correspondence to Jian Zhou.

Ethics declarations

Competing interests

The author declares no competing interests.

Peer review

Peer review information

Nature Genetics thanks the anonymous reviewers for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Performance of Orca model predictions for the HFF cell type.

a). A multiscale sequence-based prediction example zooming from whole-chromosome into a position on a holdout test chromosome. Predictions from 1–256 Mb scales are compared with micro-C experimental observations. Missing values in micro-C data are shown in gray, and these regions are also indicated in the 64–256 Mb prediction heatmaps because predictions at major assembly gaps or unmappable regions are of unknown accuracy. The genome interactions are represented by the log fold over genomic-distance-based background scores for both prediction and experimental data. b). Scatter plot comparison of the predicted interaction scores with the micro-C measured interaction scores (log fold over background) on the holdout test chromosomes. 10,000 randomly subsampled scores are shown in each panel. The overall Pearson correlations across the entire test chromosomes are annotated. The genome interactions are represented by the log fold over background scores for both prediction and experimental data. Predictions for 1–32 Mb levels are from the Orca-32Mb model and 64–256 Mb levels are from the Orca-256Mb model.

Extended Data Fig. 2 Performance of Orca model predictions for cross-cell-type genome interaction difference.

a). Scatter plot comparison of the predicted cell type differences of genome interactions (H1-ESC - HFF) with the micro-C measured interaction score differences on the holdout chromosomes. 10,000 randomly subsampled scores are shown in each panel. The overall Pearson correlations across the entire test chromosomes are annotated. The genome interactions are represented by the log fold over genomic-distance-based background scores for both prediction and experimental data. b). Prediction performance for position pairs with the strongest absolute log-fold differences between the two cell types (top 1 percentile). The performance of models predicting the cell type labels (the cell type with stronger interaction) is measured by receiver operating characteristic (ROC) curve. The area under the ROC curve (AUROC) is annotated. The AUROC score can be interpreted as the probability of a randomly selected positive example (that is stronger in HFF) being ranked higher than a randomly selected example (that is stronger in H1-ESC). Predictions for 1–32 Mb levels are from the Orca-32Mb models and 64–256 Mb levels are from the Orca-256Mb models.

Extended Data Fig. 3 Example Orca predictions of Polycomb-mediated interactions.

Predicted and observed H1-ESC and HFF genome interactions for two regions from a holdout chromosome, a). chr10:116850000-117850000 and b). chr10:100450000-101450000 are shown. The predicted and observed Polycomb-mediated interactions are marked with black triangles. ChIP-seq signal tracks for CTCF and H3K27me3 for the two cell types are also shown. Polycomb-mediated interactions are predicted to be specific to H1-ESC in both examples, consistent with experimental micro-C and ChIP-seq data.

Extended Data Fig. 4 Example Orca predictions of promoter-enhancer interactions.

Predicted and observed H1-ESC and HFF genome interactions for two regions from holdout chromosomes, a) chr8:127400000-128400000 and b) chr9:94360000-95360000 are shown. The predicted and observed enhancer-promoter interactions are marked (promoter positions or promoter-promoter interactions are marked with red triangles, enhancer-promoter or enhancer-enhancer interactions are marked with black triangles; we only marked a subset of all interactions observed). ChIP-seq signal tracks for CTCF and H3K4me3, H3K27ac, and H3K4me1 for the two cell types are also shown. The predicted enhancer-promoter interactions are consistent with micro-C observations and enhancer histone mark signal from ChIP-seq data.

Extended Data Fig. 5 Visualized predictions of transposon-mediated boundary element insertion effects in multiple insertion sites.

All insertions with previously categorized effects (boundary creation, boundary strengthening, and no domain-level effect) in Zhang et al.24 are shown. The experimental measurements by in situ Hi-C in HAP1 cell is compared with H1-ESC model predictions. The genome interactions are represented by the log fold over genomic-distance-based background scores for both prediction and experimental data. Arrows indicate the insertion sites. The genome coordinates are in hg19.

Extended Data Fig. 6 Comparison of Orca prediction with Capture Hi-C experimental measurement for structural variants from Franke et al. 2016.

Capture Hi-C data from mouse with SVs are compared with predictions for effects of equivalent human structural variants. Predicted log fold over background at 4 Mb level are scaled with the distance-expectation curve from capture Hi-C.

Extended Data Fig. 7 Multiplexed in silico mutagenesis screen results are highly correlated with single-mutation in silico mutagenesis screen results.

a). Predicted structural impact scores (1 Mb) of single disruptions (left) and multiplexed disruptions are shown on the y-axis, with disruption positions on the x-axis. 10 bp disruption sites screened cover the center 0.8 Mb of the 1 Mb region. The first three rows are three independent runs (for single disruption only the disrupted sequences are random across the runs, and for multiplexed disruption both the multiplex design of disruption sites and the disrupted sequences are random), and the last row shows the minimum of the three at each position. b). Relationship between the correlation of single and multiplexed disruption profiles (y-axis) and the number of runs combined (x-axis).

Extended Data Fig. 8 Visualization of virtual screen sequence activity on chromatin compartment alteration.

A subset of 1000 contiguous source sequences among all 27981 12800 bp source sequences covering chr8, 9, and 10 are shown. Target locations are ordered by the main mode of compartment change detected at the target site (from top: A>B to bottom: B>A), which is quantified by the loading of the first principal component of the whole sequence structural impact score (32 Mb) matrix.

Extended Data Fig. 9 Random sequence permutation effects on sequence compartment A and compartment B activity.

Comparison of chromatin compartment activities of 25600 bp sequences permuted by different segment length (at each permutation segment length, 2 bp, 4 bp, …, 256 bp, every 25600 bp sequence is divided into segments and the segments are then randomly shuffled and concatenated). Compartment B activity is compared with sequence A/T content at the same locations.

Extended Data Fig. 10 Predicted effects of disrupting genomic regions by randomly permuting sequences.

At each disruption site indicated by the arrow, 1.28 Mb sequence centered at the position is permuted by 4 bp segments. Permuted compartment A sequences show B compartment interaction patterns, while disrupted compartment B sequences remain to be in B compartment.

Supplementary information

Supplementary Information

Supplementary Figs. 1–25.

Reporting Summary

Supplementary Table 1

Supplementary Tables 1–6.

Supplementary Data 1

Sequence-based multiscale genome interaction prediction examples for H1-ESC and HFF cells randomly sampled from the holdout chromosomes.

Supplementary Data 2

Predicted multiscale structural variant effects for all transposon insertion sites tested.

Supplementary Data 3

Predicted multiscale structural variant effects for all structural variants tested.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Zhou, J. Sequence-based modeling of three-dimensional genome architecture from kilobase to chromosome scale. Nat Genet 54, 725–734 (2022). https://doi.org/10.1038/s41588-022-01065-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41588-022-01065-4

Further reading

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing