Sequence-based modeling of three-dimensional genome architecture from kilobase to chromosome scale

Zhou, Jian

doi:10.1038/s41588-022-01065-4

Technical Report
Published: 12 May 2022

Sequence-based modeling of three-dimensional genome architecture from kilobase to chromosome scale

Jian Zhou ORCID: orcid.org/0000-0002-3721-4550¹

Nature Genetics volume 54, pages 725–734 (2022)Cite this article

13k Accesses
36 Citations
150 Altmetric
Metrics details

Subjects

Abstract

To learn how genomic sequence influences multiscale three-dimensional (3D) genome architecture, this manuscript presents a sequence-based deep-learning approach, Orca, that predicts directly from sequence the 3D genome architecture from kilobase to whole-chromosome scale. Orca captures the sequence dependencies of structures including chromatin compartments and topologically associating domains, as well as diverse types of interactions from CTCF-mediated to enhancer–promoter interactions and Polycomb-mediated interactions with cell-type specificity. Orca enables various applications including predicting structural variant effects on multiscale genome organization and it recapitulated effects of experimentally studied variants at varying sizes (300 bp to 90 Mb). Moreover, Orca enables in silico virtual screens to probe the sequence basis of 3D genome organization at different scales. At the submegabase scale, it predicted specific transcription factor motifs underlying cell-type-specific genome interactions. At the compartment scale, virtual screens of sequence activities suggest a model for the sequence basis of chromatin compartments with a prominent role of transcription start sites.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Predicting multiscale 3D genome architecture from sequence.**

**Fig. 2: Multiscale sequence-based prediction of SV effects on genome structure.**

**Fig. 3: Identification of cell-type-specific motifs that underlie predicted submegabase-scale genome interactions.**

**Fig. 4: Virtual screen profiling of sequence dependencies of chromatin compartments identifies a prominent role of TSS sequences.**

Integrative genome modeling platform reveals essentiality of rare contact events in 3D genome organizations

Article Open access 11 July 2022

Deep learning approach for predicting functional Z-DNA regions using omics data

Article Open access 05 November 2020

Computational methods for analysing multiscale 3D genome organization

Article 06 September 2023

Data availability

The GRCh38/hg38 reference genome and 3D genome datasets under 4DN accession numbers 4DNFI9GMP2J8, 4DNFI643OYP9 and 4DNFILP99QJS were used for training the Orca sequence models. All coordinates in the manuscript refer to GRCh38/hg38 unless otherwise indicated. SV experimental validation datasets were downloaded from NCBI GEO accessions GSE137372, GSE66383, GSE78109 and EBI ENA accession PRJEB5236. Data used and generated in this manuscript were also deposited into Zenodo: https://zenodo.org/record/6234936, https://zenodo.org/record/4594676 and https://zenodo.org/record/6227750.

Code availability

All code, models and data for running Orca are available from the Github repository https://github.com/jzhoulab/orca (https://doi.org/10.5281/zenodo.6257290). A user-friendly web server is available at https://orca.zhoulab.io. The manuscript analysis code is available at https://github.com/jzhoulab/orca_manuscript (https://doi.org/10.5281/zenodo.6257292).

References

Rao, S. S. P. et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 159, 1665–1680 (2014).
Article CAS Google Scholar
Dixon, J. R. et al. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature 485, 376–380 (2012).
Article CAS Google Scholar
Nora, E. P. et al. Spatial partitioning of the regulatory landscape of the X-inactivation centre. Nature 485, 381–385 (2012).
Article CAS Google Scholar
van Steensel, B. & Furlong, E. E. M. The role of transcription in shaping the spatial organization of the genome. Nat. Rev. Mol. Cell Biol. 20, 327–337 (2019).
PubMed PubMed Central Google Scholar
Kosak, S. T. et al. Subnuclear compartmentalization of immunoglobulin loci during lymphocyte development. Science 296, 158–162 (2002).
Article CAS Google Scholar
Dixon, J. R. et al. Chromatin architecture reorganization during stem cell differentiation. Nature 518, 331–336 (2015).
Article CAS Google Scholar
Amat, R. et al. Rapid reversible changes in compartments and local chromatin organization revealed by hyperosmotic shock. Genome Res. 29, 18–28 (2019).
Article CAS Google Scholar
Sima, J. et al. Identifying cis elements for spatiotemporal control of mammalian DNA replication. Cell 176, 816–830.e18 (2019).
Article CAS Google Scholar
Alipour, E. & Marko, J. F. Self-organization of domain structures by DNA-loop-extruding enzymes. Nucleic Acids Res. 40, 11202–11212 (2012).
Article CAS Google Scholar
Fudenberg, G., Abdennur, N., Imakaev, M., Goloborodko, A. & Mirny, L. A. Emerging evidence of chromosome folding by loop extrusion. Cold Spring Harb. Symp. Quant. Biol. 82, 45–55 (2017).
Article Google Scholar
Fudenberg, G. et al. Formation of chromosomal domains by loop extrusion. Cell Rep. 15, 2038–2049 (2016).
Article CAS Google Scholar
Sanborn, A. L. et al. Chromatin extrusion explains key features of loop and domain formation in wild-type and engineered genomes. Proc. Natl Acad. Sci. USA 112, E6456–E6465 (2015).
CAS PubMed PubMed Central Google Scholar
Dekker, J., Rippe, K., Dekker, M. & Kleckner, N. Capturing chromosome conformation. Science 295, 1306–1311 (2002).
Article CAS Google Scholar
Lieberman-Aiden, E. et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 326, 289–293 (2009).
Article CAS Google Scholar
Krietenstein, N. et al. Ultrastructural details of mammalian chromosome architecture. Mol. Cell 78, 554–565.e7 (2020).
Article CAS Google Scholar
Zhou, J. & Troyanskaya, O. G. Predicting effects of noncoding variants with deep learning–based sequence model. Nat. Methods 12, 931–934 (2015).
Article CAS Google Scholar
Alipanahi, B., Delong, A., Weirauch, M. T. & Frey, B. J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 831–838 (2015).
Article CAS Google Scholar
Kelley, D. R., Snoek, J. & Rinn, J. L. Basset: Learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. https://doi.org/10.1101/gr.200535.115 (2016).
Zhou, J. et al. Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk. Nat. Genet. https://doi.org/10.1038/s41588-018-0160-6 (2018).
Kelley, D. R. et al. Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Res. 28, 739–750 (2018).
Article CAS Google Scholar
Chen, K. M., Cofer, E. M., Zhou, J. & Troyanskaya, O. G. Selene: a PyTorch-based deep learning library for sequence data. Nat. Methods. https://doi.org/10.1038/s41592-019-0360-8 (2019).
Avsec, Ž. et al. Base-resolution models of transcription-factor binding reveal soft motif syntax. Nat. Genet. 53, 354–366 (2021).
Article CAS Google Scholar
Fudenberg, G., Kelley, D. R. & Pollard, K. S. Predicting 3D genome folding from DNA sequence with Akita. Nat. Methods 17, 1111–1117 (2020).
Article Google Scholar
Schwessinger, R. et al. DeepC: predicting 3D genome folding using megabase-scale transfer learning. Nat. Methods 17, 1118–1124 (2020).
Article CAS Google Scholar
Durand, N. C. et al. Juicebox provides a visualization system for Hi-C contact maps with unlimited zoom. Cell Syst. 3, 99–101 (2016).
Article CAS Google Scholar
Abdennur, N. & Mirny, L. A. Cooler: scalable storage for Hi-C data and other genomically labeled arrays. Bioinformatics 36, 311–316 (2020).
Article CAS Google Scholar
Chiang, C. et al. The impact of structural variation on human gene expression. Nat. Genet. https://doi.org/10.1038/ng.3834 (2017).
Zhang, D. et al. Alteration of genome folding via contact domain boundary insertion. Nat. Genet. 52, 1076–1087 (2020).
Article Google Scholar
Suzukawa, K. et al. Identification of a breakpoint cluster region 3′ of the ribophorin I gene at 3q21 associated with the transcriptional activation of the EVI1 gene in acute myelogenous leukemias with inv (3)(q21q26). Blood. 84, 2681–2688 (1994).
Article CAS Google Scholar
Gröschel, S. et al. A single oncogenic enhancer rearrangement causes concomitant EVI1 and GATA2 deregulation in leukemia. Cell 157, 369–381 (2014).
Article Google Scholar
Lupiáñez, D. G. et al. Disruptions of topological chromatin domains cause pathogenic rewiring of gene-enhancer interactions. Cell 161, 1012–1025 (2015).
Article Google Scholar
Franke, M. et al. Formation of new chromatin domains determines pathogenicity of genomic duplications. Nature 538, 265–269 (2016).
Article CAS Google Scholar
Croft, B. et al. Human sex reversal is caused by duplication or deletion of core enhancers upstream of SOX9. Nat. Commun. 9, 5319 (2018).
Article CAS Google Scholar
Young, R. A. Control of the embryonic stem cell state. Cell 144, 940–954 (2011).
Article CAS Google Scholar
Vierbuchen, T. et al. AP-1 transcription factors and the BAF complex mediate signal-dependent enhancer selection. Mol. Cell 68, 1067–1082.e12 (2017).
Article CAS Google Scholar
Rao, S. S. P. et al. Cohesin loss eliminates all loop domains. Cell. https://doi.org/10.1016/j.cell.2017.09.026 (2017).
Belaghzal, H. et al. Liquid chromatin Hi-C characterizes compartment-dependent chromatin interaction dynamics. Nat. Genet. https://doi.org/10.1038/s41588-021-00784-4 (2021).
Meuleman, W. et al. Constitutive nuclear lamina-genome interactions are highly conserved and associated with A/T-rich sequence. Genome Res. 23, 270–280 (2013).
Article CAS Google Scholar
Miga, K. H. et al. Telomere-to-telomere assembly of a complete human X chromosome. Nature 585, 79–84 (2020).
Article CAS Google Scholar
Logsdon, G. A., Vollger, M. R. & Eichler, E. E. Long-read human genome sequencing and its applications. Nat. Rev. Genet. 21, 597–614 (2020).
Article CAS Google Scholar
Vierstra, J. et al. Global reference mapping of human transcription factor footprints. Nature 583, 729–736 (2020).
Article CAS Google Scholar
Chen, K. M., Wong, A. K., Troyanskaya, O. G. & Zhou, J. A sequence-based global map of regulatory activity for deciphering human genetics. Preprint at bioRxiv. https://doi.org/10.1101/2021.07.29.454384 (2021).
Imakaev, M. et al. Iterative correction of Hi-C data reveals hallmarks of chromosome organization. Nat. Methods 9, 999–1003 (2012).
Article CAS Google Scholar
Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D. & Wilson, A. G. Averaging weights leads to wider optima and better generalization. Preprint at https://arxiv.org/abs/1803.05407 (2018).
Chen, T., Xu, B., Zhang, C. & Guestrin, C. Training deep nets with sublinear memory cost. Preprint at https://arxiv.org/abs/1604.06174 (2016).
Khan, A. et al. JASPAR 2018: update of the open-access database of transcription factor binding profiles and its web framework. Nucleic Acids Res. 46, D1284 (2018).
Article Google Scholar
Boix, C. A., James, B. T., Park, Y. P., Meuleman, W. & Kellis, M. Regulatory genomic circuitry of human disease loci by integrative epigenomics. Nature 590, 300–307 (2021).
Article CAS Google Scholar

Download references

Acknowledgements

This work was performed using the high-performance computing resources, supported by the BioHPC, at the University of Texas Southwestern Medical Center. J.Z. is supported by the Cancer Prevention and Research Institute of Texas grant (no. RR190071), National Institutes of Health grant no. DP2GM146336 and the UT Southwestern Endowed Scholars program. The author thanks C. Park and K. Chen for feedback on an early draft of this manuscript.

Author information

Authors and Affiliations

Lyda Hill Department of Bioinformatics, University of Texas Southwestern Medical Center, Dallas, TX, USA
Jian Zhou

Authors

Jian Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

J.Z. conceived and designed the study, developed the computational methods, performed the analysis and wrote the manuscript.

Corresponding author

Correspondence to Jian Zhou.

Ethics declarations

Competing interests

The author declares no competing interests.

Peer review

Peer review information

Nature Genetics thanks the anonymous reviewers for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Performance of Orca model predictions for the HFF cell type.

a). A multiscale sequence-based prediction example zooming from whole-chromosome into a position on a holdout test chromosome. Predictions from 1–256 Mb scales are compared with micro-C experimental observations. Missing values in micro-C data are shown in gray, and these regions are also indicated in the 64–256 Mb prediction heatmaps because predictions at major assembly gaps or unmappable regions are of unknown accuracy. The genome interactions are represented by the log fold over genomic-distance-based background scores for both prediction and experimental data. b). Scatter plot comparison of the predicted interaction scores with the micro-C measured interaction scores (log fold over background) on the holdout test chromosomes. 10,000 randomly subsampled scores are shown in each panel. The overall Pearson correlations across the entire test chromosomes are annotated. The genome interactions are represented by the log fold over background scores for both prediction and experimental data. Predictions for 1–32 Mb levels are from the Orca-32Mb model and 64–256 Mb levels are from the Orca-256Mb model.

Extended Data Fig. 2 Performance of Orca model predictions for cross-cell-type genome interaction difference.

a). Scatter plot comparison of the predicted cell type differences of genome interactions (H1-ESC - HFF) with the micro-C measured interaction score differences on the holdout chromosomes. 10,000 randomly subsampled scores are shown in each panel. The overall Pearson correlations across the entire test chromosomes are annotated. The genome interactions are represented by the log fold over genomic-distance-based background scores for both prediction and experimental data. b). Prediction performance for position pairs with the strongest absolute log-fold differences between the two cell types (top 1 percentile). The performance of models predicting the cell type labels (the cell type with stronger interaction) is measured by receiver operating characteristic (ROC) curve. The area under the ROC curve (AUROC) is annotated. The AUROC score can be interpreted as the probability of a randomly selected positive example (that is stronger in HFF) being ranked higher than a randomly selected example (that is stronger in H1-ESC). Predictions for 1–32 Mb levels are from the Orca-32Mb models and 64–256 Mb levels are from the Orca-256Mb models.

Extended Data Fig. 3 Example Orca predictions of Polycomb-mediated interactions.

Predicted and observed H1-ESC and HFF genome interactions for two regions from a holdout chromosome, a). chr10:116850000-117850000 and b). chr10:100450000-101450000 are shown. The predicted and observed Polycomb-mediated interactions are marked with black triangles. ChIP-seq signal tracks for CTCF and H3K27me3 for the two cell types are also shown. Polycomb-mediated interactions are predicted to be specific to H1-ESC in both examples, consistent with experimental micro-C and ChIP-seq data.

Extended Data Fig. 4 Example Orca predictions of promoter-enhancer interactions.

Predicted and observed H1-ESC and HFF genome interactions for two regions from holdout chromosomes, a) chr8:127400000-128400000 and b) chr9:94360000-95360000 are shown. The predicted and observed enhancer-promoter interactions are marked (promoter positions or promoter-promoter interactions are marked with red triangles, enhancer-promoter or enhancer-enhancer interactions are marked with black triangles; we only marked a subset of all interactions observed). ChIP-seq signal tracks for CTCF and H3K4me3, H3K27ac, and H3K4me1 for the two cell types are also shown. The predicted enhancer-promoter interactions are consistent with micro-C observations and enhancer histone mark signal from ChIP-seq data.

Extended Data Fig. 5 Visualized predictions of transposon-mediated boundary element insertion effects in multiple insertion sites.

All insertions with previously categorized effects (boundary creation, boundary strengthening, and no domain-level effect) in Zhang et al.²⁴ are shown. The experimental measurements by in situ Hi-C in HAP1 cell is compared with H1-ESC model predictions. The genome interactions are represented by the log fold over genomic-distance-based background scores for both prediction and experimental data. Arrows indicate the insertion sites. The genome coordinates are in hg19.

Extended Data Fig. 6 Comparison of Orca prediction with Capture Hi-C experimental measurement for structural variants from Franke et al. 2016.

Capture Hi-C data from mouse with SVs are compared with predictions for effects of equivalent human structural variants. Predicted log fold over background at 4 Mb level are scaled with the distance-expectation curve from capture Hi-C.

Extended Data Fig. 7 Multiplexed in silico mutagenesis screen results are highly correlated with single-mutation in silico mutagenesis screen results.

a). Predicted structural impact scores (1 Mb) of single disruptions (left) and multiplexed disruptions are shown on the y-axis, with disruption positions on the x-axis. 10 bp disruption sites screened cover the center 0.8 Mb of the 1 Mb region. The first three rows are three independent runs (for single disruption only the disrupted sequences are random across the runs, and for multiplexed disruption both the multiplex design of disruption sites and the disrupted sequences are random), and the last row shows the minimum of the three at each position. b). Relationship between the correlation of single and multiplexed disruption profiles (y-axis) and the number of runs combined (x-axis).

Extended Data Fig. 8 Visualization of virtual screen sequence activity on chromatin compartment alteration.

A subset of 1000 contiguous source sequences among all 27981 12800 bp source sequences covering chr8, 9, and 10 are shown. Target locations are ordered by the main mode of compartment change detected at the target site (from top: A>B to bottom: B>A), which is quantified by the loading of the first principal component of the whole sequence structural impact score (32 Mb) matrix.

Extended Data Fig. 9 Random sequence permutation effects on sequence compartment A and compartment B activity.

Comparison of chromatin compartment activities of 25600 bp sequences permuted by different segment length (at each permutation segment length, 2 bp, 4 bp, …, 256 bp, every 25600 bp sequence is divided into segments and the segments are then randomly shuffled and concatenated). Compartment B activity is compared with sequence A/T content at the same locations.

Extended Data Fig. 10 Predicted effects of disrupting genomic regions by randomly permuting sequences.

At each disruption site indicated by the arrow, 1.28 Mb sequence centered at the position is permuted by 4 bp segments. Permuted compartment A sequences show B compartment interaction patterns, while disrupted compartment B sequences remain to be in B compartment.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhou, J. Sequence-based modeling of three-dimensional genome architecture from kilobase to chromosome scale. Nat Genet 54, 725–734 (2022). https://doi.org/10.1038/s41588-022-01065-4

Download citation

Received: 11 April 2021
Accepted: 29 March 2022
Published: 12 May 2022
Issue Date: May 2022
DOI: https://doi.org/10.1038/s41588-022-01065-4

This article is cited by

ChIPr: accurate prediction of cohesin-mediated 3D genome organization from 2D chromatin features
- Ahmed Abbas
- Khyati Chandratre
- Ram S. Mani
Genome Biology (2024)
Targeted design of synthetic enhancers for selected tissues in the Drosophila embryo
- Bernardo P. de Almeida
- Christoph Schaub
- Alexander Stark
Nature (2024)
Computational methods for analysing multiscale 3D genome organization
- Yang Zhang
- Lorenzo Boninsegna
- Jian Ma
Nature Reviews Genetics (2024)
Current sequence-based models capture gene expression determinants in promoters but mostly ignore distal enhancers
- Alexander Karollus
- Thomas Mauermeier
- Julien Gagneur
Genome Biology (2023)
EvoAug: improving generalization and interpretability of genomic deep neural networks with evolution-inspired data augmentations
- Nicholas Keone Lee
- Ziqi Tang
- Peter K. Koo
Genome Biology (2023)