Abstract
In interphase, the human genome sequence folds in three dimensions into a rich variety of locus-specific contact patterns. Cohesin and CTCF (CCCTC-binding factor) are key regulators; perturbing the levels of either greatly disrupts genome-wide folding as assayed by chromosome conformation capture methods. Still, how a given DNA sequence encodes a particular locus-specific folding pattern remains unknown. Here we present a convolutional neural network, Akita, that accurately predicts genome folding from DNA sequence alone. Representations learned by Akita underscore the importance of an orientation-specific grammar for CTCF binding sites. Akita learns predictive nucleotide-level features of genome folding, revealing effects of nucleotides beyond the core CTCF motif. Once trained, Akita enables rapid in silico predictions. Accounting for this, we demonstrate how Akita can be used to perform in silico saturation mutagenesis, interpret eQTLs, make predictions for structural variants and probe species-specific genome folding. Collectively, these results enable decoding genome function from sequence through structure.
This is a preview of subscription content, access via your institution
Relevant articles
Open Access articles citing this article.
-
Personal transcriptome variation is poorly explained by current genomic deep learning models
Nature Genetics Open Access 30 November 2023
-
Predictive analyses of regulatory sequences with EUGENe
Nature Computational Science Open Access 16 November 2023
-
ExplaiNN: interpretable and transparent neural networks for genomics
Genome Biology Open Access 27 June 2023
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Rent or buy this article
Prices vary by article type
from$1.95
to$39.95
Prices may be subject to local taxes which are calculated during checkout






Data availability
Datasets analyzed in this study are publicly available from: GEO (www.ncbi.nlm.nih.gov/geo/, Hi-C: GSE63525, GSE104334, GSE96107, 5C: GSE77142, Capture-C: GSE116794), 4D Nucleome Data Portal (https://data.4dnucleome.org/, Micro-C: 4DNESWST3UBH, 4DNES14CNC1I), UCSC (http://hgdownload.cse.ucsc.edu/goldenpath/hg38/database/), ENCODE data portal (www.encodeproject.org/), JASPAR (http://expdata.cmmt.ubc.ca/JASPAR/downloads/UCSC_tracks/2018/hg38/tsv/), GENCODE (https://www.gencodegenes.org/human/) and FANTOM5 (https://fantom.gsc.riken.jp/data/).
Code availability
Trained model, additional documentation and code for training and predicting with Akita available at: https://github.com/calico/basenji/tree/master/manuscripts/akita.
Change history
19 October 2020
In the version of this article initially published, the links for https://github.com/calico/basenji/tree/master/manuscripts/akita and https://github.com/calico/basenji/blob/master/manuscripts/akita/params.json led to an obsolete website rather than to the stated URLs. The errors have been corrected in the PDF and HTML versions of the article.
References
Merkenschlager, M. & Nora, E. P. CTCF and cohesin in genome folding and transcriptional gene regulation. Annu. Rev. Genomics Hum. Genet. 17, 17–43 (2016).
Krijger, P. H. L. & de Laat, W. Regulation of disease-associated gene expression in the 3D genome. Nat. Rev. Mol. Cell Biol. 17, 771–782 (2016).
Fudenberg, G., Abdennur, N., Imakaev, M., Goloborodko, A. & Mirny, L. A. Emerging evidence of chromosome folding by loop extrusion. Cold Spring Harb. Symp. Quant. Biol. 82, 45–55 (2017).
Rodríguez-Carballo, E. et al. The HoxD cluster is a dynamic and resilient TAD boundary controlling the segregation of antagonistic regulatory landscapes. Genes Dev. 31, 2264–2281 (2017).
Despang, A. et al. Functional dissection of the Sox9-Kcnj2 locus identifies nonessential and instructive roles of TAD architecture. Nat. Genet. 51, 1263–1271 (2019).
Cao, F., Zhang, Y., Loh, Y. P., Cai, Y. & Fullwood, M. J. Predicting chromatin interactions between open chromatin regions from DNA sequences. Preprint at bioRxiv https://doi.org/10.1101/720748 (2019).
Belokopytova, P. S., Nuriddinov, M. A., Mozheiko, E. A., Fishman, D. & Fishman, V. Quantitative prediction of enhancer-promoter interactions. Genome Res. 30, 72–84 (2020).
Zhang, S., Chasman, D., Knaack, S. & Roy, S. In silico prediction of high-resolution Hi-C interaction matrices. Nat. Commun. 10, 5449 (2019).
Li, W., Wong, W. H. & Jiang, R. DeepTACT: predicting 3D chromatin contacts via bootstrapping deep learning. Nucleic Acids Res. 47, e60 (2019).
Whalen, S., Truty, R. M. & Pollard, K. S. Enhancer–promoter interactions are encoded by complex genomic signatures on looping chromatin. Nat. Genet. 48, 488–496 (2016).
Trieu, T., Martinez-Fundichely, A. & Khurana, E. DeepMILO: a deep learning approach to predict the impact of non-coding sequence variants on 3D chromatin structure. Genome Biol. 21, 79 (2020).
Forcato, M. et al. Comparison of computational methods for Hi-C data analysis. Nat. Methods 14, 679–685 (2017).
Alipanahi, B., Delong, A., Weirauch, M. T. & Frey, B. J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 831–838 (2015).
Kelley, D. R., Snoek, J. & Rinn, J. L. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. 26, 990–999 (2016).
Zhou, J. & Troyanskaya, O. G. Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods 12, 931–934 (2015).
Koo, P. K., Anand, P., Paul, S. B. & Eddy, S. R. Inferring sequence-structure preferences of RNA-binding proteins with convolutional residual networks. Preprint at bioRxiv https://doi.org/10.1101/418459 (2018).
Shrikumar, A., Greenside, P., Shcherbina, A. & Kundaje, A. Not just a black box: Learning important features through propagating activation differences. Preprint at arXiv https://arxiv.org/abs/1605.01713v3 (2016).
Kelley, D. R. et al. Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Res. 28, 739–750 (2018).
Kelley, D. R. Cross-species regulatory sequence activity prediction. PLoS Comput. Biol. 16, e1008050 (2020).
Imakaev, M. et al. Iterative correction of Hi-C data reveals hallmarks of chromosome organization. Nat. Methods 9, 999–1003 (2012).
Yang, T. et al. HiCRep: assessing the reproducibility of Hi-C data using a stratum-adjusted correlation coefficient. Genome Res. 27, 1939–1949 (2017).
Nora, E. P. et al. Targeted degradation of CTCF decouples local insulation of chromosome domains from genomic compartmentalization. Cell 169, 930–944.e22 (2017).
Wutz, G. et al. Topologically associating domains and chromatin loops depend on cohesin and are regulated by CTCF, WAPL, and PDS5 proteins. EMBO J. 36, 3573–3599 (2017).
Khan, A. et al. JASPAR 2018: update of the open-access database of transcription factor binding profiles and its web framework. Nucleic Acids Res. 46, D260–D266 (2018).
Pollard, K. S., Hubisz, M. J., Rosenbloom, K. R. & Siepel, A. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res. 20, 110–121 (2010).
Grant, C. E., Bailey, T. L. & Noble, W. S. FIMO: scanning for occurrences of a given motif. Bioinformatics 27, 1017–1018 (2011).
Rhee, H. S. & Pugh, B. F. Comprehensive genome-wide protein-DNA interactions detected at single-nucleotide resolution. Cell 147, 1408–1419 (2011).
Nakahashi, H. et al. A genome-wide map of CTCF multivalency redefines the CTCF code. CellReports 3, 1678–1689 (2013).
Hnisz, D. et al. Activation of proto-oncogenes by disruption of chromosome neighborhoods. Science 351, 1454–1458 (2016).
Dixon, J. R. et al. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature 485, 376–380 (2012).
Bonev, B. et al. Multiscale 3D genome rewiring during mouse neural development. Cell 171, 557–572.e24 (2017).
Schmidt, D. et al. Waves of retrotransposon expansion remodel genome organization and CTCF binding in multiple mammalian lineages. Cell 148, 335–348 (2012).
Kaaij, L. J. T., Mohn, F., van der Weide, R. H., de Wit, E. & Bühler, M. The ChAHP complex counteracts chromatin looping at CTCF sites that emerged from SINE expansions in mouse. Cell 178, 1437–1451.e14 (2019).
Kraft, K. et al. Serial genomic inversions induce tissue-specific architectural stripes, gene misexpression and congenital malformations. Nat. Cell Biol. 21, 305–310 (2019).
Schwessinger, R. et al. DeepC: Predicting chromatin interactions using megabase scaled deep neural networks and transfer learning. Preprint at bioRxiv https://doi.org/10.1101/724005 (2019).
Krietenstein, N. et al. Ultrastructural details of mammalian chromosome architecture. Mol. Cell 78, 554–565.e7 (2020).
Davis, C. A. et al. The encyclopedia of DNA elements (ENCODE): data portal update. Nucleic Acids Res. 46, D794–D801 (2018).
Gupta, S., Stamatoyannopoulos, J. A., Bailey, T. L. & Noble, W. S. Quantifying similarity between motifs. Genome Biol. 8, R24 (2007).
Goloborodko, A., Venev, S., Abdennur, N. A. & Di Tommaso, P. mirnylab/distiller-nf: v.0.3.3 https://doi.org/10.5281/zenodo.3350937 (Zenodo, 2019).
Abdennur, N. & Mirny, L. Cooler: scalable storage for Hi-C data and other genomically-labeled arrays. Bioinformatics 36, 311–316 (2019).
Rao, S. S. P. et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 159, 1665–1680 (2014).
Rao, S. S. P. et al. Cohesin loss eliminates all loop domains. Cell 171, 305–320.e24 (2017).
Wang, S., Sun, S., Li, Z., Zhang, R. & Xu, J. Accurate de novo prediction of protein contact map by ultra-deep learning model. PLoS Comput. Biol. 13, e1005324 (2017).
Abadi, M. et al. TensorFlow (2015).
Chollet, F. et al. Keras (GitHub, 2015).
Kandasamy, K. et al. Tuning hyperparameters without grad students: scalable and robust Bayesian optimisation with dragonfly. Preprint at arXiv https://arxiv.org/abs/1903.06694 (2019).
Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).
Flyamer, I. et al. Phlya/adjustText (Zenodo) https://doi.org/10.5281/zenodo.1494343 (2018).
Aguet, F. et al. The GTEx Consortium atlas of genetic regulatory effects across human tissues. Preprint at bioRxiv https://doi.org/10.1101/787903 (2019).
Wang, G., Sarkar, A., Carbonetto, P. & Stephens, M. A simple new approach to variable selection in regression, with application to genetic fine-mapping. Preprint at bioRxiv https://doi.org/10.1101/501114 (2019).
Smit, A. F. A., Hubley, R. & Green, P. RepeatMasker Open-4.0. 2013-2015 (2015).
Hsieh, T.-H. S. et al. Resolving the 3D landscape of transcription-linked mammalian chromatin folding. Mol. Cell 78, 539–553.e8 (2020).
Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020).
van der Walt, S., Colbert, S. C. & Varoquaux, G. The NumPy array: a structure for efficient numerical computation. Comput. Sci. Eng. 13, 22–30 (2011).
Reback, J. et al. pandas-dev/pandas: Pandas 1.0.3 (Zenodo, 2020); https://doi.org/10.5281/zenodo.3715232
Perez, F. & Granger, B. E. IPython: a system for interactive scientific computing. Comput. Sci. Eng. 9, 21–29 (2007).
Hunter, J. D. Matplotlib: a 2D graphics environment. Comput. Sci. Eng. 9, 90–95 (2007).
Waskom, M. et al. seaborn: v0.5.0 (November 2014) (Zenodo, 2014); https://doi.org/10.5281/zenodo.12710
Fulco, C. P. et al. Activity-by-contact model of enhancer-promoter regulation from thousands of CRISPR perturbations. Nat. Genet. 51, 1664–1669 (2019).
Beagan, J. A. et al. YY1 and CTCF orchestrate a 3D chromatin looping switch during early neural lineage commitment. Genome Res. 27, 1139–1152 (2017).
Weintraub, A. S. et al. YY1 Is a structural regulator of enhancer-promoter loops. Cell 171, 1573–1588.e28 (2017).
Acknowledgements
We thank V. Agarwal, H. Yuan and E. Nora for feedback on the manuscript; L. Chumpitaz-Diaz and M. Pittman for feedback on tutorials; N. Abdennur and P. Kerpedjiev for help with higlass visualization and V. Heinrich for sharing mapped Capture-C reads. We also thank J. Ulirsch, Q. Wang and H. Finucane for sharing GTEx SuSiE fine mapping. G.F. and K.S.P. were funded by Gladstone Institutes, the National Heart, Lung and Blood Institute (grant no. HL098179) and the National Institute of Mental Health (grant no. MH109907).
Author information
Authors and Affiliations
Contributions
G.F. and D.R.K. conceived the project, developed models and analyzed data, with input from K.S.P. at all stages. All authors developed the final manuscript.
Corresponding authors
Ethics declarations
Competing interests
D.R.K. is a paid employee of Calico Life Sciences, LLC. All other authors declare no competing interests.
Additional information
Peer review information Lin Tang was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Akita transforms from 1D to 2D representations and benefits from multi-task training.
a, Illustration of transformation from 1D profiles to 2D maps. To convert 1D profiles to 2D maps, we averaged the values at pairs of genomic bins i and j for each filter. This operation transforms a tensor with dimensions [512 length, 64 filters] to a tensor with dimensions [512 length, 512 length, 64 filters]. We also concatenated a positional encoding of the distance between bins, abs|i-j|, as an additional filter, producing a [512 length, 512 length, 65 filters] tensor. b, Evaluation of transformation from 1D to 2D. We considered the following operations to transform 1D vector representations derived from the DNA sequence to 2D for Hi-C prediction, holding all other hyper-parameters constant. For every pair of vectors oi and oj for 1D sequence positions i and j, we computed vector t(i, j), with filters indexed by k, via: (1) “dot”: Element-wise multiplication between each vector position, \(t(i,j,k) = o_i(k)o_j(k)\). (2) “geo”: Addition of one to all vector values, element-wise multiplication between each position, square root of each position, subtraction of one from all vector values, \(t(i,j,k) = \sqrt {(o_i(k) + 1)(o_j(k) + 1)} - 1\). (3) “max”: Element-wise max between each vector position, \(t(i,j,k) = max(o_i(k),o_j(k))\). (4) “concat”: Concatenate the two vectors, \(t(i,j) = [o_i,o_j]\). (5) “avg”: Element-wise mean between each vector position, \(t(i,j,k) = (o_i(k) + o_j(k))/2\). c, Multi-task training improves accuracy relative to single dataset training. We trained Akita models for each of the five datasets alone and compared overall Pearson’s R on the test set to the jointly trained multi-task model. Multi-task training benefitted all datasets except for the highest-performing H1hESC dataset. We note that our multi-task framework thus offers a powerful approach to train on many datasets simultaneously and efficiently.
Extended Data Fig. 2 Correlation of Akita’s predictions with experimental data reach those between replicates.
a, MSE vs. Spearman R for each target, where each dot shows values for these metrics for an individual region of the test set. Light gray shows values for predictions versus the full experimental dataset, as in Fig. 1c. Purple shows these quantities if reads for each full dataset are randomly split into two datasets. The same normalization and smoothing steps used to generate training data from the full dataset were used to transform each map prior to calculating MSE or Spearman R. Predictions generally show lower MSE and higher correlations than split datasets. This indicates that our model has extracted the majority of the signal in these data, and that current performance is limited at least in part by sequencing depth of even the currently best available datasets. b, MSE vs. Spearman R for the HFF dataset. Light grey and purple as in (a). To obtain contact maps for biological replicates, as defined in Krietenstein et al., 2020, reads were re-processed and aggregated across technical replicates for the same biological replicate (bioRep). Biological replicates represent independently cultured and processed cells, whereas technical replicates represent independent sample preparations from the same cell culture. Green shows results for two biological replicates, and dark grey shows results for predictions versus the first biological replicate. Normalization and smoothing applied as in (a). Since splitting leads to slightly lower MSEs and higher correlations than those between biological replicates, this indicates that splitting reads in half computationally leads to a similar, albeit more stringent, barometer of model performance than the comparison between biological replicates. c, Maps for predictions (top row), bioRep1 (middle) and bioRep2 (bottom) for the same three regions displayed in Fig. 2.
Extended Data Fig. 3 Akita predictions relate to the aggregate CTCF and accessibility signals as well as binarized features of experimental maps.
a-c, Correlations between the strength of Akita’s predictions, strength of experimental patterns, CTCF, and DNAse. a, Map signal strength, measured by mean of squared map values, for predictions versus targets. In regions with more complex features, Akita tends to make more complex predictions. Correlation printed above each plot indicates Spearman’s R across all regions of the test set (n=413, p<1e-6 in all cases). b, Signal strength for predictions versus signal strength for CTCF ChIP-seq, measured by mean squared profile values. Akita predicts more prominent locus-specific patterns in regions with greater CTCF binding. c, Signal strength for predictions versus DNAse-seq. d–f. Akita predictions recapitulate positions of boundaries and dots called from experimental data. d, Experimental HFF Micro-C data for the same regions in Fig. 2. TAD boundaries are overlaid as green lines, for boundary strength >1 and insulation score < -0.5 (15,273 genome-wide). Dots (also termed ‘loops’ or ‘peaks’) are overlaid as purple circles, for strength >2 (36,671 dots genome-wide). Both calculated as in Krietenstein et al., 202036. e, Predictions overlaid with the same features. f, A/B compartment profiles for the indicated regions calculated at 32,768bp (215) resolution, calculated using cooltools (https://github.com/mirnylab/cooltools) from chromosome-wide experimental maps. Note that these 1 Mb regions all largely fall in the A-compartment (values >0), and that B-compartment regions often display the more uniform maps seen in Fig. 1b. Also note that called TAD boundaries and peaks likely have both false positives and false negatives, as derived features extracted by related algorithms from the same Hi-C data can show surprisingly low overlap12, and are dependent on the exact thresholds used. Indeed, binarized features alone appear to have minimal predictive value for functional enhancer interactions in CRISPRi tiling screens59. The limitations of binarized features underscores a key goal for Akita, which is to enable post-TAD analyses of genome folding data.
Extended Data Fig. 4 Akita displays limited cell-type specificity in predictions.
a, Predicted versus experimental log(observed/expected) values for each bin pair in every region of the test set, separately for each target. This shows predictions are correlated with experimental data across cell types. Color shows log10 number of bin pairs for each set of predicted versus experimental values. Corr shows Spearman R. b, Considering every region in the test set across cell types, we find: Left: models make highly correlated predictions for different cell types (Spearman R(pred(i,j,k1),pred(i,j,k2)), where k1 and k2 index cell types and the correlation is taken across all genomic regions i, and pixels j). Middle: genome folding assayed experimentally is correlated, but less so (Spearman R(data(i,j,k1),data(i,j,k2)). Right: predicted differences across cell types from our models correlate, albeit weakly, with observed differences (Spearman R((pred(i,j,k1) - pred(i,j,k2), data(i,j,k1) - data(i,j,k2)). Note different scales for Spearman R. c, Example of a region showing largely consistent folding across cell types (chr20:50759680–51808256) for targets and predictions. Tracks show binned CTCF ChIP-seq fold-change over control and DNase-seq density. d, Example of a region showing gains and losses of specific features across cell types (chr5:5179392–6227968) at bin ~300. While the predicted differences across cell types from models correlates with observed differences (b, right), our predictions are not particularly visually distinct for different cell types (c,d). At present, our models appear to primarily tune the dynamic range for the entire prediction, rather than predicting gains and losses of a subset of features (d). Also note in (d) that CTCF is still bound in HCT116 in this region as determined by ChIP-seq, despite the loss of a strong boundary around bin 300. In the future, we hypothesize that pairing improved model architectures and training procedures with a greater number of high-resolution genome folding datasets will enable our models to learn more cell type-specific representations of genome folding, as is currently possible for TF binding, chromatin state, and gene expression18.
Extended Data Fig. 5 In silico mutagenesis enables rapid screening of transcription factor influence on genome folding.
a, Experimental HFF Micro-C target data for three regions in our held-out test dataset. b, Predictions for these regions. c, Predictions for these regions after randomly mutagenizing all CTCF motifs in these regions, averaged over 10 random samples. d, Number of CTCF motifs per 2048bp bin. CTCF motif matches obtained from JASPAR24, and profiles computed separately for the number of motifs on the positive strand (>0) and negative strand (<0). e, Predictions for these regions after mutagenizing all NR3C2 motifs in these regions, averaged over 10 random samples. NR3C2 motifs cover a similar number of base pairs per region as CTCF, but their perturbation has little impact on Akita’s predictions. f, Positions of positively oriented (>0) and negatively (<0) oriented NR3C2 motifs per bin. g, Average disruption, mean((pred-predmut)2), versus the average number of kb perturbed per region. Note that YY1, suggested to be involved in genome folding60,61, is predicted to have little aggregate genome-wide impact following motif mutagenesis. This suggests YY1 may operate at a subset of loci in certain developmental contexts60 or its influence depends on the presence of nearby CTCF motifs or other complex factors and evaded our model. h, Change in signal, mean((pred)2) - mean((predmut)2), versus the average number of kb perturbed per region. This reveals a trend toward negative scores for motifs with many occurrences. i, Average disruption versus the total number of overlaps with CTCF motifs. The strong trend argues that many high scoring motifs likely have large predicted impacts due to frequent overlaps with CTCF motifs, rather than independent effects. j, Change in signal versus the total number of overlaps with CTCF motifs.
Extended Data Fig. 6 Akita learns an orientation-specific role for CTCF and enables mutagenesis of regions defined by ChIP-seq.
a-d, Akita learns an orientation-specific role for CTCF a, Predicted map signal strength before versus after in silico perturbations, either for mutagenizing all CTCF motifs (black) or inverting all CTCF motifs (blue). Points show each region in the test set (n=413). Signal strength quantified by mean squared map values. Inversions tend to show smaller perturbations to overall signal strength (blue points deviate less from the x=y line than black points). b, Average disruption for mutagenizing all CTCF motifs or inverting all CTCF motifs. Inversion disrupts maps to a similar extent as mutagenesis (points fall both above and below the x=y line to a similar extent). Jointly with (a), Akita thus predicts changing motif orientation largely alters the positioning of contact patterns, rather than their overall salience across the genome. c, Change in signal strength versus disruption for inverting all CTCF motifs, mean((pred)2) - mean((predinv-CTCF)2). Points show each region of the test set. This indicates that while motif inversions greatly change predicted contact patterns, they can both increase (-) and decrease (+) the signal strength, or salience, of contact patterns. d, Change in signal strength versus disruption for mutagenizing all CTCF motifs in each region of the test set, mean((pred)2) - mean((predmut-CTCF)2). The positive change in signal strength upon mutagenesis shows these perturbations largely decrease features strength in predicted maps. e–g, Akita enables studying the impact of sequences underlying ChIP-seq regions without defined motifs. e, Predicted change in signal versus average disruption for in silico mutagenesis of DNA sequences underlying cohesin peaks. Each point represents one of the 10,268 H1hESC Rad21 cohesin peaks overlapping regions in our test set. Mutagenesis is performed either randomly for all nucleotides under the peak (blue) or only for nucleotides that do not overlap a Jaspar CTCF motif (orange). f, Boxplots for predicted average disruption, stratified by the number of CTCF motifs overlapping the cohesin ChIP peak. Boxplots generated with seaborn defaults for the same n=10,268 peaks (boxes show quartiles, whiskers extend 1.5 times IQR beyond low and high quartiles, points outside this range shown individually). We found that mutagenesis of Rad21 ChIP-seq peaks without CTCF motifs was less disruptive than mutagenesis of peaks with CTCF motifs. Interestingly, we observed no clear trend of increased average disruption for increased numbers of CTCF motifs beyond the first. g, Boxplots as for (f) but with masking the positions of CTCF motifs in these peaks and repeated mutagenesis. On average this led to weaker disruptions of predicted maps (also see the spread of orange versus blue in (e)). However, the trend where mutagenesis of Rad21 ChIP-seq peaks without CTCF motifs was less disruptive than mutagenesis of peaks with CTCF motifs still held. This argues that Akita relies on additional sequence context beyond the immediate 19bp motif in JASPAR to correctly predict its impact on Hi-C maps, similar to how additional sequence context was found to be relevant for CTCF binding assayed by ChIP-exo27.
Extended Data Fig. 7 Impacts of predicted disruptions relate to evolutionary conservation and functional annotation categories.
a-c, Predicted nucleotide-level impacts correlate with evolutionary conservation in and around CTCF motifs. Results from saturation mutagenesis of 500 bp regions around 500 randomly selected strong CTCF motifs, annotated by JASPAR with p-value < 1e-6, as for Fig. 3d. For each mutation, we computed the disruption score as the L2 norm of the predicted contact difference maps between the reference and alternative alleles. We aggregated scores across the model outputs by taking the mean. For visualization, these figures include a 0.001 pseudocount before taking the natural logarithm. We constructed a single score for each position by taking the maximum across alternative alleles. a, The mean log disruption across regions is greatest within CTCF motifs, but is also high in the flanking regions. b, The mean PhyloP score across regions is greatest within CTCF motifs, with peaks in similar places to nucleotide-level disruption scores. PhyloP values were extracted from the mammalian 30-way alignment for the same regions as in (a). c, Scatter plots for disruption versus PhyloP scores for n=5,220 sites within CTCF motifs (left), n=7,830 sites in the flanking 15bp (center), and n=73,341 sites beyond 100bp (right). We observed a statistically significant but moderate Pearson correlation within the CTCF motifs (left) and in the directly flanking regions (center), which drops off farther away (right). d, Scatter plot for log disruption versus motif strength, computed as the absolute change of the FIMO score, for n=1,817 mutations that showed some evidence of influencing the CTCF motif. The wide range in Akita scores for a given change in FIMO score argues that Akita integrates nucleotide influences on genome folding beyond those described by a position weight matrix approach. e, f, Large-scale mutagenesis reveals impactful annotation categories for single nucleotide variants. To quantify the impact of nucleotides within and near CTCF motifs relative to other genomic features we formed a set of unbiased mutations across the genome. We randomly selected 100,000 positions striding by 256 bp within the test set genomic regions and then selecting a random alternative nucleotide. For each mutation, we computed the disruption score as the L2 norm of the predicted contact difference maps between the reference and alternative allele, averaging across outputs. e, Distributions of nucleotide disruption scores split by annotation category, compared to nucleotides outside of these annotation categories. We observed elevated scores in CTCF motifs, their flanking regions (CTCF Flank 10, CTCF Flank 100), promoters (500 bp from GENCODE-annotated transcription start site), and enhancers (FANTOM5-annotated). For visualization we added a 0.001 pseudocount before taking the natural logarithm. f, Two example sites without an annotation category. For visualization we added a 1 pseudocount before taking the natural logarithm (log1p). This suggests there are important DNA sequences for genome folding that remain uncharacterized. g, Predicted maps for a high-scoring non-CTCF GTEx variant. Predicted maps underlying the score for chr7_5898574_G_T_b38 shown in Fig. 4g. Left: prediction for the reference allele. Middle: prediction for the alternative allele. Right: prediction for the (reference - alternate), where green indicates higher predicted contact frequency for the reference allele and pink indicates higher predicted contact frequency for the alternate allele. Top row: full prediction region. Bottom row: zoom into the boundary modified by the variant. Note the different color scales. Grey lines show the position of the variant, at the center of the prediction region. Akita predicts this variant modifies the strength of a nearby boundary. While difficult to see the influence of this single nucleotide change over the full prediction region, the difference becomes apparent upon subtraction of predicted maps. Specifically, this change indicates stronger predicted insulation at this boundary for the alternate allele (exp(0.02) ~= 2% decrease in contact frequency over this boundary).
Extended Data Fig. 8 Model predicts a redundant boundary at Lmo2.
Left: Predicted genome folding for unperturbed Lmo2 locus above the CTCF ChIP-seq profile for the region. Predictions in this figure used hg19 sequence as input and Akita’s output for HFF Micro-C. Right: Numbers above maps indicate the (start,end) position of bins that were deleted, highlighted by purple shading on the zoomed-in CTCF ChIP-seq profile below the predicted WT map. Akita predicts that deleting bins encompassing individual CTCF peaks (top row) would only mildly alter genome folding, and deletion of all three (bottom right) would be more impactful than either pair (bottom left and middle).
Extended Data Fig. 9 Cross-species predictions reveal impact of B2 SINE elements on genome folding in mouse embryonic stem cells.
a, MSE versus Spearman R for mouse regions that overlap regions syntenic to the human test set (mm10-syn-test, n=156 regions). MSE and Spearman R are both calculated per region for every (target, prediction) pair. Target Hi-C data was acquired from mouse embryonic stem cells31, mapped to mm10 and processed similarly to the previous human datasets. Predictions in this figure were made using mm10 sequence as input and Akita’s output for the H1hESC Micro-C dataset. b, (left) Signal strength of predictions versus targets, for mm10-syn-test, calculated as the mean squared values in each map (same 156 regions shown as above). The model trained on human data shows an overall shift towards overly salient predictions in mouse relative to its predictions for human data (see Extended Data Fig. 3a for comparison). Black line shows x=y for reference here and below. (right) Squared error between targets and predictions correlates with the number of B2 SINE elements in the region (from RepeatMasker51). c, Masking B2 SINE elements in input DNA sequences improved MSE for 93/156 predictions (~60%, left), and Spearman R for 106/156 predictions (~67%, right). This suggests that the mouse genome has evolved ways to mitigate the impact of its numerous B2 SINE elements on genome folding, which is supported by recent studies33. d, e, Examples of improved predictions for two regions from the mm10-syn-test set after masking B2 SINEs, with the total number of B2 SINE elements per bin in the region displayed below each map. Initial predictions indicated in (a) with orange and green dots. d, chr5:106334208–107382784 (deltaCorr:0.26, corrMutB2:0.72). Rectangle highlights a feature that is incorrectly predicted to be absent prior to masking B2 SINEs, and is correctly predicted following masking B2 SINEs. e, chr14:61751296–62799872 (deltaCorr:0.18, corrMutB2:0.69). Rectangle highlights a feature that is incorrectly predicted to be present prior to masking B2 SINEs, and is correctly predicted following masking B2 SINEs.
Extended Data Fig. 10 A model trained with mouse genomic data correctly learns the minimal influence of B2 SINE sequences on genome folding.
a, MSE vs. Spearman R for a mm10-trained model on mm10 data (blue, n=384 regions shown), and the hg38-trained model on mm10 data (orange, n=156 regions shown). Each point represents a region from their respective test sets. The mm10 model was trained using Hi-C data from Bonev et al31. (mESC, CN, ncx_CN, NPC, ncx_NPC) and Micro-C from Hsieh et al52. (mESC) with the same multi-task framework used to train our hg38 model. b, c, For the mm10-trained model, masking B2 SINE elements worsened MSE for 243/384 (63%) and Spearman R for 254/384 (66%) regions. MSE and Spearman R are both calculated per region for every (target, prediction) pair, overall pixels in the upper triangular region of predicted maps (n=99681 pixels). Together (a-c) indicate the mm10-trained model correctly learns that B2 SINE elements have little impact on local genome folding and mutagenizing these elements leads to slightly worse predictive performance, in contrast with the hg38-trained model (see Extended Data Fig. 9). d, Predictions for the regions from Extended Data Fig. 9 using the mm10-trained model. Note that the region from chr5 overlaps the training set for the mm10-trained model and the region from chr14 overlaps the test set.
Supplementary information
Supplementary Information
Supplementary Table 1, legend for Data 1, Notes 1–3 and references.
Supplementary Data 1
Predictions across the test set for the human-trained model.
Rights and permissions
About this article
Cite this article
Fudenberg, G., Kelley, D.R. & Pollard, K.S. Predicting 3D genome folding from DNA sequence with Akita. Nat Methods 17, 1111–1117 (2020). https://doi.org/10.1038/s41592-020-0958-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41592-020-0958-x
This article is cited by
-
Epiphany: predicting Hi-C contact maps from 1D epigenomic signals
Genome Biology (2023)
-
Current sequence-based models capture gene expression determinants in promoters but mostly ignore distal enhancers
Genome Biology (2023)
-
ExplaiNN: interpretable and transparent neural networks for genomics
Genome Biology (2023)
-
Obtaining genetics insights from deep learning via explainable artificial intelligence
Nature Reviews Genetics (2023)
-
Computational methods for analysing multiscale 3D genome organization
Nature Reviews Genetics (2023)