Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Letter
  • Published:

Benchmarking of deep neural networks for predicting personal gene expression from DNA sequence highlights shortcomings

Abstract

Deep learning methods have recently become the state of the art in a variety of regulatory genomic tasks1,2,3,4,5,6, including the prediction of gene expression from genomic DNA. As such, these methods promise to serve as important tools in interpreting the full spectrum of genetic variation observed in personal genomes. Previous evaluation strategies have assessed their predictions of gene expression across genomic regions; however, systematic benchmarking is lacking to assess their predictions across individuals, which would directly evaluate their utility as personal DNA interpreters. We used paired whole genome sequencing and gene expression from 839 individuals in the ROSMAP study7 to evaluate the ability of current methods to predict gene expression variation across individuals at varied loci. Our approach identifies a limitation of current methods to correctly predict the direction of variant effects. We show that this limitation stems from insufficiently learned sequence motif grammar and suggest new model training strategies to improve performance.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Evaluation of Enformer across genomic regions and select loci.
Fig. 2: Evaluation of Enformer on prediction of gene expression across individuals.

Similar content being viewed by others

Data availability

Genotype and RNA-seq data for the Religious Orders Study and Rush Memory and Aging Project (ROSMAP) samples are available from the Synapse AMP-AD Data Portal (accession code syn2580853) as well as the RADC Research Resource Sharing Hub at www.radc.rush.edu.

Code availability

Scripts for running the analyses presented, as well as intermediate results, are available from https://github.com/mostafavilabuw/EnformerAssessment (ref. 25).

References

  1. Avsec, Ž. et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat. Methods 18, 1196–1203 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Avsec, Ž. et al. Base-resolution models of transcription-factor binding reveal soft motif syntax. Nat. Genet. 53, 354–366 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Eraslan, G., Avsec, Ž., Gagneur, J. & Theis, F. J. Deep learning: new computational modelling techniques for genomics. Nat. Rev. Genet. 20, 389–403 (2019).

    Article  CAS  PubMed  Google Scholar 

  4. Zhou, J. et al. Whole-genome deep-learning analysis identifies contribution of noncoding mutations to autism risk. Nat. Genet. 51, 973–980 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Zhou, J. Sequence-based modeling of three-dimensional genome architecture from kilobase to chromosome scale. Nat. Genet. 54, 725–734 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Park, C. Y. et al. Genome-wide landscape of RNA-binding protein target site dysregulation reveals a major impact on psychiatric disorder risk. Nat. Genet. 53, 166–173 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. De Jager, P. L. et al. A multi-omic atlas of the human frontal cortex for aging and Alzheimer’s disease research. Sci. Data 5, 180142 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  8. Kelley, D. R., Snoek, J. & Rinn, J. L. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. 26, 990–999 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Zhou, J. & Troyanskaya, O. G. Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods 12, 931–934 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Yuan, H. & Kelley, D. R. scBasset: sequence-based modeling of single-cell ATAC-seq using convolutional neural networks. Nat. Methods https://doi.org/10.1038/s41592-022-01562-8 (2022).

    Article  PubMed  Google Scholar 

  11. Maslova, A. et al. Deep learning of immune cell differentiation. Proc. Natl Acad. Sci. USA 117, 25655–25666 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Chen, K. M., Wong, A. K., Troyanskaya, O. G. & Zhou, J. A sequence-based global map of regulatory activity for deciphering human genetics. Nat. Genet. 54, 940–949 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Kim, D. S. et al. The dynamic, combinatorial cis-regulatory lexicon of epidermal differentiation. Nat. Genet. https://doi.org/10.1038/s41588-021-00947-3 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  14. Zhou, J. et al. Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk. Nat. Genet. 50, 1171–1179 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Novakovsky, G., Dexter, N., Libbrecht, M. W., Wasserman, W. W. & Mostafavi, S. Obtaining genetics insights from deep learning via explainable artificial intelligence. Nat. Rev. Genet. https://doi.org/10.1038/s41576-022-00532-2 (2022).

    Article  PubMed  Google Scholar 

  16. Wang, Q. S. et al. Leveraging supervised learning for functionally informed fine-mapping of cis-eQTLs identifies an additional 20,913 putative causal eQTLs. Nat. Commun. 12, 3394 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Karollus, A., Mauermeier, T. & Gagneur, J. Current sequence-based models capture gene expression determinants in promoters but mostly ignore distal enhancers. Genome Biol. 24, 56 (2023).

    Article  PubMed  PubMed Central  Google Scholar 

  18. Gamazon, E. R. et al. A gene-based association method for mapping traits using reference transcriptome data. Nat. Genet. 47, 1091–1098 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Reshef, Y. A. et al. Detecting genome-wide directional effects of transcription factor binding on polygenic disease risk. Nat. Genet. 50, 1483–1493 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Bennett, D. A. et al. Religious Orders Study and Rush Memory and Aging Project. J. Alzheimers Dis. 64, S161–S189 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  21. Mostafavi, S. et al. A molecular network of the aging human brain provides insights into the pathology and cognitive decline of Alzheimer’s disease. Nat. Neurosci. 21, 811–819 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Battle, A. et al. Characterizing the genetic basis of transcriptome diversity through RNA-sequencing of 922 individuals. Genome Res. 24, 14–24 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. GTEx Consortium. Genetic effects on gene expression across human tissues. Nature 550, 204–213 (2017).

    Article  PubMed Central  Google Scholar 

  24. Sundararajan, M., Taly, A. & Yan, Q. Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning (eds. Precup, D. & Teh, Y. W.) Vol. 70 3319–3328 (PMLR, 2017); https://doi.org/10.5281/zenodo.8274879

  25. Sasse, A, Ng, B, & Spiro, E. A. mostafavilabuw/EnformerAssessment: EnformerEvaluationV1. Zenado https://doi.org/10.5281/zenodo.8274879 (2023).

Download references

Acknowledgements

We thank D. R. Kelley for helpful comments on this manuscript. We thank the participants of ROS and MAP for their essential contributions and gifts to this project. This work has been supported by many different NIH grants, including P30AG10161 (to D.A.B.), P30AG72975 (to D.A.B.), R01AG15819 (to D.A.B.), R01AG17917 (to D.A.B.), U01AG46152 (to D.A.B. and P.L.D.), U01AG61356 (to D.A.B. and P.L.D.), R01AG057911 (to C.G.), R01AG06179 (to C.G.) and R01AG036836 (to P.L.D.), as well as a CIFAR research fellowship and an NSERC Discovery Grant (to S.M.). The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.

Author information

Authors and Affiliations

Authors

Contributions

Conceived the study: S.M. and M.C. Study design: S.M., A.S. and M.C. Data generation and quality control analyses: B.N., A.E.S., C.G., P.L.D., S.T. and D.A.B. Analyses and interpretation: A.S., A.E.S., B.N., S.M. and M.C. Wrote the initial draft: S.M., A.S. and B.N. Read and provided comments on the manuscript: M.C., B.N., A.E.S., P.L.D., C.G., S.T. and D.A.B. Supervised the project: S.M. and M.C.

Corresponding authors

Correspondence to Maria Chikina or Sara Mostafavi.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Genetics thanks Kaur Alasoo and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Sensitivity analysis for Enformer Predictions.

(a) Density plot, where each dot represents a gene (n = 13,397). X-axis shows Pearson’s r coefficients for Enformer predictions for the single most relevant track (‘CAGE,brain,adult’) and y-axis shows the fine-tuned cortex model from all human tracks. Color depicts local density. (b) Pearson’s r coefficients across 839 individuals between observed expression and the predicted CAGE track from a single forward-stranded input sequence centered at the TSS (x-axis) versus the average over forward-stranded sequences which were shifted by −3, −2, −1, 0, 1, 2, 3 bp, and a reverse-stranded input sequence centered at the TSS (y-axis). Data shown for a random subset of loci (n = 30). Orange line: diagonal line where x and y-axis have the same value. The correlation coefficient between values on x-axis and y-axis is R = 0.94 (c) Absolute Pearson’s r coefficients between Enformer predictions and observed gene expression for sets of genes with one causal SNP and all others. Causal genes determined by the Susie algorithm (‘Susie-Causal’). Edges of the box indicate the 25th and 75th percentiles, and the central mark indicates the median (N1 = 183 genes fine-mapped with Susie, N2 = 6625 genes without fine-mapped variants, two-sided Wilcoxon rank-sum test, for each gene R coefficient computed using n = 839 individuals).

Extended Data Fig. 2 Performance of the shallow CNN model.

(a) Density plot of observed population-average expression of test set genes (n = 3,401 genes) in cerebral cortex versus simple CNN’s predicted gene expression from the Reference sequences. This plot only displays genes which could be assigned to Enformer’s test set. Colors depict local density. (b) Y-axis shows Pearson’s r correlation coefficients between observed expression values and a simple CNN’s predicted values per individual. X-axis shows the negative log10 p-value computed with a gene-specific Null model (one-sided T-test, n = 50 independent samples per gene; Supplementary Method). The color represents the predicted mean expression. Red dashed line indicates FDRBH = 0.05.

Supplementary information

Supplementary Information

Supplementary Methods and Supplementary Figs. S1—11.

Reporting Summary

Supplementary Tables

Sheet 1: Supplementary Table 1. This table provides information on genes whose expression prediction was evaluated. Sheet 2: Header descriptions for Supplementary Table 1. Sheet 3: Supplementary Table2.This table provides information about the driver gene analysis. Sheet 4: Header descriptions for Supplementary Table 2. Sheet 5; This table provides ISM values for the set of tested variants. Sheet 6: Header descriptions for Supplementary Table 3.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sasse, A., Ng, B., Spiro, A.E. et al. Benchmarking of deep neural networks for predicting personal gene expression from DNA sequence highlights shortcomings. Nat Genet 55, 2060–2064 (2023). https://doi.org/10.1038/s41588-023-01524-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41588-023-01524-6

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing