Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

An automated framework for efficiently designing deep convolutional neural networks in genomics

A preprint version of the article is available at bioRxiv.

Abstract

Convolutional neural networks (CNNs) have become a standard for analysis of biological sequences. Tuning of network architectures is essential for a CNN’s performance, yet it requires substantial knowledge of machine learning and commitment of time and effort. This process thus imposes a major barrier to broad and effective application of modern deep learning in genomics. Here we present Automated Modelling for Biological Evidence-based Research (AMBER), a fully automated framework to efficiently design and apply CNNs for genomic sequences. AMBER designs optimal models for user-specified biological questions through the state-of-the-art neural architecture search (NAS). We applied AMBER to the task of modelling genomic regulatory features and demonstrated that the predictions of the AMBER-designed model are significantly more accurate than the equivalent baseline non-NAS models and match or even exceed published expert-designed models. Interpretation of AMBER architecture search revealed its design principles of utilizing the full space of computational operations for accurately modelling genomic sequences. Furthermore, we illustrated the use of AMBER to accurately discover functional genomic variants in allele-specific binding and disease heritability enrichment. AMBER provides an efficient automated method for designing accurate deep learning models in genomics.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Method and workflow overview of AMBER.
Fig. 2: AMBER searched architectures outperform sampled architectures.
Fig. 3: Illustration of AMBER architecture search logistics.
Fig. 4: Benchmarking variant effect prediction with allele-specific binding.
Fig. 5: Benchmarking heritability enrichment in disease GWAS.

Similar content being viewed by others

Data availability

All data used in this study are publicly available and the URLs are provided in the corresponding sections in Methods. Training data for the genomic regulatory features were downloaded from http://deepsea.princeton.edu/help/ as described in ref. 4. The ground-truth data for allele-specific binding analysis were obtained from the supplementary data of ref. 29. The UK Biobank GWAS summary statistics data are reported in ref. 40 and downloaded from https://alkesgroup.broadinstitute.org/UKBB/.

Code availability

The AMBER package is available on GitHub at https://github.com/zj-zhang/AMBER; the analysis presented in this study is available on GitHub at https://github.com/zj-zhang/AMBER-Seq. The AMBER code is publicly available on Zenodo at https://zenodo.org/record/438477747.

References

  1. Eraslan, G., Avsec, Ž., Gagneur, J. & Theis, F. J. Deep learning: new computational modelling techniques for genomics. Nat. Rev. Genet. https://doi.org/10.1038/s41576-019-0122-6 (2019).

  2. Ching, T. et al. Opportunities and obstacles for deep learning in biology and medicine. J. R. Soc. Interface 15, 20170387 (2018).

    Article  Google Scholar 

  3. LeCun, Y. & Bengio, Y. in The Handbook of Brain Theory and Neural Networks (ed. Arbib, M. A.) 3361(10) (MIT Press, 1995).

  4. Zhou, J. & Troyanskaya, O. G. Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods 12, 931–934 (2015).

    Article  Google Scholar 

  5. Kelley, D. R., Snoek, J. & Rinn, J. L. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res 26, 990–999 (2016).

    Article  Google Scholar 

  6. Alipanahi, B., Delong, A., Weirauch, M. T. & Frey, B. J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 831–838 (2015).

    Article  Google Scholar 

  7. Jaganathan, K. et al. Predicting splicing from primary sequence with deep learning. Cell 176, 535–548.e24 (2019).

    Article  Google Scholar 

  8. Zhou, J. et al. Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk. Nat. Genet. 50, 1171–1179 (2018).

    Article  Google Scholar 

  9. Zhou, J. et al. Whole-genome deep-learning analysis identifies contribution of noncoding mutations to autism risk. Nat. Genet. 51, 973–980 (2019).

    Article  Google Scholar 

  10. Kelley, D. R. et al. Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Res 28, 739–750 (2018).

    Article  Google Scholar 

  11. Ma, J. et al. Using deep learning to model the hierarchical structure and function of a cell. Nat. Methods 15, 290–298 (2018).

    Article  Google Scholar 

  12. Zhang, Z. et al. Deep-learning augmented RNA-seq analysis of transcript splicing. Nat. Methods 16, 307–310 (2019).

    Article  Google Scholar 

  13. Quang, D. & Xie, X. DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic Acids Res. 44, e107 (2016).

    Article  Google Scholar 

  14. Angermueller, C., Lee, H. J., Reik, W. & Stegle, O. DeepCpG: accurate prediction of single-cell DNA methylation states using deep learning. Genome Biol. 18, 67 (2017).

    Article  Google Scholar 

  15. Agarwal, V. & Shendure, J. Predicting mRNA abundance directly from genomic sequence using deep convolutional neural networks. Cell Rep. 31, 107663 (2020).

    Article  Google Scholar 

  16. Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations 1–14 (ICLR, 2014).

  17. Chollet, F. Xception: deep learning with depthwise separable convolutions. In Proc. 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017 1800–1807 (IEEE, 2017).

  18. He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition 770–778 (IEEE, 2016).

  19. Zoph, B. & Le, Q. V. Neural architecture search with reinforcement learning. In 5th International Conference on Learning Representations (ICLR, 2017).

  20. Pham, H., Guan, M. Y., Zoph, B., Le, Q. V. & Dean, J. Efficient neural architecture search via parameter sharing. In Proceedings of the 35th International Conference on Machine Learning 4095–4104 (PMLR, 2018).

  21. Avsec, Ž. et al. The Kipoi repository accelerates community exchange and reuse of predictive models for genomics. Nat. Biotechnol. 37, 592–600 (2019).

    Article  Google Scholar 

  22. Chen, K. M., Cofer, E. M., Zhou, J. & Troyanskaya, O. G. Selene: a PyTorch-based deep learning library for sequence data. Nat. Methods 16, 315–318 (2019).

    Article  Google Scholar 

  23. Real, E., Aggarwal, A., Huang, Y. & Le, Q. V. Regularized evolution for image classifier architecture search. In Proc. AAAI Conference on Artificial Intelligence Vol. 33, 4780–4789 (2019).

  24. Liu, H., Simonyan, K. & Yang, Y. Darts: differentiable architecture search. In International Conference on Learning Representations (ICLR, 2019).

  25. He, X., Zhao, K. & Chu, X. AutoML: a survey of the state-of-the-art. Knowl. Based Syst. 212, 106622 (2021).

    Article  Google Scholar 

  26. Lee, H., Grosse, R., Ranganath, R. & Ng, A. Y. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In Proc. 26th International Conference On Machine Learning, ICML 2009 609–616 (ACM, 2009); https://doi.org/10.1145/1553374.1553453

  27. Zoph, B., Vasudevan, V., Shlens, J. & Le, Q. V. 2018. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 8697–8710 (IEEE, 2018).

  28. Yu, F. & Koltun, V. Multi-scale context aggregation by dilated convolutions. In 4th International Conference on Learning Representations (ICLR, 2016).

  29. Wagih, O., Merico, D., Delong, A. & Frey, B. J. Allele-specific transcription factor binding as a benchmark for assessing variant impact predictors. Preprint at bioRxiv https://doi.org/10.1101/253427 (2018).

  30. Lee, D. et al. A method to predict the impact of regulatory variants from DNA sequence. Nat. Genet. 47, 955–961 (2015).

    Article  Google Scholar 

  31. Bryne, J. C. et al. JASPAR, the open access database of transcription factor-binding profiles: new content and tools in the 2008 update. Nucleic Acids Res. 36, D102–D106 (2008).

    Article  Google Scholar 

  32. Machanick, P. & Bailey, T. MEME-ChIP: motif analysis of large DNA datasets. Bioinformatics 27, 1696–1697 (2011).

    Article  Google Scholar 

  33. Zhang, P. et al. Negative cross-talk between hematopoietic regulators: GATA proteins repress PU.1. Proc. Natl Acad. Sci. USA 96, 8705–8710 (1999).

    Article  Google Scholar 

  34. Metcalf, D. et al. Inactivation of PU.1 in adult mice leads to the development of myeloid leukemia. Proc. Natl Acad. Sci. USA 103, 1486–1491 (2006).

    Article  Google Scholar 

  35. Wang, F. & Tong, Q. Transcription factor PU.1 is expressed in white adipose and inhibits adipocyte differentiation. Am. J. Physiol. Physiol. 295, C213–C220 (2008).

    Article  Google Scholar 

  36. Lin, L. et al. Adipocyte expression of PU.1 transcription factor causes insulin resistance through upregulation of inflammatory cytokine gene expression and ROS production. Am. J. Physiol. Endocrinol. Metab. 302, E1550 (2012).

    Article  Google Scholar 

  37. Buniello, A. et al. The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 47, D1005–D1012 (2019).

    Article  Google Scholar 

  38. Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).

    Article  Google Scholar 

  39. Finucane, H. K. et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat. Genet. 47, 1228–1235 (2015).

    Article  Google Scholar 

  40. Loh, P. R., Kichaev, G., Gazal, S., Schoech, A. P. & Price, A. L. Mixed-model association for biobank-scale datasets. Nat. Genet. 50, 906–908 (2018).

    Article  Google Scholar 

  41. Zhang, Z., Zhou, L., Gou, L. & Wu, Y. N. Neural architecture search for joint optimization of predictive power and biological knowledge. Preprint at https://arxiv.org/abs1909.00337 (2019).

  42. Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8, 229–256 (1992).

    MATH  Google Scholar 

  43. Machiela, M. & Chanock, S. LDlink: a web-based application for exploring population-specific haplotype structure and linking correlated alleles of possible functional variants. Bioinformatics 31, 3555–3557 (2015).

    Article  Google Scholar 

  44. Purcell, S. et al. PLINK: A tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).

    Article  Google Scholar 

  45. Claire Dandine-Roulland, C. & Perdry, H. Genome-wide data manipulation, association analysis and heritability estimates in R with Gaston 1.5. In 46th European Mathematical Genetics Meeting (EMGM, 2018).

  46. Lonsdale, J. et al. The Genotype-Tissue Expression (GTEx) project. Nat. Genet. 45, 580–585 (2013).

    Article  Google Scholar 

  47. Zhang, Z. Code for ‘An automated framework for efficiently designing deep convolutional neural networks in genomics’. Zenodo https://doi.org/10.5281/ZENODO.4384777 (2020).

Download references

Acknowledgements

We acknowledge all members of the Troyanskaya laboratory for helpful discussions. We acknowledge that the work in this paper was performed at the high-performance computing resources at Simons Foundation. O.G.T. is a CIFAR fellow.

Author information

Authors and Affiliations

Authors

Contributions

Z.Z. and O.G.T. conceived the study. Z.Z. implemented the experiments. C.Y.P. and C.L.T. contributed research materials and analytic tools. Z.Z. and O.G.T. wrote the paper.

Corresponding author

Correspondence to Olga G. Troyanskaya.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Machine Intelligence thanks the anonymous reviewers for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Figs. 1–6.

Reporting Summary

Supplementary Data

Supplementary Tables 1–3.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, Z., Park, C.Y., Theesfeld, C.L. et al. An automated framework for efficiently designing deep convolutional neural networks in genomics. Nat Mach Intell 3, 392–400 (2021). https://doi.org/10.1038/s42256-021-00316-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s42256-021-00316-z

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing