Abstract

To enable the application of deep learning in biology, we present Selene (https://selene.flatironinstitute.org/), a PyTorch-based deep learning library for fast and easy development, training, and application of deep learning model architectures for any biological sequence data. We demonstrate on DNA sequences how Selene allows researchers to easily train a published architecture on new data, develop and evaluate a new architecture, and use a trained model to answer biological questions of interest.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Code availability

Selene is open-source software (license BSD 3-Clause Clear). Project homepage: https://selene.flatironinstitute.org. GitHub: https://github.com/FunctionLab/selene. Archived version: https://github.com/FunctionLab/selene/archive/0.2.0.tar.gz.

Data availability

Cistrome14, Cistrome file ID 33545, measurements from GSM970258: http://dc2.cistrome.org/api/downloads/eyJpZCI6IjMzNTQ1In0%3A1fujCu%3ArNvWLCNoET6o9SdkL8fEv13uRu4b/. ENCODE21 and Roadmap Epigenomics22 chromatin profiles: files listed in Supplementary Table 1 of ref. 4. IGAP age at onset survival16,17: https://www.niagads.org/datasets/ng00058 (P-values-only file). The case studies used processed datasets from these sources. They can be downloaded at the following Zenodo links: Cistrome, https://zenodo.org/record/2214130/files/data.tar.gz; ENCODE and Roadmap Epigenomics chromatin profiles, https://zenodo.org/record/2214970/files/chromatin_profiles.tar.gz; IGAP age at onset survival, https://zenodo.org/record/1445556/files/variant_effect_prediction_data.tar.gz. Source data for Figs. 2 and 3 are available online.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  1. 1.

    LeCun, Y., Bengio, Y. & Hinton, G. Nature 521, 436–444 (2015).

  2. 2.

    Ching, T. et al. J. R. Soc. Interface. 15, 20170387 (2018).

  3. 3.

    Segler, M. H. S., Preuss, M. & Waller, M. P. Nature 555, 604–610 (2018).

  4. 4.

    Zhou, J. & Troyanskaya, O. G. Nat. Meth. 12, 931–934 (2015).

  5. 5.

    Alipanahi, B., Delong, A., Weirauch, M. T. & Frey, B. J. Nat. Biotechnol. 33, 831–838 (2015).

  6. 6.

    Kelley, D. R., Snoek, J. & Rinn, J. L. Genome Res. 26, 990–999 (2016).

  7. 7.

    Angermueller, C., Lee, H. J., Reik, W. & Stegle, O. Genome. Biol. 18, 67 (2017).

  8. 8.

    Kelley, D. R. et al. Genome Res. 28, 739–750 (2018).

  9. 9.

    Quang, D. & Xie, X. Nucleic Acids Res. 44, e107 (2016).

  10. 10.

    Sundaram, L. et al. Nat. Genet. 50, 1161–1170 (2018).

  11. 11.

    Min, S., Lee, B. & Yoon, S. Brief. Bioinform. 18, 851–869 (2017).

  12. 12.

    Budach, S. & Marsico, A. Bioinformatics 34, 3035–3037 (2018).

  13. 13.

    Avsec, Z. et al. bioRxiv Preprint at https://www.biorxiv.org/content/10.1101/375345v1 (2018).

  14. 14.

    Mei, S. et al. Nucleic Acids Res. 45, D658–D662 (2017).

  15. 15.

    Troyanskaya, O. G. et al. Selene CLI operations and outputs. Selene https://selene.flatironinstitute.org/overview/cli.html (2018).

  16. 16.

    Ruiz, A. et al. Transl. Psychiatry 4, e358 (2014).

  17. 17.

    Huang, K.-L. et al. Nat. Neurosci. 20, 1052–1061 (2017).

  18. 18.

    Li, H. et al. Bioinformatics 25, 2078–2079 (2009).

  19. 19.

    Li, H. Bioinformatics 27, 718–719 (2011).

  20. 20.

    ENCODE Project. Reference sequences. ENCODE: Encyclopedia of DNA Elements https://www.encodeproject.org/data-standards/reference-sequences/ (2016).

  21. 21.

    ENCODE Project Consortium. Nature 489, 57–74 (2012).

  22. 22.

    Kundaje, A. et al. Nature 518, 317–330 (2015).

Download references

Acknowledgements

The authors acknowledge all members of the Troyanskaya lab for helpful discussions. In addition, the authors thank D. Simon for setting up the website and automating updates to the site. The authors are pleased to acknowledge that this work was performed using the high-performance computing resources at Simons Foundation and the TIGRESS computer center at Princeton University. This work was supported by NIH grants R01HG005998, U54HL117798, R01GM071966, and T32HG003284; HHS grant HHSN272201000054C; and Simons Foundation grant 395506, all to O.G.T. O.G.T. is a CIFAR fellow.

Author information

Author notes

  1. These authors contributed equally: Kathleen M. Chen, Evan M. Cofer.

Affiliations

  1. Flatiron Institute, Simons Foundation, New York, NY, USA

    • Kathleen M. Chen
    • , Jian Zhou
    •  & Olga G. Troyanskaya
  2. Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ, USA

    • Evan M. Cofer
    • , Jian Zhou
    •  & Olga G. Troyanskaya
  3. Graduate Program in Quantitative and Computational Biology, Princeton University, Princeton, NJ, USA

    • Evan M. Cofer
  4. Department of Computer Science, Princeton University, Princeton, NJ, USA

    • Olga G. Troyanskaya

Authors

  1. Search for Kathleen M. Chen in:

  2. Search for Evan M. Cofer in:

  3. Search for Jian Zhou in:

  4. Search for Olga G. Troyanskaya in:

Contributions

K.M.C and J.Z. conceived the Selene library. K.M.C. and E.M.C. designed, implemented, and documented Selene. K.M.C. performed the analyses described in the manuscript. O.G.T. supervised the project. K.M.C., E.M.C., and O.G.T wrote the manuscript.

Competing interests

The authors declare no competing interests.

Corresponding author

Correspondence to Olga G. Troyanskaya.

Supplementary information

Source data

About this article

Publication history

Received

Accepted

Published

Issue Date

DOI

https://doi.org/10.1038/s41592-019-0360-8