Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Deep generative design of RNA family sequences

Abstract

RNA engineering has immense potential to drive innovation in biotechnology and medicine. Despite its importance, a versatile platform for the automated design of functional RNA is still lacking. Here, we propose RNA family sequence generator (RfamGen), a deep generative model that designs RNA family sequences in a data-efficient manner by explicitly incorporating alignment and consensus secondary structure information. RfamGen can generate novel and functional RNA family sequences by sampling points from a semantically rich and continuous representation. We have experimentally demonstrated the versatility of RfamGen using diverse RNA families. Furthermore, we confirmed the high success rate of RfamGen in designing functional ribozymes through a quantitative massively parallel assay. Notably, RfamGen successfully generates artificial sequences with higher activity than natural sequences. Overall, RfamGen significantly improves our ability to design functional RNA and opens up new potential for generative RNA engineering in synthetic biology.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Overview of RfamGen.
Fig. 2: Overview of RfamGen evaluation.
Fig. 3: Latent space has semantically rich representation of sequences.
Fig. 4: RfamGen generates functional sequences of diverse RNA families.
Fig. 5: Massively parallel assay of generated sequences from RfamGen.

Similar content being viewed by others

Data availability

The sequence data of the massively parallel assay has been deposited in the Sequence Read Archive under accession code PRJNA1044007. All other data are provided in this paper. Source data are provided with this paper.

Code availability

The code of RfamGen and python custom scripts are disclosed on GitHub (https://github.com/Shunsuke-1994/rfamgen). Codes are also deposited on Zenodo (https://zenodo.org/doi/10.5281/zenodo.10187598)62.

References

  1. Wilson, D. S. & Szostak, J. W. In vitro selection of functional nucleic acids. Annu. Rev. Biochem. 68, 611–647 (1999).

    Article  CAS  PubMed  Google Scholar 

  2. Guo, P. et al. Engineering RNA for targeted siRNA delivery and medical application. Adv. Drug Deliv. Rev. 62, 650–666 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Kim, C. M. & Smolke, C. D. Biomedical applications of RNA-based devices. Curr. Opin. Biomed. Eng. 4, 106–115 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  4. Kim, J. & Franco, E. RNA nanotechnology in synthetic biology. Curr. Opin. Biotech. 63, 135–141 (2020).

    Article  CAS  PubMed  Google Scholar 

  5. Thavarajah, W., Hertz, L. M., Bushhouse, D. Z., Archuleta, C. M. & Lucks, J. B. RNA engineering for public health: innovations in RNA-based diagnostics and therapeutics. Annu. Rev. Chem. Biomol. 12, 263–286 (2021).

    Article  CAS  Google Scholar 

  6. Dykstra, P. B., Kaplan, M. & Smolke, C. D. Engineering synthetic RNA devices for cell control. Nat. Rev. Genet. 23, 215–228 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Liang, J. C., Bloom, R. J. & Smolke, C. D. Engineering biological systems with synthetic RNA molecules. Mol. Cell 43, 915–926 (2011).

  8. Qi, L. S. & Arkin, A. P. A versatile framework for microbial engineering using synthetic non-coding RNAs. Nat. Rev. Microbiol. 12, 341–354 (2014).

    Article  CAS  PubMed  Google Scholar 

  9. Etzel, M. & Mörl, M. Synthetic riboswitches: from plug and pray toward plug and play. Biochemistry 56, 1181–1198 (2017).

    Article  CAS  PubMed  Google Scholar 

  10. Kobori, S. & Yokobayashi, Y. Analyzing and tuning ribozyme activity by deep sequencing to modulate gene expression level in mammalian cells. ACS Synth. Biol. 7, 371–376 (2018).

    Article  CAS  PubMed  Google Scholar 

  11. Strobel, B. et al. High-throughput identification of synthetic riboswitches by barcode-free amplicon-sequencing in human cells. Nat. Commun. 11, 714 (2020).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  12. Rotrattanadumrong, R. & Yokobayashi, Y. Experimental exploration of a ribozyme neutral network using evolutionary algorithm and deep learning. Nat. Commun. 13, 4847 (2022).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  13. Dotu, I. et al. Complete RNA inverse folding: computational design of functional hammerhead ribozymes. Nucleic Acids Res. 42, 11752–11762 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Yamagami, R., Kayedkhordeh, M., Mathews, D. H. & Bevilacqua, P. C. Design of highly active double-pseudoknotted ribozymes: a combined computational and experimental study. Nucleic Acids Res. 47, gky1118 (2018).

  15. Najeh, S., Zandi, K., Perreault, J. & Kharma, N. Computational design and experimental verification of pseudoknotted ribozymes. RNA https://doi.org/10.1261/rna.079148.122 (2023).

  16. Eddy, S. R. & Durbin, R. RNA sequence analysis using covariance models. Nucleic Acids Res. 22, 2079–2088 (1994).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Durbin, R., Eddy, S. R., Krogh, A. & Mitchison, G. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids (Cambridge Univ. Press, 1998).

  18. Kalvari, I. et al. Rfam 14: expanded coverage of metagenomic, viral and microRNA families. Nucleic Acids Res. 49, D192–D200 (2020).

    Article  PubMed Central  Google Scholar 

  19. Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M. & Church, G. M. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 16, 1315–1322 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Biswas, S., Khimulya, G., Alley, E. C., Esvelt, K. M. & Church, G. M. Low-N protein engineering with data-efficient deep learning. Nat. Methods 18, 389–396 (2021).

    Article  CAS  PubMed  Google Scholar 

  21. Iwano, N., Adachi, T., Aoki, K., Nakamura, Y. & Hamada, M. Generative aptamer discovery using RaptGen. Nat. Comput. Sci. 2, 378–386 (2022).

    Article  PubMed  PubMed Central  Google Scholar 

  22. Iuchi, H. et al. Representation learning applications in biological sequence analysis. Comput. Struct. Biotechnol. J. 19, 3198–3208 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Gómez-Bombarelli, R. et al. Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent. Sci. 4, 268–276 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  24. Kingma, D. P. & Welling, M. Auto-encoding variational bayes. Proceedings of 2nd International Conference on Learning Representations (ICLR) (eds Bengio, Y. & LeCun, Y.) (2014).

  25. Yao, Z., Weinberg, Z. & Ruzzo, W. L. CMfinder—a covariance model based RNA motif finding algorithm. Bioinformatics 22, 445–452 (2006).

    Article  CAS  PubMed  Google Scholar 

  26. Rivas, E. Evolutionary conservation of RNA sequence and structure. Wiley Interdiscip. Rev. RNA 12, e1649 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Nawrocki, E. P. & Eddy, S. R. Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics 29, 2933–2935 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Wilburn, G. W. & Eddy, S. R. Remote homology search with hidden Potts models. PLoS Comput. Biol. 16, e1008085 (2020).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  29. Nawrocki, E. P. et al. Rfam 12.0: updates to the RNA families database. Nucleic Acids Res. 43, D130–D137 (2015).

    Article  CAS  PubMed  Google Scholar 

  30. Rivas, E., Clements, J. & Eddy, S. R. A statistical test for conserved RNA structure shows lack of evidence for structure in lncRNAs. Nat. Methods 14, 45–48 (2017).

    Article  CAS  PubMed  Google Scholar 

  31. Li, C., Qian, W., Maclean, C. J. & Zhang, J. The fitness landscape of a tRNA gene. Science 352, 837–840 (2016).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  32. Weinberg, Z. et al. New classes of self-cleaving ribozymes revealed by comparative genomics analysis. Nat. Chem. Biol. 11, 606–610 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Li, S., Lünse, C. E., Harris, K. A. & Breaker, R. R. Biochemical analysis of hatchet self-cleaving ribozymes. RNA 21, 1845–1851 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Zheng, L. et al. Structure-based insights into self-cleavage by a four-way junctional twister-sister ribozyme. Nat. Commun. 8, 1180 (2017).

    Article  ADS  PubMed  PubMed Central  Google Scholar 

  35. Andreasson, J. O., Savinov, A., Block, S. M. & Greenleaf, W. J. Comprehensive sequence-to-function mapping of cofactor-dependent RNA catalysis in the glmS ribozyme. Nat. Commun. 11, 1663 (2020).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  36. Kobori, S., Nomura, Y., Miu, A. & Yokobayashi, Y. High-throughput assay and engineering of self-cleaving ribozymes by sequencing. Nucleic Acids Res. 43, e85–e85 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  37. Kobori, S. & Yokobayashi, Y. High-throughput mutational analysis of a twister ribozyme. Angew. Chem. Int. Ed. 55, 10354–10357 (2016).

    Article  CAS  Google Scholar 

  38. Xiang, J. S. et al. Massively parallel RNA device engineering in mammalian cells with RNA-Seq. Nat. Commun. 10, 4327 (2019).

    Article  ADS  PubMed  PubMed Central  Google Scholar 

  39. Hopf, T. A. et al. Mutation effects predicted from sequence co-variation. Nat. Biotechnol. 35, 128–135 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Riesselman, A. J., Ingraham, J. B. & Marks, D. S. Deep generative models of genetic variation capture the effects of mutations. Nat. Methods 15, 816–822 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Townshend, B., Kennedy, A. B., Xiang, J. S. & Smolke, C. D. High-throughput cellular RNA device engineering. Nat. Methods 12, 989–994 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Im, D. J., Ahn, S., Memisevic, R. & Bengio, Y. Denoising criterion for variational auto-encoding framework. Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 31 (2017).

  43. Hawkins-Hooker, A. et al. Generating functional protein variants with variational autoencoders. PLoS Comput. Biol. 17, e1008736 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. Trinquier, J., Uguzzoni, G., Pagnani, A., Zamponi, F. & Weigt, M. Efficient generative modeling of protein sequences using simple autoregressive models. Nat. Commun. 12, 5800 (2021).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  45. Shin, J.-E. et al. Protein design and variant prediction using autoregressive generative models. Nat. Commun. 12, 2403 (2021).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  46. Wang, J. et al. AAV-delivered suppressor tRNA overcomes a nonsense mutation in mice. Nature 604, 343–348 (2022).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  47. Albers, S. et al. Engineered tRNAs suppress nonsense mutations in cells and in vivo. Nature 618, 842–848 (2023).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  48. Kofman, C. et al. Computationally-guided design and selection of high performing ribosomal active site mutants. Nucleic Acids Res. 50, 13143–13154 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  49. Krüger, A. et al. Community science designed ribosomes with beneficial phenotypes. Nat. Commun. 14, 961 (2023).

    Article  ADS  PubMed  PubMed Central  Google Scholar 

  50. Ausländer, S. et al. A general design strategy for protein-responsive riboswitches in mammalian cells. Nat. Methods 11, 1154–1160 (2014).

    Article  PubMed  Google Scholar 

  51. Kusner, M. J., Paige, B. & Hernández-Lobato, J. Grammar variational autoencoder. Proceedings of the 34th International Conference on Machine Learning (ICML), Vol. 70. 1945–1954 (2017).

  52. Kawano, S. et al. Tutorial videos of bioinformatics resources: online distribution trial in Japan named TogoTV. Brief. Bioinforma. 13.2, 258–268 (2012).

    Article  Google Scholar 

  53. Janssen, S. & Giegerich, R. Ambivalent covariance models. BMC Bioinforma. 16, 178 (2015).

    Article  Google Scholar 

  54. Fu, H. et al. Cyclical annealing schedule: a simple approach to mitigating KL vanishing. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (eds Burstein, J. et al.) 240–250 (ACL, 2019).

  55. Rivas, E., Clements, J. & Eddy, S. R. Estimating the power of sequence covariation for detecting conserved RNA structure. Bioinformatics 36, 3072–3076 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  56. Weinberg, Z. & Breaker, R. R. R2R—software to speed the depiction of aesthetic consensus RNA secondary structures. BMC Bioinforma. 12, 3 (2011).

    Article  CAS  Google Scholar 

  57. McCarthy, T. J. et al. Ligand requirements for glmS ribozyme self-cleavage. Chem. Biol. 12, 1221–1226 (2005).

    Article  CAS  PubMed  Google Scholar 

  58. Behrens, A., Rodschinka, G. & Nedialkova, D. D. High-resolution quantitative profiling of tRNA abundance and modification status in eukaryotes by mim-tRNAseq. Mol. Cell 81, 1802–1815.e7 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  59. Zhang, J., Kobert, K., Flouri, T. & Stamatakis, A. PEAR: a fast and accurate Illumina Paired-End reAd mergeR. Bioinformatics 30, 614–620 (2014).

    Article  CAS  PubMed  Google Scholar 

  60. Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  61. Bingham, E. et al. Pyro: deep universal probabilistic programming. J. Mach. Learn. Res. 20, 973–978 (2019).

    Google Scholar 

  62. Sumi, S. et al. rfamgen. Zenodo https://doi.org/10.5281/zenodo.10187598 (2023).

Download references

Acknowledgements

This work was supported by JST CREST (grant nos. JPMJCR21F1 and JPMJCR23B3) and JSPS (grant nos. 21J15897 and 20H05626). Computations were partially performed on the NIG supercomputer at the ROIS National Institute of Genetics. We thank C. Li, J. Zhang (University of Michigan) and Y. Yokobayashi (Okinawa Institute of Science and Technology) for providing DMS, K. Hui for proofreading the manuscript, R. K. Kawaguchi (Kyoto University), L. Maya (Imperial College London) and M. Hirosawa (Kyoto University) for reading the manuscript.

Author information

Authors and Affiliations

Authors

Contributions

S.S., M.H. and H.S. managed the project. S.S. conceived the idea, developed the RfamGen program and conducted all experiments. S.S., M.H. and H.S. wrote the manuscript.

Corresponding authors

Correspondence to Michiaki Hamada or Hirohide Saito.

Ethics declarations

Competing interests

S.S., M.H. and H.S. are the inventors of record listed on the patents (US Provisional Patent application no. 63/514389). H.S. owns shares in aceRNA Technologies Ltd and is an outside director of aceRNA Technologies Ltd.

Peer review

Peer review information

Nature Methods thanks Michael Mohsen and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Rita Strack, in collaboration with the Nature Methods team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Architecture of RfamGen.

A sequence is represented as a triplet (tr, ss, bp). RfamGen is a variational autoencoder (VAE) trained to encode and decode triples while embedding them into the 16-dimensional latent space (z). Further details are described in the Supplementary Note.

Extended Data Fig. 2 Examples of semantics captured by latent space of RfamGen.

(a) t-SNE visualization of the latent space of RfamGen trained for RF00005 (tRNA) colored with phylogenetic information (top left) and anticodon information (top, right), RF00004 (U2 spliceosomal RNA, left bottom) and TPP riboswitch (RF00059, right bottom), colored with phylogenetic information. (b) t-SNE visualization of the latent space of RF00008 along with sequences with U1A protein binding motifs (UUGCAC in a loop) (top). Representative examples of the discovered sequences with high bit scores (bottom).

Source data

Extended Data Fig. 3 Evaluation of generated sequences by argmax and random sampling from a reconstructed CM.

(a) Computational procedure of evaluation of argmax and random sampling from a CM. Sequences were generated by argmax sampling (randomly sampled 1000 points from the latent space and then sampled 1 argmax sequence from 1 decoded CM) and random sampling (randomly sampled 1000 points from the latent space and then sampled 10 random sequences from 1 decoded CM). The sampled sequences were aligned to CM enrolled in Rfam database followed by bit score calculation. (b) Scatter plot of average bit scores of generated sequences of argmax and random sampling using 628 RNA families. (c) PAGE separation of cleavage products of glmS ribozymes generated by the two sampling methods from a CM (argmax and random sampling). A representative gel image from N = 3 replicates is shown. Arrowheads indicate cleaved products.

Source data

Supplementary information

Supplementary Information

Supplementary Note and Figs. 1–11.

Reporting Summary

Supplementary Table 1

Parameter of each experiment of RfamGen.

Supplementary Table 2

Oligos used in this research.

Source data

Source Data Fig. 2

The numerical and statistical source data of Fig. 2.

Source Data Fig. 3

The numerical and statistical source data of Fig. 3.

Source Data Fig. 4

The numerical and statistical source data of Fig. 4.

Source Data Fig. 5

The numerical and statistical source data of Fig. 5.

Source Data Extended Data Fig. 2

The numerical and statistical source data of Extended Data Fig. 2.

Source Data Extended Data Fig. 3

The numerical and statistical source data of Extended Data Fig. 3.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sumi, S., Hamada, M. & Saito, H. Deep generative design of RNA family sequences. Nat Methods 21, 435–443 (2024). https://doi.org/10.1038/s41592-023-02148-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41592-023-02148-8

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing