Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Perspective
  • Published:

Language models for biological research: a primer

Abstract

Language models are playing an increasingly important role in many areas of artificial intelligence (AI) and computational biology. In this primer, we discuss the ways in which language models, both those based on natural language and those based on biological sequences, can be applied to biological research. This primer is primarily intended for biologists interested in using these cutting-edge AI technologies in their applications. We provide guidance on best practices and key resources for adapting language models for biology.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Approaches to using language models for biological research.
Fig. 2: Example uses of natural language models for biological research.
Fig. 3: Choosing the right approach for adapting a language model.

Similar content being viewed by others

Code availability

Code for our interactive example of protein language models is available and can be run on Google Colab at https://colab.research.google.com/drive/1zIIRGeqpXvKyz1oynHsyLYRxHHiIbrV5?usp=sharing. The same code, as well as the associated data, is available on GitHub at https://github.com/swansonk14/language_models_biology.

References

  1. Thirunavukarasu, A. J. et al. Large language models in medicine. Nat. Med. 29, 1930–1940 (2023).

    Article  CAS  PubMed  Google Scholar 

  2. OpenAI et al. GPT-4 technical report. Preprint at https://doi.org/10.48550/arXiv.2303.08774 (2024).

  3. Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023). This paper introduces ESM-2, a powerful protein language model, and ESMFold, a model that uses ESM-2 as a foundation to predict protein structure.

  4. Theodoris, C. V. et al. Transfer learning enables predictions in network biology. Nature 618, 616–624 (2023). This paper introduces Geneformer, a single-cell language model trained on gene expression profiles of single-cell transcriptomes.

  5. Vaswani, A. et al. In Proc. Advances in Neural Information Processing Systems 30 (eds. Guyon, I. et al.) 5998–6008 (Curran Associates, 2017). This paper introduces the transformer architecture, which powers all of the language models discussed in this paper and much of the field at large.

  6. Jin, W., Yang, K., Barzilay, R. & Jaakkola, T. Learning multimodal graph-to-graph translation for molecule optimization. Int. Conf. Learn. Represent. (2019).

  7. Anthropic. The Claude 3 Model Family: Opus, Sonnet, Haiku (Anthropic, 2024).

  8. Lee, J. et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 1234–1240 (2020).

    Article  CAS  PubMed  Google Scholar 

  9. Singhal, K. et al. Towards expert-level medical question answering with large language models. Preprint at https://doi.org/10.48550/arXiv.2305.09617 (2023).

  10. Nori, H. et al. Can generalist foundation models outcompete special-purpose tuning? Case study in medicine. Preprint at https://doi.org/10.48550/arXiv.2311.16452 (2023).

  11. Ji, Z. et al. Survey of hallucination in natural language generation. ACM Comput. Surv. 55, 248:1–248:38 (2023).

    Article  Google Scholar 

  12. Chen, M. et al. Evaluating large language models trained on code. Preprint at https://doi.org/10.48550/arXiv.2107.03374 (2021).

  13. Bran, A. M. et al. Augmenting large language models with chemistry tools. Nat. Mach. Intell. 6, 525–535 (2024).

    Article  Google Scholar 

  14. Nguyen, E. et al. Sequence modeling and design from molecular to genome scale with Evo. Preprint at https://doi.org/10.1101/2024.02.27.582234 (2024).

  15. Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl Acad. Sci. USA 118, e2016239118 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Shin, J.-E. et al. Protein design and variant prediction using autoregressive generative models. Nat. Commun. 12, 2403 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Ferruz, N., Schmidt, S. & Höcker, B. ProtGPT2 is a deep unsupervised language model for protein design. Nat. Commun. 13, 4348 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Meier, J. et al. In Proc. Advances in Neural Information Processing Systems 34 (eds. Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P. S. & Wortman Vaughan, J.) 29287–29303 (Curran Associates, 2021).

  19. Brandes, N., Goldman, G., Wang, C. H., Ye, C. J. & Ntranos, V. Genome-wide prediction of disease variant effects with a deep protein language model. Nat. Genet. 55, 1512–1522 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Frazer, J. et al. Disease variant prediction with deep generative models of evolutionary data. Nature 599, 91–95 (2021).

    Article  CAS  PubMed  Google Scholar 

  21. Ruffolo, J. A. & Madani, A. Designing proteins with language models. Nat. Biotechnol. 42, 200–202 (2024).

    Article  CAS  PubMed  Google Scholar 

  22. Hsu, C., Fannjiang, C. & Listgarten, J. Generative models for protein structures and sequences. Nat. Biotechnol. 42, 196–199 (2024).

    Article  CAS  PubMed  Google Scholar 

  23. McWhite, C. D., Armour-Garb, I. & Singh, M. Leveraging protein language models for accurate multiple sequence alignments. Genome Res. 33, 1145–1153 (2023).

  24. Chu, S. K. S. & Siegel, J. B. Protein stability prediction by fine-tuning a protein language model on a mega-scale dataset. Preprint at bioRxiv https://doi.org/10.1101/2023.11.19.567747 (2023).

  25. Swanson, K., Chang, H. & Zou, J. In Proc. 17th Machine Learning in Computational Biology Meeting (eds. Knowles, D. A., Mostafavi, S. & Lee, S.-I.) 110–130 (PMLR, 2022).

  26. Jagota, M. et al. Cross-protein transfer learning substantially improves disease variant prediction. Genome Biol. 24, 182 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Sledzieski, S. et al. Democratizing protein language models with parameter-efficient fine-tuning. Proc. Natl Acad. Sci. USA 121, e2405840121 (2024).

    Article  PubMed  PubMed Central  Google Scholar 

  28. Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Cui, H. et al. scGPT: toward building a foundation model for single-cell multi-omics using generative AI. Nat. Methods https://doi.org/10.1038/s41592-024-02201-0 (2024).

  30. Huang, Z., Bianchi, F., Yuksekgonul, M., Montine, T. J. & Zou, J. A visual–language foundation model for pathology image analysis using medical Twitter. Nat. Med. 29, 2307–2316 (2023).

    Article  CAS  PubMed  Google Scholar 

  31. Tu, T. et al. Towards generalist biomedical AI. NEJM AI 1, 3 (2024).

    Article  Google Scholar 

  32. Edwards, C. et al. In Proc. 2022 Conference on Empirical Methods in Natural Language Processing (eds. Goldberg, Y., Kozareva, Z. & Zhang, Y.) 375–413 (Association for Computational Linguistics, 2022).

  33. Chen, Y. & Zou, J. GenePT: a simple but effective foundation model for genes and cells built from ChatGPT. Preprint at bioRxiv https://doi.org/10.1101/2023.10.16.562533 (2024).

  34. Cheng, J. et al. Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science 381, eadg7492 (2023).

    Article  CAS  PubMed  Google Scholar 

  35. Wang, Z. et al. LM-GVP: an extensible sequence and structure informed deep learning framework for protein property prediction. Sci. Rep. 12, 6832 (2023).

    Article  Google Scholar 

Download references

Acknowledgements

We would like to thank M. Karelina, F. Ekman, S. Simon and J. Chang for feedback and Z. Huang for help with the figures. K.S. acknowledges support from the Knight-Hennessy Scholarship. E.S. acknowledges support from the National Institutes of Health (T15LM007033). J.Z. is supported by funding from the CZ Biohub.

Author information

Authors and Affiliations

Authors

Contributions

E.S. and K.S. wrote and edited the manuscript. J.Z. supervised and edited the manuscript.

Corresponding author

Correspondence to James Zou.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Methods thanks the anonymous reviewers for their contribution to the peer review of this work. Primary Handling Editor: Lin Tang, in collaboration with the Nature Methods team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Simon, E., Swanson, K. & Zou, J. Language models for biological research: a primer. Nat Methods 21, 1422–1429 (2024). https://doi.org/10.1038/s41592-024-02354-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41592-024-02354-y

This article is cited by

Search

Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research