Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Analysis
  • Published:

Large language models improve annotation of prokaryotic viral proteins

Abstract

Viral genomes are poorly annotated in metagenomic samples, representing an obstacle to understanding viral diversity and function. Current annotation approaches rely on alignment-based sequence homology methods, which are limited by the paucity of characterized viral proteins and divergence among viral sequences. Here we show that protein language models can capture prokaryotic viral protein function, enabling new portions of viral sequence space to be assigned biologically meaningful labels. When applied to global ocean virome data, our classifier expanded the annotated fraction of viral protein families by 29%. Among previously unannotated sequences, we highlight the identification of an integrase defining a mobile element in marine picocyanobacteria and a capsid protein that anchors globally widespread viral elements. Furthermore, improved high-level functional annotation provides a means to characterize similarities in genomic organization among diverse viral sequences. Protein language models thus enhance remote homology detection of viral proteins, serving as a useful complement to existing approaches.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: VPF function prediction using PLMs uncovers novel biology.
Fig. 2: Functional category classification of PHROG VPFs with PLM-based protein embeddings.
Fig. 3: Investigation of PLM-based embedding of PHROG VPFs.
Fig. 4: Functional category classifier validation and discovery with the EFAM database of VPFs curated from the ocean virome.
Fig. 5: Identification of a tyrosine integrase within marine picocyanobacteria.
Fig. 6: Discovery of a Major Capsid Protein.

Similar content being viewed by others

Data availability

The PHROGs VPF database v3 (https://phrogs.lmge.uca.fr/) was downloaded on 26 January 2022. Re-annotation data were downloaded after v4 release. The EFAM VPF database was downloaded from the project repository on CyVerse Data Commons on 7 September 2022. PHANNs protein sequences and annotations (https://phanns.com/downloads) were downloaded on 17 January 2023. geNomad HMM database and annotations (v1.3) were downloaded from Zenodo (https://zenodo.org/record/7793532) on 10 August 2023. Protein sequences used for integrase and MCP investigation were collected from the following databases: MGnify, IMG-VR, NCBI nr and UniRef50.

Protein sequence embeddings generated for PHROGs and EFAM sequences are available at https://doi.org/10.5281/zenodo.8339381. Additional data generated for this study are available in a public Google Cloud Platform bucket (http://storage.googleapis.com/viral_protein_family_plm_embeddings). See README on the project repository https://github.com/kellylab/viral-protein-function-annotation-with-protein-language-model for details on downloading the data. Source data are provided with this paper.

Code availability

Code for generating embeddings and using the classifier is available on GitHub at https://github.com/kellylab/viral-protein-function-plm and https://doi.org/10.5281/zenodo.10182746. We made a no-code Google Colaboratory notebook available to use the classifier that is linked from the GitHub repository README. Code to produce figures and tables is available on GitHub at https://github.com/kellylab/viral-protein-function-annotation-with-protein-language-model and https://doi.org/10.5281/zenodo.10182750.

References

  1. Roux, S. et al. Ecogenomics and potential biogeochemical impacts of globally abundant ocean viruses. Nature 537, 689–693 (2016).

    Article  CAS  PubMed  Google Scholar 

  2. Paez-Espino, D. et al. Uncovering Earth’s virome. Nature 536, 425–430 (2016).

    Article  ADS  CAS  PubMed  Google Scholar 

  3. Gregory, A. C. et al. Marine DNA viral macro- and microdiversity from pole to pole. Cell 177, 1109–1123.e14 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  4. ter Horst, A. M. et al. Minnesota peat viromes reveal terrestrial and aquatic niche partitioning for local and global viral populations. Microbiome 9, 233 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  5. Gregory, A. C. et al. The gut virome database reveals age-dependent patterns of virome diversity in the human gut. Cell Host Microbe 28, 724–740.e8 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  6. Camarillo-Guerrero, L. F., Almeida, A., Rangel-Pineros, G., Finn, R. D. & Lawley, T. D. Massive expansion of human gut bacteriophage diversity. Cell 184, 1098–1109.e9 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  7. Nayfach, S. et al. Metagenomic compendium of 189,680 DNA viruses from the human gut microbiome. Nat. Microbiol. 6, 960–970 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Roux, S., Enault, F., Hurwitz, B. L. & Sullivan, M. B. VirSorter: mining viral signal from microbial genomic data. PeerJ 3, e985 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  9. Ren, J., Ahlgren, N. A., Lu, Y. Y., Fuhrman, J. A. & Sun, F. VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data. Microbiome 5, 69 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  10. Ren, J. et al. Identifying viruses from metagenomic data using deep learning. Quant. Biol. 8, 64–77 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Wood, D. E., Lu, J. & Langmead, B. Improved metagenomic analysis with Kraken 2. Genome Biol. 20, 257 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Guo, J. et al. VirSorter2: a multi-classifier, expert-guided approach to detect diverse DNA and RNA viruses. Microbiome 9, 37 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  13. Kieft, K., Zhou, Z. & Anantharaman, K. VIBRANT: automated recovery, annotation and curation of microbial viruses, and evaluation of viral community function from genomic sequences. Microbiome 8, 90 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Tisza, M. J., Belford, A. K., Domínguez-Huerta, G., Bolduc, B. & Buck, C. B. Cenote-Taker 2 democratizes virus discovery and sequence annotation. Virus Evol. 7, veaa100 (2021).

    Article  PubMed  Google Scholar 

  15. Glickman, C., Hendrix, J. & Strong, M. Simulation study and comparative evaluation of viral contiguous sequence identification tools. BMC Bioinformatics 22, 329 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Camargo, A.P., et al. Identification of mobile genetic elements with geNomad. Nat. Biotechnol. https://doi.org/10.1038/s41587-023-01953-y (2023).

  17. Meier-Kolthoff, J. P. & Göker, M. VICTOR: genome-based phylogeny and classification of prokaryotic viruses. Bioinformatics 33, 3396–3404 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Bin Jang, H. et al. Taxonomic assignment of uncultivated prokaryotic virus genomes is enabled by gene-sharing networks. Nat. Biotechnol. 37, 632–639 (2019).

    Article  Google Scholar 

  19. Moraru, C. Virclust-a tool for hierarchical clustering, core gene detection and annotation of (prokaryotic) viruses. Viruses 15, 1007 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Pons, J. C. et al. VPF-Class: taxonomic assignment and host prediction of uncultivated viruses based on viral protein families. Bioinformatics 37, 1805–1813 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Terzian, P. et al. PHROG: families of prokaryotic virus proteins clustered using remote homology. NAR Genom. Bioinform. 3, lqab067 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  22. Zayed, A. A. et al. efam: an expanded, metaproteome-supported HMM profile database of viral protein families. Bioinformatics 37, 4202–4208 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Abdelkareem, A. O., Khalil, M. I., Elaraby, M., Abbas, H. & Elbehery, A. H. A. VirNet: deep attention model for viral reads identification. In 2018 13th Int. Conf. Computer Engineering and Systems (ICCES) 623–626 (IEEE, 2018).

  24. Tynecki, P. et al. PhageAI—bacteriophage life cycle recognition with machine learning and natural language processing. Preprint at bioRxiv https://www.biorxiv.org/content/early/2020/07/12/2020.07.11.198606 (2020).

  25. Asgari, E. & Mofrad, M. R. K. Continuous distributed representation of biological sequences for deep proteomics and genomics. PLoS ONE 10, e0141287 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  26. Heinzinger, M. et al. Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinformatics 20, 723 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl Acad. Sci. USA 118, e2016239118 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Elnaggar, A. et al. ProtTrans: towards cracking the language of lifes code through self-supervised deep learning and high performance computing. IEEE Trans. Pattern Anal. Mach. Intell. 44, 7112–7127 (2021).

    Article  Google Scholar 

  29. Brandes, N., Ofer, D., Peleg, Y., Rappoport, N. & Linial, M. ProteinBERT: a universal deep-learning model of protein sequence and function. Bioinformatics 38, 2102–2110 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Bepler, T. & Berger, B. Learning the protein language: evolution, structure, and function. Cell Syst. 12, 654–669.e3 (2021).

    PubMed  PubMed Central  Google Scholar 

  31. Dohan, D., Gane, A., Bileschi, M. L., Belanger, D. & Colwell, L. Improving protein function annotation via unsupervised pre-training: robustness, efficiency, and insights. In Proc. 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, KDD ’21 2782–2791 (Association for Computing Machinery, 2021); https://doi.org/10.1145/3447548.3467163

  32. Gane, A. et al. ProtNLM: model-based natural language protein annotation. Preprint at https://storage.googleapis.com/brain-genomics-public/research/proteins/protnlm/uniprot_2022_04/protnlm_preprint_draft.pdf (2022).

  33. Nasir, A. & Caetano-Anollés, G. A phylogenomic data-driven exploration of viral origins and evolution. Sci. Adv. 1, e1500527 (2015).

    Article  ADS  PubMed  PubMed Central  Google Scholar 

  34. Balaji, S. & Srinivasan, N. Comparison of sequence-based and structure-based phylogenetic trees of homologous proteins: inferences on protein evolution. J. Biosci. 32, 83–96 (2007).

    Article  CAS  PubMed  Google Scholar 

  35. Meng, C., Zhang, J., Ye, X., Guo, F. & Zou, Q. Review and comparative analysis of machine learning-based phage virion protein identification methods. Biochim. Biophys. Acta Proteins Proteom. 1868, 140406 (2020).

    Article  CAS  PubMed  Google Scholar 

  36. Fang, Z., Feng, T., Zhou, H. & Chen, M. DeePVP: identification and classification of phage virion proteins using deep learning. GigaScience 11, giac076 (2022).

    Article  PubMed  PubMed Central  Google Scholar 

  37. Cantu, V. A. et al. PhANNs, a fast and accurate tool and web server to classify phage structural proteins. PLoS Comput. Biol. 16, e1007845 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. Mizuno, C. M., Ghai, R., Saghaï, A., López-García, P. & Rodriguez-Valera, F. Genomes of abundant and widespread viruses from the deep ocean. mBio 7, e00805–16 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  39. Hackl, T. et al. Novel integrative elements and genomic plasticity in ocean ecosystems. Cell 186, 47–62.e16 (2023).

    Article  PubMed  Google Scholar 

  40. Eppley, J. M., Biller, S. J., Luo, E., Burger, A. & DeLong, E. F. Marine viral particles reveal an expansive repertoire of phage-parasitizing mobile elements. Proc. Natl Acad. Sci. USA 119, e2212722119 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Smyshlyaev, G., Bateman, A. & Barabas, O. Sequence analysis of tyrosine recombinases allows annotation of mobile genetic elements in prokaryotic genomes. Mol. Syst. Biol. 17, e9880 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Gibb, B. et al. Requirements for catalysis in the Cre recombinase active site. Nucleic Acids Res. 38, 5817–5832 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Williams, K. P. Integration sites for genetic elements in prokaryotic tRNA and tmRNA genes: sublocation preference of integrase subfamilies. Nucleic Acids Res. 30, 866–875 (2002).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. Hatfull, G. F. & Hendrix, R. W. Bacteriophages and their genomes. Curr. Opin. Virol. 1, 298–303 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Koonin, E. V., Krupovic, M. & Dolja, V. V. The global virome: how much diversity and how many independent origins? Environ. Microbiol. 25, 40–44 (2023).

    Article  PubMed  Google Scholar 

  46. Shen, A. & Millard, A. Phage genome annotation: where to begin and end. Phage 2, 183–193 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  47. Borodovich, T., Shkoporov, A. N., Ross, R. P. & Hill, C. Phage-mediated horizontal gene transfer and its implications for the human gut microbiome. Gastroenterol. Rep. 10, goac012 (2022).

    Article  Google Scholar 

  48. Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  49. Nicolas, E. et al. The tn3-family of replicative transposons. Microbiol. Spectr. 3, 3.4.14 (2015).

    Article  Google Scholar 

  50. Mavrich, T. N. & Hatfull, G. F. Bacteriophage evolution differs by host, lifestyle and genome. Nat. Microbiol. 2, 1–9, 17112 (2017).

  51. Mohssen, M., et al. efam. CyVerse Data Commons https://datacommons.cyverse.org/browse/iplant/home/shared/iVirus/Zayed_efam_2020.1 (2021).

  52. Steinegger, M. et al. HH-suite3 for fast remote homology detection and deep protein annotation. BMC Bioinformatics 20, 473 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  53. Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).

    MathSciNet  Google Scholar 

  54. Abadi, M. et al. TensorFlow: large-scale machine learning on heterogeneous systems https://www.tensorflow.org/ (2015).

  55. Sievers, F. & Higgins, D. G. Clustal Omega for making accurate alignments of many protein sequences. Protein Sci. 27, 135–145 (2018).

    Article  CAS  PubMed  Google Scholar 

  56. McInnes, L., Healy, J. & Melville, J. UMAP: uniform manifold approximation and projection for dimension reduction. Preprint at https://arxiv.org/abs/1802.03426 (2018).

  57. Charlier, F. et al. Statannotations. Zenodo (2022); https://doi.org/10.5281/zenodo.7213391

  58. Hagberg, A. A., Schult, D. A. & Swart, P. J. Exploring network structure, dynamics, and function using NetworkX. In Proc. 7th Python in Science Conference (eds Varoquaux, G. et al.) 11–15 (2008).

  59. Cock, P. J. A. et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25, 1422–1423 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  60. Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  61. Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  62. Paysan-Lafosse, T. et al. InterPro in 2022. Nucleic Acids Res. 51, D418–D427 (2022).

    Article  PubMed Central  Google Scholar 

  63. Gabler, F. et al. Protein sequence analysis using the MPI Bioinformatics Toolkit. Curr. Protoc. Bioinformatics 72, e108 (2020).

    Article  CAS  PubMed  Google Scholar 

  64. Zimmermann, L. et al. A completely reimplemented MPI Bioinformatics Toolkit with a new HHpred server at its core. J. Mol. Biol. 430, 2237–2243 (2018).

    Article  CAS  PubMed  Google Scholar 

  65. Potter, S. C. et al. HMMER web server: 2018 update. Nucleic Acids Res. 46, W200–W204 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  66. Kelley, L. A., Mezulis, S., Yates, C. M., Wass, M. N. & Sternberg, M. J. The Phyre2 web portal for protein modeling, prediction and analysis. Nat. Protoc. 10, 845–858 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  67. Mitchell, A. L. et al. MGnify: the microbiome analysis resource in 2020. Nucleic Acids Res. 48, D570–D578 (2019).

    PubMed Central  Google Scholar 

  68. Roux, S. et al. IMG/VR v3: an integrated ecological and evolutionary framework for interrogating genomes of uncultivated viruses. Nucleic Acids Res. 49, D764–D775 (2020).

    Article  PubMed Central  Google Scholar 

  69. Paez-Espino, D. et al. IMG/VR v2.0: an integrated data management and analysis system for cultivated and environmental viral genomes. Nucleic Acids Res. 47, D678–D686 (2018).

    Article  PubMed Central  Google Scholar 

  70. Steinegger, M. & Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).

    Article  CAS  PubMed  Google Scholar 

  71. Katoh, K. & Standley, D. M. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol. 30, 772–780 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  72. Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree 2-approximately maximum-likelihood trees for large alignments. PLoS ONE 5, e9490 (2010).

    Article  ADS  PubMed  PubMed Central  Google Scholar 

  73. Letunic, I. & Bork, P. Interactive Tree Of Life (iTOL) v5: an online tool for phylogenetic tree display and annotation. Nucleic Acids Res. 49, W293–W296 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  74. van Kempen, M. et al. Fast and accurate protein structure search with Foldseek. Nat. Biotechnol. https://doi.org/10.1038/s41587-023-01773-0 (2023).

  75. Hyatt, D. et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11, 119 (2010).

    Article  PubMed  PubMed Central  Google Scholar 

  76. Camargo, A. geNomad Database. Zenodo (2023); https://doi.org/10.5281/zenodo.7793532

  77. Shannon, P. et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 13, 2498–2504 (2003).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  78. Kauffman, K. M. et al. Viruses of the Nahant Collection, characterization of 251 marine Vibrionaceae viruses. Sci. Data 5, 180114 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  79. Waterhouse, A. et al. SWISS-MODEL: homology modelling of protein structures and complexes. Nucleic Acids Res. 46, W296–W303 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  80. Biswas, T. et al. A structural basis for allosteric control of DNA recombination by lambda integrase. Nature 435, 1059–1066 (2005).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  81. Chen, V. B. et al. MolProbity: all-atom structure validation for macromolecular crystallography. Acta Crystallogr. D. 66, 12–21 (2010).

    Article  ADS  CAS  PubMed  Google Scholar 

  82. Flamholz, Z. kellylab/viral-protein-function-plm: v1.0. Zenodo (2023); https://doi.org/10.5281/zenodo.10182747

  83. Harris, C. R. et al. Array programming with NumPy. Nature 585, 357–362 (2020).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  84. pandas development team pandas-dev/pandas: Pandas. Zenodo (2020); https://doi.org/10.5281/zenodo.3509134

  85. Hunter, J. D. Matplotlib: a 2D graphics environment. Comput. Sci. Eng. 9, 90–95 (2007).

    Article  Google Scholar 

  86. Waskom, M. L. seaborn: statistical data visualization. J. Open Source Softw. 6, 3021 (2021).

    Article  ADS  Google Scholar 

  87. Granger, B. E. & Pérez, F. Jupyter: thinking and storytelling with code and data. Comput. Sci. Eng. 23, 7–14 (2021).

    Article  Google Scholar 

Download references

Acknowledgements

We thank T. Hackl, K. Kauffman and C. Matrishin for helpful discussions. Z.N.F. was supported by the Einstein Medical Scientist Training Program (1T32GM149364). S.J.B. was supported by grants from the National Science Foundation (OCE-2049004 and OCE-2304066) and the Simons Foundation (Award ID 917971). L.K. is supported in part by NIH NHLBI grant R01HL069438. Computational resources were supported by an award from the Google Cloud Research Credits program (GCP19980904) to L.K. We thank the NVIDIA Academic Hardware Grant Program for the graphics processing units used in this work.

Author information

Authors and Affiliations

Authors

Contributions

L.K. and Z.N.F conceived and designed the experiments. Z.N.F. performed the experiments. Z.N.F., L.K. and S.J.B. analysed the data and produced the figures. Z.N.F. wrote the paper. L.K. and S.J.B. edited the paper.

Corresponding author

Correspondence to Libusha Kelly.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Microbiology thanks Jie Ren and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Performance of four different PLM-based representations for viral VPF functional classification.

Embedded proteins were used to train and evaluate PHROGs functional annotation classification. Performance is measured as F1-score over five-fold training-testing splits of PHROGs VPFs (n=5). Each study is described by the model architecture, protein source, and whether the PLM is trained with a multi-task training objective (MT). Boxes represent interquartile range; whiskers represent the entire distribution with the exception of outliers (diamonds); horizontal line indicates median. BFD- Big Fantastic Database; LSTM- long short-term memory.

Source data

Extended Data Fig. 2 Evaluation of embedding similarities of constituent families between functional categories.

(a) Distribution of family average sequence-sequence similarity. (b) Distribution of family-family centroid similarity. (a-b: DNA- n=1,065; connector- n=133; head- n=946; integration- n=105; lysis- n=299; moron- n=458; other- n=560; tail- n=1,219; transcription- n=303) (c) Significance of pairwise category distribution comparison using a two-sided independent t-test with Bonferroni correction (left- lysis vs. integration- p=1.493e-02; head vs. other- p=1.014e-11; head vs. lysis- p=1.096e-02; head vs. DNA- p=2.098e-10; head vs. transcription- p=4.978e-07; head vs. connector- p=1.207e-05; head vs. integration- p=2.085e-09; moron vs. other- p=1.205e-04; moron vs. DNA- p=2.084e-03; moron vs. transcription- p=9.269e-03; moron vs. connector- p=1.484e-02; moron vs. integration- p=5.970e-05; tail vs. other- p=4.449e-16; tail vs. lysis- p=4.479e-04; tail vs. DNA- p=3.759e-15; tail vs. transcription- p=1.814e-09; tail vs. connector- p=3.121e-07; tail vs. integration- p=1.482e-11, right- transcription vs. DNA- p=1.010e-111; transcription vs. connector- p=4.416e-22; transcription vs. tail- p=7.299e-260; transcription vs. head- p=0.000e+00; transcription vs. other- p=0.000e+00; transcription vs. lysis- p=0.000e+00; transcription vs. moron- p=0.000e+00; transcription vs. integration- p=0.000e+00; DNA vs. tail- p=1.943e-208; DNA vs. head- p=0.000e+00; DNA vs. other- p=0.000e+00; DNA vs. lysis- p=0.000e+00; DNA vs. moron- p=0.000e+00; DNA vs. integration- p=0.000e+00; connector vs. tail- p=4.234e-06; connector vs. head- p=7.253e-30; connector vs. other- p=2.294e-228 ; connector vs. lysis- p=1.662e-196; connector vs. moron- p=0.000e+00; connector vs. integration- p=2.827e-271; tail vs. head- p=1.145e-243; tail vs. other- p=0.000e+00; tail vs. lysis- p=0.000e+00; tail vs. moron- p=0.000e+00; tail vs. integration- p=0.000e+00; head vs. other- p=0.000e+00; head vs. lysis- p=0.000e+00; head vs. moron- p=0.000e+00; head vs. integration- p=0.000e+00; other vs. lysis- p=9.303e-21; other vs. moron- p=1.099e-188; other vs. integration- p=6.600e-87; lysis vs. moron- p=4.171e-204; lysis vs. integration- p=5.199e-138; moron vs. integration- p=1.478e-25). Boxes represent interquartile range; whiskers represent the entire distribution with the exception of outliers (diamonds); horizontal line indicates median.

Source data

Extended Data Fig. 3 Inter-category similarity for PHROGs functional categories.

Pairwise family centroid similarities were calculated for every combination of families between the two categories. Score is the average over all comparisons.

Extended Data Fig. 4 EFAM function classifier calibration analysis.

(a) EFAM VPFs that have hits to annotated PHROG HMMs (test set) are used to evaluate the model calibration for each category. For each class, probabilities across all VPFs in the test set are binned into 10 partitions and the fraction of true positives for each bin is calculated. A perfectly calibrated model (dotted line) has a true positive proportion equal to the mean predicted probability for each bin. Below the perfect model indicates overconfidence and under the perfect model indicates under confidence. (b) Histogram of the number of predictions across the test set for each probability bin.

Source data

Extended Data Fig. 5 Decision threshold evaluation for function classifier predictions on EFAM VPFs.

EFAM VPFs with PHROG hits were used as ground truth for prediction with the function classifier. Classifier thresholds are determined by considering false discovery rate (FDR) and F1-score (F1). The final decision threshold for each class is the decision boundary with maximal F1 with FDR < = 0.1.

Source data

Extended Data Fig. 6 Comparison of PLM embedding similarity and sequence identity for PHROG VPFs.

(a) The intra-family pairwise sequence embedding similarity, measured using cosine similarity, and sequence identity, measured using global alignment identity, were calculated for all annotated PHROG VPFs. Families are colored by functional category annotation. Solid line represents a linear regression for each function with shading representing a 95% bootstrapped confidence interval for the regression estimation. (b) Linear regression results for each category. R-value is measured using Pearson correlation coefficient. P-value is calculated using the Wald Test.

Source data

Extended Data Fig. 7 Comparative protein structure modelling of an integrase family sequence supports annotation as a tyrosine recombinase.

(a) Target/template alignment between the Prochlorococcus PAC1 sequence (indicated as Model_01), and the template sequence 1Z1B, the phage lambda integrase. Red arrows point to active site residues Arg 212, Lys 235, His 308, Arg 311, His 333, and Tyr 342. Boxed amino acid regions represent secondary structure. (b) Homology model of Prochlorococcus PAC1 sequence based on template 1Z1B. Colors indicate individual monomers of the homo-tetramer template protein structure in both a and b.

Supplementary information

Supplementary Information

Supplementary Tables 1, 3 and 4.

Reporting Summary

Supplementary Tables

Supplementary tables in xlsx format.

Source data

Source Data Fig. 2

Raw data for visualization and statistical source data.

Source Data Fig. 3

Raw data for visualization and statistical source data.

Source Data Fig. 4

Raw data for visualization and statistical source data.

Source Data Fig. 5

Sequence data and sequence metadata.

Source Data Fig. 6

Sequence data and sequence metadata, and raw data for visualization.

Source Data Extended Data Fig. 1

Raw data for visualization.

Source Data Extended Data Fig. 2

Statistical source data.

Source Data Extended Data Fig. 4

Raw data for visualization.

Source Data Extended Data Fig. 5

Raw data for visualization.

Source Data Extended Data Fig. 6

Statistical source data.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Flamholz, Z.N., Biller, S.J. & Kelly, L. Large language models improve annotation of prokaryotic viral proteins. Nat Microbiol 9, 537–549 (2024). https://doi.org/10.1038/s41564-023-01584-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41564-023-01584-8

Search

Quick links

Nature Briefing Microbiology

Sign up for the Nature Briefing: Microbiology newsletter — what matters in microbiology research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: Microbiology