Large language models improve annotation of prokaryotic viral proteins

Flamholz, Zachary N.; Biller, Steven J.; Kelly, Libusha

doi:10.1038/s41564-023-01584-8

Analysis
Published: 29 January 2024

Large language models improve annotation of prokaryotic viral proteins

Nature Microbiology volume 9, pages 537–549 (2024)Cite this article

6218 Accesses
1 Citations
80 Altmetric
Metrics details

Subjects

Abstract

Viral genomes are poorly annotated in metagenomic samples, representing an obstacle to understanding viral diversity and function. Current annotation approaches rely on alignment-based sequence homology methods, which are limited by the paucity of characterized viral proteins and divergence among viral sequences. Here we show that protein language models can capture prokaryotic viral protein function, enabling new portions of viral sequence space to be assigned biologically meaningful labels. When applied to global ocean virome data, our classifier expanded the annotated fraction of viral protein families by 29%. Among previously unannotated sequences, we highlight the identification of an integrase defining a mobile element in marine picocyanobacteria and a capsid protein that anchors globally widespread viral elements. Furthermore, improved high-level functional annotation provides a means to characterize similarities in genomic organization among diverse viral sequences. Protein language models thus enhance remote homology detection of viral proteins, serving as a useful complement to existing approaches.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: VPF function prediction using PLMs uncovers novel biology.**

**Fig. 2: Functional category classification of PHROG VPFs with PLM-based protein embeddings.**

**Fig. 3: Investigation of PLM-based embedding of PHROG VPFs.**

**Fig. 4: Functional category classifier validation and discovery with the EFAM database of VPFs curated from the ocean virome.**

**Fig. 5: Identification of a tyrosine integrase within marine picocyanobacteria.**

**Fig. 6: Discovery of a Major Capsid Protein.**

Phylogenomics and the rise of the angiosperms

Article Open access 24 April 2024

Pooled multicolour tagging for visualizing subcellular protein dynamics

Article Open access 19 April 2024

CoCas9 is a compact nuclease from the human microbiome for efficient and precise genome editing

Article Open access 24 April 2024

Data availability

The PHROGs VPF database v3 (https://phrogs.lmge.uca.fr/) was downloaded on 26 January 2022. Re-annotation data were downloaded after v4 release. The EFAM VPF database was downloaded from the project repository on CyVerse Data Commons on 7 September 2022. PHANNs protein sequences and annotations (https://phanns.com/downloads) were downloaded on 17 January 2023. geNomad HMM database and annotations (v1.3) were downloaded from Zenodo (https://zenodo.org/record/7793532) on 10 August 2023. Protein sequences used for integrase and MCP investigation were collected from the following databases: MGnify, IMG-VR, NCBI nr and UniRef50.

Protein sequence embeddings generated for PHROGs and EFAM sequences are available at https://doi.org/10.5281/zenodo.8339381. Additional data generated for this study are available in a public Google Cloud Platform bucket (http://storage.googleapis.com/viral_protein_family_plm_embeddings). See README on the project repository https://github.com/kellylab/viral-protein-function-annotation-with-protein-language-model for details on downloading the data. Source data are provided with this paper.

Code availability

Code for generating embeddings and using the classifier is available on GitHub at https://github.com/kellylab/viral-protein-function-plm and https://doi.org/10.5281/zenodo.10182746. We made a no-code Google Colaboratory notebook available to use the classifier that is linked from the GitHub repository README. Code to produce figures and tables is available on GitHub at https://github.com/kellylab/viral-protein-function-annotation-with-protein-language-model and https://doi.org/10.5281/zenodo.10182750.

References

Roux, S. et al. Ecogenomics and potential biogeochemical impacts of globally abundant ocean viruses. Nature 537, 689–693 (2016).
Article CAS PubMed Google Scholar
Paez-Espino, D. et al. Uncovering Earth’s virome. Nature 536, 425–430 (2016).
Article ADS CAS PubMed Google Scholar
Gregory, A. C. et al. Marine DNA viral macro- and microdiversity from pole to pole. Cell 177, 1109–1123.e14 (2019).
Article PubMed PubMed Central Google Scholar
ter Horst, A. M. et al. Minnesota peat viromes reveal terrestrial and aquatic niche partitioning for local and global viral populations. Microbiome 9, 233 (2021).
Article PubMed PubMed Central Google Scholar
Gregory, A. C. et al. The gut virome database reveals age-dependent patterns of virome diversity in the human gut. Cell Host Microbe 28, 724–740.e8 (2020).
Article PubMed PubMed Central Google Scholar
Camarillo-Guerrero, L. F., Almeida, A., Rangel-Pineros, G., Finn, R. D. & Lawley, T. D. Massive expansion of human gut bacteriophage diversity. Cell 184, 1098–1109.e9 (2021).
Article PubMed PubMed Central Google Scholar
Nayfach, S. et al. Metagenomic compendium of 189,680 DNA viruses from the human gut microbiome. Nat. Microbiol. 6, 960–970 (2021).
Article CAS PubMed PubMed Central Google Scholar
Roux, S., Enault, F., Hurwitz, B. L. & Sullivan, M. B. VirSorter: mining viral signal from microbial genomic data. PeerJ 3, e985 (2015).
Article PubMed PubMed Central Google Scholar
Ren, J., Ahlgren, N. A., Lu, Y. Y., Fuhrman, J. A. & Sun, F. VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data. Microbiome 5, 69 (2017).
Article PubMed PubMed Central Google Scholar
Ren, J. et al. Identifying viruses from metagenomic data using deep learning. Quant. Biol. 8, 64–77 (2020).
Article CAS PubMed PubMed Central Google Scholar
Wood, D. E., Lu, J. & Langmead, B. Improved metagenomic analysis with Kraken 2. Genome Biol. 20, 257 (2019).
Article CAS PubMed PubMed Central Google Scholar
Guo, J. et al. VirSorter2: a multi-classifier, expert-guided approach to detect diverse DNA and RNA viruses. Microbiome 9, 37 (2021).
Article PubMed PubMed Central Google Scholar
Kieft, K., Zhou, Z. & Anantharaman, K. VIBRANT: automated recovery, annotation and curation of microbial viruses, and evaluation of viral community function from genomic sequences. Microbiome 8, 90 (2020).
Article CAS PubMed PubMed Central Google Scholar
Tisza, M. J., Belford, A. K., Domínguez-Huerta, G., Bolduc, B. & Buck, C. B. Cenote-Taker 2 democratizes virus discovery and sequence annotation. Virus Evol. 7, veaa100 (2021).
Article PubMed Google Scholar
Glickman, C., Hendrix, J. & Strong, M. Simulation study and comparative evaluation of viral contiguous sequence identification tools. BMC Bioinformatics 22, 329 (2021).
Article CAS PubMed PubMed Central Google Scholar
Camargo, A.P., et al. Identification of mobile genetic elements with geNomad. Nat. Biotechnol. https://doi.org/10.1038/s41587-023-01953-y (2023).
Meier-Kolthoff, J. P. & Göker, M. VICTOR: genome-based phylogeny and classification of prokaryotic viruses. Bioinformatics 33, 3396–3404 (2017).
Article CAS PubMed PubMed Central Google Scholar
Bin Jang, H. et al. Taxonomic assignment of uncultivated prokaryotic virus genomes is enabled by gene-sharing networks. Nat. Biotechnol. 37, 632–639 (2019).
Article Google Scholar
Moraru, C. Virclust-a tool for hierarchical clustering, core gene detection and annotation of (prokaryotic) viruses. Viruses 15, 1007 (2023).
Article CAS PubMed PubMed Central Google Scholar
Pons, J. C. et al. VPF-Class: taxonomic assignment and host prediction of uncultivated viruses based on viral protein families. Bioinformatics 37, 1805–1813 (2021).
Article CAS PubMed PubMed Central Google Scholar
Terzian, P. et al. PHROG: families of prokaryotic virus proteins clustered using remote homology. NAR Genom. Bioinform. 3, lqab067 (2021).
Article PubMed PubMed Central Google Scholar
Zayed, A. A. et al. efam: an expanded, metaproteome-supported HMM profile database of viral protein families. Bioinformatics 37, 4202–4208 (2021).
Article CAS PubMed PubMed Central Google Scholar
Abdelkareem, A. O., Khalil, M. I., Elaraby, M., Abbas, H. & Elbehery, A. H. A. VirNet: deep attention model for viral reads identification. In 2018 13th Int. Conf. Computer Engineering and Systems (ICCES) 623–626 (IEEE, 2018).
Tynecki, P. et al. PhageAI—bacteriophage life cycle recognition with machine learning and natural language processing. Preprint at bioRxiv https://www.biorxiv.org/content/early/2020/07/12/2020.07.11.198606 (2020).
Asgari, E. & Mofrad, M. R. K. Continuous distributed representation of biological sequences for deep proteomics and genomics. PLoS ONE 10, e0141287 (2015).
Article PubMed PubMed Central Google Scholar
Heinzinger, M. et al. Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinformatics 20, 723 (2019).
Article CAS PubMed PubMed Central Google Scholar
Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl Acad. Sci. USA 118, e2016239118 (2021).
Article CAS PubMed PubMed Central Google Scholar
Elnaggar, A. et al. ProtTrans: towards cracking the language of lifes code through self-supervised deep learning and high performance computing. IEEE Trans. Pattern Anal. Mach. Intell. 44, 7112–7127 (2021).
Article Google Scholar
Brandes, N., Ofer, D., Peleg, Y., Rappoport, N. & Linial, M. ProteinBERT: a universal deep-learning model of protein sequence and function. Bioinformatics 38, 2102–2110 (2022).
Article CAS PubMed PubMed Central Google Scholar
Bepler, T. & Berger, B. Learning the protein language: evolution, structure, and function. Cell Syst. 12, 654–669.e3 (2021).
PubMed PubMed Central Google Scholar
Dohan, D., Gane, A., Bileschi, M. L., Belanger, D. & Colwell, L. Improving protein function annotation via unsupervised pre-training: robustness, efficiency, and insights. In Proc. 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, KDD ’21 2782–2791 (Association for Computing Machinery, 2021); https://doi.org/10.1145/3447548.3467163
Gane, A. et al. ProtNLM: model-based natural language protein annotation. Preprint at https://storage.googleapis.com/brain-genomics-public/research/proteins/protnlm/uniprot_2022_04/protnlm_preprint_draft.pdf (2022).
Nasir, A. & Caetano-Anollés, G. A phylogenomic data-driven exploration of viral origins and evolution. Sci. Adv. 1, e1500527 (2015).
Article ADS PubMed PubMed Central Google Scholar
Balaji, S. & Srinivasan, N. Comparison of sequence-based and structure-based phylogenetic trees of homologous proteins: inferences on protein evolution. J. Biosci. 32, 83–96 (2007).
Article CAS PubMed Google Scholar
Meng, C., Zhang, J., Ye, X., Guo, F. & Zou, Q. Review and comparative analysis of machine learning-based phage virion protein identification methods. Biochim. Biophys. Acta Proteins Proteom. 1868, 140406 (2020).
Article CAS PubMed Google Scholar
Fang, Z., Feng, T., Zhou, H. & Chen, M. DeePVP: identification and classification of phage virion proteins using deep learning. GigaScience 11, giac076 (2022).
Article PubMed PubMed Central Google Scholar
Cantu, V. A. et al. PhANNs, a fast and accurate tool and web server to classify phage structural proteins. PLoS Comput. Biol. 16, e1007845 (2020).
Article CAS PubMed PubMed Central Google Scholar
Mizuno, C. M., Ghai, R., Saghaï, A., López-García, P. & Rodriguez-Valera, F. Genomes of abundant and widespread viruses from the deep ocean. mBio 7, e00805–16 (2016).
Article CAS PubMed PubMed Central Google Scholar
Hackl, T. et al. Novel integrative elements and genomic plasticity in ocean ecosystems. Cell 186, 47–62.e16 (2023).
Article PubMed Google Scholar
Eppley, J. M., Biller, S. J., Luo, E., Burger, A. & DeLong, E. F. Marine viral particles reveal an expansive repertoire of phage-parasitizing mobile elements. Proc. Natl Acad. Sci. USA 119, e2212722119 (2022).
Article CAS PubMed PubMed Central Google Scholar
Smyshlyaev, G., Bateman, A. & Barabas, O. Sequence analysis of tyrosine recombinases allows annotation of mobile genetic elements in prokaryotic genomes. Mol. Syst. Biol. 17, e9880 (2021).
Article CAS PubMed PubMed Central Google Scholar
Gibb, B. et al. Requirements for catalysis in the Cre recombinase active site. Nucleic Acids Res. 38, 5817–5832 (2010).
Article CAS PubMed PubMed Central Google Scholar
Williams, K. P. Integration sites for genetic elements in prokaryotic tRNA and tmRNA genes: sublocation preference of integrase subfamilies. Nucleic Acids Res. 30, 866–875 (2002).
Article CAS PubMed PubMed Central Google Scholar
Hatfull, G. F. & Hendrix, R. W. Bacteriophages and their genomes. Curr. Opin. Virol. 1, 298–303 (2011).
Article CAS PubMed PubMed Central Google Scholar
Koonin, E. V., Krupovic, M. & Dolja, V. V. The global virome: how much diversity and how many independent origins? Environ. Microbiol. 25, 40–44 (2023).
Article PubMed Google Scholar
Shen, A. & Millard, A. Phage genome annotation: where to begin and end. Phage 2, 183–193 (2021).
Article PubMed PubMed Central Google Scholar
Borodovich, T., Shkoporov, A. N., Ross, R. P. & Hill, C. Phage-mediated horizontal gene transfer and its implications for the human gut microbiome. Gastroenterol. Rep. 10, goac012 (2022).
Article Google Scholar
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Nicolas, E. et al. The tn3-family of replicative transposons. Microbiol. Spectr. 3, 3.4.14 (2015).
Article Google Scholar
Mavrich, T. N. & Hatfull, G. F. Bacteriophage evolution differs by host, lifestyle and genome. Nat. Microbiol. 2, 1–9, 17112 (2017).
Mohssen, M., et al. efam. CyVerse Data Commons https://datacommons.cyverse.org/browse/iplant/home/shared/iVirus/Zayed_efam_2020.1 (2021).
Steinegger, M. et al. HH-suite3 for fast remote homology detection and deep protein annotation. BMC Bioinformatics 20, 473 (2019).
Article PubMed PubMed Central Google Scholar
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
MathSciNet Google Scholar
Abadi, M. et al. TensorFlow: large-scale machine learning on heterogeneous systems https://www.tensorflow.org/ (2015).
Sievers, F. & Higgins, D. G. Clustal Omega for making accurate alignments of many protein sequences. Protein Sci. 27, 135–145 (2018).
Article CAS PubMed Google Scholar
McInnes, L., Healy, J. & Melville, J. UMAP: uniform manifold approximation and projection for dimension reduction. Preprint at https://arxiv.org/abs/1802.03426 (2018).
Charlier, F. et al. Statannotations. Zenodo (2022); https://doi.org/10.5281/zenodo.7213391
Hagberg, A. A., Schult, D. A. & Swart, P. J. Exploring network structure, dynamics, and function using NetworkX. In Proc. 7th Python in Science Conference (eds Varoquaux, G. et al.) 11–15 (2008).
Cock, P. J. A. et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25, 1422–1423 (2009).
Article CAS PubMed PubMed Central Google Scholar
Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020).
Article CAS PubMed PubMed Central Google Scholar
Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).
Article CAS PubMed PubMed Central Google Scholar
Paysan-Lafosse, T. et al. InterPro in 2022. Nucleic Acids Res. 51, D418–D427 (2022).
Article PubMed Central Google Scholar
Gabler, F. et al. Protein sequence analysis using the MPI Bioinformatics Toolkit. Curr. Protoc. Bioinformatics 72, e108 (2020).
Article CAS PubMed Google Scholar
Zimmermann, L. et al. A completely reimplemented MPI Bioinformatics Toolkit with a new HHpred server at its core. J. Mol. Biol. 430, 2237–2243 (2018).
Article CAS PubMed Google Scholar
Potter, S. C. et al. HMMER web server: 2018 update. Nucleic Acids Res. 46, W200–W204 (2018).
Article CAS PubMed PubMed Central Google Scholar
Kelley, L. A., Mezulis, S., Yates, C. M., Wass, M. N. & Sternberg, M. J. The Phyre2 web portal for protein modeling, prediction and analysis. Nat. Protoc. 10, 845–858 (2015).
Article CAS PubMed PubMed Central Google Scholar
Mitchell, A. L. et al. MGnify: the microbiome analysis resource in 2020. Nucleic Acids Res. 48, D570–D578 (2019).
PubMed Central Google Scholar
Roux, S. et al. IMG/VR v3: an integrated ecological and evolutionary framework for interrogating genomes of uncultivated viruses. Nucleic Acids Res. 49, D764–D775 (2020).
Article PubMed Central Google Scholar
Paez-Espino, D. et al. IMG/VR v2.0: an integrated data management and analysis system for cultivated and environmental viral genomes. Nucleic Acids Res. 47, D678–D686 (2018).
Article PubMed Central Google Scholar
Steinegger, M. & Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).
Article CAS PubMed Google Scholar
Katoh, K. & Standley, D. M. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol. 30, 772–780 (2013).
Article CAS PubMed PubMed Central Google Scholar
Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree 2-approximately maximum-likelihood trees for large alignments. PLoS ONE 5, e9490 (2010).
Article ADS PubMed PubMed Central Google Scholar
Letunic, I. & Bork, P. Interactive Tree Of Life (iTOL) v5: an online tool for phylogenetic tree display and annotation. Nucleic Acids Res. 49, W293–W296 (2021).
Article CAS PubMed PubMed Central Google Scholar
van Kempen, M. et al. Fast and accurate protein structure search with Foldseek. Nat. Biotechnol. https://doi.org/10.1038/s41587-023-01773-0 (2023).
Hyatt, D. et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11, 119 (2010).
Article PubMed PubMed Central Google Scholar
Camargo, A. geNomad Database. Zenodo (2023); https://doi.org/10.5281/zenodo.7793532
Shannon, P. et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 13, 2498–2504 (2003).
Article CAS PubMed PubMed Central Google Scholar
Kauffman, K. M. et al. Viruses of the Nahant Collection, characterization of 251 marine Vibrionaceae viruses. Sci. Data 5, 180114 (2018).
Article CAS PubMed PubMed Central Google Scholar
Waterhouse, A. et al. SWISS-MODEL: homology modelling of protein structures and complexes. Nucleic Acids Res. 46, W296–W303 (2018).
Article CAS PubMed PubMed Central Google Scholar
Biswas, T. et al. A structural basis for allosteric control of DNA recombination by lambda integrase. Nature 435, 1059–1066 (2005).
Article ADS CAS PubMed PubMed Central Google Scholar
Chen, V. B. et al. MolProbity: all-atom structure validation for macromolecular crystallography. Acta Crystallogr. D. 66, 12–21 (2010).
Article ADS CAS PubMed Google Scholar
Flamholz, Z. kellylab/viral-protein-function-plm: v1.0. Zenodo (2023); https://doi.org/10.5281/zenodo.10182747
Harris, C. R. et al. Array programming with NumPy. Nature 585, 357–362 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
pandas development team pandas-dev/pandas: Pandas. Zenodo (2020); https://doi.org/10.5281/zenodo.3509134
Hunter, J. D. Matplotlib: a 2D graphics environment. Comput. Sci. Eng. 9, 90–95 (2007).
Article Google Scholar
Waskom, M. L. seaborn: statistical data visualization. J. Open Source Softw. 6, 3021 (2021).
Article ADS Google Scholar
Granger, B. E. & Pérez, F. Jupyter: thinking and storytelling with code and data. Comput. Sci. Eng. 23, 7–14 (2021).
Article Google Scholar

Download references

Acknowledgements

We thank T. Hackl, K. Kauffman and C. Matrishin for helpful discussions. Z.N.F. was supported by the Einstein Medical Scientist Training Program (1T32GM149364). S.J.B. was supported by grants from the National Science Foundation (OCE-2049004 and OCE-2304066) and the Simons Foundation (Award ID 917971). L.K. is supported in part by NIH NHLBI grant R01HL069438. Computational resources were supported by an award from the Google Cloud Research Credits program (GCP19980904) to L.K. We thank the NVIDIA Academic Hardware Grant Program for the graphics processing units used in this work.

Author information

Authors and Affiliations

Department of Systems and Computational Biology, Albert Einstein College of Medicine, Bronx, NY, USA
Zachary N. Flamholz & Libusha Kelly
Department of Biological Sciences, Wellesley College, Wellesley, MA, USA
Steven J. Biller
Department of Microbiology and Immunology, Albert Einstein College of Medicine, Bronx, NY, USA
Libusha Kelly

Authors

Zachary N. Flamholz
View author publications
You can also search for this author in PubMed Google Scholar
Steven J. Biller
View author publications
You can also search for this author in PubMed Google Scholar
Libusha Kelly
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

L.K. and Z.N.F conceived and designed the experiments. Z.N.F. performed the experiments. Z.N.F., L.K. and S.J.B. analysed the data and produced the figures. Z.N.F. wrote the paper. L.K. and S.J.B. edited the paper.

Corresponding author

Correspondence to Libusha Kelly.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Microbiology thanks Jie Ren and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Performance of four different PLM-based representations for viral VPF functional classification.

Embedded proteins were used to train and evaluate PHROGs functional annotation classification. Performance is measured as F1-score over five-fold training-testing splits of PHROGs VPFs (n=5). Each study is described by the model architecture, protein source, and whether the PLM is trained with a multi-task training objective (MT). Boxes represent interquartile range; whiskers represent the entire distribution with the exception of outliers (diamonds); horizontal line indicates median. BFD- Big Fantastic Database; LSTM- long short-term memory.

Source data

Extended Data Fig. 2 Evaluation of embedding similarities of constituent families between functional categories.

(a) Distribution of family average sequence-sequence similarity. (b) Distribution of family-family centroid similarity. (a-b: DNA- n=1,065; connector- n=133; head- n=946; integration- n=105; lysis- n=299; moron- n=458; other- n=560; tail- n=1,219; transcription- n=303) (c) Significance of pairwise category distribution comparison using a two-sided independent t-test with Bonferroni correction (left- lysis vs. integration- p=1.493e-02; head vs. other- p=1.014e-11; head vs. lysis- p=1.096e-02; head vs. DNA- p=2.098e-10; head vs. transcription- p=4.978e-07; head vs. connector- p=1.207e-05; head vs. integration- p=2.085e-09; moron vs. other- p=1.205e-04; moron vs. DNA- p=2.084e-03; moron vs. transcription- p=9.269e-03; moron vs. connector- p=1.484e-02; moron vs. integration- p=5.970e-05; tail vs. other- p=4.449e-16; tail vs. lysis- p=4.479e-04; tail vs. DNA- p=3.759e-15; tail vs. transcription- p=1.814e-09; tail vs. connector- p=3.121e-07; tail vs. integration- p=1.482e-11, right- transcription vs. DNA- p=1.010e-111; transcription vs. connector- p=4.416e-22; transcription vs. tail- p=7.299e-260; transcription vs. head- p=0.000e+00; transcription vs. other- p=0.000e+00; transcription vs. lysis- p=0.000e+00; transcription vs. moron- p=0.000e+00; transcription vs. integration- p=0.000e+00; DNA vs. tail- p=1.943e-208; DNA vs. head- p=0.000e+00; DNA vs. other- p=0.000e+00; DNA vs. lysis- p=0.000e+00; DNA vs. moron- p=0.000e+00; DNA vs. integration- p=0.000e+00; connector vs. tail- p=4.234e-06; connector vs. head- p=7.253e-30; connector vs. other- p=2.294e-228 ; connector vs. lysis- p=1.662e-196; connector vs. moron- p=0.000e+00; connector vs. integration- p=2.827e-271; tail vs. head- p=1.145e-243; tail vs. other- p=0.000e+00; tail vs. lysis- p=0.000e+00; tail vs. moron- p=0.000e+00; tail vs. integration- p=0.000e+00; head vs. other- p=0.000e+00; head vs. lysis- p=0.000e+00; head vs. moron- p=0.000e+00; head vs. integration- p=0.000e+00; other vs. lysis- p=9.303e-21; other vs. moron- p=1.099e-188; other vs. integration- p=6.600e-87; lysis vs. moron- p=4.171e-204; lysis vs. integration- p=5.199e-138; moron vs. integration- p=1.478e-25). Boxes represent interquartile range; whiskers represent the entire distribution with the exception of outliers (diamonds); horizontal line indicates median.

Source data

Extended Data Fig. 3 Inter-category similarity for PHROGs functional categories.

Pairwise family centroid similarities were calculated for every combination of families between the two categories. Score is the average over all comparisons.

Extended Data Fig. 4 EFAM function classifier calibration analysis.

(a) EFAM VPFs that have hits to annotated PHROG HMMs (test set) are used to evaluate the model calibration for each category. For each class, probabilities across all VPFs in the test set are binned into 10 partitions and the fraction of true positives for each bin is calculated. A perfectly calibrated model (dotted line) has a true positive proportion equal to the mean predicted probability for each bin. Below the perfect model indicates overconfidence and under the perfect model indicates under confidence. (b) Histogram of the number of predictions across the test set for each probability bin.

Source data

Extended Data Fig. 5 Decision threshold evaluation for function classifier predictions on EFAM VPFs.

EFAM VPFs with PHROG hits were used as ground truth for prediction with the function classifier. Classifier thresholds are determined by considering false discovery rate (FDR) and F1-score (F1). The final decision threshold for each class is the decision boundary with maximal F1 with FDR < = 0.1.

Source data

Extended Data Fig. 6 Comparison of PLM embedding similarity and sequence identity for PHROG VPFs.

(a) The intra-family pairwise sequence embedding similarity, measured using cosine similarity, and sequence identity, measured using global alignment identity, were calculated for all annotated PHROG VPFs. Families are colored by functional category annotation. Solid line represents a linear regression for each function with shading representing a 95% bootstrapped confidence interval for the regression estimation. (b) Linear regression results for each category. R-value is measured using Pearson correlation coefficient. P-value is calculated using the Wald Test.

Source data

Extended Data Fig. 7 Comparative protein structure modelling of an integrase family sequence supports annotation as a tyrosine recombinase.

(a) Target/template alignment between the Prochlorococcus PAC1 sequence (indicated as Model_01), and the template sequence 1Z1B, the phage lambda integrase. Red arrows point to active site residues Arg 212, Lys 235, His 308, Arg 311, His 333, and Tyr 342. Boxed amino acid regions represent secondary structure. (b) Homology model of Prochlorococcus PAC1 sequence based on template 1Z1B. Colors indicate individual monomers of the homo-tetramer template protein structure in both a and b.

Supplementary information

Supplementary Information

Supplementary Tables 1, 3 and 4.

Reporting Summary

Supplementary Tables

Supplementary tables in xlsx format.

Source data

Source Data Fig. 2

Raw data for visualization and statistical source data.

Source Data Fig. 3

Raw data for visualization and statistical source data.

Source Data Fig. 4

Raw data for visualization and statistical source data.

Source Data Fig. 5

Sequence data and sequence metadata.

Source Data Fig. 6

Sequence data and sequence metadata, and raw data for visualization.

Source Data Extended Data Fig. 1

Raw data for visualization.

Source Data Extended Data Fig. 2

Statistical source data.

Source Data Extended Data Fig. 4

Raw data for visualization.

Source Data Extended Data Fig. 5

Raw data for visualization.

Source Data Extended Data Fig. 6

Statistical source data.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Flamholz, Z.N., Biller, S.J. & Kelly, L. Large language models improve annotation of prokaryotic viral proteins. Nat Microbiol 9, 537–549 (2024). https://doi.org/10.1038/s41564-023-01584-8

Download citation

Received: 23 April 2023
Accepted: 08 December 2023
Published: 29 January 2024
Issue Date: February 2024
DOI: https://doi.org/10.1038/s41564-023-01584-8

Subjects

Abstract

Access options

Similar content being viewed by others

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Extended data

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links