Abstract
Viral genomes are poorly annotated in metagenomic samples, representing an obstacle to understanding viral diversity and function. Current annotation approaches rely on alignment-based sequence homology methods, which are limited by the paucity of characterized viral proteins and divergence among viral sequences. Here we show that protein language models can capture prokaryotic viral protein function, enabling new portions of viral sequence space to be assigned biologically meaningful labels. When applied to global ocean virome data, our classifier expanded the annotated fraction of viral protein families by 29%. Among previously unannotated sequences, we highlight the identification of an integrase defining a mobile element in marine picocyanobacteria and a capsid protein that anchors globally widespread viral elements. Furthermore, improved high-level functional annotation provides a means to characterize similarities in genomic organization among diverse viral sequences. Protein language models thus enhance remote homology detection of viral proteins, serving as a useful complement to existing approaches.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
The PHROGs VPF database v3 (https://phrogs.lmge.uca.fr/) was downloaded on 26 January 2022. Re-annotation data were downloaded after v4 release. The EFAM VPF database was downloaded from the project repository on CyVerse Data Commons on 7 September 2022. PHANNs protein sequences and annotations (https://phanns.com/downloads) were downloaded on 17 January 2023. geNomad HMM database and annotations (v1.3) were downloaded from Zenodo (https://zenodo.org/record/7793532) on 10 August 2023. Protein sequences used for integrase and MCP investigation were collected from the following databases: MGnify, IMG-VR, NCBI nr and UniRef50.
Protein sequence embeddings generated for PHROGs and EFAM sequences are available at https://doi.org/10.5281/zenodo.8339381. Additional data generated for this study are available in a public Google Cloud Platform bucket (http://storage.googleapis.com/viral_protein_family_plm_embeddings). See README on the project repository https://github.com/kellylab/viral-protein-function-annotation-with-protein-language-model for details on downloading the data. Source data are provided with this paper.
Code availability
Code for generating embeddings and using the classifier is available on GitHub at https://github.com/kellylab/viral-protein-function-plm and https://doi.org/10.5281/zenodo.10182746. We made a no-code Google Colaboratory notebook available to use the classifier that is linked from the GitHub repository README. Code to produce figures and tables is available on GitHub at https://github.com/kellylab/viral-protein-function-annotation-with-protein-language-model and https://doi.org/10.5281/zenodo.10182750.
References
Roux, S. et al. Ecogenomics and potential biogeochemical impacts of globally abundant ocean viruses. Nature 537, 689–693 (2016).
Paez-Espino, D. et al. Uncovering Earth’s virome. Nature 536, 425–430 (2016).
Gregory, A. C. et al. Marine DNA viral macro- and microdiversity from pole to pole. Cell 177, 1109–1123.e14 (2019).
ter Horst, A. M. et al. Minnesota peat viromes reveal terrestrial and aquatic niche partitioning for local and global viral populations. Microbiome 9, 233 (2021).
Gregory, A. C. et al. The gut virome database reveals age-dependent patterns of virome diversity in the human gut. Cell Host Microbe 28, 724–740.e8 (2020).
Camarillo-Guerrero, L. F., Almeida, A., Rangel-Pineros, G., Finn, R. D. & Lawley, T. D. Massive expansion of human gut bacteriophage diversity. Cell 184, 1098–1109.e9 (2021).
Nayfach, S. et al. Metagenomic compendium of 189,680 DNA viruses from the human gut microbiome. Nat. Microbiol. 6, 960–970 (2021).
Roux, S., Enault, F., Hurwitz, B. L. & Sullivan, M. B. VirSorter: mining viral signal from microbial genomic data. PeerJ 3, e985 (2015).
Ren, J., Ahlgren, N. A., Lu, Y. Y., Fuhrman, J. A. & Sun, F. VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data. Microbiome 5, 69 (2017).
Ren, J. et al. Identifying viruses from metagenomic data using deep learning. Quant. Biol. 8, 64–77 (2020).
Wood, D. E., Lu, J. & Langmead, B. Improved metagenomic analysis with Kraken 2. Genome Biol. 20, 257 (2019).
Guo, J. et al. VirSorter2: a multi-classifier, expert-guided approach to detect diverse DNA and RNA viruses. Microbiome 9, 37 (2021).
Kieft, K., Zhou, Z. & Anantharaman, K. VIBRANT: automated recovery, annotation and curation of microbial viruses, and evaluation of viral community function from genomic sequences. Microbiome 8, 90 (2020).
Tisza, M. J., Belford, A. K., Domínguez-Huerta, G., Bolduc, B. & Buck, C. B. Cenote-Taker 2 democratizes virus discovery and sequence annotation. Virus Evol. 7, veaa100 (2021).
Glickman, C., Hendrix, J. & Strong, M. Simulation study and comparative evaluation of viral contiguous sequence identification tools. BMC Bioinformatics 22, 329 (2021).
Camargo, A.P., et al. Identification of mobile genetic elements with geNomad. Nat. Biotechnol. https://doi.org/10.1038/s41587-023-01953-y (2023).
Meier-Kolthoff, J. P. & Göker, M. VICTOR: genome-based phylogeny and classification of prokaryotic viruses. Bioinformatics 33, 3396–3404 (2017).
Bin Jang, H. et al. Taxonomic assignment of uncultivated prokaryotic virus genomes is enabled by gene-sharing networks. Nat. Biotechnol. 37, 632–639 (2019).
Moraru, C. Virclust-a tool for hierarchical clustering, core gene detection and annotation of (prokaryotic) viruses. Viruses 15, 1007 (2023).
Pons, J. C. et al. VPF-Class: taxonomic assignment and host prediction of uncultivated viruses based on viral protein families. Bioinformatics 37, 1805–1813 (2021).
Terzian, P. et al. PHROG: families of prokaryotic virus proteins clustered using remote homology. NAR Genom. Bioinform. 3, lqab067 (2021).
Zayed, A. A. et al. efam: an expanded, metaproteome-supported HMM profile database of viral protein families. Bioinformatics 37, 4202–4208 (2021).
Abdelkareem, A. O., Khalil, M. I., Elaraby, M., Abbas, H. & Elbehery, A. H. A. VirNet: deep attention model for viral reads identification. In 2018 13th Int. Conf. Computer Engineering and Systems (ICCES) 623–626 (IEEE, 2018).
Tynecki, P. et al. PhageAI—bacteriophage life cycle recognition with machine learning and natural language processing. Preprint at bioRxiv https://www.biorxiv.org/content/early/2020/07/12/2020.07.11.198606 (2020).
Asgari, E. & Mofrad, M. R. K. Continuous distributed representation of biological sequences for deep proteomics and genomics. PLoS ONE 10, e0141287 (2015).
Heinzinger, M. et al. Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinformatics 20, 723 (2019).
Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl Acad. Sci. USA 118, e2016239118 (2021).
Elnaggar, A. et al. ProtTrans: towards cracking the language of lifes code through self-supervised deep learning and high performance computing. IEEE Trans. Pattern Anal. Mach. Intell. 44, 7112–7127 (2021).
Brandes, N., Ofer, D., Peleg, Y., Rappoport, N. & Linial, M. ProteinBERT: a universal deep-learning model of protein sequence and function. Bioinformatics 38, 2102–2110 (2022).
Bepler, T. & Berger, B. Learning the protein language: evolution, structure, and function. Cell Syst. 12, 654–669.e3 (2021).
Dohan, D., Gane, A., Bileschi, M. L., Belanger, D. & Colwell, L. Improving protein function annotation via unsupervised pre-training: robustness, efficiency, and insights. In Proc. 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, KDD ’21 2782–2791 (Association for Computing Machinery, 2021); https://doi.org/10.1145/3447548.3467163
Gane, A. et al. ProtNLM: model-based natural language protein annotation. Preprint at https://storage.googleapis.com/brain-genomics-public/research/proteins/protnlm/uniprot_2022_04/protnlm_preprint_draft.pdf (2022).
Nasir, A. & Caetano-Anollés, G. A phylogenomic data-driven exploration of viral origins and evolution. Sci. Adv. 1, e1500527 (2015).
Balaji, S. & Srinivasan, N. Comparison of sequence-based and structure-based phylogenetic trees of homologous proteins: inferences on protein evolution. J. Biosci. 32, 83–96 (2007).
Meng, C., Zhang, J., Ye, X., Guo, F. & Zou, Q. Review and comparative analysis of machine learning-based phage virion protein identification methods. Biochim. Biophys. Acta Proteins Proteom. 1868, 140406 (2020).
Fang, Z., Feng, T., Zhou, H. & Chen, M. DeePVP: identification and classification of phage virion proteins using deep learning. GigaScience 11, giac076 (2022).
Cantu, V. A. et al. PhANNs, a fast and accurate tool and web server to classify phage structural proteins. PLoS Comput. Biol. 16, e1007845 (2020).
Mizuno, C. M., Ghai, R., Saghaï, A., López-García, P. & Rodriguez-Valera, F. Genomes of abundant and widespread viruses from the deep ocean. mBio 7, e00805–16 (2016).
Hackl, T. et al. Novel integrative elements and genomic plasticity in ocean ecosystems. Cell 186, 47–62.e16 (2023).
Eppley, J. M., Biller, S. J., Luo, E., Burger, A. & DeLong, E. F. Marine viral particles reveal an expansive repertoire of phage-parasitizing mobile elements. Proc. Natl Acad. Sci. USA 119, e2212722119 (2022).
Smyshlyaev, G., Bateman, A. & Barabas, O. Sequence analysis of tyrosine recombinases allows annotation of mobile genetic elements in prokaryotic genomes. Mol. Syst. Biol. 17, e9880 (2021).
Gibb, B. et al. Requirements for catalysis in the Cre recombinase active site. Nucleic Acids Res. 38, 5817–5832 (2010).
Williams, K. P. Integration sites for genetic elements in prokaryotic tRNA and tmRNA genes: sublocation preference of integrase subfamilies. Nucleic Acids Res. 30, 866–875 (2002).
Hatfull, G. F. & Hendrix, R. W. Bacteriophages and their genomes. Curr. Opin. Virol. 1, 298–303 (2011).
Koonin, E. V., Krupovic, M. & Dolja, V. V. The global virome: how much diversity and how many independent origins? Environ. Microbiol. 25, 40–44 (2023).
Shen, A. & Millard, A. Phage genome annotation: where to begin and end. Phage 2, 183–193 (2021).
Borodovich, T., Shkoporov, A. N., Ross, R. P. & Hill, C. Phage-mediated horizontal gene transfer and its implications for the human gut microbiome. Gastroenterol. Rep. 10, goac012 (2022).
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Nicolas, E. et al. The tn3-family of replicative transposons. Microbiol. Spectr. 3, 3.4.14 (2015).
Mavrich, T. N. & Hatfull, G. F. Bacteriophage evolution differs by host, lifestyle and genome. Nat. Microbiol. 2, 1–9, 17112 (2017).
Mohssen, M., et al. efam. CyVerse Data Commons https://datacommons.cyverse.org/browse/iplant/home/shared/iVirus/Zayed_efam_2020.1 (2021).
Steinegger, M. et al. HH-suite3 for fast remote homology detection and deep protein annotation. BMC Bioinformatics 20, 473 (2019).
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Abadi, M. et al. TensorFlow: large-scale machine learning on heterogeneous systems https://www.tensorflow.org/ (2015).
Sievers, F. & Higgins, D. G. Clustal Omega for making accurate alignments of many protein sequences. Protein Sci. 27, 135–145 (2018).
McInnes, L., Healy, J. & Melville, J. UMAP: uniform manifold approximation and projection for dimension reduction. Preprint at https://arxiv.org/abs/1802.03426 (2018).
Charlier, F. et al. Statannotations. Zenodo (2022); https://doi.org/10.5281/zenodo.7213391
Hagberg, A. A., Schult, D. A. & Swart, P. J. Exploring network structure, dynamics, and function using NetworkX. In Proc. 7th Python in Science Conference (eds Varoquaux, G. et al.) 11–15 (2008).
Cock, P. J. A. et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25, 1422–1423 (2009).
Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020).
Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).
Paysan-Lafosse, T. et al. InterPro in 2022. Nucleic Acids Res. 51, D418–D427 (2022).
Gabler, F. et al. Protein sequence analysis using the MPI Bioinformatics Toolkit. Curr. Protoc. Bioinformatics 72, e108 (2020).
Zimmermann, L. et al. A completely reimplemented MPI Bioinformatics Toolkit with a new HHpred server at its core. J. Mol. Biol. 430, 2237–2243 (2018).
Potter, S. C. et al. HMMER web server: 2018 update. Nucleic Acids Res. 46, W200–W204 (2018).
Kelley, L. A., Mezulis, S., Yates, C. M., Wass, M. N. & Sternberg, M. J. The Phyre2 web portal for protein modeling, prediction and analysis. Nat. Protoc. 10, 845–858 (2015).
Mitchell, A. L. et al. MGnify: the microbiome analysis resource in 2020. Nucleic Acids Res. 48, D570–D578 (2019).
Roux, S. et al. IMG/VR v3: an integrated ecological and evolutionary framework for interrogating genomes of uncultivated viruses. Nucleic Acids Res. 49, D764–D775 (2020).
Paez-Espino, D. et al. IMG/VR v2.0: an integrated data management and analysis system for cultivated and environmental viral genomes. Nucleic Acids Res. 47, D678–D686 (2018).
Steinegger, M. & Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).
Katoh, K. & Standley, D. M. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol. 30, 772–780 (2013).
Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree 2-approximately maximum-likelihood trees for large alignments. PLoS ONE 5, e9490 (2010).
Letunic, I. & Bork, P. Interactive Tree Of Life (iTOL) v5: an online tool for phylogenetic tree display and annotation. Nucleic Acids Res. 49, W293–W296 (2021).
van Kempen, M. et al. Fast and accurate protein structure search with Foldseek. Nat. Biotechnol. https://doi.org/10.1038/s41587-023-01773-0 (2023).
Hyatt, D. et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11, 119 (2010).
Camargo, A. geNomad Database. Zenodo (2023); https://doi.org/10.5281/zenodo.7793532
Shannon, P. et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 13, 2498–2504 (2003).
Kauffman, K. M. et al. Viruses of the Nahant Collection, characterization of 251 marine Vibrionaceae viruses. Sci. Data 5, 180114 (2018).
Waterhouse, A. et al. SWISS-MODEL: homology modelling of protein structures and complexes. Nucleic Acids Res. 46, W296–W303 (2018).
Biswas, T. et al. A structural basis for allosteric control of DNA recombination by lambda integrase. Nature 435, 1059–1066 (2005).
Chen, V. B. et al. MolProbity: all-atom structure validation for macromolecular crystallography. Acta Crystallogr. D. 66, 12–21 (2010).
Flamholz, Z. kellylab/viral-protein-function-plm: v1.0. Zenodo (2023); https://doi.org/10.5281/zenodo.10182747
Harris, C. R. et al. Array programming with NumPy. Nature 585, 357–362 (2020).
pandas development team pandas-dev/pandas: Pandas. Zenodo (2020); https://doi.org/10.5281/zenodo.3509134
Hunter, J. D. Matplotlib: a 2D graphics environment. Comput. Sci. Eng. 9, 90–95 (2007).
Waskom, M. L. seaborn: statistical data visualization. J. Open Source Softw. 6, 3021 (2021).
Granger, B. E. & Pérez, F. Jupyter: thinking and storytelling with code and data. Comput. Sci. Eng. 23, 7–14 (2021).
Acknowledgements
We thank T. Hackl, K. Kauffman and C. Matrishin for helpful discussions. Z.N.F. was supported by the Einstein Medical Scientist Training Program (1T32GM149364). S.J.B. was supported by grants from the National Science Foundation (OCE-2049004 and OCE-2304066) and the Simons Foundation (Award ID 917971). L.K. is supported in part by NIH NHLBI grant R01HL069438. Computational resources were supported by an award from the Google Cloud Research Credits program (GCP19980904) to L.K. We thank the NVIDIA Academic Hardware Grant Program for the graphics processing units used in this work.
Author information
Authors and Affiliations
Contributions
L.K. and Z.N.F conceived and designed the experiments. Z.N.F. performed the experiments. Z.N.F., L.K. and S.J.B. analysed the data and produced the figures. Z.N.F. wrote the paper. L.K. and S.J.B. edited the paper.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Microbiology thanks Jie Ren and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Performance of four different PLM-based representations for viral VPF functional classification.
Embedded proteins were used to train and evaluate PHROGs functional annotation classification. Performance is measured as F1-score over five-fold training-testing splits of PHROGs VPFs (n=5). Each study is described by the model architecture, protein source, and whether the PLM is trained with a multi-task training objective (MT). Boxes represent interquartile range; whiskers represent the entire distribution with the exception of outliers (diamonds); horizontal line indicates median. BFD- Big Fantastic Database; LSTM- long short-term memory.
Extended Data Fig. 2 Evaluation of embedding similarities of constituent families between functional categories.
(a) Distribution of family average sequence-sequence similarity. (b) Distribution of family-family centroid similarity. (a-b: DNA- n=1,065; connector- n=133; head- n=946; integration- n=105; lysis- n=299; moron- n=458; other- n=560; tail- n=1,219; transcription- n=303) (c) Significance of pairwise category distribution comparison using a two-sided independent t-test with Bonferroni correction (left- lysis vs. integration- p=1.493e-02; head vs. other- p=1.014e-11; head vs. lysis- p=1.096e-02; head vs. DNA- p=2.098e-10; head vs. transcription- p=4.978e-07; head vs. connector- p=1.207e-05; head vs. integration- p=2.085e-09; moron vs. other- p=1.205e-04; moron vs. DNA- p=2.084e-03; moron vs. transcription- p=9.269e-03; moron vs. connector- p=1.484e-02; moron vs. integration- p=5.970e-05; tail vs. other- p=4.449e-16; tail vs. lysis- p=4.479e-04; tail vs. DNA- p=3.759e-15; tail vs. transcription- p=1.814e-09; tail vs. connector- p=3.121e-07; tail vs. integration- p=1.482e-11, right- transcription vs. DNA- p=1.010e-111; transcription vs. connector- p=4.416e-22; transcription vs. tail- p=7.299e-260; transcription vs. head- p=0.000e+00; transcription vs. other- p=0.000e+00; transcription vs. lysis- p=0.000e+00; transcription vs. moron- p=0.000e+00; transcription vs. integration- p=0.000e+00; DNA vs. tail- p=1.943e-208; DNA vs. head- p=0.000e+00; DNA vs. other- p=0.000e+00; DNA vs. lysis- p=0.000e+00; DNA vs. moron- p=0.000e+00; DNA vs. integration- p=0.000e+00; connector vs. tail- p=4.234e-06; connector vs. head- p=7.253e-30; connector vs. other- p=2.294e-228 ; connector vs. lysis- p=1.662e-196; connector vs. moron- p=0.000e+00; connector vs. integration- p=2.827e-271; tail vs. head- p=1.145e-243; tail vs. other- p=0.000e+00; tail vs. lysis- p=0.000e+00; tail vs. moron- p=0.000e+00; tail vs. integration- p=0.000e+00; head vs. other- p=0.000e+00; head vs. lysis- p=0.000e+00; head vs. moron- p=0.000e+00; head vs. integration- p=0.000e+00; other vs. lysis- p=9.303e-21; other vs. moron- p=1.099e-188; other vs. integration- p=6.600e-87; lysis vs. moron- p=4.171e-204; lysis vs. integration- p=5.199e-138; moron vs. integration- p=1.478e-25). Boxes represent interquartile range; whiskers represent the entire distribution with the exception of outliers (diamonds); horizontal line indicates median.
Extended Data Fig. 3 Inter-category similarity for PHROGs functional categories.
Pairwise family centroid similarities were calculated for every combination of families between the two categories. Score is the average over all comparisons.
Extended Data Fig. 4 EFAM function classifier calibration analysis.
(a) EFAM VPFs that have hits to annotated PHROG HMMs (test set) are used to evaluate the model calibration for each category. For each class, probabilities across all VPFs in the test set are binned into 10 partitions and the fraction of true positives for each bin is calculated. A perfectly calibrated model (dotted line) has a true positive proportion equal to the mean predicted probability for each bin. Below the perfect model indicates overconfidence and under the perfect model indicates under confidence. (b) Histogram of the number of predictions across the test set for each probability bin.
Extended Data Fig. 5 Decision threshold evaluation for function classifier predictions on EFAM VPFs.
EFAM VPFs with PHROG hits were used as ground truth for prediction with the function classifier. Classifier thresholds are determined by considering false discovery rate (FDR) and F1-score (F1). The final decision threshold for each class is the decision boundary with maximal F1 with FDR < = 0.1.
Extended Data Fig. 6 Comparison of PLM embedding similarity and sequence identity for PHROG VPFs.
(a) The intra-family pairwise sequence embedding similarity, measured using cosine similarity, and sequence identity, measured using global alignment identity, were calculated for all annotated PHROG VPFs. Families are colored by functional category annotation. Solid line represents a linear regression for each function with shading representing a 95% bootstrapped confidence interval for the regression estimation. (b) Linear regression results for each category. R-value is measured using Pearson correlation coefficient. P-value is calculated using the Wald Test.
Extended Data Fig. 7 Comparative protein structure modelling of an integrase family sequence supports annotation as a tyrosine recombinase.
(a) Target/template alignment between the Prochlorococcus PAC1 sequence (indicated as Model_01), and the template sequence 1Z1B, the phage lambda integrase. Red arrows point to active site residues Arg 212, Lys 235, His 308, Arg 311, His 333, and Tyr 342. Boxed amino acid regions represent secondary structure. (b) Homology model of Prochlorococcus PAC1 sequence based on template 1Z1B. Colors indicate individual monomers of the homo-tetramer template protein structure in both a and b.
Supplementary information
Supplementary Information
Supplementary Tables 1, 3 and 4.
Supplementary Tables
Supplementary tables in xlsx format.
Source data
Source Data Fig. 2
Raw data for visualization and statistical source data.
Source Data Fig. 3
Raw data for visualization and statistical source data.
Source Data Fig. 4
Raw data for visualization and statistical source data.
Source Data Fig. 5
Sequence data and sequence metadata.
Source Data Fig. 6
Sequence data and sequence metadata, and raw data for visualization.
Source Data Extended Data Fig. 1
Raw data for visualization.
Source Data Extended Data Fig. 2
Statistical source data.
Source Data Extended Data Fig. 4
Raw data for visualization.
Source Data Extended Data Fig. 5
Raw data for visualization.
Source Data Extended Data Fig. 6
Statistical source data.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Flamholz, Z.N., Biller, S.J. & Kelly, L. Large language models improve annotation of prokaryotic viral proteins. Nat Microbiol 9, 537–549 (2024). https://doi.org/10.1038/s41564-023-01584-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41564-023-01584-8