Deep embeddings to comprehend and visualize microbiome protein space

Understanding the function of microbial proteins is essential to reveal the clinical potential of the microbiome. The application of high-throughput sequencing technologies allows for fast and increasingly cheaper acquisition of data from microbial communities. However, many of the inferred protein sequences are novel and not catalogued, hence the possibility of predicting their function through conventional homology-based approaches is limited, which indicates the need for further research on alignment-free methods. Here, we leverage a deep-learning-based representation of proteins to assess its utility in alignment-free analysis of microbial proteins. We trained a language model on the Unified Human Gastrointestinal Protein catalogue and validated the resulting protein representation on the bacterial part of the SwissProt database. Finally, we present a use case on proteins involved in SCFA metabolism. Results indicate that the deep learning model manages to accurately represent features related to protein structure and function, allowing for alignment-free protein analyses. Technologies that contextualize metagenomic data are a promising direction to deeply understand the microbiome.

In just over a decade, a substantial body of evidence linked gut microbiome dysbiosis with diseases ranging from obesity 1 , inflammatory bowel disease 2-4 , diabetes 5,6 , cancer 7,8 , depression 9 and other psychiatric disorders 10,11 . It shows the profound impact of the microbiome on human health and is a testament to rapid technological progress in sequencing technologies. Since the mid-2000s, the bulk of our insight into the role of the microbiome came from high-throughput and cost-effective 16S rRNA marker gene sequencing experiments that allow for taxonomic discrimination between microorganisms. Though informative, microbiome analysis based solely on taxonomy is prone to bias, due to incomplete reference databases and does not provide detailed information about microbiome function 12 . One of the areas of high interest and relevance is our ability to deduce the gene function from sequence, as it provides more insight into the microbiome's role in human health. Functional analysis of microbiome data can be performed based on high-throughput, large-scale shotgun metagenomics and other multi-omics experiments that are now becoming accessible for large-scale studies. Gene sequence fragments generated during a shotgun sequencing experiment can be functionally annotated, using homology-based tools such as BLAST 13 or HMMER 14 that search fragments of sequences against reference databases such as Pfam or Gene Ontology (GO) 15 . Similarly to 16S sequencing, functional assignment can be biased, due to incomplete reference databases; so far, only up to 50% of all microbial protein sequences may be annotated 16 . Despite remarkable progress in the last decades, developing precise methods for function prediction is still a major challenge in bioinformatics (see CAFA 17 initiative). The volume of metagenomic data is making the problem even more difficult to deal with. Thus, introducing an in silico method to help contextualize protein functions could prove highly beneficial for realizing the full potential behind metagenomics and multi-omics.
Deep learning is a proven technique for dealing with intricate problems and has been shown to work well for tasks like speech recognition, natural language processing, or image classification 18 . Recently, it has been successfully employed for analysing biological sequences, like genomes, proteomes 19 or metagenomes 20 . Perhaps the best-known example of the use of deep learning in biology was the protein structure prediction problem.

Results
Alignment-free deep protein embeddings represent structure-and function-related ontologies. Metagenomic data may generate an amount of information on the order of tens of millions of reads, which may be assembled into millions of protein sequences. For traditional sequence homology or profile-based approaches, this amount of data is manageable, but requires significant computing power. For deep learning, on the other hand, such a large amount of data provides an opportunity to be exploited for training and assures a robust representation of analysed sequences.
To build the deep representation, we trained the BiLSTM model on the Unified Human Gastrointestinal Protein catalog (UHGP), which contains 625 million microbial protein sequences clustered with MMseqs2 linclust into 20,239,340 representative sequences at 95% amino acid sequence identity 16,49 . From the trained model, we take a hidden-state vector that acts as a protein representation (see "Methods" and Fig. 1A).
Our ultimate goal is to produce reliable embeddings for metagenomic data, hence we first validated the model on proteins derived from ten metagenomic samples that were not included in the UHGP catalog (see "Methods" for details). For that task we chose samples from two metagenomic studies not included in the UHGP dataset (PRJEB37249 & PRJNA762199). From the latter one we selected only healthy volunteers samples to validate results with a healthy human gut microbiome, while the former study (PRJEB37249) focuses on a single enterotype (Bact2) from a Body Mass Index Spectrum cohort The model yielded substantially lower Exponential Cross-Entropy (ECE) loss than the analogous model trained on the Pfam database 50 on both validating datasets (see Table 1). Although ECE loss does not directly measure the quality of obtained embeddings, it was proven that the lower ECE the better the embeddings are in secondary structure and contact predictions 40 . Pfam is a cross-sectional curated dataset built on top of UniProtKB and is limited to identified protein families (~77% of UniProtKB sequences). The UHGP, on the other hand, is a more comprehensive database for gut metagenomic samples which in many cases (up to ~ 40%) are not represented in protein classification databases (eg. InterPro), and consequently in Pfam 16 . This emphasizes the importance of training an embedding model on a set of proteins consistent with the investigated dataset, i.e. human gut metagenomic proteins. Taken together, this leads to improved model performance.
Although the representation is aimed for metagenomic data, we need proteins with a specified function and origin to validate it. Therefore, for our analysis, we used bacterial proteins from the SwissProt database clustered into 201,622 representatives at 97% sequence identity. SwissProt is a reliable source, linking proteins to many ontologies that enable a multilevel description of sequences (e.g. Table 2). For simplicity, we call this collection of proteins Bacterial SwissProt (see "Methods"). We generated embeddings for all Bacterial SwissProt sequences using the embedding model trained on the UHGP dataset. The model trained on Pfam cannot be validated on Bacterial SwissProt as those datasets significantly overlap. Embeddings were then reduced from 2,048 dimensional vectors with Principal Component Analysis (PCA) to 50 dimensions (81.8% of variance explained). Such a representation is used in all our analyses (Reduced Embeddings in Fig. 1B). Rationale for selected parameters can be found in the "Methods" section.
To get a deeper understanding of the type of information encoded within deep representations, we created an evaluation task of recovering the label of a given protein from the labels of its nearest neighbors for a crosssection of various ontologies. If the label is correctly recovered, it indicates that the representation is consistent within this ontology (Fig. 2). Using different neighborhood sizes, we can estimate how local the representation is. This study focuses on investigating the representation and its features, not aiming at creating or evaluating a universal label predictor.
To evaluate the consistency of the representation, we selected a number of ontologies from Bacterial SwissProt, related to Function, Structure, or organism of Origin ( Table 2). The ontologies significantly vary in the number of classes and Bacterial SwissProt coverage. Hence, the recovery task for each ontology may have a varying degree of difficulty. For this reason, we compared deep embeddings to general scalable sequence-based representations that do not use deep learning. Those baseline embeddings are 3-mers with term frequency-inverse document frequency (TFIDF) transformation 51,52 , and amino-acid frequencies vectors (see "Methods"), similarly to seminal works in this field 37,40,45 . Additionally, we define the upper bound for the task by including MMseqs2 search results, a state-of-the-art tool specifically designed for the protein search task. It should be emphasized that  www.nature.com/scientificreports/ MMSeqs2 does not produce a vector representation and is not versatile as embeddings, which can be used in other types of vector analyses (visualization, clustering, semantic search). In order to measure label recovery performance of our and baseline representations, we used a cross-validation-based approach. We removed labels of 20% randomly selected proteins in the dataset. Next, we trained a k-Nearest-Neighbor (kNN) classifier. Then, for every protein without a label, we predicted its label based on all k nearest neighbors. We repeated this procedure 5 times for each k.
Many proteins are annotated with more than one label within each ontology (for example, a protein may have multiple Pfam domains). To overcome this challenge, we used the Intersection over Union (IoU) metric and example-based Precision, Recall, and F1 Score metrics 53 .
Embedding performance on structure-, function-and taxonomy-related ontologies. Despite a varying number of classes in each task, the results from all ontologies unrelated to taxonomy were similar ( Fig. 2 Table 3). This suggests a comparable degree of difficulty among them, which we hypothesize that is due to the correlations between labels (e.g. KOs are correlated with Pfam domains). The performance of all methods drops for taxonomic labels, esp. genus, family, and order (Fig. 2). EggNOG ontology, that combines information about function and taxonomy, achieves IoU values that are between those obtained for only function-and only taxonomy-related ontologies. Moreover, baseline representations show that the task's difficulty increases with a larger neighborhood (larger k). Despite that, MMseqs2, as a tool designed specifically for the protein search, was able to find similar proteins even from larger neighborhoods. The deep representation results, with a simple kNN classifier on top, were slightly worse in all metrics and ontologies.

and Supplementary
The deep representation and MMseqs2 perform best at recovering labels from ontologies based on protein structures (Gene3D, SUPFAM), while function-or domain-related ontologies obtained a slightly lower metric.
MMseqs2 searches for proteins by comparing sequence k-mers in a very clever and efficient way. If two proteins share the same structure or function, even with not so similar sequences in general, they usually share similar sequential patterns that define those functions or structures. MMseqs2 can find those patterns in sequences of both proteins.
However, vector representations presented here work differently. They produce a vector summarizing the whole protein sequence. Baseline representations (3-mers with TFIDF and amino-acid frequencies vectors) treat different parts of a sequence with equal importance, so the essential sequential patterns are lost in the burden of many neutral mutations. On the contrary, the deep model during the training can learn that some sequential patterns often occur in a training dataset with only minor changes (conserved regions) and have the most KO is a database of molecular functions. Each molecular function is represented in terms of a manually defined functional ortholog that together create molecular networks (pathways). Each functional ortholog is defined from experimentally characterized genes and proteins in specific organisms, which are then used to assign orthologous genes in other organisms, based on sequence similarity 177,018 6614

GO (Gene Ontology) Function
GO is a controlled terminology that can be used to consistently and structurally identify genes and gene products. The GO terms are organized within a directed acyclic graph (DAG), and each GO term has a described relationship to one or more other terms in the same domain (i.e. biological process, molecular function, or cellular location) www.nature.com/scientificreports/ www.nature.com/scientificreports/ significant impact on the rest of the sequence 54 . During the process of embedding a sequence, the model can put significantly more attention on those sequence fragments. This way deep embeddings can contain essential information to obtain results comparable to MMseqs2 on function-and structure-related ontologies without directly comparing the sequences. The taxonomy case is different ( Fig. 2 bottom 4 panels). Proteins with the same organism of origin still can share sequential patterns that can be found using MMseq2 search. However, the deep model will not focus on those motifs, as they neither occur often in the training dataset nor have substantial impact on the rest of the sequence. They may have a marginal impact on the deep embedding, and so it will not have any advantages over baseline representations.
These results indicate that the deep representation space encodes features related to protein structure and function 55 , and does not represent features related to the taxonomy.
Despite not achieving the highest metrics, we recognize deep embeddings as a very promising method combining advantages of a top end sequence-based tool (effectiveness in finding functionally similar proteins) and vector representations (versatility and efficiency once they are computed; Supplementary Table 6). Future developments of the approach may boost both effectiveness and efficiency.
Low dimensional representation of protein sequence space goes beyond sequence similarity. Representations learned by deep models are information-rich, but more difficult to understand due to the high dimensionality of the embedding. Further reduction in dimensionality with UMAP allows us to plot and visually interpret the embedding space built by the model.

Deep embedding model creates a functionally structured representation space. To better
understand which proteins were the easiest to recover based on the embedding, we defined Recovery Error Rate as 1-average IoU metric obtained on each protein across all ontologies and all k values. The use of this metric enabled us to localize regions with low & high Recovery Error Rates (Fig. 3). In Fig. 3A, we show that proteins with low Recovery Error Rates are located in smaller clusters, while proteins with high error rates are concentrated in the center of the UMAP visualization.
To investigate the functional structure of the representation space, we overlay it with labels defined by Kegg Orthology ID (KO) (Fig. 4). The proteins that do not have a KO assigned are colored in grey-we see that they are placed in the central part of the plot. Most of the proteins are clearly clustered by their functional annotation. Furthermore, by focusing on specific space locations, we can see that close KO clusters share other functional features: domains (Fig. 4A,B), EC number class (Fig. 4A,D), or structural and molecular features (Fig. 4E). It suggests that the deep representation does not focus only on one functional ontology, but rather on an abstract protein function defined on many levels. The visualisation explains the high label recovery results and expands analogous analysis conducted on a smaller scale with only 25 COGs 40 . Compared to the k-mer based representation, the deep representation is significantly more structured (Supplementary Fig. 1).
We hypothesize that the regions of high Recovery Error Rate (RER) are occupied by rare proteins. Rare proteins form small functional classes in Bacterial SwissProt, and the smaller the functional class is, the more Proteins colored by Recovery Error Rate, the metric that quantifies how hard it was to recover protein's labels based on its neighbors, the metric that quantifies how hard it was to recover protein's labels based on its neighbors. (B) Proteins colored by percentage of transmembrane residues in a protein chain; adopted from Perdigão et al. 56 Fig. 4). We see that high RER proteins generally belong to smaller classes and are responsible for more varied functions. For example, transcription and translation KEGG categories exhibit low RER, while Protein families: metabolism or Biosynthesis of other secondary metabolites show high RER. The latter, is a large category consisting of many pathways involved in biosynthesis of phytochemical, antibacterial, fungal, and other compounds. Overall, analyses show that groups with high mean RER are more diverse or are rich in proteins that are rare or less described in databases than groups with low mean RER ( Supplementary Figs. 2, 3 and 4).
Short and transmembrane proteins. The embedding model is sensitive to the length of the protein (Fig. 3C) and a significant number of short proteins is present in the central, lesser understood part of UMAP visualization. Short proteins (≤ 50 residues), underestimated for a long time, gained interest in recent years when it was discovered that they are involved in important biological processes such as cell signaling, metabolism, and growth 57 . The presence of a high Recovery Error Rate region might be a result of insufficient information on small proteins, which are still underrepresented in databases. Following Sberro et al., based on the NCBI GenPept database, over 90% of small protein families have no known domain and almost half are not present in reference genomes 58 . For a detailed description of the protein set see Supplementary Table 4. The whole space can be interactively explored in our application (https:// prote in-explo rer. ardig en. com).
Transmembrane proteins constitute approx. 30% of all known proteins. Unlike globular proteins, they are on average larger and must exhibit a pattern of hydrophobic residues to fit into the cell membrane 59 . In order to define transmembrane proteins we used a transmembrane score (a percentage of transmembrane residues) adopted from Perdigão et al. 56 . In Fig. 3B, we can see that the model separates transmembrane proteins well, which is in line with previous research on deep protein representations 44,60 . However, part of transmembrane proteins lie within the high-recovery error region of the UMAP plot. Despite substantial pharmacological and www.nature.com/scientificreports/ biological relevance, they are less understood and underrepresented in databases, as structural experiments on them are difficult to conduct. Lower ECE loss obtained on metagenomic proteins (compare Alignment-free deep protein embeddings represent structure-and function-related ontologies section) suggest that the deep embedding model trained on a more general catalog of metagenomic proteins (UHGP) is less biased towards well-known model organisms, hence, better suited for rare, short or transmembrane proteins.

A sample use case-phosphotransferases (EC 2.7.2).
To demonstrate the use of embedding representation in a real-life scenario, we used a group of phosphotransferases. We have chosen them due to their importance in maintaining the human gut microbiome homeostasis. Acetate, butyrate, and propionate kinases are especially crucial in the process of forming short-chain fatty acids (SCFAs). SCFAs are produced in the colon by bacteria during the fermentation of resistant starch and non-digestible fibers. Their lowered level is often observed in patients suffering from inflammatory bowel diseases (IBD) such as Crohn's disease and ulcerative colitis 61 . SCFAs serve as an important fuel for intestinal epithelial cells and participate in preserving gut barrier integrity. Recent findings indicate their role in energy metabolism (lipid metabolism), immunomodulation, regulation of intestinal epithelial cells, proliferation and cancer protection. Although promising, the research has been conducted mainly on murine or in vitro models, thus the results have to be interpreted with caution [61][62][63] .
Proteins classified as phosphotransferases were chosen based on their EC number. We decided to use this annotation as EC numbers are a manually assigned nomenclature that describes enzymes based on the chemical reactions they catalyze. Their hierarchical structure allows for a fine-grained analysis. We examined the domain architecture of EC 2.7.2 proteins using the Pfam database. The domain architecture is the main structure that defines a protein's function. We found that four domain architectures were dominant among analysed proteins. 31% of analysed proteins contained one amino acid kinase domain (PF00696), 29% of proteins had one phosphoglycerate kinase domain (PF00162), 20% contained one acetate kinase domain (PF00871), and 18% of proteins had two coincident domains PF00696 & PF01472, i.e., amino acid kinase domain and PUA domain.
In total, we studied 1,302 proteins exhibiting eight unique specific functions (ECs) and four distinct domain architectures (See Supplementary Table 5). Different domain architectures suggest that these proteins have different amino acid sequences and would be difficult to identify as similar with baseline bioinformatic methods based on sequence similarity alone.
To investigate how accurately the embedding representation reflects the functional relationships between the proteins, we visualized them using UMAP (Fig. 5A). We understand that both domain architecture and enzymatic activity impact protein embeddings and their ordination in UMAP space. Almost all proteins were grouped according to their domain architecture, and proteins with similar domain architectures, such as proteins having only PF00696 domain and proteins having two domains PF00696 & PF01472, were also placed closer to each other. Despite clear domain-based grouping, proteins that share the same domain architecture, but catalyze different chemical reactions, are separated. We hypothesize that protein domain architecture has a stronger influence on the embeddings than EC number (Fig. 5A 64 . The only inconsistency we can note are two butyrate kinase proteins (Fig. 5A cyan circles; PF00871) that were placed far from their counterparts.
To further analyse two outlying butyrate kinases, we inspected sequences of outlier's domains sequence and compared it to sequences of PF00696 and PF00871 domains. We performed multiple sequence alignment (MSA) of outlying protein's domains, the PF00696 domains and the PF00871 domains sequences (see "Methods"). MSA results showed that outlying protein's domain sequences are distant from PF00871 domain sequences from further proteins, including other butyrate kinases. However, outliers were more closely aligned with sequences that belonged to PF00696 domains ( Supplementary Data 1-3). We hypothesize that the domain sequence of the two outlying proteins is more similar to PF00696 domain than PF00871, which made UMAP place them closer to the former (Fig. 5A,B).
We hypothesize that the deep representation reflects the functional similarities between proteins that are based on domain architecture (Pfam domains) or enzymatic activity (EC number). This emphasizes the significant advantage of deep embeddings, as they do not only focus on single, human-created ontology, such as e.g., EC numbers, but rather fuse all information to characterize proteins on multiple levels. It combines the strengths of approaches that focus on motifs, domains (Pfam), and 3D structure (GENE 3D) to understand protein function space comprehensively. To better understand the differences between sequence-based distance and deep embeddings, we compared the Euclidean distance between EC 2.7.2 proteins and randomly chosen 500 proteins from the Bacterial SwissProt dataset. As a baseline, we selected sequence-based distance calculated with Clustal Omega 65 . The distance measure used by Clustal Omega for pairwise distances of unaligned sequences is the percent identity between two analysed sequences (see "Methods"). www.nature.com/scientificreports/ The embedding-based distances within and between the EC 2.7.2 subclasses are smaller than to randomly selected proteins, which do not hold for the sequence-based distance (Fig. 5C,D). The mean embedding-based distance between EC 2.7.2 proteins is significantly smaller compared to the distance between EC 2.7.2 proteins and 500 random proteins. Not only proteins from the same cluster group are closer to each other but also proteins  www.nature.com/scientificreports/ from different EC 2.7.2 clusters are located significantly closer to proteins from other EC 2.7.2 clusters than to random proteins. Mean percent identity between proteins does not reflect the clear separation between EC 2.7.2 and random proteins. Mean sequenced-based distance between proteins from the same cluster is smaller than between EC 2.7.2 and random proteins, however it does not bring proteins closer from different EC 2.7.2 clusters (Fig. 5C). This proves that the embedding model can go beyond sequence similarity and find relations between proteins with significantly different sequences and domain architectures. We believe that deep protein embeddings may enable searching for proteins that are functionally similar at different levels of specificity. Taking EC number classification scheme as example, while localizing the searched protein sequence in deep embeddings space we can find a cluster of proteins that belong to a general assemblage of transferases (EC 2.7) and splits into more specific subclusters such as EC 2.7.2 (see Fig. 5 and our application). We expect that this will help functionally define new, undiscovered bacterial proteins that implement similar functions (e.g. novel 2.7.2 subclass) with a significantly different sequence.

Discussion
The human microbiome plays a crucial role in human health, and changes in its composition can be related to various diseases, such as diabetes, cancer, or psychiatric disorders. To fully understand the complex relation between the microbiome and human health, it is necessary to look not just at the taxonomic level but also at a functional level. Despite various approaches to retrieve protein functions 66,67 , a large portion of microbial proteins remain functionally uncharacterized. This paper presents a novel context for using the Bidirectional LSTM model to visualize and contextualize the microbial protein space. We show that our model accurately represents protein features related to structure and function, overcoming some limitations of standard bioinformatics methods such as HMMER or BLAST. However, more research and development is needed to establish deep-learning-based tools that will take them over. The deep learning model creates an abstract, numerical representation of proteins in an embedding space. This embedding encodes information from various protein ontologies and combines knowledge on protein structure and function, overcoming the limitations of methods based on sequence similarity. For certain tasks processing embeddings is more efficient then sequences, although generating embeddings is still computationally expensive. The embedding is also more suitable for a large range of further downstream algorithms, such as classification, clustering and visualization. Combining embeddings with a dimensionality reduction method, such as UMAP, may enable creating a reference protein map and facilitate protein research.
One of the significant challenges that any data-driven solution must face is data bias. Our results indicate that using a catalog of metagenomic proteins (UHGP) for training made the model less biased towards well-known model organisms. Despite this, model validation required the use of experimentally verified data, which limited the scope of our validation to well-known proteins and prevented genuine validation on small or transmembrane proteins. We assume that with the growing interest in these proteins, their presence in the databases and number of their annotations will increase, which will allow for a more thorough validation.
We are witnessing rapid progress in both the deep learning field and in metagenomics, which generate massive amounts of data. We believe embedding models are an attractive alternative to database-bound, computationally intensive methods unsuitable for such influx of data. We also assume that recent advances in deep learning, like the latest intensive research on Transformer-based architectures, will only improve results presented in our work. Other appealing approach would be to join the strengths of relatively computationally-cheap embedding models with other computational technologies that can accurately predict the features of individual genes (for example: protein 3D structure using AlphaFold [21][22][23] ) and finally perform experimental validation on most promising targets. Such approach enables such efficient contextualization of metagenomic data and may be used to better understand the microbiome for health. Finally, we hope that the research presented here and in other related works will lead to concrete tools that will enable adoption of the approach in microbiome and other metagenomics studies.

Methods
Embedding model training. In the training, we took advantage of the Unified Human Gastrointestinal Protein catalog clustered at 95% sequence identity (UHGP-95) to limit the impact of the most common sequences. Further clustering may improve the model 40 . UHGP-95 contains exactly 19,228,304 protein sequences, from which we randomly selected 5% to track training progress (validation set) and set aside another 5% for the final model evaluation (test set). The rest of the data (18,266,888 sequences) was used to train the model. Due to GPU memory limitations we clipped all proteins to 1,500 amino acids. This impacted only 0.9% of proteins from the training set as the others were shorter.
We used a 3-layered Bidirectional LSTMs (BiLSTM) model with 1024 hidden units in each layer. The LSTMbased architectures are relatively well established in the protein informatics, being applied to predict, i.a. subcellular localization 47 , secondary structure 68 or protein crystallization 69 . Moreover, we have chosen the LSTM architecture as it gave the best results in Remote Homology detection in the TAPE benchmark 39 and achieved superior performance over Transformer-based architecture in the ProtTrans benchmark 44 . On the other hand, the most recent findings show the superiority of Transformer-based architectures 23,40,70 in protein informatics. We assume that those and even further advances in deep learning, especially applied to protein sequences, will only improve results presented in our work.
The model was trained by the AdamW optimizer for 225,331 weights updates with a mini-batch of size 1024, which corresponds to 12 epochs and approximately 48 h on 4 Tesla V100 GPUs. The learning rate was set to 1e-3, except the first 8,000 steps that were used as a warmup. The process was implemented in the PyTorch library 71  www.nature.com/scientificreports/ Computing embeddings. To obtain a vector representation of a protein (embedding) from the BiLSTM model, we extracted vectors of hidden states for each amino acid and averaged them. This is in contrast to natural language processing practice, which uses the hidden state vector corresponding to the last word (here it would be the last amino acid) rather than the average representation of all words. However, there is evidence suggesting the superiority of averaged presentation in the field of protein processing 45 . This may be due to the fact that proteins are usually much longer than sentences, and LSTM-based models cannot fit the whole amino acid sequence in just one state.
Validation on metagenomic proteins. The  To remove near-identical protein sequences, we deduplicated the remaining set using mmseq2 easyclust 76 with an identity threshold set to 97% and coverage set to 0.8 (mmseqs easy-cluster uniprot_sprot. fasta swiss97_clust tmp -e inf -c ʘ.5 --min-seq-id ʘ.1 --cov-mode 1 --cluster-mode 1 --threads 20). Removing duplicates ensured no cliques in the kNN graph, which we used in the kNN label recovery and UMAP visualizations. Cliques would lead to trivial solutions during kNN classification and "lonely islands" in UMAP visualizations.
After the deduplication step, we obtained 201,622 proteins, and this set we named Bacterial SwissProt.
Baseline representations. For a general sequence-based baseline representation, we used the bag of k-mers method 77 , which produces embedding for a protein by the following procedure: (a) generate all possible k-mers (subsequences of length k) from protein sequence, (b) count occurrences of each possible k-mers in the sequence, (c) sort counts alphabetically by k-mers sequence. Sorted counts form a vector representing the sequence.
Higher k leads to more specific representation but exponentially increases dimensionality, which is equal to the number of all possible k-mers (N = 20 k ). In our work, we chose k = 3, which resulted in 8,000-dimensional vectors. Choosing k = 4 would lead to 160,000 dimensions, which would be hard to manage computationally. On the other hand, k = 2 would be convenient with 400 dimensions but less specific than k = 3.
Finally, we have applied term frequency-inverse document frequency (TFIDF) transformation on the 3-mer representation, which accentuates rare k-mers. We have used sklearn's TfidfTransformer 78 to implement the transformation.
To complement the k-mer-based baseline we added a representation of the amino acid frequencies vector.
MMseqs2 search baseline. For a task-specific, state-of-the-art baseline we have used MMseqs2 search 76 .
It is a fast and sensitive sequence search tool that is broadly applied in metagenomics. We needed to increase MMseqs2 search sensitivity (from default 5.7 to 9.0) and suppress e-value thresholding to obtain up to 201 nearest proteins from the search. These are not the best parameters if one is looking only for the several nearest proteins, but it was necessary to compare larger neighbourhoods. Full commands are listed below.
Label recovery. For the analysis, we used the Bacterial SwissProt described above. We generated deep and k-mer-based representations for each protein. Next, we reduced the dimensionality of all representations to 50 using the Principal Component Analysis (PCA) algorithm (Fig. 1B). Vectors of 50 dimensions are computationally efficient for downstream analyses and at the same time explain 81.8% of variance of the full embeddings. We narrowed down the set of analysed proteins to only those with assigned labels in given ontology for each ontology analysed. We divided these sets of proteins into five equal parts to estimate recovery efficiency www.nature.com/scientificreports/ through fivefold cross-validation. For every fold, we constructed a kNN graph (https:// github. com/ lmcin nes/ pynnd escent) of the data from the four remaining folds. The graph was then used to predict classes for each protein in the fold, by querying the nearest proteins (N = 51) and propagating their labels as a prediction. As the protein can be assigned to many classes (multi-label classification), we used the Intersection over Union (IoU) as the main metric. IoU is the ratio between the correctly predicted labels and the union of all predictions with all ground-truth labels for a given protein (1). IoU ranges between 0 and 1, where 1 means perfect label recovery. For single-label tasks, IoU reduces to accuracy.

UMAP visualisations.
To visualise protein embedding space, we further reduced dimensionality of the PCA Embeddings with UMAP 48 (Uniform Manifold Approximation and Projection; https:// github. com/ lmcin nes/ umap), a nonlinear dimensionality reduction method. UMAP was chosen over another common nonlinear dimensionality reduction method, t-SNE (t-distributed Stochastic Neighbor Embedding), as it preserves more of the global structure with superior runtime performance 79 .
We set the number of neighbors (n_neighbors) to 50 to balance representing the local and global structure of the data. Also, we set the minimal distance (min_dist) to 0.3 to ensure the visibility of all proteins on the scatterplots. The rest of the parameters were left at default values.
Cluster analysis. Selecting proteins. Proteins assigned to EC 2.7.2 subclass were chosen for the analysis. In the analysis, we used 8 available EC 2.7.2 sub-subclasses out of 14, as our bacterial dataset lacked proteins described by 6 other sub-subclasses. Sub-subclasses used in this analysis are EC 2. 7 ). We assigned a Pfam ID to each protein using mapping available in SwissProt. 4 domain architectures were found dominant among 1,302 analysed proteins. 31% of analysed proteins contained one amino acid kinase domain (PF00696), 29% had one phosphoglycerate kinase domain (PF00162), 20% one acetate kinase domain (PF00871) and 18% had two coincident domains (PF00696 and PF01472), i.e. amino acid kinase domain and PUA domain.
We visualized EC 2.7.2 proteins in the same manner as described above in UMAP visualizations.
Comparison to sequence (Clustal Omega for distance matrix). To infer about the ability of the embedding model to group more closely proteins sharing a function, we compared the distance between EC 2.7.2 proteins and 500 randomly chosen proteins from the Bacterial SwissProt database (excluding EC 2.7.2 proteins). We wanted to analyse if embeddings distance between proteins is compatible with corresponding amino acid sequence distance. Embedding distance was calculated as an Euclidean distance between 50 PCA components. Those 50 PCA components are the result of dimensionality reduction of 2048 protein embeddings, generated by the model. Sequence distance was calculated using Clustal Omega 80 (clustalo --infile $sequence_file --seqtype = Protein --distmat-out $distance_matrix -clustering-out = $clustering --outfile = $alignment --threads = 16 --percent-id --full), a bioinformatic tool for multiple sequence alignment. This tool takes a fasta file with unaligned protein amino acid sequences as input and calculates percent of sequence identity between those sequences giving a pairwise distance matrix. The distance measure used by Clustal Omega for pairwise distances of unaligned sequences is percent identity between two analysed sequences.
We have chosen to draw 500 proteins to have a big enough sample and at the same time limit required computations (the number of distances to compute grows quadratically with the number of proteins). Results were almost identical when we drew 100, 1000, or different 500 proteins. We believe that in this case, 500 proteins are enough to model the distribution of "other proteins". The selected 500 proteins are listed in the notebook on our Github repository (https:// github. com/ ardig en/ micro biome-prote in-embed dings/ blob/ master/ 03-ec-2. 7.2/ ec-2. 7.2-analy sis. ipynb).
Outliers analysis. To infer sequence similarity we performed multiple sequence alignment (MSE) between outlying protein's, PF00696 and PF00871 domain sequences. HMMER software 67 was used to find domain positions in each protein. First we created a hmmer profile database using target domains (hmmbuild $hmm_database $alignment_file, hmmpress $hmm_database) and searched domains in outlying proteins (hmmscan $hmm_