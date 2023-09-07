We used our model, which can produce structure-aware embeddings of protein sequences, to build large, searchable databases of protein representations that can be queried to find proteins with similar structures using only their sequence information. The last piece of our pipeline produces protein structure alignments using sequences alone for the proteins that are predicted to have the most similar structures.

TM-Vec search

TM-Vec embedding model

The TM-Vec model is trained on pairs of protein sequences and their TM-scores (the measure of protein structure similarity we use), and leverages the latest advances in deep protein language models. When protein sequences are fed into the pipeline, a pretrained deep protein language model ProtTrans (ProtT5-XL-UniRef50) is used to produce embeddings for every residue in the protein35. These residue embeddings are then fed into a twin neural network that we train, called ϕ. Supplementary Fig. 1 shows the function ϕ which takes residue embeddings and produces a flattened vector representation of dimension 512 for each protein. ϕ is composed of several transformer encoder layers (see the TM-Vec training section for transformer details), followed by average pooling, dropout and fully connected layers. Finally, we calculate the cosine distance between the reduced representations of each protein in the pair, and our training objective is to minimize the L1 distance between the cosine similarity of the reduced representations of the pairs and their TM-score. Therefore, for any pair of protein sequences, a forward pass of our model can predict the TM-score of the pairs, and can also be used to produce structure-aware embeddings for each protein sequence.

TM-Vec database creation

To build a large database of structure-aware protein embeddings, we started with large databases of protein sequences, including SWISS-Prot68, CATH37 and UniRef50 (ref. 16). After encoding each protein sequence, we built an indexed vector-searchable database of protein embeddings using the Faiss package41. When this database was queried with a new sequence, we first embedded the protein using a forward pass of the TM-Vec embedding model and then returned the nearest neighbors of the query according to cosine similarity (the proteins in our database with the highest predicted structural similarity or TM-score). Although one of our goals was to return the nearest neighbors in structure space for any query proteins, another goal was to include the structural alignments for the nearest neighbors with the query protein, using sequences alone. Thus, the predicted most similar proteins (structurally), their predicted TM-scores and their predicted structural alignments can all be returned by the TM-Vec + DeepBLAST pipeline, and the number of proteins for which to retrieve this information is a user-defined parameter (the pipeline will return the user-defined top n).

DeepBLAST alignment module

The DeepBLAST module uses a differentiable Needleman–Wunsch algorithm (Supplementary Fig. 2). Proteins X and Y are fed into the pretrained protein language model ProtTrans35 to obtain embeddings H X and H Y . These residue-level embeddings are then propagated through the match embeddings (M) and gap embeddings (G) to obtain the match scores μ and the gap scores g. The match and gap scores are used to evaluate the differentiable dynamic programming algorithm and generate a predicted alignment traceback. These alignments can then be fine-tuned using a training dataset of ground truth alignments.

Protein language modeling for alignment

To obtain an alignment from dynamic programming, scoring parameters for matches and gaps must be obtained. Here, we use a number of pretrained protein language models to estimate these scoring parameters. These pretrained models ultimately construct a function, mapping a sequence of residues represented as one-hot encodings to a set of residue vectors, providing an alternative representation of these proteins. Often, these models will learn the representations by being trained to predict randomly masked residues within a protein sequence. Multiple studies have shown the merits of these models when performing protein structure prediction, remote homology and protein design31,32,33,34,35,36,69. Here, we use the pretrained ProtTrans language model35 to represent two proteins X and Y by embeddings \({{{H}_{X}}}\in {{\mathbb{R}}}^{p\times d}\) and \({{{H}_{Y}}}\in {{\mathbb{R}}}^{q\times d}\), where p and q represent the lengths of proteins X and Y, and d is the embedding dimension of the language model. Given these representations, we can construct mappings M and G to obtain match scores and gap scores for the differentiable dynamic programming as follows

$${{\mu }}={\sigma }_{\mu }\left(M({{{H}_{X}}}) M{({{{H}_{Y}}})}^{T}\right)\in {{\mathbb{R}}}^{p\times q}, \quad {{g}}={\sigma }_{g}\left(G({{{H}_{X}}}) G{({{{H}_{Y}}})}^{T}\right)\in {{\mathbb{R}}}^{p\times q}$$

The functions \(M:{{\mathbb{R}}}^{t\times d}\to {{\mathbb{R}}}^{t\times d}\) and \(G:{{\mathbb{R}}}^{t\times d}\to {{\mathbb{R}}}^{t\times d}\) are intermediate functions that take as input a set of t residue vectors. These functions are parameterized by convolutional neural networks, which can be fine-tuned through the backpropagation enabled by the differentiable dynamic programming. Activation functions σ μ and σ g are softplus and log-sigmoid functions to ensure that the match scores μ are strictly positive and the gap scores g are strictly negative. These constraints are used to penalize gaps and reward matches. This also helps enforce identifiability of the model, which we have found to improve the accuracy of the model in practice.

Differentiable dynamic programming

Our proposed differential dynamic programming framework does not learn any parameters; it is designed purely to enable backpropagation to fine-tune the scoring functions M and G. Differentiable dynamic programming has been extensively explored in the context of dynamic time warping70,71. Koide et al.72 and Ofitserov et al.73 suggested that a differentiable Needleman–Wunsch alignment algorithm could be derived, but the implementation has remained elusive. Here, we provide a GPU-accelerated implementation of the differentiable Needleman–Wunsch algorithm.

Previous work71 has shown that backpropagation can be performed on dynamic programming algorithms by introducing smoothed maximum and argmax functions. Doing so will enable the computation of derivatives while providing a tight approximation to the optimal dynamic programming solution. The traditional Needleman–Wunsch algorithm can be defined with the following recursion

$${v}_{i,\,j}={\mu }_{i,\,j}+{\mathrm{max}}\left\{\begin{array}{ll}{v}_{i-1,\,j-1}&{\mathrm{(Match)}}\\ {g}_{i,\,j}+{v}_{i-1,\,j}&{\mathrm{(Insert)}}\\ {g}_{i,\,j}+{v}_{i,\,j-1}&{\mathrm{(Delete)}}\end{array}\right.$$ (1)

where the alignment score v i, j is evaluated on position i in the first sequence X and on position j in the second sequence Y. Sequences X and Y are of lengths n and m, respectively. μ i, j represents the log-odds score of residues X i and Y j being aligned and g ij represents the log-odds score of an insertion or a deletion at positions i and j. Owing to the structure of dynamic programming problems, v n,m is guaranteed to be the optimal alignment score between the two sequences. Furthermore, the optimal alignment can be obtained by tracing the highest-scoring path through the alignment matrix via argmax operations.

As neither the max nor the argmax operations are differentiable, the alignment scores and the traceback cannot be differentiated in the traditional formulation of the traceback operations needed to generate alignments. Accordingly, Mensch et al.71 introduced smoothed differentiable operators

$${\tilde{max}}=\log \left(\mathop{\sum}\limits_{i}\exp ({x}_{i})\right),\,{\mathrm{argma{x}}}_{{{\Omega }}}(x)=\frac{\exp ({{{\boldsymbol{x}}}})}{{\sum }_{i}\exp ({x}_{i})}$$

where the smooth max operator \(\tilde{max}\) is given by the log sum exp function and the smoothed argmax Ω (x) is given by the softmax function. As the softmax function can be derived from the derivative of max Ω , the traceback matrix can also obtained by differentiating the resulting alignment matrix. The resulting traceback matrix will yield the expected alignment between the two proteins.

As the loss function is defined as the difference between the predicted traceback matrix and the ground truth traceback matrix, the derivatives of the traceback matrix also need to be defined. This requires both the computations of the directional derivatives and the local Hessians of the alignment matrix (Algorithm 2).

In practice, dynamic programming can be a major computational bottleneck owing to GPU data transfer and the quadratic runtime of the Needleman–Wunsch algorithm. To address this, we implemented a GPU-accelerated differentiable Needleman–Wunsch algorithm inspired by Manavski et al.74. As can be seen from the benchmarks shown in Supplementary Fig. 7d, this algorithm is an order of magnitude faster than the naive CPU-bound Needleman–Wunsch implementation. Furthermore, this algorithm enables batching, allowing multiple alignments to be processed in parallel. As shown in Supplementary Fig. 7d, larger batch sizes can further improve the scaling compared with CPU-bound alignments.

Algorithm 1. Compute DeepBLAST Ω (θ)and ∇ DeepBLAST Ω (θ)

Require:\(\theta =[\mu ,g]\in {{\mathbb{R}}}^{2\times p\times q}\)

Forward pass

\({v}_{0,0}^{M}=1;{v}_{0,.}^{* }=0;{v}_{.,0}^{* }=0\)

for i ∈ {1…p}, j ∈ {1…q} do

\({v}_{i,\,j}={\mathrm{ma{x}}}_{{{\Omega }}}\ \left({\mu }_{i,\,j}+({v}_{i-1,\,j-1},{g}_{i,\,j}+{v}_{i-1,\,j},{g}_{i,\,j}+{v}_{i,\,j-1})\right)\)

\({\omega }_{i,\,j}=

abla {\mathrm{argma{x}}}_{{{\Omega }}}\ \left({\mu }_{i,\,j}+({v}_{i-1,\,j-1},\ {g}_{i,\,j}+{v}_{i-1,\,j},\ {g}_{i,\,j}+{v}_{i,\,j-1})\right)\in {{\mathbb{R}}}^{3}\)

end for

Backward pass

e p,q+1 = 0; e p+1,q = 0; e p+1,q+1 = 1

for i ∈ {p…1}, j ∈ {q…1} do

\({e}_{i,\,j}={\omega }_{i+1,\,j+1}^{m}{e}_{i+1,\,j+1}+{\omega }_{i+1,\,j}^{x}{e}_{i+1,\,j}+{\omega }_{i,\,j+1}^{y}{e}_{i,\,j+1}\)

end for

\(W={(\omega )}_{i,\,j,k = 1}^{p+1,q+1};\ E={(e)}_{i,\,j = 1}^{p+1,q+1}\) 7D2; intermediate computations to be used in Algorithm 2

return \({{{{\rm{DeepBLAST}}}}}_{{{\Omega }}}(\theta )={v}_{p,q},

abla {{{{\rm{DeepBLAST}}}}}_{{{\Omega }}}(\theta )={(e)}_{i,\,j = 1}^{p,q}\)

Algorithm 2. Compute 〈 ∇ DeepBLAST Ω (θ), Z〉 and ∇2DeepBLAST Ω (θ)Z

Require \(\theta =[\mu ,g]\in {{\mathbb{R}}}^{2\times p\times q},Z=[{z}_{\mu },{z}_{g}]\in {{\mathbb{R}}}^{2\times p\times q}\)

Forward pass

v 0,0 = 1; v 0,. = 0; v .,0 = 0

for i ∈ {1…p}, j ∈ {1…q} do

\({\dot{v}}_{i,\,j}={z}_{{\mu }_{i,\,j}}+{\omega }_{i,\,j}^{m}({v}_{i-1,\,j-1})+{\omega }_{i,\,j}^{x}({z}_{{g}_{i,\,j}}+{v}_{i-1,\,j})+{\omega }_{i,\,j}^{y}({z}_{{g}_{i,\,j}}+{v}_{i,\,j-1})\)

\({\dot{\omega }}_{i,\,j}\!=\!-{J}_{{{\Omega }}}({\omega }_{i,\,j})\left({\omega }_{i,\,j}^{m}({\dot{v}}_{i-1,\,j-1}),\ {\omega }_{i,\,j}^{x}({z}_{{g}_{i,\,j}}\!+\!{\dot{v}}_{i-1,\,j}),\!\ {\omega }_{i,\,j}^{y}({z}_{{g}_{i,\,j}}\!+\!{\dot{v}}_{i,\,j-1})\right)\in {{\mathbb{R}}}^{3}\)

end for

Backward pass

e p,q+1 = 0; e p+1,q = 0; e p+1,q+1 = 1

for i ∈ {p…1}, j ∈ {q…1} do

\({\dot{e}}_{i,\,j}={\dot{\omega }}_{i+1,\,j+1}^{m}{e}_{i+1,\,j+1}+{\omega }_{i+1,\,j+1}^{m}{\dot{e}}_{i+1,\,j+1}\)

+ \({\dot{\omega }}_{i+1,\,j}^{x}{e}_{i+1,\,j}+{\omega }_{i+1,\,j}^{x}{\dot{e}}_{i+1,\,j}\)

+ \({\dot{\omega }}_{i,\,j+1}^{\,y}{e}_{i,\,j+1}+{\omega }_{i,\,j+1}^{y}{\dot{e}}_{i,\,j+1}\)

end for

return \(\langle

abla {{{{\rm{DeepBLAST}}}}}_{{{\Omega }}}(\theta ),Z\rangle ={\dot{v}}_{p,q},{

abla }^{2}{{{{\rm{DeepBLAST}}}}}_{{{\Omega }}}(\theta )Z={(\dot{e})}_{i,\,j = 1}^{p,q}\)

Alignment loss function

By defining a loss function between the predicted alignment and the structural alignment from TM-align, we can evaluate the accuracy of DeepBLAST and fine-tune the functions M and G. Mensch et al.71 proposed using the Euclidean distance between the predicted and ground truth alignments as a loss function. However, we found that a cross-entropy loss provided more reasonable alignment results. This loss is given by

$$L({e}^{* },e)=\mathop{\sum}\limits_{i,\,j}{e}_{i,\,j}^{* }\log ({e}_{i,\,j})+(1-{e}_{i,\,j}^{* })\log (1-{e}_{i,\,j})$$ (2)

where e* is the ground truth alignment and e is the predicted alignment. As shown by Mensch et al.71, the predicted traceback matrix represents the expectation across all possible predicted alignments, which is represented as a matrix of probabilities. As a result, the resulting alignment problem can be interpreted as a classification task to identify whether two residues between a pair of proteins are alignable. This provides additional motivation for using cross-entropy as a loss function.

Datasets

TM-Vec search

TM-Vec was trained on pairs of protein–domain sequences, along with data about the structural alignment for the pair. For every pair of proteins in our training dataset, we ran the method TM-align, which is an algorithm for protein structure comparison that is independent of protein sequences. TM-align produces a TM-score between 0 and 1, where a score below 0.2 represents a pair of unrelated proteins; a score above 0.5 implies that proteins are in the same fold; and 1 is a perfect match, indicating the same protein structure. Part of our pipeline involved validating whether our model could predict the TM-scores of pairs of proteins.

Protein-chain-pairs dataset

The model that we ultimately used to encode protein sequences was trained on pairs of protein chains. We sampled pairs of chains from SWISS-MODEL, which contains more than 500,000 chains. We made two different protein-chain-pair datasets, one with protein chains up to 300 residues long, and another with protein chains up to 1,000 residues long. For example, when we filtered out protein chains that were longer than 300 residues, we were left with 277,000 chains. With these chains in hand, we made pairs of chains, ensuring that we oversampled pairs of proteins with similar folds, using information from Gene3D75 about the predicted domains within protein chains. For all our pairs of protein chains, we ran TM-align using their SWISS-MODEL structures. We pulled out the TM-scores and sequence identity for every pair of chains. Last, we split our dataset into training, validation and test sets. For the chain-pairs dataset with chains up to 300 residues long, our train/validation split (randomly split during training) had 141 million pairs, and our held-out test dataset had 1 million pairs. Our chain-pairs dataset with chains up to 1,000 residues long had 320 million pairs.

Domain-pairs dataset

To determine whether our model could approximate TM-scores for domains and remote homologs, we built a dataset of pairs from the heavily curated CATH domains dataset. We started with the CATH nonredundant dataset of protein domains with no more than 40% sequence similarity. This dataset comprised 31,000 protein domains. We then filtered out domains that were longer than 300 residues, leaving 30,000 domains. All pairwise combinations of these 30,000 domains would lead to 450 million pairs; however, we aimed to build a balanced dataset, and dissimilar protein structures represented the vast majority of pairs (that is, domains with very different folds). Therefore, we undersampled pairs of CATH domains that came from different folds. The CATH dataset that we used for our experiments included 23 million pairs of domains.

We further split this dataset into training/validation and testing splits, and we evaluated performance on CATHS40 on left-out domain pairs (where the domain pair was not in the training/validation dataset), left-out domains (either one or both domains not in the training/validation dataset) and left-out folds (either one or both domains from folds that were not in the training/validation dataset). Here, the fold family was from the topology classification in the CATH hierarchy. Our training/validation dataset contained 19 million pairs, our left-out pairs dataset contained 100,000 pairs, our left-out domains dataset contained 100,000 pairs, and our left-out folds dataset contained 500,000 pairs.

Malidup and Malisam datasets

Some of our sequence alignment benchmarks were performed on the curated Malisam44 and Malidup43 protein structural alignment benchmarking datasets. All the structural alignments analyzed were provided from the original benchmark43,44. We also used Malidup to benchmark TM-Vec and DeepBLAST. Malidup consists of 241 pairwise structure alignments for homologous domains within the same chain. These pairs are structurally similar remote homologs. Malisam consists of 130 pairs of analogous motifs.

Structure alignment dataset

We trained DeepBLAST on 1.5 million alignments from the PDB47 obtained using TM-align15. These proteins were obtained from a curated collection of 40,000 protein structures76. Details of the model specification and training can be found in ref. 77.

Bacteriocins dataset

The bacteriocin sequences and metadata we used were from the bacteriocin database BAGEL45, and the putative unannotated bacteriocins were from Morton et al.58.

MIP novel fold dataset

In this project, there were protein structure predictions for 200,000 diverse microbial protein sequences, representing 148 putative novel folds, and the authors calculated TM-scores for pairs of proteins with novel folds48. We evaluated our TM-score predictions on 184,000 pairs of MIP proteins for which at least one protein in the pair had a novel fold.

ProtTucker benchmark dataset

ProtTucker was built to embed protein domains in a structure-aware way and uses CATH domains for its contrastive learning approach29. For this benchmark, we followed the ProtTucker training–lookup–test splits for the purpose of direct comparison with their method. Their training and lookup datasets consisted of 66,000 and 69,000 CATH domains, respectively. The test dataset did not include any domains with an HSSP-value > 0 with any of the lookup domains78 and consisted of 219 domains. We created a domain-pairs dataset from their set of 66,000 training domains in the same manner as our other CATH domain-pairs dataset by sampling pairs of domains and then running TM-align to produce TM-scores for the pairs. Our final training dataset included 35 million domain pairs.

DIAMOND benchmark dataset

The DIAMOND benchmark51 consisted of a large query dataset and a large lookup dataset of single and multidomain proteins. The lookup dataset was from the 14 September 2019 release of UniRef50 (ref. 16), which contained 37.5 million sequences; the authors then reduced this to a representative dataset of 7.74 million protein sequences with protein family annotations (SCOP)53. The query dataset was from the 25 October 2019 release of the NCBI nr database and also used the SCOP family annotations for proteins; the authors reduced this dataset to include at most 1,000 protein sequences for each SCOP superfamily, resulting in a dataset of 1.71 million queries. Finally, the authors locally shuffled both the query and the lookup sequences in this benchmark in 40-letter windows outside their annotation ranges.

Embedding methods data

For this evaluation we used the CATH NR-S40 dataset (NR-S40) (ref. 37), a collection of approximately 30,000 proteins of maximally 40% sequence identity, representing a diverse sampling of each tier in the CATH hierarchy. The dataset was partitioned into training, validation and test sets. All the benchmarks were conducted on the test set, and all trainable methods in the comparison study were trained using the training and validation sets.

TM-Vec training

The TM-Vec models trained on CATHS40 and SWISS-MODEL chains up to 300 residues long both had 17.3 million trainable parameters and were 199MB in size. These models contained two transformer encoder layers. The TM-Vec models trained on CATHS100 domains (ProtTucker training domains) and SWISS-MODEL chains (up to 1,000 residues long) both had 34.1 million trainable parameters and were 391 MB in size. These models contained four transformer encoder layers.

The pretrained deep protein language model that we used, ProtTrans (ProtT5-XL-UniRef50), had no trainable parameters in our pipeline (the model parameters were frozen), as we used the model exclusively for extracting residue embeddings with a dimension of 1,024. Our transformer encoder layers had four multihead attention heads and a dimension of 2,048 in their feedforward network model. We used the Adam optimizer to train the weights, with an initial learning rate of 1 × 10−4. A batch size of 32 was used. In terms of training requirements, for the TM-Vec model trained on SWISS-MODEL chains up to 300 residues long, we trained TM-Vec on eight Nvidia V100 GPUs for 5 days. This represented five epochs of training.

DeepBLAST training

The final DeepBLAST model consisted of eight convolutional layers of dimension 1,024 to parameterize the match embeddings M and gap embeddings G. We used the same ProTrans model to estimate residue vectors. The resulting model had more than 1.2 billion parameters. We used the Adam optimizer to train the weights, with an initial learning rate of 5 × 10−4, and the pretrained model weights were frozen. A batch size of 360 alignments was used for training. DeepBLAST was trained for 20 epochs on 24 Nvidia A100 GPUs for 6 days. The DeepBLAST model was trained on a dataset of 5 million alignments obtained from TM-align. Alignments containing more than 10 consecutive gaps or with TM-score less than 0.6 were excluded from the training dataset.

DeepBLAST alignment accuracy assessment

Alignment accuracy was assessed on a held-out test dataset of 1 million structural alignments. Validation loss was recorded during training, and we stopped training once the validation loss stopped decreasing (Supplementary Fig. 9). To determine how well DeepBLAST generalizes, a subset comprising more than 120,000 alignments that were in the held-out TM-align alignments used to train DeepBLAST were analyzed. To evaluate the accuracy of the alignments, precision and recall were computed from the number of correctly identified matching residues. As each alignment can be represented as a bipartite graph where the edges represents matching residues between two proteins, precision and recall can be extracted by comparing the edge sets of the predicted alignment and the known alignments. Supplementary Fig. 9 shows the distribution of correctly identified alignment edges, with a median recall and precision of 87%, suggesting that these models can generalize well beyond the training dataset.

DIAMOND benchmark

The metric that we used to evaluate the performance of our method on the DIAMOND benchmark was sensitivity, which we defined as the percentage of the time the family annotations of the query protein were among the family annotations of the returned top n nearest neighbor proteins. For example, for the top 10 nearest neighbors, this quantifies the percentage of the time that the family annotations of the query protein are included in the family annotations of the returned top 10 nearest neighbor proteins.

Bacteriocin benchmark

We compared TM-Vec with three structure prediction methods for this benchmark: AlphaFold2, ESMFold and OmegaFold. ColabFold11 was used to run AlphaFold2 (ref. 10) using default parameters and the MMseqs2 pipeline. ESMFold v.1 was used for ESMFold structure predictions, and OmegaFold model 1 was used for OmegaFold structure predictions.

Embedding methods benchmarks

As shown in Fig. 2b, we compared TM-Vec with six other representations: one sequence-based method, ProtTrans35; and five different structure-based methods: cliques, GRAFENE79, ORCA80, CNN (influenced by DeepFRI81) and GCN (influenced by the Kipf and Welling graph autoencoder (GAE))82. Each structure-based method in some manner consumes a thresholded distance matrix, or contact map, and is used to output a fixed-sized feature vector that is meant to encode structural information.

The structure-based methods cliques, GRAFENE and ORCA output so-called manually engineered features; in particular, these feature vectors are histograms over known nonredundant graph substructures called graphlets. We introduce cliques as a simple baseline that consists of counting the ratio of nonoverlapping cliques up to size 7 inside a given contact map. ORCA and GRAFENE count more advanced graphlet substructures including graphlet orbits (which consider the relative node identity within the graphlet).

We also evaluated against two other methods that admit learned structure-based representations: DeepFRI and the Kipf and Welling GAE. Each method consists of training an autoencoder on contact maps and extracting average-pooled representations from one of the hidden layers in the inference mode. DeepFRI is a CNN autoencoder, whereas the GAE is a graph autoencoder. Both models are trained to minimize the binary cross-entropy of the original contact map and its reconstruction.

Of the five selected structure-based methods, four were permutation invariant; the exception was DeepFRI, which considers the canonical sequence ordering and treats the input matrix as an image. In addition, the manual crafted feature vectors do not scale well with graph density and hence cannot be evaluated for larger angstrom thresholds.

Evaluation metrics shown in Fig. 2b include cluster-adjusted mutual information and triplet-scoring AUPR. Each benchmark was applied to the top five most represented categories of each of the four CATH tiers separately. For cluster-adjusted mutual information, we applied spectral clustering using five clusters to the input feature vectors and calculated the adjusted mutual information between the cluster assignments and the actual label assignments. For triplet-scoring AUPR, we chose triplets in which two of the three shared the same label assignment, whereas the third was drawn from a different category. We constructed a balanced classification problem by considering the same-label pairs as the positive class and the same number of differently labeled pairs as the negative class. We used the cosine similarity among the selected positive and negative pairs as a classification prediction and calculated the AUPR.

Supplementary Tables 3 and 4 show the results of our comparison of TM-Vec with several methods on the CATHS20 benchmark and ProtTucker benchmarks. The commands used to run FoldSeek, HHBlits, MMseqs2 and Diamond are included in the TM-Vec software repository.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.