Abstract
How we choose to represent our data has a fundamental impact on our ability to subsequently extract information from them. Machine learning promises to automatically determine efficient representations from large unstructured datasets, such as those arising in biology. However, empirical evidence suggests that seemingly minor changes to these machine learning models yield drastically different data representations that result in different biological interpretations of data. This begs the question of what even constitutes the most meaningful representation. Here, we approach this question for representations of protein sequences, which have received considerable attention in the recent literature. We explore two key contexts in which representations naturally arise: transfer learning and interpretable learning. In the first context, we demonstrate that several contemporary practices yield suboptimal performance, and in the latter we demonstrate that taking representation geometry into account significantly improves interpretability and lets the models reveal biological information that is otherwise obscured.
Introduction
Data representations play a crucial role in the statistical analysis of biological data. At its core, a representation is a distillation of raw data into an abstract, highlevel and often lowerdimensional space that captures the essential features of the original data. This can subsequently be used for data exploration, e.g. through visualization, or taskspecific predictions where limited data is available. Given the importance of representations it is no surprise that we see a rise in biology of representation learning^{1}, a subfield of machine learning where the representation is estimated alongside the statistical model. In the analysis of protein sequences in particular, the last years have produced a number of studies that demonstrate how representations can help extract important biological information automatically from the millions of observations acquired through modern sequencing technologies^{2,3,4,5,6,7,8,9,10,11,12,13,14}. While these promising results indicate that learned representations can have substantial impact on scientific data analysis, they also beg the question: what is a good representation? This elementary question is the focus of this paper.
A classic example of representation learning is principal component analysis (PCA)^{15}, which learns features that are linearly related to the original data. Contemporary techniques dispel with the assumption of linearity and instead seek highly nonlinear relations^{1}, often by employing neural networks. This has been particularly successful in natural language processing (NLP), where representations of word sequences are learned from vast online textual resources, extracting general properties of language that support subsequent specific language tasks^{16,17,18}. The success of such word sequence models has inspired its use for modeling biological sequences, leading to impressive results in application areas, such as remote homologue detection^{19}, function classification^{20}, and prediction of mutational effects^{6}.
Since representations are becoming an important part of biological sequence analysis, we should think critically about whether the constructed representations efficiently capture the information we desire. This paper discusses this topic, with focus on protein sequences, although many of the insights apply to other biological sequences as well^{13}. Our work consists of two parts. First, we consider representations in the transferlearning setting. We investigate the impact of network design and training protocol on the resulting representation, and find that several current practices are suboptimal. Second, we investigate the use of representations for the purpose of data interpretation. We show that explicit modeling of the representation geometry allows us to extract robust and identifiable biological conclusions. Our results demonstrate a clear potential for designing representations actively, and for analyzing them appropriately.
Results
Representation learning has at least two uses: In transfer learning we seek a representation that improves a downstream task, and in data interpretation the representation should reveal the data’s underlying patterns, e.g. through visualization. Since the first has been at the center of recent literature^{4,5,8,9,10,20,21}, we place our initial focus there, and turn later to data interpretation.
Representations for transfer learning
Transfer learning addresses the problems caused by limited access to labeled data. For instance, when predicting the stability of a given protein, we only have limited training data available as it is experimentally costly to measure stability. The key idea is to leverage the many available unlabeled protein sequences to learn (pretrain) a general protein representation through an embedding model, and then train a problemspecific task model on top using the limited labeled training data (Fig. 1).
In the protein setting, learning representations for transfer learning can be implemented at different scopes. It can be addressed at a universal scope, where representations are learned to reflect general properties of all proteins, or it can be implemented at the scope of an individual protein family, where an embedding model is pretrained only on closely related sequences. Initially, we will focus on universal setting, but will return to familyspecific models in the second half of the paper.
When considering representations in the transferlearning setting, the quality, or meaningfulness, of a representation is judged merely by the level of predictive performance obtained by one or more downstream tasks. Our initial task will therefore be to study how this performance depends on common modeling assumptions. A recent study established a benchmark set of predictive tasks for protein sequence representations^{5}. For our experiments below, we will consider three of these tasks, each reflecting a particular global protein property: (1) classification of protein sequences into a set of 1195 known folds^{22}, (2) fluorescence prediction for variants of the green fluorescent protein in Aequorea victoria^{23}, and 3) prediction of the stability of protein variants obtained in high throughput experimental design experiments^{24}.
Finetuning can be detrimental to performance
In the transferlearning setting, the pretraining phase and the task learning phase are conceptually separate (Fig. 1, left), but it is common practice to finetune the embedding model for a given task, which implies that the parameters of both models are in fact optimized jointly^{5}. Given the large number of parameters typically employed in embedding models, we hypothesize that this can lead to overfitted representations, at least in the common scenario where only limited data is available for the task learning phase.
To test this hypothesis we train three models, an LSTM^{25}, a Transformer^{26}, and a dilated residual network (Resnet)^{27} on a diverse set of protein sequences extracted from Pfam^{28}, where we either keep the embedding model fixed (Fix) or finetune it to the task (Fin). To evaluate the impact of the representation model itself, we consider both a pretrained version (Pre) and randomly initialized representation models that are not trained on data (Rng). Such models will map similar inputs to similar representations, but should otherwise not perform well. Finally, as a naive baseline representation, we consider the direct onehot encoding of each amino acid in the sequence. In all cases, we extract global representations using an attentionbased averaging over local representations (Fig. 1, right).
Table 1 shows that finetuning the embedding clearly reduces test performance in two out of three tasks, confirming that finetuning can have significant detrimental effects in practice. Incidentally, we also note that the randomly initialized representation performs remarkably well in several cases, which echoes results known from random projections^{29}.
Implication: finetuning a representation to a specific task carries the risk of overfitting, since it often increases the number of free parameters substantially, and should therefore take place only under rigorous cross validation. Fixing the embedding model during tasktraining should be the default choice.
Constructing a global representation as an average of local representations is suboptimal
One of the key modeling choices for biological sequences is how to handle their sequential nature. Inspired by developments in natural language processing, most of the recent representation learning advances for proteins use language models, which aim to reproduce their own input, either by predicting the next character given the sequence observed so far, or by predicting the entire sequence from a partially obscured input sequence. The representation learned by such models is a sequence of local representations (r_{1}, r_{2}, . . . , r_{L}) each corresponding to one amino acid in the input sequence (s_{1}, s_{2}, . . . , s_{L}). To successfully predict the next amino acid, r_{i} should contain information about the local neighborhood around s_{i}, together with some global signal reflecting properties of the complete sequence. In order to obtain a global representation of the entire protein, the variable number of local representations must be aggregated into a fixedsize global representation. A priori, we would expect this choice to be quite critical to the nature of the resulting representation. Standard approaches for this operation include averaging with uniform^{4,10} or learned attention^{5,30,31} weights or simply using the maximum value. However, the complex nonlocal interactions known to occur in a protein suggest that it could be beneficial to allow for more complex aggregation functions. To investigate this issue, we consider two alternative strategies (Fig. 1, right):
The first strategy (Concat) avoids aggregation altogether by concatenating the local representations r = [r_{1}, r_{2}, . . . , r_{L}, p, p, p] (with additional padding p to adjust for variable sequencelength). This approach preserves all information stored in the local r_{i}s. To make a fair comparison to the averaging strategy, we maintain the same overall representation size by scaling down the size of the local representations r_{i}. In our case, with a global representation size of 2048, and a maximal sequence length of 512, this means that we restrict the local representation to only four dimensions.
As a second strategy (Bottleneck), we investigate the possibility of learning the optimal aggregation operation, using an autoencoder, a simple neural network that as output predicts its own input, but forces it through a lowdimensional bottleneck^{32}. The model thus learns a generic global representation during pretraining, in contrast to the strategies above in which the global representation arises as a deterministic operation on the learned local representations. We implement the Bottleneck strategy within the Resnet (convolutional) setting, where we have welldefined procedures for down and upsampling the sequence length.
When comparing the two proposed aggregation strategies on the three protein prediction tasks (Stability, Fluorescence, Remote Homology), we observe a quite dramatic impact on performance (Table 2). The Bottleneck strategy, where the global representation is learned, clearly outperforms the other strategies. This was expected, since already during pretraining this model is encouraged to find a more global structure in the representations. More surprising are the results for the Concat strategy, as these demonstrate that even if we restrict the local representation to be much smaller than in standard sequential models, the fact that there is no loss of information during aggregation has a significant positive influence on the downstream performance.
Implication: if a global representation of proteins is required, it should be learned rather than calculated as an average of local representations.
Reconstruction error is not a good measure of representation quality
Any choice of embedding model will have a number of hyperparameters, such as the number of nodes in the neural network or the dimensionality of the representation itself. How do we choose such parameters? A common strategy is to make these choices based on the reconstruction capabilities of the embedding model, but is it reasonable to expect that this is also the optimal choice from the perspective of the downstream task?
As an example, we will consider the task of finding the optimal representation size. We trained and evaluated several Bottleneck Resnet models with varying representation dimensions and applied them to the three downstream tasks. The results show a clear pattern where the reconstruction accuracy increases monotonically with latent size, with the sharpest increase in the region of 10 to 500, but with marginal improvements all the way up to the maximum size of 10,000 (Supplementary Fig. S1). However, if we consider the three downstream tasks, we see that the performance in all three cases starts decreasing at around size 500–1000, thus showing a discrepancy between the optimal choice with respect to the reconstruction objective and the downstream task objectives. It is important to stress that the reconstruction accuracy is measured on a validation set, so our observation is not a matter of overfitting to the training data. We employed a carefully constructed train/validation/test partition of UniProt^{33} provided by Armenteros et al.^{21}, to avoid overlap between the sets. The results thus show that there is enough data for the embedding model to support a large representation size, while the downstream tasks prefer a smaller input size. The exact behavior will depend on the task and the available data for the two training phases, but we can conclude that there is generally no reason to believe that reconstruction accuracy and the downstream task accuracy will agree on the optimal choice of hyperparameters. Similar findings were reported in the TAPE study^{5}.
Implication: in transfer learning, optimal values for hyperparameters (e.g. representation size) can in general not be estimated during pretraining. They must be tuned for the specific task.
Representations for data interpretation: shaped by scope, model architecture, and data preprocessing
We now return to the use of representations for data interpretation. If a representation accurately describes the structure in the underlying dataset, we might expect it to be useful not only as input to a downstream model, but also as the basis for direct interpretation, for instance through visualization. In this context, it is important to realize that different modeling choices can lead to dramatically different interpretations of the same data. More troubling, even when using the same model assumptions, repeated training instances can also deviate substantially, and we must therefore analyze our interpretations with care. In the following, we explore these effects in detail.
Recent models for proteins tend to learn universal, crossfamily representations of protein space. In bioinformatics, there is, however, a long history of analyzing proteins per family. Since the proteins in the same family share a common threedimensional structure, an underlying correspondence exists between positions in different sequences, which we can approximate using multiple sequence alignment techniques. After establishing such an alignment, all input sequences will have the same length, making it possible to use simple fixedsize input models, rather than the sequential models discussed previously. One advantage is that models can now readily detect patterns at and correlations between absolute positions of the input, and directly observe both conservation and coevolution. In terms of interpretability, this has clear advantages. An example of this approach is the DeepSequence model^{2,12}, in which the latent space of a Variational Autoencoder (VAE) was shown to clearly separate the input sequences into different phyla, and capture covariance among sites on par with earlier coevolution methods. We reproduce this result using a VAE on the βlactamase family PF00144 from PFAM^{28}, using a 2dimensional latent space (Fig. 2, bottom right).
If we use the universal, fullcorpus, sequence models (LSTM, Resnet, Transformer) and the Bottleneck Resnet from the previous sections to embed the same set of proteins from the βlactamase family and use tSNE^{34} to reduce the dimensionality of the protein representations into a twodimensional space, we see no clear phylogenetic separation in the case of LSTM and Resnet, and very little for the Transformer and the Bottleneck Resnet (Fig. 2, top row). The fact that the phyla are much less clearly resolved in these sequential models is perhaps unsurprising, since these models have been trained to represent the space of all proteins, and therefore do not have the same capacity to separate details of a single protein family. Indeed, to compensate for this, recent work has introduced the concept of evotuning, where a universal representation is finetuned on a single protein family^{4,35}.
When training exclusively on βlactamase sequences (Fig. 2, bottom row) we observe more structure for all models, but only the Transformer and Bottleneck Resnet are able to fully separate the different phyla. Comparing this to an alignmentbased VAE model, we still see large differences in protein representations, despite the fact that all models now are trained on the same corpus of proteins.
The observed differences between representations is a combined effect arising from the following factors: (1) the inductive biases underlying the different model architectures, (2) the domainspecific knowledge inserted through preprocessing sequences when constructing an alignment, and (3) the postprocessing of representation space to make it amenable to visualization in 2D (the three leftmost columns in Fig. 2 were processed using tSNE, see Supplementary Figs. S3 and S4 for equivalent plots using PCA). Often, these contributions are interdependent, and therefore difficult to disentangle. For instance, the VAE can use a simple model architecture only because the sequences have been preprocessed into an alignment. Likewise, the simplicity of the VAE makes it possible to limit the size of the bottleneck to only 2 dimensions, and thereby avoid the need for posthoc dimensionality reduction, which can itself have a substantial impact on the obtained representation (Supplementary Figs. S3 and S4). Ideally, we would wish to directly obtain 2D representations for the sequential models as well, but all attempts to train variants of the LSTM, Resnet, and Transformer models with 2D latent representations were unfruitful. This suggests that the additional complexity inherent in the sequential modeling of unaligned sequences places restrictions on how simple we can make the underlying latent representation (see discussion in Supplementary Material).
Continuous progress is being made in the area of sequential modeling and its use for protein representation learning^{6,7,8,10,36,37}. In particular, transformers, when scaled up to hundreds of millions of parameters, have been shown capable of recovering the covariances among sites in a protein^{37,38}. When embedding the βlactamase sequences using these large pretrained transformer models, we indeed also see an improved separation of phyla (Supplementary Fig. 5). It remains an open question whether representations extracted from such large transformer models will eventually be able to capture more information than what can be extracted using a simple model and a highquality sequence alignment.
Implication: the scope of data (all proteins vs. single protein families), whether data is preprocessed into alignments, the model architecture, and potential post hoc dimensionality reduction all have a fundamental impact on the resulting representations, and the conclusions we can hope to draw from them. However, these contributions are often interdependent and difficult to disentangle in practice.
Representation space topology carries relevant information
The starlike structure of the VAE representation in Fig. 2, and the associated phyla colorcoding strongly suggest that the topology of this particular representation space is related to the tree topology of the evolutionary history underlying the protein family^{39}. As an example of the potential and limits to representation interpretability, we will proceed with a more detailed analysis of this space.
To explore the topological origin of the representation space, we estimate a phylogenetic tree of a subset of our input data (n = 200), and encode the inner nodes of the tree to our latent space using a standard ancestral reconstruction method (see Methods). Although the fit is not perfect—a few phyla are split and placed on opposite sides of the origin—there is generally a good correspondence (Fig. 3). We see that the reconstructed ancestors to a large extent span a meaningful tree, and it is thus clear that the representation topology in this case reflects relevant topological properties from the input space.
Implication: Although neural networks are high capacity function estimators, we see empirically that topological constraints in input space are maintained in representation space. The latent manifold is thus meaningful and should be respected when relying on the representation for data interpretation.
Geometry gives robust representations
Perhaps the most exciting prospect of representation learning is the possibility of gaining new insights through the interpretation and manipulation of the learned representation space. In NLP, the celebrated word2vec model^{40} demonstrated that simple arithmetic operations on representations yielded meaningful results, e.g. “Paris  France + Italy = Rome”, and similar results are known from image analysis. The ability to perform such operations on proteins would have substantial impact on protein engineering and design, for instance making it possible to interpolate between biochemical properties or functional traits of a protein. What is required of our representations to support such interpolations?
To qualify the discussion, we note that standard arithmetic operations such as addition and subtraction rely on the assumption that the learned representation space is Euclidean. The starlike structure observed for the alignmentbased VAE representation in Fig. 3 suggests that a Euclidean interpretation may be misleading: If we define similarities between pairs of points through the Euclidean distance between them, we implicitly assume straightline interpolants that pass through uncharted territory in the representation space when moving between ‘branches’ of the starlike structure. This does not seem fruitful.
Mathematically, the Euclidean interpretation is also problematic. In general, the latent variables of a generative model are not statistically identifiable, such that it is possible to deform the latent representation space without changing the estimated data density^{41,42}. The Euclidean topology is also known to cause difficulties when learning data manifolds with different topologies^{43,44}. With this in mind, the Euclidean assumption is difficult to justify beyond arguments of simplicity, as Euclidean arithmetic is not invariant to general deformations of the representation space. It has recently been pointed out that shortest paths (geodesics) and distances between representation pairs can be made identifiable even if the latent coordinates of the points themselves are not^{42,45}. The trick is to equip the learned representation with a Riemannian metric which ensures that distances are measured in data space along the estimated manifold. This result suggests that perhaps a Riemannian set of operations is more suitable for interacting with learned representations than the usual Euclidean arithmetic operators.
To investigate this hypothesis, we develop a suitable Riemannian metric, such that geodesic distances correspond to expected distances between onehot encoded proteins, which are integrated along the manifold. The VAE defines a generative distribution p(X∣Z) that is governed by a neural network. Here Z is a latent variable, and X a onehot encoded protein sequence. To define a notion of distance and shortest path we start from a curve c in latent space, and ask what is its natural length? We parametrize the curve as \(c:[0,1]\to {{{{{{{\mathcal{Z}}}}}}}}\), where \({{{{{{{\mathcal{Z}}}}}}}}\) is the latent space, and write c_{t} to denote the latent coordinates of the curve at time t. As the latent space can be arbitrarily deformed it is not sensible to measure the curve length directly in the latent space, and the classic geometric approach is to instead measure the curve length after a mapping to input space^{42}. For proteins, this amounts to measuring latent curve lengths in the onehot encoded protein space. The shortest paths can then be found by minimizing curve length, and a natural distance between latent points is the length of this path.
An issue with this approach is that the VAE decoder is stochastic, such that the decoded curve is stochastic as well. To arrive at a practical solution, we recall that shortest paths are also curves of minimal energy^{42} defined as
where X_{t} ~ p(X∣Z = c_{t}) denote the protein sequence corresponding to latent coordinate c_{t}. Due to the stochastic decoder, the energy of a curve is a random variable. For continuous X, recent work^{45} has shown promising results when defining shortest paths as curves with minimal expected energy. In the Methods section we derive a similar approach for discrete onehot encoded X and provide the details of the resulting optimization problem and its numerical solution.
To study the potential advantages of using geodesic over Euclidean distances, we analyze the robustness of our proposed distance. Since VAEs are not invariant to reparametrization we do not expect pairwise distances to be perfectly preserved between different initialization of the same model, but we hypothesize that the geodesics should provide greater robustness. We train the model 5 times with different seeds (see Supplementary Fig. S9) and calculate the same subset of pairwise distances. We normalize each set of pairwise distances by their mean and compute the distance standard deviation across trained models. When using normalized Euclidean distance we observe a mean standard deviation of 0.23, while for normalized geodesics distances we obtain a value of 0.11 (Fig. 4a). This significant difference indicates that geodesic distances are more robust to model retraining than their Euclidean counterparts.
Implication: distances and interpolation between points in representation space can be made robust by respecting the underlying geometry of the manifold.
Geodesics give meaning to representations
To further investigate the usefulness of geodesics, we revisit the phylogenetic analysis of Fig. 3, and consider how well distances in representation space correlate with the corresponding phylogenetic distances. The first two panels of Fig. 4b show the correlation between 500 subsampled Euclidean distances and phylogenetic distances in a Transformer and a VAE representation, respectively. We observe very little correlation in the Transformer representation, while the VAE fares somewhat better. The third panel of Fig. 4b shows the correlation between geodesic distances and phylogenetic distances for the VAE. We observe that the geodesic distances significantly increases the linear correlation for particular shorttomedium distances. Finally, in the last panel, we include as a baseline the expected Hamming distance, i.e. latent points decoded into their categorical distribution from which we draw 10 samples/sequences and calculate the average Hamming distance. We observe that the geodesics in latent space are a reasonable proxy for this expected distance in output space.
Visually, the correspondence is also striking (Fig. 5). Welloptimized geodesics follow the manifold very closely, and to a large extent preserve the underlying tree structure. We see that the irregularities described before (e.g. the incorrect placement of the yellow subtree in the top right corner) are recognized by both the phylogenetic reconstruction and our geodesics, which is visually clear by the thick bundle of geodesics running diagonally to connect these regions.
Implication: Analyzing geodesics distances instead of Euclidean distances in representation space better reflects the underlying manifold allowing us to extract biological distances that are more meaningful.
Data preprocessing affects the geometry
We have established that the preprocessing of protein sequences into an alignment has a strong effect on the learned representation. But how do alignment quality and sequence selection biases affect the learned representations? To build alignments, it is common to start with a single query sequence, and iterative search for sequences similar to this query. If the intent is to make statements only about this particular query sequence (e.g. predicting effects of variants relative to this protein) then a common practice is to remove columns in the alignment for which the query sequence has a gap. This querycentric bias is further enhanced by the fact that the search for relevant sequences occurs iteratively based on similarity, and is thus bound to have greater sequence coverage for sequences close to the query. These effects would suggest that representations learned from querycentric alignments might be better descriptions of sequences close to the query.
To test this hypothesis, we look at a more narrow subset of the βlactamase family, covering only the class A βlactamases. This subset was included as part of the DeepSequence paper^{2} and will serve as our representative example of a querycentric alignment. The class A βlactamases consist of two subclasses, A1 and A2, which are known to display consistent differences in multiple regions of the protein. The query sequence in this case is the TEM from Escherichia coli, which belongs to subclass A1. Following earlier characterization of the differences between the subclasses, we consider a set of representative sequences from each of the subclasses, and probe how they are mapped to representation space (Class A1: TEM1, SHV1, PSE1, RTG2, CumA, OXY1, KLUA1, CTXM1, NMCA, SME1, KPC2, GES1, BEL1, BPS1. Class A2: PER1, CEF1, VEB1, TLA2, CIA1, CGA1, CME1, CSP1, SPU1, TLA1, CblA, CfxA, CepA). When training a representation model on the original alignment (Fig. 6a), we indeed see that the ability to reconstruct (decode) meaningful sequences from representation values differs dramatically between the A1 and A2 classes.
It is common practice to weigh input sequences in alignments by their density in sequence space, which compensates for the sampling bias mentioned above^{46}. While this is known to improve the quality of the model for the variant effect prediction^{2}, it only partially compensates for the underlying bias between the classes in our case (Fig. 6b). If we instead retrieve fulllength sequences for all proteins, redo the alignment using standard software (Clustal Omega^{47}), and maintain the full alignment length, we see that the differences between the classes becomes much smaller (Fig. 6c, d). The reason is straightforward: as the distance from the query sequence increases, larger parts of a protein will occur within the regions corresponding to gaps in the query sequence. If such columns are removed, we discard more information about the distant sequences, and therefore see larger uncertainty (i.e. entropy) for the decoder of such latent values. Note that these differences in representation quality are not immediately clear through visual inspection alone (Supplementary Fig. S7).
Implication: Alignmentbased representations depend critically on the nature of the multiple sequence alignment. In particular, training on querycentric alignments results in representations that primarily describe sequence variation around a single query sequence. In general, densitybased reweighting of sequences should be used to counterselection bias.
Geodesics provide more meaningful interpolation
The output distributions obtained by decoding from representation space provide interpretable insights into the nature of the representation. We illustrate this by constructing an interpolant along the geodesic from a subclass A1 member to a subclass A2 member (Fig. 7a). We calculate the entropy of the output distribution (summed over all sequence positions) along the interpolant and observe that there is a clear transition with elevated entropy around point 5 (highlighted in red). To investigate which regions of the protein are affected, we calculate the KullbackLeibler divergence between the output distributions of the endpoints (Fig. 7b). Zooming in on these particular regions (Fig. 7c, left), and following them along the interpolant, we see that the representation naturally captures transitions between amino acid preferences at different sites. Most of these correspond to sites already identified in prior literature, for instance disappearance of the cysteine at position 77, the switch between N → D at position 136, and D → N at position 179^{48}. We also see an example where a region in one class aligns to a gap in the other (position 5052). The linear interpolation (Fig. 7c, right), has similar statistics at the endpoints, but displays an almost trivial interpolation trajectory, which effectively interpolates linearly between the probability levels of the output classes at the endpoints (note for instance the minor preference for cysteine in the A2 region at position 77).
Implication: Geodesics provide natural interpolants between points in representation space, avoiding high entropy regions, and thereby providing interpolated values that are better supported by data.
Discussion
Learned representations of protein sequences can substantially improve systems for making biological predictions, and may also help to reveal previously uncovered biological information. In this paper, we have illuminated parts of the answer to the question of what constitutes a meaningful representation of proteins. One of the conclusions is that the question itself does not have a single general answer, and must always be qualified with a specification of the purpose of the representation. A representation that is suitable for making predictions may not be optimal for a human investigator to better understand the underlying biology, and vice versa. The enticing idea of a single protein representation for all tasks thus seems unworkable in practice.
Designing purposeful representations
Designing a representation for a given task requires reflection over which biological properties we wish the representation to encapsulate. Different biological aspects of a protein will place different demands on the representations, but it is not straightforward to enforce specific properties in a representation. We can, however, steer the representation learning by (1) picking appropriate model architectures, (2) preprocessing the data, (3) choosing suitable objective functions, and (4) placing prior distributions on parts of the model. We discuss each of these in turn.
Informed network architectures can be difficult to construct as the usual neural network ‘building blocks’ are fairly elementary mathematical functions that are not immediately linked to highlevel biological information. Nonetheless, our discussion of lengthinvariant sequence representations is a simple example of how one might inform the model architecture of the biology of the task. It is generally acknowledged that global protein properties are not linearly related to local properties. It is therefore not surprising when we show that the model performance significantly improves when we allow the model to learn such a nonlinear relationship instead of relying on the common linear average of local representations. It would be interesting to push this idea beyond the Resnet architecture that we explored here, in particular in combination with the recent largescale transformerbased language models. We speculate that while similar ‘lowhanging fruit’ may remain in currently applied network architectures, they are limited, and more advanced tools are needed to encode biological information into network architectures. The internal representations in attentionbased architectures have been shown to recover known physical interactions between proteins^{37,38}, opening the door to the incorporation of prior information about known physical interactions in a protein. Recent work on permutation and rotation invariance/equivariance in neural networks^{49,50} hold promise, though they have yet to be explored exhaustively in representation learning.
Data preprocessing and feature engineering is frowned upon in contemporary ‘endtoend’ representation learning, but it remains an important part of model design. In particular, preprocessing using the vast selection of existing tools from computational biology is a valuable way to encode existing biological knowledge into the representation. We saw a significant improvement in the representation capabilities of unsupervised models when trained on aligned protein sequences, as this injects prior knowledge about comparable sequence positions in a set of sequences. While recent work is increasingly working towards techniques for learning such signals directly from data^{7,37,38}, it remains unclear if the advantages provided by multiple alignments can be fully encapsulated by these methods. Other preprocessing techniques, such as the reweighing of sequences, are currently also dependent on having aligned sequences. These examples suggests that if we move too fast towards ‘endtoend’ learning, we risk throwing the baby out with the bathwater, by discarding years of experience endowed in existing tools.
Relevant objective functions are paramount to any learning task. Although representation learning is typically conducted using a reconstruction loss, we demonstrate that optimal representations according to this objective are generally suboptimal for any specific transferlearned task. This suggests that hyperparameters of representations should be chosen based on downstream taskspecific performance, rather than reconstruction performance on a holdout set. This is, however, a delicate process, as optimizing the parameters of the representation model on the downstream task is associated with a high risk of overfitting. We anticipate that principled techniques for combining reconstruction objectives on the large unsupervised data sets with taskspecific objectives in a semisupervised learning setting will provide substantial benefits in this area^{51}.
Informative priors can impose softer preferences than those encoded by hard architecture constraints. The Gaussian prior in VAEs is such an example, though its preference is not guided by biological information, which appears to be a missed opportunity. In the studies of βlactamase, we, and others^{2,39}, observe a representation structure that resembles the phylogenetic tree spanned by the evolution of the protein family. Recent hyperbolic priors^{52} that are designed to emphasize hierarchies in data may help to more clearly bring forward such evolutionary structure. Since we observe that the latent representation better reflects biology when endowed with a suitable Riemannian metric, it may be valuable to use corresponding geometric priors^{53}.
Analyzing representations appropriately
Even with the most valiant efforts to incorporate prior knowledge into our representations, they must still be interpreted with great care. We highlight the particular example of distances in representation space, and emphasize that the seemingly natural Euclidean distances are misleading. The nonlinearity of encoders and decoders in modern machine learning methods means that representation spaces are generally nonEuclidean. We have demonstrated that by bringing the expected distance from the observation space into the representation space in the form of a Riemannian metric, we obtain geodesic distances that correlate significantly better with phylogenetic distances than what can be attained through the usual Euclidean view. This is an exciting result as the Riemannian view comes with a set of natural operators akin to addition and subtraction, such that the representation can be engaged with operationally. We expect this to be valuable for e.g. protein engineering, since it gives an operational way to combine representations from different proteins.
In this study, we employed our geometric analysis only on the latent space of a variational autoencoder, which is wellsuited due to its smooth mapping from a fixed dimensional latent space to a fixed dimensional output space. Expanding beyond single protein families is hindered by the fact that we cannot decode from an aggregated global representation in a sequential language model. A natural question is whether Bottleneck strategies like the one we propose could make such analysis possible. If so, it would present new possibilities for defining meaningful distances between remote homologues in latent space^{19}, and potentially allow for improved transfer of GO/EC annotations between proteins.
Finally, the geometric analysis comes with several implications that are relevant beyond proteins. It suggests that the commonly applied visualizations where latent representations are plotted as points on a Euclidean screen may be highly misleading. We therefore see a need for visualization techniques that faithfully reflect the geometry of the representations. The analysis also indicates that downstream prediction tasks may gain from leveraging the geometry, although standard neural network architectures do not yet have such capabilities.
Methods
Variational autoencoders
A variational autoencoder assumes that data X is generated from some (unknown) latent factors Z though the process p_{θ}(X∣Z). The latent variables Z can be viewed as the compressed representation of X. Latent space models try to model the joint distribution of X and Z as p_{θ}(X, Z) = p_{θ}(Z)p_{θ}(X∣Z). The generating process can then be viewed as a twostep procedure: first a latent variable Z is sampled from the prior and then data X is sampled from the conditional p_{θ}(X∣Z) (often called the decoder). Since X is discrete by nature, p_{θ}(X∣Z) is modeled as a Categorical distribution p_{θ}(X∣Z) ~ Cat(C, l_{θ}(Z)) with C classes and l_{θ}(Z) being the logprobabilities for each class. To make the model flexible enough to capture higherorder amino acid interactions, we model l_{θ}(Z) as a neural network. Even though data X is discrete, we use continuous latent variables Z ~ N(0, 1).
Construction of entropy network
To ensure that our VAE decodes to high uncertainty in regions of low data density, we construct an explicit network architecture with this property. That is, the network p_{θ}(X∣Z) should be certain about its output in regions where we have observed data, and uncertain in regions where we have not. This has been shown to be important to get wellbehaved Riemannian metrics^{42,54}. In a standard VAE with posterior modeled as a normal distribution \({{{{{{{\mathcal{N}}}}}}}}({\mu }_{\theta }({{{{{{{\bf{Z}}}}}}}}),{\sigma }_{\theta }^{2}({{{{{{{\bf{Z}}}}}}}}))\), this amounts to constructing a variance network \({\sigma }_{\theta }^{2}({{{{{{{\bf{Z}}}}}}}})\) that increases away from data^{45,55}. However, no prior work has been done on discrete distributions, such as the Categorical distribution C(μ_{θ}(Z)) that we are working with. In this model we do not have a clear division of the average output (mean) and uncertainty (variance), so we control the uncertainty through the entropy of the distribution. We remind that for a categorical distribution, the entropy is
The most uncertain case corresponds to when H(X∣Z) is largest i.e. when p(X∣Z)_{i} = 1/C for i = 1, . . . , C. Thus, we want to construct a network p_{θ}(X∣Z) that assigns equal probability to all classes when we are away from data, but is still flexible when we are close to data. Taking inspiration from^{55} we construct a function α = T(z), that maps distance in latent space to the zeroone domain (\(T:[0,\inf )\;\mapsto\, [0,1]\)). T is a trainable network of the model, with the functional form \(T({{{{{{{\bf{z}}}}}}}})={\mathtt{sigmoid}}\left(\frac{6.9077\beta \cdot V({{{{{{{\bf{z}}}}}}}})}{\beta }\right)\) with \(V({{{{{{{\bf{z}}}}}}}})=\mathop{\min}\limits_{j=\{1,..,K\}}  {{{{{{{\bf{z}}}}}}}}{{{{{{{{\boldsymbol{\kappa }}}}}}}}}_{j} { }_{2}^{2}\), where κ_{j} are trainable cluster centers (initialized using kmeans). This function essentially estimates how close a latent point z is to the data manifold, returning 1 if we are close and 0 when far away. Here K indicates the number of cluster centers (hyperparameter) and β is a overall scaling (trainable, constrained to the positive domain). With this network we can ensure a wellcalibrated entropy by picking
where \({\mathbb{L}}=\frac{1}{C}\). For points far away from data, we have α = 0 and return \({\mathbb{L}}\) regardless of category (class), giving maximal entropy. When near the data, we have α = 1 and the entropy is determined by the trained decoder p_{θ}(X∣Z)_{i}.
Figure 8 shows the difference in entropy of the likelihood between a standard VAE (left) and a VAE equipped with our developed entropy network (right). The standard VAEs produce arbitrary entropy, and is often more confident in its predictions far away from the data. Our network increases entropy as we move away from data.
Distance in sequence space
To calculate geodesic distances we first need to define geodesics over the random manifold defined by p(X∣Z). These geodesics are curves c that minimize expected energy^{42} defined as
where X_{t} ~ p(X∣Z = c_{t}) is the decoding of a latent point c_{t} along the curve c. This energy requires a meaningful (squared) norm in data space. We remind here that protein sequence data x, y is embedded into a onehot space i.e.
where we assume that p(x_{d} = 1) = a_{d}, p(y_{d} = 1) = b_{d} for d = 1, . . . , C. It can easily be shown that the squared norm between two such onehot vectors can either be 0 or 2:
The probability of these two events are given as
The expected squared distance is then given by
Extending this measure to two sequences of length L is then
The energy of a curve, can then be evaluated by integrating this sequence measure (10) along the given curve,
where Δt = ∣∣c_{i+1} − c_{i}∣∣_{2}. Geodesics can then be found by minimizing this energy (11) with respect to the unknown curve c. For an optimal curve c, its length is given by \(\sqrt{\overline{{{{{{{{\mathcal{E}}}}}}}}}(c)}\).
Optimizing geodesics
In principal, the geodesics could be found by direct minimization of the expected energy. However, empirically we observed that this strategy was prone to diverge, since the optimization landscape is very flat near the initial starting point. We therefore instead discretize the entropy landscape into a 2D grid, and form a graph based on this. In this graph each node will be a point in the grid, which is connected to its eight nearest neighbors, with the edge weight being the distance weighted with the entropy. Then, using Dijkstra’s algorithm^{56} we can rapidly find a robust initialization of each geodesic. To obtain the final geodesic curve we fit a cubic spline^{57} to the discretized curve found by Dijkstra’s algorithm, and afterwards do 10 gradient steps over the spline coefficients with respect to the curve energy (11) to refine the solution.
Phylogeny and ancestral reconstruction
The n = 200 points used for the ancestral reconstruction were chosen as latent embeddings from the training set that were closest to the trainable cluster centers \({\{{{{{{{{{\boldsymbol{\kappa}}}}}}}}}_{i}\}}_{i = 1}^{n}\) found during the estimation of the entropy network. We used FastTree2^{58} with standard settings for estimation of phylogenetic trees and subsequently applied the codeml program^{59} from the PAML package for ancestral reconstruction of the internal nodes of the tree.
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Data availability
All data used in this manuscript originates from publicly available databases. The sequence data used for pretraining and data for the different protein tasks are available as part of the TAPE repository (https://github.com/songlabcal/tape). Predefined, curated train/validation/test splits of UniProt were extracted as part of the UniLanguage repository (https://github.com/alrojo/UniLanguage). Data for the βlactamase family was extracted from the Pfam database (https://pfam.xfam.org/family/PF00144, accessed Jan 2020). Preprocessed data is available through the scripts provided in our code repository.
Code availability
The source code for the paper is freely available^{60} online under an open source license (https://github.com/MachineLearningLifeScience/meaningfulproteinrepresentations).
References
Bengio, Y., Courville, A. & Vincent, P. Representation Learning: A Review and New Perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35, 1798–1828 (2013).
Riesselman, A. J., Ingraham, J. B. & Marks, D. S. Deep generative models of genetic variation capture the effects of mutations. Nat. Methods 15, 816–822 (2018).
Bepler, T. & Berger, B. Learning protein sequence embeddings using information from structure. In International Conference on Learning Representations (2019).
Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M. & Church, G. M. Unified rational protein engineering with sequencebased deep representation learning. Nat. Methods 16, 1315–1322 (2019).
Rao, R. et al. Evaluating protein transfer learning with TAPE. In Advances in neural information processing systems 32, 9689–9701 (2019).
Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc Natl. Acad. Sci. 118, e2016239118 (2021).
Shin, J.E. et al. Protein design and variant prediction using autoregressive generative models. Nat. Commun. 12, 1–11 (2021).
Heinzinger, M. et al. Modeling aspects of the language of life through transferlearning protein sequences. BMC Bioinform. 20, 723 (2019).
Madani, A. et al. Progen: Language modeling for protein generation. arXiv: 2004.03497 (2020).
Elnaggar, A. et al. ProtTrans: Towards Cracking the Language of Lifes Code Through SelfSupervised Deep Learning and High Performance Computing, IEEE Trans. Pattern Anal. Mach. Intell. 1–1 (2021).
Lu, A. X., Zhang, H., Ghassemi, M. & Moses, A. SelfSupervised Contrastive Learning of Protein Representations By Mutual Information Maximization. bioRxiv: 2020.09.04.283929 (2020).
Frazer, J. et al. Disease variant prediction with deep generative models of evolutionary data. Nature 599, 91–95 (2021).
Ji, Y., Zhou, Z., Liu, H. & Davuluri, R. V. DNABERT: pretrained Bidirectional Encoder Representations from Transformers model for DNAlanguage in genome. Bioinformatics 4, btab083 (2021).
Repecka, D. et al. Expanding functional protein sequence spaces using generative adversarial networks. Nat. Mach. Intell. 1–10 (2021).
Jolliffe, I. Principal Component Analysis (Springer, 1986).
Radford, A., Narasimhan, K., Salimans, T. & Sutskever, I. Improving Language Understanding by Generative PreTraining. Tech. rep. (OpenAI, 2018).
Devlin, J., Chang, M.W., Lee, K. & Toutanova, K. BERT: Pretraining of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) 4171–4186 (2019).
Liu, Y. et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv: 1907.11692 (2019).
Morton, J. et al. Protein Structural Alignments From Sequence. bioRxiv: 2020.11.03.365932 (2020).
Gligorijević, V. et al. Structurebased protein function prediction using graph convolutional networks. Nat. Commun. 12, 1–14 (2021).
Armenteros, J. J. A., Johansen, A. R., Winther, O. & Nielsen, H. Language modelling for biological sequencescurated datasets and baselines. bioRxiv (2020).
Hou, J., Adhikari, B. & Cheng, J. DeepSF: deep convolutional neural network for mapping protein sequences to folds. Bioinformatics 34, 1295–1303 (2018).
Sarkisyan, K. S. et al. Local fitness landscape of the green fluorescent protein. Nature 533, 397–401 (2016).
Rocklin, G. J. et al. Global analysis of protein folding using massively parallel design, synthesis, and testing. Science 357, 168–175 (2017).
Hochreiter, S. & Schmidhuber, J. Long ShortTerm Memory. Neural Comput. 9, 1735–1780 (1997).
Vaswani, A. et al. Attention is all you need. In Advances in neural information processing systems 5998–6008 (2017).
Yu, F., Koltun, V. & Funkhouser, T. Dilated residual networks. In Proceedings of the IEEE conference on computer vision and pattern recognition 472–480 (2017).
ElGebali, S. et al. The Pfam protein families database in 2019. Nucleic Acids Res. 47, D427–D432 (2018).
Bingham, E. & Mannila, H. Random projection in dimensionality reduction: applications to image and text data. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining 245–250 (2001).
Stärk, H., Dallago, C., Heinzinger, M. & Rost, B. Light Attention Predicts Protein Location from the Language of Life. bioRxiv: 2021.04.25.441334 (2021).
Monteiro, J., Alam, M. J. & Falk, T. On The Performance of TimePooling Strategies for EndtoEnd Spoken Language Identification. English. In Proceedings of the 12th Language Resources and Evaluation Conference 3566–3572 (European Language Resources Association, 2020).
Kramer, M. A. Nonlinear principal component analysis using autoassociative neural networks. AIChE J. 37, 233–243 (1991).
The UniProt Consortium. UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res. 47, D506–D515 (2018).
Van der Maaten, L. & Hinton, G. Visualizing Data using tSNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).
Biswas, S., Khimulya, G., Alley, E. C., Esvelt, K. M. & Church, G. M. LowN protein engineering with dataefficient deep learning. Nat. Methods 18, 389–396 (2021).
HawkinsHooker, A. et al. Generating functional protein variants with variational autoencoders. PLoS Comput. Biol. 17, e1008736 (2021).
Rao, R., Meier, J., Sercu, T., Ovchinnikov, S. & Rives, A. Transformer protein language models are unsupervised structure learners. In International Conference on Learning Representations (2020).
Vig, J. et al. BERTology Meets Biology: Interpreting Attention in Protein Language Models. In International Conference on Learning Representations (2021).
Ding, X., Zou, Z. & Brooks, C. L. Deciphering protein evolution and fitness landscapes with latent space models. Nat. Commun. 10, 1–13 (2019).
Mikolov, T., Chen, K., Corrado, G. & Dean, J. Efficient estimation of word representations in vector space. arXiv: 1301.3781 (2013).
Bishop, C. M. Pattern Recognition and Machine Learning (Springer, 2006).
Hauberg, S. Only Bayes should learn a manifold (on the estimation of differential geometric structure from data). arXiv: 1806.04994 (2018).
Falorsi, L. et al. Explorations in Homeomorphic Variational AutoEncoding. In ICML18 Workshop on Theoretical Foundations and Applications of Deep Generative Models (2018).
Davidson, T. R., Falorsi, L., Cao, N. D., Kipf, T. & Tomczak, J. M. Hyperspherical Variational AutoEncoders. In Uncertainty in Artificial Intelligence (2018).
Arvanitidis, G., Hansen, L. K. & Hauberg, S. Latent space oddity: On the curvature of deep generative models. In International Conference on Learning Representations (2018).
Ekeberg, M., Lövkvist, C., Lan, Y., Weigt, M. & Aurell, E. Improved contact prediction in proteins: Using pseudolikelihoods to infer Potts models. Physical Review E. 87, 012707 (2013).
Sievers, F. et al. Fast, scalable generation of highquality protein multiple sequence alignments using Clustal Omega. Mol. Syst. Biol. 7, 539 (2011).
Philippon, A., Slama, P., Dény, P. & Labia, R. A structurebased classification of class A β lactamases, a broadly diverse family of enzymes. Clin. Microbiol. Rev. 29, 29–57 (2016).
Cohen, T. S., Geiger, M. & Weiler, M. Intertwiners between Induced Representations (with Applications to the Theory of Equivariant Neural Networks). 2018. arXiv: 1803.10743 [cs.LG].
Weiler, M., Geiger, M., Welling, M., Boomsma, W. & Cohen, T. S. 3D Steerable CNNs: Learning rotationally equivariant features in volumetric data. In Advances in Neural Information Processing Systems (2018).
Min, S., Park, S., Kim, S., Choi, H.S. & Yoon, S. PreTraining of Deep Bidirectional Protein Sequence Representations with Structural Information. arXiv: 1912.05625 (2019).
Mathieu, E., Le Lan, C., Maddison, C. J., Tomioka, R. & Teh, Y. W. Continuous hierarchical representations with poincaré variational autoencoders. In Advances in neural information processing systems (2019).
Kalatzis, D., Eklund, D., Arvanitidis, G. and Hauberg, S. Variational Autoencoders with Riemannian Brownian Motion Priors. In International Conference on Machine Learning (2020).
Tosi, A., Hauberg, S., Vellido, A. & Lawrence, N. D. Metrics for Probabilistic Geometries. In Conference on Uncertainty in Artificial Intelligence (2014).
Skafte, N., Jørgensen, M. & Hauberg, S. Reliable training and estimation of variance networks. In Advances in Neural Information Processing Systems (2019).
Dijkstra, E. W. et al. A note on two problems in connexion with graphs. Numer. Math. 1, 269–271 (1959).
Ahlberg, J. H., Nilson, E. N. & Walsh, J. L. The theory of splines and their applications. Can. Math. Bull. 11, 507–508 (1968).
Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree 2  Approximately maximumlikelihood trees for large alignments. PLoS One 5, e9490 (2010).
Adachi, J. & Hasegawa, M. MOLPHY version 2.3: programs for molecular phylogenetics based on maximum likelihood. 28 (Institute of Statistical Mathematics Tokyo, 1996).
Detlefsen, N. S., Hauberg, S. & Boomsma, W. Source code repository for this paper. Version 1.0.0, https://doi.org/10.5281/zenodo.6336064 (2022).
Acknowledgements
This work was funded in part by the Novo Nordisk Foundation through the MLLS Center (Basic Machine Learning Research in Life Science, NNF20OC0062606). It also received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (757360). NSD and SH were supported in part by a research grant (15334) from VILLUM FONDEN. WB was supported by a project grant from the Novo Nordisk Foundation (NNF18OC0052719). We thank Ole Winther, Jesper FerkinghoffBorg, and Jesper Salomon for feedback on earlier versions of this manuscript. Finally, we gratefully acknowledge the support of NVIDIA Corporation with the donation of GPU hardware used for this research.
Author information
Authors and Affiliations
Contributions
N.S.D., S.H. and W.B. jointly conceived and designed the study. N.S.D., S.H. and W.B. conducted the experiments. All authors contributed to the writing of the paper.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Communications thanks Andrew LeaverFay and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Detlefsen, N.S., Hauberg, S. & Boomsma, W. Learning meaningful representations of protein sequences. Nat Commun 13, 1914 (2022). https://doi.org/10.1038/s4146702229443w
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s4146702229443w
This article is cited by

A general model to predict small molecule substrates of enzymes based on machine and deep learning
Nature Communications (2023)

Linguistically inspired roadmap for building biologically reliable protein language models
Nature Machine Intelligence (2023)

Latent generative landscapes as maps of functional diversity in protein sequence space
Nature Communications (2023)

Transformerbased protein generation with regularized latent space optimization
Nature Machine Intelligence (2022)

ProteinGLUE multitask benchmark suite for selfsupervised protein modeling
Scientific Reports (2022)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.