Introduction

The explosion of available biological sequence data has led to multiple computational approaches aiming to infer three-dimensional structure, biological function, fitness, and evolutionary history of proteins from sequence data1,2. Recently, self-supervised deep learning models based on natural language processing methods, especially attention3 and transformers4, have been trained on large ensembles of protein sequences by means of the masked language modeling objective of filling in masked amino acids in a sequence, given the surrounding ones5,6,7,8,9,10. These models, which capture long-range dependencies, learn rich representations of protein sequences, and can be employed for multiple tasks. In particular, they can predict structural contacts from single sequences in an unsupervised way7, presumably by transferring knowledge from their large training set11. Neural network architectures based on attention are also employed in the Evoformer blocks in AlphaFold12, as well as in RoseTTAFold13 and RGN214, and they contributed to the recent breakthrough in the supervised prediction of protein structure.

Protein sequences can be classified in families of homologous proteins, that descend from an ancestral protein and share a similar structure and function. Analyzing multiple sequence alignments (MSAs) of homologous proteins thus provides substantial information about functional and structural constraints1. The statistics of MSA columns, representing amino-acid sites, allow to identify functional residues that are conserved during evolution, and correlations of amino-acid usage between columns contain key information about functional sectors and structural contacts15,16,17,18. Indeed, through the course of evolution, contacting amino acids need to maintain their physico-chemical complementarity, which leads to correlated amino-acid usages at these sites: this is known as coevolution. Potts models, also known as Direct Coupling Analysis (DCA), are pairwise maximum entropy models trained to match the empirical one- and two-body frequencies of amino acids observed in the columns of an MSA of homologous proteins2,19,20,21,22,23,24,25,26. They capture the coevolution of contacting amino acids, and provided state-of-the-art unsupervised predictions of structural contacts before the advent of protein language models. Note that coevolutionary signal also aids supervised contact prediction27.

While most protein language neural networks take individual amino-acid sequences as inputs, some others have been trained to perform inference from MSAs of evolutionarily related sequences. This second class of networks includes MSA Transformer28 and the Evoformer blocks in AlphaFold12, both of which interleave row (i.e. per-sequence) attention with column (i.e. per-site) attention. Such an architecture is conceptually extremely attractive because it can incorporate coevolution in the framework of deep learning models using attention. In the case of MSA Transformer, simple combinations of the model’s row attention heads have led to state-of-the-art unsupervised structural contact prediction, outperforming both language models trained on individual sequences and Potts models28. Beyond structure prediction, MSA Transformer is also able to predict mutational effects29,30 and to capture fitness landscapes31. In addition to coevolutionary signal caused by structural and functional constraints, MSAs feature correlations that directly stem from the common ancestry of homologous proteins, i.e. from phylogeny. Does MSA Transformer learn to identify phylogenetic relationships between sequences, which are a key aspect of the MSA data structure?

Here, we show that simple, and universal, combinations of MSA Transformer’s column attention heads, computed on a given MSA, strongly correlate with the Hamming distances between sequences in that MSA. This demonstrates that MSA Transformer encodes detailed phylogenetic relationships. Is MSA Transformer able to separate coevolutionary signals encoding functional and structural constraints from phylogenetic correlations arising from historical contingency? To address this question, we generate controlled synthetic MSAs from Potts models trained on natural MSAs, either without or with phylogeny. For this, we perform Metropolis Monte Carlo sampling under the Potts Hamiltonians, either at equilibrium or along phylogenetic trees inferred from the natural MSAs. Using the top Potts model couplings as proxies for structural contacts, we demonstrate that unsupervised contact prediction via MSA Transformer is substantially more resilient to phylogenetic noise than contact prediction using inferred Potts models.

Results

Column attention heads capture Hamming distances in separate MSAs

We first considered separately each of 15 different Pfam seed MSAs (see “Methods – Datasets” and Supplementary Table 1), corresponding to distinct protein families, and asked whether MSA Transformer has learned to encode phylogenetic relationships between sequences in its attention layers. To test this, we split each MSA randomly into a training and a test set, and train a logistic model [Eqs. (5) and (6)] based on the column-wise means of MSA Transformer’s column attention heads on all pairwise Hamming distances in the training set—see Fig. 1 for a schematic, and “Methods – Supervised prediction of Hamming distances” for details. Figure 2 and Table 1 show the results of fitting these specialized logistic models.

Fig. 1: MSA Transformer: column attentions and Hamming distances.
figure 1

a MSA Transformer is trained using the masked language modeling objective of filling in randomly masked residue positions in MSAs. For each residue position in an input MSA, it assigns attention scores to all residue positions in the same row (sequence) and column (site) in the MSA. These computations are performed by 12 independent row/column attention heads in each of 12 successive layers of the network. b Our approach for Hamming distance matrix prediction from the column attentions computed by the trained MSA Transformer model, using a natural MSA as input. For each i = 1, …, M, j = 0, …, L and l = 1, …, 12, the embedding vector \({x}_{ij}^{(l)}\) is the i-th row of the matrix \({X}_{j}^{(l)}\) defined in “Methods – MSA Transformer and column attention”, and the column attentions are computed according to Eqs. (2) and (3).

Fig. 2: Fitting logistic models to predict Hamming distances separately in each MSA.
figure 2

The column-wise means of MSA Transformer’s column attention heads are used to predict normalised Hamming distances as probabilities in a logistic model. Each MSA is randomly split into a training set comprising 70% of its sequences and a test set composed of the remaining sequences. For each MSA, a logistic model is trained on all pairwise distances in the training set. Regression coefficients are shown for each layer and attention head (first column), as well as their absolute values averaged over heads for each layer (second column). For four example MSAs, ground truth Hamming distances are shown in the upper triangle (blue) and predicted Hamming distances in the lower triangle and diagonal (green), for the training and test sets (third and fourth columns). Darker shades correspond to larger Hamming distances.

Table 1 Quality of fit for logistic models trained to predict Hamming distances separately in each MSA

For all alignments considered, large regression coefficients concentrate in early layers in the network, and single out some specific heads consistently across different MSAs—see Fig. 2, first and second columns, for results on four example MSAs. These logistic models reproduce the Hamming distances in the training set very well, and successfully predict those in the test set—see Fig. 2, third and fourth columns, for results on four example MSAs. Note that the block structures visible in the Hamming distance matrices, and well reproduced by our models, come from the phylogenetic ordering of sequences in our seed MSAs, see “Methods – Datasets”. Quantitatively, in all the MSAs studied, the coefficients of determination (R2) computed on the test sets are above 0.84 in all our MSAs—see Table 1.

A striking result from our analysis is that the regression coefficients appear to be similar across MSAs—see Fig. 2, first column. To quantify this, we computed the Pearson correlations between the regression coefficients learnt on the larger seed MSAs. Figure 3 demonstrates that regression coefficients are indeed highly correlated across these MSAs.

Fig. 3: Pearson correlations between regression coefficients in larger MSAs.
figure 3

Sufficiently deep (≥ 100 sequences) and long (≥ 30 residues) MSAs are considered (mean/min/max Pearson correlations: 0.80/0.69/0.87).

MSA Transformer learns a universal representation of Hamming distances

Given the substantial similarities between our models trained separately on different MSAs, we next asked whether a common model across MSAs could capture Hamming distances within generic MSAs. To address this question, we trained a single logistic model, based on the column-wise means of MSA Transformer’s column attention heads, on all pairwise distances within each of the first 12 of our seed MSAs. We assessed its ability to predict Hamming distances in the remaining 3 seed MSAs, which thus correspond to entirely different Pfam families from those in the training set. Figure 4 shows the coefficients of this regression (first and second panels), as well as comparisons between predictions and ground truth values for the Hamming distances within the three test MSAs (last three panels). We observe that large regression coefficients again concentrate in the early layers of the model, but somewhat less than in individual models. Furthermore, the common model captures well the main features of the Hamming distance matrices in test MSAs.

Fig. 4: Fitting a single logistic model to predict Hamming distances.
figure 4

Our collection of 15 MSAs is split into a training set comprising 12 of them and a test set composed of the remaining 3. A logistic regression is trained on all pairwise distances within each MSA in the training set. Regression coefficients (first panel) and their absolute values averaged over heads for each layer (second panel) are shown as in Fig. 2. For the three test MSAs, ground truth Hamming distances are shown in the upper triangle (blue) and predicted Hamming distances in the lower triangle and diagonal (green), also as in Fig. 2 (last three panels). We further report the R2 coefficients of determination for the regressions on these test MSAs—see also Supplementary Table 2.

In Supplementary Table 2, we quantify the quality of fit for this model on all our MSAs. In all cases, we find very high Pearson correlation between the predicted distances and the ground truth Hamming distances. Furthermore, the median value of the R2 coefficient of determination is 0.6, confirming the good quality of fit. In the three shortest and the two shallowest MSAs, the model performs below this median, while all MSAs for which R2 is above median have depth  M ≥ 52 and length L ≥ 67. We also compute, for each MSA, the slope of the linear fit when regressing the ground truth Hamming distances on the distances predicted by the model. MSA depth is highly correlated with the value of this slope (Pearson r ≈ 0.95). This bias may be explained by the under-representation in the training set of Hamming distances and attention values from shallower MSAs, as their number is quadratic in MSA depth.

Ref. 28 showed that some column attention matrices, summed along one of their dimensions, correlate with phylogenetic sequence weights (see “Methods – Supervised prediction of Hamming distances”). This indicates that the model is, in part, attending to maximally diverse sequences. Our study demonstrates that MSA Transformer actually learns pairwise phylogenetic relationships between sequences, beyond these aggregate phylogenetic sequence weights. It also suggests an additional mechanism by which the model may be attending to these relationships, focusing on similarity instead of diversity. Indeed, while our regression coefficients with positive sign in Fig. 4 are associated with (average) attentions that are positively correlated with the Hamming distances, we also find several coefficients with large negative values. They indicate the existence of important negative correlations: in those heads, the model is actually attending to pairs of similar sequences. Besides, comparing our Figs. 2, 4 with Fig. 5 in ref. 28 shows that different attention heads are important in our study versus in the analysis of ref. 28 (Sec. 5.1). Specifically, here we find that the fifth attention head in the first layer in the network is associated with the largest positive regression coefficient, while the sixth one was most important there. Moreover, still focusing on the first layer of the network, the other most prominent heads here were not significant there. MSA Transformer’s ability to focus on similarity may also explain why its performance at predicting mutational effects can decrease significantly when using MSAs which include a duplicate of the query sequence (see ref. 29, Supplementary Fig. 9 and Table 10): in these cases, the model predicts masked tokens with very high confidence using information from the duplicate sequence.

How much does the ability of MSA Transformer to capture phylogenetic relationships arise from its training? To address this question, we trained a common logistic model as above to predict Hamming distances, but using column attention values computed from a randomly re-initialized version of the MSA Transformer network. We used the same protocol as in MSA Transformer’s original pre-training to randomly initialize the entries of the network’s row- and column-attention weight matrices \({W}_{{{{{{{{\rm{Q}}}}}}}}}^{(l,h)}\), \({W}_{{{{{{{{\rm{K}}}}}}}}}^{(l,h)}\) and \({W}_{{{{{{{{\rm{V}}}}}}}}}^{(l,h)}\) (see “Methods – MSA Transformer and column attention”), as well as the entries of the matrix used to embed input tokens, the weights in the feed-forward layers, and the positional encodings. Specifically, we sampled these entries (with the exception of bias terms and of the embedding vector for the padding token, which were set to zero) from a Gaussian distribution with mean 0 and standard deviation 0.02. The results obtained in this case for our regression task are reported in Supplementary Table 3. They demonstrate that, although random initialization can yield better performance than random guessing (which may partly be explained by Gordon’s Theorem32), the trained MSA Transformer gives vastly superior results. This confirms that the masked language modeling pre-training has driven it towards precisely encoding distances between sequences.

For each layer and attention head in the network, MSA Transformer computes one matrix of column attention values per site—see Eq. (4). This is in contrast with row attention, which is tied (see “Methods – MSA Transformer and column attention”). Our results are more surprising that they would be if the model’s column attentions were also tied. Indeed, during pre-training, by tuning its row-attention weight matrices to achieve optimal tied attention, MSA Transformer discovers covariance between MSA sites in early layers, and covariance between MSA sequences is related to Hamming distance.

Finally, to explore the contribution of each column to performance in our regression task, we employed our common logistic model (trained on the means of column attention matrices) to predict Hamming distances using column attentions from individual sites. We find that the most highly conserved sites (corresponding to columns with low entropy) lead to predictions whose errors have among the smallest standard deviations—see Supplementary Table 4. Note that we focused on standard deviations to mitigate the biases of the common logistic model (see above). This indicates that highly conserved sites lead to more stable predictions.

MSA Transformer efficiently disentangles correlations from contacts and phylogeny

MSA Transformer is known to capture three-dimensional contacts through its (tied) row attention heads28, and we have shown that it also captures Hamming distances, and thus phylogeny, through its column attention heads. Correlations observed between the columns of an MSA can arise both from coevolution due to functional constraints and from phylogeny (see Fig. 5). How efficiently does MSA Transformer disentangle correlations from contacts and phylogeny? We address this question in the concrete case of structure prediction. Because correlations from contacts and phylogeny are always both present in natural data, we constructed controlled synthetic data by sampling from Potts models (Fig. 5b), either independently at equilibrium, or along a phylogenetic tree inferred from the natural MSA using FastTree33. The Potts models we used were trained on each of 15 full natural MSAs (see “Methods – Datasets” and Supplementary Table 1) using the generative method bmDCA26,34—see “Methods – Synthetic MSA generation via Potts model sampling along inferred phylogenies”. This setup allows us to compare data where all correlations come from couplings (pure Potts model) to data that comprises phylogenetic correlations on top of these couplings. For simplicity, let us call “contacts” the top scoring pairs of amino-acid sites according to the bmDCA models used to generate our MSAs, and refer to the task of inferring these top scoring pairs as “contact prediction”.

Fig. 5: Correlations from coevolution and from phylogeny in MSAs.
figure 5

a Natural selection on structure and function leads to correlations between residue positions in MSAs (coevolution). b Potts models, also known as DCA, aim to capture these correlations in their pairwise couplings. c Historical contingency can lead to correlations even in the absence of structural or functional constraints.

Contact maps inferred by plmDCA24,25 and by MSA Transformer for our synthetic datasets are shown in Supplementary Fig. 4. For datasets generated with phylogeny, more false positives, scattered across the whole contact maps, appear in the inference by plmDCA than in that by MSA Transformer. This is shown quantitatively in Table 2, which reports the area under the receiver operating characteristic curve (ROC-AUC) for contact prediction for two different cutoffs on the number of contacts. We also quantify the degradation in performance caused by phylogeny by computing the relative drop Δ in ROC-AUC due to the injection of phylogeny in our generative process, for each Pfam family and for both plmDCA and MSA Transformer. On average, Δ is twice or three times (depending on the cutoff) higher for plmDCA than for MSA Transformer. We checked that these outcomes are robust to changes in the strategy used to compute plmDCA scores. In particular, the average Δ for plmDCA becomes even larger when we average scores coming from independent models fitted on the 10 subsampled MSAs used for MSA Transformer—thus using the exact same method as for predicting contacts with MSA Transformer (see “Methods – Generating sequences along an inferred phylogeny under a Potts model”). The conclusion is the same if 10 (or 6, for Pfam family PF13354) twice-deeper subsampled MSAs are employed.

Table 2 Impact of phylogeny on contact prediction by plmDCA and MSA Transformer

These results demonstrate that contact inference by MSA Transformer is less deteriorated by phylogenetic correlations than contact inference by DCA. This resilience might explain the remarkable result that structural contacts are predicted more accurately by MSA Transformer than by Potts models even when MSA Transformer’s pre-training dataset minimizes diversity (see ref. 28, Sec. 5.1).

Table 2 also shows that plmDCA performs better than MSA Transformer on the synthetic MSAs generated without phylogeny. Because these sequences are sampled independently and at equilibrium from Potts models inferred from the natural MSAs, they are by definition well-described by Potts models. However, these sequences incorporate the imperfections of the inferred Potts models (see the inferred contact maps versus the experimental ones in Supplementary Fig. 2), in addition to lacking the phylogenetic relationships that exist in natural MSAs. These differences with the natural MSAs that were used to train MSA Transformer might explain why it performs less well than plmDCA on these synthetic MSAs, while the opposite holds for natural MSAs (see ref. 28 and Supplementary Figs. 2 and 3). Note that directly comparing the performance of inference between natural and synthetic data is difficult because the ground-truth contacts are not the same and because synthetic data relies on inferred Potts models and inferred phylogenetic trees with their imperfections. However, this does not impair our comparisons of the synthetic datasets generated without and with phylogeny, or of plmDCA and MSA Transformer on the same datasets. Furthermore, an interesting feature that can be observed in Supplementary Fig. 4, and is quantified in Supplementary Table 5, is that MSA Transformer tends to recover the experimental contact maps from our synthetic data generated by bmDCA. Specifically, some secondary structure features that were partially lost in the bmDCA inference and generation process (see the experimental contact maps in Supplementary Fig. 2) become better defined again upon contact inference by MSA Transformer. This could be because MSA Transformer has learnt the structure of contact maps, including the spatial compactness and shapes of secondary structures.

Discussion

MSA Transformer is known to capture structural contacts through its (tied) row attention heads28. Here, we showed that it also captures Hamming distances, and thus phylogenetic information, through its column attention heads. This separation of the two signals in the representation of MSAs built by MSA Transformer comes directly from its architecture with interleaved row and column attention heads. It makes sense, given that some correlations between columns (i.e. amino-acid sites) of an MSA are associated to contacts between sites, while similarities between rows (i.e. sequences) arise from relatedness between sequences15. Specifically, we found that simple combinations of column attention heads, tuned to individual MSAs, can predict pairwise Hamming distances between held-out sequences with very high accuracy. The larger coefficients in these combinations are found in early layers in the network. More generally, this study demonstrated that the regressions trained on different MSAs had major similarities. This motivated us to train a single model across a heterogeneous collection of MSAs, and this general model was still found to accurately predict pairwise distances in test MSAs from entirely distinct Pfam families. This result hints at a universal representation of phylogenetic relationships in MSA Transformer. Furthermore, our results suggest that the network has learned to quantify phylogenetic relatedness by attending not only to dissimilarity28, but also to similarity relationships.

Next, to test the ability of MSA Transformer to disentangle phylogenetic correlations from functional and structural ones, we focused on unsupervised contact prediction tasks. Using controlled synthetic data, we showed that unsupervised contact prediction is more robust to phylogeny when performed by MSA Transformer than by inferred Potts models.

Language models often capture important properties of the training data in their internal representations35. For instance, those trained on single protein sequences learn structure and binding sites36, and those trained on chemical reactions learn how atoms rearrange37. Our finding that detailed phylogenetic relationships between sequences are learnt by MSA Transformer, in addition to structural contacts, and in an orthogonal way, demonstrates how precisely this model represents the MSA data structure. We note that, without language models, analyzing the correlations in MSAs can reveal evolutionary relatedness and sub-families15, as well as collective modes of correlation, some of which are phylogenetic and some functional18. Furthermore, Potts models capture the clustered organization of protein families in sequence space26, and the latent space of variational autoencoder models trained on sequences38,39,40 qualitatively captures phylogeny39. Here, we demonstrated the stronger result that detailed pairwise phylogenetic relationships between sequences are quantitatively learnt by MSA Transformer.

Separating coevolutionary signals encoding functional and structural constraints from phylogenetic correlations arising from historical contingency constitutes a key problem in analyzing the sequence-to-function mapping in proteins15,18. Phylogenetic correlations are known to obscure the identification of structural contacts by traditional coevolution methods, in particular by inferred Potts models20,21,41,42,43,44, motivating various corrections17,21,22,24,45,46,47,48. From a theoretical point of view, disentangling these two types of signals is a fundamentally hard problem49. In this context, the fact that protein language models such as MSA Transformer learn both signals in orthogonal representations, and separate them better than Potts model, is remarkable.

Here, we have focused on Hamming distances as a simple measure of phylogenetic relatedness between sequences. It would be very interesting to extend our study to other, more detailed, measures of phylogeny. One may ask whether they are encoded in deeper layers in the network than those most involved in our study. Besides, we have mainly considered attentions averaged over columns, but exploring in more detail the role of individual columns would be valuable, especially given the impact we found for column entropies. More generally, our results suggest that the performance of protein language models trained on MSAs could be assessed by evaluating not only how well they capture structural contacts, but also how well they capture phylogenetic relationships. In addition, the ability of protein language models to learn phylogeny could make them particularly well-suited at generating synthetic MSAs capturing the data distribution of natural ones50. It also raises the question of their possible usefulness to infer phylogenies and evolutionary histories.

Methods

Datasets

The Pfam database51 contains a large collection of related protein regions (families), typically associated to functional units called domains that can be found in multiple protein contexts. For each of its families, Pfam provides an expert-curated seed alignment that contains a representative set of sequences. In addition, Pfam provides deeper “full” alignments, that are automatically built by searching against a large sequence database using a profile hidden Markov model (HMM) built from the seed alignments.

For this work, we considered 15 Pfam families, and for each we constructed (or retrieved, see below) one MSA from its seed alignment—henceforth referred to as the “seed MSA”—and one from its full alignment—henceforth referred to as the full MSA. The seed MSAs were created by first aligning Pfam seed alignments (Pfam version 35.0, Nov. 2021) to their HMMs using the hmmalign command from the HMMER suite (http://hmmer.org, version 3.3.2), and then removing columns containing only insertions or gaps. We retained the original Pfam tree ordering, with sequences ordered according to phylogeny inferred by FastTree33. In the case of family PF02518, out of the initial 658 sequences, we kept only the first 500 in order to limit the memory requirements of our computational experiments to less than 64 GB. Of the full MSAs, six (PF00153, PF00397, PF00512, PF01535, PF13354) were created from Pfam full alignments (Pfam version 34.0, Mar. 2021), removing columns containing only insertions or gaps, and finally removing sequences where 10% or more characters were gaps. The remaining nine full MSAs were retrieved from the online repository https://github.com/matteofigliuzzi/bmDCA (publication date: Dec. 2017) and were previously considered in ref. 26. These alignments were constructed from full Pfam alignments from an earlier release of Pfam.

An MSA is a matrix \({{{{{{{\mathcal{M}}}}}}}}\) with L columns, representing the different amino-acid sites, and M rows. Each row i, denoted by x(i), represents one sequence of the alignment. We will refer to L as the MSA length, and to M as its depth. For all but one (PF13354) of our full MSAs, M > 36000. Despite their depth, however, our full MSAs include some highly similar sequences due to phylogenetic relatedness, a usual feature of large alignments of homologous proteins. We computed the effective depth20 of each MSA \({{{{{{{\mathcal{M}}}}}}}}\) as

$${M}_{{{{{{{{\rm{eff}}}}}}}}}^{(\delta )}:=\mathop{\sum }\limits_{i=1}^{M}{w}_{i},\,\,{{{{{{{\rm{with}}}}}}}}\,\,{w}_{i}:=|\{i^{\prime} :{d}_{{{{{{{{\rm{H}}}}}}}}}({{{{{{{{\boldsymbol{x}}}}}}}}}^{(i)},{{{{{{{{\boldsymbol{x}}}}}}}}}^{(i^{\prime} )}) \, < \, \delta \}{|}^{-1},$$
(1)

where dH(x, y) is the (normalized) Hamming distance between two sequences x and y, i.e. the fraction of sites where the amino acids differ, and we set δ = 0.2. While \({M}_{{{{{{{{\rm{eff}}}}}}}}}^{(0.2)}/M\) can be as low as 0.06 for our full MSAs, this ratio is close to 1 for all seed MSAs: it is almost 0.83 for PF00004, and larger than 0.97 for all other families.

Finally, for each Pfam domain considered, we retrieved one experimental three-dimensional protein structure, corresponding to a sequence present in the full MSA, from the PDB (https://www.rcsb.org). All these structures were obtained by X-ray crystallography and have R-free values between 0.13 and 0.29. Information about our MSAs is summarized in Supplementary Table 1.

All these families have been previously considered in the literature and shown to contain coevolutionary signal detectable by DCA methods26, making our experiments on contact prediction readily comparable with previous results. While the precise choice of Pfam families is likely immaterial for our investigation of the column attention heads computed by MSA Transformer, our domains’ short lengths are convenient in view of MSA Transformer’s large memory footprint—which is O(LM2) + O(L2).

MSA Transformer and column attention

We used the pre-trained MSA Transformer model introduced in ref. 28, retrieved from the Python Package Index as fair-esm 0.4.0. We briefly recall that this model was trained, with a variant of the masked language modeling (MLM) objective52, on 26 million MSAs constructed from UniRef50 clusters (March 2018 release), and contains 100 million trained parameters. The input to the model is an MSA with L columns and M rows. First, the model pre-pends a special beginning-of-sentence token to each row in the input MSA (this is common in language models inspired by the BERT architecture52). Then, each residue (or token) is embedded independently, via a learned mapping from the set of possible amino-acid/gap symbols into \({{\mathbb{R}}}^{d}\) (d = 768). To these obtained embeddings, the model adds two kinds of learned6 scalar positional encodings53, designed to allow the model to distinguish between (a) different aligned positions (columns), and (b) between different sequence positions (rows). (Note that removing the latter kind was shown in ref. 28 to have only limited impact.) The resulting collection of M × (L + 1) d-dimensional vectors, viewed as an M × (L + 1) × d array, is then processed by a neural architecture consisting of 12 layers. Each layer is a variant of the axial attention54 architecture, consisting of a multi-headed (12 heads) tied row attention block, followed by a multi-headed (12 heads) column attention block, and finally by a feed-forward network. (Note that both attention blocks, and the feed-forward network, are in fact preceded by layer normalization55). The roles of row and column attention in the context of the MLM training objective are illustrated in Fig. 1a. Tied row attention incorporates the expectation that 3D structure should be conserved amongst sequences in an MSA; we refer the reader to ref. 28 for technical details. Column attention works as follows: let \({X}_{j}^{(l)}\) be the M × d matrix corresponding to column j in the M × (L + 1) × d array output by the row attention block in layer l with l = 1, …, 12. At each layer l and each head h = 1, …, 12, the model learns three d × d matrices \({W}_{{{{{{{{\rm{Q}}}}}}}}}^{(l,h)}\), \({W}_{{{{{{{{\rm{K}}}}}}}}}^{(l,h)}\) and \({W}_{{{{{{{{\rm{V}}}}}}}}}^{(l,h)}\) (note that these matrices, mutatis mutandis, could be of dimension \(d\times d^{\prime}\) with \(d^{\prime} \ne d\)), used to obtain three M × d matrices

$${Q}_{j}^{(l,h)}={X}_{j}^{(l)}{W}_{{{{{{{{\rm{Q}}}}}}}}}^{(l,h)},\,\,{K}_{j}^{(l,h)}={X}_{j}^{(l)}{W}_{{{{{{{{\rm{K}}}}}}}}}^{(l,h)},\,\,{V}_{j}^{(l,h)}={X}_{j}^{(l)}{W}_{{{{{{{{\rm{V}}}}}}}}}^{(l,h)},$$
(2)

whose rows are referred to as “query”, “key”, and “value” vectors respectively. The column attention from MSA column j {0, …, L} (where j = 0 corresponds to the beginning-of-sentence token), at layer l, and from head h, is then the M × M matrix

$${A}_{j}^{(l,h)}:={{{{{{{{\rm{softmax}}}}}}}}}_{{{{{{{{\rm{row}}}}}}}}}\left(\frac{{Q}_{j}^{(l,h)}{{K}_{j}^{(l,h)}}^{{{{{{{{\rm{T}}}}}}}}}}{\sqrt{d}}\right),$$
(3)

where we denote by \({{{{{{{{\rm{softmax}}}}}}}}}_{{{{{{{{\rm{row}}}}}}}}}\) the application of \({{{{{{{\rm{softmax}}}}}}}}({\xi }_{1},\ldots {\xi }_{d})=({e}^{{\xi }_{1}},\ldots,{e}^{{\xi }_{d}})/\mathop{\sum }\nolimits_{k=1}^{d}{e}^{{\xi }_{k}}\) to each row of a matrix independently, and by ()T matrix transposition. As in the standard Transformer architecture4, these attention matrices are then used to compute M × d matrices \({Z}_{j}^{(l,h)}={A}_{j}^{(l,h)}{V}_{j}^{(l,h)}\), one for each MSA column j and head h. Projecting the concatenation \({Z}_{j}^{(l,1)}|\cdots|{Z}_{j}^{(l,12)}\), a single M × d matrix \({Z}_{j}^{(l)}\) is finally obtained at layer l. The collection \({({Z}_{j}^{(l)})}_{j=1,\ldots,L}\), thought of as an M × (L + 1) × d array, is then passed along to the feed-forward layer.

Supervised prediction of Hamming distances

Row i of the column attention matrices \({A}_{j}^{(l,h)}\) in Eq. (3) consists of M positive weights summing to one—one weight per row index \(i^{\prime}\) in the original MSA. According to the usual interpretation of the attention mechanism3,4, the role of these weights may be described as follows: When constructing a new internal representation (at layer l) for the row-i, column-j residue position, the network distributes its focus, according to these weights, among the M available representation vectors associated with each MSA row-\(i^{\prime}\), column-j residue position (including \(i^{\prime}=i\)). Since row attention precedes column attention in the MSA Transformer architecture, we remark that, even at the first layer, the row-\(i^{\prime}\), column-j representation vectors that are processed by that layer’s column attention block can encode information about the entire row \(i^{\prime}\) in the MSA.

In ref. 28 (Sec. 5.1), it was shown that, for some layers l and heads h, averaging the M × M column attention matrices \({A}_{j}^{(l,h)}\) in Equation (3) from all MSA columns j, and then averaging the result along the first dimension, yields M-dimensional vectors whose entries correlate reasonably well with the phylogenetic sequence weights wi defined in Equation (1). Larger weights are, by definition, associated with less redundant sequences, and MSA diversity is known to be important for coevolution-based methods—particularly in structure prediction tasks. Thus, these correlations can be interpreted as suggesting that the model is, in part, explicitly attending to a maximally diverse set of sequences.

Beyond this, we hypothesize that MSA Transformer may have learned to quantify and exploit phylogenetic correlations in order to optimize its performance in the MLM training objective of filling in randomly masked residue positions. To investigate this, we set up regression tasks in which, to predict the Hamming distance y between the i-th and the \(i^{\prime}\)-th sequence in an MSA \({{{{{{{\mathcal{M}}}}}}}}\) of length L, we used the entries \({a}_{i,i^{\prime} }^{(l,h)}\) at position \((i,\, i^{\prime} )\) (henceforth a(l,  h) for brevity) from the 144 matrices

$${{{{{{{{\boldsymbol{A}}}}}}}}}^{(l,h)}:=\frac{1}{2(L+1)}\mathop{\sum }\limits_{j=0}^{L}\left({A}_{j}^{(l,h)}+{{A}_{j}^{(l,h)}}^{{{{{{{{\rm{T}}}}}}}}}\right),\,\,{{{{{{{\rm{with}}}}}}}}\,\,1\le l\le 12\,{{{{{{{\rm{and}}}}}}}}\,1\le h\le 12.$$
(4)

These matrices are obtained by averaging, across all columns j = 0, …, L, the symmetrised column attention maps \({A}_{j}^{(l,h)}\) computed by MSA Transformer, when taking \({{{{{{{\mathcal{M}}}}}}}}\) as input. We highlight that column j = 0, corresponding to the beginning-of-sentence token, is included in the average defining A(l, h).

We fit fractional logit models via quasi-maximum likelihood estimation56 using the statsmodels package (version 0.13.2)57. Namely, we model the relationship between the Hamming distance y and the aforementioned symmetrised, and averaged, attention values a = (a(1, 1), …, a(12, 12)), as

$${\mathbb{E}}\,[y\,|\,{{{{{{{\boldsymbol{a}}}}}}}}]={G}_{{\beta }_{0},{{{{{{{\boldsymbol{\beta }}}}}}}}}({{{{{{{\boldsymbol{a}}}}}}}}),\,\,{{{{{{{\rm{with}}}}}}}}\,\,{G}_{{\beta }_{0},{{{{{{{\boldsymbol{\beta }}}}}}}}}({{{{{{{\boldsymbol{a}}}}}}}}):=\sigma \left({\beta }_{0}+{{{{{{{\boldsymbol{a}}}}}}}}{{{{{{{{\boldsymbol{\beta }}}}}}}}}^{{{{{{{{\rm{T}}}}}}}}}\right),$$
(5)

where \({\mathbb{E}}\,[\,\cdot \,|\,\cdot \,]\) denotes conditional expectation, \(\sigma (x)={(1+{e}^{-x})}^{-1}\) is the standard logistic function, and the coefficients β0 and β = (β1, …, β144) are determined by maximising the sum of Bernoulli log-likelihoods

$$\ell ({\beta }_{0},\, {{{{{{{\boldsymbol{\beta }}}}}}}}\,|\,{{{{{{{\boldsymbol{a}}}}}}}},y)=y\log [{G}_{{\beta }_{0},\, {{{{{{{\boldsymbol{\beta }}}}}}}}}({{{{{{{\boldsymbol{a}}}}}}}})]+(1-y)\log [1-{G}_{{\beta }_{0},\, {{{{{{{\boldsymbol{\beta }}}}}}}}}({{{{{{{\boldsymbol{a}}}}}}}})],$$
(6)

evaluated over a training set of observations of y and a. Note that this setup is similar to logistic regression, but allows for the dependent variable to take real values between 0 and 1 (it can be equivalently described as a generalized linear model with binomial family and logit link). For simplicity, we refer to these fractional logit models simply as “logistic models”. Our general approach to predict Hamming distances is illustrated in Fig. 1b.

Using data from our seed MSAs (cf. Supplementary Table 1), we performed two types of regression tasks. In the first one, we randomly partitioned the set of row indices in each separate MSA \({{{{{{{\mathcal{M}}}}}}}}\) into two subsets \({I}_{{{{{{{{\mathcal{M}}}}}}}},{{{{{{{\rm{train}}}}}}}}}\) and \({I}_{{{{{{{{\mathcal{M}}}}}}}},{{{{{{{\rm{test}}}}}}}}}\), with \({I}_{{{{{{{{\mathcal{M}}}}}}}},{{{{{{{\rm{train}}}}}}}}}\) containing 70% of the indices. We then trained and evaluated one model for each \({{{{{{{\mathcal{M}}}}}}}}\), using as training data the Hamming distances, and column attentions, coming from (unordered) pairs of indices in \({I}_{{{{{{{{\mathcal{M}}}}}}}},{{{{{{{\rm{train}}}}}}}}}\), and as test data the Hamming distances, and column attentions, coming from pairs of indices in \({I}_{{{{{{{{\mathcal{M}}}}}}}},\, {{{{{{{\rm{test}}}}}}}}}\). The second type of regression task was a single model fit over a training dataset consisting of all pairwise Hamming distances, and column attentions, from the first 12 of our 15 MSAs. We then evaluated this second model over a test set constructed in an analogous way from the remaining 3 MSAs.

Synthetic MSA generation via Potts model sampling along inferred phylogenies

To assess the performance of MSA Transformer at disentangling signals encoding functional and structural (i.e. fitness) constraints from phylogenetic correlations arising from historical contingency, we generated and studied controlled synthetic data. Indeed, disentangling fitness landscapes from phylogenetic history in natural data poses a fundamental challenge49—see Fig. 5 for a schematic illustration. This makes it very difficult to assess the performance of a method at this task directly on natural data, because gold standards where the two signals are well-separated are lacking. We resolved this conundrum by generating synthetic MSAs according to well-defined dynamics such that the presence of phylogeny can be controlled.

First, we inferred unrooted phylogenetic trees from our full MSAs (see “Methods – Datasets”), using FastTree version 2.133 with its default settings. Our use of FastTree is motivated by the depth of the full MSAs, which makes it computationally prohibitive to employ more precise inference methods. Deep MSAs are needed for the analysis described below, since it relies on accurately fitting Potts models.

Then, we fitted Potts models on each of these MSAs using bmDCA26 (https://github.com/ranganathanlab/bmDCA, version 0.8.12) with its default hyperparameters. These include, in particular, regularization strengths for the Potts model fields and couplings, both set at λ = 10−2. With the exception of family PF13354, we trained all models for 2000 iterations and stored the fields and couplings at the last iteration; in the case of PF13354, we terminated training after 1480 iterations. In all cases, we verified that, during training, the model’s loss had converged. The choice of bmDCA is motivated by the fact that, as has been shown in refs. 26, 34, model fitting on natural MSAs using Boltzmann machine learning yields Potts models with good generative power. This sets it apart from other DCA inference methods, especially pseudo-likelihood DCA (plmDCA)24,25, which is the DCA standard for contact prediction, but cannot faithfully reproduce empirical one- and two-body marginals, making it a poor choice of a generative model26.

Using the phylogenetic trees and Potts models inferred from each full MSA, we generated synthetic MSAs without or with phylogeny, as we now explain. In the remainder of this subsection, let \({{{{{{{\mathcal{M}}}}}}}}\) denote an arbitrary MSA from our set of full MSAs, L its length, and M its depth.

Consider a sequence of L amino-acid sites. We denote by xi {1, …, q} the state of site i {1, …, L}, where q = 21 is the number of possible states, namely the 20 natural amino acids and the alignment gap. A general Potts model Hamiltonian applied to a sequence x = (x1, …, xL) reads

$$H({{{{{{{\boldsymbol{x}}}}}}}})=-\mathop{\sum }\limits_{i=1}^{L}{h}_{i}({x}_{i})-\mathop{\sum }\limits_{j=1}^{L}\mathop{\sum }\limits_{i=1}^{j-1}{e}_{ij}({x}_{i},{x}_{j}),$$
(7)

where the fields hi(xi) and couplings eij(xi, xj) are parameters that can be inferred from data by DCA methods2,20. In our case, they are inferred from \({{{{{{{\mathcal{M}}}}}}}}\) by bmDCA26,34. The Potts model probability distribution is then given by the Boltzmann distribution associated to the Hamiltonian H in Equation (7):

$$P({{{{{{{\boldsymbol{x}}}}}}}})=\frac{{e}^{-H({{{{{{{\boldsymbol{x}}}}}}}})}}{Z},$$
(8)

where Z is a constant ensuring normalization. In this context, we implement a Metropolis–Hastings algorithm for Markov Chain Monte Carlo (MCMC) sampling from P, where an iteration step consists of a proposed move (mutation) in which a site i is chosen uniformly at random, and its state xi may be changed into another state chosen uniformly at random. Each of these attempted mutations is accepted or rejected according to the Metropolis criterion, i.e. with probability

$$p=\min \left[1,\, \exp \left(-{{\Delta }}H\right)\right],$$
(9)

where ΔH is the difference in the value of H after and before the mutation.

Generating independent equilibrium sequences under a Potts model

To generate a synthetic MSAs without phylogeny from each \({{{{{{{\mathcal{M}}}}}}}}\), we performed equilibrium MCMC sampling from the Potts model with Hamiltonian H in Eq. (7), using the Metropolis–Hastings algorithm. Namely, we started from a set of M randomly and independently initialized sequences, and proposed a total number N of mutations on each sequence. Suitable values for N are estimated by bmDCA during its training, to ensure that Metropolis–Hastings sampling reaches thermal equilibrium after N steps when starting from a randomly initialized sequence26. We thus used the value of N estimated by bmDCA at the end of training. This yielded a synthetic MSA of the same depth M as the original full MSA \({{{{{{{\mathcal{M}}}}}}}}\), composed of independent equilibrium sequences.

Generating sequences along an inferred phylogeny under a Potts model

We also generated synthetic data using MCMC sampling along our inferred phylogenetic trees42, using an open-source implementation available at https://github.com/Bitbol-Lab/Phylogeny-Partners (version 2.0). We started from an equilibrium ancestor sequence sampled as explained above, and placed it at the root (note that, while FastTree roots its trees arbitrarily, root placement does not matter; see below). Then, this sequence was evolved by successive duplication (at each branching of the tree) and mutation events (along each branch). Mutations were again modeled using for acceptance the Metropolis criterion in Eq. (9) with the Hamiltonian in Eq. (7). As the length b of a branch gives the estimated number of substitutions that occurred per site along it33, we generate data by making a number of accepted mutations on this branch equal to the integer closest to bL. Since we traversed the entire inferred tree in this manner, the resulting sequences at the leaves of the tree yield a synthetic MSA of the same depth as the original full MSA \({{{{{{{\mathcal{M}}}}}}}}\). Finally, we verified that the Hamming distances between sequences in these synthetic MSAs were reasonably correlated with those between corresponding sequences in the natural MSAs—see Supplementary Fig. 1.

Because we start from an ancestral equilibrium sequence, and then employ the Metropolis criterion, all sequences in the phylogeny are equilibrium sequences. Thus, some of the correlations between the sequences at the leaves of the tree can be ascribed to the couplings in the Potts model, as in the case of independent equilibrium sequences described above. However, their relatedness adds extra correlations, arising from the historical contingency in their phylogeny. Note that separating these ingredients is extremely tricky in natural data49, which motivates our study of synthetic data.

Our procedure for generating MSAs along a phylogeny is independent of the placement of the tree’s root. Indeed, informally, a tree’s root placement determines the direction of evolution; hence, root placement should not matter when evolution is a time-reversible process. That evolution via our mutations and duplications is a time-reversible process is a consequence of the fact that we begin with equilibrium sequences at the (arbitrarily chosen) root. More formally, for an irreducible Markov chain with transition matrix \({{{{{{{\mathcal{P}}}}}}}}\) and state space Ω, and for any n ≥ 1, let \({{{{{{{{\rm{Markov}}}}}}}}}_{n}(\pi,{{{{{{{\mathcal{P}}}}}}}})\) denote the probability space of chains \({({X}_{k})}_{0\le k\le n}\) with initial distribution π on Ω. If π is the chain’s stationary distribution and π satisfies detailed balance, then, for any number of steps n ≥ 1, any chain \({({X}_{k})}_{0\le k\le n}\in {{{{{{{{\rm{Markov}}}}}}}}}_{n}(\pi,\, {{{{{{{\mathcal{P}}}}}}}})\) is reversible in the sense that \({({X}_{n-k})}_{0\le k\le n}\in {{{{{{{{\rm{Markov}}}}}}}}}_{n}(\pi,\, {{{{{{{\mathcal{P}}}}}}}})\). In our case, since the Metropolis–Hastings algorithm constructs an irreducible Markov chain whose stationary distribution satisfies detailed balance, and since duplication events are also time-reversible constraints imposed at each branching node, all ensemble observables are independent of root placement as long as the root sequences are sampled from the stationary distribution.

Assessing performance degradation due to phylogeny in coupling inference

DCA methods and MSA Transformer both offer ways to perform unsupervised inference of structural contacts from MSAs of natural proteins. In the case of DCA, the established methodology24,25,26 is to (1) learn fields and couplings [see Eq. (7)] by fitting the Potts model, (2) change the gauge to the zero-sum gauge, (3) compute the Frobenius norms, for all pairs of sites (i,  j), of the coupling matrices \({({e}_{ij}(x,\, y))}_{x,y}\), and finally (4) apply the average product correction (APC)17, yielding a coupling score Eij. Top scoring pairs of sites are then predicted as being contacts. In the case of MSA Transformer28, a single logistic regression (shared across all possible input MSAs) was trained to regress contact maps from a sparse linear combination of the symmetrized and APC-corrected row attention heads (see “Methods – MSA Transformer and column attention”).

We applied these inference techniques, normally used to predict structural contacts, on our synthetic MSAs generated without and with phylogeny (see above). As proxies for structural contacts, we used the pairs of sites with top coupling scores in the Potts models used to generate the MSAs. Indeed, when presented with our synthetic MSAs generated at equilibrium, DCA methods for fitting Potts models should recover the ranks of these coupling scores well. Hence, their performance in this task provide a meaningful baseline against which performance when a phylogeny was used to generate the data, as well as MSA Transformer’s performance, can be measured.

As a DCA method to infer these coupling scores, we used plmDCA24,25 as implemented in the PlmDCA Julia package (https://github.com/pagnani/PlmDCA, version 0.4.1), which is the state-of-the-art DCA method for contact inference. We fitted one plmDCA model per synthetic MSA, using default hyperparameters throughout; these include, in particular, regularization strengths set at λ = 10−2 for both fields and couplings, and automatic estimation of the phylogenetic cutoff δ in Eq. (1). We verified that these settings led to good inference of structural contacts on the original full MSAs by comparing them to the PDB structures in Supplementary Table 1—see Supplementary Fig. 2. For each synthetic MSA, we computed coupling scores Eij for all pairs of sites.

While Potts models need to be fitted on deep MSAs to achieve good contact prediction, MSA Transformer’s memory requirements are considerable even at inference time, and the average depth of the MSAs used to train MSA Transformer was 119228. Concordantly, we could not run MSA Transformer on any of the synthetic MSAs in their entirety. Instead, we subsampled each synthetic MSA 10 times, by selecting each time a number Msub of row indices uniformly at random, without replacement. We used Msub ≈ 380 for family PF13354 due to its greater length, and Msub ≈ 500 for all other families. Then, we computed for each subsample a matrix of coupling scores using MSA Transformer’s row attention heads and the estimated contact probabilities from the aforementioned logistic regression. Finally, we averaged the resulting 10 matrices to obtain a single matrix of coupling scores. We used a similar strategy (and the same randomly sampled row indices) to infer structural contact scores from the natural MSAs—see Supplementary Fig. 3. Consistently with findings in ref. 28, MSA Transformer generally performs better than plmDCA (Supplementary Fig. 2) at contact inference.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.