Abstract
Generative models emerge as promising candidates for novel sequencedata driven approaches to protein design, and for the extraction of structural and functional information about proteins deeply hidden in rapidly growing sequence databases. Here we propose simple autoregressive models as highly accurate but computationally efficient generative sequence models. We show that they perform similarly to existing approaches based on Boltzmann machines or deep generative models, but at a substantially lower computational cost (by a factor between 10^{2} and 10^{3}). Furthermore, the simple structure of our models has distinctive mathematical advantages, which translate into an improved applicability in sequence generation and evaluation. Within these models, we can easily estimate both the probability of a given sequence, and, using the model’s entropy, the size of the functional sequence space related to a specific protein family. In the example of response regulators, we find a huge number of ca. 10^{68} possible sequences, which nevertheless constitute only the astronomically small fraction 10^{−80} of all aminoacid sequences of the same length. These findings illustrate the potential and the difficulty in exploring sequence space via generative sequence models.
Similar content being viewed by others
Introduction
The impressive growth of sequence databases is prompted by increasingly powerful techniques in datadriven modeling, helping to extract the rich information hidden in raw data. In the context of protein sequences, unsupervised learning techniques are of particular interest: only about 0.25% of the more than 200 million aminoacid sequences currently available in the Uniprot database^{1} have manual annotations, which can be used for supervised methods.
Unsupervised methods may benefit from evolutionary relationships between proteins: while mutations modify aminoacid sequences, selection keeps their biological functions and their threedimensional structures remarkably conserved. The Pfam protein family database^{2}, e.g., lists more than 19,000 families of homologous proteins, offering rich datasets of sequencediversified but functionally conserved proteins.
In this context, generative statistical models are rapidly gaining interest. The natural sequence variability across a protein family is captured via a probability P(a_{1},..., a_{L}) defined for all aminoacid sequences (a_{1},..., a_{L}). Sampling from P(a_{1},..., a_{L}) can be used to generate new, nonnatural aminoacid sequences, which is an ideal case that should be statistically indistinguishable from the natural sequences. However, the task of learning P(a_{1},..., a_{L}) is highly nontrivial: the model has to assign probabilities to all 20^{L} possible aminoacid sequences. For typical proteins of lengths, L = 50 − 500, this accounts for 10^{65}−10^{650} values, to be learned from the M = 10^{3}−10^{6} sequences contained in most protein families. Selecting adequate generative model architectures is thus of outstanding importance.
The currently best explored generative models for proteins are socalled coevolutionary models^{3}, such as those constructed by the Direct Coupling Analysis (DCA)^{4,5,6} (a more detailed review of the state of the art is provided below). They explicitly model the usage of amino acids in single positions (i.e., residue conservation) and correlations between pairs of positions (i.e., residue coevolution). The resulting models are mathematically equivalent to Potts models^{7} in statistical physics, or to Boltzmann machines in statistical learning^{8}. They have found numerous applications in protein biology.
The effect of aminoacid mutations is predicted via the logratio \({{{{{{\mathrm{log}}}}}}}\,\{P({{{{{{{\rm{mutant}}}}}}}})/P({{{{{{{\rm{wildtype}}}}}}}})\}\) between mutant and wildtype probabilities. Strong correlations to mutational effects determined experimentally via deep mutational scanning have been reported^{9,10}. Promising applications are the datadriven design of mutant libraries for protein optimization^{11,12,13}, and the use of Potts models as sequence landscapes in quantitative models of protein evolution^{14,15}.
Contacts between residues in the protein fold are extracted from the strongest epistatic couplings between double mutations, i.e., from the direct couplings giving the name to DCA^{6}. These couplings are essential input features in the wave of deeplearning (DL) methods, which currently revolutionize the field of proteinstructure prediction^{16,17,18,19}.
The generative implementation bmDCA^{5} is able to generate artificial but functional aminoacid sequences^{20,21}. Such observations suggest novel but almost unexplored approaches towards datadriven protein design, which complement current approaches based mostly on largescale experimental screening of randomized sequence libraries or timeintensive biomolecular simulation, typically followed by sequence optimization using directed evolution, cf. refs. ^{22,23} for reviews.
Here we propose a simple model architecture called arDCA, based on a shallow (onelayer) autoregressive model paired with generalized logistic regression. Such models are computationally very efficient, they can be learned in few minutes, as compared to days for bmDCA and more involved architectures. Nevertheless, we demonstrate that arDCA provides highly accurate generative models, comparable to the state of the art in mutationaleffect and residuecontact prediction. Their simple structure makes them more robust in the case of limited data. Furthermore, and this may have important applications in homology detection^{24}, our autoregressive models are the only generative models we know about, which allow for calculating exact sequence probabilities, and not only nonnormalized sequence weights. Thereby arDCA enables the comparison of the same sequence in different models for different protein families. Last but not least, the entropy of arDCA models, which is related to the size of the functional sequence space associated with a given protein family, can be computed much more efficiently than in bmDCA.
Before proceeding, we provide here a short review of the state of the art in generative protein modeling. The literature is extensive and rapidly growing, so we will concentrate on the methods being most directly relevant as compared to the scope of our work.
We focus on generative models purely based on sequence data. The sequences belonging to homologous protein families and are given in form of multiple sequence alignments (MSA), i.e., as a rectangular matrix \({{{{{{{\mathcal{D}}}}}}}}=({a}_{i}^{m} i=1,...,L;m=1,...,M)\) containing M aligned proteins of length L. The entries \({a}_{i}^{m}\) equal either one of the standard 20 amino acids or the alignment gap “–”. In total, we have q = 21 possible different symbols in \({{{{{{{\mathcal{D}}}}}}}}\). The aim of unsupervised generative modeling is to learn a statistical model P(a_{1},..., a_{L}) of (aligned) fulllength sequences, which faithfully reflects the variability found in \({{{{{{{\mathcal{D}}}}}}}}\): sequences belonging to the protein family of interest should have comparably high probabilities, unrelated sequences very small probabilities. Furthermore, a new artificial MSA \({{{{{{{\mathcal{D}}}}}}}}^{\prime}\) sampled sequence by sequence from model P(a_{1},..., a_{L}) should be statistically and functionally indistinguishable from the natural aligned MSA \({{{{{{{\mathcal{D}}}}}}}}\) given as input.
A way to achieve this goal is the abovementioned use of Boltzmannmachine learning based on conservation and coevolution, which leads to pairwiseinteracting Potts models, i.e., bmDCA^{5}, and related methods^{25,26,27}. An alternative implementation of bmDCA, including the decimation of statistically irrelevant couplings, has been presented in^{28} and is the one used as a benchmark in this work; the Mi3 package^{29} also provides a GPUbased accelerated implementation.
However, Potts models or Boltzmann machines are not the only generativemodel architectures explored for protein sequences. Latentvariable models like Restricted Boltzmann machines^{30} or HopfieldPotts models^{31} learn dimensionally reduced representations of proteins; using sequence motifs, they are able to capture groups of collectively evolving residues^{32} better than DCA models, but are less accurate in extracting structural information from the learning MSA^{31}.
An important class of generative models based on latent variables are variational autoencoders (VAE), which achieve dimensional reduction, but in the flexible and powerful set of deep learning. The DeepSequence implementation^{33} was originally designed and tested for predicting the effects of mutations around a given wild type. It currently provides one of the best mutationaleffect predictors, and we will show below that arDCA provides comparable quality of prediction for this specific task. The DeepSequence code has been modified in^{34} to explore its capacities in generating artificial sequences being statistically indistinguishable from the natural MSA; it was shown that its performance was substantially less accurate than bmDCA. Another implementation of a VAE was reported in^{35}; also in this case the generative performances are inferior to bmDCA, but the organization of latent variables was shown to carry significant information on functionality. Furthermore, some generated mutant sequences were successfully tested experimentally. Interestingly, it was also shown that learning VAE on unaligned sequences decreases the performance as compared to prealigned MSA as used by all beforementioned models. This observation was complemented by Ref. ^{36}, which reported a VAE implementation trained on nonaligned sequences from UniProt, with length 10 < L < 1000. The VAE had good reconstruction accuracy for small L < 200, which however dropped significantly for larger L. The latent space also in this case shows an interesting organization in terms of function, which was used to generate in silico proteins with desired properties, but no experimental test was provided. The paper does not report any statistical test of the generative properties (such as a Pearson correlation of twopoint correlations), and the publicly not yet available code makes a quantitative comparison to our results currently impossible.
Another interesting DL architecture is that of a Generative Adversarial Network (GAN), which was explored in^{37} on a single family of aligned homologous sequences. While the model has a very large number of trainable parameters (~60 M), it seems to reproduce well the statistics of the training MSA, and most importantly, the authors could generate an enzyme with only 66% identity to the closest natural one, which was still found to be functional in vitro. An alternative implementation of the same architecture was presented in^{38}, and applied to the design of antibodies; also in this case the resulting sequences were validated experimentally.
Not all generative models for proteins are based on sequence ensembles. Several research groups explored the possibility of generating sequences with given threedimensional structure^{39,40,41}, e.g. via a VAE^{42} or a Graph Neural Network^{43}, or by inverting structural prediction models^{44,45,46,47}. It is important to stress that this is a very different task from ours (our work does not use structure), so it is difficult to perform a direct comparison between our work and these ones. It would be interesting to explore, in future work, the possibility to unify the different approaches and to use sequence and structure jointly for constructing improved generative models.
In summary, for the specific task of interest here, namely, generate an artificial MSA statistically indistinguishable from the natural one, one can take as reference models bmDCA^{28,5} in the context of Pottsmodellike architectures, and DeepSequence^{33} in the context of deep networks. We will show in the following that arDCA performs comparably to bmDCA, and better than DeepSequence, at a strongly reduced computational cost. From anecdotal evidence in the works mentioned above, and in agreement with general observations in machine learning, it appears that deep architectures may be more powerful than shallow architectures, provided that very large datasets and computational resources are available^{33}. Indeed, we will show that for the related task of singlemutation predictions around a wild type, DeepSequence outperforms arDCA on rich datasets, while the inverse is true on small datasets.
Results
Autoregressive models for protein families
Here we propose a computationally efficient approach based on autoregressive models, cf. Fig. 1 for an illustration of the approach and the model architecture. We start from the exact decomposition
of the joint probability of a fulllength sequence into a product of more and more involved conditional probabilities P(a_{i}∣a_{i−1},..., a_{1}) of the amino acids a_{i} in single positions, conditioned to all previously seen positions a_{i−1},..., a_{1}. While this decomposition is a direct consequence of Bayes’ theorem, it suggests an important change in viewpoint on generative models: while learning the full P(a_{1},..., a_{L}) from the input MSA \({{{{{{{\mathcal{D}}}}}}}}\) is a task of unsupervised learning (sequences are not labeled), learning the factors P(a_{i}∣a_{i−1},..., a_{1}) becomes a task of supervised learning, with (a_{i−1},..., a_{1}) being the input (feature) vector, and a_{i} the output label (in our case a categorical qstate label). We can thus build on the full power of supervised learning, which is methodologically more explored than unsupervised learning^{48,49,50}.
In this work, we choose the following parameterization, previously used in the context of statistical mechanics of classical^{51} and quantum^{52} systems:
with \({z}_{i}({a}_{i1},...,{a}_{1})={\sum }_{{a}_{i}}\exp \{{h}_{i}({a}_{i})+\mathop{\sum }\nolimits_{j = 1}^{i1}{J}_{ij}({a}_{i},{a}_{j})\}\) being a normalization factor. In machine learning, this parameterization is known as softmax regression, the generalization of logistic regression to multiclass labels^{50}. This choice, as detailed in the section “Methods”, enables a particularly efficient parameter learning by likelihood maximization, and leads to a speedup of 2–3 orders of magnitude over bmDCA, as is reported in Table 1. Because the resulting model is parameterized by a set of fields h_{i}(a) and couplings J_{ij}(a, b) as in DCA, we dub our method as arDCA.
Besides comparing the performance of this model to bmDCA and DeepSequence, we will also use simple "fieldsonly” models, also known as profile models or independentsite models. In these models, the joint probability of all positions in a sequence factorizes overall positions, P(a_{1},..., a_{L}) = ∏_{i=1,...,L}f_{i}(a_{i}), without any conditioning to the sequence context. Using maximumlikelihood inference, each factor f_{i}(a_{i}) equals the empirical frequency of amino acid a_{i} in column i of the input MSA \({{{{{{{\mathcal{D}}}}}}}}\).
A few remarks are needed.
Eq. (2) has striking similarities to standard DCA^{4}, but also important differences. The two have exactly the same number of parameters, but their meaning is quite different. While DCA has symmetric couplings J_{ij}(a, b) = J_{ji}(b, a), the parameters in Eq. (2) are directed and describe the influence of site j on site i for j < i only, i.e., only one triangular part of the Jmatrix is filled.
The inference in arDCA is very similar to plmDCA^{53}, i.e., to DCA based on pseudolikelihood maximization^{54}. In particular, both in arDCA and plmDCA the gradient of the likelihood can be computed exactly from the data, while in bmDCA it has to be estimated via Monte Carlo Markov Chain (MCMC), which requires the introduction of additional hyperparameters (such as the number of chains, the mixing time, etc.) that can have an important impact on the quality of the inference, see^{55} for a recent detailed study.
In plmDCA each a_{i} is, however, conditioned to all other a_{j} in the sequence, and not only by partial sequences. The resulting directed couplings are usually symmetrized akin to standard Potts models. On the contrary, the J_{ij}(a, b) that appear in arDCA cannot be interpreted as “direct couplings” in the DCA sense, cf. below for details on arDCAbased contact prediction. However, plmDCA has limited capacities as a generative model^{5}: symmetrization moves parameters away from their maximumlikelihood value, probably causing a loss in model accuracy. No such symmetrization is needed for arDCA.
arDCA, contrary to all other DCA methods, allows for calculating the probabilities of single sequences. In bmDCA, we can only determine sequence weights, but the normalizing factor, i.e., the partition function, remains inaccessible for exact calculations; expensive thermodynamic integration via MCMC sampling is needed to estimate it. The conditional probabilities in arDCA are individually normalized; instead of summing over q^{L} sequences, we need to sum Ltimes over the q states of individual amino acids. This may turn out as a major advantage when the same sequence in different models shall be compared, as in homology detection and protein family assignment^{56,57}, cf. the example given below.
The ansatz in Eq. (2) can be generalized to more complicated relations. We have tested a twolayer architecture but did not observe advantages over the simple softmax regression, as will be discussed at the end of the paper.
Thanks, in particular, to the possibility of calculating the gradient exactly, arDCA models can be inferred much more efficiently than bmDCA models. Typical inference times are given in Table 1 for five representative families, and show a speedup of about 2–3 orders of magnitude with respect to the bmDCA implementation of^{28}, both running on a single Intel Xeon E52620 v4 2.10 GHz CPU. We also tested the Mi3 package^{29}, which is able to learn similar bmDCA models in a time of about 60 min for the PF00014 family and 900 min for the PF00595 family, while running on two TITAN RTX GPUs, thus remaining much more computationally demanding than arDCA.
The positional order matters
Eq. (1) is valid for any order of the positions, i.e., for any permutation of the natural positional order in the aminoacid sequences. This is no longer true when we parameterize the P(a_{i}∣a_{i−1},..., a_{1}) according to Eq. (2). Different orders may give different results. In Supplementary Note 1, we show that the likelihood depends on the order and that we can optimize over orders. We also find that the best orders are correlated to the entropic order, where we select first the least entropic, i.e. most conserved, variables, progressing successively towards the most variable positions of highest entropy. The site entropy \({s}_{i}={\sum }_{a}\,{f}_{i}(a){{{{{{\mathrm{log}}}}}}}\,\,{f}_{i}(a)\) can be directly calculated from the empirical aminoacid frequencies f_{i}(a) of all amino acids a in site i.
Because the optimization over the possible L! site orderings is very timeconsuming, we use the entropic order as a practical heuristic choice. In all our tests, described in the next sections, the entropic order does not perform significantly worse than the bestoptimized order we found.
A closetoentropic order is also attractive from the point of view of interpretation. The most conserved sites come first. If the amino acid on those sites is the most frequent one, basically no information is transmitted further. If, however, a suboptimal amino acid is found in a conserved position, this has to be compensated by other mutations, i.e., necessarily by more variable (more entropic) positions. Also, the fact that variable positions come last, and are modeled as depending on all other amino acids, is well interpretable: these positions, even if highly variable, are not necessarily unconstrained, but they can be used to finely tune the sequence to any suboptimal choices done in earlier positions.
For this reason, all coming tests are done using increasing entropic order, i.e., with sites ordered before model learning by increasing empirical s_{i} values. Supplementary Figs. 1–3 shows a comparison with alternative orderings, such as the direct one (from 1 to L), several random ones, and the optimized one, cf. also Table 1 for some results.
arDCA provides accurate generative models
To check the generative property of arDCA, we compare it with bmDCA^{5}, i.e., the most accurate generative version of DCA obtained via Boltzmann machine learning. bmDCA was previously shown to be generative not only in a statistical sense, but also in a biological one: sequences generated by bmDCA were shown to be statistically indistinguishable from natural ones, and most importantly, functional in vivo for the case of chorismate mutase enzymes^{20}. We also compare the generative property of arDCA with DeepSequence^{33,34} as a prominent representative of deep generative models.
To this aim, we compare the statistical properties of natural sequences with those of independently and identically distributed (i.i.d.) samples drawn from the different generative models P(a_{1},..., a_{L}). At this point, another important advantage of arDCA comes into play: while generating i.i.d. samples from, e.g., a Potts model requires MCMC simulations, which in some cases may have very long decorrelation times and thus become tricky and computationally expensive^{28,55} (cf. also Supplementary Note 2 and Supplementary Fig. 4), drawing a sequence from the arDCA model P(a_{1},..., a_{L}) is very simple and does not require any additional parameter. The factorized expression Eq. (1) allows for sampling amino acids position by position, following the chosen positional order, cf. the detailed description in Supplementary Note 2.
Figures 2a–c show the comparison of the onepoint aminoacid frequencies f_{i}(a), and the connected twopoint and threepoint correlations
of the data with those estimated from a sample of the arDCA model. Results are shown for the responseregulator Pfam family PF00072^{2}. Other proteins are shown in Table 1 and Supplementary Note 3, Supplementary Figs. 5–6. We find that, for these observables, the empirical and model averages coincide very well, equally well, or even slightly better than for the bmDCA case. In particular, for the onepoint and twopoint quantities, this is quite surprising: while bmDCA fits them explicitly, i.e., any deviation is due to the imperfect fitting of the model, arDCA does not fit them explicitly, and nevertheless obtains higher precision.
In Table 1, we also report the results for sequences sampled from DeepSequence^{33}. While its original implementation aims at scoring individual mutations, cf. Section “Predicting mutational effects via insilico deep mutational scanning”, we apply the modification of ref. ^{34} allowing for sequence sampling. We observe that for most families, the twopoint and threepoint correlations of the natural data are significantly less well reproduced by DeepSequence than by both DCA implementations, confirming the original findings of ref. ^{34}. Only in the largest family, PF00072 with more than 800,000 sequences, DeepSequence reaches comparable or, in the case of the threepoint correlations, even superior performance.
A second test of the generative property of arDCA is given by Fig. 2d–g. Panel d shows the natural sequences projected onto their first two principal components (PC). The other three panels show generated data projected onto the same two PCs of the natural data. We see that both arDCA and bmDCA reproduce quite well the clustered structure of the responseregulator sequences (both show a slightly broader distribution than the natural data, probably due to the regularized inference of the statistical models). On the contrary, sequences generated by a profile model P_{prof}(a_{1},..., a_{L}) = ∏_{i} f_{i}(a_{i}) assuming independent sites, do not show any clustered structure: the projections are concentrated around the origin in PC space. This indicates that their variability is almost unrelated to the first two principal components of the natural sequences.
From these observations, we conclude that arDCA provides excellent generative models, of at least the same accuracy as bmDCA. This suggests fascinating perspectives in terms of dataguided statistical sequence design: if sequences generated from bmDCA models are functional, also arDCAsampled sequences should be functional. But this is obtained at a much lower computational cost, cf. Table 1 and without the need to check for convergence of MCMC, which makes the method scalable to much bigger proteins.
Predicting mutational effects via insilico deep mutational scanning
The probability of a sequence is a measure of its goodness. For highdimensional probability distributions, it is generally convenient to work with log probabilities. Using inspiration from statistical physics, we introduce a statistical energy
as the negative log probability. We thus expect functional sequences to have very low statistical energies, while unrelated sequences show high energies. In this sense, statistical energy can be seen as a proxy of (negative) fitness. Note that in the case of arDCA, the statistical energy is not a simple sum over the model parameters as in DCA, but contains also the logarithms of the local partition functions z_{i}(a_{i−1},..., a_{1}), cf. Eq. (2).
Now, we can easily compare two sequences differing by one or few mutations. For a single mutation a_{i} → b_{i}, where amino acid a_{i} in position i is substituted with amino acid b_{i}, we can determine the statisticalenergy difference
If negative, the mutant sequence has lower statistical energy; the mutation a_{i} → b_{i} is thus predicted to be beneficial. On the contrary, a positive ΔE predicts a deleterious mutation. Note that, even if not explicitly stated on the lefthand side of Eq. (5), the mutational score ΔE(a_{i} → b_{i}) depends on the whole sequence background (a_{1},..., a_{i−1}, a_{i+1},....,a_{L}) it appears in, i.e., on all other amino acids a_{j} in all positions j ≠ i.
It is now easy to perform an insilico deep mutational scan, i.e., to determine all mutational scores ΔE(a_{i} → b_{i}) for all positions i = 1, ..., L and all target amino acids b_{i} relative to some reference sequence. In Fig. 3a, we compare our predictions with experimental data over more than 30 distinct experiments and wildtype proteins, and with stateoftheart mutationaleffect predictors. These contain in particular the predictions using plmDCA (aka evMutation^{10}), variational autoencoders (DeepSequence^{33}), evolutionary distances between wildtype, and the closest homologs showing the considered mutation (GEMME^{58})—all of these methods take, in technically different ways, the contextdependence of mutations into account. We also compare it to the contextindependent prediction using the abovementioned profile models.
It can be seen that the contextdependent predictors outperform systematically the contextindependent predictor, in particular for large MSA in prokaryotic and eukaryotic proteins. The four contextdependent models perform in a very similar way. There is a little but systematic disadvantage for plmDCA, which was the first published predictor of the ones considered here.
The situation is different in the typically smaller and less diverged viral protein families. In this case, DeepSequence, which relies on dataintensive deep learning, becomes unstable. It becomes also harder to outperform profile models, e.g., plmDCA does not achieve this. arDCA performs similarly or, in one out of four cases, substantially better than the profile model.
To go into more detail, we have compared more quantitatively the predictions of arDCA and DeepSequence, currently considered as the stateoftheart mutational predictor. In Fig. 3b, we plot the performance of the two predictors against each other, with the symbol size being proportional to the number of sequences in the training MSA of natural homologs. Almost all dots are close to the diagonal (apart from few viral datasets), with 17/32 datasets having a better arDCA prediction and 15/32 giving an advantage to DeepSequence. The figure also shows that arDCA tends to perform better on smaller datasets, while DeepSequence takes over on larger datasets. In Supplementary Fig. 7, we have also measured the correlations between the two predictors. Across all prokaryotic and eukaryotic datasets, the two show high correlations in the range of 82–95%. These values are larger than the correlations between predictions and experimental results, which are in the range of 50–60% for most families. This observation illustrates that both predictors extract a highly similar signal from the original MSA, but this signal may be quite different from the experimentally measured phenotype. Many experiments actually provide only rough proxies for protein fitness, like e.g. protein stability or ligandbinding affinity. To what extent such variable underlying phenotypes can be predicted by unsupervised learning based on homologous MSA thus remains an open question.
We thus conclude that arDCA permits a fast and accurate prediction of mutational effects, in line with some of the stateoftheart predictors. It systematically outperforms profile models and plmDCA and is more stable than DeepSequence in the case of limited datasets. This observation, together with the better computational efficiency of arDCA, suggests that DeepSequence should be used for predicting mutational effects for individual proteins represented by very large homologous MSA, while arDCA is the method of choice for largescale studies (many proteins) or small families. GEMME, based on phylogenetic information, astonishingly performs very similarly to arDCA, even if the information taken into account seems different.
Extracting epistatic couplings and predicting residueresidue contacts
The bestknown application of DCA is the prediction of residueresidue contacts via the strongest direct couplings^{6}. As argued before, the arDCA parameters are not directly interpretable in terms of direct couplings. To predict contacts using arDCA, we need to go back to the biological interpretation of DCA couplings: they represent epistatic couplings between pairs of mutations^{59}. For a double mutation a_{i} → b_{i}, a_{j} → b_{j}, epistasis is defined by comparing the effect of the double mutation with the sum of the effects of the single mutations, when introduced individually into the wildtype background:
where the ΔE in arDCA is defined in analogy to Eq. (5). The epistatic effect ΔΔE(b_{i}, b_{j}) provides an effective direct coupling between amino acids b_{i}, b_{j} in sites i, j. In standard DCA, ΔΔE(b_{i}, b_{j}) is actually given by the direct coupling J_{ij}(b_{i}, b_{j}) − J_{ij}(b_{i}, a_{j}) − J_{ij}(a_{i}, b_{j}) + J_{ij}(a_{i}, a_{j}) between sites i and j.
For contact prediction, we can treat these effective couplings in the standard way (compute the Frobenius norm in zerosum gauge, apply the average product correction, cf. Supplementary Note 5 for details). The results are represented in Fig. 4 (cf. also Supplementary Figs. 8–10). The contact maps predicted by arDCA and bmDCA are very similar, and both capture very well the topological structure of the native contact map. The arDCA method gives in this case a few more false positives, resulting in a slightly lower positive predictive value (panel c). However, note that the majority of the false positives for both predictors are concentrated in the upper right corner of the contact maps, in a region where the largest subfamily of responseregulators domains, characterized by the coexistence with a Trans_reg_C DNAbinding domain (PF00486) in the same protein, has a homodimerization interface.
One difference should be noted: for arDCA, the definition of effective couplings via epistatic effects depends on the reference sequence (a_{1},..., a_{L}), in which the mutations are introduced; this is not the case in DCA. So, in principle, each sequence might give a different contact prediction, and accurate contact prediction in arDCA might require a computationally heavy averaging over a large ensemble of background sequences. Fortunately, as we have checked, the predicted contacts hardly depend on the reference sequence chosen. It is therefore possible to take any arbitrary reference sequence belonging to the homologous family and determine epistatic couplings relative to this single sequence. This observation causes an enormous speedup by a factor M, with M being the depths of the MSA of natural homologs.
The aim of this section was to compare the performance of arDCA in contact prediction when compared to established methods using exactly the same data, i.e., a single MSA of the considered protein family. We have chosen bmDCA in coherence to the rest of the paper, but apart from little quantitative differences, the conclusions remain unchanged when looking to DCA variants based on meanfield or pseudolikelihood approximations, cf. Supplementary Fig. 9. The recent success of DeepLearning–based contact prediction has shown that the performance can be substantially improved if coevolutionbased contact prediction for thousands of families is combined with supervised learning based on known protein structures, as done by popular methods like RaptorX, DeepMetaPSICOV, AlphaFold, or trRosetta^{16,17,18,19}. We expect that the performance of arDCA could equally be boosted by supervised learning, but this goes clearly beyond the scope of our work, which concentrates on generative modeling.
Estimating the size of a family’s sequence space
The MSA of natural sequences contains only a tiny fraction of all sequences, which would have the functional properties characterizing a protein family under consideration, i.e., which might be found in newly sequenced species or be reached by natural evolution. Estimating this number \({{{{{{{\mathcal{N}}}}}}}}\) of possible sequences, or their entropy \(S={{{{{{\mathrm{log}}}}}}}\,{{{{{{{\mathcal{N}}}}}}}}\), is quite complicated in the context of DCAtype pairwise Potts models. It requires advanced sampling techniques^{60,61}.
In arDCA, we can explicitly calculate the sequence probability P(a_{1}, ..., a_{L}). We can therefore estimate the entropy of the corresponding protein family via
where the second line uses Eq. (4). The ensemble average 〈⋅〉_{P} can be estimated via the empirical average over a large sequence sample drawn from P. As discussed before, extracting i.i.d. samples from arDCA is particularly simple due to their particular factorized form.
Results for the protein families studied here are given in Table 1. As an example, the entropy density equals S/L = 1.4 for PF00072. This corresponds to \({{{{{{{\mathcal{N}}}}}}}} \sim 1.25\cdot 1{0}^{68}\) sequences. While being an enormous number, it constitutes only a tiny fraction of all q^{L} ~ 1.23 ⋅ 10^{148} possible sequences of length L = 112. Interestingly, the entropies estimated using bmDCA are systematically higher than those of arDCA. On the one hand, this is no surprise: both reproduce accurately the empirical oneresidue and tworesidue statistics, but bmDCA is a maximum entropy model, which maximizes the entropy given these statistics^{4}. On the other hand, our observation implies that the effective multisite couplings in E(a_{1},..,a_{L}) resulting from the local partition functions z_{i}(a_{i−1},..., a_{1}) lead to a nontrivial entropy reduction.
Discussion
We have presented a class of simple autoregressive models, which provide highly accurate and computationally very efficient generative models for proteinsequence families. While being of comparable or even superior performance to bmDCA across a number of tests including the sequence statistics, the sequence distribution in dimensionally reduced principalcomponent space, the prediction of mutational effects, and residueresidue contacts, arDCA is computationally much more efficient than bmDCA. The particular factorized form of autoregressive models allows for exact likelihood maximization.
It allows also for the calculation of exact sequence probabilities (instead of sequence weights for Potts models). This fact is of great potential interest in homology detection using coevolutionary models, which requires comparing probabilities of the same sequence in distinct models corresponding to distinct protein families. To illustrate this idea in a simple, but the instructive case, we have identified two subfamilies of the PF00072 protein family of response regulators. The first subfamily is characterized by the existence of a DNAbinding domain of the Trans_reg_C protein family (PF00486), the second is by a DNAbinding domain of the GerE protein family (PF00196). For each of the two subfamilies, we have extracted randomly 6000 sequences used to train subfamily specific profile and arDCA models, with P_{1} being the model for the Trans_reg_C and P_{2} for the GerE subfamily. Using the logodds ratio \({{{{{{\mathrm{log}}}}}}}\,\{{P}_{1}({{{{{\mathrm{seq}}}}}})/{P}_{2}({{{{{\mathrm{seq}}}}}})\}\) to score all remaining sequences from the two subfamilies, the profile model was able to assign 98.6% of all sequences to the correct subfamily, and 1.4% to the wrong one. arDCA has improved this to 99.7% of correct, and only 0.3% of incorrect assignments, reducing the grayzone in subfamily assignment by a factor of 3–4. Furthermore, some of the false assignments of the profile model had quite large scores, cf. the histograms in Supplementary Fig. 11, while the false annotations of the arDCA model had scores closer to zero. Therefore, if we consider that a prediction is reliable only if there are no wrong predictions for a larger logodds ratio score, then the score of arDCA is 97.5% while one of the profile models is only 63.7%.
The importance of accurate generative models becomes also visible via our results on the size of sequence space (or sequence entropy). For the response regulators used as an example throughout the paper (and similar observations are true for all other protein families we analyzed), we find that "only” about 10^{68} out of all possible 10^{148} aminoacid sequences of the desired length are compatible with the arDCA model, and thus suspected to have the same functionality and the same 3D structure of the proteins collected in the Pfam MSA. This means that a random aminoacid sequence has a probability of about 10^{−80} to be actually a valid responseregulator sequence. This number is literally astronomically small, corresponding to the probability of hitting one particular atom when selecting randomly in between all atoms in our universe. The importance of good coevolutionary modeling becomes even more evident when considering all proteins being compatible with the aminoacid conservation patterns in the MSA: the corresponding profile model still results in an effective sequence number of 10^{94}, i.e., a factor of 10^{26} larger than the sequence space respecting also coevolutionary constraints. As was verified in experiments, conservation provides insufficient information for generating functional proteins, while taking coevolution into account leads to finite success probabilities.
Reproducing the statistical features of natural sequences does not necessarily guarantee the sampled sequences to be fully functional protein sequences. To enhance our confidence in these sequences, we have performed two tests.
First, we have reanalyzed the bmDCAgenerated sequences of ref. ^{20}, which were experimentally tested for their invivo chorismatemutase activity. Starting from the same MSA of natural sequences, we have trained an arDCA model and calculated the statistical energies of all nonnatural and experimentally tested sequences. As is shown in Supplementary Fig. 12, the statistical energies have a Pearson correlation of 97% with the bmDCA energies reported in ref. ^{20}. In both cases, functional sequences are restricted to the region of low statistical energies.
Furthermore, we have used small samples of 10 artificial or natural responseregulator sequences as inputs for trRosetta^{19}, in a setting that allows for proteinstructure prediction based only on the userprovided MSA, i.e., no homologous sequences are added by trRosetta, and no structural templates are used. As is shown in Supplementary Fig. 13, the predicted structures are very similar to each other, and within a rootmeansquare deviation of less than 2 Å from an exemplary PDB structure. The contacts maps extracted from the trRosetta predictions are close to identical.
While these observations do not prove that arDCAgenerated sequences are functional or fold into the correct tertiary structure, they are coherent with this conjecture.
Autoregressive models can be easily extended by adding hidden layers in the ansatz for the conditional probabilities P(a_{i}∣a_{i−1},..., a_{1}), with the aim to increase the expressive power of the overall model. For the families explored here, we found that the onelayer model Eq. (2) is already so accurate, that adding more layers only results in similar, but not superior performance, cf. Supplementary Note 6. However, in longer or more complicated protein families, the larger expressive power of deeper autoregressive models could be helpful. Ultimately, the generative performance of such extended models should be assessed by testing the functionality of the generated sequences in experiments similar to ref. ^{20}.
Methods
Inference of the parameters
We first describe the inference of the parameters via likelihood maximization. In a Bayesian setting, with a uniform prior (we discuss regularization below), the optimal parameters are those that maximize the probability of the data, given as an MSA \({{{{{{{\mathcal{D}}}}}}}}=({a}_{i}^{m} i=1,...,L;m=1,...,M)\) of M sequences of aligned length L:
Each parameter h_{i}(a) or J_{ij}(a,b) appears in only one conditional probability P(a_{i}∣a_{i−1},..., a_{1}), and we can thus maximize independently each conditional probability in Eq. (8):
where
is the normalization factor of the conditional probability of variable a_{i}.
Differentiating with respect to h_{i}(a) or to J_{ij}(a, b), with j = 1,..., i − 1, we get the set of equations:
where δ_{a,b} is the Kronecker symbol. Using Eq. (9) we find
The set of equations thus reduces to a very simple form:
where \({\left\langle \bullet \right\rangle }_{{{{{{{{\mathcal{D}}}}}}}}}=\frac{1}{M}\mathop{\sum }\nolimits_{m = 1}^{M}{\bullet }^{m}\) denotes the empirical data average and f_{i}(a), f_{ij}(a, b) is the empirical onepoint and twopoint aminoacid frequencies. Note that for the first variable (i = 1), which is unconditioned, there is no equation for the couplings, and the equation for the field takes the simple form f_{1}(a) = P(a_{1} = a), which is solved by \({h}_{1}(a)={{{{{{\mathrm{log}}}}}}}\,\,{f}_{1}(a)+\,{{\mbox{const.}}}\,\)
Unlike the corresponding equations for the Boltzmann learning of a Potts model^{5}, there is a mix between probabilities and empirical averages in Eq. (12), and there is no explicit equality between onepoint and twopoint marginals and empirical one and twopoint frequencies. This means that the ability to reproduce the empirical onepoint and twopoint frequencies are already a statistical test for the generative properties of the model, and not only for the fitting quality of the current parameter values.
The inference can be done very easily with any algorithm using gradient descent, which updates the fields and couplings proportionally to the difference of the two sides of Eq. (12). We used the Low Storage BFGS method to do the inference. We also add an L2 regularization, with a regularization strength of \(\lambda_J = 10^{4}, \lambda_h = 10^{6}\) for the generative tests and \(\lambda_J = 10^{2}, \lambda_h = 10^{4}\) for mutational effects and contact prediction. A small regularization leads to better results on generative tests, but a larger regularization is needed for contact prediction of mutational effects. Contact prediction can indeed suffer from too large parameters, and therefore a larger regularization was chosen, coherently with the one used in plmDCA. Note that the gradients are computed exactly at each iteration, as an explicit average over the data, and hence without the need of MCMC sampling. This provides an important advantage over Boltzmannmachine learning.
Finally, in order to partially compensate for the phylogenetic structure of the MSA, which induces correlations among sequences, each sequence is reweighted by a coefficient w_{m}^{4}:
which leads to the same equations as above with the only modification of the empirical average as \({\left\langle \bullet \right\rangle }_{{{{{{\mathrm{data}}}}}}}=\frac{1}{{M}_{{{{{{{{\rm{eff}}}}}}}}}}\mathop{\sum }\nolimits_{m = 1}^{M}{w}_{m}\ {\bullet }^{m}\). Typically, w_{m} is given by the inverse of the number of sequences having least 80% sequence identity with sequence m, and M_{eff} = ∑_{m}w_{m} denotes the effective number of independent sequences. The goal is to remove the influence of very closely related sequences. Note however that such reweighting cannot fully capture the hierarchical structure of phylogenetic relations between proteins.
Sampling from the model
Once the model parameters are inferred, a sequence can be iteratively generated by the following procedure:

Sample the first residue from P(a_{1})

Sample the second residue from P(a_{2}∣a_{1}) where a_{1} is sampled in the previous step.
... L. Sample the last residue from P(a_{L}∣a_{L−1},a_{L−2},..., a_{2},a_{1}) Each step is very fast because there are only 21 possible values for each probability. Both training and sampling are therefore extremely simple and computationally efficient in arDCA.
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Code availability
Codes in Python and Julia are available at https://github.com/pagnani/ArDCA.git.
Data availability
Data is available at https://github.com/pagnani/ArDCADataand was elaborated using source data freely downloadable from the Pfam database (http://pfam.xfam.org/)^{2}, cf. Supplementary Table 1. The repository contains also sample MSA generated by arDCA. The input data for Figure 3 are provided by the GEMME paper^{58}, cf. also Supplementary Table 2.
Change history
01 April 2022
A Correction to this paper has been published: https://doi.org/10.1038/s4146702229593x
References
UniProt Consortium. Uniprot: a worldwide hub of protein knowledge. Nucleic Acids Res. 47, D506–D515 (2019).
ElGebali, S., Mistry, J., Bateman, A., Eddy, S. R. & Luciani, A. et al. The Pfam protein families database in 2019. Nucleic Acids Res. 47, D427–D432 (2019).
De Juan, D., Pazos, F. & Valencia, A. Emerging methods in protein coevolution. Nat. Rev. Genet. 14, 249–261 (2013).
Cocco, S., Feinauer, C., Figliuzzi, M., Monasson, R. & Weigt, M. Inverse statistical physics of protein sequences: a key issues review. Rep. Prog. Phys. 81, 032601 (2018).
Figliuzzi, M., BarratCharlaix, P. & Weigt, M. How pairwise coevolutionary models capture the collective residue variability in proteins? Mol. Biol. Evol. 35, 1018–1027 (2018).
Morcos, F. et al. Directcoupling analysis of residue coevolution captures native contacts across many protein families. Proc. Natl Acad. Sci. USA 108, E1293–E1301 (2011).
Levy, R. M., Haldane, A. & Flynn, W. F. Potts hamiltonian models of protein covariation, free energy landscapes, and evolutionary fitness. Curr. Opin. Struct. Biol. 43, 55–62 (2017).
Ackley, D. H., Hinton, G. E. & Sejnowski, T. J. A learning algorithm for Boltzmann machines. Cogn. Sci. 9, 147–169 (1985).
Figliuzzi, M., Jacquier, H., Schug, A., Tenaillon, O. & Weigt, M. Coevolutionary landscape inference and the contextdependence of mutations in betalactamase tem1. Mol. Biol. Evol. 33, 268–280 (2016).
Hopf, T. A., Ingraham, J. B., Poelwijk, F. J., Schärfe, C. P. & Springer, M. et al. Mutation effects predicted from sequence covariation. Nat. Biotechnol. 35, 128–135 (2017).
Cheng, R. R., Morcos, F., Levine, H. & Onuchic, J. N. Toward rationally redesigning bacterial twocomponent signaling systems using coevolutionary information. Proc. Natl Acad. Sci. USA 111, E563–E571 (2014).
Cheng, R. R., Nordesjö, O., Hayes, R. L., Levine, H. & Flores, S. C. et al. Connecting the sequencespace of bacterial signaling proteins to phenotypes using coevolutionary landscapes. Mol. Biol. Evol. 33, 3054–3064 (2016).
Reimer, J. M. et al. Structures of a dimodular nonribosomal peptide synthetase reveal conformational flexibility. Science 366, eaaw4388 (2019).
Bisardi, M., RodriguezRivas, J., Zamponi, F. & Weigt, M. Modeling sequencespace exploration and emergence of epistatic signals in protein evolution. Preprint at arXiv: 2106.02441 (2021).
de la Paz, J. A., Nartey, C. M., Yuvaraj, M. & Morcos, F. Epistatic contributions promote the unification of incompatible models of neutral molecular evolution. Proc. Natl Acad. Sci. USA 117, 5873–5882 (2020).
Greener, J. G., Kandathil, S. M. & Jones, D. T. Deep learning extends de novo protein modelling coverage of genomes using iteratively predicted structural constraints. Nat. Commun. 10, 1–13 (2019).
Senior, A. W. et al. Improved protein structure prediction using potentials from deep learning. Nature 577, 706–710 (2020).
Wang, S., Sun, S., Li, Z., Zhang, R. & Xu, J. Accurate de novo prediction of protein contact map by ultradeep learning model. PLoS Comput. Biol. 13, e1005324 (2017).
Yang, J. et al. Improved protein structure prediction using predicted interresidue orientations. Proc. Natl Acad. Sci. USA 117, 1496–1503 (2020).
Russ, W. P. et al. An evolutionbased model for designing chorismate mutase enzymes. Science 369, 440–445 (2020).
Tian, P., Louis, J. M., Baber, J. L., Aniana, A. & Best, R. B. Coevolutionary fitness landscapes for sequence design. Angew. Chem. Int. Ed. 57, 5674–5678 (2018).
Huang, P.S., Boyken, S. E. & Baker, D. The coming of age of de novo protein design. Nature 537, 320–327 (2016).
Jäckel, C., Kast, P. & Hilvert, D. Protein design by directed evolution. Annu. Rev. Biophys. 37, 153–173 (2008).
Wilburn, G. W. & Eddy, S. R. Remote homology search with hidden potts models. PLoS Comput. Biol. 16, e1008085 (2020).
Barton, J. P., De Leonardis, E., Coucke, A. & Cocco, S. Ace: adaptive cluster expansion for maximum entropy graphical model inference. Bioinformatics 32, 3089–3097 (2016).
Sutto, L., Marsili, S., Valencia, A. & Gervasio, F. L. From residue coevolution to protein conformational ensembles and functional dynamics. Proc. Natl Acad. Sci. USA 112, 13567–13572 (2015).
Vorberg, S., Seemayer, S. & Söding, J. Synthetic protein alignments by ccmgen quantify noise in residueresidue contact prediction. PLoS Comput. Biol. 14, e1006526 (2018).
BarratCharlaix, P., Muntoni, A. P., Shimagaki, K., Weigt, M. & Zamponi, F. Sparse generative modeling via parameter reduction of Boltzmann machines: application to proteinsequence families. Phys. Rev. E104, 024407 (2021).
Haldane, A. & Levy, R. M. Mi3gpu: MCMCbased inverse ising inference on GPUs for protein covariation analysis. Computer Phys. Commun. 260, 107312 (2021).
Tubiana, J., Cocco, S. & Monasson, R. Learning protein constitutive motifs from sequence data. Elife 8, e39397 (2019).
Shimagaki, K. & Weigt, M. Selection of sequence motifs and generative hopfieldpotts models for protein families. Phys. Rev. E 100, 032128 (2019).
Rivoire, O., Reynolds, K. A. & Ranganathan, R. Evolutionbased functional decomposition of proteins. PLoS Comput. Biol. 12, e1004817 (2016).
Riesselman, A. J., Ingraham, J. B. & Marks, D. S. Deep generative models of genetic variation capture the effects of mutations. Nat. Methods 15, 816–822 (2018).
McGee, F., Novinger, Q., Levy, R. M., Carnevale, V. & Haldane, A., Generative capacity of probabilistic protein sequence models. Preprint at arXiv: 2012.02296 (2020).
HawkinsHooker, A., Depardieu, F., Baur, S., Couairon, G. & Chen, A. et al. Generating functional protein variants with variational autoencoders. PLoS Comput. Biol. 17, e1008736 (2021).
Costello, Z. & Martin, H. G. How to hallucinate functional proteins. arXiv 1903.00458 (2019).
Repecka, D. et al. Expanding functional protein sequence spaces using generative adversarial networks. Nat. Mach. Intell. 3, 324–333 (2021).
Amimeur, T., Shaver, J. M., Ketchem, R. R., Taylor, J. A., Clark, R. H. et al. Designing featurecontrolled humanoid antibody discovery libraries using generative adversarial networks. bioRxiv 2020.04.12.024844 (2020).
AnandAchim, N., Eguchi, R. R., Derry, A., Altman, R. B. & Huang, P. Protein sequence design with a learned potential. bioRxiv 2020.01.06.895466 (2020).
Ingraham, J., Garg, V. K., Barzilay, R. & Jaakkola, T. S. Generative models for graphbased protein design. In Neural Information Processing Systems (NeurIPS) (2019).
Jing, B., Eismann, S., Suriana, P., Townshend, R. J. & Dror, R., Learning from protein structure with geometric vector perceptrons. Preprint at arXiv: 2009.01411 (2020).
Greener, J. G., Moffat, L. & Jones, D. T. Design of metalloproteins and novel protein folds using variational autoencoders. Sci. Rep. 8, 1–12 (2018).
Strokach, A., Becerra, D., CorbiVerge, C., PerezRiba, A. & Kim, P. M. Fast and flexible protein design using deep graph neural networks. Cell Syst. 11, 402–411 (2020).
Anishchenko, I., Chidyausiku, T. M., Ovchinnikov, S., Pellock, S. J. & Baker, D. De novo protein design by deep network hallucination. bioRxiv 2020.07.22.211482 (2020).
Fannjiang, C. & Listgarten, J. Autofocused oracles for modelbased design. Preprint at arXiv: 2006.08052 (2020).
Linder, J. & Seelig, G., Fast differentiable DNA and protein sequence optimization for molecular design. Preprint at arXiv: 2005.11275 (2020).
Norn, C. et al. Protein sequence design by conformational landscape optimization. Proc. Natl Acad. Sci. USA 118, e2017228118 (2021).
Bishop, C. M. Pattern Recognition and Machine Learning. (Springer, 2006).
Goodfellow, I., Bengio, Y., Courville, A. & Bengio, Y. Deep Learning. Vol. 1. (MIT Press, Cambridge, 2016).
Hastie, T., Tibshirani, R. & Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, And Prediction. (Springer Science & Business Media, 2009).
Wu, D., Wang, L. & Zhang, P. Solving statistical mechanics using variational autoregressive networks. Phys. Rev. Lett. 122, 080602 (2019).
Sharir, O., Levine, Y., Wies, N., Carleo, G. & Shashua, A. Deep autoregressive models for the efficient variational simulation of manybody quantum systems. Phys. Rev. Lett. 124, 020503 (2020).
Ekeberg, M., Lövkvist, C., Lan, Y., Weigt, M. & Aurell, E. Improved contact prediction in proteins: using pseudolikelihoods to infer potts models. Phys. Rev. E 87, 012707 (2013).
Balakrishnan, S., Kamisetty, H., Carbonell, J. G., Lee, S.I. & Langmead, C. J. Learning generative models for protein fold families. Proteins 79, 1061–1078 (2011).
Decelle, A., Furtlehner, C. & Seoane, B. Equilibrium and nonequilibrium regimes in the learning of restricted Boltzmann machines. Preprint at arXiv: 2105.13889 (2021).
Eddy, S. R. A new generation of homology search tools based on probabilistic inference. In Genome Informatics 2009: Genome Informatics Series. Vol. 23, 205–211. (World Scientific, 2009).
Söding, J. Protein homology detection by hmm–hmm comparison. Bioinformatics 21, 951–960 (2005).
Laine, E., Karami, Y. & Carbone, A. Gemme: a simple and fast global epistatic model predicting mutational effects. Mol. Biol. Evol. 36, 2604–2619 (2019).
Starr, T. N. & Thornton, J. W. Epistasis in protein evolution. Protein Sci. 25, 1204–1218 (2016).
Barton, J. P., Chakraborty, A. K., Cocco, S., Jacquin, H. & Monasson, R. On the entropy of protein families. J. Stat. Phys. 162, 1267–1293 (2016).
Tian, P. & Best, R. B. How many protein sequences fold to a given structure? a coevolutionary analysis. Biophys. J. 113, 1719–1730 (2017).
Acknowledgements
We thank Indaco Biazzo, Matteo Bisardi, Elodie Laine, AnnaPaola Muntoni, Edoardo Sarti, and Kai Shimagaki for helpful discussions and assistance with the data. We especially thank Francisco McGee and Vincenzo Carnevale for providing generated samples from DeepSequence as in ref. ^{34}. Our work was partially funded by the EU H2020 Research and Innovation Programme MSCARISE2016 under Grant Agreement No. 734439 InferNet (M.W.), and by a grant from the Simons Foundation (#454955, F.Z.). J.T. is supported by a Ph.D. Fellowship of the iBio Initiative from the Idex Sorbonne University Alliance.
Author information
Authors and Affiliations
Contributions
A.P., F.Z., and M.W. designed research; J.T., G.U., and A.P. performed research; J.T., G.U., A.P., F.Z., and M.W. analyzed the data; J.T., F.Z., and M.W. wrote the paper.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Peer review information Nature Communications thanks the anonymous reviewers for their contribution to the peer review of this work. Peer reviewer reports are available.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Trinquier, J., Uguzzoni, G., Pagnani, A. et al. Efficient generative modeling of protein sequences using simple autoregressive models. Nat Commun 12, 5800 (2021). https://doi.org/10.1038/s41467021257564
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467021257564
This article is cited by

Tpgen: a language model for stable protein design with a specific topology structure
BMC Bioinformatics (2024)

Deep generative design of RNA family sequences
Nature Methods (2024)

Computational scoring and experimental evaluation of enzymes generated by neural networks
Nature Biotechnology (2024)

Protein structure generation via folding diffusion
Nature Communications (2024)

Latent generative landscapes as maps of functional diversity in protein sequence space
Nature Communications (2023)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.