The generative capacity of probabilistic protein sequence models

Potts models and variational autoencoders (VAEs) have recently gained popularity as generative protein sequence models (GPSMs) to explore fitness landscapes and predict mutation effects. Despite encouraging results, current model evaluation metrics leave unclear whether GPSMs faithfully reproduce the complex multi-residue mutational patterns observed in natural sequences due to epistasis. Here, we develop a set of sequence statistics to assess the “generative capacity” of three current GPSMs: the pairwise Potts Hamiltonian, the VAE, and the site-independent model. We show that the Potts model’s generative capacity is largest, as the higher-order mutational statistics generated by the model agree with those observed for natural sequences, while the VAE’s lies between the Potts and site-independent models. Importantly, our work provides a new framework for evaluating and interpreting GPSM accuracy which emphasizes the role of higher-order covariation and epistasis, with broader implications for probabilistic sequence models in general.


Introduction
Recent progress in decoding the patterns of mutations in protein multiple sequence alignments (MSAs) has highlighted the importance of mutational covariation in determining protein function, conformation and evolution, and has found practical applications in protein design, drug design, drug resistance prediction, and classification [1][2][3] .These developments were sparked by the recognition that the pairwise covariation of mutations observed in large MSAs of evolutionarily diverged sequences belonging to a common protein family can be used to fit maximum entropy "Potts" statistical models [4][5][6] .These contain pairwise statistical interaction parameters reflecting epistasis 7 between pairs of positions.Such models have been shown to accurately predict physical contacts in protein structure 6,[8][9][10] , and have been used to significantly improve the prediction of the fitness effect of mutations to a sequence compared to site-independent sequence variation models which do not account for covariation 11,12 .They are "generative" in the sense that they define the probability, p(S), that a protein sequence S results from the evolutionary process.Intriguingly, the probability distribution p(S) can be used to sample unobserved, and yet viable, artificial sequences.In practice, the model distribution p(S) depends on parameters that are found by maximizing a suitably defined likelihood function on observations provided by the MSA of a target protein family.As long as the model is well specified and generalizes from the training MSA, it can then be used to generate new sequences, and thus a new MSA whose statistics should match those of the original target protein family.We refer to probabilistic models that create new protein sequences in this way as generative protein sequence models (GPSMs).
The fact that Potts maximum entropy models are limited to pairwise epistatic interaction terms and have a simple functional form for p(S) raises the possibility that their functional form is not flexible enough to describe the data, i.e. that the model is not well specified.While a model with only pairwise interaction terms can predict complex patterns of covariation involving three or more positions through chains of pairwise interactions, it cannot model certain triplet and higher patterns of covariation that require a model with more than pairwise interaction terms 13 .For example, a Potts model cannot predict patterns described by an XOR or boolean parity function in which the n-th residue is determined by whether an odd number of the n − 1 previous residues have a certain value (see Supplementary Information).While some evidence has suggested that in the case of protein sequence data the pairwise model is sufficient and necessary to model sequence variation [14][15][16] , some of this evidence is based on averaged properties, and there appears to be some weak evidence for the possibility of rare "higher-order epistasis" affecting protein evolution 7,[17][18][19] , by which we mean the possibility that subsequence frequencies of three or more positions cannot be reproduced by a model with only pairwise interactions.Fitting maximum entropy models with all triplet interactions is not feasible without significant algorithmic innovation, since for a protein of length 100 it would require approximately 10B parameters and enormous MSA datasets to overcome finite sampling error (see Supplementary Information).However, recent developments in powerful machine learning techniques applied to images, language, and other data have shown how complex distributions p(S) can be fit with models using more manageable parameter set sizes.Building on the demonstrated power of incorporating pairwise epistasis into protein sequence models, this has motivated investigation of machine learning strategies for generative modelling of protein sequence variation which can go beyond pairwise interactions, including Restricted Boltzmann Machines (RBMs) 3 , variational autoencoders (VAEs) [20][21][22][23] , General Adversarial Networks (GANs) 24 , transformers [25][26][27][28] , and others.
One model in particular, the VAE 29,30 , has been cited as being well suited for modelling protein sequence covariation, with the potential to detect higher order epistasis 20,21 .The VAE also potentially gives additional insights into the topology of protein sequence space through examination of the "latent" (hidden) parameters of the model, which have been suggested to be related to protein sequence phylogenetic relationships [20][21][22]27 . One mplementation of a VAE-GPSM, "DeepSequence", found that the VAE model was better able to predict experimental measurements of the effect of mutations reported in deep mutational scans than a pairwise Potts model, which was attributed to the VAE's ability to model higher-order epistasis 11,20 .However, it has also been suggested by others that the improvement reported for DeepSequence could be attributed to the use of biologically motivated priors and engineering efforts, rather than because it truly captured higher-order epistasis 21 .Furthermore, while VAE-GPSMs are generative and aim to capture the protein sequence distribution p(S), to our knowledge none of these studies have thoroughly tested what we will call the "generative capacity" 31,32 of the VAE model, meaning the ability of the model to generate new sequences drawn from the model distribution p(S), which are statistically indistinguishable from those of a given "target" protein family.Testing the generative capacity, specifically higher-order covariation, of a GPSM is a fundamental check of whether the model is well specified and generalizes from the training set, two prerequisites to capturing higher-order epistasis.
Here, we perform a series of numerical experiments that explore the generative capacity of GPSMs.These GPSMs include a pairwise Potts Hamiltonian model with pairwise interaction terms (Mi3) 33 , two state-of-the-art implementations of a variational autoencoder (VAE), and a site-independent model which does not model covariation (Indep).One VAE implementation is a "standard" VAE implementation (sVAE), which uses an architecture very similar to that of EVOVAE 22 recently used to model protein sequences, though optimized with a larger number of latent dimensions, and which more closely follows the VAE inference method as it was originally presented 29,30 .The second VAE implementation is "DeepSequence", mentioned above, which uses a deeper neural network and a more sophisticated optimizer (see Methods) 20 .
We evaluate the generative capacity of these models using three standard MSA statistics, and introduce a fourth and novel metric based on averaged higher-order marginals (r 20 ) 16 .The three standard metrics are pairwise covariance correlations 2,16,34,35 , Hamming distance distributions 2,16,36,37 , and statistical energy correlations 16,20,38 .We also test how each GPSM behaves as fewer training sequences are provided, and thus how well the GPSMs generalize, using two training MSA sizes: one representing the estimated size upper bound for a Pfam protein family (10K) (see Supplementary Information) 39 , and one being large enough to eliminate out-of-sample error in our generative capacity measurements (see Results) 40 .We evaluate GPSM performance both for natural datasets and for synthetic datasets in which the ground truth is known.Our comparative analysis thus consists of four intersecting variables: three GPSMs, a suite of four generative capacity metrics, two training MSA sizes, and two dataset types.
Our results show that while VAEs model some epistasis they are unable to reproduce MSA statistics as well as Mi3, but are more accurate than Indep, which cannot model covariation, for all metrics tested.The two VAE implementations we tested perform similarly to each other across all metrics.We find that the VAEs perform more similarly to Mi3 than they do to Indep, but perform more like the Indep model for the r 20 metric than does Mi3, which is the metric that most directly tests a GPSM's ability to model higher-order sequence covariation.This result suggests that r 20 is a more sensitive measure of GPSM generative capacity than the other standard metrics, representing a novel and powerful metric to discriminate between GPSMs by averaging over higher-order covariation terms up to the 10 th order.Our work consolidates several benchmark statistics of generative capacity for Specification Error occurs when the functional form p θ (S) of a GPSM is not flexible enough to accurately model the target distribution p 0 (S) for any choice of parameters θ Out-of-Sample Error occurs when a GPSM fit to a finite training dataset fails to correctly model unseen data, and is a consequence of overfitting Estimation Error occurs due to statistical error in the MSA evaluation metrics when computed from finite evaluation and target MSAs Training MSA drawn from p 0 (S), used to train and parameterize a GPSM Target MSA drawn from p 0 (S) separately from the training MSA, used as a validation dataset for performing metrics Evaluation MSA drawn from p θ (S) for a parameterized GPSM, used to compare to the target MSA GPSMs and introduces a new one, r 20 , offering a novel framework for evaluating and interpreting GPSM accuracy in the context of higher-order covariation.By quantifying and comparing GPSM performance in our innovative epistasis-oriented approach, we hope to better understand the challenges and limitations inherent to generative modelling of natural protein sequence datasets, better gauge the state of the art, and provide insight for future efforts in terms of minimizing the confounding effects of data limitations in generative protein sequence modeling.

Target Distributions
Our goal is to set baseline expectations for the generative capacity of the GPSMs when fit to synthetic or natural protein sequence data of varying training MSA sizes.Generative models of protein MSAs define a distribution p θ (S) for the probability of a sequence S appearing in an MSA dataset given model parameters θ .The model parameters are fit by either exact or approximate maximum likelihood inference of the likelihood L = ∑ S∈MSA p θ (S) over a training MSA, using regularization techniques to prevent overfitting.The sequences in the training MSA are assumed to be identical independent samples from a "target" probability distribution p 0 (S), which is generally unknown 29 .For a model with high generative capacity, p θ (S) will closely approximate p 0 (S) 21 .We test three GPSMs: a pairwise Potts model (Mi3) 33 ; two implementations of a VAE, which are a standard VAE (sVAE) and DeepSequence 20 ; and a site-independent model (Indep), each with a different functional form of p θ (S).
It is not possible to measure the similarity of p θ (S) and p 0 (S) directly because of the high dimensionality of sequence space, since the number of sequence probabilities to compare is equal to q L , where L is the sequence length (typically ∼300) and q is the alphabet size (∼21).Instead, we measure how derived statistics computed from evaluation MSAs generated by each GPSM match those of target MSAs drawn from the target distribution.Three of these are standard metrics: the pairwise Hamming distance distribution, the pairwise covariance scores, and the GPSM's ability to predict p 0 (S) for individual sequences, also called statistical energy (see Methods).We present a fourth metric, the averaged higher order marginal accuracy r 20 .

GPSM Error and Experiments
We divide our tests into two analysis tracks: one synthetic, in which we train the GPSMs on synthetic MSAs generated from a known target distribution; and one natural, in which we train the GPSMs on a representative natural protein family MSA, the kinase super family, sequestered from Uniprot/TrEMBL 41 .These analysis tracks are meant to probe and isolate two distinct forms of error which may cause p θ (S) to deviate from p 0 (S) (see Table 1).The first is "specification error" 42 , which occurs when the functional form of p θ (S) of a model is not flexible enough to accurately model the target probability distribution p 0 (S) for any choice of parameters.A key motivation for choosing a VAE over a Potts model is its potentially lower specification error when higher-order epistasis is present 20 .Indeed, Potts models are limited to pairwise interaction terms of a particular functional form, while VAEs are not.The second form of error is "out-of-sample error" 43 , caused by a paucity of training samples, and is the consequence of overfitting 44 .Even a well-specified model could fail to generalize when fit to a small training dataset, and may mis-predict p 0 (S) for test sequences, so it follows that increasing training MSA size reduces out-of-sample error.Beyond specification and out-of-sample error, which each reflect an aspect of GPSM generalization error, there can be "estimation error" in our MSA test statistics due to the finite MSA sizes we use to estimate their values, which sets an upper bound on how well these statistics can match their target values, depending on the metric 45 .Finally, other errors may arise due to implementation limitations of the inference methods, for instance due to finite

4/18
precision arithmetic or to finite sample effects when Monte Carlo methods are used.
The synthetic analysis allows us to isolate specification error, and minimize both out-of-sample and estimation error, because here the target distribution is known exactly, and we can generate arbitrarily large training, target, and evaluation datasets.The synthetic analysis also allows us to quantify out-of-sample error by modulating the training MSA size.We specify the synthetic target probability distribution p 0 (S) to be exactly the Potts model distribution we inferred based on natural protein kinase sequence data using Mi3 in our natural analysis (see Methods) 33 .The sequences generated from this synthetic target distribution should have statistical properties similar to real, or "natural", protein family MSAs, albeit constrained by the fact that the Hamiltonian model used to generate the synthetic dataset is limited to pairwise epistatic interaction terms only 6,16,46 .Whereas Indep and the VAEs may still fail to model this known probability distribution in the synthetic analysis, our expectation is that Mi3 will be unaffected by specification error in the synthetic tests, since the target MSA is sampled from the same probability distribution used to carry out the inference.In Supplementary Information, we also perform an alternate synthetic test which does not favor Mi3 in this way, in which the target distribution is instead specified by sVAE, finding that both Mi3 and sVAE are able to fit this target sVAE distribution accurately.In our synthetic analysis, we test two synthetic training MSA sizes: (i) 1M sequences, to minimize overfitting effects and consequently out-of-sample error, thereby isolating GPSM specification error in this experiment; and (ii) 10K sequences, to illustrate the expected GPSM performance on typical datasets, as most protein families in Pfam have less than 10K independent effective sequences (see Supplementary Information) 39 .
The natural analysis examines the performance of the models on natural sequence data, which potentially contain higher-order correlations that require a Hamiltonian model with triplet or higher-order interaction terms to capture.On this dataset, the VAEs could potentially outperform Mi3, depending on the importance of higher-order epistatic terms, if present 20 .However, unlike in the synthetic analysis, here we do not know a priori the target distribution and, most importantly, we have only limited datasets for both training and evaluation.In the natural tests, the training and target MSAs each contain ∼10K non-overlapping kinase sequences from the Uniprot/TREMBL database after phylogenetic filtering at 50% sequence identity.After this filtering we consider each sequence to be an independent sample of the evolutionary process, which has the equilibrium distribution p 0 (S).
Our overall testing procedure is outlined in Fig. 1 (see Methods), and the terms discussed are summarized in Table 1.Our training datasets are either a natural protein sequence dataset obtained from Uniprot/TREMBL, or a synthetic training dataset.We then fit the GPSMs to the training datasets, and generate evaluation MSAs from each model.Finally, using our suite of four generative capacity metrics, we compare statistics of the evaluation MSAs to those of "target" MSAs, which contain sequences drawn from the target distributions and therefore represent our expectation.

Pairwise covariance correlations
We first examine the pairwise covariance scores for pairs of amino acids of an MSA defined as C i j αβ = f i j αβ − f i α f j β .Here, f i j αβ are the MSA bivariate marginals, meaning the frequency of amino acid combination α, β at positions i, j in the MSA.f i α and f j β are the univariate marginals, or individual amino acid frequencies at positions i and j.Each covariance term measures the difference between the joint frequency for pairs of amino acids and the product of the single-site residue frequencies, i.e. the expected counts in the hypothesis of statistical independence.The scores equal 0 if the two positions do not covary.Coevolving amino acids are an important aspect of sequence variation in protein MSAs, and a GPSM's ability to reproduce the pairwise covariance scores of the training dataset has been used in the past as a fundamental, non-trivial measure of the GPSM's ability to model protein sequence covariation 2,16,34,35 .
For each GPSM, we compare pairwise covariance scores for all pairs of positions and residues Ĉi j αβ in their respective evaluation MSA to the corresponding target pair C i j αβ in the target MSA using the Pearson correlation coefficient ρ({C i j αβ }, { Ĉi j αβ }) (Fig. 2, Top Row).In the synthetic tests we evaluate this statistic using 500K sequences for both the target and evaluation MSAs, while for the natural test we compare 500K evaluation sequences to the available 10K target sequences.Mi3 accurately reproduces the target covariance scores in all tests (ρ = 0.99 in Fig. 2a, ρ = 0.95 in Fig. 2b, and ρ = 0.88 in Fig. 2c respectively).The somewhat lower value for Mi3 in the natural analysis of ρ = 0.88 is accounted for entirely by increased estimation error in that test, as only 10K target sequences are available for evaluation, and the expected ρ due only to estimation error is ρ ∼0.87, which is computed by comparing the natural 10K target sequences to the natural 10K training sequences using this metric.The high generative capacity of Mi3 is expected, because Mi3 parameters are optimized to exactly reproduce the joint frequencies of pairs of amino acids from the target MSA.The VAEs' inference does not include this constraint, but we find that even when trained on the larger (1M) dataset of synthetic sequences, the covariances computed from the VAEs' evaluation MSAs are generally similar to each other, and smaller in magnitude than those of the target, showing smaller correlation with the target than Mi3 (ρ = 0.9 for sVAE and ρ = 0.92 for DeepSequence in Fig. 2a).For the VAEs, this amount of error in ρ can primarily be attributed to specification error, since training GPSMs on 1M sequences largely eliminates out-of-sample error, and the large evaluation MSAs make estimation error negligible.The VAEs' covariances are further scaled down slightly when fit to the synthetic 10K dataset (ρ = 0.87 for sVAE and ρ = 0.89 for DeepSequence in Fig. 2b).Indep cannot reproduce covariances by definition, so ρ is zero in all tests, as expected.The generative capacity trends for this metric are consistent between the synthetic and natural analyses for all GPSMs, showing the behavior is not due to artificial properties of our synthetic target model.
These results confirm that VAEs can model pairwise epistasis in protein sequence datasets, since they generate pairwise mutational covariances that are correlated with the target values, even in the absence of explicit constraints for reproducing these statistics.However, they scale down the strength of pairwise covariances in both the synthetic and natural analyses and the correlation with the target is lower than 1.Mi3, in contrast, is constrained by design to fit the pairwise covariance scores and does so nearly perfectly.

Higher order marginal statistics
A more stringent test of GPSM generative capacity is to measure the model's ability to reproduce sequence covariation involving more than two positions, or higher-order covariation.We characterize these higher-order covariation patterns in the target MSA and GPSM-generated evaluation MSAs by computing the frequency of non-contiguous amino acid n-tuples, or higher-order marginals (HOMs) corresponding to subsequences, and compare their frequency in each MSA to corresponding values in the target MSAs.For increasing values of n the number of possible n-tuple combinations increases rapidly, requiring increasingly large evaluation MSAs to accurately estimate the frequency of individual n-tuples.For this reason, we limit n-tuple length to n ≤ 10 and only compute a limited subset of all possible position sets for each n.For each n we randomly choose 3K position sets, compute the frequencies of the top twenty most frequent n-tuples for each corresponding position set in the target and evaluation MSAs, as these are well sampled, and for each position set compute the Pearson correlation between these top twenty frequencies.We then average the correlation values for each n over all position sets.We call this metric the Pearson correlation r 20 16 .In this test, estimation error is non-zero because of the extremely large MSAs required to compute n-tuple frequencies, particularly for high n > 5 (see Supplementary Information).We can predict the estimation error caused by finite sampling in the evaluation MSAs by computing the r 20 scores between two non-overlapping MSAs generated by the synthetic target model, which are of the same size as our evaluation MSAs.
In Fig. 2, Bottom Row, we plot the HOM r 20 for varying n.The expected estimation error (black line) represents a generative capacity upper bound, giving the highest measurable r 20 given the evaluation MSA size of 6M for the synthetic analysis and 10K for the natural analysis.The r 20 for Mi3 fit to 1M training sequences is very close to the validation upper-bound for all n, suggesting it has accurately fit the synthetic target distribution and its specification error is close to zero (Fig. 2d).This is expected since the synthetic target model in this test is a Potts model.With 10K training sequences, Mi3 r 20 scores are lower than the 1M result for all n (Fig. 2e), which illustrates that Mi3 is affected by out-of-sample error for typical dataset sizes, as previously described 46 .Indep has much lower r 20 scores than Mi3, as expected, since it does not model pairwise epistasis by design.Indep's r 20 scores are similar across all r 20 experiments, suggesting that it is not strongly affected by out-of-sample nor estimation error for this metric.This is expected, because its parameters are optimized for reproducing single-site frequency statistics only, which can be accurately estimated even from small training MSAs 46 .The r 20 scores for DeepSequence are slightly higher than sVAE's, but both VAEs remain close to each other for all training datasets and n, falling approximately halfway between Mi3 and Indep, and generally well below Mi3 at higher orders.Here in the 1M training MSA experiment, the VAEs' r 20 decreases to ∼0.4 and ∼0.5 at n = 10 for sVAE and DeepSequence respectively, reflecting higher specification error at higher orders (Fig. 2d).With 10K synthetic training sequences, the VAEs' r 20 decreases further for all n due to the addition of out-of-sample error (Fig. 2e).
Whereas the pairwise covariation correlations represent a preliminary indication that VAEs capture epistasis but mispredict its strength, the r 20 results reinforce this finding and extend it into higher orders.Unlike Mi3, the VAEs show specification error even when fit to large datasets from a model which only contains pairwise epistatic interaction terms (Fig. 2d).Because higher-order covariation statistics are constrained by the pairwise statistics, and the VAEs mispredict the pairwise statistics, we expect that the VAEs will exhibit specification error for higher-order epistasis, even though our synthetic test does not address this issue directly.When considered together with the r 20 ,

7/18
the pairwise covariance correlations reveal a novel insight, which is that when trying to gauge GPSM generative capacity at higher orders of covariation, the standard pairwise statistics alone can be misleading.The relative magnitudes of r 20 between models at n = 2 are different at higher n, and the performance decrease as n increases is more severe for the VAEs than Mi3 (Fig. 2, Bottom Row).In the natural analysis, Mi3 performs as close to the target as is measurable within error given the limited amount of data (Fig. 2f).Here, the small 10K target MSA causes large estimation error, making it more difficult to distinguish the performance of the different models.We emphasize that the r 20 performance decrease for Mi3 when tested against the synthetic and natural MSA targets can be accounted for entirely by out-of-sample error and estimation error, consistent with the hypothesis that pairwise interaction terms are sufficient to model protein sequence variation.

Hamming distance distributions
We next evaluate the pairwise Hamming distance distribution metric d(S, S ).The Hamming distance between two protein sequences is the number of amino acids that are different between them, and we obtain a distribution for an MSA by comparing all pairs of sequences.Because it characterizes the range of sequence diversity in an MSA, recapitulation of the Hamming distance distribution has been used in the past as a measure for GPSM performance 2,16,36,37 .In Fig. 3, we compare the pairwise Hamming distance distribution for each GPSM to that of the target distribution, computed with evaluation and target MSAs of 10K sequences each.To quantify the difference between the GPSM and target distributions for this metric, we use the total variation distance (TVD) 47 , which equals 1 when the distributions have no overlap and is 0 when they are identical, defined by All models reproduce the mode Hamming distance of ∼179.For Mi3, we report the same TVD =0.007 when trained on either 1M (Fig. 3a) or 10K (Fig. 3b) synthetic sequences, showing negligible specification error, as expected.When trained on 10K natural sequences, Mi3 TVD increases to 0.012 (Fig. 3c), for reasons discussed further below.Indep severely underestimates the probability of both low and high Hamming distances, as observed at the distribution tails, with TVD ∼0.24 across all experiments.The VAEs perform in between Mi3 and Indep, but much closer to Mi3 than Indep with respect to TVD.Performance differences across all GPSMs for this metric indicate that out-of-sample error has a consistent and detectable, though very small, effect on the fundamental sequence diversity of artificial GPSM-generated MSAs.That Mi3 and the VAEs are highly performant and comparable to each other, but not Indep, corroborates our earlier findings that epistasis is relevant to accurate modelling of protein sequence diversity (Fig. 2).However, because Indep performs much closer to Mi3 and the VAEs for this metric than on any other, and also because this metric cannot discriminate well between Mi3 and the VAEs, we suspect that reproducing the Hamming distance distribution is a much easier hurdle for GPSMs than is reproducing higher-order covariation.This shortcoming of the standard Hamming distance distribution metric becomes apparent when these results are compared to those of our novel metric, r 20 , which does show a significant gap in generative capacity between Mi3 and the VAEs at higher orders (Fig. 2, Bottom Row).
To emphasize the decay of the tails, we rescale all the distributions by their maxima and re-center them around their modes to give them the same peak, and then plot them on a log-log scale (Fig. 3, Bottom Row).The relevance of the distributions' tails lies in their power-law behavior as they approach zero, where the function's exponent is related to the intrinsic dimension of the dataset and therefore to the number of informative latent factors needed to explain the data 36,37,48 .A well-specified GPSM ought to reproduce this exponent, and therefore the tail's decay, since it is a topological property intrinsic to the dataset and independent from the particular choice of variables used to describe the probability density 48 .There is a trend of slightly decreasing generative capacity as training samples decrease, which is detectable only here in the log-scaled Hamming distribution (Fig. 3, Bottom Row).In this modified rendering of the Hamming distance distribution, differences in GPSM generative capacity can be observed at both low (Left Tail) and high (Right Tail) sequence diversity.The Mi3 distribution closely overlaps the target distribution with both 1M (Fig. 3d) and 10K (Fig. 3e) synthetic training sequences.In the 10K natural experiment, Mi3 deviates noticeably from the target on the left tail (Fig. 3f), which represents less evolutionarily diverged sequences.This could be an artifact of the phylogenetic relationships between sequences present in the natural dataset, which may have been incompletely removed by our phylogenetic filtering step for this dataset (see Methods, Supplementary Information), or it could be due to estimation error in measuring the target distribution, as only 10K target sequences are available to estimate the black line in the natural analysis.As before, Indep performance is consistently low across all experiments.The VAEs' performance at low sequence diversity (Fig. 3, Bottom Row, Left Tails) decreases for smaller training dataset size.

Statistical Energy Correlations
A fourth metric we use to evaluate generative capacity is the statistical energy E(S) of individual sequences in the dataset, which we express using the negative logarithm of the predicted sequence probability p(S), where E(S) = − log p(S).E(S) can be computed analytically for Mi3 and Indep, and estimated for VAE models by importance sampling (see Methods).This statistic directly evaluates accuracy of the GPSM distribution values from p θ (S) for a limited number of individual sequences, which has been used to validate GPSMs by comparison to corresponding experimental fitness values 16,20,38,49 .In Fig. 4, we compare artificial statistical energies from the GPSM distribution p θ (S) to those of the target distribution p 0 (S) for a 1K test MSA generated from p 0 (S).This measurement cannot be performed for the natural 10K experiment because p 0 (S) is unknown for the natural data, so we present results only for the 1M and 10K synthetic experiments.We use the 1M (Fig. 4 Juxtaposing the E(S) results to the r 20 results reiterates the striking insight of our work, which is that despite the utility of standard metrics for measuring sequence statistics, they say little about a GPSM's ability to capture higher-order covariation, and therefore necessarily higher-order epistasis.If considered by itself, the E(S) metric would indicate that both VAEs have generative capacity close to that of Mi3, and even Indep could be said to have a large amount of generative capacity, in spite of capturing no covariation by design.But according to r 20 , which directly measures the higher-order marginals, the VAEs and Indep are comparable to Mi3 only at the pairwise level.Critically, at higher orders, the performance difference between the VAEs to each other remains relatively unchanged, but is significantly lower than Mi3, with Indep lower still.This suggests that E(S), just as with the Hamming distance metric and pairwise covariance correlation, represents an easier, and perhaps different, hurdle for GPSMs than measuring generative capacity by the ability to capture higher-order sequence covariation, as done uniquely by r 20 .

Conclusions
In this study we reveal the steep challenges and limits entailed by current measurements of generative capacity in GPSMs trained on either synthetic or currently available natural sequence datasets.Recent state-of-the-art GPSM studies have benchmarked their models by comparing p θ (S) to experimental fitness values from deep mutational scans 20,22 , or by generating artificial sequences that appear to fold into realistic structures based on in silico folding energy 25 .However, these strategies for model evaluation present their own challenges.Point mutation fitness experiments in practice may measure particular contributors to fitness, including replicative capacity, drug resistance, protein stability, and enzymatic activity 20 , and are subject to significant experimental error and other limitations, e.g.those imposed by conservation 50 .Likewise, in silico computational chemistry approaches rely on energy functionals that may lack the precision needed to meaningfully discriminate between highly similar sequences 51 .Additionally, neither protein function nor fitness rely exclusively on the thermodynamic stability of static native structures, but also on the protein's conformational dynamics [52][53][54][55][56] , which are not fully described by folding energy values alone 57 .This could mean that despite generating sequences with realistic in silico folding energy, a GPSM may still not be capturing crucial higher-order epistatic effects.Neither point mutation fitness effects, nor in silico folding energy estimations, are directly related to mutational covariation statistics observed in an MSA in the sense that they do not check if subsets of covarying amino acids in specific positions present in the target MSA are indeed present in the GPSM-generated evaluation MSA.Our novel r 20 metric uniquely delivers that functionality, emphasizing higher-order covariation where previous studies rarely go beyond the pairwise level 2,16,34,35,58 .
Benchmarking coevolution-based protein sequence models in data rich and data poor regimes, as done here, is an effective method for ascertaining where data-driven effects stop, and algorithmic failure begins 40 .Due to the limited availability of natural protein sequence data, this line is inherently blurred in the natural analysis, as we demonstrate across all four of our generative capacity metrics.But in our synthetic analysis track, we have demonstrated the extent to which VAEs, with different implementations, can capture higher-order covariation at orders between three and ten when the target distribution is known, its statistical properties are measurable with a high degree of certainty, and major forms of error are removed, minimized, or accounted for.When given a 10/18 For each scatterplot, Pearson correlation coefficient ρ was computed between each GPSM's statistical energy and that of the synthetic target distribution for each sequence.These statistical energies are unit-less, and so we have shifted the data along the y-axis to compare E(S) on the same scale.Only Indep is insensitive to decreased training sample size for this metric.

11/18
large number of training and target sequences, we found both VAEs' generative capacity to be between that of a site-independent model (Indep) and a pairwise Hamiltonian (Mi3) for all measurements.In the synthetic r 20 tests, our results show that both VAEs' generative capacity is well below Mi3, raising questions about whether VAEs can capture higher-order epistasis significantly better than a pairwise Potts model.In our synthetic analysis the target distribution is a Potts model, and therefore we expect Mi3 to fit the target distribution well by design.However, we find Mi3 also outperforms the VAEs where we do not have this expectation, such as on the natural target distribution and on a target distribution specified by sVAE, as shown in Supplementary Information, suggesting Mi3 generally outperforms VAEs on protein sequence data with respect to generative capacity.The Hamming distance distributions, pairwise covariation correlations, and statistical energy correlations are standard metrics that have been used in the past to measure GPSM accuracy, but we find that they can be inadequate or misleading indicators of a GPSM's ability to capture covariation at higher orders.Taken together, our results suggest that, of the metrics we tested, only r 20 provides the granularity needed to discriminate between different GPSM's ability to model higher-order epistasis, as it directly tests the model's ability to capture higher-order covariation.
Although our results suggest VAEs are less effective for capturing higher-order epistasis than pairwise Potts models, VAEs have demonstrated utility in unsupervised learning and clustering.One VAE-GPSM, "BioSeqVAE", has generated artificial sequences that share a "hallucinated" homology to natural proteins in the training set, which could mean that their folded structures would perform similar functions to their hallucinated natural homologs 23 .Another, "PEVAE", has shown that a VAE-GPSM's latent space captures phylogenetic relationships 21 better than PCA 59 and t-SNE 60 .These VAE-GPSMs furnished a latent space that immediately allowed for function-based protein classification, a benefit unavailable to pairwise Potts models without some effort.
The causes of VAE performance limitations are actively being investigated in literature from multiple directions.One line of inquiry involves a VAE phenomenon known as "posterior collapse" (see Supporting information), in which some dimensions of the VAE latent space become insensitive to the input data.Studies of this phenomenon have led to some insights into VAE behavior, for instance that in some situations the VAE likelihood can contain spurious local maxima 61 , and many different heuristic strategies to understand and avoid this phenomenon have been suggested [62][63][64] .In Supporting Information we test that sVAE, used in our main results, does not exhibit posterior collapse, though we can trigger it for sVAE architectures with more than 7 latent dimensions in the bottleneck layer.Another line of inquiry relates to assumptions typically made about the metric and topology of the latent space, suggesting that the commonly used Euclidean metric space or Gaussian prior distribution may not best describe particular datasets, for instance because of an effect called "manifold mismatch" 65,66 .Techniques closely related to the VAE, such as the "WAE", are also being investigated as alternatives to the VAE, with different performance characteristics 67 .Among these competing frameworks for exploring VAE behavior, we have tested VAE implementations which are currently used to generatively model protein sequences.With more nuanced latent variable models, and with better understanding of protein sequence embeddings, perhaps GPSM generative capacity could extend beyond what has been demonstrated here with state-of-the-art VAEs.
Our epistasis-oriented methodology focuses on measuring higher-order covariation, with the potential for broad applicability to various sequentially ordered data.r 20 -like measurements become possible when the data are sufficient in number, and the correlation structures between elements, both within and across samples, are statistically detectable and meaningful in some context, be it visual, biophysical, or linguistic.The convergence between data categories such as images, proteins, and language with respect to generative modelling evaluation offers the exciting opportunity of a wider, interdisciplinary audience for the work proposed here.Conversely, further development of sophisticated, data-intensive, and direct generative capacity metrics of GPSMs could reveal nuances of the correlation structure of protein sequence datasets that distinguish them from other datasets, helping to explain why r 20 -like metrics can detect higher-order covariation, whereas other metrics cannot.Our work represents not only a revision of currently prevailing paradigms of GPSM benchmarking, but also a challenge to generative protein sequence modelling more broadly, to consider how epistasis and direct higher-order covariation metrics like r 20 can inform their models and results.

Sequence dataset preparation
An outline of our sequence processing is shown in Fig. 1.In Stage 1 we obtain sequences.For the natural analysis, we use an MSA of the kinase protein superfamily which we have previously curated using sequences from the Uniprot/TREMBL database 41 .This MSA is composed of ∼20K sequences of length 232, obtained by filtering a 12/18 larger set of ∼291K sequences to remove any sequences with more than 50% sequence identity to another (see Supplementary Information).For the synthetic analysis, we treat the Potts Hamiltonian model trained on this natural protein kinase MSA as the target model or distribution, and generate MSAs from it which serve as training and target MSAs in our synthetic tests.
In Stage 2 we split the sequences into training and target MSAs.For the natural analysis, we randomly divide the 20K natural sequences into 10K training and 10K target MSAs.For the synthetic analysis we generate both training MSAs used to train the synthetic models, and target MSAs used as reference, and ensure that the synthetic training and target sequences are non-overlapping sets, even though they come from the same target model.
In Stage 3, the processed MSAs are then one-hot encoded and used to train the the GPSMs.In Stage 4, we generate evaluation MSAs from each model, and in Stage 5 we perform our generative capacity measurements by comparing the artificial MSAs to the appropriate target MSA.

Mi3
The Mi3 model is a pairwise Potts Hamiltonian model fit to sequence data using the "Mi3-GPU" software we have developed previously 33 , which performs "inverse Ising inference" to infer parameters of Potts models using a Markov-Chain Monte-Carlo (MCMC) algorithm which entails very few approximations.This software allows us to fit statistically accurate Potts models to MSA data.We have examined Mi3's generative capacity and out-of-sample error in earlier work 33,46 , which we summarize here.
A Potts model is the maximum entropy model for p(S) constrained to reproduce the bivariate marginals f i j αβ of an MSA, i.e. the frequency of amino acid combination α, β ,at positions i, j.The probability distribution p θ (S) for the Potts model takes the form where Z is a normalization constant, Z = ∑ S e −E(S) , and "coupling" J i j αβ and "field" h i α parameters are compactly denoted by the vector θ = {h i α , J i j αβ }.The number of free parameters of the model (non-independent couplings and fields) is equal to the number of non-independent bivariate marginals, which can be shown to be L(L−1) 2 (q − 1) 2 + L(q − 1) for q amino acids 14 , which is ∼10.7Mparameters for our model.This implies that the Potts model is well specified to reproduce the bivariate marginals when generating sequences from p θ (S).
The Mi3 model inference procedure maximizes the log-likelihood with regularization.Maximizing the Potts log-likelihood can be shown to be equivalent to minimizing the difference between the dataset MSA bivariate marginals f i j αβ and the model bivariate marginals of sequences generated from p(S).To account for finite sampling error in the estimate of f i j αβ for an MSA of N sequences we add a small pseudocount of size 1/N, as described previously 46 .We also add a regularization penalty to the likelihood affecting the coupling parameters J i j αβ to bias them towards 0, of form λ ∑ SCAD(J i j αβ , λ , α) using the SCAD function which behaves like λ |J i j αβ | for small J i j αβ but gives no bias for large J i j αβ 68 , using a small regularization strength of λ = 0.001 for all inferences which causes little model bias.
To generate synthetic MSAs from the Potts model we use MCMC over the trial distribution p θ (S) until the Markov-Chains reach equilibrium, as described previously 33 .We can directly evaluate E(S) as the negative log-probability of any sequences for the Mi3 model using equation 1 up to a constant Z, and this constant can be dropped without affecting our results.

Indep
The Indep model is the maximum entropy model for p(S) constrained to reproduce the univariate marginals of an MSA, and is commonly called a "site-independent" model because the sequence variations at each site are independent of the variation at other sites.Because it does not fit the bivariate marginals, it cannot model covariation between positions.It takes the form where Z is a normalization constant, Z = ∑ S e −E(S) , and "field" parameters h i α for all positions i and amino acids α are compactly referred to by the vector θ = {h i α }.The fields of the Indep model generally have different values from the fields of the Potts model.Unlike for the Potts model, maximum likelihood parameters can be determined analytically to be h i α = − log f i α where f i α are the univariate marginals of the dataset MSA.When fitting the Indep model to a dataset MSA of N sequences, we add a pseudocount of 1/N to the univariate marginals to give model marginals f i α , to account for finite sample error in the univariate marginal estimates.The model distribution simplifies to a product over positions, as p θ (S) = ∑ f i s i .The number of independent field parameters is L(q − 1) which equals 4.6K parameters for our model.
To generate sequences from the independent model we independently generate the residues at each position i by a weighted random sample from the marginals f i α , and we directly evaluate the log probability of each sequence E(S) from equation 2.

VAEs
The standard variational autoencoder (sVAE) is a deep, symmetrical, and undercomplete autoencoder neural network composed of a separate encoder q φ (Z|S) and decoder p θ (S|Z) 69 , which map input sequences S to regions within a low-dimensional latent space Z and back.The probability distribution for the sVAE is defined as where the latent space distribution is a unit Normal distribution, Training of a VAE can be understood as maximization of the dataset log-likelihood with the addition of a Kullback-Leibler regularization term D KL [q φ (Z|S), p θ (Z|S)], where p θ (Z|S) is the posterior of the decoder 29,30 .sVAE's architecture is "vanilla" 70 , meaning it is implemented in a standard way and its behavior is meant to be representative of VAEs generally.Nearly identical sVAE architectures have been used as VAE-GPSMs in recent studies 22 .sVAE's encoder and decoder are implemented without advancements such as convolutional layers 20 , multi-stage training 71 , disentanglement learning 70 , Riemannian Brownian motion priors 72 , and more.This allows us to directly interrogate the assumptions and performance of standard variational autoencoding with respect to the training and evaluation of GPSMs in this work.
Our encoder and decoder have 3 layers each and employ standard normalization and regularization strategies 73,74 .sVAE's latent bottleneck layer has 7 nodes, and the model in total has 2.7M inferred parameters.The input layer of the encoder accepts a one-hot encoded sequence and the decoder's output layer values can be interpreted as a Bernoulli distribution of the same dimensions as a one-hot encoded sequence.We have tested various sVAE architectures and hyperparameters with our datasets, as well as DeepSequence as described below, and found qualitatively similar generative capacity results.
To generate a sequence from sVAE, we generate a random sample in latent space from the latent distribution p(Z) pass this value to the decoder to obtain a Bernoulli distribution, from which we sample once.To evaluate the negative log-probability of a sequence E(S) we use importance sampling, averaging over 1K samples from the latent distribution q φ (Z|S) 21 .Other publications use the Evidence Lower Bound (ELBO) estimate as an approximation of the negative log-probability 20 , and we have verified that the ELBO and the negative log-probability are nearly identical in our tests and have equal computational complexity.
In addition to sVAE, we test the DeepSequence VAE 20 .We use the default inference parameters, and use the "SVI" inference implementation which uses a "variational Bayes" inference technique as an extension of the sVAE inference method.Because DeepSequence is designed to output the ELBO rather than the negative log-probability for each sequence, we use the ELBO as an approximation of E(S), and estimate it using an average over 1K samples.To generate sequences we use the same strategy as for sVAE.

sVAE implementation
The standard variational autoencoder (sVAE) is a deep, symmetrical, and undercomplete autoencoder neural network composed of a separate encoder q φ (Z|S) and decoder p θ (S|Z), which map input sequences S to regions of a low-dimensional latent space Z and back. 1 It is a probabilistic model, and in our "vanilla" 2 implementation we assume sequences will be distributed according to a unit normal distribution in latent space, p(Z) = N [0, 1](Z) 1 .Training of a VAE can be understood as maximization of (the logarithm of) the dataset likelihood L = ∑ S p θ (S) = ∑ S p θ (S|Z)p(Z)dZ with the addition of a Kullback-Leibler regularization term D KL [q φ (Z|S), p θ (Z|S)], where p θ (Z|S) is the posterior of the decoder, which allows use of the fitted encoder q φ (Z|S) to perform efficient estimation of the likelihood and its gradient by Monte-Carlo sampling, for appropriate encoder models.
Our sVAE architecture is built on the same basic VAE architecture of "EVOVAE" 3 , which itself appears to be built on the VAE implementation provided by developers for the Keras library 4 .It is composed of 3 symmetrical ELU-activated layers in both the encoder and decoder, each layer with 250 dense (fully-connected) nodes.The encoder and decoder are connected by a latent layer of l nodes, we use l = 7 in the main text.sVAE's input layer accepts one-hot encoded sequences, the output layer is sigmoid-activated, and its node output values can be interpreted as a Bernoulli distribution of the same dimensions as a one-hot encoded sequence.The first layer of the encoder and the middle layer of the decoder have dropout regularization applied with 30% dropout rate, and the middle layer of the encoder uses batch normalization with a batch size of 200 3,5,6 .In all inferences, we hold out 10% of the training sequences as a validation dataset, and perform maximum likelihood optimization using the Keras Adam stochastic gradient optimizer on the remaining 90% 7 .After each training epoch we evaluate the loss function for the training and validation data subsets separately.We have tested using early-stopping regularization to stop inference once the validation loss has not decreased for three epochs in a row, as in previous implementations, but this led to some variability in the model depending on when the early stopping criterion was reached.To avoid this variability, and to make different models more directly comparable, we instead fixed the number of epochs to 32 for all models, since in the early stopping tests this led to near minimum training loss and validation loss, and did not lead to significant overfitting as would be apparent from an increase in the validation loss.sVAE was implemented using Keras, building on previous implementations 3,4 , however with a modification of the loss function relative to both of these, to remove a scaling factor of Lq on the reconstruction loss, which is sometimes used to avoid issues with local minima as described further below.This prefactor leads to a non-unit variance of the latent space distribution of the dataset sequences, violating our definition that the latent space distribution should be normal with unit variance, p(Z) = N [0, 1](Z).In the next section we show that after removing the prefactor the latent space distribution is approximately a unit normal, which more closely follows the original VAE conception 1,8 .Our implementation is available at https://github.com/ahaldane/MSA_VAE.
To generate a sequence from the model we generate a random sample in latent space from the latent distribution N [0, 1], pass this value to the decoder to obtain a Bernoulli distribution, from which we sample once.To evaluate the log-probability of a sequence, we use importance sampling, averaging over 1000 samples from the latent distribution q φ (Z|S) following from the relations 9,10 p θ (S) = p θ (S|Z)p(Z)dZ = q φ (Z|S) p θ (S|Z)p(Z) q φ (Z|S) dZ where, Z i are independent samples from q φ (Z|S) and N is a large number of samples.Here q θ (Z|S) plays the role of a sampling bias function, biasing samples to regions of latent space which are likely to have generated the sequence, leading to an accurate Monte-Carlo estimate of p θ (S).The value p θ (S) can be converted to a unit-less statistical energy as E(S) = − log p θ (S) for direct comparison with Mi3 and Indep statistical energies.Other publications have used the Evidence Lower Bound (ELBO) estimate as an approximation of log p θ (S) 11 , and we have tested (see Fig. 1) that the ELBO and the log-probability are nearly identical, and N = 1000 samples is sufficient for an accurate estimate.The fact that the ELBO and log-probability are nearly identical is a sign that our encoder is well fit, as the difference between these values should equal the KL divergence D KL [q φ (Z|S), p θ (Z|S)] between the "true" posterior of the decoder p θ (Z|S) and the approximate posterior q φ (Z|S), which should be 0 if the encoder q φ (Z|S) has accurately modelled the posterior 1 .

VAE model validation and generalization
To validate our choice of latent space size of l = 7 used in the main text, we tried fitting sVAEs with different latent space sizes from 2 to 10.In Fig. 2 we illustrate the latent space projections of the sequences in the training dataset.According to our specification underlying the VAE theoretically, we expect the latent space distribution to be a multidimensional normal distribution with mean 0 and unit variance.Indeed, as can be seen in the plot, and measured numerically, we generally find the latent space distribution of the dataset has close to unit variance and is approximately normal, although there is some non-normal structure in the distribution.
For latent spaces of l = 8 and l = 10 we observe that some latent dimensions appear to have"collapsed", in particular z 0 for l = 8, and z 1 and z 6 for l = 10.From repeated runs (not shown) we observe that the number of collapsed dimensions varies somewhat depending on the random seed used to initialize the stochastic optimizer, and also depends on the size of the training dataset as more dimensions collapse when fitting 1M sequences than fitting 10K sequences (not shown).For these "collapsed" dimensions, we see that the projected variance of the illustrated sequence in red in Fig. 2 is very close to 1.0, unlike in other dimensions where the projected variance is much smaller.These behaviors are consistent with a well known phenomenon of "posterior collapse" discussed in VAE literature 12 .It has been suggested that VAE posterior collapse can occur due to local minima in the likelihood function which are not global minima 12 , but in some situations can be a sign that additional latent dimensions are uninformative, and that fewer latent dimensions better represent the data 13 .We find that choosing l = 7 gives the best performing model which avoids posterior collapse.Interestingly, the number of "informative" latent variables, i.e. those that do not undergo posterior collapse, turns out to coincide with the intrinsic dimension (ID) of the dataset of training sequences, estimated from the set of pairwise distances using a completely independent approach 14 .In brief, it has been shown that graph distances calculated on k-neighbor graphs can be used to approximate geodesics and thus to generate the distribution of "intrinsic" distances.Close to the maximum, the latter depends exclusively on the dimension of the distance distribution's support.This observation is used to devise a family of estimators for the ID.Using these tools, we estimated an ID of 7 or 8 for the synthetic dataset used in the main text.These numbers are consistent with what was observed in terms of collapse of the posterior distribution: the ID is seemingly related to the number of informative latent variables so that if the number of nodes in the embedding layer is increased past this number, then posterior collapse occurs, indicating that the additional variables are not needed to explain the data.To compare the generative capacity of different GPSMs to determine how general our results are, we computed our MSA statistics for other VAEs besides the l = 7 sVAE shown in the main text.In Fig. 3 we show the r 20 scores for different models when fit to either 10K or 1M synthetic sequences, as in the synthetic tests in the main text.We include the Mi3 and Indep models, as well as sVAEs for different latent space sizes, and also models produced using the DeepSequence VAE software which comes in two variations, the "MLE" and the "SVI" algorithms 11 , for which we use the default or example parameters.All the VAEs perform fairly similarly in this metric, including the DeepSequence VAEs.For the smaller training dataset of 10K sequence the DeepSequence SVI algorithm outperforms the other VAEs, suggesting it is less susceptible to out-of-sample error.These results suggest that our results for the sVAE shown in the main text generalize to other VAEs, including the significantly more complex DeepSequence VAE, and are not strongly dependent on implementation or number of latent variables l.The models with l ∼ 7 perform among the best of the sVAE models for both the 10K and the 1M training datasets, though the difference between the models is small, and this further justifies our choice of l = 7 in the main text.

Minimizing VAE out-of-sample error
The goal of the synthetic test with 1M training sequences in the main text is to eliminate out-of-sample error (overfitting) by using an extremely large training dataset.How large must the training dataset be to mostly eliminate out-of-sample error for the sVAE?In Fig. 4 we show tests for the l = 7 sVAE for increasing training dataset sizes, finding that after 500K sequences the improvement in performance becomes small.This justifies our choice of using 1M synthetic training sequences, as there is little additional improvement to be gained by fitting to 2M sequences at the cost of increasingly prohibitive fitting time.

Using sVAE as the synthetic target distribution
In the main text, our synthetic GPSM tests are performed using a Potts model as the synthetic target distribution.This means that the target distribution is constructed without higher-order interaction terms, and a Potts model is by definition well specified to fit data generated from this target distribution.Here, we show GPSM performance when the synthetic target distribution instead corresponds to a sVAE, which potentially generates data which cannot be fit by a model with only pairwise interaction terms.
In this section we take the synthetic target distribution to be described by the sVAE fit in the main text from 10K natural sequences.As described in the main text, this model potentially generates patterns of higher-order mutational covariation which cannot be fit by the Mi3 model.We then follow the same procedure as for our synthetic 1M test of the main text, but using this target distribution.We generate 1M sequences from the target sVAE distribution which we use as training data for each GPSM, that is for Mi3, sVAE, and Indep.We generate evaluation MSAs from each inferred model and compare it to evaluation MSAs generated from the target distribution, using our test statistics.
In Fig. 5 we show MSA test statistics for the models fit to the sVAE target.We find that the performance of Mi3 fit to this target performs at least as well as the sVAE model fit to the same target.As in the main text 1M synthetic

5/9
test, the correlation scores are estimated from 500K evaluation sequences from the target and each GPSM, the r 20 scores using 6M evaluation sequences, the Hamming distributions from 30K sequences, and the energies are evaluated for 1K sequences using 1000 Monte Carlo samples.For the r 20 test we measure the estimation limit due to the finite size of the evaluation MSAs by computing r 20 between two MSAs of size 6M generated from the target distribution.There is a small difference between the estimation error limit and the Mi3 result, which may be due to out-of-sample error due to the finite 1M training data, or due to specification error, and this difference is smaller than the difference of sVAE fit to the same target distribution (red).In sum, we interpret these tests to show that the Mi3 model is able accurately fit sVAE's target distribution.

How higher-order covariation is represented by pairwise models
One of the questions we address in the main text is whether different GPSMs are well specified to describe protein sequence variation, especially in the case of covariation of many positions in the sequence at once.Of particular interest is whether a model which explicitly includes only pairwise interactions, such as the Potts model, is sufficient to model higher order epistasis, or whether GPSMs with more complex functional forms, such as a VAE, are necessary.
For clarity, we give a brief example describing how Potts models can predict many patterns of higher-order covariation, meaning triplet and higher patterns of residue covariation, despite only modelling pairwise interactions.We illustrate this using a toy model describing sequences of length L = 3 with two residue types A and B, with 2 3 = 8 possible sequences, and show different forms of higher-order covariation which a pairwise model can and cannot fit.Detailed discussion and theoretical results suggesting why pairwise models are often sufficient to model many datasets have been provided by others [15][16][17] .
First, we show how such a Potts model generates triplet covariation.Consider a Potts model with parameters given by J 12 AA = J 23 AA = −s for some interaction strength s and all other field and coupling parameters are 0.This directly couples the character "A" between positions (1,2) and also positions (2,3).These interactions cause pairwise covariation between the directly coupled residues, and in the limit of large s we find C 12 AA = C 23 AA = 0.08, or 8%, but they also cause covariation between the indirectly coupled pair, as C 13 AA = 0.04, or 4%.Furthermore, this Potts model predicts three-body covariation, as can be seen by computing the three-body covariation terms found in cluster expansions in statistical physics given by and we find that C 123 AAA = 0.024, or 2.4%, which is nonzero.This shows that a Potts model generates and can fit higher-order covariation between sets of residues even though the interactions are only pairwise, as a result of indirect covariation through chains and loops of pairwise interactions.An example of MSA triplet statistics which a Potts model is mis-specified to describe is the XOR pattern in which the dataset is composed in equal proportions of copies of the four sequences shown in Table 1.These sequences follow the XOR function in boolean logic, so that the 3rd position is the XOR function applied to the first two positions.One can see that both the A and B residues have a 50% probability at each position, and that for each pair of positions the probability of each of the four combinations AA, AB, BA, BB is 1/4.This means that the pairwise covariances C i j αβ = 0.25 − 0.5 × 0.5 are all 0. Because there are no pairwise covariances, fitting a Potts model to this data will yield a model with no coupling terms, equivalent to an Indep model.Sequences generated from this (or any) Indep model have all three-body covariation terms equal to 0. However, the three-body covariations of the dataset are non-zero and C 123 AAA = 0.125.This shows how a Potts model fit to XOR data will fail to reproduce the correct three-body covariations.More generally, it will fail to model data which follows a boolean parity function, which generalizes the XOR function to longer strings, and is defined so that the last character is set to "B" if there are an odd number of "B" characters in the preceding sequence.

AAA ABB BAB BBA
A motivation for the VAE is that it may potentially be able to model patterns of covariation such as the XOR pattern which a Potts model cannot.Whether a VAE is able to outperform the Potts model when fit to protein sequence data will depend on the prevalence of patterns such as XOR in the data which cannot be fit by a Potts model.If they are undetectable, the Potts model will be well specified and third order parameters are unnecessary.Our results with the natural dataset in the main text suggest no evidence that the Potts model is mis-specified to our dataset, as it is able to reproduce all the MSA statistics we tested up to the limits imposed by estimation and out-of-sample error.

Analysis of estimation error
When computing the r 20 scores we are able to quantify estimation error, as can be seen by the r 20 upper limit illustrated in Fig. 5b (black dotted line).Here we provide quantitative intuition for the behavior of the r 20 score as a function of the evaluation MSA size N, which explains the difficulty in eliminating estimation error entirely.
Consider a particular set of positions for which we estimate the frequency f of each subsequence at those positions in the target distribution, based on a finite MSA of size N generated from the target distribution, giving estimated marginals f .We retain only the top twenty observed subsequences for use in the r 20 computation.The statistical variance in f caused by finite-sampling error will be f (1 − f )/N, following a multinomial sampling process, and we will approximate that all top 20 marginals have similar magnitude and we approximate this error as f (1 − f )/N for all twenty values, where f is the mean value of the top 20 marginals.We can then approximate that the expected Pearson correlation ρ 2 between values estimated from two such MSAs will be ρ 2 ≈ χ 2 /(χ 2 + σ 2 ) where χ 2 is the variance in the values of the top 20 marginals (reflecting the variance of the "signal"), and σ 2 ≈ f (1 − f )/N is the statistical error in each value (representing the variance of the "noise").
f and χ are properties of the protein family being modelled, at each position-set, and do not depend on N.This invariant allows us to extrapolate, since if we solve for f (1 − f )/χ 2 = N(1/ρ 2 − 1), the rhs should be invariant when we change the size of the dataset MSA from N to N 0 or vice versa.If we estimate the rhs for a particular N 0 and measured ρ 0 numerically, we can solve for ρ at higher N since N(1/ρ 2 − 1) = N 0 (1/ρ 2 0 − 1), or The approximations we used to derive this formula will become more accurate for larger N 0 .We have tested this formula by predicting the expected r 20 for MSAs of size N by extrapolating based on the measured r 20 for MSAs of smaller size N 0 , and find it is quite accurate.
This equation shows how extremely large MSAs can be required to reduce estimation errors when evaluating r 20 , as the extrapolated N diverges as ∝ 1/x as x = 1 − ρ 2 approaches 0. For instance, if with an MSA of 6 million sequences we obtain r 20 = 0.8, then we would require 28.5 million sequence to obtain r 20 = 0.95 and 148 million to reach r 20 = 0.99.

Typical natural sequence dataset MSA size
The 10K sequence training datasets we use in the main text are meant to illustrate performance for typical protein family dataset sizes.The size of 10K sequences is the number of estimated effective sequences N eff remaining after curation and phylogenetic filtering for the 20th most frequent protein (Cadherin) in Pfam (Fig. 6, right) 18 .Some of our measurements show significant out-of-sample error for Mi3 and the VAEs based on training sample size alone, suggesting that the vast majority of GPSMs training on natural data could be subject to the level out-of-sample error reported in our results.
In Pfams's Top 20 most frequent protein domains, ranked by total number of sequences, there are between 10 5 and 10 6 total sequences each (Fig. 6, right).In this work, we use the 4th most frequent protein out of this ranking, Pkinase.After curation and phylogenetic filtering of the kinase, we retained only N eff ∼22K, or ∼5% of the original ∼424K kinase sequences (Fig. 6, left).Extending this fraction of ∼5% to the other Top 20 proteins, we estimate that N eff is capped at ∼ 10 5 (100K) for GPSMs trained on single domains, and that proteins outside the Top 20 can generally expect N eff < 10 4 (10K).This tabulation of Pfam data demonstrates that, for the vast majority of proteins with publicly available natural sequence data, contemporaneous GPSMs must have approximately N eff < 10K for training, validation, and testing.

4 LFigure 1 . 5 -
Figure 1.5-stage analysis pipeline diagram.Stage 1: Data Sequestration.Two different MSAs are sequestered: one natural from Uniprot/TREMBL; and the other synthetic, generated by a Potts model fit to the natural MSA.Stage 2: Data Processing.Sequences are indexed and split.Non-overlapping training, target, and evaluation MSAs are shown.Stage 3: GPSM Training.GPSMs are trained.Stage 4: Artificial Sequence Generation.Evaluation MSAs are generated from each GPSM.Stage 5: Generative Capacity Measurements.Computation and visualization of generative capacity metrics is performed.

Figure 2 .
Figure 2. Pairwise covariance correlations (Top Row) and Pearson r 20 correlations (Bottom Row).GPSM MSA statistics are compared to the corresponding target values.GPSMs were trained on 1M synthetic (a, d), 10K synthetic (b, e), or 10K natural (c, f) kinase sequences.Top Row Covariances C i j αβ computed from MSAs of 500K sequences generated by each GPSM (y-axis) vs target covariances (x-axis) computed from MSAs of 500K sythnetic target sequences (a, b) or from 10K natural target sequences (c), with Pearson correlation ρ shown for each comparison.Generative capacity slightly decreases as synthetic training sample size is reduced from 1M in a to 10K in b for all GPSMs, except Indep.Covariances around zero are omitted for simplicity.Bottom Row Pearson r 20 score (y-axis) as a function of Higher-Order-Marginal (HOM) n-tuple length (x-axis), illustrating how well each GPSM predicts HOMs from the target distribution.Lines are colored as in the top panels.r 20 scores for each GPSM are computed using GPSM-generated evaluation MSAs of 6M sequences compared to target MSAs of 6M sequences for the synthetic tests (d, e) or to a target MSA of 10K natural sequences for the natural test (f) due to limited natural sequence data.The black dotted line denotes an estimation upper limit caused by finite sampling estimation error.This is computed as the r 20 scores for two non-overlapping synthetic target MSAs of 6M sequences each (d and e, black circles) or for MSAs of 6M and 10K sequences both generated from the Mi3 model trained on natural sequences (f, black triangles).For panel f, only lengths two through seven are plotted, as the small dataset size limits HOM estimation.Insets emphasize pairwise r 20 .Comparing panels d and e, the generative capacity of all GPSMs decreases with synthetic training sample size according to r 20 .

Figure 3 .
Figure3.Pairwise Hamming distance distributions.These plots illustrate whether sequences generated from the GPSMs reproduce the overall sequence diversity of their respective targets.Mi3 (cyan), DeepSequence (green), sVAE (red), and Indep (gray) distributions are compared to target distribution (dotted black).GPSMs were trained on 1M synthetic (a, d), 10K synthetic (b, e), or 10K natural (c, f) sequences from the corresponding target distribution.All Hamming distributions were computed from 30K-sequence MSAs, except for the natural target, which was computed from a 10K-sequence target MSA due to data limitations.Top Row, Hamming distances d (x-axis) are shown about the mode, and frequency f is normalized as a fraction of total (y-axis).Mi3 perfectly matches the target distribution from a to c, whereas DeepSequence and sVAE overlap each other and share a mode with a slightly higher frequency than the target and Mi3.Indep has a slightly lower mode than all the other GPSMs, and a much higher mode frequency.Because generative capacity for all GPSMs is unchanged from a to c, none of them are sensitive to training or evaluation sample size for this metric.Bottom Row, Re-scaled logarithmic Hamming distance distributions better discriminate between GPSMs with respect to generative capacity than the normal Hamming distance distribution.Before being log-scaled, the Hamming distances d are normalized by the mode d Mo (x-axis), and frequencies f are normalized by the maximum Hamming distance f max (y-axis).This transformation highlights minute differences between distributions at low frequencies in the tails of the distributions on the left-and right-hand sides.From d to e, DeepSequence and sVAE appear sensitive to training sample size for this metric at the log-log scale.
, Left Column) and 10K (Fig. 4, Right Column) synthetic training MSA sizes, and quantify GPSM generative capacity for this metric by the Pearson correlation coefficient ρ({E(S)}, { Ê(S)}) between synthetic target energies E(S) and GPSM energies Ê(S).Mi3 reproduces the synthetic target distribution at both training MSA sizes.Because Mi3 should have very low specification error on the synthetic target, as it is well specified by design, the small amount of error must be due to remaining out-of-sample or numerical errors.As expected, Indep poorly reproduces the target values, with ρ = 0.6 for both MSA training sizes.The VAEs exhibit slightly larger specification error than Mi3 on the 1M training set with correlation of ρ = 0.94, and exhibit further out-of-sample error on the 10K training set with ρ = 0.89.

Figure 4 .
Figure 4. Statistical energy correlations.Statistical energies E(S) of 1K synthetic test sequences from the target distribution as evaluated by Mi3 (cyan), DeepSequence (green), sVAE (red), Indep (gray).Each GPSM was trained on 1M (a, c, e, g) or 10K (b, d, f, h) sequences from the synthetic target distribution.For each scatterplot, Pearson correlation coefficient ρ was computed between each GPSM's statistical energy and that of the synthetic target distribution for each sequence.These statistical energies are unit-less, and so we have shifted the data along the y-axis to compare E(S) on the same scale.Only Indep is insensitive to decreased training sample size for this metric.

Figure 1 .
Figure 1.Comparison of E(S) = − log p θ (S) with the ELBO estimate for the sVAE with l = 7 fit to 1M sequences, evaluated for 1000 sequences S from the validation dataset, with N = 1000 samples for both the ELBO estimate and the E(S) estimate.

Figure 2 .Figure 3 .
Figure 2. Plots of latent space distribution of the training dataset for sVAE models fit with different latent space sizes of 2, 4, 6, 8, and 10 (a,b,c,d,e respectively), fit to 1M synthetic training sequences as in the synthetic test in the main text.For each latent space size we show, for each pair of latent variables, a 2d histogram of the projected means of 10K training dataset sequences in latent space in blue.There is one subplot for l = 2, six subplots for l = 4, etc.Each plot ranges from -4 to 4 on both axes.The latent distribution q φ (Z|S) for single random sequence from the training dataset is shown as a red shading in proportion to probability.a b

Figure 4 .
Figure 4. sVAE performance for l = 7 for varying synthetic training dataset sizes.For each training dataset size, two inferences are run with different random seeds, shown in solid and dashed lines for each training size.

Figure 5 .
Figure 5. Synthetic test of the performance of different GPSMs when the synthetic target distribution is specified by sVAE.a Pairwise covariance correlation scores, as in main text Figure 2a.b r 20 scores, as in main text Figure 2d.c Hamming distance distributions, as in main text Figure 3a.d Statistical energy scores, as in main text Figure 4 panels a, c, e.

Table 1 .
Glossary of error types, and the MSA datasets we use to evaluate these errors.

Table 1 .
Example MSA following the XOR pattern.