Abstract
Despite recent advances in highthroughput combinatorial mutagenesis assays, the number of labeled sequences available to predict molecular functions has remained small for the vastness of the sequence space combined with the ruggedness of many fitness functions. While deep neural networks (DNNs) can capture highorder epistatic interactions among the mutational sites, they tend to overfit to the small number of labeled sequences available for training. Here, we developed Epistatic Net (EN), a method for spectral regularization of DNNs that exploits evidence that epistatic interactions in many fitness functions are sparse. We built a scalable extension of EN, usable for larger sequences, which enables spectral regularization using fast sparse recovery algorithms informed by coding theory. Results on several biological landscapes show that EN consistently improves the prediction accuracy of DNNs and enables them to outperform competing models which assume other priors. EN estimates the higherorder epistatic interactions of DNNs trained on massive sequence spacesa computational problem that otherwise takes years to solve.
Introduction
Recent advances in nextgeneration sequencing have enabled the design of highthroughput combinatorial mutagenesis assays that measure molecular functionality for tens of thousands to millions of sequences simultaneously. These assays have been applied to many different sequences in biology, including proteincoding sequences^{1,2,3}, RNAs^{4,5,6}, bacterial genes^{7,8,9,10}, and the Cas9 target sites^{11,12,13}. The labeled sequences collected from these assays have been used to train supervised machine learning (ML) models to predict functions (e.g., fluorescence, binding, repair outcome, etc.) from the sequence—a key step in the rational design of molecules using MLassisted directed evolution^{14}. However, due to the limitations in techniques for library preparation, these assays can only uncover a small subset of all the possible combinatorial sequences. This raises an important question in learning fitness functions: how can we enable supervised ML models to infer fitness functions using only a small number of labeled sequences?
Inferring fitness functions is a challenging task since mutational sites interact nonlinearly to form the function, a phenomenon known as epistasis in genetics^{15,16}. As a result, linear regression models which assume siteindependent interactions achieve poor accuracy in predicting nonlinear functions. Augmenting linear models with pairwise, secondorder epistatic interactions improves their prediction accuracy^{3}; however, there is now increasing evidence that a large fraction of the variance in the fitness functions can be explained only by higherorder epistatic interactions, which contribute to the ruggedness of fitness landscapes^{17,18}. Modeling rugged fitness landscapes is a hard task since the total number of possible higherorder interactions grows exponentially with the number of mutational sites. As a result, the number of parameters to be estimated (i.e., the problem dimension) also grows with the same exponential rate, which creates statistical challenges in inferring the fitness function since the number of labeled sequences does not scale with the problem dimension. In response, nonlinear ML models constrain the problem dimension by introducing various forms of inductive biases to capture hidden structures in the fitness functions. Random forests, for example, impose a tree structure over sites which favor treelike hierarchical epistatic interactions. While these inductive biases are effective in some fitness functions^{19}, they are too restrictive to capture the underlying higherorder epistatic interactions in other fitness functions^{3}. Overparameterized models in deep learning (DL), such as deep neural networks (DNNs), are expressive enough to model highorder epistatic interactions given a large number of labeled training sequences; however, when the number of labeled sequences is small, they often overfit to the training data and compromise prediction accuracy. It has been observed that regularizing DNNs to induce domainspecific biases improves their prediction accuracy for various tasks in computer vision and natural language processing^{20}. This opens up the question of whether there exists an inductive bias for DNNs trained on biological fitness landscapes that can be imposed using a computationally tractable regularization scheme.
Recent studies in biological landscapes^{3,13,21} have reported that a large fraction of the variance in many fitness functions can be explained by only a few number of (highorder) interactions between the mutational sites. The epistatic interactions in these functions are a mixture of a small number of interactions with large coefficients, and a larger number of interactions with small coefficients; in other words, their epistatic interactions are highly sparse. Promoting sparsity among epistatic interactions is a powerful inductive bias for predictive modeling because it reduces the problem dimension without biasing the model towards a subset of (loworder) interactions. Despite its benefits, promoting sparsity among epistatic interactions has not been studied in DNNs as an inductive bias. The roadblock is in finding a method to promote epistatic sparsity in DNNs. Unfortunately, directly penalizing all or some of the parameters (weights) of DNNs with sparsitypromoting priors is not likely to result in sparse epistatic regularization since the epistatic coefficients are a complex nonlinear function of the weights in DNNs.
Here, we develop a method for sparse epistatic regularization of DNNs. We call our method Epistatic Net (EN) because it resembles a fishing net which catches the epistatic interactions among all the combinatorially possible interactions in DNNs, without any restriction to a subset of (loworder) interactions. In order to find the epistatic interaction as a function of the weights in DNN, we find its spectral representation (also called the WalshHadamard (WH) transform for binary sequences) by evaluating the DNN on the entire combinatorial space of mutations, and then take the WH spectral transform of the resulting landscape using the Fast WH Transform (FWHT). The resulting function of the weights in DNN is penalized to promote epistatic sparsity. For larger sequences this approach for epistatic regularization becomes computationally intractable due to the need to enumerate all possible mutations in DNN. Therefore, we leverage the fast sparsityenabled algorithms in signal processing and coding theory in order to develop a greedy optimization method to regularize DNNs at scale. Our scalable regularization method, called ENS, regularizes DNNs by sampling only a small subset of the combinatorial sequence space by choosing sequences that induce a specific sparse graph structure. The uniform sampling scheme allows us to find the WH transform of the combinatorial DNN landscape efficiently using a fast peeling algorithm over the induced sparse graph^{22}. Results on several biological landscapes, from bacterial to protein fitness functions, shows that EN(S) enables DNNs to achieve consistently higher prediction accuracy compared to competing models and estimate all the higherorder predictive interactions on massive combinatorial sequence space—a computational problem that takes years to solve without leveraging the epistatic sparsity structure in the fitness landscapes.
Results
Regularization using the Epistatic Net (EN)
EN is a novel regularization scheme (Fig. 1b) which evaluates the DNN on all the possible combinatorial mutations of the input sequence; we call the resulting highdimensional vector the DNN landscape. EN takes the WH transform of the DNN landscape and adds the sparsitypromoting ℓ_{1}norm (i.e., the sum of the absolute values) of the WH coefficients (or total sum of the magnitude of epistasis) to the loglikelihood loss. The resulting WH loss is a differentiable function (except at zero) of the weights in DNN and is weighted by a scalar which strikes a balance between the fidelity of DNN to the labeled sequences and sparsity among epistatic interactions (see “Methods”, Supplementary Notes, and Supplementary Fig. 1). We use the stochastic gradient descent (SGD) algorithm to minimize the aggregate loss and update the weights of DNN in every iteration.
For larger sequences (of size d > 25), EN regularization becomes intractable in time and space complexity. This is because EN needs to query the DNN p = 2^{d} times to form the DNN landscape (exponential time complexity in d) and then find the WH transform of the queried DNN landscape (exponential time and space complexity in d). To overcome this, ENS leverages the sparsity in the WH spectral domain to regularize DNN using only a small number of uniformly subsampled sequences from the combinatorial input space (Fig. 1c). ENS decouples the DNN training, following the alternating direction method of multipliers (ADMM) framework^{23}, into two subproblems: (1) finding the ksparse WH spectral transform of DNN in a sample and time efficient manner, and (2) minimizing the sum of the loglikelihood loss and the WH loss. The WH loss penalizes the distance between DNN and a function constructed using the topk WH coefficients recovered in the first subproblem. In order to solve the first subproblem, we design a careful subsampling of the input sequence space^{22} that induces a linear mixing of the WH coefficients such that a greedy belief propagation algorithm (peelingdecoding) over a sparsegraph code recovers the noisy DNN landscape in sublinear sample (i.e., \({{{{{{{\mathcal{O}}}}}}}}(k{{{{{{{\mathrm{log}}}}}}}\,}^{2}p)\)) and time (i.e., \({{{{{{{\mathcal{O}}}}}}}}(k{{{{{{{\mathrm{log}}}}}}}\,}^{3}p)\)) complexity in p (with high probability)^{13,22,24,25}. Briefly, the peelingdecoding algorithm identifies the nodes on the induced sparsegraph code that are connected to only a single WH coefficient and peels off the edges connected to those nodes and their contributions on the overall graph. The algorithm repeats these steps until all the edges are removed. We solve the second subproblem using the SGD algorithm. ENS alternates between these two steps until convergence (see “Methods” and Supplementary Notes).
Inferring four canonical functions in bacterial fitness
We collected four canonical bacterial fitness functions, whose combinatorial landscapes have been measured experimentally in previously published works (see Supplementary Table 1). Figure 2a shows the sparsity level in epistatic interactions of these bacterial fitness functions. We found the coefficients for epistatic interactions by taking the WH transform of the measured combinatorial landscape (see “Methods” section for various ways to preprocess the landscapes). Figure 2a plots the fraction of variance explained as a function of the top WH coefficients. Sparsity levels can be assessed by the proximity of the resulting curve towards the topleft corner of the plot. For comparison, we also plotted synthetic fitness functions that have all possible epistatic interactions up to a certain order of interactions in Fig. 2a. While the sparsity levels vary across fitness functions, the top5 WH coefficients consistently explain more than 80% of the variance across all the landscapes.
Figure 2b shows the prediction performance of DNN with EN regularization on the bacterial landscapes compared to various competing models. All the models are trained on the same randomly sampled subset (i.e., 31%) of the sequences from the measured combinatorial landscapes and tested on a subset of unseen sequences (see Supplementary Notes for more details). The prediction accuracy is reported in terms of the coefficient of determination, R^{2} (i.e., the fraction of the variance in the test set explained from the sequence). DNN with EN regularization consistently outperforms the baseline models in all the landscapes. In particular, DNN with EN regularization performs significantly better than the ENunregularized variant consistently across all data sets (ΔR^{2} > 0.1, P < 0.033), even though DNN is optimized (in terms of architecture) for best validation performance in isolation (i.e., without epistatic regularization) and has been subjected to other forms of common sparsitypromoting regularization techniques applied directly to the weights of the DNN (“Methods”, Supplementary Data 1, and Supplementary Fig. 2).
Figure 2c shows the WH transform of the DNN landscape with and without EN regularization, as well as the WH transform of the landscapes corresponding to the rest of the competing models trained on a training set sampled from the Escherichia coli fitness landscape of Khan et al.^{9} (see Supplementary Figs. 3, 4 for a detailed analysis of the landscapes in the spectral domain). In order to find these landscapes, we queried each model for all the combinatorial mutations. In this plot, the epistatic coefficient indexed by 10100, as an example, shows an order 2 interaction between the mutational sites 1 and 3. The rest of the indices can be interpreted similarly. The WH coefficients in the measured Escherichia coli fitness function show three firstorder interactions with higher magnitude and several higherorder interactions with lower magnitude. The interactions recovered by DNN with EN regularization closely match the epistatic interactions of the measured Escherichia coli fitness function (R^{2} = 0.67), a considerable improvement over DNN without EN regularization (R^{2} = 0.41). EN regularization effectively denoises the WH spectrum of DNN by removing spurious higherorder interactions; nevertheless, given a larger training set, EN would have accepted a larger number of higherorder interactions. The WH coefficients of gradient boosted trees (R^{2} = 0.51) and random forests (R^{2} = 0.36) also show several spurious highorder interactions. Lasso regression finds two of the three measured interactions with higher magnitude, however, recovers a spurious thirdorder interaction which results in a low epistatic correlation coefficient (R^{2} = 0.18). When restricted to up to order 2 interactions, the performance of Lasso improves; it recovers the two interactions with higher coefficients, however, misses the third coefficient and the rest of the small epistatic interactions (R^{2} = 0.53).
Entacmaea quadricolor fluorescent protein
A comprehensive experimental study has reported all the combinatorial mutants that link two phenotypically distinct variants of the Entacmaea quadricolor fluorescent protein^{3}. The variants are different in d = 13 mutational sites. The study shows the existence of several highorder epistatic interactions between the sites, but also reveals extraordinary sparsity in the interactions. We used this protein landscape to assess EN in regularizing DNN for predicting protein function. We split the 2^{13} = 8192 labeled proteins randomly into three sets: training, validation, and test. The size of the test set was fixed to 3000 and the validation set size was set equal to the training set size. We varied the training set size from a minimum of n = 20 proteins to a maximum of n = 100 proteins and evaluated the accuracy of the models in (1) predicting fitness in Fig. 3a in terms of R^{2} (Supplementary Data 2) and (2) recovering the experimentally measured epistatic interactions in Fig. 3b in terms of normalized mean squared error (NMSE) (Supplementary Data 3).
DNN with EN regularization significantly outperforms DNN without EN regularization in terms of prediction accuracy (ΔR^{2} > 0.1, P < 10^{−5}), consistently across all training sizes. Moreover, DNN with EN regularization recovers the experimentally measured epistatic interactions with significantly lower error (ΔNMSE > 0.07, P < 9 × 10^{−5}), consistently across all training sizes. Applying various forms of ℓ_{1} and ℓ_{2}norm regularization on the weights of different layers of the DNN does not change the performance gap between DNN with and without EN regularization (see Supplementary Fig. 5). In particular, in order to achieve the same level of prediction accuracy (R^{2} = 0.7), DNN without EN regularization requires up to 3 times more training samples compared to DNN with EN regularization. Figures 3d, e show the scatter plots of the predicted fluorescence values of DNN and its ENregularized variant, respectively, when both models are trained on n = 60 labeled proteins. The performance gap naturally reduces for larger training sets, however, it stays consistently positive even up to n = 200 (i.e., 2.5% of the entire combinatorial landscape), which is typically larger than the number of available labeled sequences in protein function prediction problems (Supplementary Fig. 5). Our analysis also reveals the improved performance of the epistatic interactions recovered by DNN with EN regularization in predicting the pairwise contacts (residues with smaller than 4.5 Å distance^{26}) and triplet contacts (group of three residues with smaller than 4.5 Å pairwise distances) in the 3D structure of the protein—even though the networks are not trained for protein structure prediction task (Supplementary Fig. 6). DNN with EN regularization predicts contacts with \({\,{{\mbox{F}}}}_{1}^{{{\mbox{order 2}}}\,}=0.76\) and \({\,{{\mbox{F}}}}_{1}^{{{\mbox{order 3}}}\,}=0.68\) compared to DNN without EN regularization with \({\,{{\mbox{F}}}}_{1}^{{{\mbox{order 2}}}\,}=0.67\) and \({\,{{\mbox{F}}}}_{1}^{{{\mbox{order 3}}}\,}=0.66\) (F_{1} score takes the harmonic mean of the precision and recall rates).
The dimension of the fluorescent landscape of Entacmaea quadricolor protein enabled us to use the data set to compare the performance of DNN under EN regularization with its scalable version, ENS. The prediction performance of DNN with ENS regularization showed a slight drop in accuracy due to the approximations made by the ADMM decoupling (Fig. 3c, Supplementary Data 4, and “Methods”). ENS stayed fairly consistent when we decreased the number of proteins sampled from DNN to induce a sparsegraph code. Using as low as 1678 samples (out of the total of 8192 combinatorial proteins, i.e., 20% subsampling) enabled successful regularization of DNN, resulting in a significant performance gap compared to DNN without EN regularization.
Green fluorescent protein from Aequorea victoria (avGFP)
The local fitness landscape of the green fluorescent protein from Aequorea victoria (avGFP) has been investigated in a comprehensive study^{2}. The authors estimated the fluorescence levels of genotypes obtained by random mutagenesis of the avGFP protein sequence at 236 amino acid mutational sites. The final data set included 56,086 unique nucleotide sequences coding for 51,715 different protein sequences. Considering the absence or presence of a mutation at a site, created a data set with input sequence size of d = 236. Regularization in the resulting p = 2^{236}dimensional space was impossible using EN, illustrating the need for ENS. We first analyzed the peeling algorithm by inspecting the WH spectral representation of DNN once trained on the avGFP landscape. Figure 4a shows the firstorder WH coefficients of DNN, recovered using peeling after sampling DNN at 5,074,944 (out of 2^{236} ≈ 10^{71}) proteins following uniform patterns that induce a sparsegraph code. We repeated the same procedure with an independent set of uniformly subsampled sequences (with random offset) and visualized the recovered firstorder WH coefficients in a scatter plot as a function of the recovered coefficients using the first set of proteins in Fig. 4b. When sampled at two different relatively tiny subsets of this massive p = 2^{236}dimensional space, the peeling algorithm recovered similar firstorder coefficients (with R^{2} = 0.99), without assuming any prior knowledge on the WH coefficients of avGFP being loworder (also see Supplementary Fig. 7). The higher variance of the scatter plot around the center shows the small number of coefficients (30 out of 236) that are differentially recovered under the two subsamplings. The peeling algorithm associated 3.2% and 2.9% of the variation of DNN to higherorder interactions, respectively for the first and second subsampling. We compared the secondorder interactions recovered under these subsamplings (Supplementary Fig. 8). Despite the small variation associated with higherorder epistasis, 10% of the recovered secondorder interactions were exactly equal, and the rest of the interactions were locally correlated (R^{2} = 0.60 correlation within blocks of three neighboring interactions).
Next, we trained the same DNN architecture with ENS regularization. Figure 4c shows that the prediction accuracy of DNN with ENS regularization is higher than the baseline algorithms (Supplementary Data 5). The gap between DNN with and without ENS regularization is smaller compared to the previously described protein landscapes. We speculate that this is due to the nature of the local landscape of avGFP around the wildtype protein, where most of the variance can be explained by firstorder interactions and the rest can be explained by higherorder interactions that are spread throughout the WH spectrum. Figure 4d illustrates the histogram of the order of epistatic interactions recovered by invoking the peeling algorithm in every iteration of the ENS regularization scheme. Figure 4e depicts the gain in prediction accuracy after adding the recovered interactions to a purely linear model, suggesting that the difference in prediction accuracy of DNN with and without regularization can be explained (approximately) by a collection of a large number of WH coefficients with small magnitude—this analysis further demonstrates the computational power of ENS in recovering higherorder interactions in such massively large combinatorial space of interactions.
Immunoglobulinbinding domain of protein G (GB1)
A recent study investigated the fitness landscape of all the 20^{4} = 160,000 variants at four amino acid sites (V39, D40, G41, and V54) in an epistatic region of protein G domain B1, an immunoglobulinbinding protein expressed in Streptococcal bacteria^{1}. Onehot binary encoding of the amino acids results in binary sequences of length d = 80. As EN does not scale to regularize DNNs trained on this landscape, we relied on ENS. Figure 4f shows the prediction performance of DNN with ENS regularization compared to the baseline models that were scaled to such a dimension. All the models were trained on a random subset of n = 2000 proteins. ENS subsamples DNN at 215,040 proteins in order to perform the sparse epistatic regularization, which is about 10^{18} times smaller than the entire sequence space. Despite such an enormous level of undersampling, the DNN regularized with ENS consistently outperforms the competing baselines and the ENS unregularized DNN (ΔR^{2} > 0.035, P < 0.05, Supplementary Data 5, and Supplementary Fig. 9). The performance gap between the DNNs with and without ENS regularization is naturally smaller compared to the same gap in the Entacmaea quadricolor fluorescent protein landscape. This is because the protein landscape of Entacmaea quadricolor is defined over 13 mutational sites (with 8192 possible positional interactions and two possible amino acids for each site) while the protein landscape of GB1 is defined over 4 mutational sites (with 16 possible positional interactions and 20 possible amino acids for each site); the former benefits more from promoting sparsity among a larger number of biologically meaningful positional interactions.
Discussion
We showed that several of the functional landscapes in biology have common structures (i.e., inductive bias) in their epistatic interactions that manifest as sparsity in the spectral WalshHadamard (WH) domain. Sparse epistatic regularization of deep neural networks (DNNs) is an effective method to improve their prediction accuracy, especially when the number of available training samples is small compared to the vastness of sequence space. To this end, our Epistatic Net (EN) regularization method combined the advantages offered by the sparsity of biological landscapes with sublinear algorithms in signal processing and coding theory for epistatic regularization of DNNs in the combinatorial space of interactions. Analysis of the recovered higherorder epistatic interactions by the DNNs with and without regularization also revealed the power of EN in finding biologically relevant epistatic interactions.
The superior prediction performance of DNNs with EN regularization comes with the additional computational cost of finding the WH transform of the DNN landscape, which increases the computational complexity of the training algorithm by only a linear factor in the product of the length of the sequence and the epistatic sparsity level. While training can be done offline (e.g., on a server) there are avenues for making the algorithm even more efficient such as using the prior knowledge on the maximum order of interaction to constrain the regularization space. In addition, EN regularization can be extended using generalized Fourier transform to more efficiently encode amino acids compared to the more conventional onehot binary encoding strategies^{27}. Moreover, while this work laid out the algorithmic principles of sparse epistatic regularization in supervised models, unsupervised models, such as Potts model^{28}, Ising model^{29}, and Variational Autoencoders (VAEs)^{30} can benefit from such regularization scheme as well; it would be tempting to hypothesize that these energy landscapes also have structures that appear as highorder sparse coefficients in WH basis.
Overall, our sparse epistatic regularization method expands the machine learning toolkit for inferring and understanding fitness functions in biology. It helps us to visualize, analyze, and regularize the powerful, however less interpretable blackbox models in deep learning in terms of their higherorder interactions in the sequence space. We believe that our work will initiate new research directions towards developing hybrid methodologies that draws power from statistical learning, signal processing, coding theory, and physicsinspired deep learning for protein design and engineering.
Methods
Notation and background
Suppose we are given n (experimental) samples \({({{{{{{{{\bf{x}}}}}}}}}_{i},{y}_{i})}_{i = 1}^{n}\), that is, (sequence, value) pairs from a biological landscape, where x_{i} ∈ {−1, +1}^{d} denotes the binary encoding of d mutational sites in a variant and \({y}_{i}\in {\mathbb{R}}\) is its associated fitness value. We are interested in learning a function f(x) that maps all subsets of mutations to fitness values. In other words, we seek to learn a set function \(f({{{{{{{\bf{x}}}}}}}}):{{\mathbb{F}}}^{d}\to {\mathbb{R}}\), where \({{\mathbb{F}}}^{d}\) denotes the space of all the binary vectors of length d. A key theorem^{31} in mathematics states that any set function (also known as pseudoBoolean function) f(x) = f(x_{1}, x_{2}, …, x_{d}) can be represented uniquely by a multilinear polynomial over the hypercube (x_{1}, x_{2}, …, x_{d}) ∈ {−1, +1}^{d}:
where \({{{{{{{\mathcal{S}}}}}}}}\) is a subset of {1, 2, 3, …, d} = [d] and \({\alpha }_{{{{{{{{\mathcal{S}}}}}}}}}\in {\mathbb{R}}\) is the WH transform coefficient (or equivalently the epistatic coefficient) associated with the monomial (interaction) \({\prod }_{i\in {{{{{{{\mathcal{S}}}}}}}}}{x}_{i}\). For example, the pseudoBoolean function
defined over d = 5 mutational sites, has three monomials with orders 2, 1, and 3 and WH coefficients 12, − 3, and 6, respectively. The WH transform of this function is sparse with k = 3 nonzero coefficients out of a total of 2^{5} = 32 coefficients. Each monomial can be easily explained, for example, the first monomial in the WH transform, that is 12x_{1}x_{4}, indicates that mutation sites 1 and 4 are interacting and the interaction enriches fitness because the sign of the coefficient is positive. On the hand, the second monomial − 3x_{3} shows that a mutation at site 3 depletes fitness. The last monomial 6x_{1}x_{2}x_{5} shows a thirdorder interaction between mutational sites 1, 2, and 5 which also enrich fitness.
If the fitness function is measured (known) for all the combinatorial p = 2^{d} inputs x_{i}, then we can use the Fast WH Transform (FWHT)^{32} to find the WH coefficients in \({{{{{{{\mathcal{O}}}}}}}}(p{{{{{{\mathrm{log}}}}}}}\,p)\) time complexity. The problem is socalled fully determined in such a scenario. However, as discussed in the introduction, in inferring fitness functions, we typically face problems where the number of observed samples (sequences) n is much smaller than the total number of possible sequences, that is, n ≪ p = 2^{d}; in other words, we are in an underdetermined regime. In full generality, we assume that the data is generated according to a noisy nonlinear model
where θ are the parameters of the model, ε_{e} is a random variable drawn from a Gaussian distribution with zero mean and variance \({\sigma }_{e}^{2}\). Under this setting the maximum likelihood estimate is
We denote a deep neural network (DNN) by g_{θ}(x), where θ is a vector of all the weights in DNN. The DNN, g_{θ}(x), takes in a binary input vector x_{i} and predicts the output \(\hat{{y}_{i}}\). Let \({{{{{{{\bf{X}}}}}}}}\in {{\mathbb{R}}}^{p\times d}\) denote a matrix which comprises all the p = 2^{d} enumeration of the binary sequence x_{i} of length d in its rows. We slightly abuse the notation and let \({{{{{{{{\bf{g}}}}}}}}}_{\theta }({{{{{{{\bf{X}}}}}}}})\in {{\mathbb{R}}}^{p}\) denote the realvalued vector of DNN outputs over all these binary sequences. We call this highdimensional vector the DNN landscape. In order to find the WH transform of the DNN we can multiply the DNN landscape, g_{θ}(X), by the WH matrix, \({{{{{{{\bf{H}}}}}}}}\in {{\mathbb{R}}}^{p\times p}\). The WH matrix H can be defined using the recursive equation
where H^{2} is the 2 × 2 mother WH matrix defined as \({{{{{{{{\bf{H}}}}}}}}}^{2}=\left[\begin{array}{ll}1&1\\ 1&1\end{array}\right]\) and ⊗ denotes the Kronecker product. The WH matrix is a symmetric unitary matrix; in other words, (1/2^{d})HH = I. Each of the 2^{d} columns of H corresponds to a monomial (\({\prod }_{i\in {{{{{{{\mathcal{S}}}}}}}}}{x}_{i}\)) in the pseudoBoolean representation of set functions and equivalently corresponds to one of the terms in WH transform. In biology literature, this coefficient is known as an epistatic interaction when \( {{{{{{{\mathcal{S}}}}}}}} \ge 2\). The WH transform of the DNN can be calculated as \({{{{{{{\bf{H}}}}}}}}{{{{{{{{\bf{g}}}}}}}}}_{\theta }({{{{{{{\bf{X}}}}}}}})\in {{\mathbb{R}}}^{p}\). Note that in this manuscript we assume H is properly normalized to be unitary.
Epistatic net (EN)
EN regularizes the epistatic interactions in g_{θ}(X) by adding a new WH loss term to the original loglikelihood loss,
where \({{{{{{{\bf{H}}}}}}}}\in {{\mathbb{R}}}^{p\times p}\) is the WH matrix, the ℓ_{0}norm ∥.∥_{0} counts the number of nonzero values in the WH transform of the DNN (i.e., Hg_{θ}(X)), and α is a scalar which strikes balance between the loglikelihood loss and the regularization term. The scalar α is set using crossvalidation. The ℓ_{0}norm is a nonconvex and nondifferentiable term and is not suitable for optimization using the SGD algorithm since the gradient is not welldefined for this term; therefore, following the common practice in convex optimization, we relaxed the ℓ_{0}norm and approximated it by a convex and differentiable (except at zero) sparsity promoting ℓ_{1}norm in EN. We will discuss in the next section that in the scalable version of EN, it is more efficient to approximately solve the ℓ_{0}norm minimization problem using the greedy peelingdecoding algorithm from coding theory, which does not rely on gradient descent optimization.
EN approximately solves the following relaxed optimization problem using the SGD algorithm:
EN
Note that despite our convex relaxation, this optimization problem is still nonconvex since both the loglikelihood loss and the DNN landscape are nonconvex (still differentiable) functions. In general, convergence to the global minimum can not be guaranteed due to nonconvexity of DNN, however, in practice we observe that SGD converges smoothly to a useful stationary locally optimal point. To avoid convergence to locally optimal points with poor generalization performance, the DNN can be trained multiple times with several random initialization, however, as we have elaborated in the experimental section, for most of the experiments in this paper random Xavier initialization resulted in good generalization using a single initialization (no need for multiple initializations).
Scalable Epistatic Net (ENS)
For larger sequences (i.e., d > 25), the optimization algorithm in EN does not scale well with d. There are two factors that prevent EN from scaling to larger sequences: time and space complexity. We elaborate on these two factors. (1) In order to find the DNN landscape, we need to query the DNN p = 2^{d} times. Regardless of how fast DNN inference is, the time complexity of this task grows exponentially with d. For example, it would take years to query the DNN with the simplest structure on all the binary sequences of length d = 236 in the avGFP protein landscape. Furthermore, finding the WH transform of the DNN landscape, even using FWHT with \({{{{{{{\mathcal{O}}}}}}}}(p{{{{{{\mathrm{log}}}}}}}\,p)\) computational cost, will not be possible since the computational cost grows exponentially with d. 2) The WH matrix H is a p × p matrix and the DNN landscape g_{θ}(X) is a pdimensional vector. Regardless of the time required to find those matrices, they need exponential memory to store, which again becomes infeasible for even moderate values of d. We need a method that scales sublinear in p (i.e., \({{{{{{{\mathcal{O}}}}}}}}(\,{{\mbox{polylog}}}\,p)\)) both in time and space complexity.
Here, we develop ENS to approximately solve our optimization problem efficiently. We first perform a change of variables and define the WH transform of the DNN landscape as u = Hg_{θ}(X) and set it as an explicit constraint in the optimization problem. Following this change of variable, we reformulate the optimization problem in equation (7) as,
This change of variable enables us to use an augmented Lagrangian method to decouple the optimization problem in equation (7) into two subproblems: (1) updating the weights of DNN using SGD, and, 2) finding the WH transform of DNN using a fast greedy algorithm based on sparsegraph codes. The alternating direction method of the multipliers (ADMM) is a variant of the augmented Lagrangian methods that use partial updates for the dual variables and provides a principled framework to decouple the optimization problem above. Following the scaleddual form of ADMM^{23}, we decoupled the optimization problem above into two separate minimization problems and a dual update. At iteration t, we first fix \({{{{{{{{\bf{u}}}}}}}}}_{t}\in {{\mathbb{R}}}^{p}\) and solve a θminimization problem, then fix \({\theta }_{t}\in {{\mathbb{R}}}^{p}\) and solve a uoptimization problem, and finally update the dual variable \(\lambda \in {{\mathbb{R}}}^{p}\) as follows:

\(\theta \,{{\mbox{minimization}}}\,\quad \ {\theta }^{t+1}={{{{{{{\rm{arg}}}}}}}}\ {{{{{{{{\rm{min}}}}}}}}}_{\theta }\mathop{\sum }\nolimits_{i = 1}^{n}\scriptstyle{({y}_{i}{g}_{\theta }({{{{{{{{\bf{x}}}}}}}}}_{i}))}^{2}+\frac{\rho }{2}\parallel {{{{{{{\bf{H}}}}}}}}{{{{{{{{\bf{g}}}}}}}}}_{\theta }({{{{{{{\bf{X}}}}}}}}){{{{{{{{\bf{u}}}}}}}}}^{t}+{\lambda }^{t}{\parallel }_{2}^{2}\)

\({{{{{{{\bf{u}}}}}}}}\,{{\mbox{minimization}}}\,\quad {{{{{{{{\bf{u}}}}}}}}}^{t+1}={{{{{{{\rm{arg}}}}}}}}\ {{{{{{{{\rm{min}}}}}}}}}_{{{{{{{{\bf{u}}}}}}}}}\alpha \parallel{\!\!}{{{{{{{\bf{u}}}}}}}}{\parallel }_{1}+\frac{\rho }{2}\parallel{\!\!}{{{{{{{\bf{H}}}}}}}}{{{{{{{{\bf{g}}}}}}}}}_{{\theta }^{t+1}}({{{{{{{\bf{X}}}}}}}}){{{{{{{\bf{u}}}}}}}}+{\lambda }^{t}{\parallel }_{2}^{2}\)

\({{\mbox{dual update}}}\, \ \ \ {\lambda }^{t+1}={\lambda }^{t}+{{{{{{{\bf{H}}}}}}}}{{{{{{{{\bf{g}}}}}}}}}_{{\theta }^{t+1}}({{{{{{{\bf{X}}}}}}}}){{{{{{{{\bf{u}}}}}}}}}^{t+1}\),
where \(\rho \in {\mathbb{R}}\) is a hyperparameter set using crossvalidation. Note that the time and space scaling issues remain here and will be addressed momentarily. Assuming an infinite time and space budget, the θminimization problem can be tackled using SGD and the uminimization problem can be solved by projecting \({{{{{{{{\bf{w}}}}}}}}}^{t+1}:={{{{{{{\bf{H}}}}}}}}{{{{{{{{\bf{g}}}}}}}}}_{{\theta }^{t+1}}({{{{{{{\bf{X}}}}}}}})+{\lambda }^{t}\) onto the ℓ_{1}norm ball of radius ρ/α. This projection can be solved using the softthresholding operator in Lasso^{33}:
Unfortunately, all the three steps above still have exponential time and space scaling with d. In what follows we will show how to exploit the sparsity of the WH transform of the DNN landscape u = Hg_{θ}(X) to reformulate new minimization steps such that we need to subsample only a logarithmic factor \({{{{{{{\mathcal{O}}}}}}}}(\,{{\mbox{polylog}}}\,p)\) of rows in H and approximately solve these steps in sublinear time and space complexity in p (i.e., at most polynomial in d). We call this regularization scheme ENS.
The first step to arrive at the ENS regularization scheme is to reformulate the optimizations above such that the WH matrix H appears as a multiplicative term behind the dual variable λ and u. This enables us to convert the uminimization problem from a ℓ_{1}norm ball projection to a sparse WH recovery problem with H as the basis, for which we have fast solvers from signal processing and coding theory. Note that \(\parallel{\!\!}{{{{{{{\bf{H}}}}}}}}{{{{{{{{\bf{g}}}}}}}}}_{\theta }({{{{{{{\bf{X}}}}}}}}){{{{{{{{\bf{u}}}}}}}}}^{t}+{\lambda }^{t}{\parallel }_{2}^{2}=\parallel{\!\!}{{{{{{{{\bf{g}}}}}}}}}_{\theta }({{{{{{{\bf{X}}}}}}}}){{{{{{{\bf{H}}}}}}}}{{{{{{{{\bf{u}}}}}}}}}^{t}+{{{{{{{\bf{H}}}}}}}}{\lambda }^{t}{\parallel }_{2}^{2}\) and \(\parallel{\!\!}{{{{{{{\bf{H}}}}}}}}{{{{{{{{\bf{g}}}}}}}}}_{{\theta }^{t+1}}({{{{{{{\bf{X}}}}}}}}){{{{{{{\bf{u}}}}}}}}+{\lambda }^{t}{\parallel }_{2}^{2}=\parallel{\!\!}[{{{{{{{{\bf{g}}}}}}}}}_{{\theta }^{t+1}}({{{{{{{\bf{X}}}}}}}})+{{{{{{{\bf{H}}}}}}}}{\lambda }^{t}]{{{{{{{\bf{H}}}}}}}}{{{{{{{\bf{u}}}}}}}}{\parallel }_{2}^{2}\) because H is a unitary matrix. Therefore, we can write the optimization steps above as

\(\theta \,{{\mbox{minimization}}}\,\quad \ \scriptstyle{\theta }^{t+1}={{{{{{{\rm{arg}}}}}}}}\ {{{{{{{{\rm{min}}}}}}}}}_{\theta }\mathop{\sum }\nolimits_{i = 1}^{n}{({y}_{i}{g}_{\theta }({{{{{{{{\bf{x}}}}}}}}}_{i}))}^{2}+\frac{\rho }{2}\parallel {{{{{{{{\bf{g}}}}}}}}}_{\theta }({{{{{{{\bf{X}}}}}}}}){{{{{{{\bf{H}}}}}}}}{{{{{{{{\bf{u}}}}}}}}}^{t}+{{{{{{{\bf{H}}}}}}}}{\lambda }^{t}{\parallel }_{2}^{2}\)

\({{{{{{{\bf{u}}}}}}}}\,{{\mbox{minimization}}}\,\quad {{{{{{{{\bf{u}}}}}}}}}^{t+1}={{{{{{{\rm{arg}}}}}}}}\ {{{{{{{{\rm{min}}}}}}}}}_{{{{{{{{\bf{u}}}}}}}}}\scriptstyle\alpha \parallel {{{{{{{\bf{u}}}}}}}}{\parallel }_{1}+\frac{\rho }{2}\parallel [{{{{{{{{\bf{g}}}}}}}}}_{{\theta }^{t+1}}({{{{{{{\bf{X}}}}}}}})+{{{{{{{\bf{H}}}}}}}}{\lambda }^{t}]{{{{{{{\bf{H}}}}}}}}{{{{{{{\bf{u}}}}}}}}{\parallel }_{2}^{2}\)

\({{\mbox{dual update}}}\, \ {{{{{{{\bf{H}}}}}}}}{\lambda }^{t+1}={{{{{{{\bf{H}}}}}}}}{\lambda }^{t}+{{{{{{{{\bf{g}}}}}}}}}_{{\theta }^{t+1}}({{{{{{{\bf{X}}}}}}}}){{{{{{{\bf{H}}}}}}}}{{{{{{{{\bf{u}}}}}}}}}^{t+1}\).
Now, the uminimization problem is to find the WH transform of \({{{{{{{{\bf{g}}}}}}}}}_{{\theta }^{t+1}}({{{{{{{\bf{X}}}}}}}})+{{{{{{{\bf{H}}}}}}}}{\lambda }^{t}\) with an ℓ_{1}norm sparsity prior. In order to solve this uminimization problem, we resort to the fast sparsityenabled tools in signal processing and coding theory. This class of greedy algorithms solves the original ℓ_{0}norm minimization problem and finds the kWH sparse landscape (for the specific value of k) in a time and space efficient manner (\({{{{{{{\mathcal{O}}}}}}}}(k\,{{\mbox{polylog}}}\,p)\), i.e., \({{{{{{{\mathcal{O}}}}}}}}(k\,{{\mbox{poly}}}\,d)\)) using sparsegraph codes (see Supplementary Notes for an overview of these methods). To this end, we leverage subsampling of input sequences based on patterns that induce sparsegraph codes^{22}. We denote the rows corresponding to these subsampled sequences as X_{T}, where \( T \sim {{{{{{{\mathcal{O}}}}}}}}(k{{{{{{{\mathrm{log}}}}}}}\,}^{2}p)\). The subsampling induces a linear mixing of WH coefficients such that a belief propagation algorithm (peelingdecoding) over a sparsegraph code recovers a pdimensional noisy landscape with k nonzero WH coefficients in the sublinear sample (i.e., \({{{{{{{\mathcal{O}}}}}}}}(k{{{{{{{\mathrm{log}}}}}}}\,}^{2}p)\)) and time complexity (i.e., \({{{{{{{\mathcal{O}}}}}}}}(k{{{{{{{\mathrm{log}}}}}}}\,}^{3}p)\)) with high probability^{13,22,24,25} (see Supplementary Materials for a full discussion). This addresses both the time and space scalability issues in solving the uminimization problem.
In order to resolve the time and space scalability issues in the θminimization problem and the dual update we introduce a novel approximation. We follow the subsampling patterns dictated by the sparsegraph codes in solving the uminimization problem, and restrict both the θminimization problem and the dual update to those subsamples as well to arrive at,

\(\theta \,{{\mbox{minimization}}}\,\quad \ \scriptstyle{\theta }^{t+1}\;=\;{{{{{{{\rm{arg}}}}}}}}\ {{{{{{{{\rm{min}}}}}}}}}_{\theta }\mathop{\sum }\nolimits_{i = 1}^{n}{({y}_{i}{g}_{\theta }({{{{{{{{\bf{x}}}}}}}}}_{i}))}^{2}+\frac{\rho }{2}\parallel {{{{{{{{\bf{g}}}}}}}}}_{\theta }({{{{{{{{\bf{X}}}}}}}}}_{T}){{{{{{{{\bf{H}}}}}}}}}_{T}{{{{{{{{\bf{u}}}}}}}}}^{t}+{\gamma }^{t}{\parallel }_{2}^{2}\)

\({{{{{{{\bf{u}}}}}}}}\,{{\mbox{minimization}}}\,\quad {{{{{{{{\bf{u}}}}}}}}}^{t+1} \scriptstyle\; =\, {{{{{{{\rm{arg}}}}}}}}\ {{{{{{{{\rm{min}}}}}}}}}_{{{{{{{{\bf{u}}}}}}}}}\alpha \parallel {{{{{{{\bf{u}}}}}}}}{\parallel }_{0}+\frac{\rho }{2}\parallel [{{{{{{{{\bf{g}}}}}}}}}_{{\theta }^{t+1}}({{{{{{{{\bf{X}}}}}}}}}_{T})+{\gamma }^{t}]{{{{{{{{\bf{H}}}}}}}}}_{T}{{{{{{{\bf{u}}}}}}}}{\parallel }_{2}^{2}\)

\({{\mbox{dual update}}}\,\ {\gamma }^{t+1}={\gamma }^{t}+{{{{{{{{\bf{g}}}}}}}}}_{{\theta }^{t+1}}({{{{{{{{\bf{X}}}}}}}}}_{T}){{{{{{{{\bf{H}}}}}}}}}_{T}{{{{{{{{\bf{u}}}}}}}}}^{t+1}\),
where \({\gamma }^{t}:={{{{{{{{\bf{H}}}}}}}}}_{T}{\lambda }^{t+1}\in {{\mathbb{R}}}^{ T }\) and H_{T} comprises the rows of H that are in T. Note that the change of dual variable γ^{t} = H_{T}λ^{t+1} is only possible because in all the three steps the dual variable λ^{t+1} appears in the WH basis. Note that while the columns of the subsampled WH matrix H_{T} still live in a pdimensional space, this matrix is never instantiated in memory because it only appears as H_{T}u, where u is a ksparse vector. Therefore, H_{T}u is computed on the fly by only finding the columns of the (rowsubsampled) WH matrix H_{T} that corresponds to the nonzero values in u. The final ENS method iterates over these three steps to train the DNN until convergence. We indicate the algorithm to solve each step in brackets:
ENS

\(\scriptstyle\theta \,{{\mbox{minimization}}}\,\quad \ {\theta }^{t+1}={{{{{{{\rm{arg}}}}}}}}\ {{{{{{{{\rm{min}}}}}}}}}_{\theta }\mathop{\sum }\nolimits_{i = 1}^{n}{({y}_{i}{g}_{\theta }({{{{{{{{\bf{x}}}}}}}}}_{i}))}^{2}+\frac{\rho }{2}\parallel {{{{{{{{\bf{g}}}}}}}}}_{\theta }({{{{{{{{\bf{X}}}}}}}}}_{T}){{{{{{{{\bf{H}}}}}}}}}_{T}{{{{{{{{\bf{u}}}}}}}}}^{t}+{\gamma }^{t}{\parallel }_{2}^{2}\, [{{{{{\rm{SGD}}}}}}]\)

\(\scriptstyle{{{{{{{\bf{u}}}}}}}}\,{{\mbox{minimization}}}\,\quad {{{{{{{{\bf{u}}}}}}}}}^{t+1}={{{{{{{\rm{arg}}}}}}}}\ {{{{{{{{\rm{min}}}}}}}}}_{{{{{{{{\bf{u}}}}}}}}}\alpha \parallel {{{{{{{\bf{u}}}}}}}}{\parallel }_{0}+\frac{\rho }{2}\parallel [{{{{{{{{\bf{g}}}}}}}}}_{{\theta }^{t+1}}({{{{{{{{\bf{X}}}}}}}}}_{T})+{\gamma }^{t}]{{{{{{{{\bf{H}}}}}}}}}_{T}{{{{{{{\bf{u}}}}}}}}{\parallel }_{2}^{2}\,[{{{{{\rm{Peeling}}}}}}]\)

\({{\mbox{dual update}}}\, \ {\gamma }^{t+1}={\gamma }^{t}+{{{{{{{{\bf{g}}}}}}}}}_{{\theta }^{t+1}}({{{{{{{{\bf{X}}}}}}}}}_{T}){{{{{{{{\bf{H}}}}}}}}}_{T}{{{{{{{{\bf{u}}}}}}}}}^{t+1}\,[{{{{{\rm{Directly}}}}}}\, {{{{{\rm{computed}}}}}}]\)
All the three steps above in the ENS method scale sublinearly with p (i.e., at most polynomial with d) both in terms of time and space complexity.
Experimental setup
The architecture of DNN was selected in isolation (i.e., without any WH regularization). In our architecture search, we considered a fourlayer fully connected DNN with batch normalization and leaky ReLU as the activation function. The dimension of the layers was set to d × fd, fd × fd, fd × d, and the dimension of the final layer was d × 1, where f is an expansion factor. We searched for a value of f that resulted in the best generalization accuracy on an independent data set—a prediction task on DNA repair landscapes^{13} which we did not use for evaluation in this paper. DNN prediction performance was stable around f = 10 with the highest validation accuracy on the independent data set. We selected f = 10 in all our experiments, except for the experiments done on the avGFP landscape^{2}, where due to the sheer dimensionality of the problem (i.e., d = 236), we set f = 1 (on limited independent tests with f = 10 on the same landscape, we observed no considerable difference in prediction accuracy). The weights of the DNN were always initialized with the Xavier uniform initialization^{34}. We used the exact same initialization (random seed) for the baseline DNN with and without EN(S) regularization to ensure that we solely capture the effect of regularization and not the variations due to the initialization of DNN. We used the Adam optimizer in all the steps of the methods requiring SGD and a learning rate of 0.001, which resulted in the best validation accuracy. We set α = 0.1 in EN. For the DNN with EN(S) regularization, a learning rate of 0.01 resulted in the best validation accuracy. In ENS, the hyperparameters α and ρ have to be jointly set since they are dependent. We set α = 1 and ρ = 0.01 in ENS although other value pairs could have resulted in the same accuracy. The validation accuracy of DNN was monitored and used for early stopping to avoid overfitting based on the performance on a holdout validation set (with a maximum of 1000 epochs). We used the exact same validation set to perform hyperparameter tuning of the baseline algorithms, including the Lasso family, random forest, and gradient boosted trees.
For the family of Lasso regression, we performed an extra step to improve the prediction performance. We selected the top most recovered coefficients and performed ordinary least squares (OLS) on the reduced problem. This step improves the robustness and the prediction accuracy of Lasso^{35}. Therefore, in addition to the standard λ regularization parameter, which strikes a balance between sparsity and the fidelity term (i.e., the mean squared error), we also did hyperparameter tuning for the number of top coefficients in the OLS (note that the regular Lasso is included in our hyperparameter search and appears when all the nonzero coefficients are selected to perform OLS). We did a grid search over the hyperparameter λ and the number of top coefficients in Lasso. For λ we considered 50 values spanning the range [10^{−7}, 1]. Overall, this comprised of an exhaustive hyperparameter search to make sure the best performance of Lasso is being captured.
For training gradient boosted trees and random forests baselines, we used packages from sklearn in python. We did hyperparameter tuning for max depth and the number of estimators, using the default values for all other parameters. For max depth, we considered parameters ranging from 1 to a constant times the maximum number of mutations in the fitness function (i.e., d), for the number of estimators we considered values in {10, 50, 100, 200, 300, 400, 500, 1000, 2000, 3000}, and chose the pair that resulted in best validation accuracy. As a general trend, we observed that larger numbers of estimators result in higher validation accuracies before they saturate.
Here, we report the hyperparameters that resulted in the highest validation accuracy, that is, the ones we selected in our experiments. For the avGFP landscape, we set the number of estimators to 300 and max depth to 11 for gradient boosted trees and set the number of estimators to 100 and max depth to 55 for random forests. We set λ = 1 × 10^{−4} for Lasso regression when considering up to firstorder interactions and λ = 1 × 8^{−4} when considering up to secondorder interactions. For the GB1 landscape, we set the number of estimators to 100 and max depth to 2 for both gradient boosted trees and random forests. We set λ = 7 × 10^{−3} for Lasso regression when considering up to firstorder interactions and λ = 2.5 × 10^{−2} when considering up to secondorder interactions. For the protein landscape in Fig. 3, we set the number of estimators to 3000 and the max depth varied between the values in the sets {1, 2, 3, 4} and {1, 2, …, 15} across the random repeats of the experiments with different train, test, and validation set, respectively for gradient boosted trees and random forest; the value with the best validation performance was selected for each repeat. For the bacterial landscapes in Fig. 2, we set the number of estimators to 300 and the max depth varied between the values in the set {1, 2, 3} across the random repeats of the experiments with different train, test, and validation set; the value with the best validation performance was selected for each repeat.
In all the relevant protein and biological data sets, we performed a twosided Ttest for the null hypothesis that the independent prediction from DNN with and without EN regularization (across random Xavier initialization) has identical average (expected) values and reported the pvalues.
Prepossessing the fitness landscapes
For some of the landscapes tested in this paper, we followed the BoxCox power transform method as described in ref. ^{18} to remove possible global nonlinearities from the landscape. Although the effect of removing such nonlinearities was small in our analysis, global nonlinearities in general can produce highorder epistatic interactions that are not truly needed. Removing these nonlinearities can reduce noise and increase epistatic sparsity. Nevertheless, one can completely ignore this preprocessing step and rely on DNN with EN regularization to capture the global nonlinearites and infer the fitness landscape for prediction purposes.
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Data availability
The canonical bacterial fitness data used in Fig. 2 are available in the github repository associated with the Ref. ^{18}. The E. quadricolor fluorescent protein data used in Fig. 3 are available in Supplementary Data 3 of https://doi.org/10.1038/s41467019121308 Ref. ^{3}. The avGFP protein data used in Fig. 4 are available in the figshare data repository under accession code 3102154. The GB1 protein data used in Fig. 4 are available in Supplementary file 1 and Supplementary file 2 of https://elifesciences.org/articles/16965/figures Ref. ^{1}. All other data generated or analyzed in this study are included in this published article (and in its accompanying Supplementary Information and Supplementary Data).
Code availability
A software for the Epistatic Net regularization algorithms has been developed in Python and is publicly available in our github repository at https://github.com/amirmohan/epistaticnet^{36}.
References
Wu, N. C., Dai, L., Olson, C. A., LloydSmith, J. O. & Sun, R. Adaptation in protein fitness landscapes is facilitated by indirect paths. eLife 5, e16965 (2016).
Sarkisyan, K. et al. Local fitness landscape of the green fluorescent protein. Nature 533, 397–401 (2016).
Poelwijk, F., Socolich, M. & Ranganathan, R. Learning the pattern of epistasis linking genotype and phenotype in a protein. Nat. Commun. 10, 1–11 (2019).
Kopsidas, G. et al. RNA mutagenesis yields highly diverse mRNA libraries for in vitroprotein evolution. BMC Biotechnol. 7, 18 (2007).
Sanjuán, R. Mutational fitness effects in RNA and singlestranded DNA viruses: common patterns revealed by sitedirected mutagenesis studies. Philos. Trans. R. Soc. B: Biol. Sci. 365, 1975–1982 (2010).
Takata, M. A. et al. Global synonymous mutagenesis identifies cisacting RNA elements that regulate HIV1 splicing and replication. PLoS Pathogens 14, e1006824 (2018).
Weinreich, D., Delaney, N., DePristo, M. & Hartl, D. Darwinian evolution can follow only very few mutational paths to fitter proteins. Science 312, 111–114 (2006).
Hall, D., Agan, M. & Pope, S. Fitness epistasis among 6 biosynthetic loci in the budding yeast Saccharomyces cerevisiae. J. Heredity 101, S75–S84 (2010).
Khan, A., Dinh, D., Schneider, D., Lenski, R. & Cooper, T. Negative epistasis between beneficial mutations in an evolving bacterial population. Science 332, 1193–1196 (2011).
De Visser, J. A. G. & Krug, J. Empirical fitness landscapes and the predictability of evolution. Nat. Rev. Genet. 15, 480–490 (2014).
Listgarten, J. et al. Prediction of offtarget activities for the endtoend design of CRISPR guide RNAs. Nat. Biomed. Eng. 2, 38–47 (2018).
Leenay, R. T. et al. Large dataset enables prediction of repair after CRISPRCas9 editing in primary T cells. Nat. Biotechnol. 37, 1034–1037 (2019).
Aghazadeh, A., Ocal, O. & Ramchandran, K. CRISPRLand: Interpretable largescale inference of DNA repair landscape based on a spectral approach. Bioinformatics 36, i560–i568 (2020).
Wu, Z., Kan, S. J., Lewis, R. D., Wittmann, B. J. & Arnold, F. H. Machine learningassisted directed protein evolution with combinatorial libraries. Proc. Natl Acad. Sci. 116, 8852–8858 (2019).
Wolf, J. B., Brodie, E. D. & Wade, M. J., Epistasis and the Evolutionary Process. Oxford University Press (2000).
Cordell, H. J. Epistasis: what it means, what it doesn’t mean, and statistical methods to detect it in humans. Human Mol. Genet. 11, 2463–2468 (2002).
Sailer, Z. R. & Harms, M. J. Highorder epistasis shapes evolutionary trajectories. PLoS Comput. Biol. 13, e1005541 (2017).
Sailer, Z. & Harms, M. Detecting highorder epistasis in nonlinear genotypephenotype maps. Genetics 205, 1079–1088 (2017).
Jiang, R., Tang, W., Wu, X. & Fu, W. A random forest approach to the detection of epistatic interactions in casecontrol studies. BMC Bioinform. 10, 1–12 (2009).
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learning Res. 15, 1929–1958 (2014).
Ballal, A. et al. Sparse epistatic patterns in the evolution of Terpene synthases. Mol. Biol. Evol. 37, 7, 1907–1924 (2020).
Xiao, L., Bradley, J., Pawar, S. & Ramchandran, K. The SPRIGHT algorithm for robust sparse Hadamard transforms. IEEE Intl. Symp. Info. Theory, 1857–1861 (2014).
Boyd, S., Parikh, N. & Chu, E. Distributed Optimization and Statistical Learning Via the Alternating Direction Method of Multipliers. now Publishers Inc. (2011).
Li, X., Bradley, J., Pawar, S. & Ramchandran, K. SPRIGHT: A fast and robust framework for sparse WalshHadamard transform. Preprint at https://arxiv.org/abs/1508.06336 (2015).
Li, X. & Ramchandran, K. An active learning framework using sparsegraph codes for sparse polynomials and graph sketching. Proc. 28^{th} Intl. Conf. Neural Info. Process. Syst. 2, 2170–2178 (2015).
Romero, P. A., Krause, A. & Arnold, F. H. Navigating the protein fitness landscape with gaussian processes. Proc. Natl Acad. Sci. 110, E193–E201 (2013).
Brookes, D. H., Aghazadeh, A. and Listgarten, J. On the sparsity of fitness functions and implications for learning. bioRxiv https://doi.org/10.1101/2021.05.24.445506 (2021).
Wu, F.Y. The Potts model. Rev. Mod. Phys. 54, 235 (1982).
McCoy, B. M. & Wu, T. T. The twodimensional Ising model. Harvard University Press (2014).
Doersch, C. Tutorial on variational autoencoders. https://arxiv.org/abs/1606.05908 (2016).
Boros, E. & Hammer, P. PseudoBoolean optimization. Discrete Appl. Math. 123, 155–225 (2002).
Cochran, W. T. et al. What is the fast Fourier transform? Proc. IEEE 55, 1664–1674 (1967).
Figueiredo, M., Nowak, R. & Wright, S. Gradient projection for sparse reconstruction: Application to compressed sensing and other inverse problems. IEEE J. Sel. Top. Signal Process. 1, 586–597 (2007).
Boris, H. & Rolnick, D. How to start training: the effect of initialization and architecture. Proc. 32^{th} Intl. Conf. Neural Info. Process. Syst. 1, 569–579 (2018).
Friedman, J., Hastie, T. & Tibshirani, R. The Elements of Statistical Learning. Springer Series in Statistics New York (2001).
Aghazadeh, A. et al. Epistatic Net Software https://doi.org/10.5281/zenodo.5111310 (2021).
Acknowledgements
A.A., O.O., and K.R. were supported by the NSF (1703678) and ARO (W911NF2110117). H.N. was supported by the National Library of Medicine of the NIH (T32LM012417); the content is solely the responsibility of the author and does not necessarily represent the official views of the NIH. O.O.K. was supported by the NSF (1748692). D.H.B and J.L. were supported by the DOE, Office of Biological and Environmental Research, Genomic Science Program Lawrence Livermore National Laboratory’s Secure Biosystems Design Scientific Focus Area (SCW1710). The authors thank Clara WongFannjiang for insightful discussions.
Author information
Authors and Affiliations
Contributions
A.A., H.N., and O.O. designed research; A.A., H.N., O.O., D.B., and Y.H. conducted research; A.A., O.O., O.O.K., J.L., and K.R. wrote the manuscript.
Corresponding author
Ethics declarations
Competing interests
Jennifer Listgarten is on the Scientific Advisory Board for Foresite Labs and Patch Biosciences. Other authors declare no competing interests.
Additional information
Peer review information Nature Communications thanks Raghav Shroff and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Aghazadeh, A., Nisonoff, H., Ocal, O. et al. Epistatic Net allows the sparse spectral regularization of deep neural networks for inferring fitness functions. Nat Commun 12, 5225 (2021). https://doi.org/10.1038/s41467021253713
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467021253713
This article is cited by

Current progress and open challenges for applying deep learning across the biosciences
Nature Communications (2022)

Learning protein fitness models from evolutionary and assaylabeled data
Nature Biotechnology (2022)

Experimental exploration of a ribozyme neutral network using evolutionary algorithm and deep learning
Nature Communications (2022)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.