Mega-scale experimental analysis of protein folding stability in biology and design

Advances in DNA sequencing and machine learning are providing insights into protein sequences and structures on an enormous scale1. However, the energetics driving folding are invisible in these structures and remain largely unknown2. The hidden thermodynamics of folding can drive disease3,4, shape protein evolution5–7 and guide protein engineering8–10, and new approaches are needed to reveal these thermodynamics for every sequence and structure. Here we present cDNA display proteolysis, a method for measuring thermodynamic folding stability for up to 900,000 protein domains in a one-week experiment. From 1.8 million measurements in total, we curated a set of around 776,000 high-quality folding stabilities covering all single amino acid variants and selected double mutants of 331 natural and 148 de novo designed protein domains 40–72 amino acids in length. Using this extensive dataset, we quantified (1) environmental factors influencing amino acid fitness, (2) thermodynamic couplings (including unexpected interactions) between protein sites, and (3) the global divergence between evolutionary amino acid usage and protein folding stability. We also examined how our approach could identify stability determinants in designed proteins and evaluate design methods. The cDNA display proteolysis method is fast, accurate and uniquely scalable, and promises to reveal the quantitative rules for how amino acid sequences encode folding stability.


Supplementary Fig. 2 Classification of deep mutational scanning results
We classified all deep mutational scanning results into nine groups shown in Fig. 2b.Here, we show the classification criteria.The description of all metrics is also included in Supplementary Table 2, and the metrics of all domains for the classification are included in Single_DMS_list.csv.

Supplementary Fig. 3 Classification of the natural protein domains investigated in cDNA display proteolysis
Comprehensive group list of wild-type structures classified as G0 in Fig. 2b grouped into domain families.
Supplementary Fig. 4 Comprehensive double mutational data for the notable amino acid pairs (a and b) Analysis of thermodynamic coupling for two notable amino acid pairs.In the first row, we show stabilities for all 20x20 double mutants according to five different experimental metrics.From left to right, we show trypsin K 50 , chymotrypsin K 50 , ∆G inferred from trypsin experiments, ∆G inferred from chymotrypsin experiments, and ∆G inferred from both sets of experiments together.In the second row, we show the results of the additive model.From left to right, the first two plots show the inferred single amino acid terms for all 20 amino acids in the first and second sites of the amino acid pair.Error bars represent the standard deviation of the posterior distributions (n=25) .The middle heatmap shows stability (∆G) for all amino acid pairs according to the additive model (the sum of the two single amino acid terms).The fourth plot shows the observed thermodynamic coupling; e.g. the experimental ∆G (rightmost plot in the first row) minus the prediction from the additive model (middle plot of the second row).The final scatter plot shows experimental stabilities for all double mutants (y-axis) plotted against the results from the additive model (x-axis).
(c) Same analysis as (a) and (b) for two site pairs in MYO3-SH3 domain (2BTT).
(d) Analysis of thermodynamic coupling for all amino acid pairs from a notable amino acid triple.The same amino acid substitutions were also performed for the mutant background with the third amino acid replaced by Ala.From left to right, we show the stabilities (∆G) of all pairs of amino acids, the single amino acid terms in the additive model (error bars show the standard deviation of the posterior distribution), the stabilities for all pairs according to the additive model, and the thermodynamic coupling for all pairs of amino acids.

Supplementary Fig. 5 Testing calibration of classification model for predicting wild-type amino acids
(a) Relationship between predicted cumulative probability and observed cumulative occurrence for each of 19 amino acids and total data.For each of the 19 amino acids (excluding Cys), we order all 4,718 sites from lowest to highest probability for that amino acid, then step through the sites in that order while plotting the fraction of the total cumulative probability (x-axis) and the fraction of all occurrences of that amino acid (y-axis).For the "Total" plot, we order all 89,642 (4,718*19) amino acid possibilities at all sites from lowest probability to highest probability, then step through all amino acid possibilities in that order while plotting the fraction of the total cumulative probability (x-axis) and the fraction of all actual amino acid occurrences (y-axis).
The black diagonal lines show Y=X.
(b) Relationship between modeled amino acid probabilities and actual amino acid frequencies.
For each of the 19 amino acids (excluding Cys), we binned all 4,718 sites into 20 bins according to the probability of that amino acid.Bins are spaced every 0.05 probability units and each bin has a width of 0.1, so sites can appear in two neighboring bins.For each bin (x-axis), the bar shows the true frequency of that amino acid in that bin (y-axis); error bars indicate the standard deviation of the true frequency from bootstrap resampling of all the sites.The black diagonal lines show Y=X (e.g. the predicted probability matches the true frequency).For the "Total" plot, we binned all 89,642 (4,718*19) amino acid possibilities at all sites as before, then counted the fraction of matching amino acids in each bin.Error bars represent the standard deviation of the frequencies from bootstrap resampling of all sites (n=50) .

Supplementary Notes
Structure of the K 50 model to infer K 50 values from next-generation sequencing data We modeled our selection results using the single turnover kinetics model described in Fig. 1b.
We chose this model because we expect that the total concentration of protein-cDNA complex is low compared to the amount of added enzyme and because the model captures the saturation behavior observed by qPCR at high enzyme concentration (Extended Data Fig. 1).Instead of attempting to capture the microscopic complexity of our system (millions of different substrates and potential inhibitors), the purpose of the model is to treat each substrate in a consistent, simplified manner and infer reasonable parameters.
Our model makes two main assumptions.First, we assume that each sequence is cleaved independently, with no competition or product inhibition.As described by Fig. 1 eqs. 2 and 3, cleavage is described by four parameters: enzyme concentration (E), time (t), and the kinetic parameters K 50 and k max .All experiments used a fixed five minute reaction time.Based on qPCR analysis of individual sequences (Extended Data Fig. 1), we fixed the quantity k max * t at 10 0.65 for all sequences.Each sequence's unique stability is defined by the K 50 parameter that represents the enzyme concentration producing the half maximal cleavage rate (Fig. 1b eq.3).Our second main assumption is that we can interpret our K 50 values as representing the dissociation constants (K D ) between each protein sequence and the enzyme (K 50 ≈ K D , Fig. 1b eq. 6).From this assumption, we can determine the folding stability of each sequence (ΔG) based on the relationship between the observed K 50 value and theoretical K 50 values for the fully folded and fully unfolded states (K 50,F and K 50,U , Fig. 1b eqs.5-7).Although we can directly fit K 50 values without making any assumptions about the microscopic basis for K 50 (see Supplementary information for the detail), assuming that K 50 ≈ K D aids our interpretation and enables us to directly fit ΔG values to our data using the Coupled approach described below.
To fit our model to our sequencing counts data, we first assume that the cDNA display process produces an unknown initial distribution of full-length protein-cDNA complexes (the cDNA 0 distribution).The distribution of sequences at enzyme concentration E (the cDNA E distribution) is the product of the initial sequence distribution cDNA 0 and the surviving fraction of each sequence according to Fig. 1  ) Finally, we assume that our deep sequencing counts result from n sel independent selections from the cDNA E distribution, where n sel is the number of sequencing reads that exactly matched our specified DNA sequences.
We apply the K 50 model in two different ways based on whether K 50 values for trypsin and chymotrypsin are Independent or Coupled.The "Independent" procedure is used in Steps 1, 2 and 5 in the section "Procedure for fitting all data".In the independent procedure, the inputs to the model are the sequencing counts data from experiments with one protease, the enzyme concentrations, the reaction time, and the k max constant.We fit the model by sampling two parameters per sequence from normal prior distributions: (1) K 50 , and (2) the initial fraction of each sequence in the cDNA 0 distribution.The "Coupled" procedure is used in Step 5 in the section "Procedure for fitting all data".In the coupled procedure, the inputs to the model are the sequencing counts data from experiments with both proteases, the enzyme concentrations, the reaction time, the k max constant, the K 50,F constants representing the universal K 50 value for sequences in the folded state (one for each protease), and the predicted K 50,U values for all sequences for both proteases from the unfolded state model.We then assume that each sequence has a specific ΔG value that is shared across both proteases.We use this shared ΔG value along with K 50,F and K 50,U (for each protease) to determine K 50 for each protease according to Fig. 1 Eqs. 5 and 7. Finally, we fit the coupled model by sampling two parameters per sequence from normal prior distributions: (1) ΔG, and (2) the initial fractions of each sequence in cDNA 0 .
Full results from both the independent and coupled fitting procedure are provided in Tsuboyama2023_Dataset2_Dataset3_20230416.csv.For our stability parameters (protease-specific K 50 in the independent procedure and ΔG in the coupled procedure) we report the median of the posterior distribution as well as the upper and lower limits of the 95% confidence interval (the 2.5%ile and 97.5%ile values of the posterior distribution).We also used the protease-specific K 50 values from the independent procedure to compute protease-specific ΔG values.We do this using the same K 50,F and K 50,U values used in the coupled procedure according to Fig. 1 Eqs. 5 and 7.These protease-specific ΔG estimates are also reported in Tsuboyama2023_Dataset1_20230416.csv and are only used to examine the consistency between different proteases (e.g.Fig. 1f and Fig. 2d).In some cases, the independently fit K 50 values can lead to impossible values for ΔG.This can occur if K 50 is higher than K 50,F (observed cleavage is slower than our limit for cleavage in the folded state) or if K 50 is lower than K 50,U (observed cleavage is faster than predicted cleavage in the unfolded state).If the median protease-specific K 50 or the confidence interval limits for a particular sequence lead to impossible ΔG values for that sequence, we report dummy values for the corresponding protease-specific ΔG estimates.
Structure of the unfolded state model to infer unfolded K 50 (K 50,U ) from scrambled sequence data Our unfolded state model is similar to the model employed previously 18 with two notable differences.First, instead of assuming that all scrambled sequences are fully unfolded, we assume that each scrambled sequence has its own unknown folding stability, with a prior distribution biased toward low stability (normal prior centered at ΔG = -1, sigma = 4).Second, instead of fitting an unfolded state model for each protease independently, we assume that each scrambled sequence's stability (ΔG) is common across both proteases, and fit the models for each protease together.As a result, the majority of scrambled sequences are modeled as completely unfolded (Extended Data Fig. 2c), but some scrambled sequences are modeled as stable when that interpretation is consistent with both the trypsin and chymotrypsin data.
Our unfolded state has three parts: (1) a position specific scoring matrix (PSSM) that describes how the amino acid sequence in a 9-mer window (the P5 to P4' positions in protease nomenclature) determine the cleavage rate at the P1 position, (2) a local response function describing the saturation of the cleavage rate for a single P1 position, (3) a global response function that determines K 50,U based on the sum of the cleavage rates at all possible P1 positions in the full sequence.
To fit the PSSM, we assumed an identical normal prior distribution of scores at all positions, with several exceptions.Due to known critical importance of the P1 position, we used a wider prior distribution of scores for all amino acids in the P1 position for both proteases.We also used wider prior distributions at all positions (P5-P4') for the amino acids Asp, Glu, and Pro, due to the established large effects of these amino acids on cutting rates.
For the local response function to saturation of the cleavage rate at P1 site k, we used a logistic function: (9) where SS k (site saturation) is the saturation of the cutting rate at site P1=k, aa site is the amino acid identity at site, and logistic is the logistic function f(x) = 1 / (1+e x ).We fit the 21 (20 amino acids + 'X' representing empty sites) x 9 =189 elements of the PSSM for each protease.
For the global response function (determining K 50,U based on the sum of SS k across the full protein sequence), we use a sum of logistic functions with 10 different activation thresholds.where maxK 50,U is the highest possible K 50,U value (K 50,U assuming no cut sites), Scale is the range of possible K 50,U values, and threshold l is the value of the l th activation threshold for the global response function.All K 50 values (including maxK 50,U ) are in log 10 molar units.
The key parameters of the unfolded state model (for a single protease) are the 21 x 9 =189 elements of the PSSM, the maxK 50,U , the scale, and the 10 threshold values.These parameters determine K 50,U for each sequence by Eqs. 9 and 10.In addition to these parameters, we also sample the ΔG values for each scrambled sequence during fitting.These sampled parameters (as well as the universal K 50,F value for all sequences) are sufficient to determine a theoretical K 50 value for each scrambled sequence by re-writing Fig. 1b Eq. 6: (11) where f U is the fraction of unfolded molecules: The input data for the model are the observed K 50 values for all scrambled sequences.The parameters of the model are fit by assuming that all observed K 50 values should agree (with small, normally distributed errors) with the theoretical K 50 values determined by the model parameters.After fitting the model, we used the median of the posterior distributions of PSSM, maxK 50,U , scale, and the 10 threshold parameters as the final model parameters.We used these final model parameters to calculate K 50,U for all sequences in our experiments without considering any uncertainty from the model posterior distribution.

Procedure for fitting all data
Step 1: Estimation of 'effective' protease concentrations for each library: We employed four DNA oligonucleotide libraries for this study.Although we tried to minimize the difference between assay conditions, we also fit "effective" protease concentrations to our data in order to minimize batch-to-batch differences.We used the K 50 model to perform this fitting and fit protease concentrations for trypsin and chymotrypsin entirely independently.The main assumption of this fitting is that each sequence should have the same K 50 when assayed in different libraries.By enforcing that each sequence had a single K 50 value regardless of what library it appears in, we calibrated the protease concentrations in each library against each other.Although we did not use universal control sequences in all four libraries, each library contained 1000 to 2000 sequences that overlapped at least one other library in a fully connected graph.Specifically, the library pairs 1+4, 2+4, 3+4, 1+2, and 2+3 each included 1,000 to 2,000 overlapping sequences.
The overall model included 96 experimental conditions (12 protease concentrations per replicate x 2 replicates x 4 libraries; one of the 12 protease concentrations was the fixed "no protease" starting condition).However, each sequence was only present in 48 of the 96 conditions because any individual sequence was only present in two out of the four libraries.The inputs to fit the model were the sequencing counts data, the reaction time (t), and the k max constant.Additionally, to set the overall scale of the protease concentration series, we fixed the effective protease concentrations for Library 4 at the expected protease concentrations (i.e.three-fold serial dilutions of 25 μM protease (Replicate 1) or 43.3 μM protease (Replicate 2)).We also fixed all of the starting samples at zero protease.Using these model inputs, we sampled the K 50 values (one per sequence), the remaining 66 protease concentrations, and the initial sequence distributions cDNA 0 (a separate cDNA 0 was used for each of the 8 replicates).Normal priors (with lower/upper boundaries for some parameters) covering the range of experimentally relevant values were used for the model parameters.Sampling was performed using the No U-Turn Sampler (NUTS) in Numpyro with 50 steps of equilibration and 25 steps of production.We used the medians of the protease concentrations from our 25 posterior samples as our final calibrated protease concentrations for all further analysis (discarding the uncertainties).
Step 2: Estimation of K 50 values of scramble sequences: To train the unfolded state model, we need to determine K 50 values for our scramble sequences, which were included in Library 2. We used the Independent K 50 model for this step.The input data were the sequencing counts data from two replicates (i.e. 12 protease concentrations x 2 replicates = 24 data points per sequence), the reaction time (t), the k max constant, and the effective protease concentrations obtained in Step 1.We sampled the initial sequence distribution cDNA 0 (a separate cDNA 0 for each replicate) and K 50 for all sequences included in Library 2. Normal priors (with lower/upper boundaries for some parameters) covering the range of experimentally relevant values were used for the model parameters.Sampling was performed using the No U-Turn Sampler (NUTS) in Numpyro with 100 steps of equilibration and 50 steps of production.
Step 3: Construction of unfolded state model: We trained the unfolded state model for predicting K 50,U using K 50 values obtained in Step 2. The input sequences were scrambled sequences of wild-type domains selected for deep mutational screening.In addition to our set of exactly scrambled sequences (matching the wild-type amino acid composition 100%), we also included scrambled sequences containing 50%, 60%, 70%, 80%, and 90% of the number of hydrophobic amino acids in the original wild-type sequences.These sequences helped ensure the large majority of our scrambled pool was fully unfolded.Additionally, because all sequences in our experiments are padded with G/S/A linkers up to a constant length, we generated scrambled sequences using two different padding procedures.In the first approach, we designed scrambled sequences that matched the original wild-type length and were padded with G/S/A up to 72 amino acids.In the second approach, we designed 72 amino acid-length scrambles approximately matching the composition of an original wild-type domain, regardless of the length of that wild-type.These scrambled sequences required no additional padding.After measuring K 50 for all scrambles, we only used sequences with a 95% confidence interval smaller than 0.5 log 10 molar units for model training for model fitting (64,238 sequences in total, see Extended Data Fig. 3).In addition to the exact experimental sequences, we also augmented the training dataset with dummy sequences where GS linkers were replaced by the blank 'X' amino acid.
The inputs for the model are amino acid sequences created as described above, and their observed K 50 for trypsin and chymotrypsin obtained in Step 2. The parameters of the model are fit by assuming that all observed K 50 values should agree (with small, normally distributed errors) with the theoretical K 50 values.In this model, we sampled the 21 x 9 =189 elements of the PSSM, the site bias, the maxK 50,U , the scale, and the 10 threshold values.These parameters determine K 50,U for each sequence by Eqs. 9 and 10.In addition to these parameters, we also sample the ΔG values for each scrambled sequence during fitting.
Normal priors (with lower/upper boundaries for some parameters) covering the range of experimentally relevant values were used for the model parameters.Using NUTS model, we sampled the parameters described above, then reported the median of the 100 posteriors after removing the initial 400 steps.In Step 4, we used these final model parameters to calculate K 50,U for all sequences in our experiments without considering any uncertainty from the model posterior distribution.
Step 4: Prediction of unfolded K 50 values (K 50,U ) across the full dataset: Using the final model parameters obtained in Step 3, we predicted K 50,U values for each amino acid sequence in the libraries without considering any uncertainty.Additionally, since the model was constructed to predict unfolded K 50 for sequences with 86 amino acids, we added a Gly linker 'GGG' to both ends, followed by padding by 'X' up to 86 amino acids.
Step 5: Estimation of K 50 values and calculation of ΔG for trypsin and chymotrypsin: We applied the Coupled K 50 model to each of the four libraries separately.The inputs to the model are the sequencing count data from trypsin and chymotrypsin experiments (i.e. 12 protease concentrations x 2 replicates x 2 proteases = 48 data points per sequence), the effective protease concentrations obtained in Step 1, the reaction time, the k max constant (t*k max = 10 0.65 based on qPCR analysis; see Extended Data Fig. 1), the K 50,F constants (3 for trypsin, 2 for chymotrypsin; determined based on the dynamic range of proteolysis experiment; see Extended Data Fig. 5), and the K 50,U values predicted by the unfolded model in Step 4. Using the inputs, we sampled ΔG shared between trypsin and chymotrypsin, and initial sequence distribution cDNA 0 for each protease for each replicate (although our experiments utilized the same batch of the cDNA-protein complex for two replicates).
Normal priors (with lower/upper boundaries for some parameters) covering the range of experimentally relevant values were used for the model parameters.Using NUTS in Numpyro module, we sampled the posteriors of shared ΔG along with other parameters, then obtained the median of the 50 posterior samples after removing the initial 100 steps.Full results from both the independent and coupled fitting procedure are provided in Tsuboyama2023_Dataset1_20230416.csv and Tsuboyama2023_Dataset2_Dataset3_20230416.csv.For our stability parameters (protease-specific K 50 in the independent procedure and ΔG in the coupled procedure) we report the median of the posterior distribution as well as the upper and lower limits of the 95% confidence interval (the 2.5%ile and 97.5%ile values of the posterior distribution).
We also applied the Independent K 50 model to each of the four libraries separately.The inputs to the model are the sequencing count data (i.e. 12 protease concentrations x 2 replicates = 24 data points per sequence), the effective protease concentrations obtained in Step 1, the reaction time, the k max constant (t*k max = 10 0.65 based on qPCR analysis; see Extended Data Fig. 1).Using the inputs, we sampled K 50 for each protease, and initial sequence distribution cDNA 0 for each protease for each replicate (although we utilized the same batch of the cDNA-protein complex for two replicates).
Normal priors (with lower/upper boundaries for some parameters) covering the range of experimentally relevant values were used for the model parameters.Using NUTS in Numpyro module, we sampled the posteriors of K 50 for trypsin and K 50 for chymotrypsin along with other parameters, then obtained the median of the 50 posterior samples after removing the initial 100 steps.
Then, we computed protease-specific ΔG values using the protease-specific K 50 values from the Independent model.We do this using the same K 50,F and K 50,U values used in the coupled procedure according to Fig. 1b Eqs. 5 and 7.These protease-specific ΔG estimate s are also reported in Tsuboyama2023_Dataset2_Dataset3_20230416.csv, and are only used to examine the consistency between different proteases (e.g.Fig. 1f and Fig. 2d).In some cases, the independently fit K 50 values can lead to impossible values for ΔG.This can occur if K 50 is higher than K 50,F (observed cleavage is slower than our limit for cleavage in the folded state) or if K 50 is lower than K 50,U (observed cleavage is faster than predicted cleavage in the unfolded state).If the median protease-specific K50 or the confidence interval limits for a particular sequence lead to impossible ΔG values for that sequence, we reported dummy values for the corresponding protease-specific ΔG estimates.
The actual number of sequencing counts, as well as the number of counts predicted for all sequences at all concentrations according to the fitted model parameters, are given in Raw_NGS_count_tables.zip and Pipeline_K50_dG.zip.
Thermodynamic coupling analysis (related to Fig. 4) Thermodynamic coupling refers to the change in folding stability due to the interaction between two amino acids after removing folding stability effects from each amino acid individually.To determine this "nonadditivity", we first modeled our double mutant data using a fully additive model (no thermodynamic coupling).The deviations from this model then reveal the thermodynamic coupling.Our additive model assumes that the absolute stability (ΔG) of each sequence is the sum of an amino acid-dependent term for site one (ΔG 1 ) and an amino acid-dependent term for site two (ΔG 2 ) The forty site-specific terms (one ΔG 1 term for each amino acid at site one and one ΔG 2 term for each amino acid at site two) are not experimentally measurable; they are inferred based on minimizing the error of the additive model.We used Bayesian inference to infer the forty ΔG 1 and ΔG 2 terms for each set of mutants.The inputs to fit the model were the observed 400 ΔG values (20 amino acids at site one x 20 amino acids at site two) for a particular site pair.Using NUTS, we sampled ΔG 1 and ΔG 2 by assuming that the 400 observed ΔG values should agree (with small, normally distributed errors) with the expected ΔG values determined by eq. 13.Both expected and observed ΔG values were clipped to the range of -1 to 5 kcal/mol.We used 100 steps of burn-in and used the median of 50 posterior samples as the final values of the ΔG 1 and ΔG 2 terms.Using these terms, we calculated the expected (additive) ΔG for each sequence, and then the thermodynamic coupling: To calculate the uncertainty in the thermodynamic coupling, we re-fit the additive model 50 times by bootstrap resampling of the 400 observed ΔG values.This ensures the ΔG 1 and ΔG 2 terms are not overly dependent on a single experimental measurement.The model fitting code is provided in Additive_model_Fig4.ipynb.
Wild-type amino acid prediction model (related to Fig. 5) The classification model in Fig. 5  is the logistic function f(x) = 1 / (1+e x ), i indexes the 100 logistic functions defining the weighting function, amp is the learned vector describing the amplitudes of the logistic functions, threshold is the vector describing the centers of the logistic functions, steepness defines the steepness of the logistic functions, and offset the learned vector (length 19 for the 19 non-Cys amino acids) describing the absolute probability offset for each amino acid.
We used Bayesian inference to infer the amp vector (length 100) and offset vector (length 19 for the 19 non-Cys amino acids).The logistic threshold vector was fixed at 100 evenly spaced points between -2 and 7 kcal/mol.The steepness term was fixed at 5. The inputs to fit the model were the observed ΔG values and the wild-type amino acid identities for each site within the natural protein domains.Using NUTS, we sampled amp and offset by assuming that the observed wild-type amino acids were randomly chosen at each site according to the predicted probability distribution for that site, calculated according to eq. 15.We then reported the median and the standard deviation of 100 posterior samples after removing the initial 500 steps.The fitting script is included in Classification_model_Fig5.ipynb.
Derivation of eq.3 in Fig. 1b We modeled the cleavage events, where Protease enzymes (E) and protein substrates (S) form an ES complex to produce cleaved protein products (P).The goal is to get a product formation equation in terms of the total product, initial enzyme and substrate concentrations and kinetic constants.
Also, we defined equilibrium constant K 50 : Based on the model (1), we can obtain the following dynamic formulas: (16) The first two of these are assumed to be at quasi-steady state.The following are additional conservation equations for substrate-product and enzyme: where [S 0 ] is initial amount of substrates Additionally, the reaction conditions in the study were not substrate-excessive but enzyme-excessive: ( Using eqs.1',19, and 20, the following can be derived to find an expression for the enzyme-substrate complex in terms of the initial substrate and enzyme concentration: Substituting eq.22 into eq.18 and using the approximation , the an expression for the dynamics of the product formation in terms of enzyme concentration and substrate can be found: Derivation of eq.6 and eq.7 in Fig. 1b We modeled the cleavage events, where Protease enzymes (E) and folded substrates (F) or unfolded substrates (U) form a FE or UE complex to produce cleaved protein products (P F or P U ).The goal is to get a product formation equation in terms of the total product, initial enzyme and substrate concentrations and kinetic constants.We follow a similar derivation to that above for a single enzyme/substrate: where ( 23), ( 24), ( 25) , and k f and k u are rate 1/ 50, = constant for cleavage of the bound folded substrates and unfolded substrates.Assuming binding and unbinding equations and the folding and unfolding transition rates are in a quasi-equilibrium then eq 23, 24, and 25 hold throughout the time-course.
We write an equation for the overall product formation: Step 1: Write product formation eq. 26 in terms of [FE]   Finally, We can rewrite the product formation eq. 3 in terms of initial substrate concentration, total product, and an observed kinetic rate, which is a function of kinetic rates and initial enzyme concentration,: Step 3, Derivation eq.6 and eq.7 in Fig. 1b By comparing eq.32 with eq.3', we can derive the following equations (including eq.6 in Fig. 1B): Left: Mutational scanning results from cDNA display proteolysis.As in Fig. 2, white represents the folding stability of wild-type and red/blue indicates stabilizing/destabilizing mutations.Black dots indicate the wild-type amino acid, red slashes indicate missing data, and black corner slashes indicate lower confidence ∆G estimates, (95% confidence interval > 0.5 kcal/mol), including ∆G estimates near the edges of the dynamic range.Right: Agreement between variant ∆G values independently determined using assays with trypsin (x-axis) and chymotrypsin (y-axis).Multiple codon variants of the wild-type sequence are shown in red, reliable ∆G values in blue, and less reliable ∆G estimates (same as above) in gray.The black dashed lines represent Y=X.Each plot shows the number of reliable points and the Pearson r-value.(b) Mutational scanning results from robotics-enabled high-throughput purification and chemical denaturation 29 , colored as in (a).(c and d)Difference heat-map (c)  showing the consistency between cDNA display proteolysis (a) and robotics-enabled high-throughput purification and chemical denaturation (b).Dark blue squares indicate highly inconsistent positions where cDNA display proteolysis (a) observes low stability but robotics-assisted chemical denaturation (b) observes high stability.In many cases, the chemical denaturation data indicates that polar substitutions into the hydrophobic core are relatively tolerated (e.g.Y3K, L5R, F30D, F30R, Y45D), whereas cDNA display proteolysis indicates that these substitutions are very destabilizing.

Supplementary Fig. 6
Heat maps for notable domains with functional residues (a-d) Mutational scanning results for four domains.Left: Heat maps show ∆G for substitutions, deletions, and Gly and Ala insertions at each residue.White indicates the wild-type stability and red/blue indicates stabilizing/destabilizing. Black dots indicate the wild-type amino acid, red slashes indicate missing data, and black corner slashes indicate lower confidence ∆G estimates, (95% confidence interval > 0.5 kcal/mol), including ∆G estimates near the edges of the dynamic range.At top, lines show the mean ∆∆G (blue) and the mean normalized GEMME score (red), with functional sites (classified according to Extended Data Fig. 9a) marked with vertical orange lines.Right: Agreement between variant ∆G values independently determined using assays with trypsin (x-axis) and chymotrypsin (y-axis).Multiple codon variants of the wild-type sequence are shown in red, reliable ∆G values in blue, and less reliable ∆G estimates (same as above) in gray.The black dashed line represents Y=X.Each plot shows the number of reliable points and the Pearson r-value for the blue (reliable) points.(e-f) Same as (a-d), but top lines indicate the mean of ∆∆G for hydrophobic amino acid substitutions (blue) and mean normalized GEMME score of hydrophobic amino acids (red).Functional sites are classified according to Extended Data Fig. 9g.Supplementary Fig. 7 Heat maps for three designed domains with notable polar interactions.Mutational scanning results for three domains with notable polar interactions.Left: Heat maps show ∆G for substitutions, deletions, and Gly and Ala insertions at each residue.White indicates the wild-type stability and red/blue indicates stabilizing/destabilizing. Black dots indicate the wild-type amino acid, red slashes indicate missing data, and black corner slashes indicate lower confidence ∆G estimates, (95% confidence interval > 0.5 kcal/mol), including ∆G estimates near the edges of the dynamic range.The polar networks shown in Extended Data Fig. 10b are highlighted in orange, red, and green.Right: Agreement between variant ∆G values independently determined using assays with trypsin (x-axis) and chymotrypsin (y-axis).Multiple codon variants of the wild-type sequence are shown in red, reliable ∆G values in blue, and less reliable ∆G estimates (same as above) in gray.The black dashed line represents Y=X.Each plot shows the number of reliable points and the Pearson r-value for the blue (reliable) points Supplementary Fig. 8 Heat maps for three domains with notable stabilizing mutations Left: Heat maps show ∆G for substitutions, deletions, and Gly and Ala insertions at each residue.White indicates the wild-type stability and red/blue indicates stabilizing/destabilizing. Black dots indicate the wild-type amino acid, red slashes indicate missing data, and black corner slashes indicate lower confidence ∆G estimates, (95% confidence interval > 0.5 kcal/mol), including ∆G estimates near the edges of the dynamic range.The red boxes and arrows highlight sites with notable stabilizing mutations.Middle: Agreement between variant ∆G values independently determined using assays with trypsin (x-axis) and chymotrypsin (y-axis).Multiple codon variants of the wild-type sequence are shown in red, reliable ∆G values in blue, and less reliable ∆G estimates (same as above) in gray.The black dashed line represents Y=X.Each plot shows the number of reliable points and the Pearson r-value for the blue (reliable) points.Right: For four positions with stabilizing mutations, heatmaps show five experimental metrics: the trypsin (T) and chymotrypsin (C) K 50 values, the ∆G values inferred from trypsin and chymotrypsin experiments, and the overall ∆G inferred from both trypsin and chymotrypsin experiments together.
and constants only, by substituting for [] + [] + [] + [] + [get an equation which describes the dependence of [FE] on initial substrates and products, with terms in the denominator that capture sequestration in intermediate bound states.

Table 2 Description of metrics in Supplementary Fig. 2 Metrics Description
used a sum of logistic functions with learned amplitudes to define the weighting function.The overall model is defined below: Because the reaction conditions in the study were not substrate-excessive but enzyme-excessive (i.e.[E] >> [S] or [ES]), [E]≈[E 0 ]: ])