Generating quantitative binding landscapes through fractional binding selections, deep sequencing and data normalization

Quantifying the effects of various mutations on binding free energy is crucial for understanding the evolution of protein-protein interactions and would greatly facilitate protein engineering studies. Yet, measuring changes in binding free energy (ΔΔGbind) remains a tedious task that requires expression of each mutant, its purification, and affinity measurements. We developed a new approach that allows us to quantify ΔΔGbind for thousands of protein mutants in one experiment. Our protocol combines protein randomization, Yeast Surface Display technology, Next Generation Sequencing, and a few experimental ΔΔGbind data points on purified proteins to generate ΔΔGbind values for the remaining numerous mutants of the same protein complex. Using this methodology, we comprehensively map the single-mutant binding landscape of one of the highest-affinity interaction between BPTI and Bovine Trypsin. We show that ΔΔGbind for this interaction could be quantified with high accuracy over the range of 12 kcal/mol displayed by various BPTI single mutants.


Introduction
Protein-protein interactions (PPIs) control virtually all processes in the cell. Mutations at PPI binding interfaces frequently affect free energy of binding (DDGbind), sometimes abrogating and sometimes stabilizing the interaction. This change in binding affinity of one PPI could translate into remodeling of the whole PPI network, frequently leading to dysregulation of signal transduction pathways and disease 1,2 . Therefore, understanding how specific mutations in protein complexes affect their binding affinity is extremely important to both basic biology and to biomedical sciences, where inhibition of a particular PPI might help to treat the related disease.
In the recent years, many groups reported computational methods for predicting DDGbind from structure and/or sequence [3][4][5][6][7][8][9] . While achieving good predictions on average, these methods frequently give large errors in particular cases, revealing that our comprehension of the precise molecular forces that govern binding affinity in PPIs remains incomplete 10 . Our knowledge in this area could be greatly expanded by acquiring large sets of data for DDGbind values in various PPIs, facilitating progress in computational modeling. Yet, experiments that determine DDGbind remain laborious since they involve DNA manipulation, protein expression and purification in different organisms and binding affinity measurements using different techniques. Thus, experimental data describing mutational effects on binding affinity for each particular PPI remain sparse and sometimes inconsistent between different reported experiments 11 . Furthermore, the majority of mutations reported in the literature are mutations to alanine [11][12][13][14] . Such mutations are of limited use for studies of protein evolution and protein engineering since they most frequently lead to complex destabilization and are rarely observed in nature.
A much more attractive and informative approach would be to explore all possible mutational effects for a particular PPI in a single experiment, thus generating a comprehensive binding landscape for this PPI 15,16 . Such binding landscapes could be used to define evolutionary paths accessible to a particular PPI, to characterize energetic contribution of each position, and to locate frequently sought affinity-and specificity-enhancing mutations 4,17 . First efforts in this direction utilized phage display technology that allows to select binders from a large combinatorial library of protein mutants 16,18,19 . Through several rounds of selection, protein mutants compatible with binding to a particular target are selected. Subsequent sequencing of multiple selected clones allows us to calculate the frequency of each amino acid at each position, providing information on binding hot-spots 20 and cold-spots 21 . Further studies in this direction utilized Yeast Surface Display (YSD) 22 for selecting protein binders coupled to Next-Generation sequencing (NGS) to produce binding landscapes for various PPIs 23 . While YSD enables fast sorting using Fluorescently Activated Cell Sorting (FACS), NGS permits more accurate calculation of amino acid frequencies for each of the detected mutants. The ratio between the amino acid frequency in the selected pool of binders and the same frequency in the initial naïve library, referred to as the enrichment value, is calculated for each amino acid at each of the explored position. The enrichment values are then plotted to produce PPI binding landscapes. In such an approach, Whitehead et al. mapped the full single-mutant binding landscape of a de novo designed inhibitor interacting with hemagglutinin and used the enrichment information to design affinity-matured inhibitors 24 .
In spite of great promise of this approach, further studies on different biological systems revealed its potential limitations. While affinity enhancing mutations could be readily identified by this methodology, relatively low correlation (R value of 0.5) between the NGS-derived enrichment values and experimental DDGbind values for purified proteins was observed 17 .
Additional studies showed that DDGbind could be inferred from the NGS-based enrichment values only in the narrow range of energies from -0.8 to +0.5 kcal/mol 25,26 , preventing construction of quantitative binding landscapes for all of the explored mutations with broader range of target affinities. In addition, the above methodology set a requirement on the concentration of the target protein in the selection experiment; the concentration should be similar to the interaction Kd, thus limiting the application of the approach to only subset of all PPIs with medium affinities. We introduce a novel approach that allows us to overcome the abovementioned limitations and to generate quantitative binding landscapes for any PPI, regardless of their Kd value. Here, we demonstrate the applicability of our approach to a particularly difficult target, a complex between Bovine Trypsin (BT) and its inhibitor BPTI that possesses ultra-high affinity of 10 -14 M. We show that through our high-throughput NGS-based approach, we can obtain ΔΔGbind values for all BPTI binding interface mutants that correlate extremely well with experimental results on purified proteins over the range of more than 12 kcal/mol free energy changes. Our method allowed us to comprehensively map the binding landscape for this ultra-high affinity interaction, which would be impossible using any alternative technique.

Results and Discussion
To demonstrate how our approach could be used to produce quantitative binding landscapes, we first prepared the BPTI/BT complex for YSD experiments. For this purpose, the BPTIWT gene was incorporated into the pCTCON vector, that facilitates BPTI expression on the surface of yeast cells with a C-terminal myc-tag (cMyc) for monitoring protein expression ( Figure   1A). Binding of BT to BPTI mutants was accessed by monitoring fluorescence of the FITC fluorophore conjugated to a biotinylated BT via NeutrAvidin. The assessment of binding of BPTIWT to BT by FACS showed a diagonal narrow distribution, demonstrating that BPTI is well expressed on the surface of yeast cells, is properly folded, and binds to BT ( Figure S1).
Next, a combinatorial library was generated containing all single BPTI mutants at positions that are in the direct binding interface with BT. Thus, twelve BPTI positions were randomized to all twenty amino acids with an NNS codon, leaving intact two cysteines (C14 and C38) that form a disulfide bond and thus are crucial for BPTI folding (Figure 1B). The library of 228 (19 x 12) BPTI single mutants was constructed using the TPCR protocol 27 . The BPTI mutant library was expressed on the surface of yeast cells and incubated with a fluorescently labeled BT at concentration of 5 nM. This concentration of BT, although five orders of magnitude higher than the Kd of BT/BPTI interaction, was chosen since it was the minimum concentration of BT that resulted in a considerable spread of FACS binding signals from different BPTI mutants to BT (Figure 1C).

Improving accuracy and extending prediction range by collecting more data
One of the limitations of previous approaches for binding landscape generation was that DDGbind showed linear dependence on the NGS-enrichment value only in the narrow range of DDGbind values close to zero 26 . The methodology could not previously discriminate between different highly destabilizing mutations since such mutations were characterized with the same enrichment values. The same was true for mutations that showed high stabilization of the complex. To overcome this limitation and to increase the range of sensitivity for DDGbind predictions, we introduced our first innovation and used multiple affinity gates from which the mutants were collected during the YSD selection experiment. The multiple gates would allow us to collect information for each mutant several times, and each particular mutant would be enriched in at least one affinity gate. In this particular work, we used four affinity gates for mutant collection: higher than WT affinity (HI), WT-like affinity (WT), slightly lower than WT affinity (SL), and strongly lower than WT affinity (LO) (Figure 1C). The WT affinity gate was set according to the FACS signal produced by BPTIWT binding to BT at same conditions ( Figure 1S). The cells from each gate were then grown, analyzed for binding to BT ( Figure S4) and sequenced with NGS, resulting in 300-900K reads per each population. In addition, the naïve pre-sorted library of BPTI mutants was sequenced.
We further assessed the quality of the NGS data using synonymous mutations as a test.
Since some errors in the data could come from errors in the NGS process, especially for sequences detected with low frequency, we tested different cut-off values below which the data on the BPTI mutant would be discarded. Using different cut-off values, we calculated deviations in enrichment values for synonymous mutations expressing the same BPTI variant. Our data shows that at the cutoff value of 100 sequences per BPTI mutant, deviations in enrichment values were negligible (< 0.001) (Figure S2). Using this threshold, we were able to detect all 228 BPTI single mutants present in the naïve library. No threshold was applied to the sorted populations, since in such population the low number of sequences was caused by the depletion of that mutant from the population.
We thus had in our hands four enrichment values from four affinity gates for each of the 228 BPTI mutants (Figure 2). Closer examination of the data showed that enrichment values in HI and LO affinity gates exhibited pseudo-symmetry, with highly enriched mutations in the HI gate being highly depleted in the LO gate and vice versa. The enrichment value maps could be used to define binding hot-spots for the BPTI/BT interactions (such as position 15, 16 indicated as red starts on top of Figure 2) and more tolerant to mutations positions (such as 11 and 34 indicated as blue stars on Figure 2). However, these maps were not sufficient in determining exact DDGbind values for each of the mutation. In fact, we noticed that some mutations that showed enrichment value of ~1 in the HI gate, that should correspond to the WT-like affinity, were determined to be destabilizing when measured with purified proteins (for example, G12A with experimental DDGbind of +4.35 kcal/mol [28][29][30][31] ). This over-prediction of neutral and affinity-enhancing mutations by our NGS results was due to the fact that in the YSD selection experiment we used much higher concentration of BT compared to the Kd of BPTI/BT interaction, shifting the equilibrium towards protein binding even for those BPTI mutants that possessed weaker affinities compared to the WT protein.
To overcome this problem, we introduced the second innovation, described below.
Normalizing NGS data to get quantitative DDG bind measurements Our second innovation was to obtain a normalization formula for converting the enrichment data from four affinity gates into one DDGbind value. For this purpose, we collected all available DDGbind experimental data for binding of BPTI single mutants to BT, comprising 29 data points [28][29][30][31] .
Plotting the experimental DDGbind vs. enrichment values for each of the four affinity gates showed that DDGbind was linearly dependent on the natural log of enrichment values in HI and LO gates (R-value of 0.87 for each of the gates, Figure S3). The NGS values from HI and LO gates were denoted further as functions X1 and X4, respectively. DDGbind showed a more complicated twovalued function behavior for WT and SL gates. This was expected since for these gates, the highest enrichment values were observed in the narrow range of DDGbind values but decreased for mutations that showed both higher and lower affinities compared to that narrow range of values.
To eliminate this complicated multi-variable behavior and at the same time to utilize the additional information from WT and SL gates, we constructed two additional functions that Using 29 data points and leave-one-out cross-validation approach, where each data point was predicted without the enrichment information for that particular data point, we were able to predict experimental DDGbind values with very high accuracy over the range of more than 12 kcal/mol (Figure 3; R = 0.90, s =1.5 kcal/mol). We used the obtained normalization formula to convert the enrichment values to DDGbind values for all the single BPTI mutants in the library.
Since the enrichment values for these mutants are within the range explored in our small experimental data set, the calculated DDGbind values are expected to have the same accuracy as reported above.  Figure 1B). However, at position 15, one mutation, K15R, was determined to substantially stabilize the complex, in agreement with experimental results on purified proteins 30 . Figure 4 also shows that the same type of mutations (e. g. hydrophobic or polar) frequently produce similar changes in DDGbind for the same position.
We thus established that the BPTI/BT complex with the Kd of 10 -14 M is highly optimized by nature, with most single mutations in BPTI leading to high destabilization of the PPI, and very few neutral and affinity-enhancing mutations.
In summary, we report a novel approach that allows us for the first time to produce quantitative binding landscapes based on binding selections into several affinity gates, NGS of the selected mutants, and normalization procedure using a small data set of experimentally determined DDGbind values. Very recently, a similar direction has been taken by Keating and colleagues to design mutations and improve affinity in peptide/protein interactions 32 . Unlike our study, the authors used IC50 value of yeast titrations to normalize NGS data and to predict DDGbind for various peptide mutants with Kds in the medium affinity range. A similar thinking by two independent studies attest to attractiveness of our methodology for binding landscape mapping.
Yet our work presents a simpler normalization procedure based on actual in vitro affinity measurements and allows to explore PPIs at the limits of the Kd spectra. We demonstrate superior correlations between DDGbind predictions and actual in vitro measurements over a much larger range compared to all previously reported approaches.
To achieve good prediction accuracy, the experimental data points used for normalization should show large spread in DDGbind values, including both affinity-enhancing and affinityreducing mutations. More affinity gates could be used in future experiments, although in this case normalization would require a larger set of experimental data; at least five data points per each parameter should be used in the normalization function to avoid overfitting. Our methodology could be applied to study the evolutionary paths of any PPI regardless of its Kd value and to compare binding landscapes of various PPIs. The approach could be easily extended to studies of double and higher-order mutational steps, providing more comprehensive information on PPI evolution and facilitating new modeling and protein engineering studies.