Protein evolution analysis of S-hydroxynitrile lyase by complete sequence design utilizing the INTMSAlign software

Development of software and methods for design of complete sequences of functional proteins could contribute to studies of protein engineering and protein evolution. To this end, we developed the INTMSAlign software, and used it to design functional proteins and evaluate their usefulness. The software could assign both consensus and correlation residues of target proteins. We generated three protein sequences with S-selective hydroxynitrile lyase (S-HNL) activity, which we call designed S-HNLs; these proteins folded as efficiently as the native S-HNL. Sequence and biochemical analysis of the designed S-HNLs suggested that accumulation of neutral mutations occurs during the process of S-HNLs evolution from a low-activity form to a high-activity (native) form. Taken together, our results demonstrate that our software and the associated methods could be applied not only to design of complete sequences, but also to predictions of protein evolution, especially within families such as esterases and S-HNLs.


The detailed algorithm of INTMSAlign
Schematic view of INTMSAlign was shown in Supplementary Figure 1. INTMSAlign uses two files: one contains sequence of target protein (STP) and another (library in Supplementary Fig. 1) contains sequences belonging to the same family as that of STP. The library is constituted by hundreds to thousands of primary sequences which are the family protein of STP. There are some approaches to prepare the library; i), selecting the sequences from Blastp web server 1 by submitting the STP, ii), from PubMed web server by inputting some keywords, such as EC number and protein name. After STP and the library were prepared, following two parameters have to be defined: N trial and N pick . INTMSAlign picks up one STP and N pick number of sequences randomly from the library, and generates an ofile ( Supplementary Fig.1). This process is repeatedly performed until the number of the ofiles reaches to total N trial number ( Supplementary Fig.1). For the all ofiles, MSA is performed by the program "CLUSTALW".
After all of MSA processes were completed, INTMSAlign generates total N trial number of aln files ( Supplementary Fig.1). INTMSAlign integrates all generated aln files by counting up a number of the appeared 20 amino acid residues and gaps based on each residue of STP. As for the details how to count up were shown in Supplementary Fig.2. Sum of the number of 20 amino acid residues and gap for ith residue of STP was defined as following: Maximum value of "i" should be sequence length of STP. The j value corresponds to 20 amino acid residues (j = 1 ~ 20) and gap (j = 21). The j value is arranged in ascending order when one letter expression of amino acid residues is lined up in alphabetical order. For example, the j value is 1, 2, 3 and 20, then these represent Ala (A), Cys (C), Asp (D) and Tyr (Y), respectively.
Because the STP is included in every ofiles ( Supplementary Fig.1 R ij is calculated for 20 of all amino acid residues and gap; the R ij represents the appearance rate at ith of the STP sequences of amino acid residues (j). The R ij is two dimensional arrays which have i × 21 elements. The rate is saved as a text format data, which named result file ( Supplementary  Fig.1).

Residue fixation: comparison of two result files to assign correlation residues
INTMSAlign has a function to calculate the appearance rate by only selecting primary sequences in the library which have the same amino acid residues (residue X) at certain residue number (l th residue) as user defined one (l:X). In this study, we called the function "residue fixation" (Supplementary Fig.3). This could be used to perform the curation of the library.
In this section, to describe the residue fixation, we postulated the preparation of following two result files, Result-file A and Result-file B. In terms of the preparation, the identical STP and the library were utilized to prepare the two result files. Residues, which are highly conserved as different ones in each result file, could be regarded as correlation residues; appearance rate of the residues should be perturbed when MSA is performed by changing combination of primary sequences. These are corresponded with definition of correlation residues 2 . To extract correlation residues from the result files, D-score was defined as following: Maximum D-score value would be 100, and the value is obtained in case that the ith residue is differently and perfectly conserved in two result files. Residues having high D-score value are correlation residues; the residues are correlatively mutated and highly conserved in different amino acid residues in the Result-file A and the Result-file B.

Assignment of consensus residues of MeHNL: Ser80, His236 and Lys237
The INTMSAlign could assign consensus residues accurately as well as other MSA program. To show this, appearance rate of two residues, Ser80 and His236, was calculated. Bar graph of appearance rate for the residues was shown in Supplementary Fig.3. Ser80 and His236 were highly conserved; appearance rates were 84.8% (Ser, Supplementary Fig.3A) and 89.2% (His, Supplementary Fig.3B) for 80 th and 236 th residues, respectively. The Ser80 and His236 are conserved in both S-HNL and the esterases, because these residues are catalytic residues for both enzymes 3 .
Next, appearance rate for Lys237 was calculated. The 237 th residue is known as the marker residue; the 237 th residue is Lys in S-HNL and Met in the esterases, respectively 3 . From the analysis by INTMSAlign, the appearance rate of Lys was significantly low; the rate was 1.8% (Lys, Supplementary Fig.3C). On the other hand, the appearance rate about Met and Ser was significantly higher than Lys; the rate was 33.0 (Met) and 29.7% (Ser) ( Supplementary Fig.3C), respectively. The high appearance rate of Met at the 237 th residue should represent that the library seemed to be biased to the esterase family.

Assignment of correlation residues which brought difference of substrate specificity between S-HNLs and esterase
In this section, we will indicate that INTMSAlign can pick up the residues which are independently conserved in S-HNL and esterases, in a word, correlation residues. The 237 th residue of MeHNL is the marker residue between S-HNL and esterases. With referring to this, two result files were generated using the residue fixation; one result file is generated by fixing the 237 th residue as Lys (237:K) and another one is generated by fixing the 237 th residue as Met (237:M). The former and the latter result files are generated with intend to represent amino acid appearance rate of only S-HNL and esterase, respectively. In fact, in the 237:K condition (S-HNL), residues 11 th , 79 th and 239 th were highly conserved as Thr (84.9%), Glu (84.1%) and Gln (88.7%), respectively, and these residues were also conserved in MeHNL 4 . On the other hand, in the 237:M condition (esterase), residues 11 th , 79 th , and 239 th were highly conserved as Gly (86.8%), His (90.4%) and Met (92.4%), respectively, and these were also conserved in SABP2 5 .
D-score was calculated for each residue of MeHNL to extract the correlation residue. Six residues which had D-score value more than 60.0 were picked up and shown in Supplementary Table 2.
Among these residues, four residues (237 th , 239 th , 79 th , and 11 th ) located on the active site are plotted in Supplementary Fig.5. The residue fixation of the INTMSAlign works correctly because the appearance rate at 237 th residue was 100% in cases of using two result files of which conditions are the 237:K (black bar, Supplementary Fig.5A) and 237:M (red bar, Supplementary Fig.5A), respectively.
Here, 11 th residue of MeHNL had the 5 th highest D-score value (64.0, Supplementary all of the residues. The 11 th residue was Thr in S-HNL (black bar in Supplementary Fig.5D) and Gly in esterase (red bar, Supplementary Fig.5D), respectively. In SABP2, the 11 th residue corresponds to Gly12, and this is one of two residues which are important to have S-HNL activity 3 . Residue 79 had the 4 th highly D-score value (66.7, Supplementary Table 2); the residue is conserved in Glu in S-HNL and His in esterase, respectively ( Supplementary Fig.5C). The Glu79 is one of active site residues of S-HNL; enzyme reactivity of the MeHNL (E79A) loses to 1% of that of MeHNL (WT) 6 .
To show where these residues are located, four residues (Thr11, Glu79, Lys237 and Gln239) of Table 2) are shown in Supplementary Fig.6A. The residues are located near active site of MeHNL, and formed hydrogen bond network each other ( Supplementary Fig.6A).
As for residue 198 and 219 in Supplementary Table 2, these are located at remote position from the active site ( Supplementary Fig.6B). Remotely located residues from active site often regulates dynamics of proteins to achieve efficient enzyme catalysis 7 , and, therefore, these two residues may control the dynamics of S-HNL and esterase to express their function. hypothesized that this is one of the alnfile (Fig. 1), and regarded LIB1, 2, 4, 9 and 13 as picked sequences from the library. By defining "2:T", INTMSAlign will select three sequences (LIB4, 9, 13), which have Thr at the 2 rd position of the STP sequence and colored by red in the figure, and calculate amino acid appearance rate.