Bayesian Analysis of MicroScale Thermophoresis Data to Quantify Affinity of Protein:Protein Interactions with Human Survivin

A biomolecular ensemble exhibits different responses to a temperature gradient depending on its diffusion properties. MicroScale Thermophoresis technique exploits this effect and is becoming a popular technique for analyzing interactions of biomolecules in solution. When comparing affinities of related compounds, the reliability of the determined thermodynamic parameters often comes into question. The thermophoresis binding curves can be assessed by Bayesian inference, which provides a probability distribution for the dissociation constant of the interacting partners. By applying Bayesian machine learning principles, binding curves can be autonomously analyzed without manual intervention and without introducing subjective bias by outlier rejection. We demonstrate the Bayesian inference protocol on the known survivin:borealin interaction and on the putative protein-protein interactions between human survivin and two members of the human Shugoshin-like family (hSgol1 and hSgol2). These interactions were identified in a protein microarray binding assay against survivin and confirmed by MicroScale Thermophoresis.

or biochemical process and sample handing error can result in an anomalous observations. The increased data acquisition rate also has to be complemented with robust, automated procedures to analyze the binding curves and to quantify the uncertainty of determined dissociation constants 5 . To fully realize the power of automation, the anomalous observations need to be treated without human influence. Standard non-linear least square (NLLSQ) regression methods require the identification and exclusion of these anomalous observations from the analysis. While experienced human observers often can identify and remove these observations correctly, this is a time consuming process and it is prone to human error. Here we employ machine learning techniques to automate the analysis and substitute subjective outlier rejection procedures. Our goal is to increase the objectivity of the analysis by taking into account as much of the available evidence as possible and to obtain useful information from an imperfect data.
Clinically, human survivin is a tumorigenesis marker and there is substantial evidence concerning its overexpression in different cancer types 12 . Recent reports shine new light on its function in autoimmune conditions such as rheumatoid arthritis 13,14 , psoriasis 15 and multiple sclerosis 16 . Clinical relevance of Sgos is less explored and could be indirectly connected to the novel therapeutic interest in protein phosphatase 2 A (PP2A) agonists 17 .
Survivin is a small 16 kDa protein that belongs to the Inhibitor Apoptosis Protein (IAP) family. It consists of an N-terminal BIR (Baculovirus Inhibitor of apoptosis protein Repeat) domain and a long α-helical C-terminus 18 . In cell division, survivin is localized at the centrosomes and microtubules, controlling the assembly of CPC at the inner centromeres by binding to phosphorylated histone 3 11 . Survivin attaches to the other proteins of CPC by forming a ternary complex between survivin/borealin/INCENP 6,9 . Borealin plays an important role targeting other members of the CPC, correcting the kinetochore attachment errors and stabilizing the bipolar spindle in human cells 19 . X-ray crystallography studies have reported interaction between the N-terminal of borealin  and other CPC members, such as survivin 7 .
The assembly of CPC to the centromere activates Aurora B kinase that phosphorylates kinetochore substrates and coordinates the dis-attachment between kinetochores and microtubules, which pulls sister chromatids apart.
Sgos (Sgo1 and Sgo2) counteract these pulling forces. In human cells, Sgos protect the cohesin protein complex, which keeps sister chromatids together from cleavage; preventing premature loss of centromeric cohesion, missegregation and mitotic arrest. This occurs through its direct binding with the phosphatase PP2A 20 . Recently, functions of Sgos in spindle associated complex and kinetochore protection have been described. These functions are regulated by CPC and also are required for the centromeric localization of Aurora B. In fission yeast, Sgo2 promotes CPC recruitment to centromeres by Cdc2-dependent phosphorylation of the survivin subunit. Reciprocally, survivin has been shown essential for the enrichment of Sgo2 at centromeres. In humans, the N-terminus of hSgol1, but not hSgol2, mimics the phosphorylated N-terminus of histone 3, which is recognized by the BIR-domain of survivin. Thus, a competition between hSgol1 and phospho-H3 for survivin might regulate the recruitment of CPC to the centromere. The question whether or not there is a physical contact between Sgo1 and survivin is controversial. Direct contact between human survivin and hSgol1 was not detected by yeast two-hybrid system 21 . In contrast, survivin and Sgo1 appears to interact in fission yeast 22 and supporting evidence is also available from biophysical and structural studies 11 . The nature of Sgo2 and survivin interaction is also unclear. Importantly, when Sgo2 is missing it has a stronger effect on survivin localization 10,23,24 .
In this work, we explore the effect of the experimental errors and the uncertainty of determined dissociation constant in MicroScale Thermophoresis experiments. In addition to simulations, we test the method on experimental data, which were obtained from the interaction between survivin and short linear motifs of borealin, hSgol1 and hSgol2. Finally, we point out similarities in the localization and sequence of the putative binding motif in hSgol1 and hSgol2, which suggest a common evolutionary origin.

Results and Discussion
Probabilistic inference from simulated data. When evaluating a new statistical method, it is better to use a different error model for generating simulated data than what is used for inference. That way one can test how well the methods perform when confronted with unexpected errors. Here, we introduce serial dilution concentration errors and add outliers even though the competing inference strategies (non-linear least-squares regression and probabilistic Bayesian inference) are not prepared for them.
To cover a large concentration range, serial dilution schemes are most commonly applied. The highest concentration of an interaction partner is stepwise diluted to the lowest concentration in the series. Each dilution step presumably depends on the ratio of two normally distributed random variables with non-zero mean. Therefore, the concentration errors are not expected to be normally distributed and the resulting concentration distribution is not even symmetric. Through the series of multiplication, the variance and the skewedness of the concentration distributions are gradually compounded (Equations 1 and 2).
Thus, the process generating the concentrations becomes a biased random walk. The dependence on the previous concentration means that the concentration series has a tendency to jump and it is expected to stabilize on consistently lower or higher concentrations than the targeted ones. In addition, the asymmetry of the distribution results in an increased fraction of generated ligand concentrations that are lower than the targeted ones, at the expense of higher-than-targeted concentrations leading to an overall bias. As long as the pipetting coefficient of variation (CoV) is less than 1% (i.e. the pipetting performance adheres to the ISO8655 standard), the deviation from the targeted concentration is negligible on the logarithmic scale. It is worth noting though that current MST instrumentation and practice allows up to 16 dilutions in a single measurement and that the bias starts to be noticeable if the CoV exceeds 5% even on the logarithmic scale.
Using the procedures described in the Material and Methods section we generated ten synthetic test cases with different known model parameters (Table 1; Fig. 1). The robust probabilistic model and a standard non-linear least squares regression (NLLSQ) method were tested against these cases. The results of the fits are summarized in Table 2. Figure 1 shows the simulated data of test case #1 and the inferred binding curves using the two methods. The probabilistic model is illustrated with a swarm of binding curves: each corresponding to a sample from the joint a posteriori probability distribution. Visually, this swarm captures the central tendency of the data better than the single point estimate binding curve determined by the NLLSQ method.
In all tested cases, the NLLSQ regression was less accurate than the probabilistic robust model ( Table 2). The credible intervals (95% probability) of the probabilistic robust models always include the true K D value. The same is true for the confidence interval (at 0.95 confidence level), but the confidence interval is always broader than the credible interval. In the test cases #2 and #5 the NLLSQ point estimate of K D is very inaccurate and it misses the target by one order of magnitude. By comparing test case #1 and #6 one can conclude that concentration errors at realistic levels do not substantially increase to the uncertainty of K D . Test cases #1, #7 and #8 only differ in the number of repeated simulated experiments. Judging from the credible and confidence intervals, both type of fits get more accurate as the repeats increased, but the robust Bayesian approach requires less repeats to achieve the same level of precision. The credible interval narrows dramatically going from 1 to 10 repeats and the typical 3 repeats provide a good compromise between experimental requirements and accuracy of results. Similarly, more precise measurements would not benefit the NLLSQ method as the comparison of test cases #1, #9 and #10 shows. Quite the contrary, the NLLSQ method seems to perform best in test case #10 where the measurement error is the highest. In this case the contrast between the outliers and random variation around the true values is less apparent leading to less bias by the anomalous observations.  Table 1. Synthetic data generating parameters for the ten test cases. K D,true is the known dissociation constant of the interaction, B true and U true represents the ideal fluorescent measurement of any type associated with the labelled protein with the ligand fully bound and ligand absent form, respectively. ε is the normally distributed random variable added to the ideal fluorescent measurement (mean parameter 0, standard deviation in the table). COV is the coefficient of variation of the ideal pipettes used for generating the ligand concentration series.  (Table 1). Symbols with different colors belong to separate simulated dilution series. The thick line represents the NLLSQ fit to all data points whereas thin transparent lines show binding curves calculated using ten randomly selected posterior joint parameter sets.

Identification of survivin binding peptides by printed microarrays.
We tested the Bayesian inference method on potential binding interactions of survivin identified by PEPperCHIP Peptide Microarray (PEPperPRINT, GmbH, Heidelberg, Germany). These microarrays consisted of overlapping peptides originating from candidate proteins from the human proteome including hSgol1, hSgol2 and the N-terminal region of borealin (1-93). The peptides were generated by peptide laser printing 25 (Fig. 2). Among the overlapping peptides, the main responses with the clearest spot morphologies were attributed to the basic hSgol1 peptides KREEKRKANRRKSKR and RKANRRKSKRMSKYK (and to some extent the adjacent DTQERKREEKRKANR peptide) and the hSgol2 peptides ECQVKKVNKMTSKSK and KVNKMTSKSKKRKTS. In addition, a peptide derived from borealin (GSSRVAKTNSLRRRK) also shows clear signal in the microarray at 1 μg/ml soaking concentration of survivin. The spots are consistently visible at 1, 10 and 100 μg/ml survivin concentrations in the assay buffer (Fig. 2). Quantitative analysis of the fluorescent intensities of the specific and control antibodies for borealin, hSgo1l and hSgol2 are shown in Supplementary Fig. S1, S2 and S3, respectively. The consensus motif for hSgol1 and hSgol2 peptides is RKANRRKSKR and KVNKMTSKSK, respectively. Other basic peptides like SPTDPKACNWKKYKF, KTFRKVKDSSSEKKR, EVSKIVTVSTGIKKK or HQSSFLLSASKKKRI did not react with survivin in a similar manner. The identified part of hSgol1 (291-312) is overlapping with a KEN region 26 which is targeted by the ubiquitin ligase activity of anaphase-promoting complex (APC) also known as cyclosome 27 . A similar pattern can be observed in the much larger hSgol2 isoform where the combined, identified region (1066-1085) is followed by two downstream KEN motifs (1115-1117 and 1173-1175).
We aimed to quantify the interaction using the MicroScale Thermophoresis technique. Synthetic peptides that combine the two identified peptides (of hSgol1 and hSgol2, respectively) to a single peptide were designed. In the case of hSgol1 the region was also extended to incorporate the entire KEN motif yielding the hSgol1 peptide sequence KREEKRKANRRKSKRMSKYKEN (291-312) 27 . For testing the hSgol2 interaction the hSgol2 ECQVKKVNKMTSKSKKRKTS sequence (1066-1085) was used. Longer N-terminal segments of borealin are not suitable as positive control in MST measurements, because they are not soluble in typical aqueous buffers. The choice of soluble borealin peptide, GSSRVAKTNSLRRRK (6)(7)(8)(9)(10)(11)(12)(13)(14)(15)(16)(17)(18)(19)(20), was motivated primarily by the microarray experiments and the prior knowledge about the structural details of borealin:survivin interaction 7 .

MicroScale Thermophoresis experiments and robust inference of K D values. Primary MicroScale
Thermophoresis data of borealin 6-20 , hSgol1 291-312 and hSgol2 1066-1085 is shown in Fig. 3. The replicates of the thermophoretic progress curves are displayed in the Supplementary material, Supplementary Fig. S4. The median of K D posterior distribution for borealin [6][7][8][9][10][11][12][13][14][15][16][17][18][19][20] , hSgol1 291-312 and hSgol2 1066-1085 are 10.8 μM, 80.9 μM and 1.6 μM, respectively (Tables 3 and 4). The two NLLSQ fitting approach was performed with identical initial parameters (the initial parameters are listed displayed in the Supplementary material, Supplementary Table S1). For comparison the reported K D of the N-terminal hSgol1 tetrapeptide is 6.2μM 11 . The MST binding curves between    Supplementary Fig.S4) and 122 nM for hSgol2 291-312 in replicate 1 (Fig. 3), do not fit into the pattern of a sigmoid binding curve. Longer incubation times increase the number of outliers even further (see Supplementary material, Supplementary Fig. S5 and S6 and Table S2). Commonly used NLLSQ method is particularly sensitive to outlier observations 28,29 . Using the robust Bayesian approach (Fig. 3) the extreme observations are a natural part of the distribution tails and have only a limited effect on the central tendency of the fit. When comparing the NLLSQ fits as implemented in the software PALMIST 30 we see that the NLLSQ point estimates sometimes are not part of the Bayesian highest density intervals and the confidence intervals are broader than the Bayesian credible intervals (even if they answer very different questions) (Tables 3 and 4). Thus, we can establish that experimental data frequently include data points which, if not rejected, can cause substantial inaccuracy in K D estimates when using NLLSQ minimization. To analyze which peptide (hSgol1 291-312 or hSgol2 1066-1085 ) has higher affinity to survivin, a binding affinity comparison is included in Supplementary material (Supplementary Fig. S7).

Putative binding region in Shugoshins.
There is a subtle similarity between the two peptides (hSgol1 291-312 and hSgol2 1066-1085 ) in the consensus region (Fig. 4

), which can be described by a hypothetic motif [EQ]-x-[KR]-K-[AV]-N-[KR]-
x-x-S-K. This strict pattern can only be found in hSgol1 and hSgol2 in the human proteome (Uniprot). If we summarize the previously described binding N-terminal sequences 11

(histone and hSgol1) with a strictly defined consensus motif (<x-A-[RK]-[TE]-[RK]) we still obtain 22 matches in the human proteome, but for example hSgol2 is absent due to an incompatible N-terminus with sequence MECPV. hSgol2 N-terminus is not detected even with a more permissive consensus motif (<x-[AV]-[RKH]-[TSE]-[RKE]
) and the notion of specific N-terminal recognition is potentially problematic given the overwhelming evidence for physical survivin-hSgol2 interaction in fission yeast 10,24 . The hSgol1 291-312 and hSgol2 1066-1085 peptides are both highly charged, but their structure does not appear to be identical with hSgol1 291-312 showing signs of increased local order (see Supplementary Result and Discussion and Fig. S8). The sequence conservation in the Sgol1 291-312 and Sgol2 1066-1085 region is strong within mammalian Sgol1s and Sgol2s, respectively (Fig. 4). On the other hand, it is uncertain whether the subtle similarity between mammalian Sgol1 291-312 and Sgol2 1066-1085 regions results from true homology or merely converging parallel evolution. It is worth noting that there are approximately the same number of residues between the evolutionally conserved C-terminal domain and Sgol1 291-312 and Sgol2 1066-1085 regions, respectively (Fig. 4). This must be interpreted in the light that the length of Sgol2 in different species varies greatly, even when compared to our close mammalian relative, gorillas. Another way to judge evolutionary conservation is to compare with other recognizable features for example the KEN box 27 , which appears to be only present in humans (and only in hSgol1 not in hSgol2) in this (limited) comparison. Nevertheless, the binding locations relative to the C-terminal domain appears to be remarkably constant and may play a role in specific positioning of survivin. This suggests a functional link between survivin and the hSgol C-terminal domain.
Thus, the new motif in Shugoshin-like proteins appears to show signs of evolutionary conservation/convergence and may be recognized by survivin more specifically. Since this motif is not located at the N-terminus of Shugoshin-like proteins it does not immediately exclude the possibility of simultaneous binding to phosphorylated histones.

Conclusion
We described a robust Bayesian inference protocol that minimizes bias and allows unsupervised high-throughput MST analysis. This method is sufficiently fast and highly tolerant to frequently observed outlier data points. The resulting a posteriori probability distributions immediately provide the uncertainty of the estimates, which can be directly used for Bayesian A/B testing.
We also reported a putative interaction footprint of survivin on hSgol1 and hSgol2. This evidence adds a piece to the intriguing puzzle about how the chromosome passenger complex is formed and acts in concert with the sister chromatid cohesion apparatus.

Materials and Methods
Protein production and purification. Survivin was expressed in E.coli BL21(DE3) star (Merck) cells using the pHIS8 expression vector used by Verdecia et al. 18 The expression was induced at OD600 0.6-0.8 in Luria Bertani (LB) media by 0.5 mM IPTG and incubated during 3-4 h at 30 °C. The cells were harvested at 6000 g during 20 min and resuspended in lysis buffer containing 50 mM Tris pH 8, 500 mM NaCl, 10 mM imidazole, 0.25 mM Pefabloc ® SC (Sigma), 20 mg/ml DNAse (Sigma) and 0.2 mg/ml Lysozyme (Sigma). The cells were lysed with a high pressure homogenizer, using 3 cycles of 15,000-20,000 bar. Survivin was purified by Ni 2+ chelation chromatography using 5 ml HisTrap FF column (GE Healthcare). Overnight dialysis in 50 mM Tris pH8, 150 mM NaCl and 1 mM DTT buffer and a gel filtration purification using a Superdex 75 100/300GL column (GE Healthcare) were done before chemical labelling. The SDS PAGE of purified survivin before and after chemical labelling is displayed in Supplementary Fig. S9. The borealin 6-20 , hSgol1 291-312 and hSgol2 1066-1085 peptides were chemically synthetized (Genscript).

Peptide microarrays.
Pre-staining of one of the PEPperCHIP ® Peptide Microarrays was done with the secondary 6X His Tag Antibody DyLight680 antibody at a dilution of 1:1000 and with monoclonal anti-HA (12CA5)-DyLight800 control antibody at a dilution of 1:1000 to investigate background interactions with the protein-derived peptides that could interfere with the main assays. Subsequent incubation of other peptide microarray copies with survivin at concentrations of 1 µg/ml, 10 µg/ml and 100 µg/ml in incubation buffer was followed by staining with the secondary 6X His Tag Antibody DyLight680 (Rockland Immunochemicals) antibody and the monoclonal anti-HA (12CA5)-DyLight800 control antibody (Rockland Immunochemicals) as well as by read-out at scanning intensities of 7/7 (red/green). HA and His tag control peptides were simultaneously stained as internal quality control to confirm the assay quality and to facilitate grid alignment for data quantification. Read-out was performed with a LI-COR Odyssey Imaging System, while quantification of spot intensities and peptide annotation were done with PepSlide ® Analyzer. Quantification of spot intensities and peptide annotation were based on the 16-bit gray scale tiff files at scanning intensities of 7/7 that exhibit a higher dynamic range than the 24-bit colorized tiff files.  Table 3. Model parameters inferred from experimental data using the interactions of chemical labelled survivin with Borealin 6-20 in Buffer A after 5 min incubation. Variable B and U represents the T-jump + thermophoresis difference in F normal for the fully bound ligand and unbound labelled protein, respectively. K D is a dissociation constant of the ligand and labelled protein. *Median and HDI 95% interval of the sampled posterior probability distribution. **Maximum likelihood estimate and confidence interval (95%). ***Maximum likelihood estimate and confidence interval using ESP settings (95%) as determined by the software PALMIST 30 . Probabilistic modelling of binding curves and maximum likelihood estimates. The probabilistic models were implemented in python with the help of modules numpy, scipy and pymc3.
A probabilistic model was used to simulate the concentrations in a serial dilution experiment. Each step in the series involves two pipetting events: the previous solution of the series (or the stock) is pipetted and the fresh buffer without the solute which results in a concentration dilution. For microscale thermophoresis experiments (and generally serial dilution experiments) it is recommended to use equal pipetting volumes to minimize the pipette specific systematic differences at different targeted volumes. Thus a reasonable a priori belief is that the variation in the concentration at each dilution step can be described by two normally distributed pipetting events with an assumed mean of the targeted volume and equal variance. The latent concentrations in the dilution series can be thus described with the following equations: , (2)   The targeted concentrations are part of geometric series: The concentration dependence of the binding is described by the law of mass action: where U and B are the measured values (T-jump, thermophoresis fluorescence difference or fluorescence yield directly whichever carries binding information) corresponding to the 100% unbound and bound ligand species, c fl is the concentration of the fluorescent partner molecule and K D is the dissociation constant. ε is the error of the measurement, which is free of dilution error i.e. equivalent to the error at the stock ligand concentration. ε is assumed to be normally distributed for generating synthetic data and to account for the frequently observed "outliers" 20% of the synthetic data are replaced by samples from a uniform distribution that includes the U-B range.
In order to infer the K D of the interaction we constructed an a priori probabilistic model using the Python library pymc3 31 : where τ is the precision parameter of the normal prior distribution of the fluorescent labeled protein concentration (reciprocal of the variance). The other parameters believed to be uniform and when the random variables of K D , U, B, c fl , ν, λ are outside of their bounds, their a priori probability is zero. The probability distribution of variable D(c) corresponds to the thermophoretic/fluorescence yield observations at concentration c and is the only fixed parameter in the model: where, μ, ν and λ corresponds to the location, degrees of freedom and scale parameter of the Student-T distribution. The Student T-distribution is assumed here because it tends to be robust against outlier observations and expected to capture the central tendency of D more efficiently.
To sample 5000 MCMC (Metropolis 32 and No-U-Turn sampling (NUTS) 33 iterations in a single chain on a Linux workstation (i7-3970X CPU at 3.50 GHz clock frequency) takes only one minute at most, which is at least one order of magnitude less time than performing a standard MST experiment.
The standard least squares regression was done by minimizing the squared residual between the observed data and prediction based on the law of mass action function − D c f c U B K ( ( ) ( , , , )) D 2 using the scipy.minimize. leastsq function call which uses the MINPACK library implementation of the Levenberg-Marquardt minimization algorithm 34,35 .
The confidence interval of the least squares estimates of K D , B and U parameters were determined from the minimum and maximum values of the fixed parameter p for which  (12) holds. RSS stands for the residual sum of squares differences between the data and the model in Eq. 4. RSS and RSS(p) are minimized for all parameters and for all parameters except p, respectively. N denotes the number of data points, and P is the number of parameters. F is the F-distribution value, calculated for α level of 0.05 and the degrees of freedom 1 and N − P.
For the NLLSQ evaluation of the experimental thermophoresis progress curves we also used the software PALMIST 30 with the following options: Thermophoresis + T-jump, normalized fluorescence and default starting parameters. We used the same starting parameters in our NLLSQ calculation (see Supplementary Table S1). For estimating the confidence intervals in PALMIST we used the ESP method (0.95 confidence level). The binding free energy are calculated using the van't Hoff formula: ΔG = RT ln K D where T is 297.15 K and R is the universal gas constant.