Predicting disease progression from short biomarker series using expert advice algorithm

Well-trained clinicians may be able to provide diagnosis and prognosis from very short biomarker series using information and experience gained from previous patients. Although mathematical methods can potentially help clinicians to predict the progression of diseases, there is no method so far that estimates the patient state from very short time-series of a biomarker for making diagnosis and/or prognosis by employing the information of previous patients. Here, we propose a mathematical framework for integrating other patients' datasets to infer and predict the state of the disease in the current patient based on their short history. We extend a machine-learning framework of “prediction with expert advice” to deal with unstable dynamics. We construct this mathematical framework by combining expert advice with a mathematical model of prostate cancer. Our model predicted well the individual biomarker series of patients with prostate cancer that are used as clinical samples.

The first step constructs an expert of a target system. There are two options. The first option is to use long time-series observed in the past as they are. We construct experts by simply inserting previous parts of the time-series or the datasets of previous patients. Let x j,l be the lth point of the jth time-series in a database ( j 5 1, 2, …, J, l 5 1, 2, …, L) and f i,t be the ith expert's advice at time t. Numbers J and L are the number of time-series and the number of points in each timeseries, respectively; We assume that the lengths of the time-series are equal, but it is easy to extend to cases of different lengths. Let P be the number of points related to each expert. Then, we can define an expert f (L2P11)(j21)1i,k 5 x j,i1k21 for i 5 1, 2, …, L 2 P 1 1, k 5 1, 2, …, P. The second option is roughly to fit a mathematical model that has a set of parameters to a very short time-series, obtain the initial conditions for each set of parameters, and prepare the set of experts with these parameters. The details of this second rough option are discussed after we introduce a mathematical model of prostate cancer in the later section.
In the second step, TEA weights the trajectories f i,t in the database to generate an appropriate weighting for the most current state. Let y k denote the observation at time k, and let l(?,?) be a loss function. When we include the next point, the weights w i,t are updated according to a formula obtained by modifying the standard expert advice 15 . To achieve this, we sum the loss at each time step with a coefficient as follows: The modified form considers the instability of the underlying dynamics by introducing a coefficient a k (t), which measures the reliability of the prediction at, and increases exponentially with, time k such that a k (t) 5 l k21 or l k with l . 1. Thus, the real values L i,t and L t are the exponentially weighted losses of the ith expert and the predictor up to time t, respectively. We define L 0 5 L i,0 5 0 for simplicity. The weight of the expert is updated as where g is a learning rate. Chernov and Zhdanov 16 proposed a modified expert advice method in which they defined a k (t) 5 r t2k21 and 0 , r , 1. We call this the ''CZ method''. The third step predicts future states of the target system by applying the obtained weighting to the future trajectories in the database 15 . We can generate a point prediction by simply adding the q steps ahead of the trajectories with the weights obtained in the second step as follows: where N denotes the total number of experts. We can also generate a distributional prediction by assuming the distribution of observational errors, and summing this error distribution with the weights. To make these predictions online, we repeat the second and third steps iteratively.
Upper bound of the regret of TEA. We derive an upper bound of the regret of TEA. The primary property peculiar to TEA lies in our definition of a k (t). We define the coefficient as a k (t) 5 l k21 or l k , where l . 1. The choice of these two options depends on each situation. When a given data is too short, we choose the latter, e.g. our prediction about the biomarker of prostate cancer, or PSA (prostate-specific antigen). Let L t and L i,t be the accumulated losses for the proposed method. We call these the exponential accumulated losses to distinguish them from the standard accumulated losses. In addition, we define the regret as R t~Lt { min i~1,:::,N L i,t . The upper bound of the regret with a k (t) 5 l k21 is then given by Proof of the upper bound in our proposed method. We give a proof of Eq. (5) Second, we derive the upper bound of ln(W t /W 0 ). Observe that L k,t~Lk,t{1 zl t{1 l f k,t ,y t ð Þ. Then, we can reformulate ln(W t /W t21 ) as follows: Equation (7) can be regarded as the average of random variable exp(2gl t21 l(f k,t , y t )) with a probability mass function proportional to Here, we assume that x is a random variable satisfying a # x # b, and that the inequality holds when s is any real number. Replace s by 2gl t21 and x by l(f k,t , y t ). Then, the upper bound of Eq. (7) can be found using Eq. (8) as follows: {gl t{1 l p t ,y t z g 2 l 2(t{1) e 2 8 : We assume that l(?,?) is convex as described above. Then, the upper bound of ln(W t /W 0 ) can be derived as Because Eqs. (6) and (10) provide lower and upper bounds of ln(W t / W 0 ), respectively, the following inequality is obtained: {g min k~1,:::,N L k,t {ln Nƒ{gL t z g 2 e 2 8 : By substituting the regret R t~Lt { min k~1,:::,N L k,t into Eq. (11), we finally reach the following inequality: Optimization of the upper bound of TEA. We minimize the upper bound of Eq. (12) over g. First, we differentiate the upper bound with respect to g as follows: The solution is which gives the smallest upper bound. Replacing g in the upper bound of R t with g * , we obtain the following optimal upper bound Â(t): Although this optimal upper bound perhaps seems to be curious at a glance due to its exponential increase with t, this is caused by the definition of the accumulated losses Eqs. (1) and (2) with a k (t) 5 l k21 . This regret can be compared with the normal types of regrets using relationship described in the next section. When e 5 1 and l R 1, the optimal upper bound Â(t) coincides with ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi t ln N=2 p , which is the upper bound obtained in the standard expert advice method 15 .
Comparison between the proposed method and the Chernov-Zhdanov method. Here, we highlight the difference between the CZ method and the proposed TEA method. The first point of difference is the optimal upper bound of the regret. We briefly introduce the optimal upper bound of the CZ method 16 . LetL i,t and L t be the accumulated losses for the ith expert and the predictor for the CZ method, respectively. Then, the optimal upper boundÂ c (t) of the regretR t~Lt { min i~1,:::,NL i,t for the CZ method is given bŷ Note that we assume the case where the value of the decay rate r does not depend on t or k. See Ref. 16 for the proof. Although we cannot directly compare these regrets, we can compare them after normalization. Assuming that the decay rates are equal, namely l 5 r 21 , the regrets R t andR t have the following relation: Then, the following relation is obtained: This result means that the normalized optimal upper bound of the proposed method is always smaller than that of the CZ method when 0 , r , 1. Next, we compare the weights produced by the two methods. Let w i,t and ŵ i,t be the weights of the ith expert at time t in the CZ and TEA methods, respectively. Similarly to the derivation of Eq. (17), the accumulated losses of both methods are related byL i,t~r t{1 L i,t .
Substituting this relation into ŵ i,t , we have Equality (20) means that the proposed TEA method tends to assign reliable experts with heavier weights than the CZ method. This implies that ŵ i,t . X N j~1 ŵ j,t §ŵ i,t . X N j~1ŵ j,t for reliable experts because r 2t11 $ 1.
Examples of time-series prediction for mathematical models. We demonstrate the superiority of the TEA method to both the CZ method and the standard expert advice in online time-series prediction using toy examples. We use the Hénon map 17 and the Ikeda map 18 for our demonstration. These two models are commonly used to test nonlinear time-series analysis methods, which exhibit typical unstable chaotic dynamics. First, we generate time-series for the database using various values of parameters. We then generate a target time-series for prediction using a set of parameter values that is different from those used to generate the database. We prepare M 3 S experts for the database, where M is the number of parameter sets. For each parameter set, we generate S experts with different initial conditions. In numerical simulations of TEA, we set e 5 1 and g~ffi . We also set l 5 r 21 and r 5 0.9. See Algorithm 2 in Ref. 16 for the implementation of the CZ method, and Ref. 15 for that of the standard expert advice. The Hénon map 17 is a two-dimensional map defined as We set the parameters at a 5 1. 35  The proposed TEA method provides the best online prediction in different toy examples. The more experts we use, the smaller the prediction errors become. When a large number of experts are used, the proposed TEA tends to achieve the best online time-series prediction. We need to decay the past information in these examples, because the unstable chaotic dynamics rapidly loses the memories.
Examples of time-series prediction for real datasets. We now consider two real datasets: violin sounds 19 [20][21][22][23] . These time-series are both scalar and real-valued. We divide each time series into two. The first part is used to build the database, and the second constructs the targets for online prediction. We use M 5 1,000 and M 5 120 targets for the analysis of violin sounds and squid giant axon data, respectively. The lengths of the target data are 311 for the violin data and 51 for the squid giant axon data; numbers and lengths of target data are determined by the lengths of the original datasets.
We compare five methods using these real data. These are our TEA method, the CZ method, the standard expert advice, the persistence prediction, and the average prediction. The persistence prediction is a method that we let the current value to be the prediction for the next time point. We compare each pair of the method individually, and count the number of points at which the prediction by one method is better than the other for each target time-series. If one method is superior at more than half the data points, we declare that method the winner on the target data. We exclude the initial ten points from the analysis, because we cannot prepare the learning part. Finally, we count the number of wins and losses for each pair among the five methods. In the TEA numerical simulations, we set e 5 1 and For this dataset, our method and the persistence prediction produce much better results than the other methods. Therefore, we next compare our TEA method with the persistence prediction with respect to the number of experts. We use the binominal test for the analysis, i.e., if the number of wins is greater (smaller) than 531 (469), the method is significantly superior (inferior) to the other method with respect to the 95% confidence level two-sided binominal test. When the number of experts is large, our TEA method is significantly superior to the persistence prediction, as shown in Fig. 2e and Table 1. In the example of squid giant axon 20 , the proposed TEA is also better than the other four methods when the number of experts is large, especially when greater than or equal to 87, as shown in Fig. 3 and Table 1.
In conclusion, our TEA method tends to provide the best prediction when the number of experts is large. The precise number of experts for which this is the case may change depending on the given data, the length of targets, and the decay parameter.
Distribution prediction to the mathematical models. We applied the distribution prediction to time-series of the Hénon map. The distribution prediction will be explained in the later Method section. The setup is similar to that for the point prediction, except that we provide the prediction as a distribution. The results are presented in Figs. 2f and 2g. The width of the distribution prediction is narrow immediately after the learning period (Fig. 2f), then grows gradually as the number of prediction steps increases because of the instability of the underlying dynamics. The predicted confidence interval tends to contain the actual values. When we increase the number of points used for prediction, the width of the distribution prediction becomes narrower (Fig. 2g). We use values of S 5 1,000 and M 5 100 in Figs. 2f and 2g. In the TEA numerical simulations, we set  Fig. 2g. Restricting the range of l to 1 , l , 2 gives a better prediction. We generate the target and experts' time-series as in the previous section. We also obtain the distribution prediction of the Ikeda map using S 5 1,000 and M 5 100, as shown in Fig. S1f. The result is very similar to that of the Hénon map. Again, restricting the range of l to 1 , l , 2 gives a better prediction.
Mathematical models of prostate cancer. TEA can be applied to clinical problems, such as the prediction of prostate-specific antigen (PSA) after some initial treatments, while waiting to start an additional treatment. We apply TEA to the prediction of tumor markers for prostate cancer PSA. Before the technical details, we introduce a mathematical model of prostate cancers in this section. Patients had already received radical prostatectomy as an initial treatment. Then, clinicians followed postoperative PSA levels to determine when to commence salvage treatment. Although the timing at which patients start salvage treatment is an important problem, there is no definitive agreement on when this to be started. Currently, clinicians are determining the start of salvage treatment based on their discretion. The clinical part of this study was approved by the ethics committees of Jikei University School of Medicine and The University of Tokyo. All patients provided written informed consent. Cancer cells tend to thrive under an androgen-rich environment. Meanwhile, lowering androgen levels makes cancer cells grow slowly or rather decline. Because of this characteristics, clinicians suppress the androgen concentration with hormone therapy. However, when cancer cells remain exposed to an androgen-poor environment, they often acquire the ability to grow without androgen. This growth signals a cancer relapse. Intermittent androgen suppression was proposed to delay the relapse of cancer 24 . In intermittent androgen suppression, we start hormone therapy, but stop when PSA levels have decreased sufficiently. Then, we wait until PSA increases and reaches a threshold value. After reaching this threshold, we resume hormone therapy. We repeat this process to delay the relapse. However, clinical trials show that the effects of intermittent androgen suppression depend on individual patients, and are limited 25,26 .
Here, we use a mathematical model 8 of intermittent androgen suppression for prostate cancer [24][25][26] . This model was constructed based on data of Canadian patients 25,26 whose PSA had increased to some extent after radiation therapy, and were later treated by intermittent androgen suppression. Because the model of Ref. 8 has a small number of parameters, it is reasonable to predict the future PSA values with this simple model and very short time-series, although several mathematical models have been proposed to describe dynamics under intermittent androgen suppression [4][5][6]8,10,[27][28][29][30][31] . In the model described in Ref. 8, we assume that there are three classes of cancer cells: androgen dependent cancer cells x 1 , androgen independent cancer cells generated through reversible changes x 2 , and androgen independent cancer cells generated through irreversible changes x 3 . When the hormone therapy is underway, x 1 may change to x 2 or x 3 . When the hormone therapy is stopped, x 2 may return to x 1 , whereas x 3 cannot return to x 1 or x 2 because of genetic mutation. We previously verified two important properties of this model: namely, a piecewise linear model is sufficient to describe the dynamics of PSA, and the androgen concentration need not be explicitly included in the model 8 . Based on these verified properties, we can simply construct the mathematical model as d dt for the on-treatment period, and   d dt for the off-treatment period 8 . Here, d 1 , d 2 , d 3 , d 4 , d 5 , d 6 , e 1 , e 2 , e 3 , and e 4 are model parameters. We assume that a PSA measurement is represented by x 1 1 x 2 1 x 3 for simplicity. Thus, we must specify these 10 parameters for the dynamics and three other parameters for the initial conditions of x 1 , x 2 , and x 3 . If we try to find these 13 parameters directly only from a single target patient, we would need to obtain a long time-series. The application of the proposed TEA algorithm makes the required observation period of PSA measurements shorter by integrating observations from the target patient with the long timeseries data of PSA measurements obtained from previous prostate cancer patients. We note that we only analyze the off-treatment period in this paper, because the target dataset is about the follow-up period after an initial treatment. Therefore, we need 4 control parameters and initial conditions.
Construction of experts for prediction of PSA for prostate cancer.
In this paper, we have two datasets; one is a dataset of Canadian patients with many data points; the other is a dataset of Japanese patients with short time points. We need a long time-series to efficiently estimate model parameters. Therefore, we select Canadian datasets for estimation of parameters and Japanese datasets for predicting targets.
In applying TEA to prostate cancer, we first prepared 72 sets of model parameters, each of which was obtained from one of 72 Canadian prostate cancer patients treated with intermittent androgen suppression. These parameters were obtained from Ref. 8. We note that our prediction target dataset corresponds to the off-treatment period in the model 8 . Second, we chose the number of observation points to use as known data points. This must be at least three because of the model dimensions 8 . Third, using each set of para-meters, we determined the initial model state to minimize the fitting error between the initial three or more PSA measurements and the model output. The optimal initial conditions were selected by minimizing the following cost function: where h(x) 5 10 15 (1 2 x) for x , 0 and h(x) 5 0 for x $ 0, where t k is the kth observation time. We denote the number of observation points used for learning by K. The method of obtaining the initial conditions was similar to that in Ref. 8. Fourth, we ran the model with each set of parameters and the corresponding initial conditions to construct the database of experts f i,t ; thus, we have 72 experts.
Estimation of learning parameters. We applied the second step of the TEA algorithm to determine the weights of the PSA measurements. Then, we applied the third step of the TEA algorithm to obtain the distribution prediction. We determined the optimal decay rate l by minimizing the error between the last learning observation and the prediction. We restricted the range of l to 1 , l , 2 to obtain better predictions. The standard deviation s is estimated as follows. We ran the distribution prediction with the obtained initial conditions and the decay parameter. We set the standard deviation s to the mean of the absolute errors between the median of the distribution prediction and the corresponding observation when the mean is taken during the learning period.
Application of TEA to prediction of PSA for prostate cancer. We predict the values of PSA with distribution prediction. The distribution prediction of PSA with TEA is shown in Fig. 4. Here we evaluate the larger side of the predicted distribution, because overlooking high PSA is highly undesirable in a real clinical setting. We show seven points u t (Q) of the predicted distribution (97.5%, 87.5%, 75%, 65%, 60%, 55%, and 52.5%) in these figures, where u t (Q) is defined as Note that Q is the intended value of the probability, i.e. 0.975, 0.875, 0.75, 0.65, 0.6, 0.55, and 0.525, respectively, in this situation. We obtained the proportion of PSA values that are less than the intended probability for each Q, and counted the PSA data points that are next to the final data points in the learning period; namely, if we are using three data points for learning, we count the fourth data point. In this paper, we focus on the predictability of the next point.
The results are summarized in Table 2. Note that TEA can predict not only the next data point, but also those far in the future. We predicted the future PSA values for 88, 86, 80, and 69 patients when we used the first three, four, five and six time points, respectively. We also conducted numerical simulations using CZ and the standard expert advice. The predicted distributions were different for each method, as shown in Fig. 5. In numerical simulations, we set e 5 1. We arrange the learning rate as four constant values increase v as the number of the learning points increases. We note that M 5 72 is the number of experts. We also arrange the learning TEA exhibits the best performance among the three methods, because each proportion tends to be closest to the specified value of Q. These results imply that our proposed prediction method may be reasonable for real applications in a clinical setting. We also checked the prediction performance in terms of the median using the mean absolute error (MAE) as summarized in Supplementary  Table S1. As a result, TEA shows the best performance in the meaning of the average MAE among the four cases. We note that the CZ method showed good performance in terms of the root mean square error (RMSE), however, we believe that the MAE suits our situation because we employed the absolute error function for the learning period.

Discussion
In general, clinicians provide a salvage treatment with patients who had recurrence after surgery. Although many studies show the clinical benefit of a salvage treatment for patients with prostate cancer, current studies have reported that an earlier salvage treatment, especially for local recurrence, could improve clinical outcomes 32 . These results suggest that post-operative patients with lower PSA values may have a higher frequency of local recurrence that could be efficiently treated by radiotherapy. If clinicians can accurately assess the PSA failure at an earlier stage than the present standard criterion of PSA failure which is that the PSA value increases to 0.2 (ng/ml) or more after surgery, salvage treatments could be more effectively scheduled for each patient, improving the final clinical outcome 33 . However, there is still no standardized criterion to determine the best timing of salvage treatments 32,33 . Combined with a mathematical model 8 , TEA or its further extensions may be able to potentially predict the PSA dynamics in patients before PSA failure. Therefore, the proposed TEA could become the basis of a new standard index for earlier prediction of PSA failure using a simple mathematical solution, that offers important information for a suitable salvage treatment after surgery 7,34,35 .
The more experts we use, the more (numerically) accurate the prediction tends to become (Figs. 2b, 2c, and 2e); in this sense, the accumulation of datasets is important. Additionally, the longer the learning period, the more accurate the TEA prediction tends to become in the toy examples (Fig. 2g). This could be because the toy examples have bounded unstable dynamics. The prediction error does not monotonically decrease with an increase of the learning data points in the example of prostate cancer (Tables 2 and S1), because PSA tends to increase monotonously in time. TEA exhibits the best performance in our analyses. The proposed combination of the expert advice with a predicted distribution enhances the reliability of prediction. This is important in many applications, and especially in medicine.
In summary, we have demonstrated that TEA can infer the state of a target system, by combining its short time-series and the expert www.nature.com/scientificreports advice constructed as a collection of longer time-series. The proposed TEA may be applied to any problems where a short time-series and its database are given, as demonstrated in the violin and squid giant axon examples, although we primarily intend to apply TEA in clinical settings, such as inferring the state of a disease using a short timeseries from the target patient and longer time-series from previous patients with the same disease. We hope that TEA improves the overall survival and/or quality of life for patients.

Methods
Standard expert advice method. The expert advice method 15 is an online predictor in machine learning. We briefly introduce the standard expert advice method in this section. See the book of Cesa-Bianchi and Lugosi 15 for a more detailed introduction. The expert advice consists of experts and a predictor. At each time step, each expert gives the prediction on the future. The predictor makes a prediction for the future by weighting these pieces of advice based on the experts' prediction history. After a new outcome is observed, the predictor updates the experts' weights using the losses produced in the current step. We iterate these steps to realize online prediction. Let f i,t be the ith expert's advice at time t and N be the number of experts. We assign each expert the weight w i,t at time t, and obtain the prediction by averaging the experts' advice as where p t is the prediction at time t, g is a constant, and L i,t is the accumulated loss for the ith expert at time t. Better experts have smaller accumulated losses, and hence have larger weights. The accumulated losses for the ith expert and the predictor at time t are where y k is the observation at time k, and l(x, y) is a convex loss function, typically the absolute error jx 2 yj or squared error (x 2 y) 2 . We evaluate the performance of the predictor by a regret, which is defined as the predictor's accumulated loss minus the accumulated loss for the best expert. Mathematically, the regret R t is defined as where e is the maximum value of jl(?,?)j. Namely, the regret is bounded above by the right-hand side of Eq. (30) (see Ref. 15 for the derivation). We obtain the following optimal constant g * by minimizing the upper bound over g: When replacing g with g * , we obtain the optimal upper bound of the regret R t as e ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi t ln N=2 p . We call the accumulated losses defined in Eqs. (28) and (29) the standard accumulated losses.
Although the standard expert advice can be applied in many cases, the method is not suited to the prediction of unstable systems, in which the recent history should be emphasized to predict the future more accurately. Thus, we extended the standard expert advice by placing greater weights on recent past information. We call our extension the temporal expert advice, or TEA.
Distribution prediction. Here, we extend the TEA method for point prediction to the prediction of distribution, so that we can handle the prediction of biomarkers. For this purpose, we introduce the distribution prediction of the ith expert at time t p i,t (x) as follows: where s is the standard deviation. This distribution is given under the assumption that a point prediction f i,t is disturbed by various errors and that the error is normally www.nature.com/scientificreports distributed around the point prediction. Then, the predictor for the distribution prediction is given byp We determine the optimal decay rate l by minimizing the absolute error between the final learning point and the corresponding observation point. The standard deviation s is set to be the absolute difference between the point prediction fˆt and the observation y t at the final learning point under the estimated l above as s~jy t {f t j, where t is the number of the learning points. We note that we determined these parameters with a modified way in the prediction of PSA because of its instability. www.nature.com/scientificreports