Reinforcement learning derived chemotherapeutic schedules for robust patient-specific therapy

The in-silico development of a chemotherapeutic dosing schedule for treating cancer relies upon a parameterization of a particular tumour growth model to describe the dynamics of the cancer in response to the dose of the drug. In practice, it is often prohibitively difficult to ensure the validity of patient-specific parameterizations of these models for any particular patient. As a result, sensitivities to these particular parameters can result in therapeutic dosing schedules that are optimal in principle not performing well on particular patients. In this study, we demonstrate that chemotherapeutic dosing strategies learned via reinforcement learning methods are more robust to perturbations in patient-specific parameter values than those learned via classical optimal control methods. By training a reinforcement learning agent on mean-value parameters and allowing the agent periodic access to a more easily measurable metric, relative bone marrow density, for the purpose of optimizing dose schedule while reducing drug toxicity, we are able to develop drug dosing schedules that outperform schedules learned via classical optimal control methods, even when such methods are allowed to leverage the same bone marrow measurements.

In Figure S2 we present the trajectories of the bone-marrow and the control obtained via the objective functional in Eqn. 3. In Figure S3 we produce a plot similar to Figure 3 with the proliferative fraction of both the healthy bone-marrow cells and the proliferative fraction of the malignant breast-cancer cells. For this plot, the bone marrow compartments and breast cancer compartments are parameterized with the nominal parameters from Table 1. Notably, in all three scenarios the schedule is able to drive the proliferative fraction of malignant cells down while maintaining the proliferative fraction of the healthy cells. In Figure S3 ρ bm p (t) corresponds to the proliferative fraction of the bone-marrow cells and ρ bc p (t) corresponds to the proliferative fraction of the breast-cancer cells. Figure S4 demonstrates various dosing schedules on testing patients. We plot the schedules obtained via employing the reinforcement learning agent trained on the nominal parameter set, the mean optimal controller derived treatment, the NTNOC treatment, and the optimal treatment (which was treated as unknown) acquired by solving the optimal control problem on each particular testing patient (achieved by treating the model parameters as known and employing the APOPT algorithm as implemented in GEKKO 1, 2 ).
In Figure S5 we present plots of chemotherapy schedules along with the trajectories of the effected bone marrow cells.  Figure S3. A plot of the proliferative fraction ρ p (t) = P(t)/(P(t) + Q(t)) for both the healthy bone-marrow cells and the malignant breast-cancer cells under the optimal chemotherapeutic control f * (t) (dashed red) for different values of b. The objective functional used to achieve this optimal control is given via Eq. (2). Small values of b correspond to weighting preservation of the bone marrow as more important and larger values of b correspond to weighting total drug delivery as more important.
These plots are generated for the same virtual patients as in the first column of Figure S4.
The results presented in Figures 7 -10 all depend upon the particular sample of testing patients (the values of θ k i for all k and i). In order to investigate how these results change with differing batches of testing virtual patients, we now present the results of Figure 7 and Figure 9 for five additional batches of 200 testing virtual patients. Explicitly, we independently sampled five batches of 200 testing virtual patients in addition to the batch used in the main manuscript text using the same method as described in the Perturbed Virtual Patients subsection above. For ease of notation, let  Figures S6c-S6f. We note the similarities in shape between the histograms presented. In fact, we performed two statistical tests to demonstrate the similarity of the histograms produced by batch B 0.20 0 with those produced by batches B 0.20 1 to B 0.20 5 . Explicitly, we considered a two sample Kolmogorov-Smirnov test and a two sample Anderson-Darling test 4, 5 . Both tests have the null hypothesis that the two samples are from the same distribution. In all five cases we are unable to reject the null hypothesis (p > 0.25).
Similarly, in Figure S7 we present a related visualization. In Figures S7a-S7c  In dotted black lines, we also present the theoretically optimal schedule for each patient as a comparison. While the nominal OC schedules remain static (blue lines), the RL and NTNOC schedules adapt to each patient (orange and green lines). In fact, the tendency for the RL schedules (orange lines) to be "closer" (visually) to the theoretical maximal value (dotted lines) visually can demonstrate the tendency for the RL agent to be more robust to perturbations in the parameter values. reinforcement learner derived treatments from the nominal optimal controller derived treatments on all 6 batches B k 0 to B k 5 at all perturbation strengths. The data for batch B k 0 are presented as solid lines while the data for batches B k 1 to B k 5 are presented as semi-opaque filled histograms stacked on top of one another. This allows one to easily compare the relative shapes of the distribution of these data. In Figures S7d-S7f we present similar data for comparing the reinforcement learner derived treatments with those from the NTNOC derived treatments in the same manner. In all 6 cases, we note that the histograms appear similar to one another via the naked eye. We again applied a two sample Kolmogorov-Smirnov test and a two-sample Anderson-Darling test to formally compare the samples. For all 6 experiments we compared the five new testing batches B k 1 to B k 5 with the reference testing batch B k 0 under the null hypothesis that the two samples are from the same distribution. In all 30 of these such cases we were unable to reject the null hypothesis with either test (p > 0.25).