Explainable deep learning for tumor dynamic modeling and overall survival prediction using Neural-ODE

While tumor dynamic modeling has been widely applied to support the development of oncology drugs, there remains a need to increase predictivity, enable personalized therapy, and improve decision-making. We propose the use of Tumor Dynamic Neural-ODE (TDNODE) as a pharmacology-informed neural network to enable model discovery from longitudinal tumor size data. We show that TDNODE overcomes a key limitation of existing models in its ability to make unbiased predictions from truncated data. The encoder-decoder architecture is designed to express an underlying dynamical law that possesses the fundamental property of generalized homogeneity with respect to time. Thus, the modeling formalism enables the encoder output to be interpreted as kinetic rate metrics, with inverse time as the physical unit. We show that the generated metrics can be used to predict patients’ overall survival (OS) with high accuracy. The proposed modeling formalism provides a principled way to integrate multimodal dynamical datasets in oncology disease modeling.


Introduction
The study of tumor growth dynamics has a long history, with early seminal efforts demonstrating the ability of mathematical models to describe experimental data [1] [2].Subsequently, there has been a plethora of mathematical models based on various frameworks (e.g.deterministic, stochastic, game theoretical, etc.) that have integrated aspects of the underlying biological processes (e.g.tumor heterogeneity and angiogenesis) and led to the generation of scientific insights [3].For drug development applications, the modeling of tumor size dynamics from patient populations has become an important tool by which to characterize treatment efficacy.Currently, nonlinear mixed-effects modeling (NLME), based on structural models described with either algebraic or differential equations, is the most widely adopted methodology used by the pharmacometrics community [4].Via modeling, the derived tumor dynamic metrics have found wide utility in supporting the development of anti-cancer drugs, ranging from early efforts [5] showing the ability of such metrics to predict Phase 3 overall survival (OS) from Phase 2 data, to broader applications [6] [7] and use of the tumor dynamic metrics to predict the hazard ratio of clinical trials for a wide variety of solid tumor types [8].While there have been prior efforts to use machine learning (ML) algorithms to map tumor metrics derived from NLME modeling [9] to OS, the derivation of the metrics themselves has not been attempted using ML.Although there has been much progress in model-informed drug development within oncology using tumor dynamic models, there remain many opportunities for additional applications, including personalized therapy [10].
A key area for the advancement of tumor dynamic modeling is increasing the ability to accurately predict future patient outcomes from early observed longitudinal data.If successful, this would increase the impact of predictive modeling for drug development and for personalized therapy.The paths for future progress may entail the utilization of high dimensional data that have become available through technological advances in the biomedical sciences (e.g.Digital Pathology [11], ctDNA [12] etc.), via the development of algorithms, or via a combination of both.Despite the many types of models that have been developed to support clinical decision making in oncology, there is increased appreciation that in order to effectively mine ever larger datasets, Artificial Intelligence (AI) approaches are required to complement existing statistical and mechanistic models [13].
While there are many neural network architectures that can be utilized to describe longitudinal data such as tumor size measurements, the formalism of Neural-Ordinary Differential Equations (Neural-ODE) [14] [15] is an especially effective platform by which to combine the strengths of Deep Learning (DL), with the advantages of ODEs (which is amongst the most commonly used mathematical formalism in tumor dynamics modeling).The use of Neural-ODE in pharmacology has been successfully developed for pharmacokinetics (PK) [16] and pharmacodynamics (PD) [17] modeling, demonstrating promising predictivity results in the settings examined.In this work, we provide the foundational Tumor Dynamic Neural-ODE (TDNODE) modeling framework that consists of an encoder-decoder architecture.A key consideration in the methodological development is integrating the ML model with physical concepts [18], as this may enhance interpretability and enable making physically consistent predictions in temporal extrapolations beyond the training set [18].While largely building upon the important work of [15] whereby a Recurrent Neural Network (RNN) encoder is used together with a Neural-ODE decoder, there are additional developments necessary to make it pharmacology-informed in terms of dynamical characterization of patients' tumor data that leverages the well-established oncology disease modeling framework [10].In this work, we propose a principled way to normalize patients' tumor dynamics data in conjunction with scaling the encoder outputs in correspondence in order to maintain a specific unitsequivariance [19] in the learned vector field.Thus, we not only enable the interpretation of the encoder outputs as tumor dynamic metrics (with the physical unit of inverse time), but also ensure a generalizable approach to predict patients' OS in a manner that rests upon the well-established tumor growth inhibition (TGI)-OS link.We show that by applying the proposed methodology together with our devised data augmentation approach, we can generate a model that makes unbiased predictions of future tumor sizes from early, truncated tumor size data.Furthermore, we show that the encoder-generated TDNODE metrics (that is, patient-specific kinetic parameters produced by feeding the longitudinal tumor data into the encoder) can predict patients' OS using the ML survival model XGBoost [21] with a performance that significantly surpasses the existing TGI-OS approach [21].Finally, we show that the XGBoost model for OS can be interpreted using the Shapley Additive Explanation (SHAP) to quantify how the contribution of TDNODE metrics impacts tumor dynamic predictions [22] [23].

Dynamical systems formulation enabling the interpretation of parameters as kinetic rates
We wish to discover the dynamical law governing patients' tumor dynamics using the Neural-ODE system shown below: Here,  is the final simulation time of interest,  ! is a neural network parameterized by a set of weights  to be learned across the patient population,  ∈ ℝ " represents the patient-specific kinetic parameters and (0) ∈ ℝ # represents the patient-specific initial state of the system both of which are to be learned from individual patient data, and (•): [0, ] → ℝ # represents the time-continuous solution being sought.The patient-specific parameters and initial state represent patientto-patient variability obtained by passing the individual patient tumor data through the corresponding encoders.In particular, the current framework focuses on discovering the set of equations for tumor dynamics, independent of the choice of treatment and/or dosing.
In our formulation, we would like to be interpret  as kinetic parameters with the physical unit of 1/[t] rather than simply an arbitrary abstract representation generated by a neural network with the sole aim of reproducing the tumor dynamic data.We do so by performing time-scaling operation on equation ( 1) and leveraging the notion of equivariance [24]: upon transformation to a dimensionless time  ̂= /and using the chain rule, we have (please see the Appendix for the derivation): We propose that in order to interpret  as kinetic parameters, we should learn vector fields  ! which satisfy the following generalized homogeneity [25] condition: Note that while vector fields  !(, ) that are linear in  enters would clearly satisfy (3), this condition does not equate linearity as there are nonlinear functions which satisfy (3) as well.From equations ( 2) and (3), it follows that: Equation (4) shows that in order for the dynamical system expressed in dimensionless time  ̂ to reproduce a given set dynamical data given in the original time , its kinetic parameters need to scale in direct proportion to the corresponding time scaling factor: that is,  × .As  can be an arbitrarily chosen positive real scaling factor, we introduce a patientdependent data augmentation scheme such that a single temporal data trace (in original time) is truncated, and subsequently mapped to a number of different temporal traces expressed in rescaled time,  .The augmented set of temporal traces under different rescaling is then used to train the model.In summary, we propose a data augmentation scheme involving various choices of time scaling factor , thereby imbuing  with the meaning of kinetic rate parameters.Refer to the Methods section for further details.

Model architecture generates longitudinal tumor predictions and kinetic parameters
We designed TDNODE with the intent to longitudinally predict tumor dynamics given an arbitrarily defined observation window, whereby the input data consists of tumor sum-of-longest-diameter (SLD) measurements and their respective times of measurement.
The main components of the TDNODE architecture consists of a set of two encoders that process sequential inputs and an ODE-solver decoder, as displayed in Figure 1: the Initial Condition Encoder consists of a recurrent neural network (RNN) that creates a 4-dimensional encoding which represents the initial condition of a given patient; the Parameter Encoder contains an attention-based Long Short-Term Memory (LSTM) flanked by laterally-connected linear layers, which produces a 2-dimensional representation of tumor kinetic parameters.The ODE-solver decoder uses the two encoder outputs as the initial condition and kinetic parameters respectively, to generate longitudinal SLD predictions.Following Chen et al. [14], the decoder consists of an ODE system whose vector field is represented as a neural network.In the ODE solver, numerical integration is carried out for the 4-dimensional state vector (which provides a latent representation of the time-varying tumor state) from the current time point to its corresponding values at the next requested time point, according to the dynamics provided by the vector field.Finally, a Reducer is used to reduce the state vector to a single number, so as to be compared with the measured SLD values.Refer to Methods for additional details.

TDNODE produces unbiased predictions of tumor dynamics
We examined the ability of TDNODE to reproduce and extrapolate tumor dynamics using the IMPower150 dataset [26].As the aim of this work is to make longitudinal tumor predictions, we first excluded subjects from the dataset with less than or equal to one tumor size measurement.Subsequently, data were split randomly with respect to subject ID into a training set used to develop TDNODE (889 patients) and a test set used to assess TDNODE's tumor dynamic predictions (216 patients).In the training set, we performed the proposed method of augmentation that considers the subsampled subsets of collected patient time series data, which significantly increased the number of tumor dynamic profiles observed by TDNODE during training.Tumor SLD measurements were normalized with respect to the training cohort mean and standard deviation for both cohorts.Observation times were scaled with respect to the time of last measurement for each subject.To determine the last time of measurement, we deem all measurements made at a time  less than or equal to the patient's observation window ( ! ) since the start of treatment as seen, measurements that TDNODE is able to observe and estimate in its tumor dynamic prediction curve.
All measurements collected at time  greater than the  !-week observation window are conversely deemed as unseen; here TDNODE is assessed in its ability to extrapolate or infer tumor dynamics using only data collected from within its observation window.In this study, we let  != 32 weeks represent the observation window for each subject.
We show in TDNODE was also shown to produce unbiased predictions, as shown in Figure 2 for both the training and test sets.In particular, TDNODE predictions did not exhibit systematic trends with respect to the measured SLD values (Figures 2a-2b).Moreover, TDNODE exhibited no systematic bias in a statistically significant manner in the tumor size predictions with respect to the time of measurement (Figure 2c-2d).These findings demonstrate promise for TDNODE to both longitudinally model and forecast tumor size data, in contrast to classical TGI models which were shown to carry systematic biases of tumor size predictions with respect to observation time [20].Refer to Supplementary Figures 1 and 2 for plots showing these comparisons for each treatment arm in the IMPower150 dataset.
We illustrated TDNODE's predictive abilities using selected patient longitudinal SLD measurements from the test set.To this end, we adjust  ! to be 16, 24, or 32 weeks for all subjects.We observe in Supplementary Figure 3 how the TDNODE predicted tumor profiles for these patients change with respect to increases in observation window.We also observe in Supplementary Table 5 how TDNODE's extrapolation performance improves as the observation window is increased.Taken together, our results show that TDNODE can make qualitatively appropriate continuous predictions of tumor dynamics at the subject level and that TDNODE makes more accurate predictions as the observation window is increased, as expected.Additional patient longitudinal tumor dynamic predictions can be seen in Supplementary Figure 4.

TDNODE enables superior prediction of Overall-Survival compared to existing TGI-OS models
The parameter encoder module of TDNODE produces a 2-dimensional encoding that can be used to predict patients' OS.In a similar manner to using TGI metrics to predict OS [9][27], we used this output as the input to an XGBoost ML modeling approach used previously in [27] (see parameters in Supplementary Table 3).We used the same sets of training-test patient split to evaluate OS predictions.Only encoder outputs from the training set patients were used to construct the OS ML model.This model did not observe test set encoder metrics until after the OS ML model was developed.We evaluated OS predictivity using the c-index [28], which was calculated using the OS ML model's output and the OS status of each patient from the clinical trial.On the training set, we evaluated OS predictivity by 5-fold cross-validation with random splitting of the data.
We compared this statistic to that obtained using existing set of all TGI metrics (consisting of tumor growth rate (KG), tumor shrinkage rate (KS) and the time to tumor growth (TTG) [6]), which were obtained using a non-linear mixed-effects (NLME) model fitted to data from all patients, from which the individual parameters were obtained [8], as is standard practice in pharmacometrics modeling.For both models, we evaluated OS predictivity using just the generated metrics, or in conjunction with eleven baseline covariates (Supplementary Table 2) obtained at the start of each subject's enrollment in the clinical trial [27].Figure 3 displays how the ML-predicted survival probabilities with 95% confidence intervals for each treatment arm align with that of the Kaplan-Meier (KM) curves of the patients in the test set.
The results shown in Table 2 demonstrate that the ML model for predicting OS based on TDNODE metrics (referred to as TDNODE-OS.ML) resulted in significantly increased predictive performance as compared to TGI metrics (referred to TGI-OS.ML) and is true using either the metrics alone, or using the metrics in conjunction with the previously used eleven baseline covariates.Furthermore, in the case of OS prediction using TDNODE derived metrics, we observe no significant loss in performance in OS predictivity between the validation set patients and the test set patients by comparing the c-index values.Furthermore, the predictive performance levels of TDNODE-OS.ML showed marginal improvements after incorporating the eleven baseline covariates.The result suggests that the TDNODE metrics extract sufficient information from the longitudinal tumor size data such that the additional eleven baseline covariates no longer provided additional predictive value.This is in contrast to the case of TGI-OS.ML, where utilizing the eleven baseline covariates resulted in a sizable improvement in OS predictivity.

TDNODE metrics can explain patient Overall Survival
To enhance the interpretability of TDNODE, we designed the formulation such that a single dynamical law is learned from the tumor dynamics data with patient-to-patient variability explained by kinetic parameters expressed with the units of inverse time.In addition to showing that the TDNODE-generated metrics can be used to predict OS, here we demonstrated that the relationship between the metrics and OS can be explained at a qualitative level as can be seen in Figure 4. Using principal component analysis (PCA) [29], we obtained the first principal component from the set of 2-dimensional encoder metrics and showed that it exhibits similar OS predictability to that of the TDNODE-generated metrics when using XGBoost, achieving a c-index of 0.82 on the training set and 0.81 on the test set (see Supplementary Table 6 for details).Shapley Additive Explanation (SHAP) [22] [23] analysis of this XGBoost model shows a positive impact upon the OS expected survival time with respect to increases in this axis.
Given this finding, we evaluated the impact of perturbations of this component on the tumor dynamics of individual patients.In Figure 4, we find that increases in this component yield monotonic decreases in the longitudinal tumor dynamic predictions for all patients in the training and test sets.This finding aligns with that of the SHAP analysis that we performed, as increases in tumor sizes are expected to be associated with decrease in OS.SHAP summary plots of the XGBoost ML model using the 2-dimensional encoder output of TDNODE can be found in Supplementary Figure 5 and Supplementary Figure 6, and additional examples of feature dependence plots for randomly selected patients can be found in Supplementary Figure 7 and Supplementary Figure 8.

Discussion
We presented a deep learning methodology to discover a predictive tumor dynamic model from longitudinal clinical data.
In essence, the methodology leverages neural-ODEs [14][15] formulated in such a way that a single underlying dynamical law (represented by the vector field) is to be discovered from the patient population data, with patient-to-patient variability in the tumor dynamics data explained by differences in both the individual initial state as well their kinetic rate parameters.Furthermore, we introduced an equivariance property in the vector field under time rescaling transformation, so as to enable the interpretation of the learned patient embedding as kinetic parameters, with the units of inverse time.By estimating the individual patients' kinetic parameter values (or metrics) from the longitudinal tumor size data, they can then be used to predict the patients' OS using a ML model.Our proposed use of leveraging longitudinal tumor size data to generate metrics and subsequently OS predictions, follows the well-established TGI-OS paradigm [5] [8] [10] which has been widely applied to support oncology drug development.On the other hand, prior TGI-OS approaches have all relied on the human modellers choosing appropriate parametric functions for both the tumor dynamics equations as well as the statistical models for survival.In this work, we endeavor to leverage machine intelligence to discover the underlying models while retaining a one-to-one correspondence with the traditional TGI-OS paradigm on a conceptual level.Compared to many other alternate approaches for modeling time series data, our proposed TDNODE methodology has the benefit of explainability brought about by the following: (1) the bottleneck of the encoder-decoder architecture enables accounting for patient-to-patient variability in a parsimonious manner; (2) the equivariance condition of the vector field enables interpretation of patient embeddings as kinetic parameters; (3) the dynamical system nature of the formulation enables a direct connection between salient aspects of tumor trajectories and their impact on the predicted OS.
The proposed methodology was applied to data from IMPower150, a phase 3 clinical trial of NSCLC patients.We showed on this dataset that the proposed TDNODE methodology overcame a key obstacle of the current TGI modeling approach, namely its ability to predict in an unbiased manner the future tumor sizes from early longitudinal data.These results can be explained by the formulation of TDNODE in minimizing a loss function and the proposed data augmentation scheme of feeding the early tumor dynamics data into the encoder while ensuring accurate extrapolations for the future dynamics.
In contrast, the approach of population modeling based on nonlinear mixed effects [30] aims to characterize model parameters at both the population and individual levels rather than aiming to minimize errors in model predictions over an unseen horizon.Another benefit of utilizing the deep learning approach was the discovery of tumor dynamics metrics, beyond the tumor growth and shrinkage rates (i.e., KG and KS respectively) that have been widely used in TGI literature [8] [10].We showed that the TDNODE metrics obtained were better predictive of patient OS than the existing ones, as demonstrated by the sizably higher c-indices achieved.
There remain several areas for further research.While the results shown here demonstrate significant promise in the setting of NSCLC patients, it remains to be applied to other solid and hematological cancer types.Of note, as TGI models of the same structure [3] are often applied across different tumor types, it may be advantageous to apply TDNODE to identify the best set of pan-tumor dynamical equations.Additionally, our current TDNODE model does not incorporate dosing or PK; however, such extensions are possible areas for further work.In order to further improve the TNODE predictive performance, hyperparameter optimization and the experimentation of alternate numerical solvers (e.g., [41]) are promising avenues.While this work sets the mathematical foundations for longitudinal tumor data, further expansions to incorporate multimodal, high dimensional data [31] of both static and longitudinal nature remain active areas for future development.

Data summary
Longitudinal tumor sum-of-longest-diameters (SLD) data were collected from the IMpower150 clinical trial in which 1,184 chemotherapy-naive patients with stage IV non-squamous non-small cell lung cancer (NSCLC) were enrolled [32].This Phase 3, randomized, open-label study evaluated the safety and efficacy of atezolizumab (an engineered antiprogrammed death-ligand PD-L1 antibody) in combination with carboplatin + paclitaxel with or without bevacizumab compared to carboplatin + paclitaxel + bevacizumab [32].All patients provided written consent prior to enrollment.Patients assigned to Arm 1 were administered atezolizumab + carboplatin + paclitaxel (n = 400).Patients assigned to Arm 2 were administered atezolizumab + carboplatin + paclitaxel + bevacizumab (n = 392).Patients assigned to Arm 3 were administered carboplatin + paclitaxel + bevacizumab (n = 392).In Arms 1 and 2, atezolizumab was administered as an IV infusion at a dose of 1,200mg Q3W until a loss of clinical benefit was observed.In all arms, carboplatin was administered at 6mg /mL min -1 Q3W for 4 cycles, 6 cycles, or until loss of clinical benefit, whichever came first.In Arms 2 and 3, bevacizumab was administered as an IV infusion at a dose of 15 mg kg -1 Q3W until disease progression, unacceptable toxicity, or death.In all arms, paclitaxel was administered as an IV infusion at a dose of 200mg m -2 Q3W for 4 cycles, 6 cycles, or until loss of clinical benefit, whichever came first.Full details of study design can be found in [26].

Definition of pre-treatment and post-treatment measurements
The dataset contains tumor SLD measurements collected across two distinct segments: 1) pre-treatment and 2) posttreatment.We define the pre-treatment measurements as SLD measurements collected prior to the initiation of active treatment.Conversely, we define post-treatment measurements as SLD measurements collected after the active treatment has been initiated.As the time for the start of treatment is taken to be 0,  (,+ is considered pre-treatment when  (,+ < 0, and post-treatment when  (,+ ≥ 0.

Definition of Observation Window
We also defined a set of observation windows for subjects in both training and test cohorts.Here, an observation window is defined as the quantity of post-treatment time in which a subject has been observed.Let  = { % ,  & , ⋯ ,  ' } represent the set of observation windows for all  subjects such that  ( represents the observation window for subject .This scalar corresponds to the amount of seen data that TDNODE uses to predict tumor dynamics and generate parameter encodings.Thus, any measurements outside the observation window are deemed as unseen by TDNODE and subsequently not used as input.In this study we let  ( = 32 weeks for all patients. From this definition, consider patient  and let () =  + ( (,+ ≤  ( ) represent indices for measurement such that  (,1(() and  (,1(() correspond to the last observed time of measurement and last observed SLD value; that is, the last observed SLD value to be used as input for TDNODE.Hence, for patient , all (integer) indices  between 1 and () represent observed measurements, and all indices greater than () represent measurements unseen by the model in either training or testing.

Normalization of SLD
We compute the mean and standard deviation of all SLD values within the training set.In this regard, let  represent the arithmetic mean of  (,+ computed over all ,  ∈ Ψ ,-.(' , and let  represent the standard deviation of  (,+ computed over all , .Then, we perform Z-score normalization on  (,+ using  and : where  Y (,+ is the Z-score normalized 'th measurement of subject .Here we let Ψ Z ,-.(' and Ψ Z ,/0, represent the normalized SLD values of all patients in the training and test sets, respectively.

Normalization of time
We introduce a patient-specific method to normalize the observation times of each subject in both the training set and test set.For each patient, we declare the last observed time of measurement as a scaling factor that is used on that patient's time series.We then use the last observed measurement time to normalize its respective list of observation times  ( for all , : ( ,+ \ ( , ()] =

Augmentation
We wish for TDNODE to be generalizable such that it can longitudinally model a representative set of tumor dynamic profiles commonly seen in clinical trials.Doing so requires that the training set be large and diverse, containing a wide variety of tumor dynamic profiles upon which TDNODE can generalize when modeling tumor dynamics on unseen data.The result of this subsampling method is an additional 4,671 tumor dynamic profiles to be used during training, which significantly boosts the size of the training set to a total of 5,560 subsampled patients.

Formulation of the patient-dependent initial condition
Here we describe the formulation of the representation of subject 's initial tumor state: where  ( (0) ∈ ℝ # represents subject 's -dimensional initial tumor state, and  ! is the initial condition encoder neural network parametrized by , that takes as input subject 's pre-treatment tumor size measurements ( Y (,%:; ! ) and observations times (( ,%:; ! ) to produce  ( (0).Here  ( is an index that corresponds to the last pre-treatment measurement for subject : where all indices less than or equal to  ( correspond to pre-treatment measurements, and all indices greater than  ( correspond to post-treatment measurements.If a subject has no pre-treatment measurements,  ( is set to 1, indicating that the first measurement represents the pre-treatment tumor size for subject .A description of the initial condition encoder's architecture can be found in the Model Architecture section and for further information please refer to the model code provided at the locations given in Supplementary Note 1.

Implementation of the patient-dependent parameter encoding and temporal rescaling
As discussed in the Results section, we leverage equation ( 4) to transform data using a subject dependent temporal rescaling into a unit time interval.This is implemented in the following manner: the kinetic parameter encoding for subject  is computed as, where  ( ∈ ℝ " represents subject 's -dimensional representation of tumor kinetic parameters and ℎ ! is the TDNODE parameter encoder neural network, parametrized by , that takes as input subject 's post-treatment observed tumor dynamics to produce  ( .We wish to reference only post-treatment tumor dynamics, starting from the last pre-treatment measurement.To do so, we use the value at index  ( to reference the last pre-treatment measurement for subject .
Additionally, since we assume the system evolves autonomously from time zero, we set ( ,; != 0. Note that in equation ( 9) the last observed time  (,1(() serves as the temporal scaling  of equation ( 4).A description of this network's model architecture used to generate  ( can be found in the Model Architecture section and for further information please refer to the model code provided at the locations given in Supplementary Note 1.

Neural-ODE solution
We solve the Neural ODE in equation ( 4) using the initial condition encoding  ( (0) and the parameter encoding  ( .We numerically integrated this ODE system using  !, beginning at  = 0 and ends at  = ( ,* !(the normalized last measurement time for subject ).If batched solving is enabled, the solving process continues until the normalized last measurement time for the subject in the batch with the highest normalized last measurement time.Subsequent to the generation of the -dimensional tumor state evolution, it is passed to a reducer which results in a series of scalar tumor size measurements.The ODE solver used was Dormand-Prince 5(4) (dopri5 as implemented within the torchdiffeq library [35]) and the continuous adjoint solution was used for backpropagation; please see the model code provided at the locations given in Supplementary Note 1 for further information.

Model architecture
TDNODE consists of four modules: an initial condition encoder that transforms pre-treatment tumor size data into a dimensional representation of the subject's initial tumor state, a parameter encoder that transforms post-treatment tumor size data into a -dimensional representation of the subject's observed tumor dynamics, a neural ODE decoder module that computes a continuous series of -dimensional tumor state representations, and a reducer module that produces the final series of continuous scalar predictions.Refer to Figure 1 for a visual representation of the computational graph that utilizes these modules.

Initial condition encoder
The initial condition encoder, denoted as  !takes as input a tensor representing the pre-treatment measurements and times of measurement and generates a batch of -dimensional representation of each subjects' pre-treatment tumor state  ( (0).Here the input is of shape  ×  ;,*.9 × 2, where  is the specified batch size, or number of subjects to process and  ;,*.9 is the number of pre-treatment measurements for the subject with the most pre-treatment measurements in the batch.The initial condition encoder consists of a multi-layer gated recurrent unit (GRU) recurrent neural network (RNN); the output of the RNN is processed by a single fully connected layer, producing a tensor with shape  × .

Parameter encoder
The parameter encoder, denoted as ℎ !, takes as input a tensor representation the truncated post-treatment measurements and times of measurement and generates a -dimensional representation of each subject's post-treatment tumor dynamics up to an arbitrarily defined observation window.The parameter encoder takes as input a tensor of shape and  × ( 0,*.9 − 1) × 4 produces a batch of parameter encodings  ( of shape  × ; here  0,*.9 denotes the number of seen measurements for the subject with the highest number of observed measurements in the batch.Finally, we apply equation ( 9) to obtain a batch of patient-normalized encodings  ( with the same shape.The parameter encoder's architecture utilizes multi-headed attention in as well as with residual fully connected layers.The multi-headed attention layer requires as input a key, value, and query.Here the key and value are generated by separate fully connected layers.The query is generated using a deep residual neural network and a Long Short-Term Memory (LSTM) network with 100 hidden units.The outputs of the query and attention layer are subsequently concatenated.Finally, a deep residual network consisting of fully connected layers is used to generate the patient-specific kinetic parameters.Please refer to the model code referred to in Supplementary Note 1 for further details.

Neural ODE vector field
Upon creation of the batch of parameter encodings  ( and initial conditions  ( (0), we use equation ( 1) to solve the neural ODE system and generate a solution (•): [0, ] → ℝ # with shape  ×  × , where  is the number of measurements to be obtained in the interval 0 to .Here  is the upper bound of time value in numerical integration, which is equivalent to the maximum of the last times of measurement for subjects in the batch.The numerical integration is carried out with a neural network decoder,  !, which takes as input the batch of initial conditions  ( (0) and parameter encodings  ( to produce the time derivatives used to compute the next -dimensional tumor state.This tumor state is then used as input with the parameter encoding to solve to compute the time derivative to compute the next state, and so forth.We used the continuous adjoint solving method in this work.Here we define  ! as a series of fully connected layers with residual connections.Each series of fully connected layers is interspersed with SELU activation functions [33].The resulting solution (•) is then converted back into a scalar space by a Reducer module, described below.For further details, please refer to the model code at the locations provided in the Supplementary Note 1.

Reducer
The generated batch of solutions to the neural ODE system (•) is represented as a batch of  set of -dimensional tumor states that required conversion to a series of scalar SLD predictions.Here we instantiate a simple neural network reducer that takes the -dimensional batch of solutions produced by the NODE decoder and converts it into a batch of  scalar SLD predictions that represent the predicted tumor sizes for each patient.We implemented the Reducer as a series of fully connected layers interspersed with SELU activation functions.

Instantiation
In this study, we set the initial condition encoder output dimension () to 4.Here we set the hidden dimension in the GRU to 10.As the input to the initial condition encoder is a tensor of time-observation pairs, we set its input dimension to 2.
For the parameter encoder, we set the output dimension () to 2. As the input is a tensor of partitioned time-observation pairs, we set the parameter encoder's input dimension to 4. We set the dimensionalities of all networks in the preprocessor encoder network to 4 and use a single-headed self-attention mechanism with output dimensionality set to 100.The LSTM module's output is also set to 100.These outputs are concatenated and used as input to the post-processor modules, which compresses this representation into the 2-dimensional encoding output.
The NODE decoder module takes as input a 6-dimensional tensor consisting of the 4-dimensional initial condition encoding and 2-dimensional parameter encoding.Each fully connected layer has a hidden dimension of 21 and produces a 6-dimensional representation of the tumor state.Since the last 2 dimensions of this output correspond to that of the parameter encoder, we set these values to 0 to signal that the parameter encoding remains constant throughout the solving process.The result is a series of 6-dimension solutions at every time point requested.
Finally, the reducer takes as input the set of 4-dimensional solutions at each step to produce a 1-dimensional series of SLD predictions for each patient using a series of fully connected layers.Note that, since the output of the NODE module is a set of 6-dimensional encodings at every time step, the reducer module uses only the first  = 4 values of the obtained solution.

Partitioning
Equations ( 4) and ( 6) use truncated tumor dynamic profiles to generate initial condition parameter encodings for each patient, respectively.The input to equation ( 4) is a tensor of pre-treatment measurements, where each row corresponds to a single observation and its corresponding time of measurement.Conversely, the input to equation ( 6) is a tensor of partitioned post-treatment observed measurements.Here each row corresponds to a pair of adjacent observations.For instance, for a patient with 4 time series measurements, of which 3 are deemed as seen, a partitioned tensor of shape 2 x 4 is created.In the first row, the first and second measurements are concatenated.In the second row, the second and third measurements are concatenated.Each row has length 4 as each pair of measurements consists of a time and SLD observation.We implement this partitioning mechanism to better enable learning of each patient's post-tumor state.

Batching
To enable higher throughput training, we propose a batching operation to simultaneously generate solutions for multiple patients at a time.Because each patient may have a different number of measurements, we produce masks that screen each patient solution for the predictions that correspond to observed data.For each batch, the union of times is collected as a single array.Labels are also concatenated together, with the position of each label corresponding to the appropriate index in the time tensor.A mask tensor with the same shape is generated as well, with 1s representing valid positions and 0s representing positions to exclude.Pre-treatment and post-treatment tensors of each patient in the batch are stacked.Leftpadding of variable length is applied, with the pad value equivalent to the first time-SLD observation pair.IDs and cutoff indices for each are also stacked and used in each iteration.

Loss Calculation
During model development, batched tensors of shape  ×  representing the continuous time series SLD predictions of  patients are produced during each iteration.This prediction tensor is utilized in conjunction with a label and mask tensor (each of the same shape) to calculate the RMSE for the iteration used to adjust TDNODE's weights via backpropagation.In this loss function, the mask, a one-hot tensor with 1's representing the locations in which to tabulate the loss, is used to parse the prediction tensor for predictions with times that correspond to actual observed SLD measurements.After this filtering is applied, the RMSE is calculated using the labels tensor.Backpropagation is then carried out to optimize the weights of all four TDNODE modules.

Hyperparameter configuration
We trained TDNODE for 150 epochs using ADAM optimization [34] , an L2 weight decay of 1e-3, a learning rate of 5e-5, an ODE tolerance of 1e-4, a batch size of 8, and an observation window of 32 weeks.Refer to Supplementary Table 1 for additional information on the hyperparameter configuration of TDNODE.

Libraries used
In this implementation, we used the torchdiffeq library to carry out the solving process [35].Other notable data science libraries include pandas, numpy, and scipy [36], [37].We used the Pytorch deep learning framework to develop and evaluate TDNODE [38].Survival analysis and SHAP analysis were performed using the lifelines and shap libraries, respectively [23], [39].We used matplotlib to generate the majority of plots in this study [37].Refer to the environment.ymlfile in the model code (see locations provided in Supplementary Note 1) for additional packages used.

Figure 2 :
Figure 2: TDNODE enables unbiased predictions of tumor dynamics with respect to both tumor size and observation time.(a) Prediction versus SLD data on the training set.(b) Prediction versus SLD data on the test set for t > 32 weeks.(c) Training set residuals between tumor dynamic predictions (PRED) and observed SLD data with respect to time.(d) Test set residuals between tumor dynamic predictions and observations with respect to time Bootstrapped LOWESS curves with 95% confidence intervals (CIs)were generated for c-d.

Table 4 .
Table 1that TDNODE can accurately extrapolate tumor dynamics on the test set for  >  !across treatment arms in the IMPower150 dataset.Here the bootstrapped median root-mean-squared error (RMSE) and R 2 score on the extrapolation component of the test set were shown to be 9.69 and 0.88, respectively.We note that these data were never seen by TDNODE during model development and inference.Training set RMSE and R 2 performance can be seen in Supplementary Here TDNODE achieved a bootstrapped median R 2 score of 0.95 on all measurements on both the training and test sets.
Let  = { % ,  & , ⋯ ,  ' } represent the set of SLD measurements from  subjects in the dataset, where  ( = ( (,% ,  (,& , ⋯ ,  (,* ! ) is a list of SLD measurements (typically measured in mm) sorted by time, whereby (,+ corresponds to the th SLD measurement for subject .Here  ( corresponds to the number of SLD measurements for subject .Similarly, let Γ = { % ,  & , ⋯ ,  ' } represent the corresponding observation times of  subjects in the dataset, where  ( = ( (,% ,  (,& , ⋯ ,  (,* ! ) is a sorted list of  ( observation times (typically expressed in weeks), such that  (,+ corresponds to the th observation time for subject .We exclude patients with less than or equal to one post-treatment observation since longitudinal modeling would not be applicable.Using these definitions and the IMPower150 dataset, we have a total number of 1,105 eligible subjects. To increase the diversity of tumor dynamic profiles in the training set, we introduce a new subsampling strategy that we apply to training set observations Γ ,-.(' and Ψ ,-.(' .Recall that () is an index that corresponds to the last observed measurement for subject .Hence, ( ,%:1(() and  Y (,%:1(() represent the normalized post-treatment tumor dynamics for patient , and components of these lists are used to generate subject 's parameter encoding.We took () to truncate each subject's observed tumor dynamics corresponding to the respective observation window,  ( .For each patient  in the training set, data augmentation is performed in the following manner: we consider the set of all integer intervals ranging from one to the number of measurements for the subject  ( , that is () a ≡ {[1, ]|2 ≤  ≤  ( } , and in each case derive the associated sets of normalized SLD and time values: { Y (,+ | ∈ () a } and {( ,+ ( ( , max ())| ∈ () a }.

Table 1 : TDNODE predictive performance of tumor dynamics using a 32-week observation window on the test set.
We evaluated the predictive performance of TDNODE on the unseen portion of the test set.For each patient, we let the observation window   = 32 weeks and only evaluate measurements collected at time values beyond   .Although TDNODE generates a continuous solution of predictions (⋅), the RMSE and R 2 scores are calculated using the discrete set of predictions only at observation times with SLD measurements.The predictive performance across all treatment arms is shown in bold.Variability was measured via median absolute deviation (MAD).