Abstract
Individual organizations, such as hospitals, pharmaceutical companies, and health insurance providers, are currently limited in their ability to collect data that are fully representative of a disease population. This can, in turn, negatively impact the generalization ability of statistical models and scientific insights. However, sharing data across different organizations is highly restricted by legal regulations. While federated data access concepts exist, they are technically and organizationally difficult to realize. An alternative approach would be to exchange synthetic patient data instead. In this work, we introduce the Multimodal Neural Ordinary Differential Equations (MultiNODEs), a hybrid, multimodal AI approach, which allows for generating highly realistic synthetic patient trajectories on a continuous time scale, hence enabling smooth interpolation and extrapolation of clinical studies. Our proposed method can integrate both static and longitudinal data, and implicitly handles missing values. We demonstrate the capabilities of MultiNODEs by applying them to real patientlevel data from two independent clinical studies and simulated epidemiological data of an infectious disease.
Introduction
Patientlevel data build the foundation for a plethora of healthcare research endeavors such as drug discovery, clinical trials, biomarker discovery, and precision medicine^{1}. Collecting such data is extremely timeconsuming and costintensive, and additionally accessrestricted by ethical and legal regulations in most countries. Individual organizations, such as hospitals, pharmaceutical companies, and health insurance providers are currently limited in their ability to collect data that are fully representative of a disease population. This issue is especially pronounced in clinical studies, where patients are usually recruited based on predefined inclusion and exclusion criteria that introduce cohortspecific statistical biases^{2}. These biases, in turn, can negatively impact the generalization ability of machine learning models, since the usual i.i.d. assumption is violated^{3}. A naive idea to counteract this issue might be to build up large data repositories pooling diverse clinical studies from several organizations. However, here, a major obstacle is that sharing patientlevel data across different organizations is exceedingly difficult due to legal restrictions, as formulated, for example, in the General Data Protection Rule of the European Union.
The idea we propagate in this paper is to learn a continuoustime generative machine learning model from clinical study data. Given the distribution of the real training data was appropriately learned by such a model, the generated synthetic datasets maintain the real data signals, such as variable interdependencies and timedependent trajectories. Furthermore, these synthetic datasets can overcome crucial limitations of their real counterparts like missing values or irregular assessment intervals, hence opening the opportunity to make at least subsets of variables from different studies statistically comparable. A further strong motivation for generating synthetic datasets is the aim to use the generated data as an anonymized version of its realworld counterpart and thereby mitigate the increased restrictions for sharing human data^{4,5,6}. However, synthetic patientlevel datasets open opportunities that reach far beyond data sharing. For example, trained generative models could be used for synthesizing control arms for clinical trials based on data from previously conducted trials, or from realworld clinical routine data^{7}. This helps addressing major ethical concerns in disease areas, such as cancer, where it is impossible to leave patients untreated. Both, the American Food and Drug Administration and the European Medicines Agency have recognized this issue and taken initiatives to allow for synthetic control arms^{7}.
Over the last years, generative models (mostly generative adversarial networks [GANs]) have found notable success, mostly in the medical imaging domain^{8,9,10,11,12,13}. However, GANs are often found to show a collapse in the statistical mode of a distribution, which raises concerns regarding coverage of the real patient distribution by synthetic data. Moreover, these methods are not necessarily suited to cope with the complex nature of clinical data collected in observational, longitudinal cohort studies, which is the main focus of our work: In addition to the previously mentioned issue of irregular measurement frequencies and missing values not at random (e.g., due to participant dropout), clinical studies often comprise several modalities combining timedependent variables (e.g., measures of disease severity) and static information (e.g., biological sex). One approach specifically designed for the joint modeling and generation of multimodal, timedependent, and static patientlevel data containing missing values is the recently introduced Variational Autoencoder Modular Bayesian Networks (VAMBN)^{4}. However, VAMBN only operates on a discrete time scale while relevant clinical indicators such as, for example, disease progression expressed through a cognitive decline or rising inflammatory markers, are intrinsically time continuous. Recently, Neural Ordinary Differential Equations (NODEs) have been introduced as a hybrid approach fusing neural networks and Ordinary Differential Equations (ODEs)^{14}. While NODEs are time continuous and thus enable smooth interpolation between observed data points and extrapolation beyond the observations in the data, they are not able to integrate static variables.
In this work, we present the Multimodal Neural Ordinary Differential Equations (MultiNODEs) as an extension of the NODEs. MultiNODEs allow learning a generative model from multimodal longitudinal and static data that may contain missing values not at random. To demonstrate MultiNODEs’ generative capabilities, we applied the model to clinical, patientlevel data from an observational Parkinson’s disease (PD) cohort study (the Parkinson’s Progression Markers Initiative [PPMI]^{15}) and, additionally, a longitudinal Alzheimer’s disease (AD) data collection (National Alzheimer’s Coordination Center [NACC]^{16}). We compared the generated trajectories and correlation structure with the real counterpart. In this context, we additionally evaluated MultiNODEs’ performance against the previously published VAMBN approach. Furthermore, we assessed MultiNODEs’ interpolation and extrapolation performance. Finally, we investigated the influence of sample size, noisiness of the data, and longitudinal assessment density on the training of MultiNODEs in a systematic benchmark on data simulated from a mathematical model wellknown in the epidemiology field.
Results
Conceptual introduction of the MultiNODEs
MultiNODEs represent an extension of the original NODEs framework^{14} that overcomes the limitations of its predecessor such that an application to incomplete datasets consisting of both static and timedependent variables becomes feasible. Conceptually, MultiNODEs build on three key components (Fig. 1): (1) latent NODEs, (2) a variational autoencoder (more specifically a Heterogenous Incomplete Variational Autoencoder [HIVAE], designed to handle multimodal data with missing values^{17}), and (3) an implicit imputation layer^{18}. The latent NODEs enable the learning and subsequent generation of continuous longitudinal variable trajectories. The longitudinal properties of the initial condition (i.e., the starting point for the ODE system solver of the latent NODEs) are defined by the output of a recurrent variational encoder that embeds the longitudinal input data into a latent space (Fig. 1, orange box). To allow for an additional influence of static variables on the estimation of the longitudinal variable trajectories, the second component, a HIVAE, is introduced (Fig. 1, blue box). This component transforms the static information into a distinct latent space and the resulting embedding is used to augment the latent starting condition of the NODEs by concatenating the static variable embedding and the latent representation of the longitudinal variables (Fig. 1, “augmentation”). The HIVAE component itself holds generative properties and conducts the synthesis of the static variables when MultiNODEs are applied in a generative setting. Conclusively, MultiNODEs integrate static variables (e.g., biological sex or genotype information) both to inform the learning of longitudinal trajectories, and in the generative process. Finally, to mitigate the original NODEs’ incapability of dealing with missing values, we introduced the imputation layer which implicitly replaces missing values during model training with learned estimates (Fig. 1, green box). For further details on the model architecture, training, and hyperparameter optimization, we refer to the Method section and Supplementary material, respectively.
Synthetic data generation using MultiNODEs
Generating synthetic data using MultiNODEs starts by randomly sampling a latent representation for both the static and longitudinal variables, respectively. The longitudinal variables in data space are then generated by first constructing the initial conditions of the latent ODE system (i.e., concatenating the static latent representation to the longitudinal one), followed by solving the ODE system given these initial conditions, and finally by decoding the result into data space. The static variables are generated by directly transforming their sampled latent representation into data space using the HIVAE decoder.
MultiNODEs support two different approaches for the initial sampling of the latent representations, namely sampling from the prior distribution employed during model training and sampling from the learned posterior distribution of the input data.
During the posterior sampling procedure, the reparameterization trick^{19} is applied to draw a latent representation from the posterior distribution learned from the training data. The amount of noise added in this process can be tuned, whereas greater noise will lead to a wider spread of the generated marginal distributions of the synthetic data. Alternatively, the latent representations can be sampled from the prior distributions imposed on the latent space during variational model training. We ensure statistical dependence between static and longitudinal variables by drawing their values from a Bayesian network that connects both latent representations such that the longitudinal variables are conditionally dependent on the static variables. More detailed descriptions of both generation procedures are provided in the Method section.
Application cases: Parkinson’s disease and Alzheimer’s disease
We applied MultiNODEs to longitudinal, multimodal data from two independent clinical datasets with the goal of generating realistic synthetic datasets that maintain the real data properties. Details about the data preprocessing steps are described in the Supplementary material.
The first dataset was the PPMI, an observational clinical study containing 354 denovo PD patients who participated in a range of clinical, neurological, and demographic assessments which form the variables of the dataset. In total, a set of 25 longitudinal and 43 static variables was investigated.
Furthermore, as a second example, we applied MultiNODEs to longitudinal, multimodal data from the NACC. NACC is a database storing patientlevel AD data collected across multiple memory clinics. After preprocessing, the dataset used in this study contained 2284 patients, and a set of three longitudinal and four static variables was investigated.
In the following sections, we will focus on the results achieved on the PPMI data and refer to the equivalent experiments based on the NACC data that are presented in the Supplementary material.
MultiNODEs generate realistic synthetic patientlevel datasets
We applied prior as well as posterior sampling for comparison purposes. With each method, we generated the same number of synthetic patients as encountered in the real dataset to allow for a fair comparison. To assess whether the generated data followed the real data characteristics, we conducted thorough comparisons of the marginal distributions using qualitative and visual assessments and further, quantitatively compared the Jensen–Shannon divergence (JSdivergence) between the generated data and real distributions. The JSdivergence is bound between 0 and 1 with 0 indicating equal distributions. In addition, we investigated the underlying correlation structure of the measured variables. Finally, we trained a machine learning classifier (Random Forest) that evaluated whether real and synthetic patients showed similar clinical characteristics when compared to real healthy control individuals from their respective studies. Across all these aspects, we evaluated MultiNODEs’ performance in comparison to the previously published VAMBN approach^{4}.
The synthetic data generated using MultiNODE generally exhibited marginal distributions that bore high similarity to their corresponding real counterparts (Fig. 2, Supplementary Table 1, and Supplementary Fig. 1; equivalent figures for the NACC data are presented in Supplementary Fig. 4). The average JSdivergences between the real and synthetic distributions calculated across all variables and timepoints amounted to 0.018 ± 0.015 and 0.011 ± 0.009 for the PPMI data generated from the prior and posterior, respectively. For NACC the average JSdivergence was 0.071 ± 0.055 and 0.029 ± 0.031 for prior and posterior sampling, respectively. With respect to PPMI, data generation from the posterior distribution resulted in synthetic data that resembled the real data significantly closer than those generated from the prior distribution (Mann–Whitney U test, p < 0.02).
Compared to VAMBN, the prior sampling method seemed to be inferior with respect to the average JSdivergence when using NACC (U test, p = 0.038). However, no statistically significant difference in the performance of VAMBN compared to MultiNODE’s posterior sampling could be observed (U test, p = 0.80). For PPMI, no significant differences were found between VAMBN and any of MultiNODEs’ generation approaches (U test, p = 0.31 for the prior approach; U test, p = 0.24 for the posterior).
In order to evaluate whether MultiNODEs learned not only to reproduce univariate distributions but actually captured their interdependencies accurately, we compared the correlation structure of the generated data to that of the real variables. Visualizations of the Spearman rank correlation coefficients showed that both the prior and posterior sampling generated synthetic data which successfully reproduced the real variables’ interdependencies (Fig. 3). Comparing the results against VAMBNgenerated data revealed that both generation procedures of MultiNODEs were significantly better at reproducing the real data characteristics: the Frobenius norm of real data correlation matrix resulted in 45.3, and with a Frobenius norm of 25.66 the VAMBNgenerated data placed substantially further from the real data than the MultiNODEs approaches with 62.63 and 56.47 for the prior and posterior sampling, respectively. This shows that MultiNODEs slightly overestimated the present correlations, while VAMBN underestimated them. Concordantly, the relative error (i.e., the deviation of the respective synthetic dataset’s correlation matrix from the real one normalized by the norm of the real correlation matrix), was 0.81, 0.62, and 0.46, respectively, for VAMBN and MultiNODEs’ prior and posterior sampling, leaving MultiNODEs with a substantially lower error than the VAMBN approach.
Assessment of the utility of generated synthetic patients for machine learning
To evaluate whether the generated synthetic patients could be reliably used in a machine learning context, we built a Random Forest classifier that aimed to distinguish between healthy individuals and diseased patients. The classifier was trained within a fivefold crossvalidation scheme once using real and once using synthetic diseased patients. In addition, we trained a classifier on each respective synthetic dataset (comprising synthetically generated diseased and healthy subjects) and evaluated their performance on the real data (Table 1). As predictors, we used clinical symptoms and genetic markers that are characteristic of the disease in question. For PD (PPMI), these were the UPDRS scores that describe a series of motor and nonmotor symptoms commonly encountered in PD patients, for AD (NACC), we predominantly used cognitive assessments and a genetic risk factor. Technical details about the classifiers can be found in the Supplementary material. Distinguishing real patients from healthy control subjects was possible with a 10 times repeated fivefold crossvalidated performance of 0.97 ± 0.02 area under the receiver operator curve (AUC) and 0.90 ± 0.01 AUC for PPMI and NACC, respectively. On PPMI, all evaluated generative methods achieved almost equal performance, indicating that clinical characteristics of synthetic patients followed the same patterns as in real patients. In addition, the most relevant features were the same across the real and all synthetic datatrained classifiers (Supplementary Fig. 13).
For NACC, some deviations were found between a classifier’s crossvalidated performance on real data and the syntheticdatabased performances. Here, MultiNODEs’ posterior and VAMBN showed similar deviations in opposite directions, with the posterior slightly overperforming and VAMBN slightly underperforming. The performance on the data generated via MultiNODE’s prior sampling method deviated the most (Table 1). When trained on synthetic data and evaluated on real data, all trained classifiers underperformed compared to classifiers trained on real data. The feature importances of predictors were highly similar between the real datatrained and the respective synthetic datatrained classifiers.
Generating data in continuous time through smooth interpolation and extrapolation
One particular strength of MultiNODEs, that sets it apart from alternative approaches such as VAMBN, is its ability to model variable trajectories in continuous time. The latent ODE system allows for the estimation of variable trajectories at any arbitrary timepoint and thereby opens possibilities for (1) the generation of smooth trajectories, (2) overcoming paneldata limitations through interpolation, and finally, (3) extrapolation beyond the time span covered in training data themselves. Again, we evaluated these capabilities based on the PPMI and NACC datasets (for brevity, NACC results are presented in the Supplementary material). For the following, we only focused on the MultiNODE posterior sampling approach to generate synthetic subjects.
Comparing the median trajectories of variables from the real data to those generated using MultiNODEs revealed that MultiNODEs accurately learned and reproduced the longitudinal dynamics exhibited in the real data (Fig. 4). Generation from both the prior and posterior distribution led to synthesized median trajectories that closely resembled the real median trajectories. Equivalently, also the 97.5% and 2.5% quantiles of the synthetic data approximated the corresponding real quantiles closely, indicating a realistic distribution of the synthetic data across the observed timepoints. This observation held true for most of the timedependent variables (plots for all variables are linked in the Supplementary material).
We further assessed the interpolation and extrapolation capabilities of MultiNODEs. For interpolation, one timepoint was excluded from model training and subsequently, data were generated for all timepoints including the one leftout. Contrasting the interpolated/imputed values against the corresponding real values showed that MultiNODEs accurately reproduced the longitudinal dynamics of a variable, even for unobserved timepoints (Fig. 5a, c). In this context, we further compared the interpolated values against synthetic data that was generated based on the complete real data trajectory. We observed that the mean JSdivergence calculated across all variables between the interpolated data and the real data was slightly higher (0.025 ± 0.011) than that of the real data and the synthetic data generated after training MultiNODEs on the complete trajectory (0.016 ± 0.011). Similarly, the relative error between the interpolated correlation matrix and the real data was again only marginally higher than between the complete data and the real data (0.48 and 0.46, respectively; Supplementary Fig. 4).
In order to test MultiNODEs’ extrapolation capabilities, only the first 24 months of assessment followup and the static variables were used during model training. The trained model was then applied to generate data for the remaining, leftout timepoints of the longitudinal variables. In this course, 77 values were extrapolated while not every variable had the same number of followup assessments after month 24. Comparing the extrapolated synthetic data to the leftout real data demonstrated reliable extrapolation beyond the training data (Fig. 5b, d). As in the interpolation setting, we also compared the average JSdivergence between the extrapolated data and the real data with that between the real data and synthetic data that were generated after training MultiNODEs on the complete trajectory. As expected, we could see a larger difference between the JSdivergences compared to the interpolation setting with 0.037 ± 0.024 for the extrapolated data and 0.016 ± 0.009 for the synthetic data based on the complete trajectory. The correlation structure in the extrapolation culminated in a relative error of 0.64 compared to 0.46 when using the complete trajectory for training MultiNODEs (Supplementary Fig. 4).
In addition, the marginal distributions at both the interpolated and extrapolated timepoints also followed those of the real data (Fig. 5c, d).
Systematic model benchmarking on simulated data
To explore the learning properties of MultiNODEs more systematically, we investigated how alternating training conditions with respect to measurement frequency, sample size, and noisiness of the data influence MultiNODEs’ generative performance.
The benchmarking data was simulated via the wellestablished SusceptibleInfectedRemoved (SIR) model that is often used to describe the spread of infectious diseases and follows a highly nonlinear structure: Let S(t) be the number of susceptible individuals at a timepoint t, I(t) be the number of infectious individuals at a timepoint t and R(t) be the number of removed or recovered individuals at a timepoint t. With β as transmission rate, γ as mean recovery/death rate, and N = S(t) + I(t) + R(t) as fixed population size the SIR model can be defined by the ODE system presented in Eq. (1):
Details about the SIR parameter settings are described in the Supplementary material.
As baseline settings for each investigation, we simulated 1000 data points with 10 equidistant assessment timepoints each, distributed over a span of 40 time intervals, and added 5% Gaussian noise to each measurement. That means we added a normally distributed variable with the standard deviation set to 5% of the theoretical range of each of the variables S(t), I(t), and R(t). During the benchmarking, we individually alternated the sample size, timepoints, and noise level. For the timepoint investigation, we compared MultiNODEs trained on 5, 10, and 100 equidistant assessments; for the sample size we considered 100, 1000, and 5000 samples; and for the noise level, we tested 50%, 75%, and 100% of the maximum encountered value added as noise.
Alternating the amount of equidistant, longitudinal timepoints exposed a strong dependency of MultiNODEs on the longitudinal coverage of the timedependent process (Fig. 6a). While the general trends in the data were appropriately learned for all explored assessment frequencies, the position of the observations in time influenced how close the learned function approximated the true dataunderlying process. Especially the peak of the “Infected”function represented a challenge for MultiNODEs if no data point was located close to it (Fig. 6a, “Infected”). Similarly, the start of the decline in the “Susceptible”function and the incline in the “Removed”function were shifted, depending on the positioning of measurements. In conclusion, and as expected, a higher observation frequency of the dataunderlying the timedependent process significantly increased the fit of MultiNODEs to the process, although, general trends could already be approximated for lower assessment frequencies.
Investigating the effect of the sample size on training MultiNODEs, we observed that an increase of the sample size led to an expected improvement of the model fit to the SIR dynamics (Fig. 6b). While the general trends could again be learned from limited data (n = 100), sample sizes of 1000 or 5000 substantially reduced the model’s deviation from the true SIR model. With 1000 samples, the learned dynamic is less stable than when trained on 5000 samples, where a smooth dynamic was learned that closely resembled the true underlying process. In conclusion, MultiNODEs can already learn longitudinal dynamics based on only a few data points, however, they tend to underfit under these circumstances and benefit from larger sample sizes.
Adding an increasing noise level to the SIR training data revealed that MultiNODEs remain very robust (Fig. 6c). Only when introducing 100% of the maximal encountered value as additional noise, a clear deviation from the underlying true model could be observed.
Discussion
In this work, we presented MultiNODEs, a hybrid AI approach to generate synthetic patientlevel datasets. MultiNODEs are specifically designed to consider the characteristics of clinical studies, extend its predecessor, the Neural ODEs, and enable the application of the latent ODE system to multimodal datasets comprising both timedependent and static variables with values missing not at random. MultiNODEs learn a latent, continuous time trajectory from observed data. This concept fits well with processes like disease progression, where relevant observations (e.g., biomarkers and disease symptoms) only indirectly mimic the true, underlying disease mechanism. Consequently, MultiNODEs are well suited for an application to heterogeneous datasets holding complex signals as encountered, for example, in biomedical research.
Our evaluations showed that MultiNODEs successfully generated complex, synthetic medical datasets that accurately reproduced the characteristics of their realworld counterparts. In a direct comparison MultiNODEs’ outperformed the stateoftheart VAMBN approach, most notably with respect to the integrity of the correlation structure. This finding implies that the single data instances generated using MultiNODEs exhibit more realistic properties and that the real data characteristics are not only reproduced at the population level. Out of MultiNODEs two generative methods, the posterior sampling expectedly led to more realistic synthetic patients; however, generating from the prior distribution comes with the benefit that the model itself can be shared and used for data generation without needing any real data points in the process.
Machine learning classifiers that discriminated between real healthy controls and diseased subjects showed almost equal performance when trained on data from synthetic and real diseased subjects, respectively. Here, we only observed small deviations from the performance on real data for the NACC dataset, where classifiers trained and tested on synthetic patients and real healthy controls within a crossvalidation setting showed a slightly increased performance to those trained on real data. Interestingly, at the same time, we found a lower prediction performance compared to real data when we trained on synthetic subjects and evaluated on the real data. A possible explanation is that synthetic data can contain noise that is introduced during the generation of synthetic data points (e.g., through overestimated correlations between variables). Therefore, synthetically generated diseased patients are better discriminated against real healthy controls than real diseased patients. At the same time, this situation leads to the fact that a classifier trained on synthetic data (synthetic patients as well as healthy controls) shows a slightly lower prediction performance on real data compared to a classifier trained entirely on real data. Altogether our results demonstrate that synthetically generated subjects share patterns of real patients, but they are not completely identical.
Besides the reproduction of marginal distributions and synthesis of realistic data instances, MultiNODEs most prominent strength lies in the generation of smooth longitudinal data. The latent ODE system allows MultiNODEs to learn dynamics that are continuous in time and cover the unobserved time intervals of realworld data. Here, both the prior and posterior sampling approach resulted in synthetic trajectories that obey real variables’ dynamics.
Furthermore, the timecontinuous generative capabilities of MultiNODEs create opportunities to fill gaps in the real data through interpolation and go beyond the observation time by extrapolating the longitudinal dynamics. Hence, MultiNODEs could be used to support the design of longitudinal clinical studies, in which the maximum observation period, as well as visit frequency, is always a crucial decision to make. Here, the question of how patients might develop between two visits or after the last one determines the optimal followup time, to demonstrate, for example, the most significant treatment effect. Furthermore, synthetic disease trajectories generated based on data from one clinical study can be compared to those generated based on other studies, even if the visit intervals employed in the real studies were not identical.
Our benchmark experiments on the simulated SIR model data demonstrated that MultiNODEs are applicable under a variety of different data settings. While the general trends of a dataunderlying process could already be learned from a relatively limited dataset, similar to any machine learning task, the accuracy and trustworthiness of the model critically depends on the available data. Especially for complex, nonlinear processes, a sufficiently high observation frequency should be considered. Here, the position of the observation timepoints relative to the true underlying process is crucial for MultiNODEs to accurately learn nonlinear dynamics. The sample size of the training data mainly impacts how well MultiNODEs fitted the data dynamics and we observed that lower sample sizes can lead to underfitting and rather rigid ODE systems. On the other hand, only severe noise levels led to a model deviation from the true dataunderlying process, and thus, with respect to noise, MultiNODEs proved to be highly robust. In conclusion, MultiNODEs’ requirements toward the training data ultimately depend on the complexity of the dataunderlying process, whereas the learning of more complex processes requires more frequent observations and larger sample size, while more linear systems can already be learned from rather limited datasets.
One limitation of MultiNODEs in their current form only allows static categorical variables. This is because the variational encoder for longitudinal data maps trajectories to a latent Gaussian distribution. Sampling from this distribution (even, if conditioned on the distribution of the static data) and decoding will result in real valued features rather than categorical ones. In future work, we will thus explore whether a recurrent version of the HIVAE encoder can be used instead of a recurrent variational longshort term memory (LSTM) encoder.
In addition, MultiNODEs are sensitive to several hyperparameters that should be optimized for optimal performance. The training process and all relevant hyperparameters are explained in the Method section.
Synthetic data generated using models trained on sensitive personal information can bear a risk of information disclosure (e.g., attribute disclosure or dataset membership disclosure), if an attacker has information about properties of real patients that are similar to a synthetic subject. Therefore, before synthetic data are distributed, it must be assured that the probability of private information disclosure remains within taskappropriate boundaries^{20}. Disclosure risk often stands in a direct tradeoff with data utility and a sensible compromise should be taken balancing the two according to the application in question. Several approaches are described in the literature that can reduce the risk of information disclosure^{21}, one of which is based on the concept of differential privacy^{4}. MultiNODEs themselves provide a way to tune the deviation from the real data when sampling from the posterior distribution by changing the amount of noise injected in the latent space. We would like to mention that a rigorous quantification of the reidentification risk is a nontrivial and challenging task for its own requiring several assumptions and is thus beyond the scope of this paper.
Methods
Application case datasets
Both datasets, namely PPMI and NACC, are wellknown staples in their respective fields and can be accessed after successful data access applications. For PPMI see https://www.ppmiinfo.org/. For NACC we refer to https://naccdata.org/. More details on the investigated variables are presented in the Supplementary material.
Both studies retrieved informed consent from their participants for data collection and sharing and followed the declaration of Helsinki to ensure ethical data collection. Both studies got ethical approval from their respective review boards. We followed their employed regulations and thus did not seek further ethical approval, as we did not work with human participants ourselves.
Neural ODEs (NODEs)
NODEs are a hybrid of neural networks and ODEs^{14}. They can be seen as an extension of a ResNet^{22}, which does not rely on a discrete sequence of hidden layers, but on a continuous hidden dynamical system defined by an ODE.
For 0 < t < M and \(z_0 \in R^D\) the dynamics of the hidden layer of a NODE are given as Eq. (2).
where z(0) may be interpreted as the first hidden layer and z(T) as the solution to the initial value problem at timepoint T. Importantly, f is a feedforward neural network parameterized by θ.
NODEs as generative latent time series models
As demonstrated by the authors in their publication, NODEs can be trained as a continuous time Variational Autoencoder. The basic idea is to learn the initial conditions z_{0} of the dynamical system in Eq. (2) from observed time series data using a variational LSTM recurrent encoder^{23}. Hence, Eq. (2) now describes the dynamics of a latent system, resulting in a classical stateobservation model. Accordingly, a feedforward neural network decoder is required to project the solution of Eq. (2) back to observed data at defined timepoints (Supplementary Fig. 10).
Overall NODEs are trained at once by maximizing the evidence lower variational bound (ELBO): let the training data be \(D = \{ ( {x_{t_i}^n,t_i})  n = 1, \ldots ,N,i = 1, \ldots ,M\}\), where N is the number of patients and \(t_{i_1}, \ldots ,t_{i_M}\) the observed timepoints / patient visits. That means \(x_{t_i}^n \in R^p\) is the pdimensional vector of measurements taken for the nth patient at visit t_{i}. The ELBO for NODEs is then given as Eq. (3).
where \(p\left( {z_{t_0}^n} \right) = N\left( {0,I} \right)\), as usual. For details, we refer to Chen et al.^{14}.
Multimodal Neural ODEs (MultiNODEs)
Handling missing values
To handle missing values (potentially not at random) in longitudinal clinical data we build on our previously published work, in which we introduced an imputation layer to implicitly estimate missing values during neural network training^{18}: let \(A: = \left\{ {x_{t_i,j}^n{{{\mathrm{}}}}x_{t_i,j}^n\ {\rm{is\ not\ missing}}} \right\}\), 1_{A} be the indicator function on set A with cardinality \(\left A \right\). The imputation layer can be defined as a data transformation \(\tilde x_{t_i,j}^n = x_{t_i,j}^n \times 1_A\left( {x_{t_i,j}^n} \right) + b_{t_i,j} \times \left( {1  1_A\left( {x_{t_i,j}^n} \right)} \right)\), where parameters \(b_{t_i,j}\) are trainable weights. That means missing values in a patient’s data vector \(x_{t_i,j}^n\) are replaced by \(b_{t_i,j}\). The accordingly completed data is subsequently mapped through a recurrent neural network encoder to a static, lower dimensional vector, which is interpreted as the initial condition of the latent ODE system (Supplementary Fig. 11).
To learn parameters \(b_{t_i,j}\) the NODEs’ loss function needs to be adapted. More specifically, we use the modified ELBO criterion presented in Eq. (4).
where \(\hat x_{t_i,j}^n\) denotes the reconstructed data. Note that we only aim for reconstructing the observed data, but not the imputed one. Due to the layerwise architecture of a neural network \(\hat x_{t_i,j}^n\) implicitly depends on \(b_{t_i,j}\).
In practice, we initialize \(b_{t_i,j}\) for neural network training as \(\frac{1}{N}\mathop {\sum }\nolimits_{n = 1}^N x_{t_i,j}^n\).
Dealing with multimodal data
In addition to implicit missing value imputation, the second main idea of MultiNODEs is to complement NODEs with a HIVAE encoder^{17} for static variables (Supplementary Fig. 11). A HIVAE is an extension of a Variational Autoencoder that can implicitly impute missing values via an input dropout model and handle heterogeneous multimodal data, including categorical data and count data, via an accordingly factorized generative model. In addition, a HIVAE uses a Gaussian Mixture Model (GMM) as a prior distribution rather than a single Gaussian. We refer to Nazabal et al.^{17} for details.
The HIVAE results in a lower dimensional latent representation z_{stat} of static variables, which can be used to augment the initial conditions z_{t0} learned from time series data. Consequently, we arrive at the following formulation of the latent ODE system given in Eq. (5).
This approach resembles the Augmented Neural ODEs by Dupont et al.^{24}. In contrast to our work, in their work no additional features were added during the augmentation step, i.e., z_{stat} = 0. According to Dupont et al. the purpose of Augmented Neural ODEs is to smoothen f, whereas we focus here on multimodal data integration.
For training MultiNODEs, we have to jointly consider \(ELBO_{IMP}^{NODE}\) as well as \(ELBO^{HI  VAE}\). After bringing both quantities on a comparable numerical scale, we use a weighted sum as our final training objective (see Eqs. (6) and (7)):
Where β is a tunable hyperparameter. Details about hyperparameter optimization are described in the Supplementary material.
Generating synthetic subjects
We tested two methods to generate synthetic subjects with MultiNODEs:

a.
The first option is drawing a sample of latent static and longitudinal representations from the respective prior distributions \(z_{t_0} \sim N\left( {0,I} \right)\)and \(z_{stat} \sim GMM\left( \pi \right)\). To assure that interdependencies between static and longitudinal variables are conserved, we model their joint distribution \(P\left( {z_{t0},z_{stat}} \right)\) using a Bayesian network. This network contains three nodes (random variables) representing (1) the GMM mixture coefficients π for the static data used by the HIVAE, (2) the latent static representations \(Z_{stat} = GMM\left( \pi \right)\), and (3) the latent longitudinal representations \(Z_{t0} = N\left( {0,I} \right)\), respectively. The network is constrained such that directed edges can only go from s_{i} to Z_{stat} and from there to Z_{t0}. After randomly sampling a mixture component s_{i} from a multinomial distribution multinom(π), we can conditionally sample \(z_{stat} \sim Z_{stat}  s_i\) and finally \(z_{t0} \sim Z_{t0}  Z_{stat}\). Subsequently, we concatenate \(z_0 = \left[ {z_{t_0},z_{stat}} \right]\) into a vector forming the initial conditions for the latent ODE system, solve the ODE system, and decode the solution. We call this approach “prior sampling”.

b.
A second option is to draw \(z_{t_0}q( {z_{t_0}^n  \{{x_{t_i}^n,t_i} \}_i} ) = N( {\lambda ( {x_{t_i}^n,t_i}),\sigma ( {x_{t_i}^n,t_i})})\) for the longitudinal data and \(z_{stat}q\left( {z_{stat}^n{{{\mathrm{}}}}x_{stat}^n,\pi } \right) = N\left( {\lambda \left( {x_{stat}^n,s^n} \right),\sigma \left( {x_{stat}^n,s^n} \right)} \right),s^nCategorical\left( {\pi \left( {x_{stat}^n} \right)} \right)\) for the static data. That means we generate a blurred / noisy version of the original nth patient. We call this approach “posterior sampling” and recommend this sampling procedure for data generation. In our experiments, we doubled the posterior variance during sampling because we found the synthetic data otherwise to lie too close to the real data. Tuning the added noise can provide one option to balance identification risk versus data utility.

c.
Synthetic data can not only be generated for observed visits, but also for definable timepoints in between (interpolation) and after the end of the study (extrapolation). This is possible because the latent ODE system is continuous in time.
Data preprocessing
Few steps are required to preprocess the clinical data before MultiNODEs can be applied. First, the data must be organized into a threedimensional tensor of the shape samples × timepoints × variables for the longitudinal variables, and samples × variables for the static ones. Furthermore, the longitudinal variables are then transformed into a progression score by subtracting the baseline value and normalizing them by the standard deviation of this variable at baseline.
Calculating the relative error for correlation matrices
The relative error between correlation matrices is calculated as the norm of the matrix describing the difference between the real correlation matrix and synthetic data correlation matrix divided by the norm of the real correlation matrix.
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Data availability
The PPMI dataset is available under: https://www.ppmiinfo.org/. The NACC data are available under: https://naccdata.org/. The data are shared by the data owners after successful application. The data generated for this study cannot be shared by the authors due to the signed data usage agreements with the data owners of the corresponding real data (i.e., PPMI and NACC).
Code availability
The code for MultiNODEs is available at https://github.com/philippwendland/MultiNODEs.
References
Fröhlich, H. et al. From hype to reality: data science enabling personalized medicine. BMC Med. 16, 150 (2018).
Birkenbihl, C., Salimi, Y. & Fröhlich, H. Japanese Alzheimer's Disease Neuroimaging Initiative; Alzheimer's Disease Neuroimaging Initiative Unraveling the heterogeneity in Alzheimer’s disease progression across multiple cohorts and the implications for data‐driven disease modeling. Alzheimers Dement. 18, 251–261 (2022).
Birkenbihl, C. et al. Differences in cohort study data affect external validation of artificial intelligence models for predictive diagnostics of dementia – lessons for translation into clinical practice. EPMA J. 11, 367–376 (2020).
GootjesDreesbach, L., Sood, M., Sahay, A., HofmannApitius, M. & Fröhlich, H. Variational Autoencoder Modular Bayesian Networks for simulation of heterogeneous clinical study data. Front. Big Data 3, 16 (2020).
Sood, M. et al. Realistic simulation of virtual multiscale, multimodal patient trajectories using Bayesian networks and sparse autoencoders. Sci. Rep. 10, 10971 (2020).
Chen, R. J., Lu, M. Y., Chen, T. Y., Williamson, D. F. K. & Mahmood, F. Synthetic data in machine learning for medicine and healthcare. Nat. Biomed. Eng. 5, 493–497 (2021).
Thorlund, K., Dron, L., Park, J. J. & Mills, E. J. Synthetic and external controls in clinical trials – a primer for researchers. Clin. Epidemiol. 12, 457–467 (2020).
Lei, Y. et al. MRI‐only based synthetic CT generation using dense cycle consistent generative adversarial networks. Med. Phys. 46, 3565–3581 (2019).
Yang, G. et al. DAGAN: Deep DeAliasing Generative Adversarial Networks for fast compressed sensing MRI reconstruction. IEEE Trans. Med. Imaging 37, 1310–1321 (2018).
Lin, Z., Jain, A., Wang, C., Fanti, G. & Sekar, V. Using GANs for sharing networked time series data: challenges, initial promise, and open questions. in Proceedings of the ACM Internet Measurement Conference 464–483 (ACM, 2020). https://doi.org/10.1145/3419394.3423643.
Bae, H., Jung, D., Choi, H.S. & Yoon, S. AnomiGAN: Generative Adversarial Networks for anonymizing private medical data. Pac. Symp. Biocomput. Pac. Symp. Biocomput. 25, 563–574 (2020).
Jordon, J. & Yoon, J. PATEGAN: generating synthetic data with differential privacy guarantees. in International Conference on Learning Representations 21 (2019).
BeaulieuJones, B. K. et al. Privacypreserving generative deep neural networks support clinical data sharing. Circ. Cardiovasc. Qual. Outcomes 12, e005122 (2019).
Chen, R. T. Q., Rubanova, Y., Bettencourt, J. & Duvenaud, D. K. Neural ordinary differential equations. in Advances in Neural Information Processing Systems (eds Bengio, S. et al.) vol. 31 (Curran Associates, Inc., 2018).
Marek, K. et al. The Parkinson Progression Marker Initiative (PPMI). Prog. Neurobiol. 95, 629–635 (2011).
Besser, L. et al. Version 3 of the National Alzheimer’s Coordinating Center’s Uniform Data Set. Alzheimer Dis. Assoc. Disord. 32, 351–358 (2018).
Nazabal, A., Olmos, P. M., Ghahramani, Z. & Valera, I. Handling incomplete heterogeneous data using VAEs. Preprint at ArXiv180703653 Cs Stat (2020).
de Jong, J. et al. Deep learning for clustering of multivariate clinical patient trajectories with missing values. GigaScience 8, giz134 (2019).
Kingma, D. P. & Welling, M. Autoencoding variational Bayes. Preprint at http://arxiv.org/abs/1312.6114 (2014).
Goncalves, A. et al. Generation and evaluation of synthetic patient data. BMC Med. Res. Methodol. 20, 108 (2020).
Park, N. et al. Data synthesis based on generative adversarial networks. Proc. VLDB Endow. 11, 1071–1083 (2018).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 770–778 (IEEE, 2016). https://doi.org/10.1109/CVPR.2016.90.
Hochreiter, S. & Schmidhuber, J. Long shortterm memory. Neural Comput 9, 1735–1780 (1997).
Dupont, E., Doucet, A. & Teh, Y. W. Augmented neural ODEs. in Advances in Neural Information Processing Systems (eds Wallach, H. et al.) vol. 32 (Curran Associates, Inc., 2019).
Acknowledgements
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 826421, “TheVirtualBrainCloud” and from the Deutsche Forschungsgemeinschaft (DFG) funded project “NFDI4Health” (project number 442326535). Data used in the preparation of this article were obtained from the Parkinson’s Progression Markers Initiative (PPMI) database (www.ppmiinfo.org/data). For uptodate information on the study, visit www.ppmiinfo.org. PPMI—a public–private partnership—is funded by the Michael J. Fox Foundation for Parkinson’s Research and funding partners [list of the full names of all of the PPMI funding partners can be found at www.ppmiinfo.org/fundingpartners]. The NACC database is funded by NIA/NIH Grant U01 AG016976. NACC data are contributed by the NIAfunded ADCs: P30 AG019610 (PI Eric Reiman, MD), P30 AG013846 (PI Neil Kowall, MD), P30 AG06242801 (PI James Leverenz, MD), P50 AG008702 (PI Scott Small, MD), P50 AG025688 (PI Allan Levey, MD, PhD), P50 AG047266 (PI Todd Golde, MD, PhD), P30 AG010133 (PI Andrew Saykin, PsyD), P50 AG005146 (PI Marilyn Albert, PhD), P30 AG06242101 (PI Bradley Hyman, MD, PhD), P30 AG06242201 (PI Ronald Petersen, MD, PhD), P50 AG005138 (PI Mary Sano, PhD), P30 AG008051 (PI Thomas Wisniewski, MD), P30 AG013854 (PI Robert Vassar, PhD), P30 AG008017 (PI Jeffrey Kaye, MD), P30 AG010161 (PI David Bennett, MD), P50 AG047366 (PI Victor Henderson, MD, MS), P30 AG010129 (PI Charles DeCarli, MD), P50 AG016573 (PI Frank LaFerla, PhD), P30 AG06242901 (PI James Brewer, MD, PhD), P50 AG023501 (PI Bruce Miller, MD), P30 AG035982 (PI Russell Swerdlow, MD), P30 AG028383 (PI Linda Van Eldik, PhD), P30 AG053760 (PI Henry Paulson, MD, PhD), P30 AG010124 (PI John Trojanowski, MD, PhD), P50 AG005133 (PI Oscar Lopez, MD), P50 AG005142 (PI Helena Chui, MD), P30 AG012300 (PI Roger Rosenberg, MD), P30 AG049638 (PI Suzanne Craft, PhD), P50 AG005136 (PI Thomas Grabowski, MD), P30 AG06271501 (PI Sanjay Asthana, MD, FRCP), P50 AG005681 (PI John Morris, MD), and P50 AG047270 (PI Stephen Strittmatter, MD, PhD).
Funding
Open Access funding enabled and organized by Projekt DEAL.
Author information
Authors and Affiliations
Contributions
H.F., C.B., P.W. and M.K. conceived the project. P.W. implemented the method. P.W., M.G.F. and M.S. performed the experiments. C.B. and H.F. wrote the manuscript. P.W., M.K., and M.S. revised the manuscript. C.B. and H.F. supervised the work.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Wendland, P., Birkenbihl, C., GomezFreixa, M. et al. Generation of realistic synthetic data using Multimodal Neural Ordinary Differential Equations. npj Digit. Med. 5, 122 (2022). https://doi.org/10.1038/s4174602200666x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s4174602200666x