Introduction

The past decade has witnessed ground-breaking advancements been made in computational health, owing to the explosion of medical data, such as electronic health records (EHRs)1,2,3. The secondary uses of EHRs give rise to research in a wide range of varieties, especially machine learning (ML)-based digital health solutions for improving the delivery of care4,5,6,7,8. However, in practice, the benefits of data-driven research are limited to healthcare organizations (HCOs) who possess the data9,10. Due to concerns about patient privacy, HCO stakeholders are reluctant to share patient data11,12,13. Access to clinical data is often restricted, or can be prohibitively expensive to obtain, meaning that ML in biomedical research lags behind other areas in AI.

To accelerate the progress of developing AI methods in medicine, one promising alternative is for the data holder to create synthetic yet realistic data14,15. By avoiding “one-to-one” mapping to the genuine data compared with data anonymization, synthetic data offers a solution to circumvent the issue of privacy, while the correlations in the original data distributions are preserved for downstream AI applications. There have been successes in the literature using synthetic data to improve AI models where otherwise not possible due to limited availability of resources16,17,18. For example, large-scale data sharing programs have been demanded for advancing studies related to COVID-19, such as in National COVID Cohort Collaborative (N3C)19, and Clinical Practice Research Datalink (CPRD) database in the UK20.

Recent advances in generative adversarial networks (GANs)21 and their variants offer efficacious means to generate EHRs for a wide range of clinical applications22,23,24. Over the past years, EHR synthesizers have evolved from generating static patient information to producing longitudinal EHR timeseries25,26,27. As longitudinal EHRs contain patient trajectories for describing the underlying health condition, synthesizing such EHR timeseries, therefore, enables new clinical applications related to the status of disease progression28, such as dynamic forecasting of risks, predicting the onset of diseases, and survival analysis based on the time-to-event data. However, existing studies focus on synthesizing the longitudinal EHRs of a single data type25,26,29, whereas the clinical decision-making in real practice includes a variety of information sources in the form of mixed-type timeseries. For example, patient physiological signals and laboratory test results are collected in the EHR as continuous-valued timeseries, while the medication and diagnostic information are recorded as discretized-valued data as binary indicators or categorical ICD codes. Information provided in these mixed-type longitudinal EHRs offer opportunities for more precise and complex clinical analysis. Furthermore, the predictive power and robustness of the ML models can be boosted by utilizing longitudinal EHR timeseries with various types/sources.

Existing GANs are limited in simulating mixed-type EHRs due to two reasons. Firstly, it is intrinsically difficult to model the underlying joint distribution of mixed data type timeseries using a single unified framework. Since GANs require the network architectures of the generator and discriminator to be fully differentiable30, its success is typically limited to generating real-valued, continuous data while facing obstacles for directly generating sequences of discrete tokens, such as ICD codes, that also commonly appear in EHRs. Previous methods31,32 circumvent this problem by learning representations from the original data which further enables backpropagation in discrete settings, but there is still a lack of a generative approach for joint modeling of the mixed-type timeseries with heterogeneous nature. Second, although mixed-type clinical timeseries differ in syntax and distributions, they are highly correlated and inform one another of the underlying health of an individual33,34,35. It is therefore important to capture the temporal correlations between them when generating the synthetic EHR data. For example, the medications (documented in the form of discrete data) prescribed to patients are based on measurements of patients’ physiological status (presented as continuous-valued signals). Concurrently, the efficacy of the medical treatments, affect the patient’s physiological condition directly. It is therefore critical to accurately capture the temporal correlation between the mixed-type patient trajectories simultaneously to improve clinical decision support.

To address the aforementioned limitations, for the first time, we propose a GAN framework for simultaneously synthesizing mixed-type longitudinal EHR data (denoted as EHR-M-GAN thereafter). Specifically, we focus on generating timeseries in the critical care setting, where the intensive care units (ICU) patients are continuously and closely monitored (see Fig. 1a). Patient trajectories with high-dimensionality and heterogeneous data types (both continuous-valued and discrete-valued timeseries) are generated while the underlying temporal dependencies are captured. The main contributions of our work are as follows:

  • A GAN model entitled EHR-M-GAN is proposed for simultaneously generating mixed-type multivariate EHR timeseries with high fidelity, and overcoming the challenges when extending GANs into the mixed-type data settings (see Fig. 1b). First, to jointly model the underlying distributions of the heterogeneous features, EHR-M-GAN first maps data from different observational spaces into a reversible, lower-dimensional, shared latent space through a dual variational autoencoder (dual-VAE). Then, to capture the correlated temporal dynamics of the mixed-type data, a sequentially coupled generator that is built upon a coupled recurrent network (CRN) is employed. In addition, a conditional version of our model—EHR-M-GANcond—is also implemented, which is capable of synthesizing condition-specific EHR patient data, such as those result in ICU mortality or hospital readmission. The code of our proposed work is publicly available on GitHub.

  • Evaluations are performed based on three publicly available ICU datasets: MIMIC-III36, eICU37, and HiRID38 from a total of 141,488 patients. Standardized preprocessing pipelines are applied for the three ICU datasets to provide generalizable machine learning benchmarks. The code for the end-to-end preprocessing pipelines is also available on GitHub.

  • Our EHR-M-GAN outperforms the state-of-the-art benchmarks on a diverse spectrum of evaluation metrics. When compared to real EHR data, both qualitative and quantitative metrics are used to assess the representativeness of the mixed-type data and their inter-dependencies. We further demonstrate the advantages offered by EHR-M-GAN in augmenting clinical timeseries for downstream tasks under various clinical scenarios.

  • In the evaluation of privacy risks, we perform an empirical analysis on EHR-M-GAN based on membership inference attack39. We then further evaluate the performance of EHR-M-GAN under the framework of differential privacy for its application in downstream task40.

Fig. 1: Overall schematics.
figure 1

a Data extraction. Electronic health records (EHRs) data are routinely collected for patients in intensive care units (ICUs). Intensively monitored vital signals and laboratory measurements are recorded as continuous-valued timeseries, while the presence or absence of medical interventions is collected as discrete-valued timeseries during the ICU admission. These mixed-type EHR data are correlated but distributed differently, and they change over time depending on the diagnoses provided by clinicians. b Network architecture. EHR-M-GAN contains two key components—Dual-VAE and Coupled Recurrent Network (CRN). Step 1: Dual-VAE is first pretrained for mapping heterogeneous data (\({{{{\bf{x}}}}}_{t}^{c},{{{{\bf{x}}}}}_{t}^{d}\)) into shared latent representations (\({{{{\bf{z}}}}}_{t}^{c},{{{{\bf{z}}}}}_{t}^{d}\)). Multiple objective loss constraints are used to bridge the domain/distribution gap, including ELBO loss, matching loss, contrastive loss, and semantic loss (for EHR-M-GANcond only). Both encoders and decoders in the Dual-VAE are implemented with LSTMs. The training process for Step 1 is indicated in the Dual-VAE pretrain path (dashed purple line). Step 2: Then, a CRN is established as the generator based on the parallel bilateral LSTM block, which takes the random noise vectors (\({{{{\boldsymbol{\upsilon }}}}}_{t}^{c},{{{{\boldsymbol{\upsilon }}}}}_{t}^{d}\)) as inputs (see the Coupled generation path). Step 3: The synthetic latent representations (\({\hat{{{{\bf{z}}}}}}_{t}^{c},{\hat{{{{\bf{z}}}}}}_{t}^{d}\)) provided by CRN are then decoded into synthetic samples (\({\hat{{{{\bf{x}}}}}}_{t}^{c},{\hat{{{{\bf{x}}}}}}_{t}^{d}\)) using the pretrained decoder in Dual-VAE, which is indicated in the Decoding path (solid red line). Step 4: Finally, the adversarial loss is derived from the LSTM-based discriminators and backpropagated to update the network, which is indicated in the Adversarial training path (dotted black line). c Evaluation pipeline. The pipeline includes metrics for evaluating the fidelity of both continuous-valued and discrete-valued timeseries, and the correlations within the mixed-type data. Also, downstream task (in d) is performed for evaluating the application of synthetic data in realistic clinical use case. Finally, membership inference attack and differential privacy are used to evaluate our model’s privacy risk empirically. d Prediction example. Data within 24-hours prior to the patient’s endpoints in the ICU (discharge or mortality) is extracted. Both the observation window and prediction window are fixed as 12 h. The classification task is to use patients' continuous-valued physiological measurements within the observation window as input, to predict the forthcoming discrete-valued medical intervention status in the prediction window. The four outcomes of the intervention status can be categorized as follows: Stay On: The intervention begins with on and stays on within the prediction window; Onset: The intervention begins with off and is turned on within the prediction window; Switch off: The intervention begins with on and is stopped within the prediction window; Stay Off: The intervention begins with off and stays off within the prediction window.

Results

Evaluation metrics

Evaluating GAN models is a notoriously challenging task. Advantages and pitfalls of commonly used evaluation metrics for GANs are discussed in ref. 41. In this work, a systematic evaluation framework is adopted to assess the quality of synthetic patient EHRs with respect to its fidelity, correlation, utility, and privacy (see Table 1). We first individually assess the representativeness of the synthetic continuous-valued and discrete-valued timeseries. This includes measuring the distance between underlying data distributions (such as Maximum mean discrepancy and Dimension-wise probability), comparing the feature-level statistics between the real and synthetic data (Patient trajectories), and assessing the indistinguishability of the synthetic data to the true data (i.e., Discriminative score). Secondly, we evaluate to which extent our model can reconstruct the interdependency between different features (Pearson pairwise correlations), and the temporal dynamics in the patient trajectories (Autocorrelation function), by using a set of qualitative and quantitative metrics. Thirdly, we introduce data augmentation by incorporating synthesized EHR timeseries under various settings, and quantitatively assess the improvement provided by EHR-M-GAN in the Downstream tasks for medical intervention prediction in the ICU (i.e., the utility of the synthetic data). Lastly, we measure the attribute of patient privacy-preserving of EHR-M-GAN under Membership inference attack. We also evaluate the performance of the same downstream tasks under Differential privacy guarantees (see Fig. 1c and Table 1 for the evaluation pipeline).

Table 1 Summary of the evaluation protocol in this study.

Maximum mean discrepancy

To measure the similarity between the continuous-valued synthetic data and the real data, maximum mean discrepancy (MMD) is used. MMD can assess whether two sets of samples are from the same distributions, and in our case, one from the true data x and one from synthetic data \({x}^{{\prime} }\) generated by GANs. To calculate the statistics, a kernel function \(K:X\times {X}^{{\prime} }\to {\mathbb{R}}\) is used to quantify the similarity between the two distributions. In this study, a sum of Gaussian kernel sets is adopted following the implementations in ref. 42, which can be expressed as:

$$K({{{\bf{x}}}},{{{{\bf{x}}}}}^{{\prime} })=\mathop{\sum}\limits_{i}\exp \left(-\frac{\parallel {{{\bf{x}}}}-{{{{\bf{x}}}}}^{{\prime} }{\parallel }_{F}^{2}}{{\sigma }_{i}^{2}}\right)$$
(1)

where σi is the value of the i-th selected bandwidth for calculating MMD. As in our study, the real and synthetic samples are multivariate timeseries aligned along the fixed time axis (i.e., 24 data points per patient), we therefore handle these multivariate timeseries as matrices and use the kernel function to calculate the Frobenius norm (\({\left\Vert \cdot \right\Vert }_{F}\)) between them25.

Finally, given samples \({\left\{{{{{\bf{x}}}}}_{i}\right\}}_{i = 1}^{N}\) from real distributions, and samples \({\left\{{{{{\bf{x}}}}}_{j}^{{\prime} }\right\}}_{j = 1}^{M}\) from the synthetic distributions (with N and M denoting the corresponding sample sizes), the estimation of MMD can be defined as:

$$\begin{array}{ll}\widehat{{\rm{MMD}}^{2}}\,=\,\frac{1}{n(n-1)}\mathop{\sum }\limits_{i=1}^{n}\mathop{\sum }\limits_{j\ne i}^{n}K\left({{{{\bf{x}}}}}_{i},{{{{\bf{x}}}}}_{j}\right)-\frac{2}{mn}\mathop{\sum }\limits_{i=1}^{n}\mathop{\sum }\limits_{j=1}^{m}K({{{{\bf{x}}}}}_{i},{{{{\bf{x}}}}}_{j}^{{\prime} })\\ \qquad\qquad\quad +\,\frac{1}{m(m-1)}\mathop{\sum }\limits_{i=1}^{m}\mathop{\sum }\limits_{j\ne i}^{m}K({{{{\bf{x}}}}}_{i}^{{\prime} },{{{{\bf{x}}}}}_{j}^{{\prime} })\end{array}$$
(2)

It can be inferred from the Eq. (2) that higher similarity between the two distributions leads to the lower MMD value, with the lower bound value zero indicating that the two distributions are identical.

As indicated in Table 2, EHR-M-GAN outperforms the state-of-the-art benchmarks among all three datasets in synthesizing continuous-valued timeseries. The conditional version—EHR-M-GANcond—further boosts the performance of the model by leveraging the information of the condition-specific inputs. Furthermore, as shown in the ablation study, EHR-M-GAN and EHR-M-GANcond produce smaller MMD values when compared to their variants. Using MIMIC-III as an example, compared with the basic model GANVAE, by integrating the shared latent space learning using dual-VAE under multiple loss constraints, the performance of GANSL significantly improves (p-value <0.05; Unpaired t-test with a significance level of 0.05 is used throughout the paper unless specified otherwise). By further building the sequentially coupled generator and exploiting the information within mixed-type data, the MMD of EHR-M-GAN shows a nearly 24% improvement over GANVAE. When synthesizing mixed-type timeseries using the unified network, the performance of GANUnified for generating continuous-valued timeseries lags behind the proposed EHR-M-GAN. It therefore can be inferred that, compared with EHR-M-GAN which extracts useful hierarchical representations for each data type using tailored encoding layers, it is quite challenging for GANUnified to learn marginal distributions from raw mixed-type timeseries with a unified architecture.

Table 2 Maximum mean discrepancy (MMD) of continuous-valued synthetic data.

Dimension-wise probability

To evaluate the representativeness of the synthetic discrete-valued timeseries, the dimension-wise probability test is employed. To test the probability distributions between the real and synthetic binary features, the Bernoulli success probability p [0, 1] is calculated for the discrete-valued timeseries, and is visualized through scatterplot. As a sanity check, it investigates if the probability of the medical intervention being active at the given timestamps is matched between the real data (x-axis) and synthetic data (y-axis). The correlation coefficients (CCs) and root-mean-square errors (RMSEs) are also adopted43 based on the Bernoulli success probabilities to quantitatively measure the distribution divergence between real and synthetic data.

As shown in Fig. 2 (see Supplementary Figs. 4 and 5 for results on eICU and HiRID datasets), the optimal results are provided by EHR-M-GAN and EHR-M-GANcond. The close-to-real probability distributions that appear along the diagonal line indicate the remarkable similarity between the real data and the synthetic data provided by our models. The quantified CC and RMSE also correspond with the visualization results, which are close to the highest mark (EHR-M-GAN: RMSE = 0.0095, CC = 0.9973). Similar to the results in MMD, the dimensional-wise distributions are better captured when modules such as dual-VAE and sequentially coupled generator are introduced in EHR-M-GAN. GANUnified suffers from mode collapse (the generator fails to produce outputs with sufficient diversity), and therefore shows poor performance compared with other variants when synthesizing discrete-valued timeseries. As the mixed-type features are treated as unimodal input without differentiating their heterogeneous nature, no marginal representations are explicitly learned.

Fig. 2: Scatterplot of the dimension-wise probability test on MIMIC-III dataset.
figure 2

Dimension-wise probability calculates the Bernoulli success probability of each dimension, i.e., the probability of the treatment being active at a particular time. The x-axis and y-axis represent dimension-wise probability for the real data and synthetic data generated from different models, respectively. The same color indicates the same treatment (but with varying timestamps). The optimal performance appears along the diagonal line. The corresponding CCs ([0, 1], the higher the better) and RMSEs (\(\left[0,+\infty \right)\), the lower the better) are also calculated to quantify the probability distribution similarities between the real and synthetic EHRs timeseries. Dimension-wise probability plot for eICU and HiRID dataset can be found in Supplementary Note 4.

Among all state-of-the-art benchmark models, DualAAE shows the best result but is slightly sub-optimal when compared to EHR-M-GAN. In contrast, both skewed distribution and low performance scores are observed in medGAN, as it lacks the ability to capture the temporal correlations within timeseries. SynTEG shows improved performance over medGAN, as it is capable of synthesizing discrete-valued features in EHRs with timestamps. The non-GAN generative method PrivBayes also shows good performance among all the benchmark synthesizers when modeling the underlying probability distribution of the discrete-valued EHR timeseries. On the other hand, despite the well-known performance of SeqGAN in natural language generation, it is not quite applicable in synthesizing sequential clinical EHRs. The result of EHR-M-GAN shows its superiority in explicitly capturing each dimension of the discrete-valued sequences. This indicates that the proposed EHR-M-GAN mitigates the challenge of generating discrete-valued features in traditional GANs by learning the shared latent representations using dual-VAE.

Patient trajectories

We compare the distribution of patient trajectories per timepoint between the real data and synthetic data generated by EHR-M-GAN for the MIMIC-III dataset. Five commonly measured vital sign and laboratory features—Oxygen Saturation, Systolic Blood Pressure, Respiratory Rate, Heart Rate, Temperature, as well as two medical intervention features—Mechanical Ventilation and Vasopressor, are considered and compared as an exemplar in Fig. 3. It can be inferred that the proposed model can accurately capture the statistical distribution (mean and standard deviation) of both continuous-valued and discrete-valued features. The temporal dynamics are well-preserved in the synthetic timeseries. For example, the variance of Oxygen Saturation gradually increases towards the ICU endpoints in the real data, and is closely reflected in the synthetic timeseries. Furthermore, EHR-M-GANcond shows superior performance as it can generate correct trajectories with specific patient conditions (see Supplementary Note 4).

Fig. 3: Comparison of the distribution of values at each timepoint (mean and standard deviation) between real and synthetic patient trajectory produced by EHR-M-GAN.
figure 3

Multivariate timeseries 24 h before patients' ICU endpoints are generated, including Heart Rate, Respiratory Rate, Systolic Blood Pressure, Oxygen Saturation, Temperature, Mechanical Ventilation and Vasopressor. The mean value of the real/synthetic feature at each timepoint is plotted by the solid/dotted line, with the shaded area indicating ±1 standard deviation. For Mechanical Ventilation and Vasopressor, the y-axis indicates the probability distribution of such intervention being applied ("On'') at a given time. The synthetic patient trajectories generated by EHR-M-GANcond under different conditions can be found in Supplementary Note 4.

Discriminative score

For both continuous-valued and discrete-valued data, the discriminative score is measured as the accuracy of a discriminator trained post-hoc to separate real from generated samples. Synthetic data are generated with the same amount of the hold-out test set from the original data, and are labeled as synthetic and real correspondingly to train the binary classifier. In this study, the classifier (critic) is implemented with a single-layered Bi-directional Long Short-Term Memory (Bi-LSTM) model, with its parameters randomly initialized (as opposed to critic built upon representations from the trained generative model28). The critic trained from the supervised learning task can be used to characterize the temporal correlations across the patient EHR timeseries.

As indicated from the results in Table 3, it appears that EHR-M-GAN and EHR-M-GANcond can produce synthetic data that are less distinguishable from real data than the benchmarked models. Especially for EHR-M-GANcond, it achieves the optimal discriminative scores consistently against other benchmarks for both continuous-valued and discrete-valued timeseries. For discrete-valued data generation, EHR-M-GAN-generated samples achieve the discriminative score of 0.813 on the MIMIC-III dataset, which has a 4% statistically significant improvement over the best performing benchmark (p < 0.05). The overall discriminative scores produced by PrivBayes on three ICU databases are comparable with the GAN models such as SynTEG and DualAAE. For continuous-valued timeseries generation, the discriminative score of TimeGAN on HiRID dataset outperforms the other models as well as EHR-M-GAN, though not statistically significant (p = 0.4374). By leveraging the additional information from the conditional inputs, EHR-M-GANcond can provide significantly better result than TimeGAN (p < 0.05).

Table 3 Discriminative score of synthetic data.

The ablation study has proved the effectiveness of EHR-M-GAN for generating high quality EHR timeseries. The shared latent space representation learning in the dual-VAE (i.e., GANSL) have shown remarkable success compared with GANVAE, which generates the latent embeddings based on separate VAEs. The sequentially coupled generator further improves the model by capturing the dynamics between mixed data types. Further compared with GANUnified which models the mixed-type data in a unified network, our proposed model enables effective learning for the marginal distributions from each data type. In addition, as shown in Supplementary Table 9, EHR-M-GAN can provide more realistic synthetic samples than the dual-VAE module alone (see Supplementary Note 3 for details).

Interdependency characteristics

In this section, we first employ Pearson pairwise correlation (PPC), which ranges from −1 to 1, to evaluate how closely the synthetic data can model the correlations between continuous-valued and discrete-valued timeseries. Timestamps of the patient trajectories are extracted with every 3 h interval in a total of 24 h ICU stay, to explore the temporal dependencies within different variables. To further quantitatively measure the difference between heatmaps generated from real and synthetic samples, we calculate the mean value of the absolute difference between the two PCC matrices (μabs). We also adopted correlation accuracy (CorAcc)44, which quantifies the similarity of two heatmaps within the range of 0 to 1. We discretize the correlation coefficients into 6 correlation levels: strong negative ([ − 1, − 0.5)), middle negative ([ − 0.5, − 0.3)), low negative ([ − 0.3, − 0.1)), no correlation ([ − 0.1, 0.1)), low positive ([0.1, 0.3)), middle positive ([0.3, 0.5)), and strong positive ([0.5, 1)). Then, CorAcc can be calculated as the percentage of pairs where the real and synthetic data is assigned to the same correlation level.

As observed (see Fig. 4), correlation trends over distinctive features are closely reflected by the synthetic data, with the quantitative measure CorAcc consistently exceed 0.8 on three critical care databases. It is also worth noticing that EHR-M-GAN can successfully recover temporal dependencies with a high granularity from real patient trajectories. For example, synchronized correlations across timestamps are observed between Respiratory Rate and Heart Rate in the MIMIC-III dataset. Such trends are preserved in synthetic data. This can be explained by the common regulation of these two features by the autonomic nervous system and their synchronized increase in cases of physiological stress, such as hypoxemia. In summary, the proposed EHR-M-GAN can reconstruct the temporal dynamics and correlations between features in the real data, which is valuable for downstream ML-based classification and prediction applications.

Fig. 4: Pearson pairwise correlation (PPC) between continuous-valued and discrete-valued timeseries.
figure 4

The plots contrast the PPC calculated within the real data (left column) and the synthetic data generated by EHR-M-GAN (right column). Besides the visual inspection, the similarity between two heatmaps are quantified by CorAcc and μabs. These metrics indicate how well the synthetic data reconstruct the correlations observed in the real patient trajectories. As shown in this figure, SpO2, SBP, RR, HR, Temp represents Oxygen Saturation, Systolic Blood Pressure, Respiratory Rate, Heart Rate, Temperature, respectively. And Vent. and Vaso. corresponds to Vasopressor and Mechanical Ventilation. PPC is calculated every 3 h over the total 24 h of ICU stay (ticks of the timestamps are omitted).

Then, autocorrelation functions (ACF)45 and the corresponding root-mean-square errors (RMSEs) are calculated to show how EHR-M-GAN can capture the temporal correlations among the timeseries. ACF measures the relationship between the timeseries and its lagged version. Supplementary Figs. 68 shows the ACF calculated for selected continuous-valued and discrete-valued variables (same as Pearson pairwise plot) on real and synthetic timeseries. The time lags are specified as the hourly intervals up to 24 h before patients’ ICU endpoints (ICU discharge or death). Additionally, RMSEs are calculated to quantitatively evaluate the similarity between the corresponding two curves produced by real data and synthetic data.

Similar patterns are presented between the ACF calculated for real data and their synthetic counterparts, while the quantitative statistics also correspond with the observation. Moreover, overlapping confidence intervals indicate that the synthetic data is able to consistently capture the underlying temporal distributions within the real timeseries. For variables such as Heart Rate, Oxygen Saturation, and Systolic Blood Pressure, the positive ACF coefficients rapidly decrease within the period of first few hours, followed by the growing trends of negative temporal correlation. The lag with the lowest correlation coefficient is identified at approximately 4 hours. Specifically, global peaks appear roughly at the 12-hour ticks of Temperature for both real and synthetic data on three critical databases. Meanwhile, the negative correlation strengthens as the time lag increase for Mechanical ventilation in the original timeseries. Since these behaviors can be reproduced by EHR-M-GAN, therefore they demonstrate that our model can effectively capture the temporal characteristics in the original timeseries.

Downstream tasks

As previously discussed, one of the most prominent goals for GANs is to benefit the future downstream analyses in the real clinical application. A relevant question in the ICU is whether specialized medical treatments, such as therapeutic interventions or organ support, are required for critically ill patients during the admission. Accurate predictions on such tasks can help clinicians to provide actionable, in-time interventions in the resource-intensive ICU. Therefore in this section, clinical intervention prediction tasks are implemented to evaluate the potential of EHR-M-GAN and EHR-M-GANcond in synthesizing high-fidelity synthetic data to further boost the performance of ML classifiers. In line with prior work46,47,48, we establish LSTM-based classifiers to predict the status of mechanical ventilation and vasopressors using continuous-valued multivariate physiological signals as the predictors. A fixed duration of 12 h is used for both observation window and prediction window (see Fig. 1). Four outcomes of medical intervention status are defined as: Stay on, Onset, Switch off, Stay off (detailed descriptions can be found in Fig. 1).

We partition the dataset as illustrated in Fig. 5a, and the performance is assessed from two aspects (see Fig. 5b: (i) Traditional approach: To explore whether the synthetic data can represent the real data accurately, we compare Train on Real, Test on Real (TRTR) with Train on Synthetic, Test on Real (TSTR), to show whether the performance of a classifier trained on synthetic data from EHR-M-GAN or EHR-M-GANcond can be generalized to real data. In addition to the proposed models, synthetic data produced by the baseline models are also used to train the downstream classifiers for comparison. Other than a measurement of data utility where the downstream task is to predict discrete-valued medical intervention (described as outcomes in this scenario) using continuous-valued physiological features (denoted as predictors), TSTR can also be used to assess data synthesizers’ ability to capture the interdependencies between the mixed-type features. (ii) Data augmentation approach: As data augmentation is employed as a means of circumventing the issue caused by the under-resourced EHR data, here we explore whether synthetic data can be used to improve the existing ML algorithms through data augmentation. Therefore, Train on Synthetic and Real, Test on Real (TSRTR) is compared with TRTR to measure the improvement of the classifier’s performance when trained on the augmented data25,49. The augmentation ratio α or β is applied on sub-train data \({A}_{Tr}^{{\prime} }\) or synthetic data B, in two different scenarios of TSRTR, respectively. Details are explained as follows (also see Fig. 5b for illustration).

Fig. 5: Downstream intervention prediction experimental setup.
figure 5

a Data splitting. During training stage, the real data is split into two sets with 70% training data A and 30% test data \({A}^{{\prime} }\). The test data \({A}^{{\prime} }\) is further split into sub-train data \({A}_{Tr}^{{\prime} }\) and sub-test data \({A}_{Te}^{{\prime} }\) with equal size. Then, the synthetic data B, with size equal to the sub-train data \({A}_{Tr}^{{\prime} }\), is synthesized by EHR-M-GAN (or EHR-M-GANcond) trained on the real training data A. b Data augmentation scenarios. Subsequent experiments are trained on set \({A}_{Tr}^{{\prime} }\), or B, or \({A}_{Tr}^{{\prime} }\cup B\) and then tested on \({A}_{Te}^{{\prime} }\). In traditional approach, results based on Train on Real, Test on Real (TRTR) and Train on Synthetic, Test on Real (TSTR) are compared to assess the generalizability of the synthetic data. In data augmentation approach, i.e., Train on Synthetic and Real, Test on Real (TSRTR), we either augment real data \({A}_{Tr}^{{\prime} }\) with α (augmentation ratio, 0 to 50%) of the synthetic samples B, or augment synthetic samples B with β (0 to 50%) of the real data \({A}_{Tr}^{{\prime} }\).

Firstly, as the dearth of data potentially degrades the performance of downstream classifiers, given that the real data has a limited and fixed sample size, we investigate whether adding synthetic EHR data provided by EHR-M-GAN and EHR-M-GANcond can improve the training of downstream classifiers. Ratio α indicates the portion of synthetic data (see Fig. 5b being used to augment the real data to improve the quality and robustness of the downstream classifiers. α is set to be 10%, 25%, and 50%, representing the availability of synthetic samples provided for augmentation.

Secondly, the acquisition of healthcare data is generally time-consuming and expensive, therefore another overarching goal for the generative model is to minimize the efforts in collecting data. In this section, we investigate whether high-fidelity synthetic data can offer a viable solution for boosting the downstream classifiers’ performance when the availability of real data is limited. This allows us to understand if the sample size required for real data collection can be reduced while maintaining sufficient predictive power through the use of synthetic data. During the experiment, the synthetic data B is given (to emulate the scenario where synthetic datasets are available for a particular clinical research purpose), which further is combined with limited real data (collected during clinical trial), to train the downstream classifiers (i.e., augment synthetic data with limited real data). Then by implementing EHR-M-GAN or EHR-M-GANcond in TSRTR, we investigate the proportion of the real data \({A}_{Tr}^{{\prime} }\) (ratio β) required to maintain the same performance as in TRTR based on the entire synthetic dataset B (see Fig. 5b).

Traditional approach

Table 4 compares the classification performances of predicting forthcoming medical interventions in the ICUs under the experimental setting of TRTR and TSTR. It is expected that the optimal AUROCs are achieved by the classifiers that are trained on real data. In comparison, the classifiers trained on the synthetic data provided by proposed models can achieve similar performances. More specifically, synthetic data generated by EHR-M-GANcond demonstrates better generalizability when compared with EHR-M-GAN in the downstream application, such as the task of predicting mechanical ventilation on the HiRID dataset.

Table 4 Downstream task evaluation.

Compared with the baseline models, the proposed EHR-M-GAN shows improved performance in TSTR, as it can model the distribution of mixed-type EHRs more accurately, while preserving the temporal correlations in the heterogeneous timeseries through the dependency learning components. The results indicate that interdependency between the mixed-type EHRs is weakly captured by GANVAE, as the two streams of inputs are trained in parallel and separately. GANUnified attempts to capture the temporal correlations of mixed-type EHRs through jointly modeling their underlying distribution in a unified network. However, its unified architecture limits the model’s capacity to learn the marginal distribution of each data type, the resulted quality of the synthetic EHRs is impaired and so is its performance in TSTR.

Data augmentation approach (with ratio α)

The results in Table 5 demonstrate that classifiers boosted by EHR-M-GAN can consistently outperform TRTR (see Table 4) at the augmentation ratio of 50%. In comparison, only 25% of augmentation ratio is needed to achieve improved results for EHR-M-GANcond. For example, the classifier trained on MIMIC-III to predict the status of Vasopressor with augmentation ratio α set as 50%, significantly increase the AUROC by 6% when compared to the classifier trained using only the real data (p < 0.05). Our experiment results have demonstrated that the proposed models can be used for data augmentation to overcome the issue of data scarcity and subsequently improve the classifiers’ performance.

Table 5 Downstream task evaluation with data augmentation ratio α.

Data augmentation approach (with ratio β)

On the other hand, as shown in Table 6, by augmenting with the synthetic data provided by EHR-M-GAN, only approximately 50% of the real data is required to keep the classification AUROCs on par with, or even significantly better than fully exploiting the real data under TRTR. For EHR-M-GANcond, the ratio needed for real data to maintain the comparable predictive power is further reduced to 25%, which equivalently indicates a 75% reduction of sample size required in real data collection. Overall, results presented in Table 6 demonstrate that by exploiting only a limited ratio of the real data, EHR-M-GAN and EHR-M-GANcond can robustly maintain the level of prediction performance, therefore alleviating the necessity for acquiring clinical data at scale.

Table 6 Downstream task evaluation with data augmentation ratio β.

Privacy risk evaluation

Patient privacy is a major concern with regards to sharing electronic health records in any means. Generative models can overcome the explicit one-to-one mapping towards the underlying original data in contrast to data anonymization. However, GAN could potentially raise privacy concerns of information leakage if they simply “memorize” the training data, or synthesize samples nearly identical to the real samples (often due to mode collapse). In that case, sensitive medical information (e.g. national insurance number) belonging to a specific patient used in training GANs can be retrieved during the generative stage, thus posing challenges for preserving privacy in downstream applications.

In this section, we first quantify the vulnerability of EHR-M-GAN to adversary’s membership inference attacks, also known as presence disclosure50,51. The threat model is implemented under the membership inference for GANs in the black-box settings50. The attacker is assumed to possess complete knowledge of all the patient records set P, where a subset from P further is used to train GANs. During the experiment, the number of samples in the subset for training EHR-M-GAN are varied to investigate the impact of the availability of training data on the success of the attacker (see Fig. 6a). By observing the synthetic patient records from EHR-M-GAN, the adversary’s goal is to determine whether a single known record x in the patient record set P is from the data used in training EHR-M-GAN. If EHR-M-GAN simply “memorizes” the training data and can only generate synthetic samples (nearly) identical to the real samples, it would be straightforward for the adversary to identify samples that are used as training data. Determined by whether the attacker can correctly infer a given record is in or not in GAN’s training, the accuracy and recall can be calculated.

Fig. 6: Privacy risk evaluation of EHR-M-GAN on MIMIC-III dataset.
figure 6

a Membership inference attack. Membership inference attack against EHR-M-GAN vs. the percentage of the training data. Accuracy and recall are used to evaluate the success rate of such attacks. Lower accuracy or recall indicates less privacy information disclosed by the attacker from the generative model (0.5 can be seen as the random guess baseline where strong privacy guarantees are provided by GANs). Recall indicates the ratio of samples that are successfully claimed by the attacker among all the real data that are used in training GAN models. Error bars represent the standard error. b Differential privacy. Performance of medical intervention prediction tasks, under various differential privacy (DP) budgets, measured by Macro-AUROC.

As shown in Fig. 6a, when 90% of the training data is used for developing EHR-M-GAN, the attacker had a recall of 0.533 and accuracy of 0.527 to recover which training data are considered. This is eminently close to flipping a coin with random guess (i.e., 0.5), indicating EHR-M-GAN is sufficiently robust against the membership inference attack. In other words, patient samples used in EHR-M-GAN’s training are not recoverable by the threat model. On the other hand, as the percentage of the training data reduces, both accuracy and recall for membership inference attacks rise. An accuracy of 0.624 and recall of 0.732 are reached with 20% of training data. This offers a guideline for future application in developing GANs that incorporating more training data can make the generator less susceptible to such attack. This is also consistent with the conclusion derived from the experiment on membership inference attacks in the prior research52.

The concept differential privacy (DP)53, which is a rigorous mathematical definition of privacy, has emerged to be the prevailing notion in the context of statistically analyzing data privacy. The (ϵ, δ)-differential privacy is guaranteed for model M, if given any pair of adjacent datasets D and \({D}^{{\prime} }\) (differing on a single patient record), it holds: \(P[{{{\mathcal{M}}}}(D)\in S]\le {e}^{\epsilon }P\left[{{{\mathcal{M}}}}\left({D}^{{\prime} }\right)\in S\right]+\delta\). In our case, \({{{\mathcal{M}}}}(\cdot )\) is the GAN model trained based on D or \({D}^{{\prime} }\), and S is the subset of any possible outcomes of the generative process. By perturbing the underlying data distribution, DP bounds the maximum variations of the algorithm when any single individual is included or excluded from the dataset. In practice, recent works on developing differentially private deep learning models have benefited from differential private stochastic gradient descent (DP-SGD) algorithm. DP-SGD operates DP by gradient clipping and noise adding during SGD, thereby ensuring that the impact of single record in the training dataset on algorithm parameters is limited within DP’s extend. In this section, (ϵ, δ)-differential privacy is implemented in EHR-M-GAN using TensorFlow Privacy. We then perform the same downstream tasks on medical intervention prediction using synthetic data generated from DP-guaranteed EHR-M-GAN, and compare its performance with TSTR (as shown in Table 4).

Figure 6b shows the TSTR performance of EHR-M-GAN under differential privacy guarantee with varying budgets ϵ (δ fixed at ≤0.001). The value ϵ determines how strict the privacy is, where the smaller value indicates a stronger privacy restriction. As suggested in Fig. 6b, the performance of the downstream tasks operated based on the synthetic data generated by EHR-M-GAN improves as the DP budget relaxes (ϵ increases). We observe that the AUROC of DP-bounded EHR-M-GAN can maintain at an acceptable level even under strict privacy setting. For example, the AUROC for predicting the treatment of Vasopressor can maintain at 0.714 (AUROC = 0.725 under TRTR) even when the ϵ decrease to 4, which is an empirically reasonable value for implementing DP in practice54. Future work that focuses on privacy-preserving GAN under DP-guarantee is expected, where the fidelity of the synthetic data can be restored without compromising its privacy.

Discussion

In this study, we propose a generative adversarial network entitled EHR-M-GAN, aiming at mitigating the challenge of synthesizing longitudinal EHR with mixed data types. A comprehensive list of evaluation metrics is introduced for the systematic assessment, in terms of the fidelity, correlation, utility, and privacy of the synthesis model. First, both EHR-M-GAN and its conditional version, EHR-M-GANcond, demonstrate consistent improvements against the state-of-the-art benchmark GANs in synthesizing timeseries data with high-fidelity. This indicates that the distributional characteristics of the EHR timeseries can be well-preserved in the synthetic data provided by EHR-M-GAN, therefore ensuring its usability during clinical data sharing. Second, as opposed to previous models which were confined to synthesizing only one specific type of data, EHR-M-GAN can produce mixed-type timeseries and successfully capture the temporal dynamics and correlation between features. By accurately reconstructing the interdependencies and complex clinical relationships between features, downstream studies such as association analysis and outcome prediction can be supported. Notably, the proposed models also outperform the GAN variants that allow mixed-type inputs in the ablation study, indicating that the components in EHR-M-GAN are effective in synthesizing mixed-type timeseries with high fidelity, while successfully reconstructing the interdependencies between them. Then, during downstream task evaluation, given the prediction of medical interventions in fast-paced critical care environments as an exemplar, the results demonstrate the broad applicability of our model in developing ML algorithm-based decision support tools by data augmentation. Lastly, the generative capability of the proposed model avoids the “one-to-one” mapping to the original data, and enables the collaborative uses and sharing of EHRs by creating realistic novel samples. The assessment of privacy risks further demonstrates the synthetic data provided by EHR-M-GAN can preserve the sensitive information in patient records while maintaining an acceptable level of data utility.

The results in our study have several notable implications with respect to the synthesis of EHR data. First, as the proposed model can be used to provide synthetic longitudinal EHRs for various data types while preserving their underlying correlations, it is now feasible to use the synthesized data to improve the performance of ML models for downstream applications such as the prediction of next intervention, or understanding the disease dynamics and patient phenotyping, based on both the continuous and discrete components of EHR timeseries55,56. Second, the experimental results indicate that the quality of the synthetic EHR data can be improved by the integration of mixed-type information, in contrast to the benchmarks that utilize single-type data for learning. This also enables us to mimic how information is presented in clinical practices. Furthermore, we can generate condition/outcome-specific patient trajectories along with corresponding interventions, to facilitate clinical prediction and decision-making. Third, though facing the privacy-utility tradeoffs, the synthetic EHRs data provided by the proposed model leads to negligible privacy risks under the membership inference attacks. This paves the way for a series of applications in clinical research, including but not limited to, enabling the development of ML models by accessing the synthetic data, overcoming the paucity of medical data and improving the robustness of ML algorithms through data augmentation.

Due to the heterogeneous nature of EHR data, besides the ICU setting in our empirical evaluation, there are needs for synthesizing mixed-type EHR timeseries in various clinical scenarios. For example, patients’ encounters in hospitals are documented as structured EHRs recorded in the temporal order. Each visit is typically associated with the corresponding medical events presented in the form of discrete-valued ICD codes27, and continuous-valued measurements. These mixed-type EHR timeseries capture a patient’s health status and better align with clinical decision-making process than those using the single-type data alone. Therefore, developing GANs targeting mixed-type EHRs generation have the potential to pave the way for complex deep-learning systems that are capable of integrating information from various sources. However, it is worth noting that the validation of our proposed model is based on critical care settings with limited feature dimensions, can only serve as a proof of concept. When extending the proposed model to other clinical settings, such as synthesizing ICD codes with hundreds or thousands of feature dimensions27, the scalability and utility of our proposed model when dealing with the enlarged, sparse feature space needs further investigation.

There are limitations in the current work. First, data curation strategies on clinical timeseries, including truncating, smoothing and imputation, are applied before the EHR timeseries are used for the training of generative models. As during the data preprocessing, we first extract the timeseries with a fixed duration (i.e., 24 h before the ICU clinical endpoints), and then hourly aggregate patients’ physiological and intervention signals based on their mean statistics, followed by completing the missing value in the timeseries through the “Simple Imputation” approach57. Although these preprocessing steps are commonly used in clinical research under the critical care settings46, the proposed model cannot model the irregular time intervals between signals nor missing values within the timeseries. However, dealing with irregularity of the timestamps when synthesizing clinical events in EHRs could be useful for predicting outcomes that are time-aware in the downstream tasks27. Modeling such time intervals could be non-trivial as the determinative perspectives sometimes go beyond the scope of inferring patients physiological status such as resource allocations within hospitals. Also, synthesizing timeseries while incorporating the missing values could also be beneficial in the real-world application scenarios. As ML models are sometimes sensitive to the data missingness, imputing the incomplete data in EHRs using generative approaches could improve the performance of ML models, and has become an area of active research58. Furthermore, as evaluations are performed based on clinical timeseries with a fixed length, no comparisons are made between the model’s scalability when dealing with timeseries with varying lengths. Recent studies have found the quality of the synthetic longitudinal data degenerates over time, also called as the “drift problem”28. Such problems when dealing with long sequences should be recognized and mitigated with techniques such as conditional fuzzing and regularization methods28, in both the generation and evaluation steps.

The evaluation of GANs is still a challenging task. Recent findings have suggested that systematical assessment for EHR synthesizers is critical before their applications in different use cases59. In this study, a comprehensive evaluation list is provided with regards to the fidelity, correlation, utility and privacy of the synthesis models. It is also worth noting that evaluation metrics should be properly chosen and implemented based on the purpose of the task, otherwise may lead to biased results. For example, recent findings28 have reported that the traditional implementation of the discriminative score which trains the critic using the randomly initialized parameters, though widely used29, may lead to unreliable results. Improvement has been made to this evaluation metric for a more robust assessment, where the parameters of the trained generative model can be used for the critic’s initialization.

Finally, the conditional aspect of our model is currently limited as it can not generate patient-specific EHRs conditioning on information at a more granular level. Even though the proposed conditional GANs can synthesize a subgroup of patients with target outcomes or statuses that clinicians specify, it is still limited in incorporating personalized information during the conditional generation. Future work for developing GANs in healthcare data can be extended to patient-level EHRs generation, such as synthesizing counterfactual information of a target patient for treatment effect estimation60,61. Ultimately, by constructing the “synthetic twin” of patients using GANs, the synthesis tool can become more generalizable for precision medicine and support the clinical decision making in delivering personalized healthcare.

Synthetic data provides an alternative to sharing real patient data while preserving patient privacy. Results in our study demonstrate that the proposed EHR-M-GAN and EHR-M-GANcond can generate realistic longitudinal EHR timeseries with mixed data types. By providing synthetic EHR data better mimicking the nature of clinical decision-making, the proposed model can therefore enable faster development in AI-driven clinical tools with increased robustness and adaptability. In addition to the improved performance against the existing state-of-the-art benchmark models, augmentation provided by synthetic data during training boosts the predictive performance in downstream clinical tasks. EHR-M-GAN can help eliminate the barriers to data acquisition for healthcare studies, therefore overcoming the challenges posed by the paucity of medical data available for research use. Despite the novelty of this study in filling the research gap for synthesizing longitudinal EHRs in mixed-type settings, we acknowledge that there is still a gap between the real EHRs data and its synthetic counterparts produced by current generative methods. Therefore developing advanced EHR synthesizers especially in mixed-type settings still requires active research in the future study.

Methods

Dataset description

The following three publicly accessible ICU datasets with de-identified EHR data are used for evaluating the performance of EHR-M-GAN in generating the longitudinal data:

  • MIMIC-III (Medical Information Mart for Intensive Care)36—a freely accessible database that comprises EHR data associated with approximately 60,000 ICU admitted patients and 312 million observations to Beth Israel Deaconess Medical Center.

  • eICU (eICU Collaborative Research Database)37—a multi-center critical care database containing data for over 200,000 admissions and 827 million observations to ICUs from 208 hospitals located throughout the United States.

  • HiRID (High time-resolution ICU dataset)38—a high-resolution ICU dataset relating to more than 3 billion observations from almost 34,000 ICU patient admissions, monitored at the Department of Intensive Care Medicine, Bern University Hospital, Switzerland.

All these critical care databases include vital sign measurements, laboratory tests, treatment information, survival records, and other routinely collected data from hospital EHR systems. From these clinical observations, we featurize the patient trajectories as the combination of continuous-valued physiological timeseries (such as heart rate, oxygen saturation, and measurements from blood gas tests) and discrete-valued medical intervention timeseries (such as the usage of therapeutic devices or intravenous medications). Temporal trajectories 24h prior to patients’ ICU endpoints (discharge or death) are extracted for the three critical care databases. Data are preprocessed following an open-source framework—MIMIC-Extract46, where the patients’ physiological and intervention signals are hourly aggregated for denser representations. Details on data curation, including the cohort selection criteria, full list of features, and imputation method, are explained in Supplementary Note 2. Overall, the summarizing statistics of the finalized cohorts for three databases are shown in Table 7.

Table 7 Summary of the cohorts after preprocessing on three critical care databases.

Problem formulation

The longitudinal patient EHR dataset is denoted as \({{{\mathscr{D}}}}={\{({{{{\bf{x}}}}}_{i,1:{T}_{i}})\}}_{i = 1}^{N}\), with each record (e.g., individual patient) being indexed by i {1, 2, . . . N}. Here we consider the i-th instance tuple \({{{{\bf{x}}}}}_{i,1:{T}_{i}}=\{{{{{\bf{x}}}}}_{i,1:{T}_{i}}^{{{{\mathcal{C}}}}},{{{{\bf{x}}}}}_{i,1:{T}_{i}}^{{{{\mathcal{D}}}}}\}\) consists of two components (i.e., two types of data). Let \({{{{\bf{x}}}}}_{i,1:{T}_{i}}^{{{{\mathcal{C}}}}}\in {{\mathbb{R}}}^{| J| }\) denote the J-dimensional continuous-valued timeseries, such as physiological signals from real-time bedside monitors. And \({{{{\bf{x}}}}}_{i,1:{T}_{i}}^{{{{\mathcal{D}}}}}\in {{\mathbb{Z}}}^{| K| }\) denotes the K-dimensional discrete-valued timeseries, such as life-support interventions with the categorical value indicate its status (presence or absence).

Challenges in mixed-type timeseries generation

There are two main challenges when synthesizing mixed-type EHR timeseries. First, GANs have serious limitations on the type of data they can model30. Specifically, as GANs require generators and discriminators to be both fully differentiable, generating discrete-valued timeseries using traditional GANs architectures would raise problems during backpropagation as no direct gradient can be provided31,32. Therefore, it is intrinsically difficult to model the underlying joint distribution of mixed data type timeseries using a single unified framework. Second, as the mixed-type timeseries are correlated (such as correlations between ICU patients’ physiological signals and treatment status in the critical care setting), it is therefore important to model the interdependencies among heterogeneous types of timeseries.

Intuition behind EHR-M-GAN

First, to jointly model the distribution of continuous-valued and discrete-valued timeseries using GANs, we build the generative model based on the latent space encoded by VAE networks. Instead of directly synthesizing discrete-valued timeseries that deactivate the backpropagation in GANs, the generator first synthesizes latent representations that allow the direct gradient in the network, therefore satisfying the prerequisite for GANs architecture to be fully differentiable. The synthetic latent representations for both types of data can be further decoded into raw timeseries using the decoders in VAEs.

Even though the aforementioned network architectures enable the joint modeling of mixed-type data distribution, it still lacks the capability of capturing the inter-dependencies in heterogeneous data. In order to address the second issue, we devised dual-VAE module for pretraining step and sequentially coupled generator module for generation step. The dual-VAE incorporates multiple loss constraints, which were previously adopted in domains such as self-supervised learning (SSL), timeseries representation learning, and domain adaptation, to extract useful hierarchical representations from heterogeneous but correlated data types. The sequentially coupled generator module replaces the traditional LSTM cell with the bilateral LSTM (BLSTM) cell we propose, where the “communication” of the two types of information are introduced into the networks. Therefore, the temporal dynamics between the mixed-type data can be preserved during the iteration.

Network architecture

As illustrated above, EHR-M-GAN can be factorized into two key components (see Fig. 1b): (1) a dual-VAE framework for learning the shared latent space representations; (2) an RNN-based sequentially coupled generator and its corresponding sequence discriminators.

As shown in Fig. 1b, during the pretrain stage, both continuous-valued and discrete-valued temporal trajectories are first jointly mapped into a shared latent space using the dual-VAE component (Step 1). Then, the sequentially coupled generator in EHR-M-GAN produces the synthetic latent representations (Step 2), which further can be recovered into features in the observational space by the pretrained decoders in the dual-VAE (Step 3). Finally, the adversarial loss is provided based on discriminative results and backpropagated to update the network (Step 4). The following sections discuss them in turn.

Dual-VAE pretraining for shared latent space representations

One premise of successfully training EHR-M-GAN to generate reversible latent codes is to meet the assumption that for the same patient indexed with i, both \({{{{\bf{x}}}}}_{i,1:{T}_{i}}^{{{{\mathcal{C}}}}}\) and \({{{{\bf{x}}}}}_{i,1:{T}_{i}}^{{{{\mathcal{D}}}}}\) can be encoded into the same latent space \({{{{\mathcal{H}}}}}^{{{{\mathcal{S}}}}}\subset {{\mathbb{R}}}^{| S| }\), where S denotes its spatial dimension. For the sake of simplicity, the subscripts i are omitted throughout most of the paper. To achieve this, we propose to use a dual-VAE framework, which exploits two VAE networks to encode both continuous and discrete multivariate timeseries into dense representations within \({{{{\mathcal{H}}}}}^{{{{\mathcal{S}}}}}\) based on multiple constraints.

Supplementary Fig. 2 diagrams the details of the proposed dual-VAE framework for learning the shared latent representations. We start with training two encoders, i.e., \({{Enc}}^{{{{\mathcal{C}}}}}\): \({\phi }_{{{{\mathcal{T}}}}\times {{{{\mathcal{X}}}}}^{{{{\mathcal{C}}}}}}\to {\phi }_{{{{\mathcal{T}}}}\times {{{{\mathcal{H}}}}}^{{{{\mathcal{S}}}}}}\) and \({{Enc}}^{{{{\mathcal{D}}}}}\): \({\phi }_{{{{\mathcal{T}}}}\times {{{{\mathcal{X}}}}}^{{{{\mathcal{D}}}}}}\to {\phi }_{{{{\mathcal{T}}}}\times {{{{\mathcal{H}}}}}^{{{{\mathcal{S}}}}}}\), with the embedding functions:

$${{{{\bf{z}}}}}_{1:T}^{{{{\mathcal{C}}}}}={{Enc}}^{{{{\mathcal{C}}}}}({{{{\bf{x}}}}}_{1:T}^{{{{\mathcal{C}}}}})\quad \quad {{{{\bf{z}}}}}_{1:T}^{{{{\mathcal{D}}}}}={{Enc}}^{{{{\mathcal{D}}}}}({{{{\bf{x}}}}}_{1:T}^{{{{\mathcal{D}}}}})$$
(3)

After passing data from \({{{{\mathcal{X}}}}}^{{{{\mathcal{C}}}}}\) and \({{{{\mathcal{X}}}}}^{{{{\mathcal{D}}}}}\) through two encoders, a pair of embedding vectors \(({{{{\bf{z}}}}}_{1:T}^{{{{\mathcal{C}}}}},{{{{\bf{z}}}}}_{1:T}^{{{{\mathcal{D}}}}})\) in the shared latent space \({{{{\mathcal{H}}}}}^{{{{\mathcal{S}}}}}\) can be obtained. Then the decoders for both domains \({{Dec}}^{{{{\mathcal{C}}}}}:{\psi }_{{{{\mathcal{T}}}}\times {{{{\mathcal{H}}}}}^{{{{\mathcal{S}}}}}}\to {\psi }_{{{{\mathcal{T}}}}\times {{{{\mathcal{X}}}}}^{{{{\mathcal{C}}}}}}\) and \({{Dec}}^{{{{\mathcal{D}}}}}:{\psi }_{{{{\mathcal{T}}}}\times {{{{\mathcal{H}}}}}^{{{{\mathcal{S}}}}}}\to {\psi }_{{{{\mathcal{T}}}}\times {{{{\mathcal{X}}}}}^{{{{\mathcal{D}}}}}}\) further reconstruct features based on the latent embeddings using mapping functions that operate in the opposite direction:

$${\tilde{{{{\bf{x}}}}}}_{1:T}^{{{{\mathcal{C}}}}}={{Dec}}^{{{{\mathcal{C}}}}}({{{{\bf{z}}}}}_{1:T}^{{{{\mathcal{C}}}}})\quad \quad {\tilde{{{{\bf{x}}}}}}_{1:T}^{{{{\mathcal{D}}}}}={{Dec}}^{{{{\mathcal{D}}}}}({{{{\bf{z}}}}}_{1:T}^{{{{\mathcal{D}}}}})$$
(4)

Also, to incentivize dual-VAE to better bridge the gap between domains of mixed-type timeseries, we enforce a weight-sharing constraint62,63 within specific layers of both the encoders pairs and the decoders pairs (See Supplementary Note 1 for details).

In the following subsections, we define multiple loss constraints for the optimization of dual-VAE, including ELBO loss, matching loss, contrastive loss, as well as semantic loss for EHR-M-GANcond. Among these losses, ELBO loss ensures that the mixed-type timeseries can be successfully reconstructed after being encoded into latent representations. The matching loss ensures that heterogeneous types of features from a single patient share contexts during representation learning (instance-wise). The goal of contrastive loss is to ensure that patients with similar trajectories stay close to each other in the latent space (population-wise). And semantic loss used in EHR-M-GANcond encourages patients with the same conditional labels (e.g., outcomes) to share similar latent representations. Intuitions and descriptions behind the objectives are discussed in turn.

Evidence lower bound (ELBO)

We first incorporate the standard VAE loss, with the optimization objective as the evidence lower bound (ELBO). VAE holds the assumption of spherical Gaussian prior for the distribution of latent embeddings, where features can then be reconstructed by sampling from that space. The re-parameterization tricks enable differentiable stochastic sampling and network optimization. For encoder and decoder in the dual-VAE for domain \(d\in \{{{{\mathcal{C}}}},{{{\mathcal{D}}}}\}\), the objective function is defined as:

$$\begin{array}{ll}{{{{\mathcal{L}}}}}_{d}^{{{{\rm{ELBO}}}}}\,=\,-{{\mathbb{E}}}_{{q}_{\phi }({{{\bf{z}}}}| {{{\bf{x}}}})}[\log {p}_{\psi }({{{\bf{x}}}}| {{{\bf{z}}}})]\\ \qquad\qquad\,\,+\,{\beta }_{{{{\rm{KL}}}}}{D}_{{{{\rm{KL}}}}}({q}_{\phi }({{{\bf{z}}}}| {{{\bf{x}}}})\parallel {p}_{\psi }({{{\bf{z}}}}))\end{array}$$
(5)

where \({{{\bf{z}}}} \sim {Enc}({{{\bf{x}}}})\,\triangleq\, {q}_{\phi }({{{\bf{z}}}}| {{{\bf{x}}}}),\tilde{{{{\bf{x}}}}} \sim {Dec}({{{\bf{z}}}})\,\triangleq\, {p}_{\psi }({{{\bf{x}}}}| {{{\bf{z}}}})\), and DKL is the Kullback–Leibler divergence. The first term in Eq. (5) is the expected log-likelihood term that penalizes the deviations in reconstructing the inputs, while the second term of KL-divergence is the regularization imposed over the latent distribution from its Gaussian prior (normally chosen to be \({{{\mathcal{N}}}}({{{\bf{0}}}},{{{\boldsymbol{I}}}})\)). βKL is the hyperparameter for balancing the weights between two terms.

Matching loss

Typically, representations derived from the same patient are assumed to capture the shared context. Therefore, embedding vectors \(({{{{\bf{z}}}}}_{i,1:{T}_{i}}^{{{{\mathcal{C}}}}},{{{{\bf{z}}}}}_{i,1:{T}_{i}}^{{{{\mathcal{D}}}}})\) projected from the same patient i, are supposed to be positioned closely in the shared latent space (see Supplementary Fig. 2). Therefore, in this study, we borrow the concept of matching loss from domain alignment in DA, which enables efficient representation learning crossing domains/modalities64. In this study, the matching loss ensures that low-dimensional latent space can be shared between heterogeneous features. Hence, the pairwise matching loss is incorporated to motivate the encoders to minimize the distance within the corresponding representation pairs. In the low-dimensional Euclidean space, we optimize the network by using the following objective:

$${{{{\mathcal{L}}}}}^{{{{\rm{Match}}}}}={{\mathbb{E}}}_{{{{\bf{z}}}} \sim {p}_{{{{\bf{z}}}}}}[\mathop{\sum}\limits_{t\in {{{\mathcal{T}}}}}| | {{{{\bf{z}}}}}_{t}^{{{{\mathcal{C}}}}}-{{{{\bf{z}}}}}_{t}^{{{{\mathcal{D}}}}}| {| }^{2}]$$
(6)

The pairwise matching loss achieve its optimal when the distance proxy \({{{{\mathcal{L}}}}}^{{{{\rm{Match}}}}}\) becomes zero.

Contrastive loss

On the flip side, pairwise reconstruction error (i.e., intra-correlations within one instance) measured by matching loss neglects the commonalities present across patients (inter-correlations of data)65. In order to guarantee sufficient bound for representation learning, we incorporate contrastive loss as another distance metric to capture the inter-correlations among the population.

Contrastive learning is a concept that has recently been popularized in self-supervised learning (SSL)66, which aims to capture intrinsic patterns from input data without human annotations. In this study, we instantiate the contrastive loss via NT-Xent, which is proposed by Chen et al. in their work SimCLR67. The core of contrastive learning is to encourage networks to attract positive pairs closer and repulse negative pairs apart in the latent space. In this study, we adapt the corresponding auxiliary tasks for calculating contrastive loss to the scenario of learning representations from mixed-type timeseries. The objective of the task is to determine whether a set of representations transformed from the observational space belong to the same patient. And this leads to the corresponding positive pairs (true) and negative pairs (false).

For patient data of N records, we can obtain N pairs of latent representations from the encoders in dual-VAE. For patient indexed with i, \({{{{\bf{h}}}}}_{i}^{{{{\mathcal{C}}}}}\) and \({{{{\bf{h}}}}}_{i}^{{{{\mathcal{D}}}}}\) denotes the embeddings derived from the continuous-valued and discrete-valued observational space, respectively. Due to the symmetric architecture of dual-VAE, here we use d and \({d}^{{\prime} }\) to represent one of each different domain, i.e., \(d,{d}^{{\prime} }\in \{{{{\mathcal{C}}}},{{{\mathcal{D}}}}\}\) and \(d\,\ne\, {d}^{{\prime} }\). Therefore, the positive pairs for patient i can be referred as \(({i}^{d},{i}^{{d}^{{\prime} }})\), while the other 2(N − 1) samples are regarded as negative pairs. Then the contrastive loss for a positive pair \(({i}^{d},{i}^{{d}^{{\prime} }})\) is defined as:

$${{{{\mathcal{L}}}}}_{{i}^{d},{i}^{{d}^{{\prime} }}}^{{{{\rm{Contra}}}}}=-\log \frac{\exp \left({{\mathrm{sim}}}\,\left({{{{\bf{h}}}}}_{{i}^{d}},{{{{\bf{h}}}}}_{{i}^{{d}^{{\prime} }}}\right)/\tau \right)}{\mathop{\sum }\nolimits_{{i}^{d{d}^{{\prime} }} = 1}^{2N}{{\mathbb{1}}}_{[{i}^{d{d}^{{\prime} }}\ne {i}^{d}]}\exp \left({{\mathrm{sim}}}\,\left({{{{\bf{h}}}}}_{{i}^{d}},{{{{\bf{h}}}}}_{{i}^{d{d}^{{\prime} }}}\right)/\tau \right)}$$
(7)

where \({{\mathrm{sim}}}\,(u,v)={u}^{T}v/\parallel u\parallel \parallel v\parallel\) denotes the cosine similarity between two vectors. τ > 0 denotes a temperature hyperparameter. \({{\mathbb{1}}}_{[n\ne m]}\in \{0,1\}\) is an indicator evaluating to 1 iff n ≠ m. And \({i}^{d{d}^{{\prime} }}\in \{1,2,...,2N\}\) represents the index of latent embeddings from both data types. The final contrastive loss is computed across the total number of \(| {i}^{d}-{i}^{{d}^{{\prime} }}| =N\) positive pairs for both \(({i}^{d},{i}^{{d}^{{\prime} }})\) and \(({i}^{{d}^{{\prime} }},{i}^{d})\), and is defined as:

$${{{{\mathcal{L}}}}}^{{{{\rm{Contra}}}}}=\frac{1}{2N}\mathop{\sum }\limits_{{i}^{d}=1}^{N}\mathop{\sum }\limits_{{i}^{{d}^{{\prime} }}=1}^{N}[{{{{\mathcal{L}}}}}_{{i}^{d},{i}^{{d}^{{\prime} }}}^{{{{\rm{Contra}}}}}+{{{{\mathcal{L}}}}}_{{i}^{{d}^{{\prime} }},{i}^{d}}^{{{{\rm{Contra}}}}}]$$
(8)

Semantic loss

In EHR-M-GANcond, semantic loss is imposed to better align patients with same labels (conditions) into the same latent space clusters. For example, if the label of severe clinical deterioration in the ICU is given for conditional data generation, the corresponding synthetic continuous-valued timeseries (e.g., severely deranged vital signs) is supposed to be strongly associated with the discrete-valued timeseries (e.g., intensive medical interventions) under the same label. For both domains, additional linear classifiers are trained to classify the latent embeddings based on their corresponding conditions in the observational space. We implement logistic regression as the linear classifiers and calculate the cross entropy as the semantic losses for both domains. For \(d\in \{{{{\mathcal{C}}}},{{{\mathcal{D}}}}\}\), given the latent embedding vector zd and the conditional information vector y:

$${{{{\mathcal{L}}}}}_{d}^{{{{\rm{Class}}}}}={{\mathbb{E}}}_{{{{{\bf{z}}}}}^{d}\in {{{{\mathcal{H}}}}}^{{{{\mathcal{S}}}}}}{{{\rm{CE}}}}\left({f}_{{{{\rm{linear}}}}}^{d}({{{{\bf{z}}}}}^{d}),{{{\bf{y}}}}\right)$$
(9)

where \({f}_{{{{\rm{linear}}}}}^{d}\) denotes the linear classifier for the corresponding domain. And \({{{\rm{CE}}}}=-{\sum }_{j}{y}_{j}\log ({\widehat{y}}_{j}),\ (j=1,2,...,| L| )\) denotes the cross entropy loss, where \({\hat{y}}_{j}\) is the output of the linear classifier, and yj is the ground truth label for class j in condition vector y.

In summary, to train the dual-VAE for learning the shared latent space representation, the total objective function for \(d\in \{{{{\mathcal{C}}}},{{{\mathcal{D}}}}\}\) is:

$${{{{\mathcal{L}}}}}_{d}={\beta }_{0}{{{{\mathcal{L}}}}}_{d}^{{{{\rm{ELBO}}}}}+{\beta }_{1}{{{{\mathcal{L}}}}}^{{{{\rm{Match}}}}}+{\beta }_{2}{{{{\mathcal{L}}}}}^{{{{\rm{Contra}}}}}$$
(10)

Under the conditional learning scenario of EHR-M-GANcond, the total loss becomes:

$${{{{\mathcal{L}}}}}_{d}={\beta }_{0}{{{{\mathcal{L}}}}}_{d}^{{{{\rm{ELBO}}}}}+{\beta }_{1}{{{{\mathcal{L}}}}}^{{{{\rm{Match}}}}}+{\beta }_{2}{{{{\mathcal{L}}}}}^{{{{\rm{Contra}}}}}+{\beta }_{3}{{{{\mathcal{L}}}}}_{d}^{{{{\rm{Class}}}}}$$
(11)

where β0, β1, β2, and β3 are scalar loss weights used to balance the loss terms.

To validate the effectiveness of multiple losses and the weight-sharing constraint in the proposed dual-VAE network, we perform the corresponding ablation study using MIMIC-III dataset as an example. The results can be found in Supplementary Note 3. As shown in Supplementary Table 7, all the components in the proposed dual-VAE network contribute to the improvement of EHR-M-GAN’s performance when generating mixed-type timeseries data.

Sequentially coupled generator based on CRN

We propose the sequentially coupled generator for generating latent representations for mixed-type timeseries, which is built based on the network architecture of coupled recurrent network (CRN). Specifically, a CRN exploits bilateral long short-term memory (BLSTM) cells as its recurrent layer to preserve the temporal dependencies between the continuous and discrete-valued sequences. The network architecture of bilateral-LSTM we proposed can extract and transmit the correlations between the mixed-type timeseries, as opposed to vanilla-LSTM which has only one recursive connection. In the following section, we first discuss the structure of BLSTM in detail as its essential recurrent layer of CRN, and then build the sequentially coupled generator based on CRN.

Bilateral long short-term memory

As the traditional LSTM only considers temporal dynamics from single-type timeseries, therefore is incapable to extract and transmit temporal correlation from heterogeneous features. Therefore, we propose the bilateral-LSTM cell with two network connections to characterize the correlations between two types of data. Given \(d,{d}^{{\prime} }\in \{{{{\mathcal{C}}}},{{{\mathcal{D}}}}\}\), \({{{{\boldsymbol{\upsilon }}}}}_{t}^{d}\) and \({{{{\bf{h}}}}}_{t}^{d}\) denotes the input vector (i.e., the random noise during GANs’ training) and hidden state vector for domain d at time step t, respectively. An additional set of weights for introducing hidden states representations \({{{{\bf{h}}}}}_{t}^{{d}^{{\prime} }}\) from domain \({d}^{{\prime} }\) is included when updating the input gate \({{{{\bf{i}}}}}_{t}^{d}\), forget gate \({{{{\bf{f}}}}}_{t}^{d}\), output gate \({{{{\bf{o}}}}}_{t}^{d}\), and cell memory \({\tilde{{{{\bf{c}}}}}}_{t}^{d}\). The state transition functions for BLSTM are:

$$\begin{array}{ll}{{{{\bf{i}}}}}_{t}^{d}=\sigma \left({{{{\bf{W}}}}}_{id\upsilon }{{{{\boldsymbol{\upsilon }}}}}_{t}^{d}+{{{{\bf{W}}}}}_{id{h}^{{d}^{{\prime} }}}{{{{\bf{h}}}}}_{t-1}^{{d}^{{\prime} }}+{{{{\bf{W}}}}}_{id{h}^{d}}{{{{\bf{h}}}}}_{t-1}^{d}+{{{{\bf{b}}}}}_{id}\right)&\\ {{{{\bf{f}}}}}_{t}^{d}=\sigma \left({{{{\bf{W}}}}}_{fd\upsilon }{{{{\boldsymbol{\upsilon }}}}}_{t}^{d}+{{{{\bf{W}}}}}_{fd{h}^{{d}^{{\prime} }}}{{{{\bf{h}}}}}_{t-1}^{{d}^{{\prime} }}+{{{{\bf{W}}}}}_{fd{h}^{d}}{{{{\bf{h}}}}}_{t-1}^{d}+{{{{\bf{b}}}}}_{fd}\right)\\ {{{{\bf{o}}}}}_{t}^{d}=\sigma \left({{{{\bf{W}}}}}_{od\upsilon }{{{{\boldsymbol{\upsilon }}}}}_{t}^{d}+{{{{\bf{W}}}}}_{od{h}^{{d}^{{\prime} }}}{{{{\bf{h}}}}}_{t-1}^{{d}^{{\prime} }}+{{{{\bf{W}}}}}_{od{h}^{d}}{{{{\bf{h}}}}}_{t-1}^{d}+{{{{\bf{b}}}}}_{od}\right)\\ {\tilde{{{{\bf{c}}}}}}_{t}^{d}=\tanh \left({{{{\bf{W}}}}}_{cd\upsilon }{{{{\boldsymbol{\upsilon }}}}}_{t}^{d}+{{{{\bf{W}}}}}_{cd{h}^{{d}^{{\prime} }}}{{{{\bf{h}}}}}_{t-1}^{{d}^{{\prime} }}+{{{{\bf{W}}}}}_{cd{h}^{d}}{{{{\bf{h}}}}}_{t-1}^{d}+{{{{\bf{b}}}}}_{cd}\right)\\ {{{{\bf{c}}}}}_{t}^{d}={{{{\bf{f}}}}}_{t}^{d}\odot {{{{\bf{c}}}}}_{t-1}^{d}+{{{{\bf{i}}}}}_{t}^{d}\odot {\tilde{{{{\bf{c}}}}}}_{t}^{d}\\ {{{{\bf{h}}}}}_{t}^{d}={{{{\bf{o}}}}}_{t}^{d}\odot \tanh \left({{{{\bf{c}}}}}_{t}^{d}\right)\end{array}$$
(12)

As indicated by Eq. (12), the proposed BLSTM network overcomes the limitation of vanilla-LSTM network on modeling the correlation between the mixed-type timeseries by establishing the supplemental recursive connection. The new connection facilitates the model to intrinsically decide how much information it should pass through the gates from its counterpart. A diagram of the BLSTM cell in contrast to vanilla-LSTM cell can be found in the Supplementary Fig. 3.

Coupled recurrent network

The architecture of CRN consists of three layers: the input layers, the recurrent layers, and the fully connected layers. First, the random noise vectors \({{{{\boldsymbol{\upsilon }}}}}_{t}^{d}\) and \({{{{\boldsymbol{\upsilon }}}}}_{t}^{{d}^{{\prime} }}\) for two domains, which are sampled from uniform distributions (i.e., \({{{{\boldsymbol{\upsilon }}}}}_{t}^{d},{{{{\boldsymbol{\upsilon }}}}}_{t}^{{d}^{{\prime} }}\in {{{\mathcal{U}}}}(0,1)\)), are separately fed into the input layers. Then the recurrent layersfrec, which are built based on two streams of BLSTM, one for each data type, are used to recursively iterate hidden states from both branches. Finally, the fully connected layers\({f}_{{{{\rm{conn}}}}}^{d}\) and \({f}_{{{{\rm{conn}}}}}^{{d}^{{\prime} }}\) produce the generated latent vectors \({\hat{{{{\bf{z}}}}}}_{t}^{d}\) and \({\hat{{{{\bf{z}}}}}}_{t}^{{d}^{{\prime} }}\) for the decoding stage in dual-VAE. At time step t, CRN can be formulated as:

$$\begin{array}{ll}({{{{\bf{h}}}}}_{t}^{d},{{{{\bf{h}}}}}_{t}^{{d}^{{\prime} }})\,={f}_{{{{\rm{rec}}}}}(({{{{\boldsymbol{\upsilon }}}}}_{t}^{d},{{{{\boldsymbol{\upsilon }}}}}_{t}^{{d}^{{\prime} }}),({{{{\bf{h}}}}}_{t-1}^{d},{{{{\bf{h}}}}}_{t-1}^{{d}^{{\prime} }}))\\ {\hat{{{{\bf{z}}}}}}_{t}^{d}\,={f}_{{{{\rm{conn}}}}}^{d}({{{{\bf{h}}}}}_{t}^{d})\\ {\hat{{{{\bf{z}}}}}}_{t}^{{d}^{{\prime} }}\,={f}_{{{{\rm{conn}}}}}^{{d}^{{\prime} }}({{{{\bf{h}}}}}_{t}^{{d}^{{\prime} }})\end{array}$$
(13)

In summary, heterogeneous timeseries that exhibits mutual influence on each other are integrated into CRN to model their interdependencies. By exploiting the BLSTM cell as its recurrent layer, two streams of the inputs in the generator are encouraged to “communicate” with each other. CRN is therefore capable of exploiting the interplay between mixed-type data that correlates over time.

Joint training and optimization

The overall architecture of EHR-M-GAN is shown in Fig. 1. In this section, we give a detailed description of the entire network’s structure and the optimization objective of the model. The steps for the training and optimization of EHR-M-GAN are as follows:

  • The pretraining of dual-VAE: First, a dual-VAE network which consists of a pair of encoders (\({{Enc}}^{{{{\mathcal{C}}}}},{{Enc}}^{{{{\mathcal{D}}}}}\)) and decoders (\({{Dec}}^{{{{\mathcal{C}}}}},{{Dec}}^{{{{\mathcal{D}}}}}\)) is pretrained with both continuous and discrete data. Based on multiple objective constraints in Eq. (10), a shared latent space is learnt using dual-VAE, where the gap between the embedding representations \(({{{{\bf{z}}}}}_{1:T}^{{{{\mathcal{C}}}}},{{{{\bf{z}}}}}_{1:T}^{{{{\mathcal{D}}}}})\) from both domains is minimized.

  • The generation of latent representations based on CRN: Then, during the joint training stage, the sequentially coupled generator which is built based on CRN, takes the random noise vector \(({\hat{{{{\bf{z}}}}}}_{1:T}^{{{{\mathcal{C}}}}},{\hat{{{{\bf{z}}}}}}_{1:T}^{{{{\mathcal{D}}}}})\) as inputs and iterating across the timesteps t {1, 2, . . . , T} by the internal transition functions. Therefore, the synthetic latent embedding representations \(({\hat{{{{\bf{z}}}}}}_{1:T}^{{{{\mathcal{C}}}}},{\hat{{{{\bf{z}}}}}}_{1:T}^{{{{\mathcal{D}}}}})\) for both continuous and discrete data can be obtained.

  • The decoding for the mixed-type timeseries: Next, the generated latent embeddings \(({\hat{{{{\bf{z}}}}}}_{1:T}^{{{{\mathcal{C}}}}},{\hat{{{{\bf{z}}}}}}_{1:T}^{{{{\mathcal{D}}}}})\) are further fed into the pretrained decoders (\({{Dec}}^{{{{\mathcal{C}}}}},{{Dec}}^{{{{\mathcal{D}}}}}\)) and decoded into the corresponding synthetic patient trajectories \(({\hat{{{{\bf{x}}}}}}_{1:T}^{{{{\mathcal{C}}}}},{\hat{{{{\bf{x}}}}}}_{1:T}^{{{{\mathcal{D}}}}})\) in the observational space.

  • The adversarial loss update based on the discriminators: Finally, the adversarial loss can be calculated from the LSTM network-based discriminators \({D}^{{{{\mathcal{C}}}}}\) and \({D}^{{{{\mathcal{D}}}}}\) by distinguishing between the real samples and synthetic timeseries for both data types.

The mathematical expression for the min-max objectives in EHR-M-GAN is provided as follows:

$$\begin{array}{l}\mathop{\min }\limits_{G}\mathop{\max }\limits_{D}{V}_{{{{\rm{EHR-M-GAN}}}}} =\,{{\mathbb{E}}}_{{{{\bf{x}}}} \sim {p}_{{{{\bf{x}}}}}}[\log {D}^{{{{\mathcal{C}}}}}({{{{\bf{x}}}}}^{{{{\mathcal{C}}}}})+\log {D}^{{{{\mathcal{D}}}}}({{{{\bf{x}}}}}^{{{{\mathcal{D}}}}})]\\ \qquad\qquad\qquad\quad\quad\quad\quad\,\,\, +\,{{\mathbb{E}}}_{{{{\boldsymbol{\upsilon }}}} \sim {p}_{{{{\boldsymbol{\upsilon }}}}}}[\log (1-{D}^{{{{\mathcal{C}}}}}({\hat{{{{\bf{x}}}}}}^{{{{\mathcal{C}}}}}))+\log (1-{D}^{{{{\mathcal{D}}}}}({\hat{{{{\bf{x}}}}}}^{{{{\mathcal{D}}}}}))]\end{array}$$
(14)

Conditional version of EHR-M-GAN

For the conditional extension of EHR-M-GANcond, the auxiliary label information is first used during the pretraining step of dual-VAE. Both the encoders and decoders condition on the auxiliary (one-hot) labels from \({{{\mathcal{L}}}}\), to make the model better adapted to particular contexts. In dual-VAE, the additional semantic loss is also incorporated during the optimization for the shared latent space as in Eq. (11). Meanwhile, the same conditional labels are also applied in the sequentially coupled generator and discriminators, where the classes are fed as additional inputs through concatenation, as in the original CGAN architecture proposed by Mirza et al.68.

The t-SNE visualization of the latent embeddings induced from dual-VAE can be found in Supplementary Note 4, which indicates that the conditional information carried into EHR-M-GANcond can be beneficial when synthesizing patient trajectories under specific medical conditions. Overall, the adversarial loss for EHR-M-GANcond can be denoted as follows:

$$\begin{array}{l}\mathop{\min }\limits_{G}\mathop{\max }\limits_{D}{V}_{{{{\rm{EHR}}}}-{{{\rm{M}}}}-{{{{\rm{GAN}}}}}_{{{{\rm{cond}}}}}}={{\mathbb{E}}}_{{{{\bf{y}}}},{{{\bf{x}}}} \sim {p}_{{{{\bf{y}}}},{{{\bf{x}}}}}}[\log {D}^{{{{\mathcal{C}}}}}({{{{\bf{x}}}}}^{{{{\mathcal{C}}}}}| {{{\bf{y}}}})+\log {D}^{{{{\mathcal{D}}}}}({{{{\bf{x}}}}}^{{{{\mathcal{D}}}}}| {{{\bf{y}}}})]\\ \qquad\qquad\qquad\qquad\qquad\qquad+\,{{\mathbb{E}}}_{{{{\bf{y}}}} \sim {p}_{{{{\bf{y}}}}},{{{\boldsymbol{\upsilon }}}} \sim {p}_{{{{\boldsymbol{\upsilon }}}}}}[\log (1-{D}^{{{{\mathcal{C}}}}}({\hat{{{{\bf{x}}}}}}^{{{{\mathcal{C}}}}}| {{{\bf{y}}}}))+\log (1-{D}^{{{{\mathcal{D}}}}}({\hat{{{{\bf{x}}}}}}^{{{{\mathcal{D}}}}}| {{{\bf{y}}}}))]\end{array}$$
(15)

The pseudocodes for dual-VAE and EHR-M-GAN are provided in the Supplementary Note 1.

Baseline models

We compare the performance of EHR-M-GAN with eight state-of-the-art generative methods in literature. However, as these benchmarks typically face challenges when modeling mixed-type EHR timeseries and can only synthesize single-type EHRs, we draw the comparison between EHR-M-GAN and the benchmark models using the corresponding partial component of our synthetic results, i.e., either the continuous-valued part or the discrete-valued part. For continuous-valued timeseries generation, benchmark GAN models include C-RNN-GAN69, R(C)GAN25, and TimeGAN29. For discrete-valued timeseries generation, classic medGAN32, seqGAN31, and two recently proposed work—SynTEG27 and DualAAE26 are used for comparison. Apart from these GAN-based models, we also incorporate PrivBayes70 to synthesize discrete-valued timeseries, which falls in the class of non-GAN generative approaches using a Bayesian framework17. As the original paper of PrivBayes focuses on data anonymization using differential privacy, we therefore implemented its ‘Non-Private’ version for a fair comparison with other baselines (see Section 4.1 Non-Private Methods in ref. 70). For medGAN and PrivBayes, we feed the flattened temporal sequence as the input since the models cannot produce timeseries data.

We further perform the ablation study to investigate whether our introduced novel components in the proposed model have advantages over its variants that also model mixed-type EHRs. First, we compare EHR-M-GAN with a variant that jointly models the mixed-type data using a single unified VAE network (denoted as GANUnified). Second, we test the variant that encodes the mixed-type inputs separately with two independent VAE networks (denoted as GANVAE). Then, we assess the effectiveness of the proposed sequentially coupled generator component in our model by implementing GANSL. Lastly, as the dual-VAE module alone can also be used for generating EHR timeseries, it serves as a non-GAN-based benchmark in the ablation study (see Supplementary Note 3). The architectures of different variants of EHR-M-GAN in the ablation study are detailed as follows (also see Fig. 7 for illustration):

  • GANUnified: It contains a unified VAE module and two separate GANs. The continuous-valued and discrete-valued timeseries is concatenated together, via normalization and one-hot encoding, as input to the encoder in the unified VAE network. The decoder receives the concatenation of the generated latent vectors as the input, and then decodes it into synthetic timeseries with the corresponding data types using the separate fully connected layers. Each component in the architecture of GANUnified (unified encoder and decoder, separate generators and discriminators) is implemented with LSTMs, which are the same as EHR-M-GAN.

  • GANVAE: It is composed of a pair of VAE networks and GANs (one for each type of inputs). The continuous-valued timeseries and discrete-valued timeseries from the same patients are separately fed into the corresponding paths in GANVAE, and then run in parallel. The synthetic outputs of each data type are then combined as the final results. It maintains the basic structure of EHR-M-GAN but lacks the latent space sharing with dual-VAE and the sequentially coupled generator in the original EHR-M-GAN.

  • GANSL: In addition to GANVAE, it learns the shared latent space representations through dual-VAE by adding the corresponding loss functions in EHR-M-GAN, including ELBO loss, Matching loss and Contrastive loss. This model lacks the sequentially coupled generator.

  • EHR-M-GAN: In addition to GANSL, it incorporates the sequentially coupled generator for the learning the correlated temporal dynamics in timeseries of different data types. This is the proposed full model.

  • EHR-M-GANcond: This version is implemented on the basis of conditional GAN68, where the conditional inputs are fed into EHR-M-GAN to generate patients under specific labels.

Fig. 7: The network architectures in the ablation study.
figure 7

Three variants of EHR-M-GAN are implemented in the ablation study. Compared with the full model of EHR-M-GAN, a GANUnified learns the joint representations of heterogeneous types of data in a unified network; b GANVAE maintains the basic architecture of EHR-M-GAN, but ignore the dependency learning (i.e., separate networks for two streams of inputs are trained in parallel); c GANSL constructs the shared latent space using the dual-VAE module but omit the sequentially coupled generator for learning the temporal correlations in the mixed-type timeseries.

For training EHR-M-GANcond, auxiliary information from the patient status is used as conditional input. These conditional inputs are selected since synthesizing EHR information of patient subgroups with potential outcomes would be valuable for clinicians in their decision-making process. Other conditional labels (such as patient demographics in the categorized format) can also be used in the proposed conditional synthesizer for other research purposes. For MIMIC-III dataset, the classes are (1) ICU mortality: patient died within the ICU; (2) Hospital mortality: patient discharged alive from the ICU, and died within the hospital; (3) 30-day readmission: patient discharged alive from the hospital, and readmitted to the hospital within 30 days; (4) No 30-day readmission: patient discharged alive from the hospital, and had no readmission record to the hospital within 30 days. For eICU and HiRID datasets, the corresponding labels are also extracted based on the availability of the patient outcomes (see Table 7).

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.