Generating synthetic mixed-type longitudinal electronic health records for artificial intelligent applications

The recent availability of electronic health records (EHRs) have provided enormous opportunities to develop artificial intelligence (AI) algorithms. However, patient privacy has become a major concern that limits data sharing across hospital settings and subsequently hinders the advances in AI. Synthetic data, which benefits from the development and proliferation of generative models, has served as a promising substitute for real patient EHR data. However, the current generative models are limited as they only generate single type of clinical data for a synthetic patient, i.e., either continuous-valued or discrete-valued. To mimic the nature of clinical decision-making which encompasses various data types/sources, in this study, we propose a generative adversarial network (GAN) entitled EHR-M-GAN that simultaneously synthesizes mixed-type timeseries EHR data. EHR-M-GAN is capable of capturing the multidimensional, heterogeneous, and correlated temporal dynamics in patient trajectories. We have validated EHR-M-GAN on three publicly-available intensive care unit databases with records from a total of 141,488 unique patients, and performed privacy risk evaluation of the proposed model. EHR-M-GAN has demonstrated its superiority over state-of-the-art benchmarks for synthesizing clinical timeseries with high fidelity, while addressing the limitations regarding data types and dimensionality in the current generative models. Notably, prediction models for outcomes of intensive care performed significantly better when training data was augmented with the addition of EHR-M-GAN-generated timeseries. EHR-M-GAN may have use in developing AI algorithms in resource-limited settings, lowering the barrier for data acquisition while preserving patient privacy.


INTRODUCTION
The past decade has witnessed ground-breaking advancements been made in computational health, owing to the explosion of medical data, such as electronic health records (EHRs) Artzi et al. (2020); Raket et al. (2020); Menger et al. (2019).The secondary uses of EHRs give rise to research in a wide range of varieties, especially machine learning (ML)-based digital health solutions for improving the delivery of care Wilkinson et al. (2020); Watson et al. (2019); Futoma et al. (2020); Esteva et al. (2021); Rajkomar et al. (2019).However, in practice, the benefits of data-driven research are limited to healthcare organizations (HCOs) who possess the data Wirth et al. (2021); Dinov (2016).Due to concerns about patient privacy, HCO stakeholders are reluctant to share patient data Miotto et al. (2018); Kim et al. (2021); Simon et al. (2019).Access to clinical data is often restricted, or can be prohibitively expensive to obtain, meaning that ML in biomedical research lags behind other areas in AI.
To accelerate the progress of developing AI methods in medicine, one promising alternative is for the data holder to create synthetic yet realistic data Jordon et al. (2018); Frid-Adar et al. (2018).By avoiding "one-to-one" mapping to the genuine data compared with data anonymization, synthetic data offers a solution to circumvent the issue of privacy, while the correlations in the original data Existing GANs are limited in simulating mixed-type EHRs due to two reasons.Firstly, it is intrinsically difficult to model the underlying joint distribution of mixed data type timeseries using a single unified framework.Since GANs require the network architectures of the generator and discriminator to be fully differentiable Hjelm et al. (2017), its success is typically limited to generating real-valued, continuous data while facing obstacles for directly generating sequences of discrete tokens, such as ICD codes, that also commonly appear in EHRs.Previous methods Yu et al. (2017); Choi et al. (2017) circumvent this problem by learning representations from the original data which further enables backpropagation in discrete settings, but there is still a lack of a generative approach for joint modelling of the mixed-type timeseries with heterogeneous nature.Second, although mixed-type clinical timeseries differ in syntax and distributions, they are highly correlated and inform one another of the underlying health of an individual Yu et al. (2019); Ghassemi et al. (2017); Wang et al. (2019).It is therefore important to capture the temporal correlations between them when generating the synthetic EHR data.For example, the medications (documented in the form of discrete data) prescribed to patients are based on measurements of patients' physiological status (presented as continuous-valued signals).Concurrently, the efficacy of the medical treatments, affect the patient's physiological condition directly.It is therefore critical to accurately capture the temporal correlation between the mixed-type patient trajectories simultaneously to improve clinical decision support.
To address the aforementioned limitations, for the first time, we propose a GAN framework for simultaneously synthesizing mixed-type longitudinal EHR data (denoted as EHR-M-GAN thereafter).Specifically, we focus on generating timeseries in the critical care setting, where the intensive care units (ICU) patients are continuously and closely monitored (see Fig. 1a).Patient trajectories with high-dimensionality and heterogeneous data types (both continuous-valued and discrete-valued timeseries) are generated while the underlying temporal dependencies are captured.The main contributions of our work are as follows: 1.A novel GAN model entitled EHR-M-GAN is proposed for simultaneously generating mixed-type multivariate EHR timeseries with high fidelity, and overcoming the challenges when extending GANs into the mixed-type data settings (see Fig. 1b).First, to jointly model the underlying distributions of the heterogeneous features, EHR-M-GAN first maps data from different observational spaces into a reversible, lower-dimensional, shared latent space through a dual variational autoencoder (dual-VAE).Then, to capture the correlated temporal dynamics of the mixed-type data, a sequentially coupled generator that is built upon a coupled recurrent network (CRN) is employed.In addition, a conditional version of our model -EHR-M-GAN cond -is also Step 1: Dual-VAE is first pretrained for mapping heterogeneous data (x c t , x d t ) into shared latent representations (z c t , z d t ).Multiple objective loss constraints are used to bridge the domain/distribution gap.The training process for Step 1 is indicated in the Dual-VAE pretrain path (dashed purple line).
Step 2: Then, a CRN is established as the generator based on the parallel bilateral LSTM block, which takes the random noise vectors (υ c t , υ d t ) as inputs (see the Coupled generation path).
Step 3: The synthetic latent representations (ẑ c t , ẑd t ) provided by CRN are then decoded into synthetic samples (x c t , xd t ) using the pretrained decoder in Dual-VAE, which is indicated in the Decoding path (solid red line).Step 4: Finally, the adversarial loss is derived from the discriminators and backpropagated to update the network, which is indicated in the Adversarial training path (dotted black line).c.Evaluation pipeline.The pipeline includes metrics for evaluating the synthetic data.d.Prediction example.Data within 24-hours prior to the patient's endpoints in the ICU (discharge or mortality) is extracted.Both the observation window and prediction window are fixed as 12 hours.The classification task is to use patients' continuous-valued physiological measurements within the observation window as input, to predict the forthcoming discrete-valued medical intervention status in the prediction window.The four outcomes of the intervention status can be categorized as follows: Stay On: The intervention begins with on and stays on within the prediction window; Onset: The intervention begins with off and is turned on within the prediction window; Switch off : The intervention begins with on and is stopped within the prediction window; Stay Off : The intervention begins with off and stays off within the prediction window.implemented, which is capable of synthesizing condition-specific EHR patient data, such as those result in ICU mortality or hospital readmission.The code of our proposed work is publicly available on GitHub1 .
2. Evaluations are performed based on three publicly available ICU datasets: MIMIC-III Johnson et al. (2016), eICU Pollard et al. (2018) and HiRID Yèche et al. (2021) from a total of 141,488 patients.Standardized preprocessing pipelines are applied for the three ICU datasets to provide generalizable machine learning benchmarks.The code for the end-to-end preprocessing pipelines is also available on GitHub2 .
3. Our EHR-M-GAN outperforms the state-of-the-art benchmarks on a diverse spectrum of evaluation metrics.When compared to real EHR data, both qualitative and quantitative metrics are used to assess the representativeness of the mixed-type data and their inter-dependencies.We further demonstrate the advantages offered by EHR-M-GAN in augmenting clinical timeseries for downstream tasks under various clinical scenarios.4. In the evaluation of privacy risks, we perform an empirical analysis on EHR-M-GAN based on membership inference attack Shokri et al. (2017).We then further evaluate the performance of EHR-M-GAN under the framework of differential privacy for its application in downstream task Dwork (2006).

METHODS
In this section, we first formulate the problem based on the mixed-type temporal EHR data and its corresponding mathematical notation.Then, we discuss the challenges of synthesizing mixed-type EHR timeseries and the intuition behind the proposed model.Finally, we introduce the proposed EHR-M-GAN model in detail.

PROBLEM FORMULATION
The longitudinal patient EHR dataset is denoted as D = {(x i,1:Ti )} N i=1 , with each record (e.g., individual patient) being indexed by i ∈ {1, 2, ...N }.Here we consider the i-th instance tuple x i,1:Ti = {x C i,1:Ti , x D i,1:Ti } consists of two components (i.e., two types of data).Let x C i,1:Ti ∈ R |J| denote the |J|-dimensional continuous-valued timeseries, such as physiological signals from real-time bedside monitors.And x D i,1:Ti ∈ Z |K| denotes the |K|-dimensional discrete-valued timeseries, such as life-support interventions with the categorical value indicate its status (presence or absence).

CHALLENGES IN MIXED-TYPE TIMESERIES GENERATION
There are two main challenges when synthesizing mixed-type EHR timeseries.First, GANs have serious limitations on the type of data they can model Hjelm et al. (2017).Specifically, as GANs require generators and discriminators to be both fully differentiable, generating discrete-valued timeseries using traditional GANs architectures would raise problems during backpropagation as no direct gradient can be provided Choi et al. (2017); Yu et al. (2017).Therefore, it is intrinsically difficult to model the underlying joint distribution of mixed data type timeseries using a single unified framework.Second, as the mixed-type timeseries are correlated (such as correlations between ICU patients' physiological signals and treatment status in the critical care setting), it is therefore important to model the interdependencies among heterogeneous types of timeseries.

INTUITION BEHIND EHR-M-GAN
First, to jointly model the distribution of continuous-valued and discrete-valued timeseries using GANs, we build the generative model based on the latent space encoded by VAE networks.Instead of directly synthesizing discrete-valued timeseries that deactivate the backpropagation in GANs, the generator first synthesizes latent representations that allow the direct gradient in the network, therefore satisfying the prerequisite for GANs architecture to be fully differentiable.The synthetic latent representations for both types of data can be further decoded into raw timeseries using the decoders in VAEs.
Even though the aforementioned network architectures enable the joint modelling of mixed-type data distribution, it still lacks the capability of capturing the inter-dependencies in heterogeneous data.In order to address the second issue, we devised dual-VAE module for pretraining step and sequentially coupled generator module for generation step.The dual-VAE incorporates multiple loss constraints, which were previously adopted in domains such as self-supervised learning (SSL), timeseries representation learning, and domain adaptation (DA), to extract useful hierarchical representations from heterogeneous but correlated data types.The sequentially coupled generator module replaces the traditional LSTM cell with the novel bilateral LSTM (BLSTM) cell we propose, where the "communication" of the two types of information are introduced into the networks.Therefore, the temporal dynamics between the mixed-type data can be preserved during the iteration.

PROPOSED MODEL
As illustrated above, EHR-M-GAN can be factorized into two key components (see Fig. 1b): (1) a dual-VAE framework for learning the shared latent space representations; (2) an RNN-based sequentially coupled generator and its corresponding sequence discriminators.
As shown in Fig. 1b, during the pretrain stage, both continuous-valued and discrete-valued temporal trajectories are first jointly mapped into a shared latent space using the dual-VAE component (Step 1).Then, the sequentially coupled generator in EHR-M-GAN produces the synthetic latent representations (Step 2), which further can be recovered into features in the observational space by the pretrained decoders in the dual-VAE (Step 3).Finally, the adversarial loss is provided based on discriminative results and backpropagated to update the network (Step 4).The following sections discuss them in turn.

DUAL-VAE PRETRAINING FOR SHARED LATENT SPACE REPRESENTATIONS
One premise of successfully training EHR-M-GAN to generate reversible latent codes is to meet the assumption that for the same patient indexed with i, both x C i,1:Ti and x D i,1:Ti can be encoded into the same latent space H S ⊂ R |S| , where |S| denotes its spatial dimension.For the sake of simplicity, the subscripts i are omitted throughout most of the paper.To achieve this, we propose to use a dual-VAE framework, which exploits two VAE networks to encode both continuous and discrete multivariate timeseries into dense representations within H S based on multiple constraints.
After passing data from X C and X D through two encoders, a pair of embedding vectors (z C 1:T , z D 1:T ) in the shared latent space H S can be obtained.Then the decoders for both domains Dec C : ψ T ×H S → ψ T ×X C and Dec D : ψ T ×H S → ψ T ×X D further reconstruct features based on the latent embeddings using mapping functions that operate in the opposite direction: Also, to incentivize dual-VAE to better bridge the gap between domains of mixed-type timeseries, we enforce a weight-sharing constraint Liu et al. (2017); Liu and Tuzel (2016) within specific layers of both the encoders pairs and the decoders pairs (See Section S.1.B for details).
In the following subsections, we define multiple loss constraints for the optimization of dual-VAE, including ELBO loss, matching loss, contrastive loss, as well as semantic loss for EHR-M-GAN cond .Among these losses, ELBO loss ensures that the mixed-type timeseries can be successfully reconstructed after being encoded into latent representations.The matching loss ensures that heterogeneous types of features from a single patient share contexts during representation learning (instance-wise).
The goal of contrastive loss is to ensure that patients with similar trajectories stay close to each other in the latent space (population-wise).And semantic loss used in EHR-M-GAN cond encourages patients with the same conditional labels (e.g., outcomes) to share similar latent representations.
Intuitions and descriptions behind the objectives are discussed in turn.
Evidence Lower Bound (ELBO).We first incorporate the standard VAE loss, with the optimization objective as the evidence lower bound (ELBO).VAE holds the assumption of spherical Gaussian prior for the distribution of latent embeddings, where features can then be reconstructed by sampling from that space.The re-parameterization tricks enable differentiable stochastic sampling and network optimization.For encoder and decoder in the dual-VAE for domain d ∈ {C, D}, the objective function is defined as: where z ∼ Enc(x) q φ (z|x), x ∼ Dec(z) p ψ (x|z), and D KL is the Kullback-Leibler divergence.The first term in Eq. ( 3) is the expected log-likelihood term that penalizes the deviations in reconstructing the inputs, while the second term of KL-divergence is the regularization imposed over the latent distribution from its Gaussian prior (normally chosen to be N (0, I)).β KL is the hyperparameter for balancing the weights between two terms.
Matching loss.Typically, representations derived from the same patient are assumed to capture the shared context.Therefore, embedding vectors (z C i,1:Ti , z D i,1:Ti ) projected from the same patient i, are supposed to be positioned closely in the shared latent space (See Fig. S2 in Supplementary materials).Therefore, in this study, we borrow the concept of matching loss from domain alignment in DA, which enables efficient representation learning crossing domains/modalities Wan et al. (2020).In this study, the matching loss ensures that low-dimensional latent space can be shared between heterogeneous features.Hence, the pairwise matching loss is incorporated to motivate the encoders to minimize the distance within the corresponding representation pairs.In the low-dimensional Euclidean space, we optimize the network by using the following objective: The pairwise matching loss achieve its optimal when the distance proxy L Match becomes zero.
Contrastive loss.On the flip side, pairwise reconstruction error (i.e., intra-correlations within one instance) measured by matching loss neglects the commonalities present across patients (intercorrelations of data) Kiyasseh et al. (2021).In order to guarantee sufficient bound for representation learning, we incorporate contrastive loss as another distance metric to capture the inter-correlations among the population.
Contrastive learning is a concept that has recently been popularized in self-supervised learning (SSL) Liu et al. (2021), which aims to capture intrinsic patterns from input data without human annotations.In this study, we instantiate the contrastive loss via NT-Xent, which is proposed by Chen et al. in their work SimCLR Chen, Kornblith, Norouzi and Hinton (2020).The core of contrastive learning is to encourage networks to attract positive pairs closer and repulse negative pairs apart in the latent space.In this study, we adapt the corresponding auxiliary tasks for calculating contrastive loss to the scenario of learning representations from mixed-type timeseries.The objective of the task is to determine whether a set of representations transformed from the observational space belong to the same patient.And this leads to the corresponding positive pairs (true) and negative pairs (false).
For patient data of N records, we can obtain N pairs of latent representations from the encoders in dual-VAE.For patient indexed with i, h C i and h D i denotes the embeddings derived from the continuous-valued and discrete-valued observational space, respectively.Due to the symmetric architecture of dual-VAE, here we use d and d to represent one of each different domain, i.e., d, d ∈ {C, D} and d = d .Therefore, the positive pairs for patient i can be referred as (i d , i d ), while the other 2(N − 1) samples are regarded as negative pairs.Then the contrastive loss for a positive pair (i d , i d ) is defined as: where sim(u, v) = u T v/ u v denotes the cosine similarity between two vectors.τ > 0 denotes a temperature hyperparameter.
Semantic loss.In EHR-M-GAN cond , semantic loss is imposed to better align patients with same labels (conditions) into the same latent space clusters.For example, if the label of severe clinical deterioration in the ICU is given for conditional data generation, the corresponding synthetic continuous-valued timeseries (e.g., severely deranged vital signs) is supposed to be strongly associated with the discrete-valued timeseries (e.g., intensive medical interventions) under the same label.
For both domains, additional linear classifiers are trained to classify the latent embeddings based on their corresponding conditions in the observational space.We implement logistic regression as the linear classifiers and calculate the cross entropy as the semantic losses for both domains.For d ∈ {C, D}, given the latent embedding vector z d and the conditional information vector y: where f d linear denotes the linear classifier for the corresponding domain.And CE = − j y j log ( y j ) , (j = 1, 2, ..., |L|) denotes the cross entropy loss, where ŷj is the output of the linear classifier, and y j is the ground truth label for class j in condition vector y.
In summary, to train the dual-VAE for learning the shared latent space representation, the total objective function for d ∈ {C, D} is: Under the conditional learning scenario of EHR-M-GAN cond , the total loss becomes: where β 0 , β 1 , β 2 , and β 3 are scalar loss weights used to balance the loss terms.
To validate the effectiveness of multiple losses and the weight-sharing constraint in the proposed dual-VAE network, we perform the corresponding ablation study using MIMIC-III dataset as an example.The results can be found in S.3.B in the Supplementary materials.As shown in Table S7, all the components in the proposed dual-VAE network contribute to the improvement of EHR-M-GAN's performance when generating mixed-type timeseries data.

SEQUENTIALLY COUPLED GENERATOR BASED ON CRN
We propose the sequentially coupled generator for generating latent representations for mixed-type timeseries, which is built based on the network architecture of coupled recurrent network (CRN).Specifically, a CRN exploits bilateral long short-term memory (BLSTM) cells as its recurrent layer to preserve the temporal dependencies between the continuous and discrete-valued sequences.The novel network architecture of bilateral-LSTM we proposed can extract and transmit the correlations between the mixed-type timeseries, as opposed to vanilla-LSTM which has only one recursive connection.In the following section, we first discuss the structure of BLSTM in detail as its essential recurrent layer of CRN, and then build the sequentially coupled generator based on CRN.
Bilateral long short-term memory.As the traditional LSTM only considers temporal dynamics from single-type timeseries, therefore is incapable to extract and transmit temporal correlation from heterogeneous features.Therefore, we propose the novel bilateral-LSTM cell with two network connections to characterize the correlations between two types of data.
As indicated by Eq. 10, the proposed BLSTM network overcomes the limitation of vanilla-LSTM network on modelling the correlation between the mixed-type timeseries by establishing the supplemental recursive connection.The new connection facilitates the model to intrinsically decide how much information it should pass through the gates from its counterpart.A diagram of the BLSTM cell in contrast to vanilla-LSTM cell can be found in the Supplementary materials (see Fig. S3).
Coupled recurrent network.The architecture of CRN consists of three layers: the input layers, the recurrent layers, and the fully connected layers.First, the random noise vectors υ d t and υ d t for two domains, which are sampled from uniform distributions (i.e., υ d t , υ d t ∈ U(0, 1)), are separately fed into the input layers.Then the recurrent layers f rec , which are built based on two streams of BLSTM, one for each data type, are used to recursively iterate hidden states from both branches.Finally, the fully connected layers f d conn and f d conn produce the generated latent vectors ẑd t and ẑd t for the decoding stage in dual-VAE.At time step t, CRN can be formulated as: In summary, heterogeneous timeseries that exhibits mutual influence on each other are integrated into CRN to model their interdependencies.By exploiting the BLSTM cell as its recurrent layer, two streams of the inputs in the generator are encouraged to "communicate" with each other.CRN is therefore capable of exploiting the interplay between mixed-type data that correlates over time.

JOINT TRAINING AND OPTIMIZATION
The overall architecture of EHR-M-GAN is shown in Fig. 1.In this section, we give a detailed description of the entire network's structure and the optimization objective of the model.The steps for the training and optimization of EHR-M-GAN are as follows: • The pretraining of dual-VAE: First, a dual-VAE network which consists of a pair of encoders (Enc C , Enc D ) and decoders (Dec C , Dec D ) is pretrained with both continuous and discrete data.Based on multiple objective constraints in Eq. 8 (for EHR-M-GAN cond the objective function can be referred in Eq. 9), a shared latent space is learnt using dual-VAE, where the gap between the embedding representations (z C 1:T , z D 1:T ) from both domains is minimized.• The generation of latent representations based on CRN: Then, during the joint training stage, the sequentially coupled generator which is built based on CRN, takes the random noise vector (ẑ C 1:T , ẑD 1:T ) as inputs and iterating across the timesteps t ∈ {1, 2, ..., T } by the internal transition functions.Therefore, the synthetic latent embedding representations (ẑ C 1:T , ẑD 1:T ) for both continuous and discrete data can be obtained.• The decoding for the mixed-type timeseries: Next, the generated latent embeddings (ẑ C 1:T , ẑD 1:T ) are further fed into the pretrained decoders (Dec C , Dec D ) and decoded into the corresponding synthetic patient trajectories (x C 1:T , xD 1:T ) in the observational space.• The adversarial loss update based on the discriminators: Finally, the adversarial loss can be calculated from the LSTM network-based discriminators D C and D D by distinguishing between the real samples and synthetic timeseries for both data types.
The mathematical expression for the min-max objectives in EHR-M-GAN is provided as follows:

CONDITIONAL VERSION OF EHR-M-GAN
For the conditional extension of EHR-M-GAN cond , the auxiliary label information is first used during the pretraining step of dual-VAE.Both the encoders and decoders condition on the auxiliary (one-hot) labels from L, to make the model better adapted to particular contexts.In dual-VAE, the additional semantic loss is also incorporated during the optimization for the shared latent space (see Eq. 9).
Meanwhile, the same conditional labels are also applied in the sequentially coupled generator and discriminators, where the classes are fed as additional inputs through concatenation, as in the original CGAN architecture proposed by Mirza et al Mirza and Osindero (2014).
The t-SNE visualisation of the latent embeddings induced from dual-VAE can be found in Supplementary materials (see Section S.4.C), which indicates that the conditional information carried into EHR-M-GAN cond can be beneficial when synthesizing patient trajectories under specific medical conditions.Overall, the adversarial loss for EHR-M-GAN cond can be denoted as follows: The pseudocodes for dual-VAE and EHR-M-GAN are provided in the Supplementary materials (see Section S.1.E).

DATASET DESCRIPTION
The following three publicly accessible ICU datasets are used for evaluating the performance of EHR-M-GAN in generating the longitudinal EHR data: All these critical care databases include vital sign measurements, laboratory tests, treatment information, survival records, and other routinely collected data from hospital EHR systems.From these clinical observations, we featurize the patient trajectories as the combination of continuousvalued physiological timeseries (such as heart rate, oxygen saturation, and measurements from blood gas tests) and discrete-valued medical intervention timeseries (such as the usage of therapeutic devices or intravenous medications).Temporal trajectories 24 hours prior to patients' ICU endpoints (discharge or death) are extracted for the three critical care databases.Data are preprocessed following an open-source framework -MIMIC-Extract Wang et al. (2020), where the patients' physiological and intervention signals are hourly aggregated for denser representations.Details on data curation, including the cohort selection criteria, full list of features, and imputation method, are explained in Supplementary Materials (see S.2 Datasets).Overall, the summarising statistics of the finalised cohorts for three databases are shown in Table 1.

BASELINE MODELS
We compare the performance of EHR-M-GAN with eight state-of-the-art generative methods in literature.However, as these benchmarks typically face challenges when modeling mixed-type EHR timeseries Hjelm et al. (2017) and can only synthesize single-type EHRs, we draw the comparison between EHR-M-GAN and the benchmark models using the corresponding partial component of our synthetic results, i.e., either the continuous-valued part or the discrete-valued part.For continuous-valued timeseries generation, benchmark GAN models include C-RNN- GAN Mogren (2016), R(C) GAN Esteban et al. (2017) and TimeGAN Yoon et al. (2019).For discrete-valued timeseries generation, classic medGAN Choi et al. (2017), seqGAN Yu et al. (2017), and two Table 1: Summary of the cohorts after preprocessing on three critical care databases.Number of patients and ICU admissions, as well as the dimensions of continuous-valued and discrete-valued variables, are provided for each dataset.Temporal trajectories 24 hours prior to patients' ICU endpoints are extracted for the three critical care databases.Note that only the first ICU admission is selected for each patient.The dimension of the continuous-and discrete-valued data are provided.The conditional labels for training EHR-M-GAN cond and the corresponding counts for each class are also listed.Compared with the full model of EHR-M-GAN, GAN Unified learns the joint representations of heterogeneous types of data in a unified network; GAN VAE maintains the basic architecture of EHR-M-GAN, but ignore the dependency learning (i.e., separate networks for two streams of inputs are trained in parallel); GAN SL constructs the shared latent space using the dual-VAE module but omit the sequentially coupled generator for learning the temporal correlations in the mixed-type timeseries.
We further perform the ablation study to investigate whether our introduced novel components in the proposed model have advantages over its variants that also model mixed-type EHRs.First, as EHR-M-GAN learns the joint representations from heterogeneous types of data using separate (but inherently correlated and weights-sharing) VAE networks, we compare it with a variant that jointly models the mixed-type data using a single unified VAE network (denoted as GAN Unified ).Then, we test the variant that encodes the mixed-type inputs separately with two independent VAE networks, and then combines the resulted synthesis of different data types as outputs (denoted as GAN VAE ).Lastly, we assess the effectiveness of the proposed dual-VAE component in our model alone by implementing GAN SL .The architectures of different variants of EHR-M-GAN in the ablation study are detailed as follows (also see Fig. 2 for illustration): • GAN Unified : It contains a unified VAE module and two separate GANs.The continuousvalued and discrete-valued timeseries is concatenated together, via normalization and onehot encoding, as input to the encoder in the unified VAE network.The decoder receives the concatenation of the generated latent vectors as the input, and then decodes it into synthetic timeseries with the corresponding data types using the separate fully connected layers.Each component in the architecture of GAN Unified (unified encoder and decoder, separate generators and discriminators) is implemented with LSTMs, which are the same as EHR-M-GAN.• GAN VAE : It is composed of a pair of VAE networks and GANs (one for each type of inputs).
The continuous-valued timeseries and discrete-valued timeseries from the same patients are separately fed into the corresponding paths in GAN VAE , and then run in parallel.The synthetic outputs of each data type are then combined as the final results.It maintains the basic structure of EHR-M-GAN but lacks the latent space sharing with dual-VAE and the sequentially coupled generator in the original EHR-M-GAN.• GAN SL : In addition to GAN VAE , it learns the shared latent space representations through dual-VAE by adding the corresponding loss functions in EHR-M-GAN, including ELBO loss, Matching loss and Contrastive loss.This model lacks the sequentially coupled generator.
• EHR-M-GAN: In addition to GAN SL , it incorporates the sequentially coupled generator for the learning the correlated temporal dynamics in timeseries of different data types.This is the proposed full model.• EHR-M-GAN cond : This version is implemented on the basis of conditional GAN Mirza and Osindero (2014), where the conditional inputs are fed into EHR-M-GAN to generate patients under specific labels.
For training EHR-M-GAN cond , auxiliary information from the patient status is used as conditional input.These conditional inputs are selected since synthesizing EHR information of patient subgroups with potential outcomes would be valuable for clinicians in their decision-making process.Other conditional labels (such as patient demographics in the categorized format) can also be used in the proposed conditional synthesizer for other research purposes.For MIMIC-III dataset, the classes are (1) ICU mortality: patient died within the ICU; (2) Hospital mortality: patient discharged alive from the ICU, and died within the hospital; (3) 30-day readmission: patient discharged alive from the hospital, and readmitted to the hospital within 30 days; (4) No 30-day readmission: patient discharged alive from the hospital, and had no readmission record to the hospital within 30 days.For eICU and HiRID datasets, the corresponding labels are also extracted based on the availability of the patient outcomes (see Table 1).

EVALUATION METRICS
Evaluating GAN models is a notoriously challenging task.Advantages and pitfalls of commonly used evaluation metrics for GANs are discussed in Borji (2021).In this work, a systematic evaluation framework is adopted to assess the quality of synthetic patient EHRs with respect to its fidelity, correlation, utility, and privacy (see Table 2).We first individually assess the representativeness of the synthetic continuous-valued and discrete-valued timeseries.This includes measuring the distance between underlying data distributions (such as Maximum mean discrepancy and Dimension-wise probability), comparing the feature-level statistics between the real and synthetic data (Patient trajectories), and assessing the indistinguishability of the synthetic data to the true data (i.e., Discriminative score).Secondly, we evaluate to which extent our model can reconstruct the interdependency between different features (Pearson pairwise correlations), and the temporal dynamics in the patient trajectories (Autocorrelation function), by using a set of qualitative and quantitative metrics.Thirdly, we introduce data augmentation by incorporating synthesized EHR timeseries under various settings, and quantitatively assess the improvement provided by EHR-M-GAN in the Downstream tasks for medical intervention prediction in the ICU (i.e., the utility of the synthetic data).Lastly, we measure the attribute of patient privacy-preserving of EHR-M-GAN under Membership inference attack.We also evaluate the performance of the same downstream tasks under Differential privacy guarantees (See Fig. 1c and Table 2 for the evaluation pipeline).

MAXIMUM MEAN DISCREPANCY
To measure the similarity between the continuous-valued synthetic data and the real data, maximum mean discrepancy (MMD) is used.MMD can assess whether two sets of samples are from the same distributions, and in our case, one from the true data x and one from synthetic data x generated by GANs.To calculate the statistics, a kernel function K : X × X → R is used to quantify the similarity between the two distributions.In this study, a sum of Gaussian kernel sets is adopted following the implementations in Sutherland et al. (2016), which can be expressed as: where σ i is the value of the i-th selected bandwidth for calculating MMD.As in our study, the real and synthetic samples are multivariate timeseries aligned along the fixed time axis (i.e., 24 data points per patient), we therefore handle these multivariate timeseries as matrices and use the kernel function to calculate the Frobenius norm ( • F ) between them Esteban et al. (2017).
Finally, given samples {x i } N i=1 from real distributions, and samples {x j } M j=1 from the synthetic distributions (with N and M denoting the corresponding sample sizes), the estimation of MMD can be defined as: It can be inferred from the equation [15] that higher similarity between the two distributions leads to the lower MMD value, with the lower bound value zero indicating that the two distributions are identical.As indicated in Table 3, EHR-M-GAN outperforms the state-of-the-art benchmarks among all three datasets in synthesizing continuous-valued timeseries.The conditional version -EHR-M-GAN cond further boosts the performance of the model by leveraging the information of the condition-specific inputs.Furthermore, as shown in the ablation study, EHR-M-GAN and EHR-M-GAN cond produce smaller MMD values when compared to their variants.Using MIMIC-III as an example, compared with the basic model GAN VAE , by integrating the shared latent space learning using dual-VAE under multiple loss constraints, the performance of GAN SL significantly improves (GAN SL vs. GAN VAE , 0.745 to 0.926, p-value < 0.05 from t-test3 ).By further building the sequentially coupled generator based on BLSTMs and exploiting the information within mixed-type data, the MMD of EHR-M-GAN shows a nearly 24% improvement over GAN VAE .When synthesizing mixed-type timeseries using the unified network, the performance of GAN Unified for generating continuous-valued timeseries lags behind the proposed EHR-M-GAN.It therefore can be inferred that, compared with EHR-M-GAN which extracts useful hierarchical representations for each data type using tailored encoding layers, it is quite challenging for GAN Unified to learn marginal distributions from raw mixed-type timeseries with a unified architecture.

DIMENSION-WISE PROBABILITY
To evaluate the representativeness of the synthetic discrete-valued timeseries, the dimension-wise probability test is employed.To test the probability distributions between the real and synthetic binary features, the Bernoulli success probability p ∈ [0, 1] is calculated for the discrete-valued timeseries, and is visualized through scatterplot.As a sanity check, it investigates if the probability of the medical intervention being active at the given timestamps is matched between the real data (x-axis) and synthetic data (y-axis).The correlation coefficients (CCs) and root-mean-square errors (RMSEs) are also adopted Baowaly et al. (2019) based on the Bernoulli success probabilities to quantitatively measure the distribution divergence between real and synthetic data.
As shown in Fig. 3 (see Fig. S4 and S5 for more results on eICU and HiRID datasets), the optimal results are provided by EHR-M-GAN and EHR-M-GAN cond .The close-to-real probability distributions that appear along the diagonal line indicate the remarkable similarity between the real data and the synthetic data provided by our models.The quantified CC and RMSE also correspond with the visulisation results, which are close to the highest mark (EHR-M-GAN: RMSE = 0.0095, CC = 0.9973).Similar to the results in MMD, the dimensional-wise distributions are better captured when modules such as dual-VAE and sequentially coupled generator are introduced in EHR-M-GAN.GAN Unified suffers from mode collapse (the generator fails to produce outputs with sufficient diversity), and therefore shows poor performance compared with other variants when synthesizing discrete-valued timeseries.As the mixed-type features are treated as unimodal input without differentiating their heterogeneous nature, no marginal representations are explicitly learned.Among all state-of-the-art benchmark models, DualAAE shows the best result but is slightly suboptimal when compared to EHR-M-GAN.In contrast, both skewed distribution and low performance scores are observed in medGAN, as it lacks the ability to capture the temporal correlations within timeseries.SynTEG shows improved performance over medGAN, as it is capable of synthesizing discrete-valued features in EHRs with timestamps.The non-GAN generative method PrivBayes also shows good performance among all the benchmark synthesizers when modeling the underlying probability distribution of the discrete-valued EHR timeseries.On the other hand, despite the well-known performance of SeqGAN in natural language generation, it is not quite applicable in synthesizing sequential clinical EHRs.Dimensionwise probability calculates the Bernoulli success probability of each dimension, i.e., the probability of the treatment being active at a particular time.The x-axis and y-axis represent dimension-wise probability for the real data and synthetic data generated from different models, respectively.The same color indicates the same treatment (but with varying timestamps).The optimal performance appears along the diagonal line.The corresponding CCs ([0, 1], the higher the better) and RMSEs ([0, +∞), the lower the better) are also calculated to quantify the probability distribution similarities between the real and synthetic EHRs timeseries.Dimension-wise probability plot for eICU and HiRID dataset can be found in Supplementary materials (see S.4.A).
Generating discrete-valued features are known to be problematic for traditional GANs.Due to their limitation in passing the gradients from the critic models, vanilla GANs cannot update their generators efficiently based on the adversarial loss Yu et al. (2017); Choi et al. (2017).However, the result of EHR-M-GAN shows its superiority in explicitly capturing each dimension of the discrete-valued sequences.EHR-M-GAN mitigates this problem by learning the shared latent representations using dual-VAE.Discrete-valued timeseries are encoded into a gradient-differentiable space for further optimizing the generators and thus solving the problem.

PATIENT TRAJECTORIES
We compare the distribution of patient trajectories per timepoint between the real data and synthetic data generated by EHR-M-GAN for theMIMIC-III dataset.Five commonly measured vital sign and laboratory features -Oxygen Saturation, Systolic Blood Pressure, Respiratory Rate, Heart Rate, Temperature, as well as two medical intervention features -Mechanical Ventilation and Vasopressor, are considered and compared as an exemplar in Fig. 5.It can be inferred that the proposed model can accurately capture the statistical distribution (mean and standard deviation) of both continuous-valued and discrete-valued features.The temporal dynamics are well-preserved in the synthetic timeseries.For example, the variance of Oxygen Saturation gradually increases towards the ICU endpoints in the real data, and is closely reflected in the synthetic timeseries.Furthermore, EHR-M-GAN cond shows superior performance as it can generate correct trajectories with specific patient conditions (see section S.4.D in Supplementary Materials for results).

DISCRIMINATIVE SCORE
For both continuous-valued and discrete-valued data, the discriminative score is measured as the accuracy of a discriminator trained post-hoc to separate real from generated samples.Synthetic data are generated with the same amount of the hold-out test set from the original data, and are labeled as synthetic and real correspondingly to train the binary classifier.In this study, the classifier (critic) is implemented with a single-layered Bi-directional Long Short-Term Memory (Bi-LSTM) model (i.e., many-to-one), with its parameters randomly initialized (as opposed to critic built upon representations from the trained generative model Zhang et al. (2022)).The critic trained from the supervised learning task can be used to characterize the temporal correlations across the patient EHR timeseries.
As indicated from the results in Table 4, it appears that EHR-M-GAN and EHR-M-GAN cond can produce synthetic data that are less distinguishable from real data than the benchmarked models.Especially for EHR-M-GAN cond , it achieves the optimal discriminative scores consistently against other benchmarks for both continuous-valued and discrete-valued timeseries.For discrete-valued data generation, EHR-M-GAN-generated samples achieve the discriminative score of 0.813 on the MIMIC-III dataset, which has a 4% statistically significant improvement over the best performing benchmark (EHR-M-GAN vs. DualAAE: 0.813 to 0.847, p < 0.05).The overall discriminative scores produced by PrivBayes on three ICU databases are comparable with the GAN models such as SynTEG and DualAAE.For continuous-valued timeseries generation, the discriminative score of TimeGAN on HiRID dataset outperforms the other models as well as EHR-M-GAN, though not statistically significant (EHR-M-GAN vs. TimeGAN: 0.724 to 0.716, p = 0.4374).By leveraging the additional information from the conditional inputs, EHR-M-GAN cond can provide significantly better result than TimeGAN (EHR-M-GAN cond vs. TimeGAN: 0.693 to 0.716, p < 0.05).
The ablation study has proved the effectiveness of EHR-M-GAN for generating high quality EHR timeseries.The shared latent space representation learning in the dual-VAE (i.e., GAN SL ) have shown remarkable success as making the synthetic data more realistic than separately generating the latent embeddings based on VAEs (as in GAN VAE ).The sequentially coupled generator further improves the model by capturing the dynamics between mixed-type data and iterating over time, therefore enabling the synthetic timeseries to become more indistinguishable from the original.Further compared with GAN Unified that models the mixed-type data in a unified network, our proposed model enables effective learning for the marginal distributions from each data type.More importantly, EHR-M-GAN can leverage its dependency learning components to explicitly capture the correlations between heterogeneous types of data.

INTERDEPENDENCY CHARACTERISTICS
In this section, we first employ Pearson pairwise correlation (PPC), which ranges from -1 to 1, to evaluate how closely the synthetic data can model the correlations between continuous-valued and discrete-valued timeseries.Timestamps of the patient trajectories are extracted with every 3 hours interval in a total of 24 hours ICU stay, to explore the temporal dependencies within different variables.To further quantitatively measure the difference between heatmaps generated from real and synthetic samples, we calculate the mean value of the absolute difference between the two PCC matrices (µ abs ).We also adopted correlation accuracy (CorAcc) Tao et al. (2021) which quantifies the similarity of two heatmaps within the range of 0 to 1.We discretize the correlation coefficients into 6 correlation levels: strong negative Vaso.

HiRID -Synthetic data
CorAcc=84.79%,abs=0.0385As observed, correlation trends over distinctive features are closely reflected by the synthetic data, with the quantitative measure CorAcc consistently exceed 0.8 on three critical care databases.It is also worth noticing that EHR-M-GAN can successfully recover temporal dependencies with a high granularity from real patient trajectories.For example, synchronized correlations across timestamps are observed between Respiratory Rate and Heart Rate in the MIMIC-III dataset.Such trends are preserved in synthetic data.This can be explained by the common regulation of these two features by the autonomic nervous system and their synchronized increase in cases of physiological stress, such as hypoxemia.In summary, the proposed EHR-M-GAN can reconstruct the temporal dynamics and correlations between features in the real data, which is valuable for downstream ML-based classification and prediction applications.
Then, autocorrelation functions (ACF) Benedetti et al. (2020) and the corresponding root-mean-square errors (RMSEs) are calculated to show how EHR-M-GAN can capture the temporal correlations among the timeseries.ACF measures the relationship between the timeseries and its lagged version.Fig. S6 -S8 in the Supplementary materials shows the ACF calculated for selected continuous-valued and discrete-valued variables (same as Pearson pairwise plot) on real and synthetic timeseries.The time lags are specified as the hourly intervals up to 24 hours before patients' ICU endpoints (ICU discharge or death).Additionally, RMSEs are calculated to quantitatively evaluate the similarity between the corresponding two curves produced by real data and synthetic data.
Similar patterns are presented between the ACF calculated for real data and their synthetic counterparts, while the quantitative statistics also correspond with the observation.Moreover, overlapping confidence intervals indicate that the synthetic data is able to consistently capture the underlying temporal distributions within the real timeseries.For variables such as Heart Rate, Oxygen Saturation, and Systolic Blood Pressure, the positive ACF coefficients rapidly decrease within the period of first few hours, followed by the growing trends of negative temporal correlation.The lag with the lowest correlation coefficient is identified at approximately 4 hours.Specifically, global peaks appear roughly at the 12-hour ticks of Temperature for both real and synthetic data on three critical databases.Meanwhile, the negative correlation strengthens as the time lag increase for Mechanical ventilation in the original timeseries.Since these behaviours can be reproduced by EHR-M-GAN, therefore they demonstrate that our model can effectively capture the temporal characteristics in the original timeseries.Vaso.

Real data Synthetic data
Figure 5: Comparison of the distribution of values at each timepoint (mean and standard deviation) between real and synthetic patient trajectory produced by EHR-M-GAN.Multivariate timeseries 24 hours before patients' ICU endpoints are generated, including Heart Rate, Respiratory Rate, Systolic Blood Pressure, Oxygen Saturation, Temperature, Mechanical Ventilation and Vasopressor.The mean value of the real/synthetic feature at each timepoint is plotted by the solid/dotted line, with the shaded area indicating ±1 standard deviation.For Mechanical Ventilation and Vasopressor, the y-axis indicates the probability distribution of such intervention being applied ("On") at a given time.The synthetic patient trajectories generated by EHR-M-GAN cond under different conditions can be found in Supplementary materials.

DOWNSTREAM TASKS
As previously discussed, one of the most prominent goals for GANs is to benefit the future downstream analyses in the real clinical application.A relevant question in the ICU is whether specialized medical treatments, such as therapeutic interventions or organ support, are required for critically ill patients during the admission.Accurate predictions on such tasks can help clinicians to provide actionable, in-time interventions in the resource-intensive ICU.Therefore in this section, clinical intervention prediction tasks are implemented to evaluate the potential of EHR-M-GAN and EHR-M-GAN cond in synthesizing high-fidelity synthetic data to further boost the performance of ML classifiers.In line with prior work Wang et al. (2020); Wu et al. (2017); Suresh et al. (2017), we establish LSTM-based classifiers to predict the status of mechanical ventilation and vasopressors using continuous-valued multivariate physiological signals as the predictors.A fixed duration of 12 hours is used for both observation window and prediction window (see Fig. 1).Four outcomes of medical intervention status are defined as: Stay on, Onset, Switch off, Stay off (detailed descriptions can be found in Fig. 1).
We partition the dataset as illustrated in Figure 6a, and the performances are assessed from two aspects (see Figure 6b): (i) Traditional approach: To explore whether the synthetic data can represent the real data accurately, we compare Train on Real, Test on Real (TRTR) with Train on Synthetic, Test on Real (TSTR), to show whether the performance of a classifier trained on synthetic data from EHR-M-GAN or EHR-M-GAN cond can be generalized to real data.In addition to the proposed models, synthetic data produced by the baseline models are also used to train the downstream classifiers for comparison.Other than a measurement of data utility where the downstream task is to predict discrete-valued medical intervention (described as outcomes in this scenario) using continuous-valued physiological features (denoted as predictors), TSTR can also be used to assess data synthesizers' ability to capture the interdependencies between the mixed-type features.(ii) Data augmentation approach: As data augmentation is employed as a means of circumventing the issue caused by the under-resourced EHR data, here we explore whether synthetic data can used to improve the existing ML algorithms through data augmentation.Therefore, Train on Synthetic and Real, Test on Real (TSRTR) is compared with TRTR to measure the improvement of the classifier's performance when trained on the augmented data Esteban et al. (2017); Kiyasseh et al. (2020).The augmentation ratio α or β is applied on sub-train data A T r or synthetic data B, in two different scenarios of TSRTR, respectively.Details are explained as follows (also see Figure 6b for illustration).
Firstly, as the dearth of data potentially degrades the performance of downstream classifiers, given that the real data has a limited and fixed sample size, we investigate whether adding synthetic EHR data provided by EHR-M-GAN and EHR-M-GAN cond can improve the training of downstream classifiers.Ratio α indicates the portion of synthetic data (see Figure 6b) being used to augment the real data to improve the quality and robustness of the downstream classifiers.α is set to be 10%, 25%, and 50%, representing the availability of synthetic samples provided for augmentation.
Secondly, the acquisition of healthcare data is generally time-consuming and expensive, therefore another overarching goal for the generative model is to minimize the efforts in collecting data.In this section, we investigate whether high-fidelity synthetic data can offer a viable solution for boosting the downstream classifiers' performance when the availability of real data is limited.This allows us to understand if the sample size required for real data collection can be reduced while maintaining sufficient predictive power through the use of synthetic data.During the experiment, the synthetic data B is given (to emulate the scenario where synthetic datasets are available for a particular clinical research purpose), which further is combined with limited real data (collected during clinical trial), to train the downstream classifiers (i.e., augment synthetic data with limited real data).Then by implementing EHR-M-GAN or EHR-M-GAN cond in TSRTR, we investigate the proportion of the real data A T r (ratio β) required to maintain the same performance as in TRTR based on the entire synthetic dataset B (see Figure 6b).Traditional approach.Table 5 compares the classification performances of predicting forthcoming medical interventions in the ICUs under the experimental setting of TRTR and TSTR.It is expected that the optimal AUROCs are achieved by the classifiers that are trained on real data.In comparison, the classifiers trained on the synthetic data provided by proposed models can achieve similar performances.More specifically, synthetic data generated by EHR-M-GAN cond demonstrates better generalisability when compared with EHR-M-GAN in the downstream application, such as the task of predicting mechanical ventilation on the HiRID dataset (TRTR vs. TSTR from EHR-M-GAN cond : 0.867 to 0.856, with p=0.3906).Compared with the baseline models, the proposed EHR-M-GAN shows improved performance in TSTR, as it can model the distribution of mixed-type EHRs more accurately, while preserving the temporal correlations in the heterogeneous timeseries through the dependency learning components.The results indicate that interdependency between the mixed-type EHRs is weakly captured by GAN VAE , as the two streams of inputs are trained in parallel and separately.GAN Unified attempts to capture the temporal correlations of mixed-type EHRs through jointly modeling their underlying distribution in a unified network.However, its unified architecture limits the model's capacity to learn the marginal distribution of each data type, the resulted quality of the synthetic EHRs is impaired and so is its performance in TSTR.6a demonstrate that classifiers boosted by EHR-M-GAN can consistently outperform TRTR (see Table 5) at the augmentation ratio of 50%.In comparison, only 25% of augmentation ratio is needed to achieve improved results for EHR-M-GAN cond .For example, the classifier trained on MIMIC-III to predict the status of Vasopressor with augmentation ratio α set as 50%, significantly increase the AUROC by 6% when compared to the classifier trained using only the real data (EHR-M-GAN cond vs. TRTR: 0.896 to 0.841, p < 0.05).Our experiment results have demonstrated that the proposed models can be used for data augmentation to overcome the issue of data scarcity and subsequently improve the classifiers' performance.

Data augmentation approach (with ratio α). The results in Table
Data augmentation approach (with ratio β).On the other hand, as shown in Table 6b, by augmenting with the synthetic data provided by EHR-M-GAN, only approximately 50% of the real data is required to keep the classification AUROCs on par with, or even significantly better than fully exploiting the real data under TRTR.For EHR-M-GAN cond , the ratio needed for real data to maintain the comparable predictive power is further reduced to 25%, which equivalently indicates a 75% reduction of sample size required in real data collection.Overall, results presented in Table 6b demonstrate that by exploiting only a limited ratio of the real data, EHR-M-GAN and EHR-M-GAN cond can robustly maintain the level of prediction performance, therefore alleviating the necessity for acquiring clinical data at scale.

PRIVACY RISK EVALUATION
Patient privacy is a major concern with regards to sharing electronic health records in any means.Generative models can overcome the explicit one-to-one mapping towards the underlying original data in contrast to data anonymisation.However, GAN could potentially raise privacy concerns of information leakage if they simply "memorise" the training data, or synthesize samples nearly identical to the real samples (often due to mode collapse).In that case, sensitive medical information (e.g.national insurance number) belonging to a specific patient used in training GANs can be retrieved during the generative stage, thus posing challenges for preserving privacy in downstream applications.
In this section, we first quantify the vulnerability of EHR-M-GAN to adversary's membership inference attacks, also known as presence disclosure Hayes et al. (2019);Chen, Yu, Zhang and Fritz (2020).The threat model is implemented under the membership inference for GANs in the black-box settings Hayes et al. (2019).The attacker is assumed to possess complete knowledge of all the patient records set P , where a subset from P further is used to train GANs.During the experiment, the number of samples in the subset for training EHR-M-GAN are varied to investigate the impact of the availability of training data on the success of the attacker (see Figure 7a).By observing the synthetic patient records from EHR-M-GAN, the adversary's goal is to determine whether a single known record x in the patient record set P is from the data used in training EHR-M-GAN.If EHR-M-GAN simply "memorises" the training data and can only generate synthetic samples (nearly) identical to the real samples, it would be straightforward for the adversary to identify samples that are used as training data.Determined by whether the attacker can correctly infer a given record is in or not in GAN's training, the accuracy and recall can be calculated.
As shown in Figure 7a, when 90% of the training data is used for developing EHR-M-GAN, the attacker had a recall of 0.533 and accuracy of 0.527 to recover which training data are considered.This is eminently close to flipping a coin with random guess (i.e., 0.5), indicating EHR-M-GAN is sufficiently robust against the membership inference attack.In other words, patient samples used in EHR-M-GAN's training are not recoverable by the threat model.On the other hand, as the percentage of the training data reduces, both accuracy and recall for membership inference attacks rise.An accuracy of 0.624 and recall of 0.732 are reached with 20% of training data.This offers a guideline for future application in developing GANs that incorporating more training data can make the generator less susceptible to such attack.This is also consistent with the conclusion derived from the experiment on membership inference attacks in the prior research Lin et al. (2020).
The concept differential privacy (DP) Dwork (2008), which is a rigorous mathematical definition of privacy, has emerged to be the prevailing notion in the context of statistically analyzing data privacy.The ( , δ)-differential privacy is guaranteed for model M , if given any pair of adjacent datasets D and D (differing on a single patient record), it holds: P [M(D) ∈ S] ≤ e P [M (D ) ∈ S] + δ.In our case, M(•) is the GAN model trained based on D or D , and S is the subset of any possible outcomes of the generative process.By perturbing the underlying data distribution, DP bounds the maximum variations of the algorithm when any single individual is included or excluded from the dataset.In practice, recent works on developing differentially private deep learning models have benefited from differential private stochastic gradient descent (DP-SGD) algorithm.DP-SGD operates DP by gradient clipping and noise adding during SGD, thereby ensuring that the impact of single record in the training dataset on algorithm parameters is limited within DP's extend.In this section, ( , δ)-differential privacy is implemented in EHR-M-GAN using TensorFlow Privacy4 .We then perform the same downstream tasks on medical intervention prediction using synthetic data generated from DP-guaranteed EHR-M-GAN, and compare its performance with TSTR (as shown in Table 5).
Figure 7b shows the TSTR performance of EHR-M-GAN under differential privacy guarantee with varying budgets (δ fixed at ≤ 0.001).The value determines how strict the privacy is, where the smaller value indicates a stronger privacy restriction.As suggested in Figure 7b, the performance of the downstream tasks operated based on the synthetic data generated by EHR-M-GAN improves as the DP budget relaxes ( increases).We observe that the AUROC of DP-bounded EHR-M-GAN can maintain at an acceptable level even under strict privacy setting.For example, the AUROC for predicting the treatment of Vasopressor can maintain at 0.714 (AUROC = 0.725 under TRTR) even when the decrease to 4, which is an empirically reasonable value for implementing DP in practice Differential Privacy Team (2017).Future work that focuses on privacy-preserving GAN under DP-guarantee is expected, where the fidelity of the synthetic data can be restored without compromising its privacy.

DISCUSSION AND CONCLUSIONS
In this study, we propose a generative adversarial network entitled EHR-M-GAN, aiming at mitigating the challenge of synthesizing longitudinal EHR with mixed data types.A comprehensive list of evaluation metrics is introduced for the systematic assessment, in terms of the fidelity, correlation, utility, and privacy of the synthesis model.First, both EHR-M-GAN and its conditional version, EHR-M-GAN cond , demonstrate consistent improvements against the state-of-the-art benchmark GANs in synthesizing timeseries data with high-fidelity.This indicates that the distributional characteristics of the EHR timeseries can be well-preserved in the synthetic data provided by EHR-M-GAN, therefore ensuring its usability during clinical data sharing.Second, as opposed to previous models which were confined to synthesizing only one specific type of data, EHR-M-GAN can produce mixed-type timeseries and successfully capture the temporal dynamics and correlation between features.By accurately reconstructing the interdependencies and complex clinical relationships between features, downstream studies such as association analysis and outcome prediction can be supported.Notably, the proposed models also outperform the GAN variants that allow mixed-type inputs in the ablation study, indicating that the components in EHR-M-GAN are effective in synthesizing mixed-type timeseries with high fidelity, while successfully reconstructing the interdependencies between them.Then, during downstream task evaluation, given the prediction of medical interventions in fast-paced critical care environments as an exemplar, the results demonstrate the broad applicability of our model in developing ML algorithm-based decision support tools by data augmentation.Lastly, the assessment of privacy risks further demonstrates the synthetic data provided by EHR-M-GAN can preserve the sensitive information in patient records while maintaining an acceptable level of data utility.
The results in our study have several notable implications with respect to the synthesis of EHR data.First, as the proposed model can be used to provide synthetic longitudinal EHRs for various data types while preserving their underlying correlations, it is now feasible to use the synthesized data to improve the performance of ML models for downstream applications such as the prediction of next intervention, or understanding the disease dynamics and patient phenotyping, based on both the continuous and discrete components of EHR timeseries Alaa and van der Schaar (2019); Lee and Van Der Schaar (2020).Second, the experimental results indicate that the quality of the synthetic EHR data can be improved by the integration of mixed-type information, in contrast to the benchmarks that utilize single-type data for learning.This also enables us to mimic how information is presented in clinical practices.Furthermore, we can generate condition/outcome-specific patient trajectories along with corresponding interventions, to facilitate clinical prediction and decision-making.Third, though facing the privacy-utility tradeoffs, the synthetic EHRs data provided by the proposed model leads to negligible privacy risks under the membership inference attacks.This paves the way for a series of applications in clinical research, including but not limited to, enabling the development of ML models by accessing the synthetic data, overcoming the paucity of medical data and improving the robustness of ML algorithms through data augmentation.
Due to the heterogeneous nature of EHR data, besides the ICU setting in our empirical evaluation, there are needs for synthesizing mixed-type EHR timeseries in various clinical scenarios.For example, patients' encounters in hospitals are documented as structured EHRs recorded in the temporal order.Each visit is typically associated with the corresponding medical events presented in the form of discrete-valued ICD codes Zhang et al. (2021), and continuous-valued measurements.These mixedtype EHR timeseries capture a patient's health status and better align with clinical decision-making process than those using the single-type data alone.Therefore, developing GANs targeting mixedtype EHRs generation have the potential to pave the way for complex deep-learning systems that are capable of integrating information from various sources.However, it is worth noting that the validation of our proposed model is based on critical care settings with limited feature dimensions, can only serve as a proof of concept.When extending the proposed model to other clinical settings, such as synthesizing ICD codes with hundreds or thousands of feature dimensions Zhang et al. (2021), the scalability and utility of our proposed model when dealing with the enlarged, sparse feature space needs further investigation.
There are limitations in the current work.First, data curation strategies on clinical timeseries, including truncating, smoothing and imputation, are applied before the EHR timeseries are used for the training of generative models.As during the data preprocessing, we first extract the timeseries with a fixed duration (i.e., 24 hours before the ICU clinical endpoints), and then hourly aggregate patients' physiological and intervention signals based on their mean statistics, followed by completing the missing value in the timeseries through the "Simple Imputation" approach Che et al. (2018).Although these preprocessing steps are commonly used in clinical research under the critical care settings Wang et al. (2020), the proposed model cannot model the irregular time intervals between signals nor missing values within the timeseries.However, dealing with irregularity of the timestamps when synthesizing clinical events in EHRs could be useful for predicting outcomes that are time-aware in the downstream tasks Zhang et al. (2021).Modeling such time intervals could be non-trivial as the determinative perspectives sometimes go beyond the scope of inferring patients physiological status such as resource allocations within hospitals.Also, synthesizing timeseries while incorporating the missing values could also be beneficial in the real-world application scenarios.As ML models are sometimes sensitive to the data missingness, imputing the incomplete data in EHRs using generative approaches could improve the performance of ML models, and has become an area of active research Yoon, Jordon and Schaar (2018).Furthermore, as evaluations are performed based on clinical timeseries with a fixed length, no comparisons are made between the model's scalability when dealing with timeseries with varying lengths.Recent studies have found the quality of the synthetic longitudinal data degenerates over time, also called as the "drift problem" Zhang et al. (2022).Such problems when dealing with long sequences should be recognised and mitigated with techniques such as conditional fuzzing and regularization methods Zhang et al. (2022), in both the generation and evaluation steps.
The evaluation of GANs is still a challenging task.Recent findings have suggested that systematical assessment for EHR synthesizers is critical before their applications in different use cases Yan et al. (2022).In this study, a comprehensive evaluation list is provided with regards to the fidelity, correlation, utility and privacy of the synthesis models.It is also worth noting that evaluation metrics should be properly chosen and implemented based on the purpose of the task, otherwise may lead to biased results.For example, recent findings Zhang et al. (2022) have reported that the traditional implementation of the discriminative score which trains the critic using the randomly initialised parameters, though widely used Yoon et al. (2019), may lead to unreliable results.Improvement has been made to this evaluation metric for a more robust assessment, where the parameters of the trained generative model can be used for the critic's initialization.
Finally, the conditional aspect of our model is currently limited as it can not generate patient-specific EHRs conditioning on information at a more granular level.Even though the proposed conditional GANs can synthesize a subgroup of patients with target outcomes or statuses that clinicians specify, it is still limited in incorporating personalised information during the conditional generation.Future work for developing GANs in healthcare data can be extended to patient-level EHRs generation, such as synthesizing counterfactual information of a target patient for treatment effect estimation Yoon, Jordon and Van Der Schaar (2018); Qian et al. (2021).Ultimately, by constructing the "synthetic twin" of patients using GANs, the synthesis tool can become more generalisable for precision medicine and support the clinical decision making in delivering personalized healthcare.
Synthetic data provides an alternative to sharing real patient data while preserving patient privacy.Results in our study demonstrate that the proposed EHR-M-GAN and EHR-M-GAN cond can generate realistic longitudinal EHR timeseries with mixed data types.By providing synthetic EHR data with higher fidelity and more variety, the proposed model can therefore enable faster development in AI-driven clinical tools with increased robustness and adaptability.In addition to the improved performance against the existing state-of-the-art benchmark models, augmentation provided by synthetic data during training boosts the predictive performance in downstream clinical tasks.EHR-M-GAN can help eliminate the barriers to data acquisition for healthcare studies, therefore overcoming the challenges posed by the paucity of medical data available and approved for research use.Despite the novelty of this study in filling the research gap for synthesizing longitudinal EHRs in mixed-type settings, we acknowledge that there is still a gap between the real EHRs data and its synthetic counterparts produced by current generative methods.Therefore developing advanced EHR synthesizers especially in mixed-type settings still requires active research in the future study.
by generating samples x1:T that are hard for the discriminator to distinguish from.Meanwhile, the discriminator D is optimized to distinguish real samples x 1:T from synthetic samples x1:T .Overall, the training of GAN is a minmax game with the following objective function: Conditional GAN is the extension case of GAN, where both the generator G and discriminator D receive conditional information y ∈ L = {1, 2, ..., |L|} from |L| classes [2].In other words, the inputs are augmented by being concatenated with y at each timestamp, i.e., x 1:T → [y; As shown in Fig. S2, the shared latent space is learnt by a dual-VAE network, which contains a pair of encoders (parameterized as φ Enc C and φ Enc D ), and a pair of decoders (parameterized as ψ Dec C and ψ Dec D ) of VAE networks, one for each type of timeseries.We found VAE preferable to vanilla autoencoder in our case, considering that (1) the KL regularization in VAE strengthens the learning of the compressed latent representations, which further narrows the domain gap for mixed-type features [10]; (2) VAE can be easily extended to the conditional learning scenario in EHR-M-GAN cond .The encoders map the observations into the latent space with Enc(x) q φ (z|x), while the decoders further map the representations into the reconstructed input with Dec(z) p ψ (x|z).During the implementation, we found that except for pretraining the dual-VAE, integrating the optimization for decoders during the joint training stage also benefit the generative model from learning an improved representations in the shared latent space.In dual-VAE, we enforce a weight-sharing constraint [11] across certain layers within both the encoders pairs and decoders pairs to further eliminate the gap between domains (see Fig. S2).To be specific, only weights of the last few layers of the encoders and the first few layers of the decoders are shared [12].This forces the encoders to derive the same high-level representations while maintaining different low-level realizations.Meanwhile, it forces the decoders to share the same high-level semantics and decode them into different low-level feature space observations.

D. Comparison between LSTM and Bilateral-LSTM
To better compare with BLSTMs, we elaborate the architecture of the LSTM network.LSTM utilizes three gates to control the cell state in order to mitigate the problems of gradient vanishing and exploding that appears in the recurrent neural network (RNN) -an input gate i t that controls the amount of input information to be passed along into the memory cell, a forget gate f t which controls the amount of past information to be neglected, and an output gate o t which controls the update of the new memory cell.The range of outputs from i t , f t and o t are limited by [0,1] due to the sigmoid activation function.At each time step t, the transition functions in LSTM are as follows: where c t denotes the context vector, σ denotes the sigmoid activation function, and denotes the operation of element-wise multiplication.Based on the basic structure of LSTM, the Bilateral Long Short-Term Memory (BLSTM) network is proposed (see Fig. S3).Equations that demonstrate the calculation of BLSTM units can be found in Methodology section in the main article.

∼ D
// Distinguish real and fake samples using discriminators and estimate loss : 9: // Update network weights via Adam optimizer : 11: // Synthesize M pairs of coupled mixed-types of features for M patients:

DATASETS.
We construct the pipeline of data preprocessing based on the work of MIMIC-Extract [13].Three large-scale, publicly available datasets -MIMIC-III, eICU, and HiRID are processed based on the standard pipeline.The complete steps for data preprocessing include: • Cohort selection: In cohort selection, patients in three ICU databases are selected based on the same predefined criteria (see Section 2.A for details).
• Timeseries features extraction: Then, the timeseries features are extracted based on the lists provided in Section 2.C.Both continuous-valued and discrete-valued features are selected accordingly.
• Unit conversion and outlier filtering: Due to the fact that clinical data is often measured in different units, unit conversion are applied (such as converting Fahrenheit to Celsius for Temperature).For outlier filtering, a reasonable physiologically valid range are applied for different measurements (see [13] for details).[14]).While other features such as laboratory test results are measured infrequently.Therefore, we hourly aggregate the timeseries further into a uniform hourly bucket.
• Imputation and normalization: Finally, imputation method in Section B are used and normalization are applied to obtain the final result of the data matrix.

A. Cohort selection criteria.
In line with the previous literature [13,15], the cohort are selected based on the following criteria: (1) Only the first known ICU admission of the patient is selected.This is because the patients who have multiple ICU admission records typically require specific treatments for lifesupport intervention; (2) Patient has to be an adult at the time of ICU admission (at least 15); (3) The duration of the patients' ICU stay is at least 12 hours and less than 10 days.This is because the treatment for patients who have longer hours in the ICU stay usually indicates their physiological changes can not be directly linked to the positive effect of the treatment (but compensating for the life support treatment being taken off) [15].

B. Imputation method.
For continuous-valued timeseries, missing data is imputed based on method of Simple Imputation [16].The missing timeseries data is imputed as the last observed value, or individual-specific mean if no previous observation is provided.Else, if there is no observation for the subject, the imputation value is set to the global mean of the entire cohort.Compared to imputation methods developed upon customized RNN models or explicitly designed for the applied domains, it does not rely on additional information such as the prediction labels therefore more generalizable.Though simple, such method has been widely applied in clinical timeseries analysis [17] including MIMIC-III datasets [13,16,18].For discrete-valued timeseries, we followed the preprocessing rules in MIMIC-EXTRACT.For intermittent interventions such as oral antibiotics, its status is regarded as 'not applied' when missing.For intervention with multi-hour continuous duration, such as mechanical ventilation, the missed status is considered to be consistent with the previous status until the new administration occurs.Therefore, the imputation method was not applied to the discrete-valued data.

C. Timeseries features extraction.
Features of continuous-valued and discrete-valued timeseries are extracted for three critical care databases based on the following lists (for MIMIC-III dataset, see Table S1, S2; for eICU dataset see Table S3, S4; for HiRID dataset, see Table S5, S6).

Category of treatment Features Source
Oxygen therapy supplemental oxygen respiratorycharting, nursecharting, treatment

Category of treatment Features Source
Oxygen therapy supplemental oxygen vm23 mechanical ventilation vm60

Colloids colloids vm34
Renal therapy haemofiltration vm72 Blood transfusion packed red blood cells pm35 During the pretraining stage of dual-VAE module, we implemented the VAEs with recurrent neural network based on Google DeepMind's "DRAW" -Deep Recurrent Attentive Writer [20].Instead of automatically generating the entire images/timeseries at once, it utilizes a sequential variational auto-encoding framework that enables the iterative generation of multivariate timeseries.The reconstruction loss on the leave-out validation set (i.e., the "one-to-one" mapping) is used for optimizing the hyperparameters in dual-VAEs (see Table A).
Furthermore, to stablize GANs' training and overcome the problem of mode collapse, training strategies such as feature matching loss is utilized [21].Feature matching is a regularizing objective that prevents the generator in GANs from overtraining on the current discriminator.It has been shown effective to stablize the GANs' training as it calculates the statistics of the real data per minibatch, instead of directly maximizing the output of the discriminator.The formal definition of feature matching loss is described as follows:

B. Ablation study for training dual-VAE
Multiple losses are placed when optimizing the shared latent space in the dual-VAE module.Except for the standard evidence lower bound (ELBO) loss in VAE, external losses, namely (1) Matching loss; (2) Contrastive loss, and (3) Semantic loss (for the conditional variation of our proposed model) are used.Also, during the implementation, the weight-sharing constraint is adopted for specific layers in dual-VAE's encoder and decoder pairs to extract the high-level representations from mixed-type inputs (see Section S.1.C Shared latent space learning using dual-VAE for details).In order to analyze the contribution of each aforementioned component when training dual-VAE, we perform an ablation study by varying the corresponding training configurations (see Table S8) using MIMIC-III dataset as an example.The performance for synthesizing continuous-valued timeseries is evaluated by maximum mean discrepancy (MMD) and discriminative score.For discrete-valued timeseries, the performance of GANs is evaluated by dimensional-wise probability (DWP) quantified by the averaged root mean squared errors (RMSEs) across all feature dimensions (see Dimension-wise probability section in the main text for details) and discriminative score.The results of the ablation study are shown in Table S8.
Table S8.The ablation study for components in Dual-VAE on MIMIC-III dataset.'Baseline' represents the proposed GAN models (EHR-M-GAN or EHR-M-GANcond) with all components included.The quality of synthetic continuous-valued timeseries is evaluated by MMD and discriminative score (both the lower the better).The quality of synthetic discrete-valued timeseries is evaluated by averaged RMSEs in DWP and discriminative score (both the lower the better).As shown in Table S8, both matching loss and contrastive loss contribute to the improvement of EHR-M-GAN's performance when generating mixed-type timeseries data.For example, the absence of the contrastive loss leads to a noticeable degradation in the quality of the synthetic discrete-valued timeseries (evaluated by discriminative score).Also, removing the matching loss causes the increase of the MMD between real and synthetic continuous-valued timeseries.The weight-sharing scheme between the encoder and decoder architectures in the dual-VAE also boosts GANs' performance but within a limited range.For EHR-M-GANcond model, the effectiveness of the components that appear in EHR-M-GAN can still be observed.On the other hand, semantic loss, which injects conditional information into the networks, plays a major role in synthesizing more realistic patient trajectories.The results in Table S8 show that the impact of the semantic loss exceeds the other two losses in learning the valid shared latent representations in dual-VAE.

C. Embedding visualisation.
We apply t-SNE to qualitatively visualise the latent representations generated by EHR-M-GAN and EHR-M-GAN cond on three critical care databases.The latent embedding vectors are induced by the encoders in the dual-VAE during learning the shared latent space representations (See Methods section, p12, for details).The t-SNE embedding results on raw timeseries are also included for comparison.
It can be seen that better separability of the representation clusters in the shared latent space is shown in the embeddings obtained from EHR-M-GAN cond compared with EHR-M-GAN and raw data.This illustrates the superiority of the EHR-M-GAN cond in terms of learning the contextual information from the patient trajectories.It therefore can be inferred that the conditional extension of the proposed model can further yield benefits by synthesizing conditionspecific EHR timeseries with respect to distinctive patient health status.pages 1462-1471. PMLR, 2015. 21. Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen.Improved techniques for training gans.Advances in neural information processing systems, 29, 2016.

Figure 1 :
Figure 1: Overall schematics.a. Data extraction.Electronic health records (EHRs) data of mixedtype are routinely collected for patients in intensive care units (ICUs).b.Network architecture.EHR-M-GAN contains two key components -Dual-VAE and Coupled Recurrent Network (CRN).Step 1: Dual-VAE is first pretrained for mapping heterogeneous data (x c t , x d t ) into shared latent representations (z c t , z d t ).Multiple objective loss constraints are used to bridge the domain/distribution gap.The training process for Step 1 is indicated in the Dual-VAE pretrain path (dashed purple line).Step 2: Then, a CRN is established as the generator based on the parallel bilateral LSTM block, which takes the random noise vectors (υ c t , υ d t ) as inputs (see the Coupled generation path).Step 3: The synthetic latent representations (ẑ c t , ẑd t ) provided by CRN are then decoded into synthetic samples (x c t , xd t ) using the pretrained decoder in Dual-VAE, which is indicated in the Decoding path (solid red line).Step 4: Finally, the adversarial loss is derived from the discriminators and backpropagated to update the network, which is indicated in the Adversarial training path (dotted black line).c.Evaluation pipeline.The pipeline includes metrics for evaluating the synthetic data.d.Prediction example.Data within 24-hours prior to the patient's endpoints in the ICU (discharge or mortality) is extracted.Both the observation window and prediction window are fixed as 12 hours.The classification task is to use patients' continuous-valued physiological measurements within the observation window as input, to predict the forthcoming discrete-valued medical intervention status in the prediction window.The four outcomes of the intervention status can be categorized as follows: Stay On: The intervention begins with on and stays on within the prediction window; Onset: The intervention begins with off and is turned on within the prediction window; Switch off : The intervention begins with on and is stopped within the prediction window; Stay Off : The intervention begins with off and stays off within the prediction window.

Fig
Fig. S2 (see Section S.1.C in Supplementary materials) diagrams the details of the proposed dual-VAE framework for learning the shared latent representations.We start with training two encoders, i.e., Enc C : φ T ×X C → φ T ×H S and Enc D : φ T ×X D → φ T ×H S , with the embedding functions: recently proposed work -SynTEGZhang et al. (2021) and DualAAELee et al. (2020)  are used for comparison.Apart from these GAN-based models, we also incorporate PrivBayesZhang et al. (2017) to synthesize discrete-valued timeseries, which falls in the class of non-GAN generative approaches using a Bayesian frameworkTucker et al. (2020).As the original paper of PrivBayes focuses on data anonymization using differential privacy, we therefore implemented its 'Non-Private' version for a fair comparison with other baselines (see Section 4.1 Non-Private Methods inZhang et al. (2017)).For medGAN and PrivBayes, we feed the flattened temporal sequence as the input since the models cannot produce timeseries data.

Figure 2 :
Figure 2: The network architectures in the ablation study.Three variants of EHR-M-GAN are implemented in the ablation study.Compared with the full model of EHR-M-GAN, GAN Unified learns the joint representations of heterogeneous types of data in a unified network; GAN VAE maintains the basic architecture of EHR-M-GAN, but ignore the dependency learning (i.e., separate networks for two streams of inputs are trained in parallel); GAN SL constructs the shared latent space using the dual-VAE module but omit the sequentially coupled generator for learning the temporal correlations in the mixed-type timeseries.

Figure 3 :
Figure3: Scatterplot of the dimension-wise probability test on MIMIC-III dataset.Dimensionwise probability calculates the Bernoulli success probability of each dimension, i.e., the probability of the treatment being active at a particular time.The x-axis and y-axis represent dimension-wise probability for the real data and synthetic data generated from different models, respectively.The same color indicates the same treatment (but with varying timestamps).The optimal performance appears along the diagonal line.The corresponding CCs ([0, 1], the higher the better) and RMSEs ([0, +∞), the lower the better) are also calculated to quantify the probability distribution similarities between the real and synthetic EHRs timeseries.Dimension-wise probability plot for eICU and HiRID dataset can be found in Supplementary materials (see S.4.A).

Figure 4 :
Figure 4: Pearson pairwise correlation (PPC) between continuous-valued and discretevalued timeseries.The plots contrast the PPC calculated within the real data (left column) and the synthetic data generated by EHR-M-GAN (right column).Besides the visual inspection, the similarity between two heatmaps are quantified by CorAcc and µ abs .These metrics indicate how well the synthetic data reconstruct the correlations observed in the real patient trajectories.As shown in this figure, SpO2, SBP, RR, HR, Temp represents Oxygen Saturation, Systolic Blood Pressure, Respiratory Rate, Heart Rate, Temperature, respectively.And Vent. and Vaso.corresponds to Vasopressor and Mechanical Ventilation.PPC is calculated every 3 hours over the total 24 hours of ICU stay (ticks of the timestamps are omitted).

Figure 6 :
Figure 6: Downstream intervention prediction experimental setup.a. Data splitting.During training stage, the real data is split into two sets with 70% training data A and 30% test data A .The test data A is further split into sub-train data A T r and sub-test data A T e with equal size.Then, the synthetic data B, with size equal to the sub-train data A T r , is synthesized by EHR-M-GAN (or EHR-M-GAN cond ) trained on the real training data A. b.Data augmentation scenarios.Subsequent experiments are trained on set A T r , or B, or A T r ∪ B and then tested on A T e .In traditional approach, results based on Train on Real, Test on Real (TRTR) and Train on Synthetic, Test on Real (TSTR) are compared to assess the generalisability of the synthetic data.In data augmentation approach, i.e., Train on Synthetic and Real, Test on Real (TSRTR), we either augment real data A T r with α (augmentation ratio, 0 to 50%) of the synthetic samples B, or augment synthetic samples B with β (0 to 50%) of the real data A T r .

Figure 7 :
Figure 7: Privacy risk evaluation of EHR-M-GAN on MIMIC-III dataset.a. Membership inference attack.Membership inference attack against EHR-M-GAN vs. the percentage of the training data.Accuracy and recall are used to evaluate the success rate of such attacks.Lower accuracy or recall indicates less privacy information disclosed by the attacker from the generative model (0.5 can be seen as the random guess baseline where strong privacy guarantees are provided by GANs).Recall indicates the ratio of samples that are successfully claimed by the attacker among all the real data that are used in training GAN models.b.Differential privacy.Performance of medical intervention prediction tasks, under various differential privacy (DP) budgets, measured by Macro-AUROC.
x 1:T ].This formulation allows G to generate samples conditioned on the auxiliary information of |L|-dimensional categorical labels.In this case, the objective function becomes: min G max D V CGAN =E y,x∼p y,x [log D(x|y)] +E y∼p y ,υ∼p υ [log(1 − D(G(y, υ)|y))] (S2) C. Shared latent space learning using dual-VAE.
The network architecture of dual-VAE during the pretraining stage.
x) is the feature representation of the intermediate layer of the discriminator (layer before the final classification).
[n =m] ∈ {0, 1} is an indicator evaluating to 1 iff n = m.And i dd ∈ {1, 2, ..., 2N } represents the index of latent embeddings from both data types.The final contrastive loss is computed across the total number of |i d − i d | = N positive pairs for both (i d , i d ) and (i d , i d ), and is defined as:

Table 2 :
Summary of the evaluation protocol in this study.A comprehensive set of evaluation metrics are used to test the Fidelity, Correlation, Utility and Privacy of the synthetic EHR data.Definitions of evaluation metrics for corresponding data types are explained.The last column illustrates when the corresponding evaluation metric indicates better performance.
TSRTR (downstream task)Continuous and discrete The downstream classifier is trained which uses real data and synthetic data as training set, and (hold-out) real data as test set.The result is compared with TRTR to see whether the performance can be improved.Higher AUROCs (with TRTR as baselines)PrivacyMembership inference attack Continuous and discrete A threat model is trained under the black-box setting to determine whether a record is used for training GANs.This quantifies the risk of sensitive information from real data being revealed by synthetic data.

Table 4 :
Discriminative score of synthetic data.A discriminative model is trained post-hoc to discriminate between synthetic samples and real samples.The accuracy from the discriminative classifier is used as the discriminative score, where the lower value indicates better performance.The result is bounded by 0.5 when the classifier cannot distinguish between two distributions.GAN cond 0.784 ± 0.024 0.803 ± 0.022 0.778 ± 0.019

Table 5 :
Downstream task evaluation.Downstream tasks are evaluated under the training scenarios of Train on Real, Test on Real (TRTR) and Train on Synthetic, Test on Real (TSTR).Prediction of two outcomes of interest -intervention by Mechanical ventilation (Vent.)and Vasopressors (Vaso.) are selected as exemplary tasks.Macro-AUROC is used to score the performance of the LSTM-based classifiers on the mutli-class prediction tasks (labeled as Stay on, Onset, Switch off, Stay off ).

Table 6 :
Downstream task evaluation with data augmentation.Downstream tasks are evaluated under the training scenarios of Train on Synthetic and Real, Test on Real (TSRTR).The predictive tasks and evaluation metrics are in accordance with Table5.The upper arrow (↑) indicates that the AUROC value under TSRTR is higher than TRTR in Table5for the corresponding task, while the bold arrow (↑ ↑) indicates that the value is significantly improved using t-test (p≤0.05).All data from sub-train data A T r concated with α of the synthetic data B (augmentation ratio α = 10%, 25% or 50%) is used as the training set.

•
Semantic grouping: Next, semantically similar variables are grouped based on clinical concepts (such as Heart Rate is recorded as ItemID 211 in CareVUE EHR systems and ItemID 220045 under MetaVision EHR systems).A clinical taxonomy are used to aggregate features that are semantically equivalent [13].• Hourly aggregation: Timestamps with different granularity are provided for different in three databases.Time-varying physiological signals such as Heart Rate are frequently measured (e.g., most parameters under bedside monitoring are recorded every 2 minutes in HiRID dataset

Table S1 .
[13] of vital sign and laboratory test features for MIMIC-III dataset.Features are further extracted based on the preprocessed results of MIMIC-Extract (see Appendix A. Feature set in[13]).The dimension of continuous-valued features for MIMIC-III dataset during model's training is 78.

Table S2 .
List of medical intervention features for MIMIC-III dataset, where Features indicates the name of the intervention features during model's training, Category of treatment shows the category of treatment that the specific intervention feature belongs to, and Source is the corresponding chart(s) where the variable is extracted based on 1 .The dimension of discretevalued features for MIMIC-III dataset during model's training is 20.

Table S3 .
[19] of vital sign and laboratory test features for eICU dataset.Features are selected base on the recommendation from Rocheteau et al[19].The dimension of continuous-valued features for eICU dataset during model's training is 55.

Table S4 .
List of medical intervention features for eICU dataset, where Features indicates the name of the intervention features during model's training, Category of treatment shows the category of treatment that the specific intervention feature belongs to, and Source is the corresponding chart(s) where the variable is extracted based on 2 .The dimension of discrete-valued features for eICU dataset during model's training is 19.

Table S5 .
[14] of vital sign and laboratory test features for HiRID dataset.Features are extracted based on the official HiRID preprocessing codes (meta-variables from Merging stage 3 )[14].The dimension of continuous-valued features for HiRID dataset during model's training is 50.

Table S6 .
[14] of medical intervention features for HiRID dataset, where Features indicates the name of the intervention features during model's training, Category of treatment shows the category of treatment that the specific intervention feature belongs to, and Source is the corresponding feature names in the official HiRID preprocessing codes (meta-variables from Merging stage 4 )[14]that we extracted based on.The dimension of discrete-valued features for HiRID dataset during model's training is 39.