Abstract
Privacy concerns often arise as the key bottleneck for the sharing of data between consumers and data holders, particularly for sensitive data such as Electronic Health Records (EHR). This impedes the application of data analytics and MLbased innovations with tremendous potential. One promising approach for such privacy concerns is to instead use synthetic data. We propose a generative modeling framework, EHRSafe, for generating highly realistic and privacypreserving synthetic EHR data. EHRSafe is based on a twostage model that consists of sequential encoderdecoder networks and generative adversarial networks. Our innovations focus on the key challenging aspects of realworld EHR data: heterogeneity, sparsity, coexistence of numerical and categorical features with distinct characteristics, and timevarying features with highlyvarying sequence lengths. Under numerous evaluations, we demonstrate that the fidelity of EHRSafe is almostidentical with real data (<3% accuracy difference for the models trained on them) while yielding almostideal performance in practical privacy metrics.
Similar content being viewed by others
Introduction
Electronic Health Records (EHR) provide tremendous potential for enhancing patient care, embedding performance measures in clinical practice, and facilitating clinical research. Statistical estimation and machine learning models trained on EHR data can be used to diagnose diseases (such as diabetes^{1}, track patient wellness^{2}, and predict how patients respond to specific drugs^{3}). To develop such models, researchers and practitioners need access to data. However, data privacy concerns and patient confidentiality regulations continue to pose a major barrier to data access^{4,5,6}.
Conventional methods to anonymize data can be tedious and costly^{7,8}. They can distort important features from the original dataset, decreasing the utility of the data significantly, and they can be susceptible to privacy attacks even when the deidentification process is in accordance with existing standards^{9}. Synthetic data open new horizons for data sharing^{10}. With two key properties, synthetic data can be extremely useful: (1) high fidelity (i.e., the synthesized data are useful for the task of interest, such as giving similar downstream performance when a diagnostic model is trained on them), (2) meets certain privacy measures (i.e., the synthesized data do not reveal any real patient’s identity).
Generative models have shown notable success in generating synthetic data^{11,12,13,14,15}. They are trained to synthesize data from a given random noise vector or a feature that the model is conditioned on. This comes with the premise, for privacy preservation, that the data samples synthesized from random vectors should be distinct from the real ones. Among generative models, Generative Adversarial Networks (GANs)^{16} have particularly gained traction as they can synthesize highly realistic samples from the actual distribution of real data. The notable success of GANs in synthesizing highdimensional complex data has been shown for images^{17}, speech^{18}, text^{19} and timeseries^{15}. Recent works have also adapted GANs for privacypreserving data generation, with methods such as adding noise to model weights^{20} or modified adversarial training^{21}.
When it comes to synthetic EHR data generation, there are multiple fundamental challenges. EHR data contain heterogeneous features with different characteristics and distributions. There can be numerical features (e.g., blood pressure) as well as categorical features, with many (e.g., medical codes) or two (e.g., mortality outcome) categories. We note that EHR data with images and freeform text are beyond the scope of this paper. Some of these features might be static (i.e., not varying during the modeling window), while others are timevarying, such as regular or sporadic lab measurements or diagnoses. Feature distributions might come from quite different families—categorical distributions might be highly nonuniform (e.g., if there are minority groups), and numerical distributions might be highly skewed (e.g., a small proportion of values being very large while the vast majority are small). Ideally, a generative model should have sufficient capacity to model all these types of features. Depending on a patient’s condition, the number of visits might vary drastically—some patients might visit a clinic only once, whereas some might visit hundreds of times, leading to a variance in sequence lengths that is typically much higher compared to other timeseries data. There might also be a high ratio of missing features across different patients and time steps, as not all lab measurements or other input data might have been collected. An effective generative model should be realistic in synthesizing missing patterns.
GANs have been extended to healthcare data, particularly for EHR.^{22,23,24} apply various GAN variants on EHR data. However, these variants have limitations regarding the aforementioned fundamental aspects of realworld EHR data, such as dealing with missing features, varying feature length (rather than fixed length), categorical features (beyond numerical), and static features (beyond time series). These fundamental challenges require a holistic redesign in GANbased synthetic data generation systems. In this paper, our goal is to push the stateoftheart by designing a framework that can jointly represent these diverse data modalities while preserving the privacy of source training data.
EHRSafe, overviewed in Fig. 1, generates synthetic data that maintain the relevant statistical properties of the downstream tasks while preserving the privacy of the original data. Our methodological innovations are key to this—we introduce approaches for encoding/decoding features, normalizing complex distributions, conditioning adversarial training, and representing missing data. We demonstrate our results on two largescale realworld EHR datasets: MIMICIII^{25,26,27} and eICU^{28}. We demonstrate superior synthetic data generation on a range of fidelity and privacy metrics, often outperforming the previous works by a large margin.
Results
Datasets
We utilize two realworld deidentified EHR datasets to showcase the EHRSafe framework: (1) MIMICIII (https://physionet.org/content/mimiciii/1.4/), (2) eICU (https://eicucrd.mit.edu/gettingstarted/access/). Both are inpatient datasets that consist of varying lengths of sequences and include multiple static and temporal features with missing components.
MIMICIII
The total number of patients is 19,946. Among more than 3000 features, we select 90 heterogeneous features that have high correlations with the mortality outcome (Details can be found in Supplementary Information). Ninety features consist of (1) 3 static numerical features (e.g., age), (2) 3 static categorical features (e.g., marital status), (3) 75 temporal numerical features (e.g., respiratory rate), (4) 8 temporal categorical features (e.g., heart rhythm), and (5) 1 measurement time. The sequence lengths vary between 1 and 30.
eICU
The total number of patients is 198,707. There are (1) 3 static numerical features (age, gender, mortality), (2) 1 static categorical feature (condition code), (3) 162 temporal numerical features, and (4) 1 measurement time. Among 162 temporal numerical features, we only select 50 features whose average number of observations is higher than 1 per patient. We set the maximum length of sequence as 50. For longer sequences, we only use the last 50 time steps.
For both datasets, we divide the patients into disjoint train and test datasets with 80 and 20% ratios. We only use the training split to train EHRSafe. At inference, we generate synthetic train and test datasets from random vectors (note that EHRSafe can generate an arbitrary amount of synthetic samples). We apply standard outlier removal methods (by removing the sample whose values are outside of certain value ranges between 0.1 percentile and 99.9 percentile) to exclude the outliers from the original datasets. More details on datasets, training and evaluation can be found in Supplementary Information.
Fidelity
The fidelity metrics assess the quality of synthetically generated data by measuring the realisticness of the synthetic data compared with real data (more details are provided in Supplementary Information). Higher fidelity implies that it is more difficult to differentiate between synthetic and real data. For generative modeling, there is no standard way of evaluating the fidelity of the generated synthetic data samples, and often different works base their evaluations on different methods. In this section, we evaluate the fidelity of synthetic data with multiple quantitative and qualitative analyses, including training on synthetic/testing on real and KSstatistics. More results (including tSNE analyses, comparison of distributions, propensity scores, and feature importance) can be found in Supplementary Information.
Statistical similarity
We provide quantitative comparisons of statistical similarity between original and synthetic data that compare the distributions of the generated synthetic data and original data per each feature (including the missing patterns). For numeric variables, we report the mean, standard deviation, missing rates, and KSstatistics. For categorical data, we report the ratio of each category. We only report the results with the 15 temporal numerical features (with lowest missing rates) and all static numerical features. Table 1 summarizes the results for temporal and static numerical features, and most statistics are wellaligned between original and synthetic data (KSstatistics are mostly lower than 0.03). Additional results of the top 50 temporal numerical features and categorical features can be found in Supplementary Information.
Utility—ML model development on synthetic vs. real data
As one of the most important use cases of synthetic data is enabling machine learning innovations, we focus on the fidelity metric that compares a predictive model performance when it is trained on synthetic vs. real data. Similar model performance would indicate that the synthetic data captures the relevant informative content for the task.
We focus on the mortality prediction task^{29,30}, one of the most important machine learning tasks for EHR. We train four different predictive models (Gradient Boosting Tree Ensemble (GBDT), Random Forest (RF), Logistic Regression (LR), Gated Recurrent Units (GRU)). Table 2 compares the performance of the predictive models. In most scenarios, they are highly similar in terms of AUC. On MIMICIII, the best model (GBDT) on synthetic data is only 0.026 worse than the best model on real data, whereas on eICU, the best model (RF) on synthetic data is only 0.009 worse than the best model on real data. In Supplementary Information, we also provide the algorithmic fairness analysis across multiple subgroups divided by static categorical features (such as gender and religion).
Additionally, we evaluate the utility of the synthetic data with a random subset of features and multiple target variables. The goal is to evaluate the predictive capability of each dataset regardless of which features and targets are being used. We choose random subsets with 30 features and two target variables (mortality and gender) and test the hypothesis that the performance difference between the trained models by original and synthetic data is greater than X. In a practical setting, the choice of X would enable data owners to define a constraint on the acceptable fidelity of synthetic data. We report results with X = 0.04 for illustrative purposes. We obtain the pvalue (computed by one sample Ttest) that allows us to reject this hypothesis. As can be seen in Table 2, for MIMICIII mortality prediction, we can reject the hypothesis that AUC difference is greater than 0.04 with pvalue smaller than 0.01 (average AUC difference is 0.009). For eICU gender prediction, we achieve 0.019 average AUC difference with pvalue smaller than 0.001.
Privacy
Unlike deidentified data, there is no straightforward onetoone mapping between real and synthetic data (generated from random vectors). However, there may be some indirect privacy leakage risks built on correlations between the synthetic data and partial information from real data. We consider three different privacy attacks that represent known approaches that adversaries may apply to deanonymize private data (details are provided in Fig. 2 and Supplementary Information):

Membership inference attack: The adversary explores the probability of data being a member of the training data used for training the synthetic data generation model^{31}.

Reidentification attack: The adversary explores the probability of some features being reidentified using synthetic data and matching to the training data^{32}.

Attribute inference attack: The adversary predicts the value of sensitive features using synthetic data^{33}.
These metrics are highly practical as they represent the expected risks that currently prevent sharing of conventionally anonymized data. Furthermore, they are highly interpretable, as results for these metrics directly measure the risks associated with sharing synthetic data.
Table 3 summarizes the results along with the ideal achievable value for each metric. According to the results shown in Table 3, we observe that the privacy metrics are very close to the ideal in all cases. The risk of understanding whether a sample of the original data is a member used for training the model is very close to random chance. For the attribute inference attack, we focus on the prediction task of inferring specific attributes (gender, religion and marital status) using other attributes as features. We compare prediction accuracy when training a kNN classifier with real data against another kNN classifier trained with synthetic data. The results demonstrate that access to synthetic data does not lead to higher prediction performance on specific attributes as compared to access to the original data. More results for privacy with different distance metrics can be found in Supplementary Information.
Discussion
We provide ablation studies on key components of EHRSafe in Table 4 (top): (1) stochastic normalization, (2) explicit mask modeling, and (3) categorical embedding. All three components are observed to substantially contribute to the quality of synthetic data generation. Supplementary Information further illustrates the impact of stochastic normalization in terms of CDF curves.
In Table 4 (bottom), we compare EHRSafe to three alternative methods (TimeGAN^{15}, RCGAN^{34}, CRNNGAN^{35}) proposed for timeseries synthetic data generation. Note that the alternative methods are not designed to handle all the challenges of EHR data, such as varying length sequences, missingness and joint representation of static and timevarying features (please see Supplementary Information on how we modify them for these functionalities). Thus, they significantly underperform EHRSafe, as shown in Table 4.
Postprocessing can further improve the statistical similarity of the synthetic data. Perfectly matching the distributions of synthetic and real data might be particularly challenging for features with skewness or CDFs with discrete jumps. For some scenarios where EHRSafe might have a shortcoming in matching the distributions, a proposed postprocessing method (details can be found in Supplementary Information) can further refine the generated data and improve the fidelity results for statistical similarity. The postprocessing method is based on matching the ratios of samples in different buckets for the real and synthetic data. Note that this procedure is not a learningbased method (i.e., no trainable parameters). With this procedure, we can significantly improve the statistical similarity—KSstatistics are less or equal to 0.01 for all features. However, the drawbacks are the additional complexity of generating synthetic data and a slight degradation of the utility metrics (e.g., AUC changed from 0.749 to 0.730 on MIMICIII with Random Forest). There is not much difference in the proposed privacy metrics (e.g., membershipinference attack metric changed from 0.493 to 0.489 on MIMICIII).
We demonstrate that EHRSafe achieves very strong empirical privacy results when considering multiple practical privacy metrics. However, EHRSafe does not provide theoretical privacy guarantees (e.g., differential privacy) unless its training is modified by randomly perturbing the models^{21,36}. Note that EHRSafe framework can be directly adopted with differential privacy. For instance, DPSGD^{37} can be used to train the encoderdecoder and WGANGP models to achieve a differentially private generator and decoder with respect to the original data. Since synthetic data are generated through the differentially private generator and decoder using the random vector as the inputs, the generated synthetic data are also differentially private with respect to the original data. Even though these approaches can be adopted to EHRSafe, it may result in a decrease in fidelity as the added noise would hurt the generative model training.
For the proposed metrics, the specific assumptions and models might pose limitations. The proposed fidelity metrics that reflect the downstream machine learning use cases depend on the model type. For future work, it would be interesting to study which fidelity metrics would correspond to the performance of the best achievable model. Similarly, the proposed privacy attacks employ certain assumptions about the methodology and model of the attacker (e.g., nearest neighbor search for very highdimensional data might be suboptimal). It would be interesting to understand the theoretically achievable privacy.
Most of our results are very close to the ideal achievable performance, indicating one could have high confidence in using our method in the real world. The result that has the most room for improvement is statistical similarity, as it is not as high for all features. Reducing this consistently across all features can be done with further advances in generative modeling.
Various followup directions remain important for future work. The EHR data of this paper’s focus are heterogeneous structured data, and we show significant advancement over the prior stateoftheart that focused on more limited data types. A natural extension is to integrate the generative modeling capability for text and image data, as modern EHR datasets often contain both. Realistic generation of text and image data would require high capacity and deep decoders. However, such decoders would come with extra training challenges, and effective training of them could require a much higher number of data samples. In addition, extra training difficulties would arise due to the fact that training dynamics for different modalities are different. Utilizing foundation models that are pretrained on publicly available data is shown to be one of the key drivers of the recent research progress for deep learning on image and text data (including generative modeling). However, publicly available general purpose image and text datasets often come from very different domains, and their relevance to realworld EHR data would be low.
In this paper, we verify the performance of EHRSafe on two healthcare provider datasets which consist of admitted patients. An important followup work would be on applying EHRSafe on outpatient medical datasets from primary care or insurance companies. Scaling synthetic data generation for a complete EHR dataset with many features is another important future work. From a modeling perspective, there is no fundamental limitation for scaling—EHRSafe can be trained to generate a very high number of features without hitting computational issues. However, we expect degradation in the generation quality for rarelyobserved features (e.g., almost 90% of the MIMICIII features are measured less than 1 time per visit, on average). Weak data coverage would constitute the fundamental challenge.
In conclusion, we propose a generative modeling framework for EHR data, EHRSafe, that can generate highly realistic synthetic EHR data that are robust to privacy attacks. EHRSafe is based on generative adversarial networks modeling applied to the encoded representations of the raw data. We introduce multiple innovations in the EHRSafe architecture and training mechanisms that are motivated by the key challenges in EHR data. These innovations enable EHRSafe to demonstrate high fidelity (almostidentical properties with real data when desired downstream capabilities are considered) with almostideal privacy preservation.
Methods
This research follows Google AI principles (https://ai.google/principles/), reviewed by Google Health Ethics Committee and solely publicly available datasets are used.
The overall EHRSafe framework is illustrated in Fig. 1d. To synthesize EHR data, we adopt generative adversarial networks (GANs). EHR data are heterogeneous (see Fig. 1b), including timevarying and static features that are partially available. Direct modeling of raw EHR data is thus challenging for GANs. To circumvent this, we propose utilizing a sequential encoderdecoder architecture to learn the mapping from the raw EHR data to lowdimensional representations and vice versa.
While learning the mapping, esoteric distributions of various numerical and categorical features pose a great challenge; for example, some values or numerical ranges might be much more common, dominating the distribution, while the capability of modeling rare cases is crucial. Our proposed methods for feature mapping are key to handling such data by converting to distributions for which the training of encoderdecoder and GAN are more stable and accurate. The mapped lowdimensional representations, generated by the encoder, are used for GAN training, and at test time, they are generated, which are then converted to raw EHR data with the decoder. Algorithm 1 overviews the training procedure for EHRSafe. In the following subsections, we explain the key components.
Feature representations
EHR data often consist of both static and timevarying features. Each static and temporal feature can be further categorized into either numeric or categorical. Measurement time for timevarying features is another important feature. Overall, the five categories of features for the patient index i are: (1) measurement time as u, (2) static numeric feature (e.g., age) as s^{n}, (3) static categorical feature (e.g., marital status) as s^{c}, (4) timevarying numerical feature (e.g., vital signs) as t^{n}, (5) timevarying categorical feature (e.g., hearth rhythm) as t^{c}. The sequence length of timevarying features is denoted as T(i). Note that each patient record may have a different sequence length. With all these features, given training data can be represented as:
where N is the total number of patient records.
EHR datasets often contain missing features as patients might visit clinics sporadically, and not all measurements or information are collected completely at all visits. In order to generate realistic synthetic data, missingness patterns should also be generated in a realistic way. Let’s denote the binary mask m with 1/0 values based on whether a feature is observed (m = 1) or not (m = 0). The missingness for the features is represented as
Note that there is no missingness for measurement time—we assume time is always given whenever at least one timevarying feature is observed.
Figure 3 visualizes how the raw data are converted into four categories of features: (1) measurement time, (2) timevarying features, (3) mask features, (4) static features.
Encoding and decoding categorical features
Handling categorical features poses a unique challenge beyond numerical features, as meaningful discrete mappings need to be learned. Onehot encoding is one possible solution; however, if some features have a large number of categories (such as the medical codes), the number of dimensions would significantly increase, hurting the GAN training and data efficiency^{38}. We propose encoding and decoding categorical features to obtain learnable mappings to be used for generative modeling. We first encode the categorical features (s^{c}) into onehot encoded features (s^{co})—here, we use the notation with static categorical feature but it is the same with temporal categorical features. Then, we employ a categorical encoder (CE^{s}) to transform onehot encoded features into the latent representations (s^{ce}):
where K is the number of categorical features. Lastly, we use the multihead decoders (\([C{F}_{1}^{s},...,C{F}_{K}^{s}]\)) to recover the original onehot encoded data from the latent representations.
Both encoder (CE^{s}) and multihead decoders (\([C{F}_{1}^{s},...,C{F}_{K}^{s}]\)) are trained with softmax cross entropy objective: (\({{{{{L}}}}}_{c}\)):
We use separate encoderdecoder models for static and temporal categorical features. The transformed representations are denoted as s^{ce} and t^{ce}, respectively.
Algorithm 1
Pseudocode of EHRSafe training.
Input: Original data \({{{{D}}}}={\{{{{{\bf{s}}}}}^{n}(i),{{{{\bf{s}}}}}^{c}(i),{\{{u}_{\tau }(i),{{{{\bf{t}}}}}_{\tau }^{n}(i),{{{{\bf{t}}}}}_{\tau }^{c}(i)\}}_{\tau = 1}^{T(i)}\}}_{i = 1}^{N}\)
1: Generate missing patterns of \({{{{D}}}}\): \({{{{{D}}}}}_{{{{{M}}}}}={\{{{{{\bf{m}}}}}^{n}(i),{{{{\bf{m}}}}}^{c}(i),{\{{{{{\bf{m}}}}}_{\tau }^{n}(i),{{{{\bf{m}}}}}_{\tau }^{c}(i)\}}_{\tau = 1}^{T(i)}\}}_{i = 1}^{N}\)
2: Transform categorical data (s^{c}, t^{c}) into onehot encoded data (s^{co}, t^{co})
3: Train static categorical encoder and decoder:
4: Train temporal categorical encoder and decoder:
5: Transform onehot encoded data (s^{co}, t^{co}) to categorical embeddings (s^{ce}, t^{ce})
6: Stochastic normalization for numerical features (s^{n}, t^{n}, u) (see Algorithm 2)
7: Train encoderdecoder model using Equation (11)
8: Generate original encoder states e using trained encoder (E), original data \({{{{D}}}}\) and missing patterns \({{{{{D}}}}}_{{{{{M}}}}}\)
9: Train generator (G) and discriminator (D) using WGANGP
Output: Trained generator (G), trained decoder (F), trained categorical decoder (CF^{s}, CF^{t})
Stochastic normalization for numerical features
One prominent challenge for training GAN is mode collapse^{38}, i.e., the generative model overemphasizes the generation of some commonly observed data values. Especially for distributions where the mass probability is condensed within a small numerical range, this can be a severe issue. For EHR data, such distributions are indeed observed for many features.
Some numerical clinical features might have values from a discrete set of observations (e.g., high respiratory pressure values coming as multiples of 5—35, 40, 45, etc.) or from highly nonuniform distributions, yielding cumulative distribution functions (CDFs) that are discontinuous or with significant jumps.
Directly generating such numerical features coming from highly discontinuous CDFs can be challenging for GANs, as they are known to suffer from mode collapse and would have the tendency to generate common values for all samples. To circumvent this issue and obtain high fidelity, we propose a normalization/renormalization method, shown in Algorithms 2 and 3, that map the raw feature distributions to and from a more uniform distribution that is easier to model with GANs. An example application would be like: (1) estimate the ratio of each unique value in the original feature; (2) transform each unique value into the normalized feature space with the ratio as the width—if we have 3 original values: (1, 2, 3) and their corresponding ratios as (0.1, 0.7, 0.2); (3) map 1 into [0, 0.1] range in a uniformly random way; for 2, we map into [0.1, 0.8]; for 3, we map into [0.8, 1.0].
Algorithm 2
Pseudocode of stochastic normalization.
Input: Original feature X
1: Uniq(X) = Unique values of X, N = Length of (X)
2: lowerbound = 0.0, upperbound = 0.0, \(\hat{X}=X\)
3: for val in Uniq(X) do
4: Find index of X whose value = val as idx(val)
5: Compute the frequency (ratio) of val as ratio(val) = Length of idx(val) / N
6: upperbound = lowerbound + ratio(val)
7: \(\hat{X}\)[idx(val)] ~ Uniform(lowerbound, upperbound)
8: params[val] = [lowerbound, upperbound]
9: lowerbound = upperbound
10: end for
Output: Normalized feature (\(\hat{X}\)), normalization parameters (params)
Algorithm 3
Pseudocode of stochastic renormalization.
Input: Normalized feature (\(\hat{X}\)), normalization parameters (params)
1: \(X=\hat{X}\)
2: for param in params.keys do
3: Find index of \(\hat{X}\) whose value is in [param.values] as idx(param)
4: X[idx(param)] = param
5: end for
Output: Original feature X
As shown in Supplementary Information, the proposed stochastic normalization can be highly effective in transforming features with discontinuous CDFs into approximately uniform distributions while allowing for perfect renormalization into the original feature space. We demonstrate that the impact of normalization is significant for EHRSafe to improve results in Table 4.
We also note that the stochastic normalization method is highly effective for handling skewed distributions that might correspond to features with outliers. Stochastic normalization maps the original feature space (with outliers) into a normalized feature space (with uniform distribution), and then the applied renormalization recreates the skewed distributions with outliers.
Encoderdecoder architecture
Given the described encoding scheme for numerical and categorical features, next, we describe the employed architecture for jointly extracting the representations from multiple types of data, including static, temporal, measurement time, and mask features. We propose to encode these heterogeneous features into joint representations from which the synthetic data samples are generated. Highdimensional sparse data are challenging to model with GANs, as they might cause convergence stability and mode collapse issues, and they might be less data efficient^{38} To address this, using an encoderdecoder model is beneficial as it condenses highdimensional heterogeneous features into latent representations that are low dimensional and compact.
The encoder model (F) inputs the static data (s^{n}, s^{ce}), temporal data (t^{n}, t^{ce}), time data (u), and mask data (\({{{{\bf{m}}}}}^{n},{{{{\bf{m}}}}}^{c},{{{{\bf{m}}}}}_{\tau }^{n},{{{{\bf{m}}}}}_{\tau }^{c}\)) and generates the encoder states (e), as shown in Fig. 4 and below equations.
The decoder model (G) inputs these encoded representations (e) and aims to recover the original static, temporal, measurement time, and mask data.
If the decoder model can recover the original heterogeneous data correctly, it can be inferred that e contains most of the information in the original heterogeneous data.
For temporal, measurement time and static features, we use mean square error (\({{{{{L}}}}}_{m}\)) as the reconstruction loss. Note that we compute the errors only when the features are observed. For the mask features, we use the binary cross entropy (\({{{{{L}}}}}_{c}\)) as the reconstruction loss because the mask features consist of binary variables. Thus, our full reconstruction loss becomes:
where λ is the hyperparameter to balance the cross entropy loss and mean squared loss.
Adversarial training
The trained encoder model is used to map raw data into encoded representations, that are then used for GAN training so that the trained generative model can generate realistic encoded representations that can be decoded into realistic raw data.
We first utilize the trained encoder to generate original encoder states (e) using the original raw data—the original dataset gets converted into \({{{{{D}}}}}_{e}={\{{{{\bf{e}}}}(i)\}}_{i = 1}^{N}\). Next, we use the generative adversarial network (GAN) training framework to generate synthetic encoder states \(\hat{{{{\bf{e}}}}}\) to make synthetic encoder states dataset \({\hat{{{{{D}}}}}}_{e}\). More specifically, the generator (G) uses the random vector (z) to generate synthetic encoder states as follows.
Then, the discriminator D tries to distinguish the original encoder states e from the synthetic encoder states \(\hat{{{{\bf{e}}}}}\). As the GAN framework, we adopt Wasserstein GAN^{39} with Gradient Penalty^{40} due to its training stability for heterogeneous data types. The optimization problem can be stated as:
where η is WGANGP hyperparameter, which is set to 10. Figure 4 describes the proposed GAN model with generator and discriminator architectures based on multilayer perceptron (MLP).
Inference
The inference process of EHRSafe is overviewed in Algorithm 4. After training both the encoderdecoder and GAN models, we can generate synthetic heterogeneous data from any random vector. Note that only the trained generator and decoder are used for inference.
As shown in Fig. 5, the trained generator uses the random vector to generate synthetic encoder states.
Then, the trained decoder (F) uses the synthetic encoder states as the inputs to generate synthetic temporal (\({\hat{{{{\bf{t}}}}}}^{n},{\hat{{{{\bf{t}}}}}}^{ce}\)), static (\({\hat{{{{\bf{s}}}}}}^{n},{\hat{{{{\bf{s}}}}}}^{ce}\)), time (\(\hat{u}\)), and mask (\({\hat{{{{\bf{m}}}}}}^{n},{\hat{{{{\bf{m}}}}}}^{c},{\hat{{{{\bf{m}}}}}}_{\tau }^{n},{\hat{{{{\bf{m}}}}}}_{\tau }^{c}\)) data.
Representations for the static and temporal categorical features are decoded using the decoders in Fig. 6 to generate synthetic static categorical (\({\hat{s{{{\boldsymbol{}}}}}}^{c}\)) data and temporal categorical (\({\hat{{{{\bf{t}}}}}}^{c}\)) data.
The generated synthetic data are represented as:
Note that with the trained models, we can generate an arbitrary number of synthetic data samples (even more than the original data).
Algorithm 4
Pseudocode of EHRSafe inference.
Input: Trained generator (G), trained decoder (F), the number of synthetic data (M), trained categorical decoder (CF^{s}, CF^{t})
1: Sample M random vectors \({{{\bf{z}}}} \sim {{{{N}}}}(0,I)\)
2: Generate synthetic embeddings: \(\hat{{{{\bf{e}}}}}=G({{{\bf{z}}}})\)
3: Decode synthetic embeddings to synthetic data: \({\hat{{{{\bf{s}}}}}}^{n},{\hat{{{{\bf{s}}}}}}^{ce},{\hat{{{{\bf{t}}}}}}^{n},{\hat{{{{\bf{t}}}}}}^{ce},\hat{u},{\hat{{{{\bf{m}}}}}}^{n},{\hat{{{{\bf{m}}}}}}^{c},{\hat{{{{\bf{m}}}}}}_{\tau }^{n},{\hat{{{{\bf{m}}}}}}_{\tau }^{c}=F(\hat{{{{\bf{e}}}}})\)
4: Decode synthetic categorical embeddings: \({\hat{{{{\bf{s}}}}}}^{c}=C{F}^{s}({\hat{{{{\bf{s}}}}}}^{ce}),{\hat{{{{\bf{t}}}}}}^{c}=C{F}^{t}({\hat{{{{\bf{t}}}}}}^{ce})\)
5: Renormalize synthetic numerical data (\({\hat{{{{\bf{s}}}}}}^{n},{\hat{{{{\bf{t}}}}}}^{n},\hat{u}\)) (see Algorithm 3)
Output: Synthetic data \(\hat{{{{{D}}}}}={\{{\hat{{{{\bf{s}}}}}}^{n}(i),{\hat{{{{\bf{s}}}}}}^{c}(i),{\{{\hat{{{{\bf{u}}}}}}_{\tau }(i),{\hat{{{{\bf{t}}}}}}_{\tau }^{n}(i),{\hat{{{{\bf{t}}}}}}_{\tau }^{c}(i)\}}_{\tau = 1}^{\hat{T}(i)}\}}_{i = 1}^{M}\) and synthetic missing pattern \({\hat{{{{{D}}}}}}_{{{{{M}}}}}={\{{\hat{{{{\bf{m}}}}}}^{n}(i),{\hat{{{{\bf{m}}}}}}^{c}(i),{\{{\hat{{{{\bf{m}}}}}}_{\tau }^{n}(i),{\hat{{{{\bf{m}}}}}}_{\tau }^{c}(i)\}}_{\tau = 1}^{\hat{T}(i)}\}}_{i = 1}^{M}\)
Data availability
The data used for the training, validation, and test sets are publicly available. All data were collected entirely from openly available sources. The following websites can be used to access the EHR datasets used in this study: MIMICIII (https://physionet.org/content/mimiciii/1.4/), eICU (https://eicucrd.mit.edu/gettingstarted/access/).
References
Zhu, T., Li, K., Herrero, P. & Georgiou, P. Deep learning for diabetes: a systematic review. IEEE J. Biomed. Health Inform. 25, 2744–2757 (2020).
Yu, L., Chan, W. M., Zhao, Y. & Tsui, K.L. Personalized health monitoring system of elderly wellness at the community level in Hong Kong. IEEE Access 6, 35558–35567 (2018).
Liu, R. et al. Systematic pancancer analysis of mutation–treatment interactions using large realworld clinicogenomics data. Nat. Med. 28, 1656–1661 (2022).
Abouelmehdi, K., BeniHssane, A., Khaloufi, H. & Saadi, M. Big data security and privacy in healthcare: a review. Procedia Comput. Sci. 113, 73–80 (2017).
Iyengar, A., Kundu, A. & Pallis, G. Healthcare informatics and privacy. IEEE Internet Comput. 22, 29–31 (2018).
Ray, P. & Wimalasiri, J. The need for technical solutions for maintaining the privacy of EHR. In Proc. 2006 International Conference of the IEEE Engineering in Medicine and Biology Society, 4686–4689 (IEEE, 2006).
AzarmDaigle, M., Kuziemsky, C. & Peyton, L. A review of cross organizational healthcare data sharing. Procedia Comput. Sci. 63, 425–432 (2015).
Uzuner, Ö., Luo, Y. & Szolovits, P. Evaluating the stateoftheart in automatic deidentification. J. Am. Med. Inform. Assoc. 14, 550–563 (2007).
Janmey, V. & Elkin, P. L. Reidentification risk in HIPAA deidentified datasets: the MVA attack. AMIA Annu. Symp. Proc. 2018, 1329–1337 (2018).
Chen, R. J., Lu, M. Y., Chen, T. Y., Williamson, D. F. & Mahmood, F. Synthetic data in machine learning for medicine and healthcare. Nat. Biomed. Eng. 5, 493–497 (2021).
Goodfellow, I. et al. Generative adversarial nets. In Proc. 27th International Conference on Neural Information Processing Systems, Vol. 27, 2672–2680 (2014).
Van den Oord, A. et al. Conditional image generation with PixelCNN decoders. In Proc. 30th International Conference on Neural Information Processing Systems, 4797–4805 (2016).
Van den Oord, A. et al. Wavenet: a generative model for raw audio. Preprint at https://arxiv.org/abs/1609.03499 (2016).
Nowozin, S., Cseke, B. & Tomioka, R. fGAN: training generative neural samplers using variational divergence minimization. In Proc. 30th International Conference on Neural Information Processing Systems, 271–279 (2016).
Yoon, J., Jarrett, D. & Van der Schaar, M. Timeseries generative adversarial networks. In Proc. 33rd Conference on Neural Information Processing Systems (2019).
Creswell, A. et al. Generative adversarial networks: an overview. IEEE Signal Process. Mag. 35, 53–65 (2018).
Karras, T., Aila, T., Laine, S. & Lehtinen, J. Progressive growing of GANs for improved quality, stability, and variation. In Proc. International Conference on Learning Representations (ICLR) (2018).
Kong, J., Kim, J. & Bae, J. HiFiGAN: generative adversarial networks for efficient and high fidelity speech synthesis. Adv. Neural Inf. Process. Syst. 33, 17022–17033 (2020).
de Masson d’Autume, C., Mohamed, S., Rosca, M. & Rae, J. Training language GANs from scratch. In Proc. 33rd Conference on Neural Information Processing Systems (2019).
Liu, Y., Peng, J., James, J. & Wu, Y. PPGAN: privacypreserving generative adversarial network. In Proc. 2019 IEEE 25th International Conference on Parallel and Distributed Systems (ICPADS), 985–989 (IEEE, 2019).
Jordon, J., Yoon, J. & Van Der Schaar, M. PATEGAN: generating synthetic data with differential privacy guarantees. In Proc. 2019 International Conference On Learning Representations (2019).
Jarrett, D., Bica, I. & van der Schaar, M. Timeseries generation by contrastive imitation. Adv. Neural Inf. Process. Syst. 34, 28968–28982 (2021).
Choi, E. et al. Generating multilabel discrete patient records using generative adversarial networks. PMLR 68, 286–305 (2017).
Lu, C., Reddy, C. K., Wang, P., Nie, D. & Ning, Y. Multilabel clinical timeseries generation via conditional GAN. Preprint at https://arxiv.org/abs/2204.04797 (2022).
Johnson, A., Pollard, T. & Mark, R. MIMICIII clinical database (version 1.4). PhysioNet 10 (2016). https://physionet.org/content/mimiciii/1.4/.
Johnson, A. E. et al. MIMICIII, a freely accessible critical care database. Sci. Data 3, 160035 (2016).
Goldberger, A. L. et al. PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation 101, e215–e220 (2000).
Pollard, T. J. et al. The eICU Collaborative Research Database, a freely available multicenter database for critical care research. Sci. Data 5, 180178 (2018).
Sadeghi, R., Banerjee, T. & Romine, W. Early hospital mortality prediction using vital signals. Smart Health 9, 265–274 (2018).
Sheikhalishahi, S., Balaraman, V. & Osmani, V. Benchmarking machine learning models on eICU critical care dataset. Preprint at https://arxiv.org/abs/1910.00964 (2019).
Liu, G. et al. SocInf: membership inference attacks on social media health data with machine learning. IEEE Trans. Comput. Soc. Syst. 6, 907–921 (2019).
Su, D., Huynh, H. T., Chen, Z., Lu, Y. & Lu, W. Reidentification attack to privacypreserving data analysis with noisy samplemean. In Proc. 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 1045–1053 (2020).
Mehnaz, S. et al. Are your sensitive attributes private? Novel model inversion attribute inference attacks on classification models. In Proc. 31st USENIX Security Symposium (USENIX Security 22), 4579–4596 (2022).
Esteban, C., Hyland, S. L. & Rätsch, G. Realvalued (medical) time series generation with recurrent conditional GANs. Preprint at https://arxiv.org/abs/1706.02633 (2017).
Mogren, O. CRNNGAN: continuous recurrent neural networks with adversarial training. Preprint at https://arxiv.org/abs/1611.09904 (2016).
Torkzadehmahani, R., Kairouz, P. & Paten, B. DPCGAN: differentially private synthetic data and label generation. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (2019).
Abadi, M. et al. Deep learning with differential privacy. In Proc. 2016 ACM SIGSAC Conference on Computer and Communications Security, 308–318 (2016).
Saxena, D. & Cao, J. Generative adversarial networks (gans) challenges, solutions, and future directions. ACM Comput. Surv. (CSUR) 54, 1–42 (2021).
Arjovsky, M., Chintala, S. & Bottou, L. Wasserstein generative adversarial networks. PMLR 70, 214–223 (2017).
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V. & Courville, A. C. Improved training of Wasserstein GANs. In Proc. 31st International Conference on Neural Information Processing Systems, 5769–5779 (2017).
Acknowledgements
This work was approved by Google, and no extramural funding was used for this project.
Author information
Authors and Affiliations
Contributions
J.Y., F.B., S.A. and T.P. initiated the project. J.Y. and S.A. designed the model architecture and training methodology. J.Y., M.M., N.F.G., T.J., S.A. contributed to metric developments. J.Y., M.M., N.F.G., A.S.R., P.B., F.K. and D.A. contributed to developing scalable pipelines and software infrastructure. J.Y., M.M., N.F.G., T.J. and S.A. contributed to the overall experimental design and analyses. M.M., N.F.G. and F.B. contributed to data preprocessing. J.Y., M.M., G.L., A.M., F.B., E.K., S.A. and T.P. managed the project. J.Y., M.M., N.F.G., T.J., A.M., F.B., E.K., S.A. and T.P. wrote the paper.
Corresponding author
Ethics declarations
Competing interests
This work was approved by Google, and no extramural funding was used for this project. All authors are affiliated with Google. The authors have no other competing interests to declare.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Yoon, J., Mizrahi, M., Ghalaty, N.F. et al. EHRSafe: generating highfidelity and privacypreserving synthetic electronic health records. npj Digit. Med. 6, 141 (2023). https://doi.org/10.1038/s41746023008887
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41746023008887