Synthesize high-dimensional longitudinal electronic health records via hierarchical autoregressive language model

Synthetic electronic health records (EHRs) that are both realistic and privacy-preserving offer alternatives to real EHRs for machine learning (ML) and statistical analysis. However, generating high-fidelity EHR data in its original, high-dimensional form poses challenges for existing methods. We propose Hierarchical Autoregressive Language mOdel (HALO) for generating longitudinal, high-dimensional EHR, which preserve the statistical properties of real EHRs and can train accurate ML models without privacy concerns. HALO generates a probability density function over medical codes, clinical visits, and patient records, allowing for generating realistic EHR data without requiring variable selection or aggregation. Extensive experiments demonstrated that HALO can generate high-fidelity data with high-dimensional disease code probabilities closely mirroring (above 0.9 R2 correlation) real EHR data. HALO also enhances the accuracy of predictive modeling and enables downstream ML models to attain similar accuracy as models trained on genuine data.


INTRODUCTION
The widespread adoption of electronic health record (EHR) systems has established the foundation for machine learning (ML) and artificial intelligence (AI) applications in healthcare.The EHR data is highly complex, comprising over 10,000 unique medical codes for diagnoses, procedures, and medications, as well as thousands of lab measurements.Each patient record can include multiple visits with combinations of diagnoses, procedures, medications, and labs.These combinations create intricate relationships and complex patterns across tens of thousands of medical codes.AI and ML techniques are used to learn and model complex patterns in EHR data, enabling applications such as clinical predictive modeling [1,2], health monitoring [3,4], computational phenotyping [5,6], treatment recommendations [7][8][9], and more.However, the progress of AI and ML in healthcare is often impeded by the difficulty of accessing and sharing large real EHR datasets.Sharing EHR data is challenging due to privacy, security, and legal constraints.While patient de-identification can alleviate some of these concerns by removing obvious patient identifiers such as name, address, and birth date [10,11], studies have shown that the risk of re-identification remains high even after thorough de-identification [12][13][14].
To date, existing methods cannot generate realistic EHR data in its original, high-dimensional form.The high dimensionality of EHR data, along with rare and sparse variables and complex relationships among variables, makes the generation task a difficult one.Consequently, existing approaches all concede to creating lower-dimensional data by either aggregating variables or using a subset of more common variables of a manageable size.For example, the MedGAN method [19] modeled 615 disease categories without longitudinal information; the SynTEG model [15] aggregates codes to higher level phenotypes and then removes rare phenotypes, resulting in only 1,276 variables; the ehrMGAN approach [21] reduced the variable dimension to be less than 100, and EVA [16] models frequent co-occurrence patterns in the original EHR data as one-hot vectors, limiting its ability to generate diverse and novel co-occurrence patterns.Our supplementary information provides a table of these dimensionalities of existing methods.While these low-dimensional approaches may capture the proper statistics on a small number of variables and support narrow ML use cases relying solely on those variables, the resulting synthetic data is inadequate for broader applications that require high-dimensional data including comprehensive statistical analysis, patient phenotyping, billing prediction and analysis, disease staging, and comprehensive data sharing.
We propose an approach for generating high-dimensional EHR data in its native form: the Hierarchical Autoregressive Language Model (HALO).This model, shown in Figure 1, takes an autoregressive and probabilistic approach and can capture the hierarchical distribution of EHR records and their temporal relationships.Using a hierarchical approach to model binary sequences of over a million variables, HALO can efficiently learn and represent complex patterns in EHR data.
HALO works by utilizing a pair of modules to represent both the visit-and code-level structures of a patient record.First, it uses a coarse, visit-level module to factorize the probability along each of a patient's visits and to efficiently process and represent a patient's past medical history.It then adds fine, code-level modeling to generate each variable in a given visit based on both that past history and also the previous variables in the same visits for maximum intra-visit cohesion.
We evaluate the performance of HALO by training it on a comprehensive outpatient claims dataset, as well as the MIMIC-III inpatient EHR data [25], and compare the results with a diverse set of existing synthetic EHR data generation techniques such as [15,16,26].
We evaluate the data quality based on its utility in modeling the statistical data distribution and for supporting ML models.HALO can accurately synthesize high-dimensional EHR data via modeling disease code probabilities ( ≈ 10, 000), disease code co-occurrence probabilities within a visit ( ≈ 1, 000, 000), and conditional probabilities across consecutive visits ( ≈ 5, 000, 000).In our experiments, arXiv:2304.02169v3[cs.LG] 9 Nov 2023

R EHR Input
Figure 1: The proposed HALO model.The architecture of HALO utilizing an autoregressive multi-granularity approach which analyzes at both the visit and code level to generate next code probabilities based on the history of all previous visits as generated through a stack of transformer decoder layers and the previous codes in the current visit through a series of masked linear layers.
we found that HALO achieves a correlation coefficient of above 0.9  2 when compared to real EHR data, demonstrating its ability to generate realistic data.
In addition to generating high-fidelity and granular EHR data, we show that HALO improves predictive modeling on our EHR dataset by more than 17% compared to the leading baseline.We evaluate the predictive accuracy and perplexity of HALO on a hold-off test set, demonstrating its superiority.Furthermore, the synthetic data generated by HALO enable downstream phenotyping ML models to achieve comparable accuracy to models trained on real data, with an AUC of 0.938 for HALO data versus 0.943 for real data.We then demonstrate that combining real and synthetic data generated by HALO can improve the accuracy of ML models even more compared to using just real EHR data.Furthermore, we show that HALO generates realistic data while simultaneously protecting patients' privacy in the training data, as evaluated by a series of privacy metrics.

Problem Formulation
Structured EHRs are multi-level longitudinal records, where each patient is represented by a sequence of visits.Each visit is characterized by a set of medical codes, reflecting the diagnoses, procedures, and medications administered during that visit.Additional patient information, such as demographics, disease phenotype labels, lab test results, and inter-visit time, can also be included.We begin by formalizing the problem and introducing key notations that will be used throughout.

EHR Data
We represent a patient record R as a sequence of visits over time such that R = V (1) , V (2) where each visit V ( ) contains a varying number of medical codes ∈ L, and the inter-visit time gap  ( ) .C is then the set of all medical codes in our vocabulary, including diagnoses, procedures, and medications and L is the set of all labs.Beyond the longitudinal records, a patient record also possesses some static information S containing demographics such as gender, race, and birth year and disease phenotype label D indicating major and persistent disease conditions.
Matrix Representation To allow input to HALO and other machine learning models, we then convert R, S, and D into a matrix representation R. Specifically, we build a matrix containing a sequence of the vector representations for each of the patient's  visits, a preceding start visit, label visit and a succeeding end visit.The start visit v  is a one-hot vector containing a special start code added to C to signify the start of the record often required for certain model architectures.
The label visit v  similarly contains special codes added to C representing demographic and chronic disease phenotypes from S and D, respectively.For example, this label visit will have codes representing the patient's gender, racial and ethnic groups, birth year, and any chronic labels.
Each subsequent visit v  ∈ R | C | is then represented as a multihot binary vector representing medical codes, lab values, and intervisit gaps present in that visit.To represent continuous lab values and visit gaps in a discrete form, we employ a granular discretization.This is achieved by adding multiple range codes to C for each lab test and for the intervals between visits.By converting all medical information into binary variables,    represents the presence of the -th code in C in the -th visit of the patient record R.
Finally, to signal the end of the patient record in v  , a special last visit code is added to C, serving a similar purpose to a stop token in natural language generation.This not only enables generative models to learn when to terminate records but also allows for R to be padded through additional columns into a constant length for batch input without altering its content.
Figure 2 depicts the format of the visit vector and the EHR representation, and we provide a table of notations for reference in our supplementary information.
Generation task is to create R ′ , a synthetic patient record that is statistically similar to and offers the utility of R without any one-to-one mapping to a real patient.Our HALO method does this by learning distribution  (R).

Experimental Design
We evaluate our method and compare it to several baselines comprising both recently proposed models and other logical autoregressive model architectures on a series of experiments on both outpatient and inpatient EHR datasets.To maintain the fidelity of the original EHR data, our experiments focus on synthesizing original granular medical codes without aggregating or combining codes.Specifically, we seek to answer the following questions.
• Is HALO effective at modeling the underlying data distribution of electronic health records?
• Can HALO produce a synthetic dataset that is statistically similar to real EHR data?
• Can HALO augment real data for more accurate disease phenotyping prediction?
• Can HALO generate realistic continuous variables such as lab results and visit time gap?
• Can HALO preserve patient privacy in the training?

Datasets and Experimental Setup
Datasets We use two datasets for our experiments: (1) The outpatient EHR is from a large real-world US claims data.It contains 929,268 patients and binary labels for 11 chronic diseases (specific diseases and patient counts are included in the supplementary information).This yields a final real-world outpatient EHR dataset with an average of 34.16 visits per record and 3.52 codes per visit with 9,882 unique ICD-10 codes.(2) The inpatient EHR is from the MIMIC-III ICU stay dataset [25].
It contains 46,520 patients with 25 disease phenotype labels as defined by the MIMIC benchmark [27].This dataset has an average of 1.26 visits per record and 15.11 codes per visit with 6,841 unique ICD-9 codes.Note that this includes patients with just a single visit (and as we will show, HALO's Code-Level Module allow it to be very effective on those patients).Both datasets share the same patient representation as a series of visits along with chronic disease phenotype labels.We keep the ICD codes in the data without code aggregation or removing any infrequent codes.
Experiment setup: We use a 0.8-0.2training-test split with an additional 0.9-0.1 training-validation split during training for both outpatient and inpatient datasets.We use the Adam optimizer with learning rate 1e-4 (which was arrived upon through experimentation).We use a batch size of 48 and train for 50 epochs, saving the model with the lowest loss on the validation set.We implement the model and train in the Python 3.6.9coding language using the PyTorch 1.9.0+cu111 framework [28] along with the scikit-learn 0.24.2 and NumPy 1.17.2 packages.Finally, all experiments are done via one NVIDIA TESLA V100 GPU with 32 GB RAM.The HALO source code is publicly available on GitHub at https://github.com/btheodorou99/HALO_Inpatient.

Baseline Methods
Below we outline the baseline methods and the necessary alterations to those baselines to adapt to our problem setting.
• HALO − Coarse: This baseline is an ablation baseline consisting of just the coarse, visit-level granularity module of the full HALO architecture.It generates each code probability based on all previous visits (grouped into a multi-hot representation) but without the fine, inter-visit modeling such that  (   ) is modeled by It consists predominantly of 12 transformer decoder blocks in the model of [29] augmented to support multi-hot as opposed to one-hot inputs and outputs within the embedding layer and final activation layer.
• GPT Model [29]: We applied the GPT model without any augmentation to support multi-hot inputs and outputs but instead with the conversion of EHRs to a fully one-hot sequential representation.However, this model had to be shrunk down to 3 blocks from 12 to fit into memory because this greatly expanded the length of the sequences.
• LSTM EHR Model [30]: is a deep, autoregressive LSTM model, adapted to generate structured patient records rather than unstructured text as it had previously been utilized, which is directly analogous to the HALO − Coarse model but uses LSTM blocks instead of transformer decoder blocks.
• SynTEG [15]: is a GAN-based model that uses a transformer and LSTM-based encoder model to generate embeddings of EHRs up to a given visit before feeding those embeddings into a conditional GAN which generates the next visit.The data formatting.a The visit representation.Each visit is represented as a multi-hot vector containing indices for medical codes, static label codes to cover demographics and disease phenotypes, and special codes describing the shape and temporal ordering of the patient's visit.b The EHR representation.An EHR is then represented as a matrix constructed as a series of temporally ordered visit vectors.
• EVA [16]: is a VAE-based model which uses a bidirectional-LSTM encoder and CNN-based decoder (using deconvolutions to expand the latent encoding to the proper temporal dimension and then masked, diluted 1D convolutions to build the records in an autoregressive manner).The only change we made was to convert the output from one-hot code combinations to multi-hot code probabilities to allow for greater representative power.

Evaluating EHR Language Modeling
The first evaluation is conducted by predicting the probabilities and outputs of the test set.In this phase, we assess the performance of HALO against two multi-hot language model baselines, namely HALO − Coarse and LSTM.These baselines explicitly generate a probability distribution without accessing the entire input.It's worth noting that other baseline models, such as the GAN-based SynTEG model, the VAE-based EVA model, and the GPT model, cannot be directly compared in this task, because those methods do not make a single probability prediction for each code within the visit.Our first evaluation aims to assess the capability of the models to predict the presence of potential medical codes, given a patient's past medical history and the previous codes from the current visit.Note that we explore different orderings of codes (such as most to least prevalent, alphanumeric, random, etc.) but find no noticeable differences, displaying the results of such an exploration in our supplementary information and settling on a random ordering throughout our experiments.This evaluation is crucial in showcasing a model's ability to learn patterns from the patient population and its potential to perform well in various patient simulation and extension applications.We show the results in Table 1 where we see that HALO outperforms the two compared language model architectures.Upon closer examination, we observed that the LSTM baseline model struggled with the complexity and size of the outpatient EHR dataset, while our proposed model HALO performed comparably to the HALO − Coarse ablation baseline.In contrast, in the inpatient EHR setting, where the visits are shorter but contain more codes, HALO's multi-granularity approach proved to be highly effective.Specifically, the model achieved a notable 4% reduction in binary cross-entropy (BCE) loss and a 17% increase in F1 Score on test data when compared to the single granularity HALO − Coarse model.Notably, both HALO models significantly outperformed the LSTM baseline in this setting.These results highlight the significant value of our multi-granularity approach in handling the complex and diverse nature of medical codes in different EHR settings.
Additionally, we present perplexity, which evaluates the probability or likelihood of the test set as quantified by a model trained on the training set, normalized by the unit of consideration that we are interested in.In our case, this normalizing unit is the number of medical codes in a patient's medical record (or equivalently number of ones in R).Perplexity is a metric found commonly in the wider generative modeling domain, especially on the task of natural language generation (e.g.[29]).We introduce it to the task of synthetic We include each of our autoregressive, predictive, and likelihood-based models.The bold value denotes the best results.Baseline methods SynTEG, EVA, and GPT are all omitted here because they either do not produce a probability distribution, peek at the outputs, or utilize a different, non-comparable data representation.HALO outperforms both of the baselines, achieving up to an 4% decrease in testset BCE loss, a 17% increase in F1 score, and a 13% lower perplexity per present code as compared to the leading HALO − Coarse baseline.Source data are provided as a Source Data file.
EHR generation here.Perplexity is defined mathematically by where  is the test dataset and R ( ) is the -th record in .In practice we calculate the values by summing their log probabilities, using the equivalent form The normalized value then also corresponds to how many of the different normalizing units (medical codes) one would have to randomly pick between on average to achieve the same probability.
The results of the perplexity evaluation are shown in Table 1 as well.We see similar results as with the classification evaluation with both HALO and HALO − Coarse performing very well on the outpatient EHR dataset (with HALO performing slightly better) as the LSTM baseline struggles, and HALO easily outpacing both baseline methods in this likelihood evaluation for the inpatient EHR dataset, producing a 13% lower perplexity per present code as compared to the HALO − Coarse architecture without the inter-visit modeling.Thus, in both of these test set evaluations, we see that HALO is much more effective in terms of modeling the underlying distribution of EHRs.

Statistical Similarity to real EHRs
The second analysis evaluates the statistical similarity of the generated and real data.For each methods, we generate a synthetic dataset of the same size as the training dataset.We then compare the unigram and bigram (both within the same visit and across consecutive visits) probabilities for each unique code and pair of codes within the real and synthetic datasets.

Statistical comparison results:
We evaluate the data at the visit and record level, considering approximately 10,000 individual codes and over a million bigram codes.We also compare various aggregate statistics, such as the number of visits per record, medical codes per visit, and prevalence of chronic disease labels.The code probability results are presented in Figure 3, and the aggregate statistics in Table 2. Additionally, we provide  2 values for visit-level normalized code probabilities in our high-dimensional outpatient EHR dataset and a lower-dimensional setting.The details can be found in Table 3.
Furthermore, an interactive visualization of 1000 randomly selected code-level disease prevalence comparisons between our method and real data is accessible at https://vega.github.io/.It allows zooming, panning, and hovering over points for specific disease names.Finally, we provide chronic disease label probabilities, full visit level code probability plots, probability densities underlying the aggregated statistics, and a discussion of the various failure modes of our baseline methods for that evaluation in our supplementary information.HALO again outperforms the baseline methods in each evaluation.

Key findings:
We observe that besides the GPT baseline struggling with the complexity of the outpatient EHR dataset in terms of stopping the record generation (as is common to many language models in the text generation domain as their overall quality decays for long sequences, and the lack of visit level grouping in its data representation causes its sequences to be considerably longer), the language model architectures (GPT, LSTM, HALO − Coarse, and HALO) can model both the shape of the synthetic records and the temporal dependencies much better on average than the VAE and especially GAN-based baselines.While each of the compared methods models the unigram code probabilities relatively well, the temporal modeling is better shown in the overall synthetic record and visit lengths, the generation of chronic disease labels, and the sequential bigram evaluation.SynTEG, EVA, and the LSTM baseline thus struggle with these evaluations (with the LSTM baseline struggling largely due simply to overall weakness).
The LSTM and HALO − Coarse language model baselines then falter with respect to same-visit bigram probabilities due to their lack of intra-visit dependency modeling while the GPT baseline which models each code individually and so offers that intra-visit modeling can maintain relatively stronger performance there.HALO can combine and build on each baseline's strengths without any weaknesses, using the compact multi-hot representation to offer a powerful model that does not struggle with any length or feature  The values are  2 values to measure the correlations of the three types of code probabilities for different synthetic datasets against the training data in both high-dimensional and low-dimensional settings.Bold values denote the best results.Although the results showed a drop in performance for each method in the high-dimensional setting, HALO was able to maintain strong performance with minimal decline.Overall, our proposed method achieved state-of-the-art performance, outperforming the baselines in both bigram evaluations in low and high dimensional settings.Source data are provided as a Source Data file.
of data while simultaneously maintaining the intra-visit modeling in an even more powerful and structured way.As such, it can best maintain performance in this high-dimensional setting and produces state-of-the-art results that closely model the true training data in all settings from record and visit lengths, label probabilities, and all combinations of code probabilities.This signifies that HALO is capable of generating data that looks realistic.

Accurate Disease Phenotyping Using Synthetic EHRs
The final evaluation explores the utility of the synthetic datasets for training disease classifiers.To this end, we utilize two different synthetically-supplemented data setups and machine learning classifiers to predict chronic disease labels based on patients' visits.In each of the two setups, we use a simple bidirectional LSTM with a single-layer fully connected head classifier to predict chronic disease label(s) based on a patient's visits.
Accurate Disease Phenotyping: In the first data setup, we assess model performance in real-world scenarios using synthetic training data.We conduct experiments for each of the 11 chronic disease labels in the outpatient EHR dataset, sourced from the Centers for Medicare and Medicaid Services and the SynPUF dataset [31].Additionally, we perform experiments for each of the 25 chronic diseases in the inpatient EHR dataset, based on the benchmark proposed in [27].
For each chronic disease, we randomly extract 2,500 records for training, ensuring a balanced distribution of positive and negative labels (50-50).This process is repeated across our six synthetic datasets (one for each method) and one real training dataset, resulting in a total of seven balanced training datasets.The selected size of 2,500 records strikes a balance between having enough training data for machine learning models and maintaining sufficient positive labels for each disease.
We train classifiers on each of these datasets and select the best model for each dataset using a validation set of 250 records, equally representing both classes.Finally, we evaluate the models on test sets consisting of 500 records, equally representing both classes, from the original test set comprising real patient data.
We display the average accuracy and F1 score for each synthetic dataset from each of the compared models as well as the real training data across each of the chronic disease labels in Table 4.Note that we provide the standard deviations of each metric in either table as well, but most of that deviation stems from differences between tasks rather than inconsistent performance within each model.
We provide a full set of results by chronic disease label and also additional synthetically augmented outpatient results in our supplementary information.In both datasets, we can see that each synthetic data of GPT, HALO − Coarse, and HALO largely maintain the performance of real data and offer large improvements over the SynTEG, EVA, and LSTM baselines.HALO's synthetic data offers the   best prediction results .
Phenotyping of Rare Conditions: We evaluate the utility of synthetic EHR data in identifying uncommon conditions.We created a highly imbalanced dataset of patients labeled with cancer chronic disease from the outpatient EHR dataset.The dataset comprised 50,000 EHR records without the cancer chronic disease label and only 1,000 records with the label.Using this imbalanced data, we trained a classifier and compared its performance to classifiers trained on balanced datasets.For balancing, we added 49,000 positively labeled synthetic records and also used another classifier trained on a dataset balanced using real records.
The evaluation results are summarized in Table 5. Notably, HALO outperformed all baselines, exhibiting significant improvements on the original unbalanced dataset as well as the synthetically augmented datasets.It approached the upper bound performance of the ideal balanced dataset.
This experiment underscores the potential of synthetic EHR data in supporting the identification of uncommon conditions.

Realistic Continuous Variables in Synthetic EHRs
We conclude with a brief exploration to demonstrate the viability of our discretized representation of continuous values, and HALO's effectiveness in using it to model those variables.We build new training datasets including visit gaps in the outpatient EHR dataset and lab values in the inpatient EHR dataset.We use these dataset to train a new version of our model and generate another synthetic dataset of 250,000 and 45,000 records respectively.
We then show that the distributions of those variables match the real values.In Figure 4a and Figure 4b, we show that HALO accurately replicates the distribution of gaps between patient visits and the pattern of shorter gaps for longer records, respectively.These captured nuanced patterns are on top of the aggregate mean gaps being very similar as well.There are 33.53 days between visits on average within the real outpatient EHR data and 35.77 days on average for HALO's synthetic data.
Using the inpatient dataset, we then demonstrate that HALO replicates not only the presence (in Figure 4c) but also the average values (in Figure 4d) of performed lab tests.Specific labs included We present the classification results on the test set for the simulated rare-disease detection task.We compare models trained on datasets balanced using each synthetic dataset against models trained on the original imbalanced data (representing the rare disease dataset).Additionally, we compare the results against an upper-bound ideal dataset balanced using real data.The best results are highlighted in bold.Among the evaluated models, EVA and SynTEG exhibit limited utility, while the language model architectures LSTM, GPT, and HALO − Coarse offer substantial value.HALO achieves state-of-theart performance, closely approaching the results of a true, balanced dataset.The source data can be found in the provided Source Data file.
(corresponding to points in those two plots) are included in our supplementary information.Overall, HALO's approach to continuous variables is effective, and it has the potential to generate comprehensive synthetic patient records with multiple variables of different types.

Privacy Evaluation of Synthetic EHRs
In addition to demonstrating the high fidelity of synthetic EHRs generated by HALO, we want to ensure that the privacy of the patients within the original training dataset is protected.To that end, we conduct a commonly used membership inference attack to test its identification risk, and we provide the results of two more evaluations in our supplementary information.

Membership Inference Attack:
The evaluation is the ability to thwart a membership inference attack.These attacks aim to determine whether any specific real patient record was used in the training dataset to generate the synthetic records.Membership inference attacks are a well-known privacy test in the field of synthetic EHR generation, and addressing them is crucial to ensure the privacy and confidentiality of patient identities.
To demonstrate that HALO is not susceptible to such an attack, we show that we can prevent two different attempts at a membership inference attack based on the synthetic data generator and the synthetic dataset itself.We generate an attack dataset by first selecting 100,000 records from each real dataset used for training and assigning them a positive label.Then we select 100,000 records from the remaining records not used for training as the negative label set.
Next, we conduct two attacks: • In the Model Attack, we label the 100,000 records with the highest log probability from the model as positive, predicting that they were part of the training dataset.The mean visit gap, as per visit number, across both real and synthetic datasets reveals that HALO accurately captures the pattern of patients with many records, showing shorter gaps in their subsequent visits.c) Lab presence probabilities: The probability of binary lab presence demonstrates that HALO accurately generates lab variables, even when discretized across multiple variables.d) Mean lab values: The average value of labs, when present, confirms that HALO's synthetic labs closely resemble those of the real dataset.Values in parentheses are  2 .
• In the Dataset Attack, we label the 100,000 records with the lowest hamming distance to the closest record in synthetic dataset as positive.We pick hamming distance (equivalent to Manhattan Distance in our binary setting) as our distance metric between patient records throughout our privacy evaluations in accordance with [32], but any distance metric could be substituted interchangeably.These two attacks allow us to test the ability of the synthetic dataset to prevent an attacker from inferring whether a real record was used in the training dataset.
We show the results of the classifications from the attacks in Table 6.The accuracy of both attacks on both datasets is approximately 50%, which is similar to a random guess.This shows that neither the model nor the synthetic dataset reveals any meaningful or compromising information about the patient identity in the training dataset.We also perform the dataset attack with each of our baseline datasets and see that each similarly accomplishes it, achieving a similar probability at around 50%.Note that we do not perform the model attack with the baseline models because most of them cannot offer a probability output of input patient records, and the dataset-based attack is the standard one used throughout literature in this domain.
Beyond membership inference attack, we also show that HALO passes attribute inference attack and nearest neighbor adversarial accuracy [33] evaluations in our supplementary information.For each record in the attack dataset, we find both the log probability of the record from the trained model (Model Attack) and the hamming distance to the closest record in the synthetic dataset (Dataset Attack).The attacks then label the half of the records with the highest probability or lowest distance records, respectively, as in the training set.We see that the accuracy for either attack is right around 50%, which is similar to a random guess.This indicates that the synthetic dataset and the model do not reveal any patient-identifying information about the original training datasets.We also find that each baseline synthetic dataset similarly thwarts the dataset attack.Source data are provided as a Source Data file.

DISCUSSION
In this paper, we proposed a method HALO for generating highdimensional synthetic longitudinal EHR data.Our method is specifically designed to handle the sequential, multi-granular, and highdimensional nature of electronic health records by generating an explicit probability distribution over the codes, visits, and records, and HALO can generate realistic data so without needing to aggregate or remove any codes as past approaches have unanimously done.We then showed that HALO can produce incredibly realistic synthetic EHR data.Specifically, we showed that HALO can capture the probability distribution underlying the records better than other language model baselines and then produce a synthetic dataset that both looks similar to and offers the utility of real patient records as measured by medical code occurrence probabilities and machine learning classification tasks augmented with synthetic data.Finally, we also show that our method offers this performance without compromising privacy through several privacy evaluations.
In conclusion, one of the key advantages of HALO is its ability to generate binary sequences that are over a million variables in length.Its impressive performance makes it a promising avenue for developing and sharing realistic but synthetic EHR datasets that can support diverse applications.This represents an exciting opportunity to expand the use of synthetic data in the healthcare field and could help to address some of the challenges associated with data privacy and security.
While we have shown the impressive performance of HALO in both producing high-quality, high-fidelity, and privacy-preserving, we now briefly discuss some remaining limitations.First, the architecture is designed in the model of a large language model.While the multi-modal setup allows the model to condition on more patterns per data point and learn more efficiently, our high-performing generator still requires relatively large training datasets which might not be available in some settings.
Another important aspect of our model is that it generates synthetic records through a probabilistic process.While it learns realworld patterns during training, there is still a chance that some generated records may not be clinically meaningful.However, this risk can be mitigated through postprocessing with clinical rules that validate the synthetic records.If our model is deployed in the real world, it is important to consider implementing such postprocessing steps to ensure that only clinically relevant synthetic records are produced.
Finally, our HALO model focuses on generating longitudinal EHR data, such as medical codes and lab results.However, other crucial data modalities, such as clinical notes and medical images, are not yet covered by the model.To generate fully comprehensive patient records that include all modalities, it will be necessary to use diverse training data and develop multiple models to handle each modality.This exciting avenue of research is a promising future direction.

METHODS
Our study is acquired exempt status from Institutional Review Board (IRB) approval.This study has been found to be exempt pursuant to 45CFR46.104(d)( 4) "Secondary research for which consent is not required: Secondary research uses of identifiable private information, if (i) The identifiable private information is publicly available; AND (ii) Information is recorded by the investigator in such a manner that the identity of the human subjects cannot readily be ascertained directly or through identifiers linked to the subjects, the investigator does not contact the subjects, and the investigator will not re-identify subjects."

Background and Related Work
Of all the EHR generation methods, rule-based approaches, such as Synthea [34] or SynPUF [31], have proven to be the most effective in delivering practical value.These simple approaches either offer de-identification of real records by combining data across multiple patients in a sufficiently privacy-preserving way [31], simulation of patients within a complex yet constrained rule-based system [34], Bayesian probabilistic modeling of aggregated, non-temporal patient records [35], or proprietary method without detailed explanation [36][37][38].Many of these systems can only produce synthetic patient data with limited capacity in realism and utility.We focus instead on ML methods that have the potential to generate realistic high-dimensional synthetic patient data.

GAN-based Methods
Many synthetic data generation methods use Generative Adversarial Networks (GANs), which involve a generator that creates realistic data, and a discriminator that decides if the data is real or fake [39].The GANs has been applied to patient record generation first in [19] followed by many other GAN-based approaches [15,17,18,[20][21][22][23][24]40].However, GANs have limitations when generating sequential data like EHRs.They usually only produce one output (no time connections) and so most EHR generation methods aggregate EHR data into one time step [22][23][24], create a representation of EHR data [18], or do both [19,20].
GANs also struggle with high dimensional and sparse data like real-world EHR, limiting all existing synthetic EHR GAN approaches to produce relatively low dimensional data through the aggregation of visits and medical codes or removal of rare codes.For example, there are a few methods in this category which do generate longitudinal data.LongGAN [40] and EHR-M-GAN [21] both focus only on dense lab time series of under a hundred dimensions.CorGAN [17] generates records with 1,071 distinct codes, and the current state of the art GAN approach that we baseline against, SynTEG [15], both combines and removes rare codes before arriving at a final dimensionality of 1,276.
While GANs have the potential to be conditioned on external factors and labels, such as demographics or disease phenotype labels, the ability to do so has not been extensively explored in existing works on EHR generation.Moreover, there are only a limited number of approaches that can generate synthetic EHR data tailored to specific diseases.For example, SmoothGAN [24] focuses on aggregated lab and medication information and does not model individual visits; EHR-M-GAN [21] offers conditional and sequential capabilities, but for low dimensional (under 100 dimensions) lab time-series information; CONAN and MaskEHR [18,41] model only a single rare-disease population for data augmentation; and EMR-WGAN and HGAN [22,23] can only model low-dimensional (both under 1000 dimensions) aggregated EHRs.
Deep Sequential Methods Accurately modeling the longitudinal nature of EHRs is crucial for realistic EHR generation.In recent years, two methods have shown progress in generating sequential EHRs by using either a GAN or a VAE to condition on representations of past patient visits to generate current visits [15,16].Specifically, SynTEG [15] models the time between visits, and EVA [16] offers a conditional variant.In our experiments, we compare HALO to these two models.However, both SynTEG and EVA often need to perform preprocessing steps to reduce the dimensionality of the vocabulary by aggregating medical codes and removing rare codes.
Language Models Our objective is to develop an improved method for generating realistic and high-dimensional EHR data by drawing inspiration from natural language generation.Language generation models predict the next word based on the preceding words, thereby learning a probability distribution of languages.Similarly, EHR models predict the next visit based on past visits.Also our proposed method provides an explicit probability output that allows for direct modeling and evaluation of the underlying data distribution.This approach is particularly beneficial in accurately capturing the complex and high-dimensional nature of EHR data.
The Transformer architecture, introduced in [42], has revolutionized natural language processing and enabled the development of large, attention-based models like BERT [43] and GPT [26,29,44].Among these models, we draw inspiration from GPT, which relies on a stack of Transformer decoder blocks that use masking to predict the next set of probabilities in parallel, allowing for fast training and scalability.However, applying language models directly to EHR data poses unique challenges.Unlike natural language sequences, EHR data exhibits a hierarchical structure that must be captured, with medical codes associated with specific patient visits, and visits associated with individual patients.Additionally, EHR data contains heterogeneous elements, including demographic variables, structured medical codes, and numeric lab measures, not all of which are discrete tokens.Addressing these challenges requires approaches that leverage the strengths of language models while adapting them to the peculiarities of EHR data.

Hierarchical Autoregressive Language Model (HALO)
We model the probability of patient record R,  (R), via a hierarchical autoregressive model, which utilizes both visit-and code-level structures of a patient record.First, it factorizes the probability along the visit level using the autoregressive identity by to produce what we call our coarse autoregressive sequence.We then continue to factorize the probability of visits further along the code level by converting into what we call our fine autoregressive sequence.This final probability is then rewritten as the product where the probability of each code is based on each of the previous visits and each of the previous codes in the current visit.Our multigranularity approach enables the modeling of high-dimensional sequences of many binary variables per record.This is achieved by grouping prior information into significantly fewer multivariate time steps for previous visits, while retaining the full autoregressive modeling capability for each current visit.Our HALO architecture is designed to reflect this powerful yet compact model, with a powerful and efficient structure divided into two distinct granularity levels: visit level and code level.This allows for each code to be conditioned on all previous visits and the past codes of the current visit.

Visit-Level Module
We begin with the coarse, visit-level granularity.We use a stack of  transformer decoder blocks, which have shown to be effective in the high-dimensional domain of natural language processing, to generate a sequence of visit-level histories, where the -th element in the sequence, h ( )  ∈ R  emb , is an embedding that represents all of a patient's medical history through their -th visit.Those histories then combine to form H ( ) ∈ R ( +3) × emb (where the 3 in  + 3 includes the start, label, and end visits), the output of the first module which serves of the purpose of the To encode each of the multi-hot visit representations [v 1 • • • v  ] into a fixed-length vector in R  emb , we employ an embedding layer that includes two trainable parameter matrices: a code embedding matrix W  and a positional embedding matrix W  .The code embedding matrix maps each visit code to a dense vector representation, while the positional embedding matrix captures the relative position of each visit in the sequence.Next, we use a decoder model consisting of  = 12 transformer decoder blocks to generate a series of visit history representations, which summarize the information contained in all previous visits in the coarse, visit-level sequence.The transformer decoder blocks employ masked multihead self-attention, which allows the model to attend to all previous visits while preventing information leakage from future visits.This process is written more formally as where R ∈ R ( +3) × is the patient record matrix representation, W  ∈ R  × emb is the code embedding matrix, W  ∈ R ( +2) × emb is the positional embedding matrix (to recapture the position and order of the sequence of visits), and each transformer block is based on a decoder block from the original transformer architecture [42] which we describe in more detail in our supplementary information.Thus, having processed the multi-hot patient visits through the initial, coarse visit-level module of our architecture, we obtain a sequence of visit history representations H ( ) , which capture the collective information of all previous visits up to each time step.These representations provide a compressed summary of the patient's visit history, enabling downstream modules to make predictions based on the patient's medical trajectory.
Code-Level Module However, we still need to add in the codelevel priors and generate output probabilities.To construct the input for the fine, code-level module, we offset and concatenate the previous module's visit history embedding outputs with the original record input, R. Specifically, we append the first  + 2 visit histories with the last 0) .Each of the  +2 inputs in H ′(0) has a representation of the history of all the previous visits and the codes of the current visit, mirroring both the visit and code priors in Equation 6.The final input representation H ′(0) has size R ( +2) × ( emb + ) To model the distribution of each  (   ), this H ′ (0) is then fed through  = 2 masked linear layers which maintain the same dimensionality and use upper triangular masking of the weight matrix to ensure that they preserve the autoregressive property of the probabilities (and have a ReLU activation function between layers).These linear layers are able to efficiently model the highdimensional, intra-visit patterns where other sequential approaches such as additional recurrent or transformer modules would run out of memory.The probabilities are generated formally by where the submatrix indexing at the end removes the visit-level history embedding portions of each vector to extract just the code probabilities, and the masked linear layers are achieved by where the max function is omitted for the final fine layer (sigmoid is used instead), ⊙ is element-wise matrix multiplication, M ∈ R ( emb + ) × ( emb + ) is the upper triangular masking matrix (with ones in the upper triangular portion and zeros in the lower portion) to preserve the autoregressive property, and W () ∈ R ( emb + ) × ( emb + ) and b () ∈ R  emb + are the trainable parameters of the module.The output O ∈ R ( +2) × is then a matrix of probabilities of each code for each visit after the start visit built from the visit histories and each previous code in the same visit.Each code corresponds to a conditional probability in the product from Equation 6.
We train our model using the binary cross-entropy loss function over each medical code (treating the problem as a multi-label classification problem) with masking applied such that the start visit as well as any padded visits (of all zeros) do not contribute to the loss.The architecture of our model is shown in Figure 1.

Additional Features and Considerations
Finally, We discuss different variants and add-on features of HALO.
Conditional Generation Our method generates electronic health record (EHR) data by using demographics S and chronic disease phenotypes D as labels, which are represented in our label vocabulary and applied to individual visits, as shown in Figure 2. We selected these labels based on their relevance to downstream use cases.Each label is represented as a binary variable in v  , indicating the presence of the corresponding disease or demographics group indicator.These indicators are defined by concepts such as specific categories of genders, races, ethnicity, age groups, and more.We can easily extend this strategy to include other labels of interest, such as various biomarkers, patient outcomes, or even abstract patient embeddings.
Unconditional Generation Our setup generates electronic health record (EHR) data with conditional labels by incorporating a "label visit" in the data format, as illustrated in Figure 2.This format enables easy generation of labeled and conditional data, which are highly valuable for using synthetic data in machine learning tasks and as an augmentation tool, particularly for rare cohorts.However, it's important to note that this formatting is optional.If desired, the "label visit" component can be removed from the EHR representation, and the architecture can be trained to generate unconditioned EHRs without any modification.

Generation of Continuous Variables
Our model can generate not only medical codes but also continuous variables, such as lab values and temporal gaps between visits.However, the availability of these additional variables in the generated data depends on their presence in the original dataset used for training.For example, the outpatient EHR dataset used in our study includes the time between visits, while the inpatient EHR dataset includes lab values.
In previous models, continuous values were typically generated using either GANs, which lack the autoregressive probabilistic modeling that we employ, or value predictors (such as time series analysis models), which we often found to produce average values with insufficient variance.To overcome these limitations, we model continuous variables within the healthcare domain by discretizing lab values and temporal gaps into clinically equivalent buckets.The resulting binary variables are included in the model's context, denoted as C, before being converted back to continuous values through random uniform sampling within the corresponding bucket range.By using this approach, our model generates more realistic and diverse continuous variables than previous methods.
More specifically, to generate discrete versions of continuous variables, such as lab values and temporal gaps, we divide the range of each variable into several "buckets", as represented by the values  1 ,   | refers to the number of buckets required.We determine the bucket ranges by either seeking advice from clinicians on practical ranges, creating granular but equivalent groupings, or using a histogram construction algorithm [45].The same approach is applied to temporal gaps as well.
For example, the heart rate lab test with possible values ranging from 0 to 400 beats per minute down could be broken down into twenty different buckets splitting the overall span into smaller ranges which offer the same medical meaning for all their contained values.This breakdown could have  1 = (0, 40) and  7 = (90, 100).These buckets then convert the single continuous variable into many binary variables.Whenever the continuous variable is present in the original EHR, a single one of those variable representing the corresponding bucket is set to 1 with the rest remaining 0. For instance, if a patient has a heart rate lab measurement of 93 bpm in their seventh visit, the seventh of the new heart rate variables would be 1 and the rest would remain 0. If there was no such lab measurement in the visit, they would all be 0.
These new binary variables are added into the wider code vocabulary C and treated in the same way as all of the other medical codes in the vocabulary by our HALO model during learning and generation.After generation, the specific lab values and inter-visit gaps are converted back into a continuous value by uniformly sampling from the corresponding bucket range at the very end.
This discretization allows us to maintain the same powerful and probabilistic modeling process, matching the probabilistic variance of real continuous values in the same way we match the variance of medical code presences.However, by building appropriately granular buckets, we can avoid losing meaningful information and maintain a full representation of a patient.We explore the performance of this approach further in our experiments.

DATA AVAILABILITY
The MIMIC-III inpatient EHR dataset [25] that we use is publicly available and may be downloaded and used freely after performing training and applying on physionet.org.Furthermore, we also released the synthetic data for each of our compared methods for both the inpatient and outpatient datasets at https://figshare.com/ articles/dataset/HALO_Synthetic_Data/23811162.These datasets can then be used to reproduce the results and data statistics.

CODE AVAILABILITY
We make our code for the inpatient dataset experiments, including dataset construction, modeling building, training, and evaluation, available at https://github.com/btheodorou99/HALO_Inpatient[46].Between this and public availability of that dataset, all inpatient results can be fully reproduced.Furthermore, HALO is also included in the open-source machine learning package for healthcare Py-Health [47], where it is available for easy use in concert with various machine learning tasks.

Masked Attention
with M in the final line being a triangular matrix of −∞ values ensuring that the softmax calculation only allows elements in the sequence to attend to themselves and elements before them.

𝑙 (𝑡 ) 𝑗
The -th lab value in V ( ) ( )  The gap between the  − 1 and  -th visits LG] 9 Nov 2023 record shapes, making for much better performance overall.In this low-dimensional setting, some of the weaker baselines are even able to approach or slightly surpass HALO with respect to their per visit unigram code probabilities, demonstrating the strong performance that is showcased in their respective papers.However, those same baselines are then the ones that have the sharpest dropoff in performance when translating to the high-dimensional setting while HALO, HALO − Coarse, and GPT are able to maintain stronger performance there.Furthermore, those weaker baselines, HALO − Coarse, and to a lesser degree GPT all perform significantly worse in both of the more complex and difficult bigram probabilities even at this lower dimensionality.HALO is able to largely avoid that problem and maintain its performance there.So, HALO achieves very strong unigram code probabilities and state of the art results in the two bigram probabilities by a wide margin in this low-dimensional setting as well.

Comparison to ChatGPT
While we focus on comparing HALO to other leading baselines designed specifically for synthetic EHR generation, we feel that it is also worthwhile to compare its effectiveness to general purpose large language models such as OpenAI's ChatGPT.Such models have gained a lot of recent renown for offering state of the art performance in a wide variety of tasks.So, we explored ChatGPT's ability to generate patient records in a variety of formats and were impressed by its plausibility and ability to produce a variety of types of records (both in terms of conditional generation and output format).However, it is notable that despite this impressive ability, the generation process is designed to generate realistic responses overall rather than sample from the underlying distribution of patient records (let alone the distribution of records within a specific training dataset).This difference is then notable in preventing downstream usage for data analysis or machine learning tasks.To demonstrate the divide, we generate 1,000 patient records by feeding ChatGPT the prompt "Generate a realistic patient record as a sequence of hospital visits with a set of ICD-9 codes at each visit in the form [[Code1, Code2, ...], [Code3, ...], ...]".We then process and compare this synthetic dataset's statistics to those of our inpatient dataset.We show Per Visit code probabilities for both our HALO and ChatGPT in Supplementary Figure 3.While the fact that Chat-GPT offers any correlation is an impressive feat, it is nonetheless clear that such language models can not be reasonably compared to synthetic EHR generation models such as HALO for the types of downstream use cases that are desired.

Record Shapes
We presented aggregate statistics regarding mean number of visits per record and codes per visit for both the real and synthetic datasets in our main paper.We now present the more detailed corresponding probability density plots for each statistic in Supplementary Figure 4.There we see that HALO and HALO-Coarse outperform the other baselines to accurately mimic the training dataset's shape not just on average but throughout the distribution.
We also present the chronic disease label probabilities for the outpatient dataset in Supplementary Figure 5.There EVA and HALO perform very well, effectively capturing the patterns of those codes found in the second, label visit.

Synthetic Training
Finally, we presented aggregated results across the chronic disease labels for both the outpatient and inpatient EHR dataset in our main paper as they are more concise and easier to understand.However, for the sake of completeness we also provide accuracy by individual label for models trained on each of our compared synthetic datasets and the real training dataset in Supplementary Table 6 for our outpatient EHR dataset and Supplementary Table 7 for our inpatient EHR dataset.
We then also provide additional aggregated results on the outpatient data demonstrating the effect of using synthetic data as an augmentation technique to supplement real data.Specifically, we add additional models for each chronic disease classification task trained on the same real data but augmented with each synthetic datasets in turn as well.We show those aggregated results of mean test set classification performance across the 11 label-based tasks in Supplementary Table 8.These results mirror those from the original setting of replacing real training data with synthetic data, with HALO performing the best.However, it here offers the most gain over real-only training datasets rather than the least dropoff.Thus, we show that HALO's synthetic data is able to be effectively used as an augmentation technique to produce better results than can be achieved with real data alone.

Additional Privacy Evaluations
In our main paper we use a pair of membership inference attacks to evaluate the privacy preservation of HALO and our other compared methods.Here we provide two other attacks from literature and show that HALO thwarts them as well.Attribute Inference Attack: The first of the two additional evaluations is the ability to thwart a typical attribute inference attack.This attack determines whether the synthetic dataset leaks specific and sensitive patient attributes based on correlations from demographic and other more common, less sensitive attributes of the patient.Consequently, it tests whether the synthetic dataset can be used to learn individual attributes of real patient data.
To demonstrate that HALO is not susceptible to such an attack, we show that it thwarts the nearest neighbor-based attribute inference attack.In this attack, we use subsets of the synthetic dataset and the original training dataset, randomly sampled to match the size of the original test dataset.We define demographic information, chronic disease labels, and the binary presence of the 500 most common medical codes (determined by the training dataset) as the conditional attributes.The sensitive attributes to be identified are the binary presence of the remaining uncommon medical codes.
To conduct the attack, we find the closest patient in the synthetic dataset for each patient in the training set based on having the most shared conditional attributes.We then predict each of the uncommon attributes to be the same as that closest synthetic patient.Those predicted attributes are compared with the ground truth sensitive patient attributes and graded using F1 Score.We then repeat this attack with real patients from the test dataset in place of

Figure 2 :
Figure 2: The data formatting.a The visit representation.Each visit is represented as a multi-hot vector containing indices for medical codes, static label codes to cover demographics and disease phenotypes, and special codes describing the shape and temporal ordering of the patient's visit.b The EHR representation.An EHR is then represented as a matrix constructed as a series of temporally ordered visit vectors.

Figure 3 :
Figure 3: Code probability plots.These plots show the Unigram, Sequential Visit Bigram, and Same Record Bigram probabilities for each synthetic dataset.With the exception of SynTEG, all models exhibit some correlation in the unigram and temporal bigram evaluations, but many have weak correlation or consistently yield higher synthetic probabilities due to a lack of temporal consistency and repetition across visits in the records.HALO and to a lesser extent, HALO − Coarse perform the best in all settings, while HALO is the only one that can realistically produce pairs of codes within and across visits and achieve state-of-the-art results.

Figure 4 :
Figure 4: Continuous Variable Generation Performance: HALO effectively captures the distribution of continuous variables through its discretization approach, as demonstrated in four scenarios.a) Inter-visit gap probability density: The probability density of inter-visit gaps indicates that HALO closely approximates the true shape of real data.b) Inter-visit gap by visit number:The mean visit gap, as per visit number, across both real and synthetic datasets reveals that HALO accurately captures the pattern of patients with many records, showing shorter gaps in their subsequent visits.c) Lab presence probabilities: The probability of binary lab presence demonstrates that HALO accurately generates lab variables, even when discretized across multiple variables.d) Mean lab values: The average value of labs, when present, confirms that HALO's synthetic labs closely resemble those of the real dataset.Values in parentheses are  2 .

S
A patient's static demographic information D A patient's chronic disease information L The set of all labs  ∈ N The number of visits in R C The set of all medical codes R ∈ R ( +3) ×|C| The matrix representation of R, S, and D v  ∈ R |C| The vector representation of the  -th visit in R    ∈ {0, 1} The binary presence of the -th code in C in v  Layer Normalization is then a regularization technique that ensures that all neurons or variables in the corresponding intermediate layer have the same distribution across all features in a given input, which has been shown to produce smoother gradients and more effective training.arXiv:2304.02169v3[cs.

Table 1 :
Test set modeling metrics

Table 2 :
Aggregate statistics regarding the shape of training and compared synthetic datasetsAggregate statistics on the number of visits per record and the number of codes per visit.The values are mean (std).HALO outperformed all the baselines while closely approximating the distribution of the true training data.Source data are provided as a Source Data file.

Table 3 :
Code probability correlations  2 between training and synthetic datasets

Table 4 :
Chronic disease classification model performance trained on synthetic dataWe compared the average performance in terms of accuracy and F1 Score for each of the 11 chronic disease labels in our outpatient dataset and 25 chronic disease labels in our inpatient dataset.The models were trained on each of our synthetic datasets and tested on real data.The reported values represent the mean and standard deviation across the tasks, with bold values indicating the best results.GPT, HALO − Coarse, and HALO's data offer large improvements over the other baselines and perform similarly to real training data.HALO's synthetic data performs the best with the highest average performance of all synthetic methods.Source data are provided as a Source Data file.

Table 5 :
Rare disease detection performance on synthetic balanced datasets

Table 2 :
Table of Notations *