Introduction

There has been growing interest in using synthetic data generation (SDG) techniques to enable broader sharing of health data for research and analysis1,2,3,4,5,6,7,8,9,10,11, and SDG has been highlighted as a key privacy enhancing technology for data access in the coming decade12. Furthermore, there are recent examples of health research studies using synthetic data not requiring ethics approval because they are considered to contain no patient information13, which can greatly accelerate research projects.

There are multiple synthetic health datasets that are being made available to a broad research community such as: the NIH National COVID Cohort Collaborative (N3C)14, the CMS Data Entrepreneur’s Synthetic Public Use files15, synthetic cardiovascular and COVID-19 datasets available from the CPRD in the UK16,17, A&E data from NHS England18, a synthetic dataset from the Dutch cancer registry19, cancer data from Public Health England20, synthetic variants of the French public health system claims and hospital dataset (SNDS)21, and the South Korean data from the Health Insurance Review and Assessment service (the national health insurer)22. Furthermore, recently authors have been making synthetic variants of data used in their research papers publicly available23, to enable open science.

An important criterion for evaluating synthetic data is its utility. Utility is assessed by the data custodian before sharing the synthetic data with the eventual data users. The eventual data users would only have access to the synthetic data and not to the real datasets that were used to train the generative models.

Utility metrics can be defined as broad or narrow24. Broad metrics are generic and do not take into account the specific analytic workloads that the synthetic dataset will be used for25. Most of these metrics focus on the fidelity of the synthetic data to the real data by assessing the similarity of the joint distributions of both datasets. They are useful, for example, to compare and improve SDG methods26,27,28. Narrow metrics are specific to an analysis that is performed with synthetic data. They are also sometimes referred to as workload-aware utility metrics. The data custodian would often not have a precise knowledge of individual user workloads in advance, and therefore utility is evaluated on commonly used workloads instead. Our focus in this study is on these narrow metrics.

One definition of narrow utility is replicability. Replicability is the reliability of findings when an existing study is repeated using the same analytical methods but different data29. There are two interpretations of replicability in the context of SDG.

Under one interpretation, replicability is assessed by comparing the analysis results using the real datasets with the results of the same analysis performed on the synthetic data, and is illustrated in Fig. 1. Here the effect size from a specific real dataset, which is a sample from some population, is computed and denoted by \(e_{rs}\), then \(e_{rs}\) is compared to the parameter estimate from the synthetic data, \(e_{sdg}\), for example, by evaluating the confidence interval overlap24. It is quite common to evaluate the utility of SDG techniques using this approach1,2,3,4,5,6,7,8,9,10,11. In the current study we define objective criteria for such an evaluation.

Figure 1
figure 1

Different approaches for evaluating the “narrow” utility of synthetic data in terms of replicability.

Another interpretation of replicability is whether population inferences made using synthetic data are valid30. In this case the comparison is between \(e_{sdg}\) and the population value of the parameter, \(e_{p}\). For this type of utility evaluation, standard metrics such as bias, coverage, precision, and statistical power become more relevant31.

The original proposal for SDG treated it as a form of multiple imputation32. Under the multiple imputation model, multiple datasets, say m , are synthesized and combining rules are used to compute the parameter estimates and variances across the m synthetic datasets33,34. Additional variance adjustment and combining rules were introduced for singly imputed synthetic data (i.e., m = 1)35. Such corrections ensured that variability introduced by the synthesis process are accounted for when computing parameter estimates, their standard errors, and making population inferences from synthetic datasets.

Disclosing m synthetic datasets to the data analysts could also increase the privacy risks. While synthetic data is deemed to have low identity disclosure risks in practice because there is not a one-to-one mapping between synthetic records and real people36,37,38,39,40,41,42,43, it still has other types of disclosure risks, such as membership disclosure44,45,46,47. Therefore, it is important to evaluate the privacy implications when generating and sharing m synthetic datasets.

Previous studies evaluating the effect of the combining rules on analysis results from synthetic data used simulated datasets that were not specific to health data35, performed more qualitative evaluations of study results48,49, or focused primarily on disclosure risks39. These studies did not provide a set of specific recommendations for the application of the multiple imputation combining rules for health data, and did not consider both types of replicability criteria30: (a) the similarity of analysis findings to those from real data, and (b) the validity of population inferences.

In this paper we therefore perform a simulation study to evaluate the two types of replicability criteria, and also answer the following questions:

Q1

How many synthetic datasets should be generated and combined (i.e., what is the appropriate value of m) to maximize the replicability of results using SDG ? The values of m varied from 1 to 500 in previous work35,43,48,50,51,52. There has not been a comprehensive assessment of the appropriate number of synthetic datasets to be generated

Q2

What are the privacy risks from sharing m synthetic datasets? There has been limited research on the privacy risks when multiple synthetic datasets from the same real dataset are released

Q3

Would the amplification of the synthetic datasets improve the replicability of SDG results ? A naïve amplification whereby synthetic data is larger than the real data will result in an inflation of statistical power, however, how will this amplification affect both replicability criteria with the application of combining rules ?

Q4

What are the differences in the performance of two of the more common SDG methods, sequential synthesis and generative adversarial networks, with respect to the replicability of analysis results using the generated datasets ?

Our results addressing these questions can inform how well common SDG methods enable replication of analyses using synthetic data, and the overall evaluation approach can be used in future utility benchmarking studies.

Methods

We present the simulation design in the ADEMP format as recommended for simulation studies by Morris et al.31.

Aim

The aim of this simulation study was to evaluate the replicability of common statistical analyses performed on synthetic data, and answer the four questions in the introduction about various factors that may impact replicability.

Data generating mechanisms

Simulating a population

To perform the Monte Carlo simulations, we need to have a population of patients, which we then sample from. There are multiple approaches to simulating a population. One can define distributions of convenience (e.g., Gaussian) for a number of variables and sample from those, define a regression model with arbitrary effect sizes, and use the latter to generate outcome variables50. This general approach produces a population that is not grounded in realistic health data, and typically treats the predictor variables as independent, which is an assumption that is unlikely to be true in practice. We instead use an approach that is common in health data simulations, whereby we start from real datasets and then we sample with replacement to generate simulated samples31,53.

Datasets

The three health datasets evaluated in this simulation study covered multiple conditions, jurisdictions, and data collection approaches, as summarized in Table 1. More details about these datasets are included in the supplementary materials.

Table 1 The three datasets and their characteristics.

The first was the control arm from a colon cancer clinical trial (N0147 trial)54,55. The second dataset was the 2014 Canadian Community Health Survey (CCHS), which is conducted by Statistics Canada and represents the population of Canada. The third dataset was a prospectively maintained Danish Colorectal Cancer Group (DCCG) database including all Danish patients with a first-time diagnosis of right-sided colonic cancer between 2001 and 201856.

The analytic workload was logistic regression (LR). For each dataset a specific parameter was of interest and that was the focus of our simulations. More details on the parameter of interest and the LR model covariates for each dataset are provided in the supplementary materials.

According to a classification of odds ratio effect sizes57, the effect sizes for the parameter of interest in N0147, CCHS, and DCCG are all small (defined as OR between 1.28 and 1.8) with the largest OR at 1.8 for the N0147 dataset. This is slightly smaller than the median OR in epidemiological studies of 2.16. However, published studies tend to have effect sizes biased upwards compared to all analyses that are conducted58,59. Therefore, the effect sizes we simulate are arguably quite close to the median ones in health research and representative of current research.

Dataset sample size

The sample size for the datasets used in the simulations was that deemed sufficient to achieve an 80% power for the LR parameter of interest using the true effect size, correlations, and event rate based on standard power equations60,61. Because real data do not satisfy all of the assumptions, this calculated value was used as a floor. We then performed a Monte Carlo simulation with 1000 iterations using that calculated sample size as the starting point to determine the empirical 80% power sample size, which was used in our studies. For example, if the calculated sample size using the power equations was 100 observations, we then sampled 1000 datasets from the population we created of 100 observations each and computed the empirical power. If that was below 80% then the simulation was re-run with 110 observations, and so on, incrementing the sample size until the 80% power was reached. The sample size that achieves 80% power is the one shown in Table 1.

Study design

We followed a fully factorial design with the following factors considered: generative model (two types of generative models), whether to adjust for multiple synthetic datasets (Y/N), number of synthetic datasets that are generated (\(m\), the number of datasets, varied from 1 to 20), and number of different data amplifications (4 levels). This provides 320 different scenarios for each of the three datasets considered.

Target of analysis

Analytic workload

We used LR models because they are common in health research for diagnostic and prognostic modeling62. A recent systematic review has shown that LR performance is comparable to the use of machine learning models for clinical prediction workloads63. Furthermore, an evaluation of the relative accuracy of LR models compared to other machine learning techniques, such as random forests and SVM, on synthetic versus real datasets across multiple types of SDG methods showed that LR models are only very slightly different64. Therefore, evaluating LR model parameters would have broad applicability for health research.

Estimand

A different model was fit for each dataset. The specific estimand of interest is described below in the context of the LR model. For our analysis the Wald confidence interval was computed.

For the N0147 dataset, we evaluated the impact of bowel obstruction on 5 year survival as a binary outcome65. The CCHS model we constructed evaluated cardiovascular health using the CANHEART index66, which was dichotomized at the “poor” to “intermediate” health boundary, and the covariate of interest was sex67. The DCCG model we constructed examines the relationship between sex and medical complications68,69.

Adjustment using multiple imputation combining rules

Assume that we are estimating a particular model parameter of interest qi with variance vi using synthetic dataset i where i = 1 … m. The adjustment for the model parameters and variances are as follows35. The combined model parameter \(\overline{q}_{m}\) is the mean across the m model parameters from the synthetic datasets \(\overline{q}_{m} = 1/m\sum\nolimits_{i} {q_{i} }\) , and \(\overline{v}_{m}\) is the mean variance across the m model parameters from the synthetic datasets \(\overline{v}_{m} = 1/m\sum\nolimits_{i} {v_{i} }\). The adjusted variance is computed as Tf = \(\overline{v}_{m}(k/n + 1/m)\) where k is the size of the synthetic dataset and n is the size of the real dataset, and the adjusted large sample 95% confidence interval of the model parameter is computed as \(\overline{q}_{m}\) ± 1.96 \(\sqrt {T_{f} }\).

This means that as the value of k increases above n the adjusted variance will also increase. This will have an impact on inferential validity, and imposes a cost to data amplification through synthesis.

Note that even with a single synthetic dataset with no amplification, \(1/m = 1\) in the combining rules, therefore the parameter CI width is still increased by \(\sqrt 2\) under the multiple imputation approach. This means that the CI for a model from a single synthetic dataset and from a single synthetic dataset with the combining rules applied are not the same.

Methods (generative models evaluated)

We used two types of generative models: a sequential synthesis model and a generative adversarial network (GAN). These two types of generative models are representative of those used in practice. Sequential synthesis using decision trees was one of the first machine learning approaches proposed in the literature70,71 and has since been used extensively to synthesize health and social sciences data35,71,72,73,74,75,76,77,78, and applied in research studies on synthetic data48,71,79. More recently GANs have been one of the more used types of generative models in research and practice80,81,82, and have been applied often for the synthesis of health data37,44,46,83,84,85.

Overview of sequential synthesis

The first type of generative models was a sequential decision tree-based synthesizer28. Each model in the sequence was trained using a gradient-boosted decision tree algorithm86,87, with Bayesian optimization and fivefold cross-validation for hyperparameter tuning88. The variable sequence is optimized using a particle swarm algorithm28.

The process of sequential synthesis is illustrated in Fig. 2 for a four-variable dataset: V1 to V4. In the fitting phase, three models are constructed: M1 to M3. As shown, the first model takes as input V1 and produces V2 as the outcome. The nature of the variables, whether categorical or continuous, does not affect the process, as the model adjusts to become either a classification tree or a regression tree accordingly. The second model in the sequence takes V1 and V2 as input with V3 as the outcome, and so on.

Figure 2
figure 2

Illustration of the sequential synthesis process for a four-variable dataset.

The synthesis step is initiated by sampling from the actual or fitted distribution of the first variable, V1. This creates the synthetic version of that variable sV1. Sampled values are then entered into the first model to generate the distribution of sV2. The synthetic value of sV2 is either sampled according to the predicted probabilities (for categorical variables) or smoothed using a kernel density estimator with boundary correction (for continuous variables)89, with bandwidth computed from the original data.

Having generated two synthetic values, sV1 and sV2, these form the input for model M2 to produce the distribution of sV3. Again, the generated synthetic value is either sampled from that predicted distribution or smoothed. The process proceeds in that manner until all variables are synthesized.

Overview of GAN

The second type of generative model is CTGAN90, which is a conditional GAN architecture.

In its basic form, a vanilla GAN consists of two multi-layer-perceptron neural networks, viz., a generator and a discriminator. The generator and the discriminator play a min–max game. The input to the generator is noise while its output is synthetic data. The discriminator has two inputs: the real training data and the synthetic data generated by the generator. The output of the discriminator indicates whether its input is real or synthetic. The generator is trained to ‘trick’ the discriminator by generating samples that look real. On the other hand, the discriminator is trained to maximize its discriminatory capability.

There are many variations of the vanilla GAN that are widely used for different applications. For instance, Bourou91 provides a review of GANs used in tabular data synthesis. Conditional GAN was first introduced by Mirza92. Of special interest is the CTGAN proposed by Xu93. CTGAN was developed to tackle several challenges when modelling tabular data. Among these are the multimodal distributions of continuous variables and highly imbalanced categorical variables. CTGAN, solves the first problem by proposing a per-mode normalization technique. The second problem is solved by a conditional GAN where each category of a categorical variable serves as the condition passed to the GAN.

Performance measures

Replicability metrics

The performance measures that were used to evaluate replicability are summarized in Table 2 (to evaluate replicability defined as the ability to draw the same conclusions as the analysis on the real dataset94) and Table 3 (to evaluate replicability defined as the validity of population inferences from the synthetic datasets31).

Table 2 The definitions of the metrics that were used to evaluate replicability defined as the ability to draw the same conclusions as the analysis on the real data94.
Table 3 The definitions of the metrics that were used to evaluate replicability defined as the validity of population inferences from the synthetic datasets31.

Privacy metric—membership disclosure

Privacy risks were computed using a membership disclosure metric95. Membership disclosure evaluates the ability of an adversary to correctly determine if a target individual is in the original data that was used to train the generative model. The metric is a relative F1 score that evaluates the accuracy of such adversary attacks compared to a naïve attack which does not use the information in the synthetic data. Previous work has used a threshold of 0.2 to determine if the relative F1 score was low enough67,94,95.

Membership disclosure was evaluated by pooling all of the \(m\) synthetic datasets. Although in practice we did not observe a difference in the membership disclosure risk between the \(m\) pooled datasets or when evaluating a single dataset. The results shown consider the \(m\) pooled datasets.

To compute the membership disclosure risk we need to have a measure of the population size. For the colon cancer dataset there were 1,365,135 people living with colorectal cancer in the US in 201896, which we set as our population size. For the CCHS dataset we used the population of Canada in 2014 since that was a population survey. The prevalence of colon cancer in Denmark is approximately 30,00097. These values are summarized in Table 1.

Membership disclosure is different from identity disclosure (commonly referred to as re-identification risk), in that a dataset can have a low re-identification risk but still have a high membership disclosure risk. Although the original datasets that were used in this study were deemed to be de-identified already, it is still necessary to assess membership disclosure risk.

Number of simulations

The number of simulation iterations was set to 1000 for each simulated scenario. This is the most common value for the number of simulation iterations used in the medical statistics literature31,98. This is also consistent with assuming a Monte Carlo standard error of 0.7% for a 95% CI coverage evaluation31, which is a key performance parameter in our study.

We drew 1000 datasets with sample sizes shown in Table 1. These sample sizes give us 80% power to detect the desired effect for each LR model. A generative model was trained for each real data sample using sequential synthesis and CTGAN, which gave us 2000 generative models for each dataset. For each generative model 20 synthetic datasets were generated and these were used in our analysis. LR models using m = 1 … 20 datasets were fitted, and their results combined, and the eight metrics described above computed for each m, as well as membership disclosure. When generating synthetic datasets, we also evaluated the impact of data amplification. We evaluated four levels of amplification: 1×, 2×, 5×, and 10×. The baseline for the amplification is the 80% power sample size in Table 1 (i.e., amplification is equal to \(k/n\)).

The failure rate during the simulations was highest with the DCCG dataset using the sequential synthesis method at 1.29% of the 1000 simulation runs. This could be due to the failure of the generative model or lack of convergence in the LR model. The failure rate for the sequential generative model with the N0147 was 0.03%, and for CTGAN with the N0147 dataset was 0.16%. For the other dataset—generative model combinations there were not failures. When failures occurred, they were treated as missing observations in the analysis.

Statistical testing

We do not perform statistical significance tests to compare the different metrics because in the context of a simulation these are not informative. The number of simulation runs can be increased to make very small effects statistically significant. Therefore, the results are presented descriptively which gives us the information we need to evaluate replicability and privacy.

Neutrality of simulation study

This simulation was intended to be a neutral comparison study so as not to favor any particular generative model99. We argue that we meet two of the criteria for a neutral comparative study completely and meet the third one partially. First, the purpose of our study was to evaluate the replicability across common generative models rather than to evaluate a new proposed generative model. Both of the generative models included in our study have been used often in research and practice. Second, the evaluation criteria were selected based on the existing literature and we have tried to be more inclusive with respect to the selection of metrics. Therefore, there was a rational process to the choice of metrics. For the third criterion, while we are neutral with respect to the two methods included in our study in that we have evaluated them both before25,94,95, we have also performed more research and applied work with the sequential synthesis method9,28.

Ethics

The protocol for this study was approved by the Veritas IRB protocol number 2021-2882-7683-1, and the Children’s Hospital of Eastern Ontario Research Institute research ethics board protocol number 23/23X. The use of the DCCG dataset was approved by the Danish Data Protection Agency (Datatilsynet) number RN-2018-94. This study was performed in accordance with relevant guidelines and regulations. All the datasets used were provided to the research team for secondary analysis and they were already deemed to be de-identified/anonymized. Therefore, the Children’s Hospital of Eastern Ontario Research Institute Research Ethics Board did not require additional consent from the data subjects for this study.

Results

We present the results for the N0147 dataset in the main body of the paper and summarize the findings for the other two datasets which are included in the supplementary materials. We generally found that the CTGAN replicability results were quite poor, and we include those results in the supplementary materials.

In the results we will refer to the findings for a single dataset without the application of the combining rules as “single”. When the parameter estimates and their CI values are adjusted using the combining rules in “Adjustment using multiple imputation combining rules” we will refer to the results as “multiple”. Even for \(m = 1\), when the combining rules are applied the adjusted variance is \(T_{f} = 2\overline{v}_{m}\) with no amplification. This is different from the “single” variance (\(v_{1}\)) which would be just the computed value from the fitted model. Therefore, in the “multiple” case, even for one synthetic dataset, the parameter variance is adjusted upward to account for the generative process.

For multiple dataset results, the decision agreement results for the N0147 dataset are shown in Fig. 3, and are high (all above the 0.8 threshold) for all values of \(m\), and decrease slightly as \(m\) increases. The estimate agreement reaches a plateau at \(m = 5\) and at that plateau is also above the 0.83 threshold. Standardized difference is shown in Fig. 4 along with CI overlap. Standardized difference is consistently above the 0.95 value, and the CI overlap results are also quite high (mostly above 0.8) and increase with higher values of \(m\), reaching a plateau at \(m = 5\). These observations are consistent with the DCCG and CCHS datasets shown in the supplementary materials.

Figure 3
figure 3

Decision and estimate agreement for the N0147 colon cancer dataset using the sequential synthesis method. The amplification value indicates the multiple of the sample size shown in Table 1 (1420).

Figure 4
figure 4

Standardized difference and confidence interval overlap for the N0147 colon cancer dataset using the sequential synthesis method. The amplification value indicates the multiple of the sample size shown in Table 1 (1420).

Data amplification affects the single dataset results for estimate agreement, and this is consistent with the pattern that for larger synthetic datasets the parameter estimate will converge to the true mean. CI overlap deteriorated for the single results with amplification. Decision agreement is not affected by amplification since the results were statistically significant and therefore narrower confidence intervals did not change that conclusion. Amplification did not have a material impact on the “multiple” results.

The N0147 and DCCG results using CTGAN provided in the supplementary materials are comparable to Figs. 3 and 4 with higher agreement, standardized difference, and CI overlap with higher values of \(m\), reaching an acceptable plateau at \(m = 5\). The CCHS results with CTGAN are quite poor, with low estimate agreement and confidence interval overlap results.

The bias and power results across different values of m for the N0147 dataset are shown in Fig. 5 at different levels of amplification. The bias is consistently close to zero, and power is close to the nominal 80% level. Bias and power tend to plateau with higher values of \(m = 10\). Amplification does help increase power only slightly for the “multiple” datasets. As expected, “single” power increases with amplification because it is a simple increase in sample size without adjustment of the variance.

Figure 5
figure 5

The bias and power for the N0147 colon cancer dataset using the sequential synthesis method. The amplification value indicates the multiple of the sample size shown in Table 1 (1420).

The bias eliminated coverage and empirical SE plots across different values of m for the N0147 dataset are shown in Fig. 6 at different levels of amplification. The coverage of the adjusted parameters is consistently close to the 95% nominal level. Empirical SE decreases towards zero and plateaus at higher values of m. This is not surprising since as \(m\) increases the average variance values move closer to the average across simulation runs—the combined estimates become more consistent. Amplification does not change the general patterns observed. The coverage and empirical SE for the “single” results tend to be poor, with coverage far from the nominal 95% level, and empirical SE being quite high.

Figure 6
figure 6

The coverage and empirical SE for the N0147 colon cancer dataset using the sequential synthesis method. The amplification value indicates the multiple of the sample size shown in Table 1 (1420).

The results for the CCHS and DCCG datasets generated by sequential synthesis are very similar to the N0147 ones for the population inference results. These results are included in the supplementary materials.

For CTGAN the findings, included in the supplementary materials, are different. Bias is high and power is quite poor for the N0147 and CCHS datasets. But CTGAN performs quite well on these metrics for the DCCG dataset. Similarly, coverage for N0147 and CCHS exceeds the nominal level, but is at the nominal level for the DCCG dataset. Empirical SE performs similarly across all datasets with a gradual convergence to zero as more replicates are generated. Amplification did not change these patterns for the adjusted datasets.

The membership disclosure results are shown in Table 4. The value does not vary by the number of synthetic datasets that are generated. This is because the risk reaches its maximum with one dataset, and the values in Table 4 reflect their average. The risk value is low (below the suggested 0.2 threshold in the literature) suggesting that the privacy risks are acceptable for the synthetic data irrespective of the number of data replicates. The conclusions are similar for the CTGAN membership disclosure, also shown in the supplementary materials.

Table 4 Averaged membership disclosure values for the three datasets using the sequential synthesis generative model.

Overall, for sequential synthesis, the “multiple” results are superior than the “single” results. In many cases the evaluative metrics plateau at approximately \(m = 10\). This is the case across all eight criteria that are used. For the privacy criterion there is no difference between “single” and “multiple” results.

Note that the Monte Carlo standard error31 was also computed for the various evaluative metrics. This was negligibly low and would not be visible in the plots.

Discussion

Summary

In this study we evaluated the replicability of findings using fully synthetic datasets through a series of simulations. Two sets of evaluative criteria were used to assess replicability: (a) the similarity of analysis findings to those from real data, and (b) the validity of population inferences. The simulations were based on three heterogeneous datasets covering multiple diseases, conditions, data collection modalities and jurisdictions. Two different, but commonly used, types of generative models were evaluated: a sequential synthesis approach using decision trees and a conditional generative adversarial network approach. The assumed analytical workload was logistic regression.

Generating multiple datasets using sequential synthesis and combining the parameter estimates provides for better replication of results than using a single synthetic dataset without any adjustments to the estimates and confidence intervals.

The results allow us to respond to the questions we posed at the outset of the study:

Q1

Applying combining rules from 10 synthetic datasets was sufficient to ensure good performance across all of our eight metrics for data generated using sequential synthesis. The replicability of results of single synthetic datasets without the use of any combining rule adjustments was generally poor, and can be misleading

Q2

Membership disclosure risk is consistently below the threshold across all generative models and is not materially affected by the value of \(m\) or by amplification

Q3

The generation of amplified datasets only had a marginal impact on replicability in general, and more importantly had a very marginal effect on statistical power when the combining rules were applied

Q4

The replicability of analyses when the synthetic data was generated using sequential synthesis was high, but for CTGAN replicability was quite poor in some datasets, with decision and estimate agreement severely impacted, as well as power being far off the nominal value and high degrees of bias on some datasets. Therefore, the ability to replicate real data results from synthetic data will depend on the type of generative model being used

Our results indicate that sequential synthesis gave better replicability than CTGAN. These results are consistent with previous comparative evidence on oncology data whereby a sequential synthesis generative model utilizing decision trees had better utility than a GAN46,94. There are also implementation differences that may contribute to sequential synthesis performing better. Our sequential synthesis implementation had a more complete process for handling missing data compared to the open source Synthetic Data Vault (SDV) implementation of CTGAN that we used90,100. We observed that SDV generative models were not able to reproduce the missingness patterns in the synthetic data as well. Furthermore, the SDV implementation had limited hyperparameter tuning.

While GANs have been used extensively for SDG80,81, there is evidence that performance can vary significantly across different GAN architectures101. This dependence on the type of generative model, even within the same class, suggests that the kind of evaluation we presented here should be conducted for each type of generative model when applied in practice.

Application in practice

Our results indicate that generating a single dataset and performing analysis on that without any adjustments to model parameters and standard errors can result in low replicability. Analytic conclusions should be drawn from models fitted on ten synthetic datasets and their parameters combined to ensure replicability of analyses.

This does not necessarily mean that generative models should be provided to data users to allow them to generate multiple datasets themselves. In general, machine learning (ML) models are known to be susceptible to adversarial attacks that can reveal sensitive information about the individuals in the training datasets102,103. Therefore, it has been argued that sharing ML models may lead to different types of disclosure risks, making (unprotected) ML models equivalent to personally identifiable information104. Hence, there may be additional privacy risk from sharing generative models. Instead, we propose that data custodians should share ten instances of synthetic datasets rather than single synthetic datasets to ensure the replicability of findings.

There is equivocal value to amplification of synthetic data for statistical analyses. Because of the relatively low computational burden of amplification, a 5× (or even 10×) amplification for the ten generated datasets can marginally improve replicability, although one can make the counterargument that the additional complexity of handling larger datasets would not provide a meaningful return in terms of replicability.

Our methodology can also serve as a general framework for evaluating and comparing the replicability of synthetic datasets. Replicability is only one dimension of the utility of synthetic datasets and generative models, but an important one.

Limitations

Our analytic workload was logistic regression models. This type of analysis is one of the most common in health research and therefore the results should still have broad applicability62. However, other types statistical models should be evaluated in future work.

Our study was focused on evaluating the replicability using fully synthetic datasets. We did not consider partially synthetic datasets nor hybrid synthetic datasets, which may produce a different set of recommendations. We also did not consider other utility metrics, such as the fidelity of the synthetic data. Arguably, fidelity is mostly relevant if it enables replicability25, and therefore having a framework for assessing replicability is a necessary condition for assessing utility in general.

We did not examine the impact of generating multiple synthetic datasets on the results of machine learning models and prognostic model prediction accuracy on unseen cases. We limited our investigations to the commonly used logistic regression models and parameter inferences only.

Our results are limited by the characteristics of the datasets that were used. While there was heterogeneity in these datasets in terms of type, jurisdiction, and context, additional evaluations using our replicability framework would be of value.