A comparison of synthetic data generation and federated analysis for enabling international evaluations of cardiovascular health

Azizi, Zahra; Lindner, Simon; Shiba, Yumika; Raparelli, Valeria; Norris, Colleen M.; Kublickiene, Karolina; Herrero, Maria Trinidad; Kautzky-Willer, Alexandra; Klimek, Peter; Gisinger, Teresa; Pilote, Louise; El Emam, Khaled

doi:10.1038/s41598-023-38457-3

Download PDF

Article
Open access
Published: 17 July 2023

A comparison of synthetic data generation and federated analysis for enabling international evaluations of cardiovascular health

Zahra Azizi¹^na1,
Simon Lindner²^na1,
Yumika Shiba^1,3,
Valeria Raparelli^4,5,
Colleen M. Norris^5,6,
Karolina Kublickiene⁷,
Maria Trinidad Herrero⁸,
Alexandra Kautzky-Willer²,
Peter Klimek^10,9,
Teresa Gisinger¹¹,
Louise Pilote^1,12^na2 &
…
Khaled El Emam^13,14,15^na2

Scientific Reports volume 13, Article number: 11540 (2023) Cite this article

1945 Accesses
1 Citations
3 Altmetric
Metrics details

Subjects

Abstract

Sharing health data for research purposes across international jurisdictions has been a challenge due to privacy concerns. Two privacy enhancing technologies that can enable such sharing are synthetic data generation (SDG) and federated analysis, but their relative strengths and weaknesses have not been evaluated thus far. In this study we compared SDG with federated analysis to enable such international comparative studies. The objective of the analysis was to assess country-level differences in the role of sex on cardiovascular health (CVH) using a pooled dataset of Canadian and Austrian individuals. The Canadian data was synthesized and sent to the Austrian team for analysis. The utility of the pooled (synthetic Canadian + real Austrian) dataset was evaluated by comparing the regression results from the two approaches. The privacy of the Canadian synthetic data was assessed using a membership disclosure test which showed an F1 score of 0.001, indicating low privacy risk. The outcome variable of interest was CVH, calculated through a modified CANHEART index. The main and interaction effect parameter estimates of the federated and pooled analyses were consistent and directionally the same. It took approximately one month to set up the synthetic data generation platform and generate the synthetic data, whereas it took over 1.5 years to set up the federated analysis system. Synthetic data generation can be an efficient and effective tool for enabling multi-jurisdictional studies while addressing privacy concerns.

Genome-wide association studies

Article 26 August 2021

Interviews in the social sciences

Article 15 September 2022

Development and validation of a new algorithm for improved cardiovascular risk prediction

Article Open access 18 April 2024

Introduction

Cardiovascular diseases (CVD) continue to represent the leading cause of mortality and morbidity amongst women and men worldwide¹. Biological differences between the sexes such as anatomical and physiological variations in coronary arteries and the autonomic nervous system alter the development and progression of CVD². However, the environment and lifestyle³ as well as individuals’ identity, roles, and relations in society may play an important role. These characteristics are gendered in a way that they affect males and females differently and evolve through early life to adulthood⁴. The effect of these gendered factors can vary between countries with different cultural and political biases⁵. Investigating the impact of sex on CVD across multiple countries requires pooling data sourced from these jurisdictions.

However, sharing and pooling health data across institutions and across national and international jurisdictions has been a challenge⁶. Privacy concerns are key barriers to data sharing and data access^7,8, particularly in EEA countries where the General Data Protection Regulation (GDPR) imposes high standards for data sharing that are often difficult to meet in practice^9,10. This raises a particular challenge given that the GDPR is serving as a template regulation around the globe¹⁰.

One approach to address such privacy concerns has been to perform a distributed data analysis whereby the analysis is performed within each dataset locally and then final results combined through a meta-analysis. As an example, in a study evaluating the effectiveness of different statins in each of 3 Canadian provinces¹¹, the hazard ratios for different statins were combined using a fixed-effects model, with weight being the inverse of the variance of the province-specific parameter estimate¹². However, because the same analysis needs to be executed multiple times by different teams in each province, this general approach has not resulted in timely results in practice¹³.

Another option which can enable the timely sharing of datasets in a privacy protective manner is synthetic data generation (SDG)^14,15. There have been multiple synthetic health data releases in the US^16,17, the UK^18,19,20 and other European countries^21,22. None of these efforts pooled datasets across jurisdictions to enable cross-country analysis.

In this study, we evaluate whether SDG can be applied for pooling data to enable international comparative studies. Our objective was to assess country differences of the effect of sex on the cardiovascular health (CVH) of Canadian and Austrian populations. The datasets used in this study were from the Canadian Community Health Survey (CCHS) that was administered in 2014 in Canada (n = 63,522), and the Austria Health Interview Survey (ATHIS 2014, n = 15,771) which was conducted as part of the European Health Interview Survey series. These surveys collect information on health status, psychosocial factors, and healthcare resource utilization.

The Canadian data was synthesized and sent to Austria to be pooled with the original Austrian dataset, and a multivariable regression model was constructed from the pooled dataset. To generate synthetic data, we used sequential classification and regression trees^23,24. The results were compared to the ground truth results obtained through a federated analysis on the source data. The federated analysis was performed using the DataSHIELD method and tools²⁵. The DataSHIELD approach exchanges the intermediate results among the nodes which means that its analysis gives the same the results as those obtained from pooling the original datasets. The study workflow is shown in Fig. 1.

Starting from relatively similar points with the availability of robust software to perform federated analysis and SDG, and a relatively good knowledge of privacy enhancing technologies within the team, the federated analysis took eighteen months to set up operationally and obtain results, whereas the SDG approach took in total one month to set up, install, and execute. Therefore, testing whether an SDG method can produce the same results as those obtained from source data and if such an approach can be privacy protective, could enable significantly more efficient pooling of data across jurisdictions.

Results

Privacy risks of synthetic data

The privacy of the synthetic CCHS data was assessed using a membership disclosure test (step 3 in Fig. 1). Membership disclosure risk assessment is a common way to evaluate the privacy risks in synthetic datasets^26,27,28,29, and is defined as an adversary, using the information in synthetic data, determines that a real target person was included in the original dataset used as input for synthetic data generation (i.e. was a member of the training dataset). Knowing that an individual was in the training data can reveal sensitive attributes about that individual.

The relative membership disclosure F1 score³⁰ was 0.001, indicating that the ability for an adversary to predict membership is quite poor. The low value means that the synthetic Canadian dataset can be deemed as having low disclosure risks.

Descriptive statistics

The CCHS cycle 2014 included 55.3% females, while the ATHIS Cycle 2014 included 55.7% females (Table 1). The Austrian participants were slightly younger than the Canadians. However, there was an age difference between males and females in the Canadian participants with slightly older females (p < 0.001) but similar in the Austrian participants (p = 0.32). There was a small difference in hypertension between males and females in the Canadian dataset (M vs. F: 24.2% vs. 25.1%), and in the Austrian dataset (M vs F: 21.4% vs. 18.9%). In the Austrian dataset there were more females that were immigrants (M vs F: 7.6% vs. 9.6%) compared to the Canadian dataset where there was no difference in immigration status (M vs. F: 14.5% vs. 14.4%). Otherwise, the two datasets were similar in terms of male vs female comparisons with the following patterns: more females had a lower BMI, more males had diabetes and were smokers, more females were divorced or widowed, more females lived in single occupant households, and more females lived in low- or medium-income households.

Table 1 Comparison of baseline characteristics for the Canadian and Austrian datasets.

Full size table

Comparison of pooled partially synthetic data and federated analysis results

Descriptive statistics

A comparison of the marginal distributions between males and females in Table 2 showed consistently similar results in the federated and pooled analyses of partially synthetic data across all variables, with the standardized mean differences (SMD) consistently below the 0.1 threshold³¹.

Table 2 Descriptive statistics for the federated and pooled analysis.

Full size table

Males tended to be younger, there were more females with normal BMI (M vs. F: 39.9% vs. 52.9%), more males had diabetes (M vs. F: 9.2% vs. 7.5%) and were smokers (M vs. F: 24.1% vs. 18.8%). There were more males that were single (M vs. F: 32.5% vs. 25.3%), and males were more likely to be in a household with a high income (M vs. F: 53.9% vs. 44.4%). Females were more common in single-person households (M vs. F: 21.3 vs. 28.4%) and were more likely to be divorced or widowed (M vs. F: 12.9% vs. 25.3%). There were no significant differences between sexes on hypertension (M vs. F: 23.6% vs. 23.9%), post-secondary education and higher (M vs. F: 51.7% vs. 51.2%), and whether the individual was an immigrant (M vs. F: 13.1% vs. 13.4%).

Determinants of cardiovascular health: univariable analysis

The outcome variable of interest was CVH calculated through a modified CANHEART index in both countries³². Overall, 70.7% of Canadians and 67.9% of Austrians had a CANHEART score greater than three.

Table 3 shows the parameter estimates, confidence intervals, and p-values for the pooled and federated univariable regression analysis. The results were similar between the two methods of analysis, with the substantive conclusions being the same from both approaches.

Table 3 Univariable linear regression using the federated and pooled analysis.

Full size table

Females had better CVH than males (pool vs. fed: 0.18 vs. 0.19), as well as individuals in larger households (pool vs. fed: 0.19 vs. 0.18) and immigrants (pool vs. fed: 0.09 vs. 0.09). Older individuals had worse CVH (pool vs. fed: − 0.17 vs. − 0.17), as well as divorced/widowed individuals (pool vs. fed: − 0.61 vs. − 0.6) and common-law/married individuals (pool vs. fed: − 0.41 vs. − 0.4) compared to single individuals. Lower income individuals also had worse CVH (pool vs. fed: − 0.19 vs. − 0.18). There was a weak positive relationship between higher education and CVH (pool vs. fed: 0.04 vs. 0.04). The weakest relationship was between country and CVH whereby the effect size was similar between federated analysis and pooled analysis (− 0.04 vs. − 0.03), indicating slightly worse CVH among the Austrian respondents.

Determinants of cardiovascular health across countries: interaction analyses

In the multivariable analysis of the main effects, the parameter estimates of the federated and pooled analysis were directionally the same as for the univariable analysis, and the comparison between the federated and pooled analysis yields the same conclusions as for the univariable analysis (see Table 4).

Table 4 Multivariable main effects models for predicting CVH in federated and pooled analyses.

Full size table

In the multivariable analyses considering the country interactions to determine whether country moderates the relationship between the other variables and CVH, the impact of several factors differed between countries (Table 5). For example, although males in Austria have lower CVH than males in Canada, females in Austria had better CVH than females in Canada. Also, at lower levels of education, CVH was lower among the Austrian respondents, but this country difference changed as education levels increased whereby Austrians with high levels of education had higher CVH. At the highest level of education Austrians had better CVH than Canadians. Immigrants had better CVH in Canada compared to Austria, but worse CVH than non-immigrants in both countries.

Table 5 Multivariable model with country interactions for federated and pooled analysis.

Full size table

There is one difference in the interaction parameters between the federated and pooled models. While the significance of the interaction parameter for being married differs between the two approaches, the substantive conclusions are the same in that being married has lower CVH in both countries, and CVH is lower in Austria than in Canada irrespective of marital status.

The effect size for the country variable is larger in the interaction model compared to the univariable model and main effects only multivariable models. The interaction model assumes a contingency effect of country and therefore the country parameter should not be interpreted by itself³³.

Elapsed time comparisons

A significant time elapsed to set-up the necessary servers in multiple locations with the requisite security protocols for the federated analysis (these servers hold the original sensitive datasets and needed to be accessible remotely from a different jurisdiction, requiring the introduction of additional security protocols and checks), and to obtain the necessary approvals (Table 6). The programming required for DataSHIELD had to be done anew since common regression R packages used by the analysts were not usable in a federated context. Once the multiple nodes have been set up the processing speeds are comparable.

Table 6 The difference in elapsed time between the federated analysis and the pooled analysis.

Full size table

These values demonstrate the advantage of synthetic data relatively speaking. An important context here is that the DataSHIELD system was being set up in two academic medical centers, which may have an impact on timing. Plus, this work was done during the COVID-19 pandemic which would have impacted the speed at which multi-institutional and multi-jurisdictional projects progressed.

Discussion

Summary

Our results highlight the country specific effects of sex on CVH and demonstrated slightly better CVH in Canadians compared with Austrians. Marital status, low household income and not being single were associated with worse CVH while female sex, greater household size, higher level of education, and being an immigrant were associated with better CVH in federated and pooled datasets. The magnitude of these factors differed between Austria and Canada.

The result of this secondary analysis of population-based datasets revealed that synthetic data generation methods using sequential classification and regression trees can be used to pool datasets across countries for international studies. The analytical conclusions were the same for the models developed using the pooled partially synthetic dataset as the ground truth model developed using federated analysis in various analytical steps including descriptive, univariable analysis and multivariable main effects and country interaction models. While previous observational studies have compared synthetic and real data^34,35,36, there has been no population-based study testing the use of SDG for pooling datasets across jurisdictions and comparing it to a federated approach.

We provided evidence that synthetic data has similar utility compared to the ground truth generated through federated analysis. While there was one difference in regression model parameters, this was for a weak effect size. Where weak effects are important then the pooled partially synthetic data can be used for exploratory analysis to validate assumptions while procedures for the exchange of the original data are set up.

The significantly lower effort in getting to the results using synthetic data can enable researchers to efficiently share data across jurisdictions. Data synthesis was completed in approximately one month whereas it took eighteen months to set up the federated analysis system across two nodes. It is expected that further substantial work would be needed to set up additional nodes to accommodate the inclusion of other countries in the international analysis.

The use of synthetic data will allow merging a variety of population-based databases globally and across jurisdictions nationally and internationally. For our specific work, this would allow us to assess the association of sex with the cardiovascular health of populations while evaluating the effect of geo-politico-cultural differences in disease risk.

We found that being divorced, widowed, or married was associated with worse CVH compared to being single. Similar results were obtained in an analysis of data from the US, where single participants had better health habits and lower preventable risk factors than married/widowed or divorced in the National Health Interview Survey³⁷. While singles might have better CVH, evidence for the mortality rate from CVD in single participants compared to married participants is still inconsistent^38,39,40,41. Studies have identified the increased prevalence of non-traditional CVH risk factors including stress, depression, recreational drugs, and other socioeconomic risks in non-married groups that can indeed impact these subjects additionally⁴². This may explain the greater risk of CVD and mortality in non-married compared to married subjects in those studies. It is also reported that these acute stressors are even greater in those widowed and divorced (spousal death, divorce)⁴³, which may strengthen the development of CVD compared to single and married in our study.

Lower socioeconomic status is associated with increased risk of CVD and mortality³. Our results are generally supportive demonstrating a positive effect of higher education. There was significant interaction between many covariates and country. Males in Austria have worse CVH than males in Canada. Also, at lower levels of education CVH is worse among the Austrian respondents, but this country specific effect reverses as education levels increase: at the highest level of education Austrians seem to have better CVH than Canadians. Moreover, immigrants have better CVH in Canada than Austria, and non-immigrants have better CVH overall that is also higher in Canada. Being married has worse CVH in both countries, and CVH is lower in Austria than in Canada across all values of marital status. These results suggest groups to be targeted for improving CVH are country specific.

Limitations and future work

One of the limitations of our study is using only a single data synthesis method. Application of other types of data synthesis and comparing the utility of those methods with those from the current study is recommended in future studies. We only pooled two datasets. Multi-jurisdictional studies may pool datasets across more than two jurisdictions, and we did not test utility when multiple datasets are synthesized and pooled.

Other methods for privacy-reserving analysis of multi-jurisdictional data include performing a meta-analysis. However, because the same, potentially complex, analyses must be performed multiple times, the timelines of this approach has in practice proven to be challenging¹³. The use of synthetic data generation can help accelerate the time to results.

Conclusions

Our results indicate high utility for the pooled partially synthetic dataset, and low privacy risks for the synthetic data, in addition to an elapsed time advantage when compared to the federated analysis platform. Our analysis identified factors with a differential effect on CVH depending on country where a person lives. Hence, interventions will need to be country specific.

Methods

The objective of the analysis was to assess country-level differences in the role of sex on cardiovascular health (CVH) using a pooled dataset of Canadian and Austrian individuals.

Datasets used

The CCHS and ATHIS variables/questions that were used in our analysis are included in Supplementary Material A. The first step in the workflow (see Fig. 1) was to harmonize the datasets using Maelström research guidelines for retrospective data⁴⁴.

Data synthesis method

Generative model

We used a sequential synthesis method using sequence-optimized decision trees²⁴. With sequential synthesis models, a variable is synthesized by using the values earlier in the sequence as predictors. All variables used in the analysis were synthesized (step 2 of the workflow as illustrated in Fig. 1). Only the CCHS dataset was synthesized.

Sequential trees have been used to synthesize health and social sciences data^{45,46,47,48,49,50,51,52,53}, and applied in research studies on synthetic data^45,54,55. Additional improvements were implemented to the basic sequential synthesis method for this study. Each model in the sequence was trained using a gradient boosted decision tree^56,57 with Bayesian optimization for hyperparameter selection⁵⁸. Each combination of hyperparameters was selected using fivefold cross validation on the training dataset during tuning.

In the context of the synthesis of categorical variables, synthetic values are generated based on the predicted probabilities. In general, boosted trees do not output correct probabilities and these need to be calibrated, especially as the number of iterations increases⁵⁹. In addition, for imbalanced categorical outcomes, the model is trained with larger weights for the minority class, which gives incorrect probabilities. Therefore, the predicted probabilities are adjusted using beta calibration⁶⁰.

For each continuous variable $X_{i}$ we first convert them to a Gaussian distribution. The empirical cdf was applied to each variable $F_{i} (X_{i} )$, and then the quantile function for the standard normal was applied, $\Phi^{ - 1} (F_{i} (X_{i} ))$, which is passed through for synthesis. After synthesis, the generated values $\hat{X}_{i}$ are converted back as $F_{i}^{ - 1} (\Phi (\hat{X}_{i} ))$.

Combining rules for synthetic data

The original proposal for synthetic data generation treated it as a form of multiple imputation⁶¹. Under the multiple imputation model, multiple datasets, say m, are synthesized and combining rules are used to compute the parameter estimates and variances for partial synthesis across the m synthetic datasets^62,63. Such corrections for the parameter estimates and variances ensured that variability introduced by the synthesis process are accounted for when making population inferences from synthetic datasets.

In the context of the current study, a partial synthesis is performed in that only the Canadian dataset is replaced with the synthetic version.

For a particular model parameter ${\text{q}}_{{\text{i}}}$ with variance ${\text{v}}_{{\text{i}}}$ using synthetic dataset i where $i = 1 \ldots m$. The adjustment for the model parameters and variances are as follows^51,64,65. The combined model parameter ${\overline{\text{q}}}_{m}$ is the mean across the m model parameters from the synthetic datasets ${\overline{\text{q}}}_{m} \, = \,{\raise0.7ex\hbox{$1$} \!\mathord{\left/ {\vphantom {1 m}}\right.\kern-0pt} \!\lower0.7ex\hbox{$m$}}\,\sum\limits_{i} {q_{i} }$, and ${\overline{\text{v}}}_{m}$ is the mean variance across the m model parameters from the synthetic datasets where ${\overline{\text{v}}}_{m} \; = \,{\raise0.7ex\hbox{$1$} \!\mathord{\left/ {\vphantom {1 m}}\right.\kern-0pt} \!\lower0.7ex\hbox{$m$}}\,\sum\limits_{i} {v_{i} }$ . The between imputation variance is given by $b_{m} = \frac{1}{m - 1}\sum\limits_{i = 1}^{m} {\left( {q_{i} - \overline{q}_{m} } \right)^{2} }$, and the adjusted variance is computed as $T_{p} = {\raise0.7ex\hbox{${b_{m} }$} \!\mathord{\left/ {\vphantom {{b_{m} } m}}\right.\kern-0pt} \!\lower0.7ex\hbox{$m$}} + \overline{v}_{m}$, and the adjusted large sample 95% confidence interval of the model parameter is computed as ${\overline{\text{q}}}_{m} \, \pm \,1.96\,\sqrt {T_{f} }$. For this study we set $m = 10$, which is consistent with current practice for the analysis of synthetic data^51,55,64,65.

Assessing the privacy risks of the synthetic data

Privacy risk was evaluated using membership disclosure on the ten pooled synthetic datasets. The accuracy of a membership disclosure attack can be measured using the relative F1 score³⁰, which indicates the ability of an adversary to correctly determine the membership status of a record. The details of the method to compute membership disclosure are provided in Supplementary Material C.

Once deemed to have low privacy risks, the synthetic dataset was sent to the Austrian team for analysis. The Austrian team pooled the source ATHIS and the synthetic CCHS datasets from both countries and built the regression models described below. This is referred to as the “pooled” dataset.

Statistical analysis

The analysis was performed on the pooled source ATHIS data and the synthetic CCHS data (steps 4 and 5 in Fig. 1).

Outcome variable: cardiovascular health

Our measure of CVH was the CANHEART index. The original CANHEART index was composed from the sum of the ideal metrics for 6 cardiometabolic risk factors and behaviors including history of smoking, leisure physical activity, daily fruit and vegetable consumption, body mass index, diabetes and hypertension³². However, due to harmonization limitations, we had to create a modified version with available variables in both datasets. The modified CANHEART index was calculated using smoking, body mass index (BMI), diabetes and hypertension variables (see Supplementary Material B). This score ranges from 0 (worse) to 4 (best or ideal cardiovascular health).

For youth, the original CANHEART index did not include hypertension and diabetes in the score due to their low prevalence in that group. However, the index with these scores included has been validated in the juvenile population in a previous study⁶⁶.

Descriptive statistics on pooled dataset

The SMD was used to statistically compare the federated and pooled datasets. SMD was selected as given our large sample size, small, clinically unimportant differences, are likely to be statistically different when using t-tests or chi squared tests. The SMD between the federated and pooled datasets was computed for each synthetic dataset generated and then averaged across all of them. An SMD greater than 0.1 is deemed as a potentially clinically important difference, a threshold often recommended for declaring imbalance in pharmacoepidemiologic research³¹.

Univariable and multivariable models on pooled dataset

Both univariable and multivariable linear regression models were used to determine the association between the predictors and cardiovascular health. The multivariable regression model had as predictors the following variables: sex, education level, marital status, household size, household income, immigrant status, age, and country. Goodness of fit was evaluated with R² for each model.

Comparison between pooled partially synthetic data analysis and federated analysis

One common measure of the utility of synthetic datasets is that the data analysis results using synthetic data are similar to the analysis results using the real data (ground truth results) and that the conclusions are the same⁶⁷. It is quite common to evaluate the utility of synthetic data generation techniques using this approach^34,35,68,69. In our case, the ground truth results using federated analysis served as our real data results.

The utility of the pooled dataset was evaluated by comparing the pooled data regression model with the model constructed from a federated analysis which used both source datasets²⁵. The federated analysis approach gives the correct results as it does not involve any distortion of the variables. The two nodes of the system were in Montreal and Vienna. A distributed analysis on the horizontally partitioned dataset was performed by exchanging interim regression results between the two nodes. Because no raw data is exchanged among the nodes the interim results sharing is not deemed to be a disclosure of personal health information (step 6 in Fig. 1).

If the pooled partially synthetic data is a good proxy for the pooled source data then we would expect the conclusions from the pooled analysis to be the same as the conclusions from federated analysis (step 7 in Fig. 1).

Ethics

The study was approved by the research ethics boards of the McGill University Health Center (Project #2020–5452) and the Medical University of Vienna (1859/2019). All methods were carried out in accordance with relevant guidelines and regulations. Given that the datasets come from national surveys conducted by national statistical offices in each country (Statistics Canada and Statistik Austria), the respondents provided informed consent for the data collection and to the conditions for disclosing the data for further research.

Data availability

The data that support the findings of this study are available from Statistics Canada for the Canadian data and Statistik Austria for the Austrian data. However, restrictions apply to the availability of these datasets. To access the datasets, direct requests must be made to the data custodians as these are not public datasets and there may be conditions and agreements for making them available.

Abbreviations

CVH:: Cardiovascular health
CVD:: Cardiovascular diseases
SDG:: Synthetic data generation
US:: United States
UK:: United Kingdom
CCHS:: Canadian community health survey
ATHIS:: Austria health interview survey
BMI:: Body mass index

References

Virani, S. S. et al. Heart disease and stroke statistics-2020 update: A report from the American Heart Association. Circulation 141(9), e139–e596. https://doi.org/10.1161/CIR.0000000000000757 (2020).
Article PubMed Google Scholar
Huxley, V. H. Sex and the cardiovascular system: The intriguing tale of how women and men regulate cardiovascular function differently. Adv. Physiol. Educ. 31(1), 17–22. https://doi.org/10.1152/advan.00099.2006 (2007).
Article PubMed Google Scholar
Connelly, P. J. et al. The importance of gender to understand sex differences in cardiovascular disease. Can. J. Cardiol. 37(5), 699–710. https://doi.org/10.1016/j.cjca.2021.02.005 (2021).
Article PubMed Google Scholar
Bartz, D. et al. Clinical advances in sex- and gender-informed medicine to improve the health of All: A review. JAMA Intern. Med. 180(4), 574–583. https://doi.org/10.1001/jamainternmed.2019.7194 (2020).
Article PubMed Google Scholar
Cirillo, D. et al. Sex and gender differences and biases in artificial intelligence for biomedicine and healthcare. NPJ Digit. Med. 3, 81. https://doi.org/10.1038/s41746-020-0288-5 (2020).
Article PubMed PubMed Central Google Scholar
Prosperi, M., Min, J. S., Bian, J. & Modave, F. Big data hurdles in precision medicine and precision public health. BMC Med. Inform. Decis. Mak. 18(1), 139. https://doi.org/10.1186/s12911-018-0719-2 (2018).
Article PubMed PubMed Central Google Scholar
van Panhuis, W. G. et al. A systematic review of barriers to data sharing in public health. BMC Public Health 14(1), 1144. https://doi.org/10.1186/1471-2458-14-1144 (2014).
Article PubMed PubMed Central Google Scholar
Kalkman, S., Mostert, M., Gerlinger, C., van Delden, J. J. M. & van Thiel, G. J. M. W. Responsible data sharing in international health research: A systematic review of principles and norms. BMC Med. Ethics 20(1), 21. https://doi.org/10.1186/s12910-019-0359-9 (2019).
Article PubMed PubMed Central Google Scholar
T. Rabesandratana, “European data law is impeding studies on diabetes and Alzheimer’s, researchers warn,” Science | AAAS, Nov. 20, 2019. https://www.sciencemag.org/news/2019/11/european-data-law-impeding-studies-diabetes-and-alzheimer-s-researchers-warn (Accessed 21 June 2021).
Bentzen, H. B. et al. Remove obstacles to sharing health data with researchers outside of the European Union. Nat. Med. https://doi.org/10.1038/s41591-021-01460-0 (2021).
Article PubMed PubMed Central Google Scholar
Zhou, Z. et al. Effectiveness of statins for secondary prevention in elderly patients after acute myocardial infarction: an evaluation of class effect. CMAJ 172(9), 1187–1194. https://doi.org/10.1503/cmaj.1041403 (2005).
Article PubMed PubMed Central Google Scholar
Deeks, J. J., Altman, D. G. & Bradburn, M. J. Statistical methods for examining heterogeneity and combining results from several studies in meta-analysis. In Systematic Reviews in Health Care 285–312 (Wiley, 2001). https://doi.org/10.1002/9780470693926.ch15.
Chapter Google Scholar
Suissa, S. et al. CNODES: The Canadian network for observational drug effect studies. Open Med. 6(4), e134 (2012).
PubMed PubMed Central Google Scholar
K. El Emam, L. Mosquera, and R. Hoptroff, Practical Synthetic Data Generation: Balancing Privacy and the Broad Availability of Data. Sebastopol, CA: O’Reilly Media, 2020. [Online]. https://www.oreilly.com/library/view/practical-synthetic-data/9781492072737/. Accessed 19 October 2020.
El Emam, K. & Hoptroff, R. The synthetic data paradigm for using and sharing data. Cutter Executive Update 19(6), 1–12 (2019).
Google Scholar
Haendel, M. A. et al. The National COVID Cohort Collaborative (N3C): Rationale, design, infrastructure, and deployment. J. Am. Med. Inform. Assoc. 28(3), 427–443. https://doi.org/10.1093/jamia/ocaa196 (2021).
Article PubMed Google Scholar
CMS, “CMS 2008–2010 Data Entrepreneurs’ Synthetic Public Use File (DE-SynPUF),” 2022. https://www.cms.gov/Research-Statistics-Data-and-Systems/Downloadable-Public-Use-Files/SynPUFs/DE_Syn_PUF (Accessed 17 July 2022).
Wang, Z., Myles, P. & Tucker, A. Generating and evaluating synthetic UK primary care data: Preserving data utility patient privacy. In 2019 IEEE 32nd International Symposium on Computer-Based Medical Systems (CBMS) 126–131 (Cordoba, 2019). https://doi.org/10.1109/CBMS.2019.00036.
Chapter Google Scholar
“Synthetic data at CPRD,” Medicines & Healthcare products Regulatory Agency, 2020. https://www.cprd.com/content/synthetic-data (Accessed 24 September 2020).
NHS England, “A&E Synthetic Data.” https://data.england.nhs.uk/dataset/a-e-synthetic-data (Accessed 16 July 2022).
IKNL, “Synthetic dataset,” integraal kankercentrum Nederland, 2021. https://iknl.nl/en/ncr/synthetic-dataset (Accessed 13 January 2023).
“SNDS synthétiques,” Systeme national des donnees de sante, 2021. https://documentation-snds.health-data-hub.fr/formation_snds/donnees_synthetiques/ (Accessed 20 January 2022).
Beata Nowok, “Utility of synthetic microdata generated using tree-based methods,” presented at the UNECE Statistical Data Confidentiality Work Session, Helsinki, Oct. 2015. [Online]. https://unece.org/statistics/events/SDC2015. Accessed 24 February 2020.
El Emam, K., Mosquera, L. & Zheng, C. Optimizing the synthesis of clinical trial data using sequential trees. J. Am. Med. Inform. Assoc. 28(1), 3–13. https://doi.org/10.1093/jamia/ocaa249 (2021).
Article PubMed Google Scholar
Wolfson, M. et al. DataSHIELD: Resolving a conflict in contemporary bioscience–performing a pooled analysis of individual-level data without sharing the data. Int. J. Epidemiol. 39(5), 1372–1382. https://doi.org/10.1093/ije/dyq111 (2010).
Article PubMed PubMed Central Google Scholar
Zhang, Z., Yan, C., Mesa, D. A., Sun, J. & Malin, B. A. Ensuring electronic medical record simulation through better training, modeling, and evaluation. J. Am. Med. Inform. Assoc. 27(1), 99–108. https://doi.org/10.1093/jamia/ocz161 (2020).
Article PubMed Google Scholar
Zhang, Z., Yan, C., Lasko, T. A., Sun, J. & Malin, B. A. SynTEG: A framework for temporal structured electronic health data simulation. J. Am. Med. Inform. Assoc. 28(3), 596–604. https://doi.org/10.1093/jamia/ocaa262 (2021).
Article PubMed Google Scholar
Goncalves, A. et al. Generation and evaluation of synthetic patient data. BMC Med. Res. Methodol. 20(1), 108. https://doi.org/10.1186/s12874-020-00977-1 (2020).
Article PubMed PubMed Central Google Scholar
Hilprecht, B., Härterich, M. & Bernau, D. Monte Carlo and reconstruction membership inference attacks against generative models. Proc. Priv. Enhanc. Technol. 2019(4), 232–249. https://doi.org/10.2478/popets-2019-0067 (2019).
Article Google Scholar
El Emam, K., Mosquera, L. & Fang, X. Validating a membership disclosure metric for synthetic health data. JAMIA Open 5(4), 10. https://doi.org/10.1093/jamiaopen/ooac083 (2022).
Article Google Scholar
Stuart, E. A., Lee, B. K. & Leacy, F. P. Prognostic score–based balance measures for propensity score methods in comparative effectiveness research. J. Clin. Epidemiol. 66(8), S84-S90.e1. https://doi.org/10.1016/j.jclinepi.2013.01.013 (2013).
Article PubMed PubMed Central Google Scholar
Maclagan, L. C. et al. The CANHEART health index: A tool for monitoring the cardiovascular health of the Canadian population. CMAJ 186(3), 180–187. https://doi.org/10.1503/cmaj.131358 (2014).
Article PubMed PubMed Central Google Scholar
Pedhazur, E. Multiple Regression in Behavioral Research (Holt, Reinhart and Winston, 1982).
MATH Google Scholar
Benaim, A. R. et al. Analyzing medical research results based on synthetic data and their relation to real data results: Systematic comparison from five observational studies. JMIR Medical Informatics 8(2), e16492. https://doi.org/10.2196/16492 (2020).
Article Google Scholar
El Emam, K., Mosquera, L., Jonker, E. & Sood, H. Evaluating the utility of synthetic COVID-19 case data. JAMIA Open 4(1), ooab012. https://doi.org/10.1093/jamiaopen/ooab012 (2021).
Article PubMed PubMed Central Google Scholar
Mosquera, L. et al. A method for generating synthetic longitudinal health data. BMC Med. Res. Methodol. 23(1), 67. https://doi.org/10.1186/s12874-023-01869-w (2023).
Article PubMed PubMed Central Google Scholar
Kaplan, R. M. & Kronick, R. G. Marital status and longevity in the United States population. J. Epidemiol. Community Health 60(9), 760–765. https://doi.org/10.1136/jech.2005.037606 (2006).
Article PubMed PubMed Central Google Scholar
Kilpi, F., Konttinen, H., Silventoinen, K. & Martikainen, P. Living arrangements as determinants of myocardial infarction incidence and survival: A prospective register study of over 300,000 Finnish men and women. Soc. Sci. Med. 133, 93–100. https://doi.org/10.1016/j.socscimed.2015.03.054 (2015).
Article PubMed Google Scholar
Ikeda, A. et al. Living arrangement and coronary heart disease: The JPHC study. Heart 95(7), 577–583. https://doi.org/10.1136/hrt.2008.149575 (2009).
Article CAS PubMed Google Scholar
Schultz, W. M. et al. Marital status and outcomes in patients with cardiovascular disease. J. Am. Heart Assoc. 6(12), e005890. https://doi.org/10.1161/JAHA.117.005890 (2017).
Article PubMed PubMed Central Google Scholar
Dhindsa, D. S., Khambhati, J., Schultz, W. M., Tahhan, A. S. & Quyyumi, A. A. Marital status and outcomes in patients with cardiovascular disease. Trends Cardiovasc. Med. 30(4), 215–220. https://doi.org/10.1016/j.tcm.2019.05.012 (2020).
Article PubMed Google Scholar
Koskenvuo, M., Kaprio, J., Romo, M. & Langinvainio, H. Incidence and prognosis of ischaemic heart disease with respect to marital status and social class. A national record linkage study. J. Epidemiol. Community Health 35(3), 192–196. https://doi.org/10.1136/jech.35.3.192 (1981).
Article CAS PubMed PubMed Central Google Scholar
Schaefer, C., Quesenberry, C. P. & Wi, S. Mortality following conjugal bereavement and the effects of a shared environment. Am. J. Epidemiol. 141(12), 1142–1152. https://doi.org/10.1093/oxfordjournals.aje.a117387 (1995).
Article CAS PubMed Google Scholar
Fortier, I. et al. Maelstrom Research guidelines for rigorous retrospective data harmonization. Int. J. Epidemiol. 46(1), 103–105. https://doi.org/10.1093/ije/dyw075 (2017).
Article PubMed Google Scholar
Drechsler, J. & Reiter, J. P. An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets. Comput. Stat. Data Anal. 55(12), 3232–3243. https://doi.org/10.1016/j.csda.2011.06.006 (2011).
Article MathSciNet Google Scholar
Arslan, R. C., Schilling, K. M., Gerlach, T. M. & Penke, L. Using 26,000 diary entries to show ovulatory changes in sexual desire and behavior. J. Pers. Soc. Psychol. 121(2), 410–431. https://doi.org/10.1037/pspp0000208 (2021).
Article PubMed Google Scholar
Bonnéry, D. et al. The promise and limitations of synthetic data as a strategy to expand access to state-level multi-agency longitudinal data. J. Res. Educ. Effect. 12(4), 616–647. https://doi.org/10.1080/19345747.2019.1631421 (2019).
Article Google Scholar
Sabay, A., Harris, L., Bejugama, V. & Jaceldo-Siegl, K. Overcoming small data limitations in heart disease prediction by using surrogate data. SMU Data Sci. Rev. 1(3), 25 (2018).
Google Scholar
Michael Freiman, Amy Lauger, and Jerome Reiter, “Data Synthesis and Perturbation for the American Community Survey at the U.S. Census Bureau,” US Census Bureau, Working paper, 2017. [Online]. https://www.census.gov/library/working-papers/2018/adrm/formal-privacy-synthetic-data-acs.html. Accessed 24 February 2020.
B. Nowok, “Utility of synthetic microdata generated using tree-based methods,” presented at the UNECE Statistical Data Confidentiality Work Session, Helsinki, Oct. 2015. [Online]. https://unece.org/statistics/events/SDC2015. Accessed 24 February, 2020.
Raab, G. M., Nowok, B. & Dibben, C. Practical data synthesis for large samples. J. Priv. Confid. 7(3), 67–97. https://doi.org/10.29012/jpc.v7i3.407 (2016).
Article Google Scholar
Nowok, B., Raab, G. M. & Dibben, C. Providing bespoke synthetic data for the UK Longitudinal Studies and other sensitive data with the synthpop package for R 1. Stat. J. IAOS 33(3), 785–796. https://doi.org/10.3233/SJI-150153 (2017).
Article Google Scholar
Quintana, D. S. “A synthetic dataset primer for the biobehavioural sciences to promote reproducibility and hypothesis generation. eLife 9, e53275. https://doi.org/10.7554/eLife.53275 (2020).
Article PubMed PubMed Central Google Scholar
C. Little, M. Elliot, R. Allmendinger, and S. Samani, “Generative adversarial networks for synthetic data generation: A comparative study,” presented at the UNECE Expert Meeting on Statistical Data Confidentiality, Poznań, Poland: United Nations Economic Commission for Europe, Dec. 2021, p. 17. [Online]. https://unece.org/statistics/documents/2021/12/working-documents/generative-adversarial-networks-synthetic-data. Accessed 17 January 2022.
Taub, J., Elliot, M. & Sakshaug, W. The impact of synthetic data generation on data utility with application to the 1991 UK samples of anonymised records. Trans. Data Priv. 13(1), 1–23 (2020).
Google Scholar
Bühlmann, P. & Hothorn, T. Boosting algorithms: Regularization, prediction and model fitting. Statist. Sci. 22(4), 477–505. https://doi.org/10.1214/07-STS242 (2007).
Article MathSciNet MATH Google Scholar
G. Ke et al., “LightGBM: A Highly Efficient Gradient Boosting Decision Tree,” in Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., Curran Associates, Inc., 2017, pp. 3146–3154. [Online]. http://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree.pdf. Accessed 15 October 2020.
J. Snoek, H. Larochelle, and R. P. Adams, “Practical Bayesian optimization of machine learning algorithms,” in Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 2, in NIPS’12. Red Hook, NY, USA: Curran Associates Inc., 2012, pp. 2951–2959.
A. Niculescu-Mizil and R. A. Caruana, “Obtaining Calibrated Probabilities from Boosting,” arXiv:1207.1403 [cs, stat], Jul. 2012. [Online]. http://arxiv.org/abs/1207.1403. Accessed 21 October 2020.
M. Kull, T. S. Filho, and P. Flach, “Beta calibration: a well-founded and easily implemented improvement on logistic calibration for binary classifiers,” in Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, PMLR, Apr. 2017, pp. 623–631. [Online]. https://proceedings.mlr.press/v54/kull17a.html. Accessed 30 December 2022.
Rubin, D. Discussion: Statistical disclosure limitation. J. Off. Stat. 9(2), 462–468 (1993).
Google Scholar
Raghunathan, T., Reiter, J. & Rubin, D. Multiple imputation for statistical disclosure control. J. Off. Stat. 19, 1–16 (2003).
Google Scholar
Reiter, J. P. Satisfying disclosure restrictions with synthetic data sets. J. Off. Stat. 18(4), 531–543 (2002).
Google Scholar
Reiter, J. Inference for partially synthetic, public use microdata sets. Surv. Methodol. 29(2), 181–188 (2003).
Google Scholar
Loong, B., Zaslavsky, A. M., He, Y. & Harrington, D. P. Disclosure control using partially synthetic data for large-scale health surveys, with applications to CanCORS. Stat. Med. 32(24), 4139–4161. https://doi.org/10.1002/sim.5841 (2013).
Article MathSciNet PubMed PubMed Central Google Scholar
Azizi, Z. et al. Sex, gender, and cardiovascular health in Canadian and Austrian Populations. Can. J. Cardiol. 37(8), 1240–1247. https://doi.org/10.1016/j.cjca.2021.03.019 (2021).
Article PubMed Google Scholar
El Emam, K. Seven ways to evaluate the utility of synthetic data. IEEE Secur. Priv. 18(4), 56–59. https://doi.org/10.1109/MSEC.2020.2992821 (2020).
Article Google Scholar
Azizi, Z., Zheng, M., Mosquera, L., Pilote, L. & El Emam, K. Can synthetic data be a proxy for real clinical trial data ? A validation study. BMJ Open 11, e043497 (2021).
Article PubMed PubMed Central Google Scholar
Beaulieu-Jones, B. K. et al. Privacy-preserving generative deep neural networks support clinical data sharing. Circ. Cardiovasc. Qual. Outcomes 12(7), e005122. https://doi.org/10.1161/CIRCOUTCOMES.118.005122 (2019).
Article PubMed PubMed Central Google Scholar

Download references

Funding

The current study is part of the “Gender Outcomes International Group: to Further Well-being Development (GOING-FWD)”, a five-country consortium with the goal of integrating sex and gender in evaluating chronic disease outcomes. (http://gender-net-plus.eu/joint-call/funded-projects/going-fwd/). The GOING-FWD Consortium is funded by the GENDER-NET Plus ERA-NET Initiative (Project Ref. Number: GNP-78): The Canadian Institutes of Health Research (GNP-161904), "La Caixa" Foundation (ID 100010434) with code LCF/PR/DE18/52010001, The Swedish Research Council (2018-00932) and The Austrian Science Fund (FWF, I 4209). VR was funded by the Scientific Independence of Young Researcher Program of the Italian Ministry of University, Education, and Research (RBSI14HNVT). KEE was partially funded by the Canada Research Chairs program, a Discovery Grant RGPIN-2016-06781 from the Natural Sciences and Engineering Research Council of Canada, and by Replica Analytics Ltd.

Author information

These authors contributed equally: Zahra Azizi and Simon Lindner.
These authors jointly supervised this work: Louise Pilote and Khaled El Emam.

Authors and Affiliations

Centre for Outcomes Research and Evaluation, Research Institute of the McGill University Health Centre, 5252 De Maisonneuve Blvd, Office 2B.39, Montréal, QC, H4A 3S5, Canada
Zahra Azizi, Yumika Shiba & Louise Pilote
Department of Internal Medicine III, Division of Endocrinology and Metabolism, Gender Medicine Unit, Medical University of Vienna, Vienna, Austria
Simon Lindner & Alexandra Kautzky-Willer
Faculty of Medicine, McGill University, Montreal, Canada
Yumika Shiba
Department of Translational Medicine, University of Ferrara, Ferrara, Italy
Valeria Raparelli
Faculty of Nursing, University of Alberta, Edmonton, AB, Canada
Valeria Raparelli & Colleen M. Norris
Heart and Stroke Strategic Clinical Networks, Alberta Health Services, Alberta, Canada
Colleen M. Norris
Karolinska Institute, Stockholm, Sweden
Karolina Kublickiene
Clinical & Experimental Neuroscience (NiCE-IMIB-IUIE), School of Medicine, University of Murcia, Murcia, Spain
Maria Trinidad Herrero
Section for Science of Complex Systems, CeMSIIS, Medical University of Vienna, Vienna, Austria
Peter Klimek
Complexity Science Hub Vienna, Vienna, Austria
Peter Klimek
Division of Endocrinology and Metabolism, Medical University of Vienna, Vienna, Austria
Teresa Gisinger
Divisions of Clinical Epidemiology and General Internal Medicine, McGill University Health Centre Research Institute, Montreal, QC, Canada
Louise Pilote
Children’s Hospital of Eastern Ontario Research Institute, 401 Smyth Road, Ottawa, ON, K1H 8L1, Canada
Khaled El Emam
School of Epidemiology and Public Health, University of Ottawa, Ottawa, ON, Canada
Khaled El Emam
Replica Analytics Ltd, Ottawa, ON, Canada
Khaled El Emam

Authors

Zahra Azizi
View author publications
You can also search for this author in PubMed Google Scholar
Simon Lindner
View author publications
You can also search for this author in PubMed Google Scholar
Yumika Shiba
View author publications
You can also search for this author in PubMed Google Scholar
Valeria Raparelli
View author publications
You can also search for this author in PubMed Google Scholar
Colleen M. Norris
View author publications
You can also search for this author in PubMed Google Scholar
Karolina Kublickiene
View author publications
You can also search for this author in PubMed Google Scholar
Maria Trinidad Herrero
View author publications
You can also search for this author in PubMed Google Scholar
Alexandra Kautzky-Willer
View author publications
You can also search for this author in PubMed Google Scholar
Peter Klimek
View author publications
You can also search for this author in PubMed Google Scholar
Teresa Gisinger
View author publications
You can also search for this author in PubMed Google Scholar
Louise Pilote
View author publications
You can also search for this author in PubMed Google Scholar
Khaled El Emam
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Z.A., S.L., Y.S., C.N., V.R., K.K., M.H., A.K.W., T.G., P.K., L.P., K.E.E. designed the study, developed the questions and contributed to the development of conceptual framework. Z.A., S.L., K.E.E. and L.P. performed the main analysis, wrote the first draft and worked on critical revisions of the manuscript. Z.A., S.L., Y.S., C.N., V.R., K.K., M.H., A.K.W., T.G., P.K., L.P., and K.E.E. supervised the project from study design to final draft and approved the final version of the manuscript. K.E.E. and L.P. are the guarantor of the study and accept full responsibility for the work and/or the conduct of the study, had access to the data, and controlled the decision to publish. All authors agree to be accountable for all aspects of the work by ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.

Corresponding authors

Correspondence to Louise Pilote or Khaled El Emam.

Ethics declarations

Competing interests

This work was performed in collaboration with Replica Analytics Ltd. This company is a spin-off from the CHEO Research Institute. KEE is co-founder and has equity in this company. ZA, SL, YS, CN, VR, KK, MH, AKW, TG, PK, and LP have no competing interests to declare.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Azizi, Z., Lindner, S., Shiba, Y. et al. A comparison of synthetic data generation and federated analysis for enabling international evaluations of cardiovascular health. Sci Rep 13, 11540 (2023). https://doi.org/10.1038/s41598-023-38457-3

Download citation

Received: 25 September 2022
Accepted: 08 July 2023
Published: 17 July 2023
DOI: https://doi.org/10.1038/s41598-023-38457-3

This article is cited by

An evaluation of the replicability of analyses using synthetic health data
- Khaled El Emam
- Lucy Mosquera
- Alaa El-Hussuna
Scientific Reports (2024)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Genome-wide association studies

Interviews in the social sciences

Development and validation of a new algorithm for improved cardiovascular risk prediction

Introduction

Results

Privacy risks of synthetic data

Descriptive statistics

Comparison of pooled partially synthetic data and federated analysis results

Descriptive statistics

Determinants of cardiovascular health: univariable analysis

Determinants of cardiovascular health across countries: interaction analyses

Elapsed time comparisons

Discussion

Summary

Limitations and future work

Conclusions

Methods

Datasets used

Data synthesis method

Generative model

Combining rules for synthetic data

Assessing the privacy risks of the synthetic data

Statistical analysis

Outcome variable: cardiovascular health

Descriptive statistics on pooled dataset

Univariable and multivariable models on pooled dataset

Comparison between pooled partially synthetic data analysis and federated analysis

Ethics

Data availability

Abbreviations

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Publisher's note

Supplementary Information

Supplementary Information.

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

An evaluation of the replicability of analyses using synthetic health data

Comments

Search

Quick links