Introduction

Cardiovascular diseases (CVD) continue to represent the leading cause of mortality and morbidity amongst women and men worldwide1. Biological differences between the sexes such as anatomical and physiological variations in coronary arteries and the autonomic nervous system alter the development and progression of CVD2. However, the environment and lifestyle3 as well as individuals’ identity, roles, and relations in society may play an important role. These characteristics are gendered in a way that they affect males and females differently and evolve through early life to adulthood4. The effect of these gendered factors can vary between countries with different cultural and political biases5. Investigating the impact of sex on CVD across multiple countries requires pooling data sourced from these jurisdictions.

However, sharing and pooling health data across institutions and across national and international jurisdictions has been a challenge6. Privacy concerns are key barriers to data sharing and data access7,8, particularly in EEA countries where the General Data Protection Regulation (GDPR) imposes high standards for data sharing that are often difficult to meet in practice9,10. This raises a particular challenge given that the GDPR is serving as a template regulation around the globe10.

One approach to address such privacy concerns has been to perform a distributed data analysis whereby the analysis is performed within each dataset locally and then final results combined through a meta-analysis. As an example, in a study evaluating the effectiveness of different statins in each of 3 Canadian provinces11, the hazard ratios for different statins were combined using a fixed-effects model, with weight being the inverse of the variance of the province-specific parameter estimate12. However, because the same analysis needs to be executed multiple times by different teams in each province, this general approach has not resulted in timely results in practice13.

Another option which can enable the timely sharing of datasets in a privacy protective manner is synthetic data generation (SDG)14,15. There have been multiple synthetic health data releases in the US16,17, the UK18,19,20 and other European countries21,22. None of these efforts pooled datasets across jurisdictions to enable cross-country analysis.

In this study, we evaluate whether SDG can be applied for pooling data to enable international comparative studies. Our objective was to assess country differences of the effect of sex on the cardiovascular health (CVH) of Canadian and Austrian populations. The datasets used in this study were from the Canadian Community Health Survey (CCHS) that was administered in 2014 in Canada (n = 63,522), and the Austria Health Interview Survey (ATHIS 2014, n = 15,771) which was conducted as part of the European Health Interview Survey series. These surveys collect information on health status, psychosocial factors, and healthcare resource utilization.

The Canadian data was synthesized and sent to Austria to be pooled with the original Austrian dataset, and a multivariable regression model was constructed from the pooled dataset. To generate synthetic data, we used sequential classification and regression trees23,24. The results were compared to the ground truth results obtained through a federated analysis on the source data. The federated analysis was performed using the DataSHIELD method and tools25. The DataSHIELD approach exchanges the intermediate results among the nodes which means that its analysis gives the same the results as those obtained from pooling the original datasets. The study workflow is shown in Fig. 1.

Figure 1
figure 1

The data synthesis and federated analysis workflow.

Starting from relatively similar points with the availability of robust software to perform federated analysis and SDG, and a relatively good knowledge of privacy enhancing technologies within the team, the federated analysis took eighteen months to set up operationally and obtain results, whereas the SDG approach took in total one month to set up, install, and execute. Therefore, testing whether an SDG method can produce the same results as those obtained from source data and if such an approach can be privacy protective, could enable significantly more efficient pooling of data across jurisdictions.

Results

Privacy risks of synthetic data

The privacy of the synthetic CCHS data was assessed using a membership disclosure test (step 3 in Fig. 1). Membership disclosure risk assessment is a common way to evaluate the privacy risks in synthetic datasets26,27,28,29, and is defined as an adversary, using the information in synthetic data, determines that a real target person was included in the original dataset used as input for synthetic data generation (i.e. was a member of the training dataset). Knowing that an individual was in the training data can reveal sensitive attributes about that individual.

The relative membership disclosure F1 score30 was 0.001, indicating that the ability for an adversary to predict membership is quite poor. The low value means that the synthetic Canadian dataset can be deemed as having low disclosure risks.

Descriptive statistics

The CCHS cycle 2014 included 55.3% females, while the ATHIS Cycle 2014 included 55.7% females (Table 1). The Austrian participants were slightly younger than the Canadians. However, there was an age difference between males and females in the Canadian participants with slightly older females (p < 0.001) but similar in the Austrian participants (p = 0.32). There was a small difference in hypertension between males and females in the Canadian dataset (M vs. F: 24.2% vs. 25.1%), and in the Austrian dataset (M vs F: 21.4% vs. 18.9%). In the Austrian dataset there were more females that were immigrants (M vs F: 7.6% vs. 9.6%) compared to the Canadian dataset where there was no difference in immigration status (M vs. F: 14.5% vs. 14.4%). Otherwise, the two datasets were similar in terms of male vs female comparisons with the following patterns: more females had a lower BMI, more males had diabetes and were smokers, more females were divorced or widowed, more females lived in single occupant households, and more females lived in low- or medium-income households.

Table 1 Comparison of baseline characteristics for the Canadian and Austrian datasets.

Comparison of pooled partially synthetic data and federated analysis results

Descriptive statistics

A comparison of the marginal distributions between males and females in Table 2 showed consistently similar results in the federated and pooled analyses of partially synthetic data across all variables, with the standardized mean differences (SMD) consistently below the 0.1 threshold31.

Table 2 Descriptive statistics for the federated and pooled analysis.

Males tended to be younger, there were more females with normal BMI (M vs. F: 39.9% vs. 52.9%), more males had diabetes (M vs. F: 9.2% vs. 7.5%) and were smokers (M vs. F: 24.1% vs. 18.8%). There were more males that were single (M vs. F: 32.5% vs. 25.3%), and males were more likely to be in a household with a high income (M vs. F: 53.9% vs. 44.4%). Females were more common in single-person households (M vs. F: 21.3 vs. 28.4%) and were more likely to be divorced or widowed (M vs. F: 12.9% vs. 25.3%). There were no significant differences between sexes on hypertension (M vs. F: 23.6% vs. 23.9%), post-secondary education and higher (M vs. F: 51.7% vs. 51.2%), and whether the individual was an immigrant (M vs. F: 13.1% vs. 13.4%).

Determinants of cardiovascular health: univariable analysis

The outcome variable of interest was CVH calculated through a modified CANHEART index in both countries32. Overall, 70.7% of Canadians and 67.9% of Austrians had a CANHEART score greater than three.

Table 3 shows the parameter estimates, confidence intervals, and p-values for the pooled and federated univariable regression analysis. The results were similar between the two methods of analysis, with the substantive conclusions being the same from both approaches.

Table 3 Univariable linear regression using the federated and pooled analysis.

Females had better CVH than males (pool vs. fed: 0.18 vs. 0.19), as well as individuals in larger households (pool vs. fed: 0.19 vs. 0.18) and immigrants (pool vs. fed: 0.09 vs. 0.09). Older individuals had worse CVH (pool vs. fed: − 0.17 vs. − 0.17), as well as divorced/widowed individuals (pool vs. fed: − 0.61 vs. − 0.6) and common-law/married individuals (pool vs. fed: − 0.41 vs. − 0.4) compared to single individuals. Lower income individuals also had worse CVH (pool vs. fed: − 0.19 vs. − 0.18). There was a weak positive relationship between higher education and CVH (pool vs. fed: 0.04 vs. 0.04). The weakest relationship was between country and CVH whereby the effect size was similar between federated analysis and pooled analysis (− 0.04 vs. − 0.03), indicating slightly worse CVH among the Austrian respondents.

Determinants of cardiovascular health across countries: interaction analyses

In the multivariable analysis of the main effects, the parameter estimates of the federated and pooled analysis were directionally the same as for the univariable analysis, and the comparison between the federated and pooled analysis yields the same conclusions as for the univariable analysis (see Table 4).

Table 4 Multivariable main effects models for predicting CVH in federated and pooled analyses.

In the multivariable analyses considering the country interactions to determine whether country moderates the relationship between the other variables and CVH, the impact of several factors differed between countries (Table 5). For example, although males in Austria have lower CVH than males in Canada, females in Austria had better CVH than females in Canada. Also, at lower levels of education, CVH was lower among the Austrian respondents, but this country difference changed as education levels increased whereby Austrians with high levels of education had higher CVH. At the highest level of education Austrians had better CVH than Canadians. Immigrants had better CVH in Canada compared to Austria, but worse CVH than non-immigrants in both countries.

Table 5 Multivariable model with country interactions for federated and pooled analysis.

There is one difference in the interaction parameters between the federated and pooled models. While the significance of the interaction parameter for being married differs between the two approaches, the substantive conclusions are the same in that being married has lower CVH in both countries, and CVH is lower in Austria than in Canada irrespective of marital status.

The effect size for the country variable is larger in the interaction model compared to the univariable model and main effects only multivariable models. The interaction model assumes a contingency effect of country and therefore the country parameter should not be interpreted by itself33.

Elapsed time comparisons

A significant time elapsed to set-up the necessary servers in multiple locations with the requisite security protocols for the federated analysis (these servers hold the original sensitive datasets and needed to be accessible remotely from a different jurisdiction, requiring the introduction of additional security protocols and checks), and to obtain the necessary approvals (Table 6). The programming required for DataSHIELD had to be done anew since common regression R packages used by the analysts were not usable in a federated context. Once the multiple nodes have been set up the processing speeds are comparable.

Table 6 The difference in elapsed time between the federated analysis and the pooled analysis.

These values demonstrate the advantage of synthetic data relatively speaking. An important context here is that the DataSHIELD system was being set up in two academic medical centers, which may have an impact on timing. Plus, this work was done during the COVID-19 pandemic which would have impacted the speed at which multi-institutional and multi-jurisdictional projects progressed.

Discussion

Summary

Our results highlight the country specific effects of sex on CVH and demonstrated slightly better CVH in Canadians compared with Austrians. Marital status, low household income and not being single were associated with worse CVH while female sex, greater household size, higher level of education, and being an immigrant were associated with better CVH in federated and pooled datasets. The magnitude of these factors differed between Austria and Canada.

The result of this secondary analysis of population-based datasets revealed that synthetic data generation methods using sequential classification and regression trees can be used to pool datasets across countries for international studies. The analytical conclusions were the same for the models developed using the pooled partially synthetic dataset as the ground truth model developed using federated analysis in various analytical steps including descriptive, univariable analysis and multivariable main effects and country interaction models. While previous observational studies have compared synthetic and real data34,35,36, there has been no population-based study testing the use of SDG for pooling datasets across jurisdictions and comparing it to a federated approach.

We provided evidence that synthetic data has similar utility compared to the ground truth generated through federated analysis. While there was one difference in regression model parameters, this was for a weak effect size. Where weak effects are important then the pooled partially synthetic data can be used for exploratory analysis to validate assumptions while procedures for the exchange of the original data are set up.

The significantly lower effort in getting to the results using synthetic data can enable researchers to efficiently share data across jurisdictions. Data synthesis was completed in approximately one month whereas it took eighteen months to set up the federated analysis system across two nodes. It is expected that further substantial work would be needed to set up additional nodes to accommodate the inclusion of other countries in the international analysis.

The use of synthetic data will allow merging a variety of population-based databases globally and across jurisdictions nationally and internationally. For our specific work, this would allow us to assess the association of sex with the cardiovascular health of populations while evaluating the effect of geo-politico-cultural differences in disease risk.

We found that being divorced, widowed, or married was associated with worse CVH compared to being single. Similar results were obtained in an analysis of data from the US, where single participants had better health habits and lower preventable risk factors than married/widowed or divorced in the National Health Interview Survey37. While singles might have better CVH, evidence for the mortality rate from CVD in single participants compared to married participants is still inconsistent38,39,40,41. Studies have identified the increased prevalence of non-traditional CVH risk factors including stress, depression, recreational drugs, and other socioeconomic risks in non-married groups that can indeed impact these subjects additionally42. This may explain the greater risk of CVD and mortality in non-married compared to married subjects in those studies. It is also reported that these acute stressors are even greater in those widowed and divorced (spousal death, divorce)43, which may strengthen the development of CVD compared to single and married in our study.

Lower socioeconomic status is associated with increased risk of CVD and mortality3. Our results are generally supportive demonstrating a positive effect of higher education. There was significant interaction between many covariates and country. Males in Austria have worse CVH than males in Canada. Also, at lower levels of education CVH is worse among the Austrian respondents, but this country specific effect reverses as education levels increase: at the highest level of education Austrians seem to have better CVH than Canadians. Moreover, immigrants have better CVH in Canada than Austria, and non-immigrants have better CVH overall that is also higher in Canada. Being married has worse CVH in both countries, and CVH is lower in Austria than in Canada across all values of marital status. These results suggest groups to be targeted for improving CVH are country specific.

Limitations and future work

One of the limitations of our study is using only a single data synthesis method. Application of other types of data synthesis and comparing the utility of those methods with those from the current study is recommended in future studies. We only pooled two datasets. Multi-jurisdictional studies may pool datasets across more than two jurisdictions, and we did not test utility when multiple datasets are synthesized and pooled.

Other methods for privacy-reserving analysis of multi-jurisdictional data include performing a meta-analysis. However, because the same, potentially complex, analyses must be performed multiple times, the timelines of this approach has in practice proven to be challenging13. The use of synthetic data generation can help accelerate the time to results.

Conclusions

Our results indicate high utility for the pooled partially synthetic dataset, and low privacy risks for the synthetic data, in addition to an elapsed time advantage when compared to the federated analysis platform. Our analysis identified factors with a differential effect on CVH depending on country where a person lives. Hence, interventions will need to be country specific.

Methods

The objective of the analysis was to assess country-level differences in the role of sex on cardiovascular health (CVH) using a pooled dataset of Canadian and Austrian individuals.

Datasets used

The CCHS and ATHIS variables/questions that were used in our analysis are included in Supplementary Material A. The first step in the workflow (see Fig. 1) was to harmonize the datasets using Maelström research guidelines for retrospective data44.

Data synthesis method

Generative model

We used a sequential synthesis method using sequence-optimized decision trees24. With sequential synthesis models, a variable is synthesized by using the values earlier in the sequence as predictors. All variables used in the analysis were synthesized (step 2 of the workflow as illustrated in Fig. 1). Only the CCHS dataset was synthesized.

Sequential trees have been used to synthesize health and social sciences data45,46,47,48,49,50,51,52,53, and applied in research studies on synthetic data45,54,55. Additional improvements were implemented to the basic sequential synthesis method for this study. Each model in the sequence was trained using a gradient boosted decision tree56,57 with Bayesian optimization for hyperparameter selection58. Each combination of hyperparameters was selected using fivefold cross validation on the training dataset during tuning.

In the context of the synthesis of categorical variables, synthetic values are generated based on the predicted probabilities. In general, boosted trees do not output correct probabilities and these need to be calibrated, especially as the number of iterations increases59. In addition, for imbalanced categorical outcomes, the model is trained with larger weights for the minority class, which gives incorrect probabilities. Therefore, the predicted probabilities are adjusted using beta calibration60.

For each continuous variable \(X_{i}\) we first convert them to a Gaussian distribution. The empirical cdf was applied to each variable \(F_{i} (X_{i} )\), and then the quantile function for the standard normal was applied, \(\Phi^{ - 1} (F_{i} (X_{i} ))\), which is passed through for synthesis. After synthesis, the generated values \(\hat{X}_{i}\) are converted back as \(F_{i}^{ - 1} (\Phi (\hat{X}_{i} ))\).

Combining rules for synthetic data

The original proposal for synthetic data generation treated it as a form of multiple imputation61. Under the multiple imputation model, multiple datasets, say m, are synthesized and combining rules are used to compute the parameter estimates and variances for partial synthesis across the m synthetic datasets62,63. Such corrections for the parameter estimates and variances ensured that variability introduced by the synthesis process are accounted for when making population inferences from synthetic datasets.

In the context of the current study, a partial synthesis is performed in that only the Canadian dataset is replaced with the synthetic version.

For a particular model parameter \({\text{q}}_{{\text{i}}}\) with variance \({\text{v}}_{{\text{i}}}\) using synthetic dataset i where \(i = 1 \ldots m\). The adjustment for the model parameters and variances are as follows51,64,65. The combined model parameter \({\overline{\text{q}}}_{m}\) is the mean across the m model parameters from the synthetic datasets \({\overline{\text{q}}}_{m} \, = \,{\raise0.7ex\hbox{$1$} \!\mathord{\left/ {\vphantom {1 m}}\right.\kern-0pt} \!\lower0.7ex\hbox{$m$}}\,\sum\limits_{i} {q_{i} }\), and \({\overline{\text{v}}}_{m}\) is the mean variance across the m model parameters from the synthetic datasets where \({\overline{\text{v}}}_{m} \; = \,{\raise0.7ex\hbox{$1$} \!\mathord{\left/ {\vphantom {1 m}}\right.\kern-0pt} \!\lower0.7ex\hbox{$m$}}\,\sum\limits_{i} {v_{i} }\) . The between imputation variance is given by \(b_{m} = \frac{1}{m - 1}\sum\limits_{i = 1}^{m} {\left( {q_{i} - \overline{q}_{m} } \right)^{2} }\), and the adjusted variance is computed as \(T_{p} = {\raise0.7ex\hbox{${b_{m} }$} \!\mathord{\left/ {\vphantom {{b_{m} } m}}\right.\kern-0pt} \!\lower0.7ex\hbox{$m$}} + \overline{v}_{m}\), and the adjusted large sample 95% confidence interval of the model parameter is computed as \({\overline{\text{q}}}_{m} \, \pm \,1.96\,\sqrt {T_{f} }\). For this study we set \(m = 10\), which is consistent with current practice for the analysis of synthetic data51,55,64,65.

Assessing the privacy risks of the synthetic data

Privacy risk was evaluated using membership disclosure on the ten pooled synthetic datasets. The accuracy of a membership disclosure attack can be measured using the relative F1 score30, which indicates the ability of an adversary to correctly determine the membership status of a record. The details of the method to compute membership disclosure are provided in Supplementary Material C.

Once deemed to have low privacy risks, the synthetic dataset was sent to the Austrian team for analysis. The Austrian team pooled the source ATHIS and the synthetic CCHS datasets from both countries and built the regression models described below. This is referred to as the “pooled” dataset.

Statistical analysis

The analysis was performed on the pooled source ATHIS data and the synthetic CCHS data (steps 4 and 5 in Fig. 1).

Outcome variable: cardiovascular health

Our measure of CVH was the CANHEART index. The original CANHEART index was composed from the sum of the ideal metrics for 6 cardiometabolic risk factors and behaviors including history of smoking, leisure physical activity, daily fruit and vegetable consumption, body mass index, diabetes and hypertension32. However, due to harmonization limitations, we had to create a modified version with available variables in both datasets. The modified CANHEART index was calculated using smoking, body mass index (BMI), diabetes and hypertension variables (see Supplementary Material B). This score ranges from 0 (worse) to 4 (best or ideal cardiovascular health).

For youth, the original CANHEART index did not include hypertension and diabetes in the score due to their low prevalence in that group. However, the index with these scores included has been validated in the juvenile population in a previous study66.

Descriptive statistics on pooled dataset

The SMD was used to statistically compare the federated and pooled datasets. SMD was selected as given our large sample size, small, clinically unimportant differences, are likely to be statistically different when using t-tests or chi squared tests. The SMD between the federated and pooled datasets was computed for each synthetic dataset generated and then averaged across all of them. An SMD greater than 0.1 is deemed as a potentially clinically important difference, a threshold often recommended for declaring imbalance in pharmacoepidemiologic research31.

Univariable and multivariable models on pooled dataset

Both univariable and multivariable linear regression models were used to determine the association between the predictors and cardiovascular health. The multivariable regression model had as predictors the following variables: sex, education level, marital status, household size, household income, immigrant status, age, and country. Goodness of fit was evaluated with R2 for each model.

Comparison between pooled partially synthetic data analysis and federated analysis

One common measure of the utility of synthetic datasets is that the data analysis results using synthetic data are similar to the analysis results using the real data (ground truth results) and that the conclusions are the same67. It is quite common to evaluate the utility of synthetic data generation techniques using this approach34,35,68,69. In our case, the ground truth results using federated analysis served as our real data results.

The utility of the pooled dataset was evaluated by comparing the pooled data regression model with the model constructed from a federated analysis which used both source datasets25. The federated analysis approach gives the correct results as it does not involve any distortion of the variables. The two nodes of the system were in Montreal and Vienna. A distributed analysis on the horizontally partitioned dataset was performed by exchanging interim regression results between the two nodes. Because no raw data is exchanged among the nodes the interim results sharing is not deemed to be a disclosure of personal health information (step 6 in Fig. 1).

If the pooled partially synthetic data is a good proxy for the pooled source data then we would expect the conclusions from the pooled analysis to be the same as the conclusions from federated analysis (step 7 in Fig. 1).

Ethics

The study was approved by the research ethics boards of the McGill University Health Center (Project #2020–5452) and the Medical University of Vienna (1859/2019). All methods were carried out in accordance with relevant guidelines and regulations. Given that the datasets come from national surveys conducted by national statistical offices in each country (Statistics Canada and Statistik Austria), the respondents provided informed consent for the data collection and to the conditions for disclosing the data for further research.