Fig. 1 | Nature Communications

Fig. 1

From: Estimating the success of re-identifications in incomplete datasets using generative models

Fig. 1

Estimating the population uniqueness of the USA corpus. a We compare, for each population, empirical and estimated population uniqueness (boxplot with median, 25th and 75th percentiles, maximum 1.5 interquartile range (IQR) for each population, with 100 independent trials per population). For example, date of birth, location (PUMA code), marital status, and gender uniquely identify 78.7% of the 3 million people in this population (empirical uniqueness) that our model estimates to be 78.2 ± 0.5% (boxplot in black). b Absolute error when estimating USA’s population uniqueness when the disclosed dataset is randomly sampled from 10% to 0.1%. The boxplots (25, 50, and 75th percentiles, 1.5 IQR) show the distribution of mean absolute error (MAE) for population uniqueness, at one subsampling fraction across all USA populations (100 trials per population and sampling fraction). The y axis shows both p, the sampling fraction, and \(n_{\cal{S}} = p \times n\), the sample size. Our model estimates population uniqueness very well for all sampling fractions with the MAE slightly increasing when only a very small number of records are available (p = 0.1% or 3061 records)

Back to article page