Main

The rapid growth of AI and machine learning is reshaping global efforts and 'technology for good' programmes, impacting the lives of billions. These advancements set new benchmarks in accuracy, exemplified by AI surpassing human experts in diagnostics1 and complex games2. However, concerns regarding bias, privacy and manipulation have emerged, prompting the formulation of principles of responsible AI to advocate for safe, ethical and trustworthy AI development3. Within the AI development pipeline, data collection and annotation stages hold paramount importance, influencing the overall system performance. A recent report by NIST emphasizes how reliance on large-scale datasets can introduce bias during AI model training and evaluation4. The data-driven nature of contemporary AI algorithms exacerbates these biases, as any anomalies within the training data can directly impede the learning process. Despite its critical role, data quality often remains undervalued within the AI/ML research community5.

The discourse surrounding dataset quality in AI research is evolving, encompassing both qualitative and quantitative evaluation methods5,6. There is growing recognition of the importance of data quality alongside algorithmic efficiency and model trustworthiness. Past research has established qualitative assessment of dataset quality through methods such as interviews and discussions7,8. Notably, Gebru and co-workers proposed comprehensive datasheets for released datasets to raise transparency and accountability6. This aligns with frameworks by Hutchinson et al. for building responsible data development processes9. Efforts are underway to address data representation, with Kamikubo and colleagues analysing accessibility datasets10. Furthermore, Miceli et al. conducted a qualitative study on documenting context within image datasets11; Paulluda et al. further advocated for the combined use of qualitative and quantitative measures in dataset development12. The discussion around data quality extends to socio-cultural data collection, with calls for institutional frameworks and procedures inspired by archival practices to address consent, inclusivity and privacy concerns13. Similarly, research by Peng and co-workers analyses problematic datasets and highlights the need for improved development pipelines14. In the field of NLP, researchers have proposed data statements to understand bias and intent15. Birhane et al. emphasized the importance of consent and responsible curation in large-scale vision datasets16.

More recently, researchers have also discussed the impact of regulations and policies on trustworthy AI17, and data privacy laws—inspired by the General Data Protection Regulation (GDPR)—are being implemented globally18,19,20. The GDPR regulates biometric data processing and grants individuals control over their information21. Various research efforts have been undertaken to explore the GDPR’s impact on AI deployment and data management22,23. In Table 1, we summarize data privacy laws for some of the countries around the world. There are other laws specific to certain kinds of data, such as the Health Insurance Portability and Accountability Act (HIPAA)24 for medical health in the US, and the Biometric Information Privacy Act (BIPA)25, which protects biometric information in the state of Illinois, United States. Furthermore, the European Commission’s Ethics Guidelines and the proposed AI Act further emphasize responsible AI development26,27. Recent work has also discussed the impact of the Artificial Intelligence Act on facial processing applications28.

Table 1 Some of the laws surrounding data privacy around the world other than the GDPR21

Despite extensive evaluations, quantitative assessments of dataset fairness and privacy remain limited. Although multiple efforts have been made that make use of quantitative and statistical measures in varied contexts of fairness and privacy29,30, no existing work jointly estimates the responsibility of a dataset. Existing works quantify bias in biometric data at the model level31,32 and audit existing algorithms33,34,35, but few explore the impact at the dataset level. Dulhanty et al. propose a model-driven framework for measuring fairness but lack joint quantification of fairness metrics36. Toolkits for evaluating bias through annotations exist37, alongside statistical analyses for data variable relationships38. Rsearchers have explored various approaches to quantify privacy in datasets. Li et al. leverage human labelling to identify privacy-sensitive information in images39, whereas Gervais et al. infer locations from purchase history data40. Orekondy et al. propose an algorithm to predict privacy risk in images41. Furthermore, metrics such as l-diversity42, k-anonymity29, t-closeness43 and M-invariance44 quantify privacy leakage by considering factors such as an adversary with knowledge, sensitive attribute representation and protection against specific attacks29,42,43.

Although various studies have emphasized distinct challenges in dataset collection, very few have examined the dimensions of fairness, privacy and regulatory compliance within datasets, specifically through the analysis of dataset annotations. This study leverages concepts from the widely recognized ethical values of AI45,46,47,48 to present a large audit of computer vision datasets focusing on fairness, privacy and regulatory compliance as key dimensions (Fig. 1). The audit framework presents examples from datasets in biometrics and healthcare, and introduces a framework for evaluating the compliance of datasets with fairness, privacy and regulatory norms, values which have been scarcely explored in past literature. Although there are additional ethical considerations that could be evaluated, our work focuses on the assessment of the aforementioned important responsible AI values. There may be other aspects of responsibility, such as equity or reproducibility, where data play a crucial role49. Further research is necessary to expand the evaluation to encompass a broader range of ethical principles crucial to responsible AI.

Fig. 1: Responsible ML datasets.
figure 1

Quantifying dataset responsibility across factors of fairness, privacy and regulatory compliance.

To conduct the audit, this work introduces a responsible rubric to assess machine learning datasets (especially those in the biometric and healthcare domains) such as face-recognition and chest X-ray datasets. After reviewing over 100 datasets and excluding those unsuitable due to size or lack of accessibility, we applied our framework to 60 datasets. The proposed framework attempts to quantitatively assess the trustworthiness of training data for ML models in terms of fairness, privacy risks and adherence to regulatory standards. The framework evaluates datasets on diversity, inclusivity and the reliability of annotations for fairness; identifies sensitive annotations for privacy evaluation; and verifies compliance with regulations.

On the basis of the quantitative evaluation, our detailed analysis showcases that many datasets do not adequately address fairness, privacy and regulatory compliance, highlighting the need for better curation practices. We highlight a fairness–privacy paradox, wherein the inclusion of sensitive attributes to enhance fairness might inadvertently risk privacy. Given the complexities of quantitatively examining datasets, we offer recommendations to improve dataset accountability and advocate for the integration of qualitative evaluations, such as datasheets, to promote dataset responsibility.

Quantification of the responsible rubric

The quantification of the three parameters is summarized below (refer Fig. 2) and is further detailed in the Methods.

Fig. 2: Quantification framework.
figure 2

a, The factors involved in fairness quantification: inclusivity, diversity and labels, and the formulation employed for the calculation of the fairness score. b, The factors involved in privacy quantification. c, The factors considered in the quantification of regulatory compliance.

Quantifying fairness (F)

We consider the impact of three factors for quantifying dataset fairness: diversity, inclusivity and labels (see Fig. 2a). Inclusivity quantifies whether different groups of people are represented in the dataset across gender, skin tone, ethnic group and age parameters. These align with common fairness evaluation practices in deep learning research using biometric and healthcare data50,51. Although other demographics such as disability or income are important, they lack annotations in these datasets and therefore could not be used (Supplementary Table 1 details the subgroups considered). We acknowledge that there may be biased subsets of data that are left unexplored by limiting the variables to the chosen demographic groups52. We further acknowledge limitations in gender categorization (male/female) from existing datasets53 and include an 'Other' option. Ethnicity subgroups are inspired by FairFace54, with an additional 'mixed-race' category. Age classifications follow the AgeDB dataset55. Diversity quantifies the distribution of these groups in the dataset, with an assumption that a balanced dataset is the most fair. Although a balanced dataset does not guarantee equal performance, existing work has shown improved fairness with the use of balanced datasets56,57. We note that such a dataset may not be ideal in many cases, but it acts as a simplifying assumption for the proposed formulation. Finally, we consider the reliability of the labels depending on whether they have been self-reported by the individuals in the dataset or are annotated based on apparent characteristics.

Quantifying privacy (P)

To quantify privacy leakage in the publicly available datasets, we identify vulnerable label annotations that can lead to the potential leakage of private information. The dataset’s annotated labels are employed to quantify the potential privacy leakage. Following a comprehensive review, we observe that there are six attributes in these domains that are encountered most commonly58. These include name identification, sensitive and protected attributes, accessories, critical objects, location inference and medical condition.

Quantifying regulatory compliance (R)

The regulatory compliance score in the dataset is quantified on the basis of three factors: institutional approval, the individual's consent to the data collection, and the facility for expungement/correction of the individual's data from the dataset. Although the absence of a person's consent may not necessarily breach regulatory norms, for lack of a more subtle evaluation, we use individual consent in the dataset as one of the factors for compliance.

Results

In this work we surveyed a large number of datasets featuring humans. Although fairness and privacy issues persist across different data domains such as objects and scenes59,60, current regulatory norms are designed for people. We limit our discussion to face-based and healthcare imaging datasets; however, it is possible to extend the concepts presented in this study to other domains. After filtering through a total of 100 datasets and discarding datasets that are decommissioned, small in size (fewer than 100 images) and whose data could not be downloaded/accessed/requested, 60 datasets remained. These datasets are used for the analysis and quantification of the responsible rubric, out of which 52 are face-based biometric datasets (Supplementary Table 2), and eight are chest X-ray-based healthcare datasets (Supplementary Table 3). We quantify the datasets across the dimensions of fairness, privacy and regulatory compliance. Using the specified quantification methodology, a 3-tuple containing scores across the three dimensions is obtained. Figure 3a showcases the distribution of the scores.

Fig. 3: Analysis based on fairness, privacy and regulatory scores.
figure 3

a, The summary of fairness, privacy and regulatory compliance scores through histogram visualization for the datasets we surveyed. Left: the maximum value of the fairness score that can be obtained is 5; however, it is observed that the fairness scores do not exceed 3. Middle: although most datasets in our study preserve privacy in terms of not leaking location or medical information, very few provide perfect privacy preservation. Right: most datasets comply with no regulatory norms or only one. We can observe that most datasets provide a low fairness score and perform poorly on the regulatory compliance metric. b, The impact of weighing different factors in the quantification of the privacy score. c, The FPR scores of the datasets on the basis of the year in which they were published. d, The average FPR scores of the datasets on the basis of the data collection source. The colours in c and d correspond with those in a.

Fairness in datasets

The fairness of datasets is calculated using equation (4) (see Methods). The fairness metric described in this work provides a maximum value of 5, with 5 being the fairest. The mean ± s.d. fairness score over the 60 datasets was 0.96 ± 0.64, signifying a wide spread (see Table 2 for more detailed results). The UTKFace dataset is observed to be the fairest, with a score of 2.71 among the datasets listed here, providing maximum representation. It should be noted that the UTKFace dataset achieves slightly more than half the maximum fairness score. Interestingly, the average fairness score for the eight healthcare datasets was 1.34 ± 0.17, whereas that for the biometric datasets was 0.90 ± 0.67, showcasing a higher overall fairness of healthcare datasets when compared with biometric datasets.

Privacy preservation in datasets

The privacy preserved in datasets is computed on the basis of the presence of privacy-compromising information in the annotations. A P score indicating the privacy preservation capacity and PL score indicating privacy leakage of the dataset are calculated. The distribution of P for privacy quantification is presented in Fig. 3a. The best value of P is 6. We observe that the DroneSURF dataset contains no private information, making it perfectly privacy preserving. The healthcare datasets in the study de-identify individuals but naturally leak information on medical conditions, whereas some further provide sensitive information such as location.

Regulatory compliance in datasets

With modern information technology laws in place, the regulatory compliance of datasets is quantified on the basis of institutional approval of the dataset; the individual's consent to data collection; and the facility for expungement/correction of the individual's data from the dataset. On the basis of these criteria, the compliance scores are calculated with a maximum value of three. The distribution of scores is provided in Fig. 3a. On average, a regulatory score of 0.58 is obtained. We observe that the FB Fairness Dataset (Casual Conversations) satisfies all regulatory compliances, thereby obtaining the maximum regulatory score, whereas most datasets provide a score of 0 or 1.

Fairness–privacy paradox in datasets

Many face-based biometric datasets provide sensitive attribute information. Although the presence of these annotations enables fairness analysis, it also leads to privacy leakage, causing a fairness–privacy paradox where enhancing one factor hinders the other. One way to remedy the situation could be to provide population statistics instead of sensitive attribute labels for each sample in the published dataset papers; however, current fairness algorithms are evaluated through sensitive attribute annotations in the dataset, and their absence can hinder the fairness evaluation process. In differential privacy-based solutions, it has been observed that the performance degradation is unequal across different subgroups61, highlighting the need for these sensitive labels for fairness analysis. The fairness–privacy paradox remains an open problem for datasets containing sensitive attribute information such as biometrics and healthcare imaging. With ongoing discussion regarding concerns for privacy and fairness, regulations can sometimes provide conflicting guidance on privacy laws and proposed AI laws, giving researchers and industry a reason to approach this paradox with caution in dataset development. Recent work in face recognition is exploring models trained using synthetically generated datasets to circumvent the different privacy-related concerns62,63,64. However, these synthetic datasets use powerful generative models utilizing large face datasets in their training. Some diffusion-based models have also been shown to replicate the training data during generation65.

Holistic view of responsibility in datasets

Studying the aforementioned factors in conjunction, we obtain a 3D representation of the datasets. The 3-tuple provides insight into how responsible a dataset may be for downstream training. To observe the behaviour of the 3-tuple visually, we plotted a 3D scatter plot for the face datasets, along with a hypothetical FPR dataset (Fig. 4a). The hypothetical FPR dataset has a perfect fairness, privacy and regulatory score on the basis of our formulation. After applying the DBSCAN algorithm with eps = 1 (the maximum distance between two points to be considered as a part of one cluster), we observe five clusters with two outliers. We compute the cluster centres (denoted as clusters 1–5) of the five clusters by taking the mean along the three dimensions. The centres of clusters 1–5 are located at (0.67, 5, 2), (1.14, 4, 1), (1.37, 3, 1), (0.69, 4.94, 0.28) and (1.45, 3, 0), respectively, where a cluster centre (F,P,R) denotes the centre’s fairness, privacy and regulatory score, respectively. On calculating the Euclidean distance of these centres from the FPR dataset, we find that they lie at a distance of 4.56, 4.79, 5.11, 5.20 and 5.53 units, respectively, with the clusters comprising 4, 7, 3, 32 and 4 datasets, respectively.

Fig. 4: Cluster-based analysis.
figure 4

ad, Cluster analysis based on the 3-tuple quantification of fairness, privacy and regulatory compliance for face-based datasets only (a,b), and face-based datasets alongside healthcare datasets (c,d). a,c, The 3D scatter plot of the different datasets across the three axes, with the FPR dataset plotted with perfect fairness, privacy preservation and regulatory compliance. b,d, The scatter plot after performing DBSCAN clustering with eps = 1. We observe that the FB Fairness Dataset and the UTKFace dataset lie closest to the FPR dataset.

The FB Fairness Dataset66 (1.56, 5, 3) and the UTKFace dataset67 (2.71, 5, 1) emerge as outliers, with Euclidean distances of 3.59 and 3.20 units from the FPR dataset, respectively. When compared with the other clusters, we observe that these datasets lie closest to the FPR dataset, showcasing their superiority over the other datasets along these axes. Cluster 1 is the next closest cluster, which comprises the LAOFIW68, 10k US Adult Faces Database69, CAFE Database70 and IISCIFD71, with average scores of 0.67, 5 and 2 for fairness, privacy and regulatory compliance, respectively. We observe that cluster 5 is the farthest from the FPR dataset and contains datasets that have regulatory scores of 0. Cluster 4 has datasets containing regulatory scores between 0 and 1, whereas closer clusters—such as clusters 2 and 3—have datasets containing regulatory scores of 1 with overall higher fairness scores. Datasets in cluster 1 perform extremely well on the privacy and regulatory scores but considerably lack in fairness. Similar observations can be made when the scatter plot includes healthcare datasets along with the face datasets (Fig. 4c,d). The numerical results are tabulated in Table 2. A weighted average of the three scores is calculated by dividing each score by its maximum value and then taking an average that provides a value in the range of 0 to 1 (Table 2). By utilizing this average, we observe that the top three responsible datasets are the FB Fairness Dataset (Casual Conversations), the IISCIFD and the UTKFace dataset. A high regulatory compliance score plays an important role in the overall responsibility score of FB Fairness and IISCIFD datasets. By contrast, a high fairness score imparts UTKFace with a high responsible rubric value.

Table 2 Summary of the different scores obtained for fairness, privacy, and regulatory compliance quantification obtained for the biometric and healthcare datasets in the study

To further understand how the F, P and R scores vary across the datasets, we evaluate them on the basis of the year in which they were collected. We observe a trend towards increasing fairness and regulatory scores over the years (refer to Fig. 3c). We also evaluate the average F, P and R scores over all 60 datasets on the basis of the source of the dataset collection (refer to Fig. 3d). We observe that fairness and regulatory scores are generally higher for datasets that are not web-collected, showing how collecting data from the web can negatively impact these factors. To summarize the observations made over the existing face datasets, we find that

  • Most of the existing datasets suffer on all three axes (fairness, privacy and regulatory compliance) as per the proposed metric. For example, the UTKFace dataset is among the fairest datasets but performs poorly on regulatory compliance. On the other hand, the LFWA dataset lacks on all three fronts.

  • Although many works claim fairness as the primary focus in their datasets, these datasets provide poor fairness scores on evaluation. One such example is the DiveFace dataset. The fairness quantification of datasets using our framework shows that being fair is a major concern, with 91% of the existing datasets obtaining a fairness score of two or less out of five.

  • A vast number of large-scale datasets in computer vision are web-curated without any institutional approval. These datasets are often released under various CC-BY licences even when these datasets do not have individual consent. We found that these datasets also fare low on the fairness front because the annotations are not always reliable, posing major risks to overall data responsibility.

  • Following regulatory norms effectively improves the responsibility rubric for a given dataset; however, based on the available information, most datasets are not compliant, with 89% datasets having a compliance score of 0 or 1.

  • When comparing fairness, privacy and regulatory scores, it is clear that the privacy scores are higher in general. It is worth noting that privacy standards and constraints are already defined and have existed for a few years now21, and datasets are possibly collected with these regulations in mind. This further indicates a need for fairness and regulatory constraints that promote data collection with higher fairness and regulatory standards.

Recommendations

Drawing from the insights gained through applying our framework to a broad spectrum of datasets, we propose several recommendations to enhance the process of dataset collection in the future. These suggestions are designed to address ethical and technical considerations in dataset creation and management,

  • Institutional appproval, ethics statement and individual consent: datasets involving humans should receive approval from an institutional review board (such as one of those in the US). Future regulations may require consent from individuals to be obtained explicitly for the dataset and its intended use.

  • Facility for expungement/correction of an individual's data: provisions should be made for individuals to request the deletion or amendment of their data within datasets, complying with privacy regulations such as the GDPR. This capability, already present in some datasets (such as the FB Fairness Dataset, IJB-C and UTKFace datasets), should become a standard feature, allowing for greater control over personal data.

  • Fairness and privacy: datasets should be collected from a diverse population, and distribution across sensitive attributes should be provided in a privacy-preserving manner. The proposed fairness and privacy scores can aid in quantifying a dataset’s diversity and privacy preservation.

  • Comprehensive datasheet: dataset creators should curate and provide a datasheet containing information regarding the objectives; intended use; funding agency; the demographic distribution of individuals or images; the licensing information; and the limitations of the dataset. By specifying the intended use, the data can be restricted for processing outside of the intended use under the GDPR. An excellent resource for the construction of a datasheet is provided by Gebru and colleagues6. We propose modifications to this datasheet by adding questions concerning fairness, privacy and regulatory compliance in datasets (refer to Supplementary Tables 4 and 5 for further discussion), encouraging detailed documentation of datasets. Although our main contribution continues to be in the form of the quantification framework, we encourage the comprehensive incorporation of insights from dataset documentation.

Limitations and future work

The formulation for quantification in this work considers dataset fairness on the basis of the distribution of its labels; however, this approach does not encompass the image diversity of the images, including the occurrence of duplicate images within specific subgroups. Moreover, it is important to recognize that, for certain applications, an unequal distribution among groups might be preferable; for instance, when a particular group presents more challenges in processing, and thus necessitates more data to achieve uniform model performance across all groups. Furthermore, the current formulation for fairness, privacy and regulatory scores is tailored to human-centric datasets. Although datasets focused on objects could also face fairness challenges, current regulations predominantly focus on mitigating the impact on humans. The examination of object-based datasets remains a task for future exploration. Moreover, the recommendations and datasheets proposed in this study are intended to establish the highest standards, which can be challenging to achieve given the capabilities of current technologies. These recommendations are designed to act as a guiding north star, with the understanding that attaining these ideals necessitates focused research endeavours. The fairness–privacy paradox continues to be a complex issue within the field, and removing data from trained models through unlearning—although a subject of active research—is yet to be resolved. We acknowledge that the various components of the framework for quantification involve multiple design choices, such as the choice of metrics, among others. Although the proposed framework uses a specific set of design choices to measure specific values across specific characteristics, the design choices can be adapted as necessary, depending on the application. For example, we use the Shannon diversity index as the metric for computing diversity; however, alternative diversity metrics could be used depending on the specific information they capture.

Conclusion

Whereas contemporary research predominantly focuses on developing trustworthy machine learning algorithms, our work emphasizes assessing the integrity of AI by examining datasets through the lens of fairness, privacy and regulatory compliance. We conduct a large-scale audit of datasets, specifically those related to faces and chest X-rays, and propose recommendations for creating responsible ML datasets. Our objective is to initiate a dialogue on establishing quantifiable criteria for dataset responsibility, anticipating that these criteria will be further refined in subsequent studies. Such progress would facilitate effective dataset examination, ensuring alignment with responsible AI principles. As global data protection laws tighten, the scientific community must reconsider how datasets are crafted. We advocate for the implementation of quantitative measures, combined with qualitative datasheets and the proposed recommendations, to encourage the creation of responsible datasets. This initiative is vital for advancing responsible AI systems. We lay the groundwork for an ethical and responsible AI research and development framework by merging quantitative analysis with qualitative evaluations and practical guidance.

Methods

In this section we describe the methodology adopted for designing the audit framework for Responsible Datasets.

Quantifying dataset fairness

For fairness computation, we define the set of demographics D = {gender, skin tone, ethnicity, age} and S as corresponding subgroups in each demographic (refer to Supplementary Table 1 for subgroups). For example, if D1 = gender then S1 = {male, female, other}. For a given dataset, d denotes the set of demographics annotated in the dataset, and s, the subgroups corresponding to those demographics. Then, inclusivity ri for each demography i is the ratio of demographic subgroups present in the dataset and the pre-defined demographic subgroups in Si,

$${r}_{i}=| {s}_{i}| /| {S}_{i}|$$
(1)

The diversity vi is calculated using Shannon’s diversity index72, which is a popular metric to measure diversity, especially in ecological studies73. The distribution of different subgroups for a given demography di is computed as follows,

$${p}_{ij}={\rm{num}}\left({s}_{ij}\right)/\sum\limits_{j}{\rm{num}}\left({s}_{ij}\right)$$
(2)
$${v}_{i}=-\frac{1}{{\rm{ln}}(| {s}_{i}| )}\sum\limits_{j}{p}_{ij} \times {\rm{ln}}\left(\,{p}_{ij}\right),$$
(3)

where num(sij) denotes the number of samples for the jth subgroup of the ith demographic in the dataset. When the number of samples is not available, we consider num to denote the number of individuals in the dataset. Fairness across each of the demographics is measured as between 0 and 1.

The label score l is calculated to reflect the reliability of demographic labels: self-reported (1.0), classifier-generated (0.67) or apparent labels (0.33). Self-reported labels indicate that the individuals provided their information as a part of the data collection process. Classifier-generated labels imply that the labels were obtained through an automated process. Finally, apparent labels indicate that the annotations were made by an external annotator after observing the images. Classifier-generated labels provided a higher score due to their consistency compared with potential human bias in apparent labels74,75,76. We acknowledge that there may be different perceptions of reliability in annotation based on the task and nature of the data. In cases in which the labels are collected using more than one category, an average of the corresponding categories’ scores is taken. For healthcare datasets, a score of 1 is provided if a medical professional provides/validates the annotations, otherwise a score of 0 is provided. The fairness score (F) is computed as,

$$F=\sum\limits_{i}({r}_{i} \times {v}_{i})+l$$
(4)

Here, a higher F indicates a fairer dataset.

The proposed fairness computation can be extended to additional demographics by incorporating them and their subgroups into D and S, respectively. The calculation for inclusivity and diversity remains the same, with the maximum value of the fairness score adjusted on the basis of the number of demographic groups added. We focus on individual demographics because considering all possible intersections leads to exponential growth in the number of subgroups; however, we acknowledge the importance of intersectionality in fairness research and encourage the exploration of methods for quantifying intersectional fairness in future work.

Quantifying dataset privacy

We quantify potential privacy leakage through annotated labels considering the exposure of the set of attributes A as,

  • A1: Name identification (highest leakage)

  • A2: Sensitive attributes such as gender and race

  • A3: Accessories such as hats and sunglasses

  • A4: Critical objects such as credit cards and signatures

  • A5: Location information such as coordinates and landmarks

  • A6: Medical condition information

Then,

$$\textbf{A}=\{{A}_{1},{A}_{2},{A}_{3},{A}_{4},{A}_{5},{A}_{6}\}$$
(5)

We assess privacy leakage through manual inspection of dataset annotations (publications, websites, repositories). Each attribute in A present in the annotations contributes a point to the privacy leakage score (PL) calculated as,

$${\rm{PL}}=\sum\limits_{i=1}^{6}{A}_{i}.$$
(6)

The higher the PL, the greater the potential privacy risk. Finally, the privacy preservation score, P, for a given dataset is estimated as,

$$P=(| A| -{\rm{PL}}).$$
(7)

Due to the difficulty of universally weighting privacy factors, we assign equal weight to each attribute in A. These weights may be varied depending on the requirements of the dataset user. We show how varying the weights across the different factors influences P (Fig. 3b). Our audit framework is flexible and can incorporate additional attributes as necessitated by the specific application domain or dataset under consideration.

Quantifying regulatory compliance

Data privacy regulations vary globally, with GDPR serving as a prominent example21. Beyond legal requirements, datasets can benefit from institutional approvals (for example, the Institutional Review Board) and ethics and impact statements77,78,79. We propose a regulatory compliance score R for datasets based on three factors (each scored 0 or 1):

  • Institutional approval granted

  • Individual consent obtained

  • Data expungement/correction facility available

A compliant dataset with all criteria met receives a score of 3. Although the absence of a person's consent may not necessarily breach regulatory norms, for lack of a more subtle evaluation, we use individual consent as one of the factors for compliance. Factors are validated manually from dataset publications, webpages or GitHub pages. Missing information defaults to a score of zero.