Abstract
Artificial Intelligence (AI) has seamlessly integrated into numerous scientific domains, catalysing unparalleled enhancements across a broad spectrum of tasks; however, its integrity and trustworthiness have emerged as notable concerns. The scientific community has focused on the development of trustworthy AI algorithms; however, machine learning and deep learning algorithms, popular in the AI community today, intrinsically rely on the quality of their training data. These algorithms are designed to detect patterns within the data, thereby learning the intended behavioural objectives. Any inadequacy in the data has the potential to translate directly into algorithms. In this study we discuss the importance of responsible machine learning datasets through the lens of fairness, privacy and regulatory compliance, and present a large audit of computer vision datasets. Despite the ubiquity of fairness and privacy challenges across diverse data domains, current regulatory frameworks primarily address human-centric data concerns. We therefore focus our discussion on biometric and healthcare datasets, although the principles we outline are broadly applicable across various domains. The audit is conducted through evaluation of the proposed responsible rubric. After surveying over 100 datasets, our detailed analysis of 60 distinct datasets highlights a universal susceptibility to fairness, privacy and regulatory compliance issues. This finding emphasizes the urgent need for revising dataset creation methodologies within the scientific community, especially in light of global advancements in data protection legislation. We assert that our study is critically relevant in the contemporary AI context, offering insights and recommendations that are both timely and essential for the ongoing evolution of AI technologies.
Similar content being viewed by others
Main
The rapid growth of AI and machine learning is reshaping global efforts and 'technology for good' programmes, impacting the lives of billions. These advancements set new benchmarks in accuracy, exemplified by AI surpassing human experts in diagnostics1 and complex games2. However, concerns regarding bias, privacy and manipulation have emerged, prompting the formulation of principles of responsible AI to advocate for safe, ethical and trustworthy AI development3. Within the AI development pipeline, data collection and annotation stages hold paramount importance, influencing the overall system performance. A recent report by NIST emphasizes how reliance on large-scale datasets can introduce bias during AI model training and evaluation4. The data-driven nature of contemporary AI algorithms exacerbates these biases, as any anomalies within the training data can directly impede the learning process. Despite its critical role, data quality often remains undervalued within the AI/ML research community5.
The discourse surrounding dataset quality in AI research is evolving, encompassing both qualitative and quantitative evaluation methods5,6. There is growing recognition of the importance of data quality alongside algorithmic efficiency and model trustworthiness. Past research has established qualitative assessment of dataset quality through methods such as interviews and discussions7,8. Notably, Gebru and co-workers proposed comprehensive datasheets for released datasets to raise transparency and accountability6. This aligns with frameworks by Hutchinson et al. for building responsible data development processes9. Efforts are underway to address data representation, with Kamikubo and colleagues analysing accessibility datasets10. Furthermore, Miceli et al. conducted a qualitative study on documenting context within image datasets11; Paulluda et al. further advocated for the combined use of qualitative and quantitative measures in dataset development12. The discussion around data quality extends to socio-cultural data collection, with calls for institutional frameworks and procedures inspired by archival practices to address consent, inclusivity and privacy concerns13. Similarly, research by Peng and co-workers analyses problematic datasets and highlights the need for improved development pipelines14. In the field of NLP, researchers have proposed data statements to understand bias and intent15. Birhane et al. emphasized the importance of consent and responsible curation in large-scale vision datasets16.
More recently, researchers have also discussed the impact of regulations and policies on trustworthy AI17, and data privacy laws—inspired by the General Data Protection Regulation (GDPR)—are being implemented globally18,19,20. The GDPR regulates biometric data processing and grants individuals control over their information21. Various research efforts have been undertaken to explore the GDPR’s impact on AI deployment and data management22,23. In Table 1, we summarize data privacy laws for some of the countries around the world. There are other laws specific to certain kinds of data, such as the Health Insurance Portability and Accountability Act (HIPAA)24 for medical health in the US, and the Biometric Information Privacy Act (BIPA)25, which protects biometric information in the state of Illinois, United States. Furthermore, the European Commission’s Ethics Guidelines and the proposed AI Act further emphasize responsible AI development26,27. Recent work has also discussed the impact of the Artificial Intelligence Act on facial processing applications28.
Despite extensive evaluations, quantitative assessments of dataset fairness and privacy remain limited. Although multiple efforts have been made that make use of quantitative and statistical measures in varied contexts of fairness and privacy29,30, no existing work jointly estimates the responsibility of a dataset. Existing works quantify bias in biometric data at the model level31,32 and audit existing algorithms33,34,35, but few explore the impact at the dataset level. Dulhanty et al. propose a model-driven framework for measuring fairness but lack joint quantification of fairness metrics36. Toolkits for evaluating bias through annotations exist37, alongside statistical analyses for data variable relationships38. Rsearchers have explored various approaches to quantify privacy in datasets. Li et al. leverage human labelling to identify privacy-sensitive information in images39, whereas Gervais et al. infer locations from purchase history data40. Orekondy et al. propose an algorithm to predict privacy risk in images41. Furthermore, metrics such as l-diversity42, k-anonymity29, t-closeness43 and M-invariance44 quantify privacy leakage by considering factors such as an adversary with knowledge, sensitive attribute representation and protection against specific attacks29,42,43.
Although various studies have emphasized distinct challenges in dataset collection, very few have examined the dimensions of fairness, privacy and regulatory compliance within datasets, specifically through the analysis of dataset annotations. This study leverages concepts from the widely recognized ethical values of AI45,46,47,48 to present a large audit of computer vision datasets focusing on fairness, privacy and regulatory compliance as key dimensions (Fig. 1). The audit framework presents examples from datasets in biometrics and healthcare, and introduces a framework for evaluating the compliance of datasets with fairness, privacy and regulatory norms, values which have been scarcely explored in past literature. Although there are additional ethical considerations that could be evaluated, our work focuses on the assessment of the aforementioned important responsible AI values. There may be other aspects of responsibility, such as equity or reproducibility, where data play a crucial role49. Further research is necessary to expand the evaluation to encompass a broader range of ethical principles crucial to responsible AI.
To conduct the audit, this work introduces a responsible rubric to assess machine learning datasets (especially those in the biometric and healthcare domains) such as face-recognition and chest X-ray datasets. After reviewing over 100 datasets and excluding those unsuitable due to size or lack of accessibility, we applied our framework to 60 datasets. The proposed framework attempts to quantitatively assess the trustworthiness of training data for ML models in terms of fairness, privacy risks and adherence to regulatory standards. The framework evaluates datasets on diversity, inclusivity and the reliability of annotations for fairness; identifies sensitive annotations for privacy evaluation; and verifies compliance with regulations.
On the basis of the quantitative evaluation, our detailed analysis showcases that many datasets do not adequately address fairness, privacy and regulatory compliance, highlighting the need for better curation practices. We highlight a fairness–privacy paradox, wherein the inclusion of sensitive attributes to enhance fairness might inadvertently risk privacy. Given the complexities of quantitatively examining datasets, we offer recommendations to improve dataset accountability and advocate for the integration of qualitative evaluations, such as datasheets, to promote dataset responsibility.
Quantification of the responsible rubric
The quantification of the three parameters is summarized below (refer Fig. 2) and is further detailed in the Methods.
Quantifying fairness (F)
We consider the impact of three factors for quantifying dataset fairness: diversity, inclusivity and labels (see Fig. 2a). Inclusivity quantifies whether different groups of people are represented in the dataset across gender, skin tone, ethnic group and age parameters. These align with common fairness evaluation practices in deep learning research using biometric and healthcare data50,51. Although other demographics such as disability or income are important, they lack annotations in these datasets and therefore could not be used (Supplementary Table 1 details the subgroups considered). We acknowledge that there may be biased subsets of data that are left unexplored by limiting the variables to the chosen demographic groups52. We further acknowledge limitations in gender categorization (male/female) from existing datasets53 and include an 'Other' option. Ethnicity subgroups are inspired by FairFace54, with an additional 'mixed-race' category. Age classifications follow the AgeDB dataset55. Diversity quantifies the distribution of these groups in the dataset, with an assumption that a balanced dataset is the most fair. Although a balanced dataset does not guarantee equal performance, existing work has shown improved fairness with the use of balanced datasets56,57. We note that such a dataset may not be ideal in many cases, but it acts as a simplifying assumption for the proposed formulation. Finally, we consider the reliability of the labels depending on whether they have been self-reported by the individuals in the dataset or are annotated based on apparent characteristics.
Quantifying privacy (P)
To quantify privacy leakage in the publicly available datasets, we identify vulnerable label annotations that can lead to the potential leakage of private information. The dataset’s annotated labels are employed to quantify the potential privacy leakage. Following a comprehensive review, we observe that there are six attributes in these domains that are encountered most commonly58. These include name identification, sensitive and protected attributes, accessories, critical objects, location inference and medical condition.
Quantifying regulatory compliance (R)
The regulatory compliance score in the dataset is quantified on the basis of three factors: institutional approval, the individual's consent to the data collection, and the facility for expungement/correction of the individual's data from the dataset. Although the absence of a person's consent may not necessarily breach regulatory norms, for lack of a more subtle evaluation, we use individual consent in the dataset as one of the factors for compliance.
Results
In this work we surveyed a large number of datasets featuring humans. Although fairness and privacy issues persist across different data domains such as objects and scenes59,60, current regulatory norms are designed for people. We limit our discussion to face-based and healthcare imaging datasets; however, it is possible to extend the concepts presented in this study to other domains. After filtering through a total of 100 datasets and discarding datasets that are decommissioned, small in size (fewer than 100 images) and whose data could not be downloaded/accessed/requested, 60 datasets remained. These datasets are used for the analysis and quantification of the responsible rubric, out of which 52 are face-based biometric datasets (Supplementary Table 2), and eight are chest X-ray-based healthcare datasets (Supplementary Table 3). We quantify the datasets across the dimensions of fairness, privacy and regulatory compliance. Using the specified quantification methodology, a 3-tuple containing scores across the three dimensions is obtained. Figure 3a showcases the distribution of the scores.
Fairness in datasets
The fairness of datasets is calculated using equation (4) (see Methods). The fairness metric described in this work provides a maximum value of 5, with 5 being the fairest. The mean ± s.d. fairness score over the 60 datasets was 0.96 ± 0.64, signifying a wide spread (see Table 2 for more detailed results). The UTKFace dataset is observed to be the fairest, with a score of 2.71 among the datasets listed here, providing maximum representation. It should be noted that the UTKFace dataset achieves slightly more than half the maximum fairness score. Interestingly, the average fairness score for the eight healthcare datasets was 1.34 ± 0.17, whereas that for the biometric datasets was 0.90 ± 0.67, showcasing a higher overall fairness of healthcare datasets when compared with biometric datasets.
Privacy preservation in datasets
The privacy preserved in datasets is computed on the basis of the presence of privacy-compromising information in the annotations. A P score indicating the privacy preservation capacity and PL score indicating privacy leakage of the dataset are calculated. The distribution of P for privacy quantification is presented in Fig. 3a. The best value of P is 6. We observe that the DroneSURF dataset contains no private information, making it perfectly privacy preserving. The healthcare datasets in the study de-identify individuals but naturally leak information on medical conditions, whereas some further provide sensitive information such as location.
Regulatory compliance in datasets
With modern information technology laws in place, the regulatory compliance of datasets is quantified on the basis of institutional approval of the dataset; the individual's consent to data collection; and the facility for expungement/correction of the individual's data from the dataset. On the basis of these criteria, the compliance scores are calculated with a maximum value of three. The distribution of scores is provided in Fig. 3a. On average, a regulatory score of 0.58 is obtained. We observe that the FB Fairness Dataset (Casual Conversations) satisfies all regulatory compliances, thereby obtaining the maximum regulatory score, whereas most datasets provide a score of 0 or 1.
Fairness–privacy paradox in datasets
Many face-based biometric datasets provide sensitive attribute information. Although the presence of these annotations enables fairness analysis, it also leads to privacy leakage, causing a fairness–privacy paradox where enhancing one factor hinders the other. One way to remedy the situation could be to provide population statistics instead of sensitive attribute labels for each sample in the published dataset papers; however, current fairness algorithms are evaluated through sensitive attribute annotations in the dataset, and their absence can hinder the fairness evaluation process. In differential privacy-based solutions, it has been observed that the performance degradation is unequal across different subgroups61, highlighting the need for these sensitive labels for fairness analysis. The fairness–privacy paradox remains an open problem for datasets containing sensitive attribute information such as biometrics and healthcare imaging. With ongoing discussion regarding concerns for privacy and fairness, regulations can sometimes provide conflicting guidance on privacy laws and proposed AI laws, giving researchers and industry a reason to approach this paradox with caution in dataset development. Recent work in face recognition is exploring models trained using synthetically generated datasets to circumvent the different privacy-related concerns62,63,64. However, these synthetic datasets use powerful generative models utilizing large face datasets in their training. Some diffusion-based models have also been shown to replicate the training data during generation65.
Holistic view of responsibility in datasets
Studying the aforementioned factors in conjunction, we obtain a 3D representation of the datasets. The 3-tuple provides insight into how responsible a dataset may be for downstream training. To observe the behaviour of the 3-tuple visually, we plotted a 3D scatter plot for the face datasets, along with a hypothetical FPR dataset (Fig. 4a). The hypothetical FPR dataset has a perfect fairness, privacy and regulatory score on the basis of our formulation. After applying the DBSCAN algorithm with eps = 1 (the maximum distance between two points to be considered as a part of one cluster), we observe five clusters with two outliers. We compute the cluster centres (denoted as clusters 1–5) of the five clusters by taking the mean along the three dimensions. The centres of clusters 1–5 are located at (0.67, 5, 2), (1.14, 4, 1), (1.37, 3, 1), (0.69, 4.94, 0.28) and (1.45, 3, 0), respectively, where a cluster centre (F,P,R) denotes the centre’s fairness, privacy and regulatory score, respectively. On calculating the Euclidean distance of these centres from the FPR dataset, we find that they lie at a distance of 4.56, 4.79, 5.11, 5.20 and 5.53 units, respectively, with the clusters comprising 4, 7, 3, 32 and 4 datasets, respectively.
The FB Fairness Dataset66 (1.56, 5, 3) and the UTKFace dataset67 (2.71, 5, 1) emerge as outliers, with Euclidean distances of 3.59 and 3.20 units from the FPR dataset, respectively. When compared with the other clusters, we observe that these datasets lie closest to the FPR dataset, showcasing their superiority over the other datasets along these axes. Cluster 1 is the next closest cluster, which comprises the LAOFIW68, 10k US Adult Faces Database69, CAFE Database70 and IISCIFD71, with average scores of 0.67, 5 and 2 for fairness, privacy and regulatory compliance, respectively. We observe that cluster 5 is the farthest from the FPR dataset and contains datasets that have regulatory scores of 0. Cluster 4 has datasets containing regulatory scores between 0 and 1, whereas closer clusters—such as clusters 2 and 3—have datasets containing regulatory scores of 1 with overall higher fairness scores. Datasets in cluster 1 perform extremely well on the privacy and regulatory scores but considerably lack in fairness. Similar observations can be made when the scatter plot includes healthcare datasets along with the face datasets (Fig. 4c,d). The numerical results are tabulated in Table 2. A weighted average of the three scores is calculated by dividing each score by its maximum value and then taking an average that provides a value in the range of 0 to 1 (Table 2). By utilizing this average, we observe that the top three responsible datasets are the FB Fairness Dataset (Casual Conversations), the IISCIFD and the UTKFace dataset. A high regulatory compliance score plays an important role in the overall responsibility score of FB Fairness and IISCIFD datasets. By contrast, a high fairness score imparts UTKFace with a high responsible rubric value.
To further understand how the F, P and R scores vary across the datasets, we evaluate them on the basis of the year in which they were collected. We observe a trend towards increasing fairness and regulatory scores over the years (refer to Fig. 3c). We also evaluate the average F, P and R scores over all 60 datasets on the basis of the source of the dataset collection (refer to Fig. 3d). We observe that fairness and regulatory scores are generally higher for datasets that are not web-collected, showing how collecting data from the web can negatively impact these factors. To summarize the observations made over the existing face datasets, we find that
-
Most of the existing datasets suffer on all three axes (fairness, privacy and regulatory compliance) as per the proposed metric. For example, the UTKFace dataset is among the fairest datasets but performs poorly on regulatory compliance. On the other hand, the LFWA dataset lacks on all three fronts.
-
Although many works claim fairness as the primary focus in their datasets, these datasets provide poor fairness scores on evaluation. One such example is the DiveFace dataset. The fairness quantification of datasets using our framework shows that being fair is a major concern, with 91% of the existing datasets obtaining a fairness score of two or less out of five.
-
A vast number of large-scale datasets in computer vision are web-curated without any institutional approval. These datasets are often released under various CC-BY licences even when these datasets do not have individual consent. We found that these datasets also fare low on the fairness front because the annotations are not always reliable, posing major risks to overall data responsibility.
-
Following regulatory norms effectively improves the responsibility rubric for a given dataset; however, based on the available information, most datasets are not compliant, with 89% datasets having a compliance score of 0 or 1.
-
When comparing fairness, privacy and regulatory scores, it is clear that the privacy scores are higher in general. It is worth noting that privacy standards and constraints are already defined and have existed for a few years now21, and datasets are possibly collected with these regulations in mind. This further indicates a need for fairness and regulatory constraints that promote data collection with higher fairness and regulatory standards.
Recommendations
Drawing from the insights gained through applying our framework to a broad spectrum of datasets, we propose several recommendations to enhance the process of dataset collection in the future. These suggestions are designed to address ethical and technical considerations in dataset creation and management,
-
Institutional appproval, ethics statement and individual consent: datasets involving humans should receive approval from an institutional review board (such as one of those in the US). Future regulations may require consent from individuals to be obtained explicitly for the dataset and its intended use.
-
Facility for expungement/correction of an individual's data: provisions should be made for individuals to request the deletion or amendment of their data within datasets, complying with privacy regulations such as the GDPR. This capability, already present in some datasets (such as the FB Fairness Dataset, IJB-C and UTKFace datasets), should become a standard feature, allowing for greater control over personal data.
-
Fairness and privacy: datasets should be collected from a diverse population, and distribution across sensitive attributes should be provided in a privacy-preserving manner. The proposed fairness and privacy scores can aid in quantifying a dataset’s diversity and privacy preservation.
-
Comprehensive datasheet: dataset creators should curate and provide a datasheet containing information regarding the objectives; intended use; funding agency; the demographic distribution of individuals or images; the licensing information; and the limitations of the dataset. By specifying the intended use, the data can be restricted for processing outside of the intended use under the GDPR. An excellent resource for the construction of a datasheet is provided by Gebru and colleagues6. We propose modifications to this datasheet by adding questions concerning fairness, privacy and regulatory compliance in datasets (refer to Supplementary Tables 4 and 5 for further discussion), encouraging detailed documentation of datasets. Although our main contribution continues to be in the form of the quantification framework, we encourage the comprehensive incorporation of insights from dataset documentation.
Limitations and future work
The formulation for quantification in this work considers dataset fairness on the basis of the distribution of its labels; however, this approach does not encompass the image diversity of the images, including the occurrence of duplicate images within specific subgroups. Moreover, it is important to recognize that, for certain applications, an unequal distribution among groups might be preferable; for instance, when a particular group presents more challenges in processing, and thus necessitates more data to achieve uniform model performance across all groups. Furthermore, the current formulation for fairness, privacy and regulatory scores is tailored to human-centric datasets. Although datasets focused on objects could also face fairness challenges, current regulations predominantly focus on mitigating the impact on humans. The examination of object-based datasets remains a task for future exploration. Moreover, the recommendations and datasheets proposed in this study are intended to establish the highest standards, which can be challenging to achieve given the capabilities of current technologies. These recommendations are designed to act as a guiding north star, with the understanding that attaining these ideals necessitates focused research endeavours. The fairness–privacy paradox continues to be a complex issue within the field, and removing data from trained models through unlearning—although a subject of active research—is yet to be resolved. We acknowledge that the various components of the framework for quantification involve multiple design choices, such as the choice of metrics, among others. Although the proposed framework uses a specific set of design choices to measure specific values across specific characteristics, the design choices can be adapted as necessary, depending on the application. For example, we use the Shannon diversity index as the metric for computing diversity; however, alternative diversity metrics could be used depending on the specific information they capture.
Conclusion
Whereas contemporary research predominantly focuses on developing trustworthy machine learning algorithms, our work emphasizes assessing the integrity of AI by examining datasets through the lens of fairness, privacy and regulatory compliance. We conduct a large-scale audit of datasets, specifically those related to faces and chest X-rays, and propose recommendations for creating responsible ML datasets. Our objective is to initiate a dialogue on establishing quantifiable criteria for dataset responsibility, anticipating that these criteria will be further refined in subsequent studies. Such progress would facilitate effective dataset examination, ensuring alignment with responsible AI principles. As global data protection laws tighten, the scientific community must reconsider how datasets are crafted. We advocate for the implementation of quantitative measures, combined with qualitative datasheets and the proposed recommendations, to encourage the creation of responsible datasets. This initiative is vital for advancing responsible AI systems. We lay the groundwork for an ethical and responsible AI research and development framework by merging quantitative analysis with qualitative evaluations and practical guidance.
Methods
In this section we describe the methodology adopted for designing the audit framework for Responsible Datasets.
Quantifying dataset fairness
For fairness computation, we define the set of demographics D = {gender, skin tone, ethnicity, age} and S as corresponding subgroups in each demographic (refer to Supplementary Table 1 for subgroups). For example, if D1 = gender then S1 = {male, female, other}. For a given dataset, d denotes the set of demographics annotated in the dataset, and s, the subgroups corresponding to those demographics. Then, inclusivity ri for each demography i is the ratio of demographic subgroups present in the dataset and the pre-defined demographic subgroups in Si,
The diversity vi is calculated using Shannon’s diversity index72, which is a popular metric to measure diversity, especially in ecological studies73. The distribution of different subgroups for a given demography di is computed as follows,
where num(sij) denotes the number of samples for the jth subgroup of the ith demographic in the dataset. When the number of samples is not available, we consider num to denote the number of individuals in the dataset. Fairness across each of the demographics is measured as between 0 and 1.
The label score l is calculated to reflect the reliability of demographic labels: self-reported (1.0), classifier-generated (0.67) or apparent labels (0.33). Self-reported labels indicate that the individuals provided their information as a part of the data collection process. Classifier-generated labels imply that the labels were obtained through an automated process. Finally, apparent labels indicate that the annotations were made by an external annotator after observing the images. Classifier-generated labels provided a higher score due to their consistency compared with potential human bias in apparent labels74,75,76. We acknowledge that there may be different perceptions of reliability in annotation based on the task and nature of the data. In cases in which the labels are collected using more than one category, an average of the corresponding categories’ scores is taken. For healthcare datasets, a score of 1 is provided if a medical professional provides/validates the annotations, otherwise a score of 0 is provided. The fairness score (F) is computed as,
Here, a higher F indicates a fairer dataset.
The proposed fairness computation can be extended to additional demographics by incorporating them and their subgroups into D and S, respectively. The calculation for inclusivity and diversity remains the same, with the maximum value of the fairness score adjusted on the basis of the number of demographic groups added. We focus on individual demographics because considering all possible intersections leads to exponential growth in the number of subgroups; however, we acknowledge the importance of intersectionality in fairness research and encourage the exploration of methods for quantifying intersectional fairness in future work.
Quantifying dataset privacy
We quantify potential privacy leakage through annotated labels considering the exposure of the set of attributes A as,
-
A1: Name identification (highest leakage)
-
A2: Sensitive attributes such as gender and race
-
A3: Accessories such as hats and sunglasses
-
A4: Critical objects such as credit cards and signatures
-
A5: Location information such as coordinates and landmarks
-
A6: Medical condition information
Then,
We assess privacy leakage through manual inspection of dataset annotations (publications, websites, repositories). Each attribute in A present in the annotations contributes a point to the privacy leakage score (PL) calculated as,
The higher the PL, the greater the potential privacy risk. Finally, the privacy preservation score, P, for a given dataset is estimated as,
Due to the difficulty of universally weighting privacy factors, we assign equal weight to each attribute in A. These weights may be varied depending on the requirements of the dataset user. We show how varying the weights across the different factors influences P (Fig. 3b). Our audit framework is flexible and can incorporate additional attributes as necessitated by the specific application domain or dataset under consideration.
Quantifying regulatory compliance
Data privacy regulations vary globally, with GDPR serving as a prominent example21. Beyond legal requirements, datasets can benefit from institutional approvals (for example, the Institutional Review Board) and ethics and impact statements77,78,79. We propose a regulatory compliance score R for datasets based on three factors (each scored 0 or 1):
-
Institutional approval granted
-
Individual consent obtained
-
Data expungement/correction facility available
A compliant dataset with all criteria met receives a score of 3. Although the absence of a person's consent may not necessarily breach regulatory norms, for lack of a more subtle evaluation, we use individual consent as one of the factors for compliance. Factors are validated manually from dataset publications, webpages or GitHub pages. Missing information defaults to a score of zero.
Data availability
The data curated for the study are available at https://osf.io/pfujw/, as well as via the project page at https://iab-rubric.org/resources/codes/fpr. As part of the data, we provide .csv files containing information such as the number of people and attribute annotations for the datasets used in this study. The names of the datasets can be referred from Table 2 (Column 1).
Code availability
The code used for the computation of the metrics, as well as for the analysis, is available at https://osf.io/pfujw/; it is also accessible via the project page at https://iab-rubric.org/resources/codes/fpr. All experimental results and analyses may be replicated by using the data made available for this work.
References
Williams, R. An AI Used Medical Notes to Teach Itself to Spot Disease on Chest X-rays (MIT Review, 2022); https://www.technologyreview.com/2022/09/15/1059541/ai-medical-notes-teach-itself-spot-disease-chest-x-rays/
Raja, A. Hybrid AI Beats Eight World Champions at Bridge (INDIAai, 2022); https://indiaai.gov.in/article/hybrid-ai-beats-eight-world-champions-at-bridge
Responsible AI For All: Adopting the Framework—A Use Case Approach on Facial Recognition Technology (NITI Aayog, 2022); https://www.niti.gov.in/sites/default/files/2022-11/Ai_for_All_2022_02112022_0.pdf
Schwartz, R. et al. Towards A Standard for Identifying and Managing Bias in Artificial Intelligence NIST Special Publication 1270 (NIST, 2022).
Sambasivan, N. et al. "Everyone wants to do the model work, not the data work": data cascades in high-stakes AI. In Proc. 2021 CHI Conference on Human Factors in Computing Systems 1–15 (ACM, 2021).
Gebru, T. et al. Datasheets for datasets. Commun. ACM 64, 86–92 (2021).
Heger, A. K., Marquis, L. B., Vorvoreanu, M., Wallach, H. & Wortman Vaughan, J. Understanding machine learning practitioners’ data documentation perceptions, needs, challenges, and desiderata. In Proc. ACM on Human–Computer Interaction Vol. 6, 1–29 (ACM, 2022).
Scheuerman, M. K., Hanna, A. & Denton, E. Do datasets have politics? Disciplinary values in computer vision dataset development. In Proc. ACM on Human–Computer Interaction Vol. 5, 1–37 (ACM, 2021).
Hutchinson, B. et al. Towards accountability for machine learning datasets: practices from software engineering and infrastructure. In Proc. 2021 ACM Conference on Fairness, Accountability, and Transparency 560–575 (ACM, 2021).
Kamikubo, R., Wang, L., Marte, C., Mahmood, A. & Kacorri, H. Data representativeness in accessibility datasets: a meta-analysis. In Proc. 24th International ACM SIGACCESS Conference on Computers and Accessibility 1–15 (ACM, 2022).
Miceli, M. et al. Documenting computer vision datasets: an invitation to reflexive data practices. In Proc. 2021 ACM Conference on Fairness, Accountability, and Transparency 161–172 (ACM, 2021).
Paullada, A., Raji, I. D., Bender, E. M., Denton, E. & Hanna, A. Data and its (dis)contents: a survey of dataset development and use in machine learning research. Patterns 2, 100336 (2021).
Jo, E. S. & Gebru, T. Lessons from archives: strategies for collecting sociocultural data in machine learning. In Proc. 2020 Conference on Fairness, Accountability, and Transparency 306–316 (ACM, 2020).
Peng, K. L., Mathur, A. & Narayanan, A. Mitigating dataset harms requires stewardship: lessons from 1000 papers. In 35th Conference on Neural Information Processing Systems Datasets and Benchmarks Track (NeurIPS, 2021); https://openreview.net/forum?id=KGeAHDH4njY
Bender, E. M. & Friedman, B. Data statements for natural language processing: toward mitigating system bias and enabling better science. Trans. Assoc. Comput. Linguistics 6, 587–604 (2018).
Birhane, A. & Prabhu, V. U. Large image datasets: a pyrrhic win for computer vision? In 2021 IEEE Winter Conference on Applications of Computer Vision 1536–1546 (IEEE, 2021).
Liang, W. et al. Advances, challenges and opportunities in creating data for trustworthy AI. Nat. Mach. Intell. 4, 669–677 (2022).
Data Protection and Privacy Legislation Worldwide (UNCTAD, 2023); https://unctad.org/page/data-protection-and-privacy-legislation-worldwide
Greenleaf, G. Global Tables of Data Privacy Laws and Bills 6–19 (UNSW Law Research, 2021).
Greenleaf, G. Now 157 Countries: Twelve Data Privacy Laws in 2021/22 3–8 (UNSW Law Research, 2022).
Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the Protection of Natural Persons with Regard to the Processing of Personal Data and on the Free Movement of Such Data, and Repealing Directive 95/46/EC (General Data Protection Regulation) Document no. 32016R0679 (European Union, 2016); http://data.europa.eu/eli/reg/2016/679/oj
Forti, M. The deployment of artificial intelligence tools in the health sector: privacy concerns and regulatory answers within the GDPR. Eur. J. Legal Stud. 13, 29 (2021).
Goldsteen, A., Ezov, G., Shmelkin, R., Moffie, M. & Farkash, A. Data minimization for GDPR compliance in machine learning models. AI Ethics 2, 477–49 (2021).
Health Insurance Portability and Accountability Act of 1996 104–191 (ASPE, 1996); https://aspe.hhs.gov/reports/health-insurance-portability-accountability-act-1996
Biometric Information Privacy Act (Illinois General Assembly, 2008); https://www.ilga.gov/legislation/ilcs/ilcs3.asp?ActID=3004
Ethics Guidelines for Trustworthy AI (High-Level Expert Group on Artificial Intelligence, 2019); https://www.aepd.es/sites/default/files/2019-12/ai-ethics-guidelines.pdf
Laying Down Harmonised Rules on Artificial Intelligence (Artificial Intelligence Act) and Amending Certain Union Legislative Acts (European Comission, 2021); https://digital-strategy.ec.europa.eu/en/library/proposal-regulation-laying-down-harmonised-rules-artificial-intelligence
Hupont, I., Tolan, S., Gunes, H. & Gómez, E. The landscape of facial processing applications in the context of the European AI act and the development of trustworthy systems. Sci. Rep. 12, 10688 (2022).
Samarati, P. & Sweeney, L. Protecting Privacy When Disclosing Information: k-Anonymity and its Enforcement Through Generalization and Suppression (EPIC, 1998).
Dwork, C. Differential privacy: a survey of results. In Theory and Applications of Models of Computation: 5th International Conference 1–19 (Springer, 2008).
Tommasi, T., Patricia, N., Caputo, B. & Tuytelaars, T. A deeper look at dataset bias. In Domain Adaptation in Computer Vision Applications 37–55 (Springer, 2017).
Yang, K., Qinami, K., Fei-Fei, L., Deng, J. & Russakovsky, O. Towards fairer datasets: filtering and balancing the distribution of the people subtree in the imagenet hierarchy. In Proc. 2020 Conference on Fairness, Accountability, and Transparency 547–558 (ACM, 2020).
Birhane, A., Prabhu, V. U. & Whaley, J. Auditing saliency cropping algorithms. In Proc. IEEE/CVF Winter Conference on Applications of Computer Vision 4051–4059 (IEEE, 2022).
Mittal, S. Thakral, K., Majumdar, P., Vatsa, M. & Singh, R. Are face detection models biased? In 2023 IEEE 17th International Conference on Automatic Face and Gesture Recognition 1–7 (IEEE, 2023).
Majumdar, P., Mittal, S., Singh, R. & Vatsa, M. Unravelling the effect of image distortions for biased prediction of pre-trained face recognition models. In International Conference on Computer Vision 3786–3795 (IEEE, 2021).
Dulhanty, C. & Wong, A. Auditing imagenet: towards a model-driven framework for annotating demographic attributes of large-scale image datasets. Preprint at https://arxiv.org/abs/1905.01347 (2019).
Wang, A. et al. Revise: a tool for measuring and mitigating bias in visual datasets. Int. J. Comput. Vis. 130, 1790–1810 (2022).
Holland, S., Hosny, A., Newman, S., Joseph, J. & Chmielinski, K. The dataset nutrition label. Data Protect. Privacy 12, 1–26 (2020).
Li, Y., Troutman, W., Knijnenburg, B. P. & Caine, K. Human perceptions of sensitive content in photos. In Proc. IEEE Conference on Computer Vision and Pattern Recognition Workshops 1590–1596 (IEEE, 2018).
Gervais, A., Ritzdorf, H., Lucic, M., Lenders, V. & Capkun, S. Quantifying location privacy leakage from transaction prices. In Computer Security–ESORICS 2016 382–405 (Springer, 2016).
Orekondy, T., Schiele, B. & Fritz, M. Towards a visual privacy advisor: understanding and predicting privacy risks in images. In Proc. IEEE International Conference on Computer Vision 3686–3695 (IEEE, 2017).
Machanavajjhala, A., Kifer, D., Gehrke, J. & Venkitasubramaniam, M. l-Diversity: privacy beyond k-anonymity. In ACM Transactions on Knowledge Discovery from Data Vol. 1, 3 (2007).
Li, N., Li, T. & Venkatasubramanian, S. t-Closeness: privacy beyond k-anonymity and l-diversity. In 2007 IEEE 23rd International Conference on Data Engineering 106–115 (IEEE, 2006).
Xiao, X. & Tao, Y. M-invariance: towards privacy preserving re-publication of dynamic datasets. In Proc. 2007 ACM SIGMOD International Conference on Management of Data 689–700 (ACM, 2007).
Empowering Responsible AI Practices (Microsoft, 2024); https://www.microsoft.com/en-us/ai/responsible-ai
Responsible AI Practices (Google, 2024); https://ai.google/responsibility/responsible-ai-practices/
Roush, B. The White House addresses responsible AI: EO takeaways on fairness. Relativity (20 November 2023); https://www.relativity.com/blog/the-white-house-addresses-responsible-ai-eo-takeaways-on-fairness
Responsible AI Principles (Elsevier, 2024); https://www.elsevier.com/about/policies-and-standards/responsible-ai-principles
Kapoor, S. & Narayanan, A. Leakage and the reproducibility crisis in machine-learning-based science. Patterns 4, 100804 (2023).
Singh, R., Majumdar, P., Mittal, S. & Vatsa, M. Anatomizing bias in facial analysis. In Proc. AAAI Conference on Artificial Intelligence Vol. 36, 12351–12358 (AAAI, 2022).
Zong, Y., Yang, Y. & Hospedales, T. MEDFAIR: benchmarking fairness for medical imaging. In 11th International Conference on Learning Representations (ICLR, 2023).
Wamburu, J. et al. Systematic discovery of bias in data. In 2022 IEEE International Conference on Big Data 4719–4725 (IEEE, 2022).
Levi, G. & Hassner, T. Age and gender classification using convolutional neural networks. In Proc. IEEE Conference on Computer Vision and Pattern Recognition Workshops 34–42 (IEEE, 2015).
Karkkainen, K. & Joo, J. FairFace: face attribute dataset for balanced race, gender, and age for bias measurement and mitigation. In Proc. IEEE/CVF Winter Conference on Applications of Computer Vision 1548–1558 (IEEE, 2021).
Moschoglou, S. et al. AgeDB: the first manually collected, in-the-wild age database. In Proc. IEEE Conference on Computer Vision and Pattern Recognition Workshops 51–59 (IEEE, 2017).
Wang, M., Zhang, Y. & Deng, W. Meta balanced network for fair face recognition. IEEE Trans. Pattern Anal. Mach. Intell. 44, 8433–8448 (2021).
Ramaswamy, V. V., Kim, S. S. & Russakovsky, O. Fair attribute classification through latent space de-biasing. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 9301–9310 (IEEE, 2021).
Meden, B. et al. Privacy-enhancing face biometrics: a comprehensive survey. IEEE Trans. Inf. Forensics Secur. 16, 4147–4183 (2021).
Rojas, W. A. G. et al. The dollar street dataset: Images representing the geographic and socioeconomic diversity of the world. In Conference on Neural Information Processing Systems Datasets and Benchmarks Track (NeurIPS, 2022).
Deng, J. et al. ImageNet: a large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition 248–255 (IEEE, 2009).
Bagdasaryan, E., Poursaeed, O. & Shmatikov, V. Differential privacy has disparate impact on model accuracy. In 33rd Conference on Neural Information Processing Systems (NeurIPS, 2019).
Qiu, H. et al. SynFace: face recognition with synthetic data. In Proc. IEEE/CVF International Conference on Computer Vision 10880–10890 (IEEE, 2021).
Melzi, P. et al. GANDiffFace: controllable generation of synthetic datasets for face recognition with realistic variations. In Proc. IEEE/CVF International Conference on Computer Vision (IEEE, 2023).
Kim, M., Liu, F., Jain, A. & Liu, X. DCFace: Synthetic face generation with dual condition diffusion model. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 12715–12725 (IEEE, 2023).
Carlini, N. et al. Extracting training data from diffusion models. In 32nd USENIX Security Symposium 5253–5270 (USENIX, 2023).
Hazirbas, C. et al. Towards measuring fairness in AI: the casual conversations dataset. IEEE Trans. Biometrics Behav. Identity Sci. 4, 324–332 (2021).
Zhang, Z., Song, Y. & Qi, H. Age progression/regression by conditional adversarial autoencoder. In IEEE Conference on Computer Vision and Pattern Recognition 5810–5818 (IEEE, 2017).
Alvi, M., Zisserman, A. & Nellåker, C. Turning a blind eye: explicit removal of biases and variation from deep neural network embeddings. In IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2018).
Bainbridge, W. A., Isola, P. & Oliva, A. The intrinsic memorability of face photographs. J. Exp. Psychol. 142, 1323–1334 (2013).
LoBue, V. & Thrasher, C. The Child Affective Facial Expression (CAFE) set: validity and reliability from untrained adults. Front. Psychol. 5, 1532 (2015).
Katti, H. & Arun, S. Are you from north or south India? A hard face-classification task reveals systematic representational differences between humans and machines. J. Vision 19, 1–1 (2019).
Shannon, C. E. A mathematical theory of communication. Bell Syst. Tech. J. 27, 379–423 (1948).
Gaggiotti, O. E. et al. Diversity from genes to ecosystems: a unifying framework to study variation across biological metrics and scales. Evol. Appl. 11, 1176–1193 (2018).
Kahneman, D., Sibony, O. & Sunstein, C. R. Noise: A Flaw in Human Judgment (Hachette, 2021).
Sylolypavan, A., Sleeman, D., Wu, H. & Sim, M. The impact of inconsistent human annotations on AI driven clinical decision making. NPJ Digital Med. 6, 26 (2023).
Miceli, M., Schuessler, M. & Yang, T. Between subjectivity and imposition: power dynamics in data annotation for computer vision. In Proc. ACM on Human–Computer Interaction Vol. 4, 1–25 (ACM, 2020).
Ethics Guidelines (CVPR, 2022); https://cvpr2022.thecvf.com/ethics-guidelines
U.S. State Privacy Laws (LewisRice, 2024); https://tinyurl.com/mwmedz27
Nosowsky, R. & Giordano, T. J. The Health Insurance Portability and Accountability Act of 1996 (HIPAA) privacy rule: implications for clinical research. Annu. Rev. Med. 57, 575–590 (2006).
General Law on the Protection of Personal Data (LGPD) Law No. 13,709 (Presidency of the Republic, 2018); http://www.planalto.gov.br/ccivil_03/_ato2015-2018/2018/lei/L13709.htm
The Information Technology (Amendment) Act (Ministry of Law and Justice, 2008); https://eprocure.gov.in/cppp/rulesandprocs/kbadqkdlcswfjdelrquehwuxcfmijmuixngudufgbuubgubfugbububjxcgfvsbdihbgfGhdfgFHytyhRtMTk4NzY=
The Personal Data Protection Bill (Lok Sabha, 2019); https://sansad.in/getFile/BillsTexts/LSBillTexts/Asintroduced/341%20of%202019As%20Int....pdf?source=legislation
Privacy Protection (Transfer of Data to Databases abroad) Regulations, 5761–2001 (Minister of Justice, 2020); https://www.gov.il/BlobFolder/legalinfo/legislation/en/PrivacyProtectionTransferofDataabroadRegulationsun.pdf
Act on the Protection of Personal Information (Act No. 57 of 2003) (Cabinet Secretariat, 2003); https://www.cas.go.jp/jp/seisaku/hourei/data/APPI.pdf
The Law on Legal Protection of Personal Data of the Republic of Lithuania (Teises Aktu Registras, 1996); https://www.e-tar.lt/portal/lt/legalActEditions/TAR.5368B592234C?faces-redirect=true
Privacy Act 1993 (Parliamentary Counsel Office, 1993); https://www.legislation.govt.nz/act/public/1993/0028/latest/DLM296639.html
Nigeria Data Protection Regulation 2019 (National Information Technology Development Agency, 2019); https://olumidebabalolalp.com/wp-content/uploads/2021/01/NDPR-NDPR-NDPR-Nigeria-Data-Protection-Regulation.pdf
Protection of Personal Information Act, 2013 (Government Gazette, 2013); https://www.gov.za/sites/default/files/gcis_document/201409/3706726-11act4of2013protectionofpersonalinforcorrect.pdf
Federal Act on Data Protection (The Federal Council, 1992); https://www.fedlex.admin.ch/eli/cc/1993/1945_1945_1945/en
Personal Data Protection Act (Government Gazette, 2019); https://thainetizen.org/wp-content/uploads/2019/11/thailand-personal-data-protection-act-2019-en.pdf
Law 6698 on Personal Data Protection (Republic of Turkey Presidency, 2016); https://www.resmigazete.gov.tr/eskiler/2016/04/20160407-8.pdf
The California Privacy Rights and Enforcement Act of 2020 (Attorney General's Office, 2019); https://oag.ca.gov/system/files/initiatives/pdfs/19-0017%20%28Consumer%20Privacy%20%29.pdf
Fischer, M. Texas Consumer Privacy Act (Texas Legislature Online, 2019); https://capitol.texas.gov/tlodocs/86R/billtext/pdf/HB04518I.pdf
Capture or Use of Biometric Identifier Act (Texas Legislature Online, 2009); https://statutes.capitol.texas.gov/Docs/BC/htm/BC.503.htm
Substitute House Bill 1493 (House Technology and Economic Development, 2017); https://lawfilesext.leg.wa.gov/biennium/2017-18/Pdf/Bills/House%20Bills/1493-S.pdf?q=20230308063651
Ricanek, K. & Tesafaye, T. MORPH: a longitudinal image database of normal adult age-progression. In 7th International Conference on Automatic Face and Gesture Recognition 341–345 (IEEE, 2006).
Lab, C. V. Caltech 10k Web Faces (Caltech Vision Lab, 2023); https://www.vision.caltech.edu/datasets/caltech_10k_webfaces
Kumar, N., Belhumeur, P. & Nayar, S. FaceTracer: a search engine for large collections of images with faces. in European Conference on Computer Vision 340–353 (Springer, 2008).
Ryan, A. et al. Automated facial expression recognition system. In 43rd Annual 2009 International Carnahan Conference on Security Technology 172–177 (IEEE, 2009).
Kumar, N. Berg, A. C. Belhumeur, P. N. & Nayar, S. K. Attribute and simile classifiers for face verification. In 2009 IEEE 12th International Conference on Computer Vision 365–372 (IEEE, 2009).
Singh, R. et al. Plastic Surgery: a new dimension to face recognition. IEEE Trans. Inf. Forensics Secur. 5, 441–448 (2010).
Gupta, S. Castleman, K. R., Markey, M. K. & Bovik, A. C. Texas 3D Face Recognition Database. In 2010 IEEE Southwest Symposium on Image Analysis & Interpretation 97–100 (IEEE, 2010).
Wong, Y., Chen, S., Mau, S., Sanderson, C. & Lovell, B. C. Patch-based probabilistic image quality assessment for face selection and improved video-based face recognition. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops 74–81 (IEEE, 2011).
Grgic, M., Delac, K. & Grgic, S. SCFace—surveillance cameras face database. Multimedia Tools Appl. 51, 863–879 (2011).
Wolf, L., Hassner, T. & Maoz, I. Face recognition in unconstrained videos with matched background similarity. In Conference on Computer Vision and Pattern Recognition 2011 529–534 (IEEE, 2011).
Riccio, D., Tortora, G., De Marsico, M. & Wechsler, H. EGA — ethnicity, gender and age, a pre-annotated face database. In 2012 IEEE Workshop on Biometric Measurements and Systems for Security and Medical Applications (BIOMS) Proceedings 1–8 (IEEE, 2012).
Mavadati, S. M., Mahoor, M. H., Bartlett, K., Trinh, P. & Cohn, J. F. DISFA: a spontaneous facial action intensity database. IEEE Trans. Affective Comput. 4, 151–160 (2013).
Setty, S. et al. Indian Movie Face Database: a benchmark for face recognition under wide variations. In National Conference on Computer Vision, Pattern Recognition, Image Processing, and Graphics 1–5 (IEEE, 2013).
Vieira, T. F., Bottino, A., Laurentini, A. & De Simone, M. Detecting siblings in image pairs. Visual Comput. 30, 1333–1345 (2014).
Hancock, P. Stirling/ESRC 3D Face Database (Univ. Stirling, 2023); http://pics.stir.ac.uk/ESRC/
Eidinger, E., Enbar, R. & Hassner, T. Age and gender estimation of unfiltered faces. In IEEE Transactions on Information Forensics and Security Vol. 9, 2170–2179 (IEEE, 2014).
Chen, B.-C., Chen, C.-S. & Hsu, W. H. Cross-age reference coding for age-invariant face recognition and retrieval. In European Conference on Computer Vision 768–783 (Springer, 2014).
Liu, Z., Luo, P., Wang, X. & Tang, X. Deep learning face attributes in the wild. In Proc. IEEE International Conference on Computer Vision 3730–3738 (IEEE, 2015).
Ng, H.-W. & Winkler, S. A data-driven approach to cleaning large face datasets. In 2014 IEEE International Conference on Image Processing (ICIP) 343–347 (IEEE, 2014).
Tresadern, P. et al. Mobile biometrics: combined face and voice verification for a mobile platform. IEEE Pervasive Comput. 99, 79–87 (2012).
Lenc, L. & Král, P. Unconstrained Facial Images: database for face recognition under real-world conditions. In Mexican International Conference on Artificial Intelligence 349–361 (Springer, 2015).
Niu, Z. et al. Ordinal regression with multiple output CNN for age estimation. In 2016 IEEE Conference on Computer Vision and Pattern Recognition 4920–4928 (IEEE, 2016).
Rothe, R., Timofte, R. & Van Gool, L. Deep expectation of real and apparent age from a single image without facial landmarks. Int. J. Comput. Vis. 126, 144–157 (2018).
Bianco, S. Large Age-Gap face verification by feature injection in deep networks. Pattern Recognit. Lett. 90, 36–42 (2017).
Buolamwini, J. & Gebru, T. Gender shades: intersectional accuracy disparities in commercial gender classification. In Proc. Machine Learning Research 77–91 (PMLR, 2018).
Sepas-Moghaddam, A., Chiesa, V., Correia, P. L., Pereira, F. & Dugelay, J.-L. The IST-EURECOM Light Field Face Database. In 2017 5th International Workshop on Biometrics and Forensics 1–6 (IEEE, 2017).
Cao, Q., Shen, L., Xie, W., Parkhi, O. M. & Zisserman, A. VGGFace2: a dataset for recognising faces across pose and age. In 2018 13th IEEE International Conference on Automatic Face and Gesture Tecognition 67–74 (IEEE, 2018).
Kushwaha, V. et al. Disguised faces in the wild. In Proc. IEEE Conference on Computer Vision and Pattern Recognition Workshops 1–9 (IEEE, 2018).
Maze, B. et al. IARPA Janus Benchmark — C: face dataset and protocol. In 2018 International Conference on Biometrics 158–165 (IEEE, 2018).
Wang, F. et al. The devil of face recognition is in the noise. In Proc. European Conference on Computer Vision 765–780 (Springer, 2018).
Wang, M. et al. Racial faces in the wild: reducing racial bias by information maximization adaptation network. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV) 692–702 (IEEE, 2019).
Dantcheva, A., Bremond, F. & Bilinski, P. Show me your face and I will tell you your height, weight and body mass index. In 2018 24th International Conference on Pattern Recognition 3555–3560 (IEEE, 2018).
Cheng, J. et al. Exploiting effective facial patches for robust gender recognition. Tsinghua Sci. Technol. 24, 333–345 (2019).
Shi, S. et al. PV-RCNN: point-voxel feature set abstraction for 3D object detection. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 10529–10538 (IEEE, 2020).
Kalra, I. et al. Dronesurf: benchmark dataset for drone-based face recognition. In 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition 1–7 (IEEE, 2019).
Majumdar, P., Chhabra, S., Singh, R. & Vatsa, M. Subclass contrastive loss for injured face recognition. In 2019 IEEE 10th International Conference on Biometrics Theory, Applications and Systems 1–7 (IEEE, 2019).
Afifi, M. & Abdelhamed, A. AFIF4: deep gender classification based on adaboost-based fusion of isolated facial features and foggy faces. J. Visual Commun. Image Rep. 62, 77–86 (2019).
Robinson, J. P. et al. Face recognition: too bias, or not too bias? In IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2020).
Morales, A., Fierrez, J., Vera-Rodriguez, R. & Tolosana, R. SensitiveNets: learning agnostic representations with application to face images. In IEEE Trans. Pattern Anal. Mach. Intell. 43, 2158–2164 (2020).
Terhörst, P. et al. MAAD-FACE: a massively annotated attribute dataset for face images. IEEE Trans. Inf. Forensics Secur. 16, 3942–3957 (2021).
Cheema, U. & Moon, S. Sejong Face Database: a multi-modal disguise face database. Comput. Vis. Image Understand. 208, 103218 (2021).
Jaeger, S. et al. Two public chest X-ray datasets for computer-aided screening of pulmonary diseases. Quant. Imaging Med. Surgery 4, 475–477 (2014).
Wang, X. et al. ChestX-ray8: Hospital-scale chest X-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. in Proc. IEEE Conference on Computer Vision and Pattern Recognition 2097–2106 (IEEE, 2017).
Shih, G. et al. Augmenting the national institutes of health chest radiograph dataset with expert annotations of possible pneumonia. Radiology 1, e180041 (2019).
Irvin, J. et al. CheXpert: a large chest radiograph dataset with uncertainty labels and expert comparison. In Proc. AAAI Conference on Artificial Intelligence Vol. 33, 590–597 (AAAI, 2019).
Bustos, A., Pertusa, A., Salinas, J.-M. & de la Iglesia-Vayá, M. PadChest: a large chest X-ray image dataset with multi-label annotated reports. Medical Image Anal. 66, 101797 (2020).
Vayá, M. D. L. I. et al. In BIMCV COVID-19+: a large annotated dataset of RX and CT images from COVID-19 patients with Extension Part I (IEEE DataPort, 2023).
Cohen, J. P. et al. COVID-19 Image Data Collection: prospective predictions are the future. J. Mach. Learn. Biomed. Imaging 1, 002 (2020).
Acknowledgements
S.M. was partially supported by the UGC-Net JRF Fellowship and IBM fellowship. K.T. is partially supported through the PMRF Fellowship. M.V. was partially supported through the Swarnajayanti Fellowship. This work was partially supported by Facebook AI. All data were stored, and experiments were performed on IITJ servers by the IITJ faculty and students.
Author information
Authors and Affiliations
Contributions
All authors contributed to problem formulation, study design, discussion of analysis, and writing and reviewing the manuscript. S.M. and K.T. performed the experimental evaluations.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Machine Intelligence thanks Girmaw Abebe Tadessee and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary Discussion and Tables 1–5.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Mittal, S., Thakral, K., Singh, R. et al. On responsible machine learning datasets emphasizing fairness, privacy and regulatory norms with examples in biometrics and healthcare. Nat Mach Intell 6, 936–949 (2024). https://doi.org/10.1038/s42256-024-00874-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s42256-024-00874-y