Abstract
The adoption of machine learning (ML) and, more specifically, deep learning (DL) applications into all major areas of our lives is underway. The development of trustworthy AI is especially important in medicine due to the large implications for patients’ lives. While trustworthiness concerns various aspects including ethical, transparency and safety requirements, we focus on the importance of data quality (training/test) in DL. Since data quality dictates the behaviour of ML products, evaluating data quality will play a key part in the regulatory approval of medical ML products. We perform a systematic review following PRISMA guidelines using the databases Web of Science, PubMed and ACM Digital Library. We identify 5408 studies, out of which 120 records fulfil our eligibility criteria. From this literature, we synthesise the existing knowledge on data quality frameworks and combine it with the perspective of ML applications in medicine. As a result, we propose the METRIC-framework, a specialised data quality framework for medical training data comprising 15 awareness dimensions, along which developers of medical ML applications should investigate the content of a dataset. This knowledge helps to reduce biases as a major source of unfairness, increase robustness, facilitate interpretability and thus lays the foundation for trustworthy AI in medicine. The METRIC-framework may serve as a base for systematically assessing training datasets, establishing reference datasets, and designing test datasets which has the potential to accelerate the approval of medical ML products.
Similar content being viewed by others
Introduction
During the last decade, the field of artificial intelligence (AI) and in particular machine learning (ML) has experienced unprecedented advances, largely due to breakthroughs in deep learning (DL)1,2,3,4,5 and increased computational power. Recently, the introduction of easy-to-use yet still extremely capable models such as GPT-46 and Stable Diffusion7 has further expanded the technology to an even broader audience. The large-scale handling and implementation of AI8 into fields such as manufacturing, agriculture and food, automated driving, smart cities and healthcare has since shifted the topic into the centre of attention of not just scholars and companies but the general public.
The introduction of novel and disruptive technologies is typically accompanied by an oscillating struggle between exploiting technological chances and mitigation of risks. ML is proving to have great potential to improve many aspects of our lives9,10,11. However, the race for implementation and utilisation is currently outpacing comprehension of the technology. The complex and black box character of AI applications has therefore largely steered the public conversation towards safety, security and privacy concerns12,13. A lack of confidence of the general population in the transparency of AI prevents its utilisation for society and economic growth. It can lead to a slowed adoption of innovations in crucial areas and discourage innovators from unlocking the technology’s full potential. Hence, the demand for regulation (e.g., EU AI Act14, US FDA considerations15) as well as the need for an improved understanding of AI is ever increasing. This is of particular importance in the field of healthcare due to its large impact on people’s lives. The amount of ML solutions in medicine (research tools and commercial products) is steadily on the rise, in particular in the fields of radiology and cardiology16,17. Despite breakthroughs up to human-level performance9,18,19,20, ML-backed medical products are mainly used as diagnosis assistance systems17 leaving the final decision to medical human professionals. In particular, medical ML solutions are successfully solving the task of image segmentation21,22,23. Due to the unknown consequences of using AI for medical decision-making, more stringent regulatory requirements are of high importance to accelerate the approval process of new AI products into medical practice. Decision-making needs to be supported by reliable health data to generate consistent evidence. One of the drivers for evidence-based medicine approaches was the introduction of scientific standards in clinical practice24. Since then, data integrity (defined by the ALCOA-principles or ALCOA+25) has become an essential requirement of several guidelines, such as good clinical practice26, good laboratory practice27 or good manufacturing practice28. In the pharmaceutical industry, data integrity plays a similarly important role as a requirement for drug trials. While data integrity focuses on maintaining the accuracy and consistency of a dataset over its entire life cycle, data quality is concerned with the fitness of data for use.
To improve confidence in AI utilisation in general, the focus is put on the development of so-called trustworthy AI, which aims at overcoming the black box character and developing a better understanding. Several approaches and definitions for trustworthy AI have been discussed and published over the past years by researchers29,30,31,32,33, public entities34,35, corporations36, and organisations37,38. Depending on the area of interest, trustworthiness may include (but is far from limited to) topics such as ethics; societal and environmental well-being; security, safety, and privacy; robustness, interpretability and explainability; providing appropriate documentation for transparency and accountability29,30,31,32,33,34,35,36,37,38. In particular, the approach to achieve transparency through documentation has gained much attention in the form of reporting guidelines and best practices. While some initiatives cover the entire ML system and development pipeline (e.g., MINIMAR39, FactSheets40), others are concerned with documentation surrounding the model (e.g., Model Cards41), and still others concentrate on the documentation of datasets (e.g., Datasheets42, STANDING Together43,44, Dataset Nutrition Label45, Data Cards46, Healthsheet47, Data Statements for NLP48). These standardisation efforts are a crucial first step for developing a better understanding of ML systems as a whole and of the interdependence of its components (e.g., data and algorithm). However, these approaches cover only limited information on the content of datasets and their suitability for use in ML. Additionally, we note that reporting guidelines and best practices concerning the documentation of datasets are mostly written from the perspective of providers and creators of datasets42,45, with some explicitly trying to reduce information asymmetry between supplier and consumer40.
One of the most critical parts of an AI is the quality of its training data since it has fundamental impact on the resulting system. It lays the foundation and inherently provides limitations for the AI application. If the data used for training a model is bad, the resulting AI will be bad as well (‘garbage in, garbage out’49). Neural networks are prone to learning biases from training data and amplifying them at test time50, giving rise to a much discussed aspect of AI behaviour: fairness51. Many remedies have been put forward to tackle discriminating and unfair algorithm behaviour52,53,54. Yet, one of the main causes of undesirable learned patterns lies in biased training data55,56. Thus, data quality plays a decisive role in the creation of trustworthy AI and assessing the quality of a dataset is of utmost importance to AI developers, as well as regulators and notified bodies.
The scientific investigation of data quality was initiated roughly 30 years ago. The term data quality was famously broken down into so-called data quality dimensions by Wang and Strong in 199657. These dimensions represent different characteristics of a dataset which together constitute the quality of the data. Throughout the years, general data quality frameworks have taken advantage of this approach and have produced refined lists of data quality dimensions for various fields of application and types of data. Naturally, this has produced different definitions and understandings. Within this systematic review, we transfer the existing research and knowledge about data quality to the topic of AI in medicine. In particular, we investigate the research question: Along which characteristics should data quality be evaluated when employing a dataset for trustworthy AI in medicine? The systematic comparison of previous studies on data quality combined with the perspective on modern ML enables us to develop a specialised data quality framework for medical training data: the METRIC-framework. It is intended for assessing the suitability of a fixed training dataset for a specific ML application, meaning that the model to be trained as well as the intended use case should drive the data quality evaluation. The METRIC-framework provides a comprehensive list of 15 awareness dimensions which developers of AI medical devices should be mindful of. Knowledge about the composition of medical training data with respect to the dimensions of the METRIC-framework should drastically improve comprehension of the behaviour of ML applications and lead to more trustworthy AI in medicine.
We note that data quality itself is a term used in different settings, with different meanings and varying scopes. For the purpose of this review, we focus on the actual content of a dataset instead of the surrounding technical infrastructure. We do so since the content is the part of a dataset which ML applications use to learn patterns and develop their characteristics. We thus exclude research on data quality considerations and frameworks within the topic of data governance and data management58,59. This concerns aspects such as data integration60, information quality management61, ETL processes in data warehouses62, or tools for data warehouses63,64 which do not affect the behavioural characteristics of AI systems. We also omit records discussing case studies of survey data quality65,66, as well as training strategies to cope with bad data67,68,69,70,71,72.
We further point out that the use of the term AI in current discussions is scientifically imprecise since discussions within the healthcare sector almost exclusively revolve around the implementation of ML approaches, in particular of DL approaches. Technically, the term AI spans a much wider range of technologies than just DL as part of the field of ML. Due to the complexity of DL applications and their proficiency in solving tasks deemed to require human intelligence, the terms are currently often used interchangeably in literature. We follow the same vocabulary here (e.g., ‘trustworthy AI’, ‘AI in medicine’) but stress the limitation of our results to ML approaches.
Results
In order to answer the research question ‘Along which characteristics should data quality be evaluated when employing a dataset for trustworthy AI in medicine?’, we conducted an unregistered systematic review following the PRISMA guidelines73. Our predetermined search string contains variations of the following terms: (i) data quality, (ii) framework or dimensions and (iii) machine learning (see Methods for more details and the full search string). The initial search of the databases Web of Science, PubMed and ACM Digital Library was performed on the 12th of April 2024 and yielded 4633 unique results. After title and abstract screening, adding references of the remaining records (‘snowballing’) and full text assessment, we find 120 records that match our eligibility criteria (see Methods). This represents the literature corpus that serves as a foundation for answering the research question. The full workflow is illustrated in Fig. 1.
In Fig. 2, the papers from our literature corpus are displayed according to their publication year57,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192. The overarching topics contained in the corpus naturally divide the papers into three categories: general data (35 entries), big data (8 entries) and ML data (77 entries). This reflects the historic development of the research field of data quality during the last 30 years.
General data quality
The field first shifted into focus with digital and automatically mass-generated data during the 1980s and 1990s causing a need for quality evaluation and control on a broad scale. While during the first 10 years landmark papers57,74 built the foundation for the field, the last 20 years have seen general data quality frameworks published more frequently75,76,77,78,79,80,81,82,83,84. The literature corpus additionally contains general data quality frameworks with high specificity to medical applications85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105 while frameworks with high specificity to non-medical topics193,194 were excluded.
The early data quality research in the 1980s and 1990s uncovered the lack of objective measures to asses data quality, which led to the introduction of task-dependent dimensions and the establishment of a data quality framework from the perspective of the data consumer57. Another fundamental challenge in the data quality field is the efficient data storage while maintaining quality. This was first investigated with the introduction of a data quality framework from the perspective of the data handler74. Both approaches to data quality proved to be useful and were unified in one framework75. In the following years, the frameworks were further extended76,77, equipped with measures78,79 and refined80,81. Moreover, it became clear that specialised fields such as the medical domain require adapted frameworks.
With the overarching question of how to improve patient care and the rise of electronic health records (EHR) in the 1990s, the need for high data quality in the medical sector increased. Accordingly, one of the first data quality frameworks in healthcare was implemented by the Canadian Institute for Health Information85. The first comprehensive data quality framework specifically for EHR data in the literature corpus was established by conducting a survey of quality challenges in EHR86. It considers, among other characteristics, accuracy, completeness and particularly timeliness. However, accuracy is hard to quantify in the medical context as even the diagnosis of experienced practitioners sometimes do not coincide. Accordingly, the notion of concordance of differing data sources was introduced87. Yet, the data quality frameworks for EHR could only be transferred to other types of medical data to a certain extent. Thus, data quality frameworks for particular data types such as immunisation data, public health data, multi-centre healthcare data or similar were put forward88,89,90,91,92,93,94,95. The various frameworks still suffered from inconsistent terminology and attempts were made to harmonise the definitions and assessment96,97,98,99,100,101,102,103. Particularly, Kahn et al.97 proposed a framework with exact definitions and recently, Declerck et al.103 published a ‘review of reviews’ portraying the different terminologies and attempting to map them to a reference. While these developments have advanced the understanding of data quality in the context of medical applications, frameworks for EHR frequently focus on the data quality of individual patients86,87, neglecting data quality aspects for the overall population. In particular, representativeness is often not a factor86,87 while it is a crucial property for secondary use of data in clinical studies88 or when reusing medical data as training data for ML applications.
Big data quality
As the amount of data from varying sources grew, conventional databases reached their capacity and the field of big data emerged. Big data is generally concerned with handling huge unstructured data streams that need to be processed at a rapid pace, emphasising the need for extended data quality frameworks. This development is reflected by a small wave of papers published between 2015 and 2020106,107,108,109,110,111,112,113. For example, the weaker structure of the data encouraged the use of data quality frameworks that include the data schema as a data quality dimension106,107. Further, the increasing amount of data requires the computational efficiency of the surrounding database infrastructure to be a part of big data quality frameworks108,109,110. Computational efficiency is also a limiting factor when ML methods are applied to big data. While it is generally assumed that more data leads to better results, this has to be balanced with computational capabilities. Hence, a data quality framework was developed that bridges the gap between ML and big data111. We note that the ‘4 V’s’ (volume, velocity, veracity and variety) of big data195 implicitly suggest a framework for big data quality. However, the ‘4 V’s’ are in fact big data traits which can have an effect on data quality but are not considered data quality dimensions196. They therefore do not contribute to answering our research question and are not further discussed. This might change in the future when data from wearables or remote patient monitoring sensors become available for health management.
ML data quality
The performance and behaviour of DL applications heavily depends on the quality of the data used during training as this is the foundation from which patterns are learned. The records of the literature corpus which discuss or empirically evaluate the effect of data quality on DL deal with a wide variety of data types and models. Many records investigate tabular data while utilising both simpler and more advanced architectures114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132. Recently, studies increasingly look at data quality in the context of sequential data (often time series)133,134,135,136,137,138,139, images119,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182, natural language120,121,145,182,183,184,185,186 or other complex types of data122,151,187,188,189,190,191,192. Some papers try to estimate data quality effects on ML models by using synthetic data122,123,189.
Contrary to the big data and general data quality literature from our corpus, the DL papers focus on the evaluation of one or very few specific data quality dimensions without (yet) considering broader theoretical data quality frameworks. Dimensions that are predominantly investigated are those which can easily be manipulated and lend themselves to be applicable to a wide range of datasets irrespective of specific tasks. The most prominent dimension is amount of data123,124,125,126,129,146,147,148,149,150,151,152,153,154,155,156,157,158,159,183,184,185,186,187,188,189 which is empirically shown to benefit performance, albeit in a saturating manner. Another dominant topic is completeness to which the ML community almost exclusively refers to as missing data119,125,126,127,128,133,134,135,182. The effect that data errors have on the DL application is also frequently investigated. Specifically, this is done by separately looking at perturbed features (inputs of a NN)128,129,130,131,132,133,134,136,159,160,161,162,163,164,165,166,167,168,182 and noisy targets (predictions to be generated by a NN)131,132,133,154,155,156,157,158,159,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183. Many ML settings are classification tasks which is reflected by the corpus often addressing label noise157,158,159,169,182,183. One record highlights the hefty weight that physicians’ annotations carry in medicine158. In order to evaluate the effect of data quality (features or targets) on ML applications, the training data is commonly manipulated. On the feature (input) side, e.g., images are distorted by adjusting contrast whereas time series sequences are disturbed by swapping elements. On the target side, e.g., correct labels are randomly replaced by false ones.
When it comes to the concrete behaviour change of the DL algorithm, most of the DL papers in the literature corpus investigate the robustness of a model, i.e. the stable behaviour of a model when facing erroneous or a limited amount of inputs. Only few records investigate generalisability119,144,145 or distribution shift139,192, a model’s capability of coping with new, unseen data. Another noteworthy exception is Ovadia et al.145 who additionally study predictive uncertainty.
Overall, theoretical data quality frameworks enjoy little attention by the ML community due to the novelty of the ML research field. Papers often focus on few specific data quality dimensions and tasks. Each task comes with its specific data type, necessitating different approaches to manipulate the data and measure these effects. The research dealing with the impact of manipulated data is heavily skewed towards robust behaviour in the sense of predictive performance. Other possibly affected aspects such as explainability or fairness are underrepresented and to some degree neglected which is a potential shortcoming for safety-critical applications such as medical diagnosis predictions.
METRIC-framework for medical training data
The literature corpus has shown that while similar ideas exist for the assessment of data quality across fields and applications, the idiosyncrasy of each field or application can only be captured by specialised frameworks rather than by a one-model-fits-all framework. The evaluation of data quality plays a particularly important role in the field of ML due to the fact that its behaviour is not only dependent on the algorithm choice but also strongly depends on its training data. At the same time, ML is implemented in various fields, each processing and requiring different types and qualities of data. We therefore propose a specialised data quality framework for evaluating the quality of medical training data: the METRIC-framework (Fig. 3), which is based on our literature corpus57,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192. We note that the METRIC-framework is specifically not designed to assess the data quality of a dataset in vacuum. Rather, it was conceived for the situation where the purpose of the desired medical AI is known. Thus, the intention of the METRIC-framework is to assess the appropriateness of a dataset with respect to a specific use case. From now on, we refer to data quality for training (or test) data of medical ML applications only. We point out that our framework does not yet include a guideline on the assessment or measurement of data qualities but rather presents a set of awareness dimensions which play a central role in the evaluation of data quality.
While examining the literature corpus, we found that terms describing data quality appear under varying definitions, or often with no definition at all. While standardisation efforts exist for the terminology in the context of evaluating data quality83,197,198, they are often not employed or did not exist yet for older papers making comparisons difficult. Therefore as a first step, we extracted all mentioned data quality dimensions from the literature corpus together with their definitions (if present) and added them to a list. This yielded 461 different terms with 991 mentions across all papers. Second, we hierarchically clustered the terms with respect to their intended meaning and according to their dependencies into clusters, dimensions and subdimensions (see Methods for more details on data extraction). We thus obtained 38 relevant dimensions and subdimensions which are displayed on the outer circle of Fig. 3. In Tables 1–6, we provide a complete list of definitions for all 38 relevant dimensions and subdimensions, as well as their hierarchy, practical examples and references with respect to the literature corpus. We adopted definitions from a recent data quality glossary197 if they existed there and met our understanding of the dimension in the given context of medical training data. If necessary, we included definitions given by Wang et al.57 in a second iteration. If none of these two sources suggested an appropriate definition, we captured the meaning of the desired term on the basis of the literature corpus and thus determined its definition in the context of medical training data.
The METRIC-framework encompasses three levels of details: clusters which pool similar dimensions; dimensions which are individual characteristics of data quality; and subdimensions which split larger dimensions into more detailed attributes (compare Fig. 3 from inside to outside). Besides the terms contained in the METRIC-framework, we found several frequently mentioned dataset properties which we, for our purpose, want to separate from the METRIC-framework. We summarise these additional properties under a separate cluster called data management (Fig. 4). The attributes included in this cluster ensure that a dataset is well-documented, legally and effectively usable. In particular, it includes the properties documentation, security and privacy, as well as the well-established FAIR-Principles199. Appropriate documentation of datasets is the topic of multiple initiatives42,43,44,45,46,47 that give guidance for the data creator and handler. The METRIC-framework on the other hand is targeted towards AI developers. It evaluates the suitability of the content of the data for a specific ML task, which is greatly facilitated by appropriate documentation but does not depend on it. Similarly, the FAIR-principles199, requiring data to be findable, accessible, interoperable and reusable, are vital for evaluating datasets for general purpose but are not included in the METRIC-framework since the question of fit for a specific purpose can only be asked when a dataset is already successfully obtained. Security is another important aspect of data management: Who can access and edit the data? Can it be manipulated? Again, such questions concern the handling of the data, not the evaluation of its content. Finally, privacy (data privacy and patient privacy) is a delicate and heavily discussed topic in the context of healthcare. However, we separate these issues from the METRIC-framework since they concern data collection, creation and handling. We note that aspects such as anonymisation or pseudonymisation may impact the quality of the content of a dataset by, e.g., removing information167. However, the METRIC-framework is designed to evaluate the resulting dataset with respect to its usefulness for a specific task, not the quality of the modifications. Hence, while these properties play a central role in the creation, handling, management and obtainment of data, the METRIC-framework is targeted at the content of a dataset since that is the part the ML algorithm learns from. Therefore, we see the data management cluster as a prerequisite for data quality assessment by the METRIC-framework which itself divides the concept of data quality for the content of a dataset into five clusters: measurement process, timeliness, representativeness, informativeness, consistency. A summary of the characteristics and key aspects of all five clusters is given in Table 7.
Measurement process
The cluster measurement process captures factors that influence uncertainty during the data acquisition process. Two of the dimensions within this cluster differentiate between technical errors originating from devices during measurement (see device error) and errors induced by humans during, e.g., data handling, feature selection or data labelling (see human-induced error). For the dimension device error, we distinguish between the subdimension accuracy, the systematic deviation from the ground truth (also called bias), and the subdimension precision, the variance of the data around a mean value (also called noise). In practice, a ground truth for medical data is most often not attainable, making accuracy evaluation impossible. In that case, the level and structure of noise in the training data should be compared to the expected noise in the data after AI deployment. If the training data only contains low noise but the AI is utilised in clinical practice on data with much higher noise levels, the performance of the AI application might not be sufficient since the model did not face suitable error characteristics during training. Therefore, lower noise data is not necessarily better and adding noise to the training data might in some instances even improve performance200,201,202. The errors belonging to the dimension human-induced errors are of a fundamentally different nature and need to be treated accordingly. This type of error includes human carelessness and outliers in the dataset due to (unintentional) human mistakes. The final subdimension, noisy labels, is one of the most relevant topics in current ML research157,159,169,182. Since in the medical domain, supervised learning paradigms are prevalent, proper feature selection and reliable labelling are indispensable. However, human decision making can be highly irrational and subjective, especially in the medical context203,204,205, representing one of various sources of labelling noise206. Among expert annotators there is often considerable variability206,207. Even in the most common (non-medical) datasets of ML (e.g., MNIST208, CIFAR-100209, Fashion-MNIST210) there is a significant percentage of wrong labels211,212. In contrast to the precision of instruments, noise in human judgements is demanding to be assessed through so-called noise audits to identify different factors, like pattern noise and occasion noise in the medical decision process213. Such intra- and inter-observer variability has always been a highly important topic in many medical disciplines, e.g., in radiology where guidelines, training and consensus reading approaches are used to reduce noise214.
Another issue that frequently occurs in the data acquisition process and which plays an important role in ML is the absence of data values with unknown reason. We follow the ML vocabulary by capturing this quality issue with the dimension completeness, while noting that outside of ML contexts, this term is commonly used to describe representativeness, coverage or variety in other contexts. Most prominently, Wang et al.57 define completeness as ‘breadth, depth, and scope of information’. This definition has been picked up by other researchers, as well100,106,126. In ML, however, completeness is usually measured by the ratio of missing to total values. Apart from the mostly quantitative dimensions within the cluster, the dimension source credibility is concerned with mostly qualitative characteristics. On the one hand, it includes the question whether or not the measured data can be trusted based on the expertise of people involved in data measurement, processing and handling. On the other hand, the subdimension traceability evaluates whether changes from original data to its current state are documented. Being aware of modifications such as the exclusion of outliers, automated image processing in medical imaging or data normalisation and their utilised algorithms are necessary for understanding the composition of the data. Finally, the subdimension data poisoning considers whether the data was intentionally corrupted (e.g., adversarial attacks) to cause distorted outcomes. The entire cluster measurement process is crucial for data quality evaluation in the medical field since errors may propagate through the ML model and lead to false diagnosis or treatment of patients.
We note that special consideration has to be given to the field of medical imaging within the measurement process cluster due to the fact that many imaging devices are not classical measurement devices. For instance, in current radiological practice, decisions are still based mainly on visual inspection of images and rating of diseases and therapy effects are often done in qualitative terms such as ‘enlarged’, ‘smaller’ or ‘enhanced’. This places a lot of importance on the qualitative subdimensions source credibility and expertise, with respect to quality assessment in such use cases. However, over the last two decades significant efforts have been made to establish quantitative imaging biomarkers to transform scanners more into measurement devices to quantify biophysical parameters, like flow, perfusion, diffusion or elasticity. Such quantitative imaging approaches reduce the operator dependency and enable more quantitative evaluation in the dimension device error. Worldwide alliances such as Quantitative Imaging Biomarkers Alliance (QIBA) launched in 2007 by the Radiology Society North America215 and now replaced by the Quantitative Imaging Committee (QUIC), the Quantitative Imaging Network (QIN) of the National Cancer Institute in the US216 or the European Imaging Biomarkers Alliance (EIBALL) by European Society of Radiology217 are committed to make this transformation.
Timeliness
Since medical knowledge and understanding is subject to constant development, it is important to investigate the cluster timeliness which indicates whether the point in time at which the dataset is used in relation to the point in time at which it was created and updated is appropriate for the task at hand. Indications for diagnoses based on medical data may have changed since a dataset was created and labelled, and changes in coding systems (such as the transition from ICD-9 to ICD-10 or ICD-9-CM to ICD-10-CM) may affect mortality and injury statistics218,219. The age of the data dictates whether such investigations are necessary. In such cases, the labels or standards utilised would then have to be appropriately updated to satisfy the subdimension currency. Furthermore, knowledge about the subdimension age might provide information about precision and accuracy of the measurement as it gives insight into the technology used during data acquisition.
Representativeness
Another central cluster, especially for medical applications, is representativeness. Its dimensions are concerned with the extent to which the dataset represents the targeted population (such as patients) for which the application is intended. Whether the population of the dataset covers a sufficient range in terms of age, sex, race or other background information is the topic of the subdimension variety in demographics contained within the dimension variety. This dimension also contains the subdimension variety of data sources concerned with questions such as: Does the data originate from a single site? Were the measurements done with devices from the same or different manufacturers? Appropriately investigating such questions can provide a strong indication for the applicability and generalisability of the ML application in different environments220,221,222,223. The dimension depth of data is one of the main topics of the ML papers in our literature corpus. Apart from the subdimension dataset size already discussed in the previous section, this dimension also includes the subdimension granularity, which considers whether the level of detail (e.g., the resolution of image data) is sufficient for the application, as well as the subdimension coverage, which investigates whether sub-populations (e.g., specific age groups) are still diverse by themselves (e.g., still contain all possible diagnoses in case of classification applications). Finally, the highly-discussed dimension target class balance pays tribute to the technical requirements of ML140,141,144,150,159. An algorithm must learn patterns for specific classes from the training data. However, strong imbalances in the class ratio could be caused by, e.g., rare diseases. In order to still be able to properly learn corresponding patterns it may be helpful to deliberately overrepresent rare classes in the dataset instead of matching their real world distribution224,225.
Informativeness
The cluster informativeness considers the connection between the data and the information it provides and whether the data does so in a clear, compact and beneficial way. First of all, the understandability of the data considers whether the information of the data is easily comprehended. Second, the dimension redundancy investigates whether such information is concisely communicated (see subdimension conciseness) or whether redundant information is present such as duplicate records (see subdimension uniqueness). The dimension informative missingness answers the question whether the patterns of missing values provide additional information. Che et al.135 find an informative pattern in the case of the MIMIC-III critical care dataset226 which displays a correlation between missing rates of variables and ICD9-diagnosis labels. Missingness patterns are categorised by the literature into either not missing at random (NMAR), missing at random (MAR) or missing completely at random (MCAR)227,228. Finally, feature importance is concerned with the overall relevance of the features for the task at hand and moreover with the value each feature provides for the performance of a ML application since the quantity of data has to be balanced with computational capability. Valuable features might in many cases be as important as dataset size229, which is a frequently discussed topic in the data-centric AI community230.
Consistency
The dimensions belonging to the cluster consistency illuminate the topic of consistent data presentation from three perspectives. Rule-based consistency summarises subdimensions concerned with format (syntactic consistency), which includes the fundamental and well-discussed topic of data schema106, and the conformity to standards and laws (compliance). These subdimensions ensure that the dataset is easily processable on the one hand and comparable and legally correct on the other. Logical consistency evaluates whether or not the content of the dataset is free of contradictions, both within the dataset (e.g., a patient without kidneys that is diagnosed with kidney stones) and in relationship to real world knowledge (e.g., a 200-year-old patient). The last dimension of the cluster, distribution consistency, concerns the distributions and their statistical properties of relevant subsets of the total dataset. While the subdimension homogeneity evaluates whether subsets have similar or different statistical properties at the same point in time (e.g., can data from different hospitals be identified by statistics?), the subdimension distribution drift deals with varying distributions at different time points. This subdimension can be neglected if the dataset is not continuously changing over time, but distribution drift is sometimes unconsciously discarded due to a lack of model surveillance. Therefore, it is a prominent research topic145 and the unconsciousness furthermore underlines the importance of distribution drift for medical applications93.
Discussion
The METRIC-framework (Fig. 3) represents a comprehensive system of data quality dimensions for evaluating the content of medical training data with respect to an intended ML task. We stress again that these dimensions should for now be regarded as awareness dimensions. They provide a guideline along which developers should familiarise themselves with their data. Knowledge about these characteristics is helpful for recognising the reason for the behaviour of an AI system. Understanding this connection enables developers to improve data acquisition and selection which may help in reducing biases, increasing robustness, facilitating interpretability and thus has the potential to drastically improve the AI’s trustworthiness.
With training data being the basis for almost all medical AI applications, the assessment of its quality gains more and more attention. However, we note that providing a division of the term data quality into data quality dimensions is only the first step on the way to overall data quality assessment. The next step will be to equip each data quality dimension with quantitative or qualitative measures to describe their state. The result of this measure then has to be evaluated with respect to the question: Is the state of the dimension appropriate for the desired AI algorithm and its application? These three steps (choosing a measure, obtaining a result, evaluating its appropriateness for the desired task) can be applied to each dimension and subdimension. Appropriately combining the individual outcomes can potentially serve as a basis for a measure of the overall data quality in future work.
So far the dimensions in the METRIC-framework are not ranked in any way. However, it is clear that some of them are more important than others. Therefore, some dimensions deserve more attention in the assessment process or might even be a criterion for exclusion of a dataset for a certain task. These dimensions should be among the first to be assessed in practice. On the other hand, some dimensions are much more difficult to measure and evaluate than others. This can be due to their qualitative nature, the complexity of the statistical measure, the degree of use-case dependence or the expert knowledge that is needed for the assessment, to name a few. These considerations are of central interest for the development of a complete data quality assessment and examination process.
In Fig. 5, we provide insights that should be taken into consideration when practically assessing data quality. We classify each of the 15 awareness dimensions along two different properties. On the one hand, we estimate whether a dimension requires mostly quantitative or qualitative measures. We observe that about half of the dimensions require mostly quantitative measures while a fifth necessitate more manual inspection by qualitative measures (see left-hand side of Fig. 5). Being able to choose quantitative measures typically implies more objectivity and enables automation, two desirable properties for quality assessment. Dimensions categorised as mostly qualitatively measurable or requiring a mixture of quantitative and qualitative input will typically require specific domain knowledge from the medical field. Such domain knowledge can be difficult to obtain and expensive.
On the other hand, we consider whether the state of a dimension or the evaluation of its appropriateness level is use case dependent (see right-hand side of Fig. 5). This is of interest to developers as use case dependent dimensions require not only additional knowledge, work and time during quality assessment but also during quality improvement of data. Our findings suggest a division of the wheel of data quality after categorising all 15 dimensions. The clusters representativeness and timeliness as well as the dimensions device error and feature importance belong to the group of use case dependent dimensions. Whether a dataset is representative of the targeted population can only be evaluated with knowledge of the use case. Similarly, the importance of features changes between applications. Whether the age and currency of the data (see dimension timeliness) are appropriate can also differ depending on the task. For instance, the coding standard the data should conform to depends on the application. The newest standards are not necessarily the best if in practice these standards are not implemented (see section on Timeliness). Similarly, reducing noise levels in the data is not necessarily better for all applications. It rather depends on the expected noise levels of the application (see section on Measurement process for more detail).
For an overall assessment of the quality of the dataset, we estimate that on average the dimensions of the representativeness cluster together with the dimensions feature importance, distribution consistency and human-induced error are crucial factors. Ignoring a single one of these dimensions potentially has proportionally larger effects on the AI application than other dimensions. This might also depend on the type of ML problem. Actual quantification of the effect of data quality dimensions on ML applications is part of ongoing and future research. Nevertheless, we for now recommend prioritising these six dimensions if it is possible to dedicate time to evaluating or improving a dataset. With the exception of the dimension feature importance, all of the crucial dimensions are simultaneously measured mostly quantitatively making them primary candidates for software tools designed for improving the quality of datasets.
The importance of data quality for medical ML products is undisputed and gaining more and more attention with on-going discussions about fairness and trustworthiness. Parts of future regulation and certification guidelines will not only include ML algorithms but likely also require evaluating the quality of datasets used for their training and testing. Such inclusion of data quality in regulation requires systematic assessment of medical datasets. The METRIC-framework may serve as a base for such a systematic assessment of training datasets, for establishing reference datasets, and for designing test datasets during the approval of medical ML products. This has the potential to accelerate the process of bringing new ML products into medical practice.
Methods
Literature review
In order to answer the research question ‘Along which characteristics should data quality be evaluated when employing a dataset for trustworthy AI in medicine?’, we conducted a systematic review following the PRISMA guidelines73. The goal of such a review is to objectively collect the knowledge of a chosen research area by summarising, condensing and expanding the ideas to further its progress. PRISMA reviews commonly follow four main steps: (i) Searching suitable databases with carefully formulated search strings and extracting matching papers; (ii) screening titles and abstracts to include or exclude papers based on predetermined criteria; (iii) extending the literature list by screening titles and abstracts of all referenced papers from the included papers (called ‘snowballing’); (iv) screening the full text of all still included papers with respect to the eligibility criteria to build the final literature corpus.
Search strategy
Our research question aims at combining the knowledge from the field of general data quality frameworks with insights about the effects that the quality of training data has on ML applications in medicine. This should ultimately lead to a novel framework for data quality in the context of medical training data. Therefore, we built a search string that on the one hand targeted papers about data quality frameworks by combining variations of ‘data quality’ with variations of the terms ‘framework’ and ‘dimensions’. On the other hand, we attempted to collect papers about the connection between the quality of training data and the behaviour of a DL application by again combining variations of the word ‘data quality’ but this time with variations of ‘machine learning’, including ‘artificial intelligence’ and ‘deep learning’ (see Search query). We then performed the database search on one general and two thematically suitable online databases: Web of Science, PubMed and ACM Digital Library. We are aware that the choice of databases skews, to some degree, all interpretations which, to some extent, is mitigated by snowballing. All retrieved results were concatenated and duplicates removed, yielding 4633 records.
Search query
The following search string in pseudo-code (visualised in Fig. 6) was executed on the 12th of April 2024 on Web of Science, PubMed and ACM Digital Library:
(("data quality" OR "data-quality" OR "data qualities" OR "quality of data" OR "quality of the data" OR "qualities of data" OR "qualities of the data" OR "quality of training data" OR "quality of the training data" OR "quality of ML data" OR "data bias" OR "data biases" OR "bias in the data" OR "biases in the data" OR "data problem" OR "data problems" OR "problem in the data" OR "problem with the data" OR "problems with the data" OR "data error" OR "data errors" OR "error in the data" ) AND ("dimension" OR "dimensions" OR "AI" OR "artificial intelligence" OR "ML" OR "machine learning" OR "deep learning" OR "neural network" OR "neural networks" ) ) OR ("data quality framework" OR "data quality frameworks" OR "framework of data quality" OR "framework for data quality" )
The chosen databases supported exact (instead of fuzzy) searches, expressed by quotation marks around keywords. The search was applied to the title and abstract fields of all records of the databases.
Eligibility criteria
In Table 8, our chosen eligibility criteria that were applied to the various screening steps are listed. Papers were included if they either provided broad-scale data quality frameworks with general purpose or with specificity to a medical application, or if they discussed or quantified the effects of at least one training data quality dimension on DL behaviour. In contrast, papers were excluded if they (i) either discussed frameworks with specificity to non-medical fields or (ii) only considered single or few data quality dimensions without reference to ML or (iii) focused on the quality of data management and surveys. No limits were imposed with respect to publication date or publisher source (i.e. peer-reviewed or not), while non-English records and inaccessible records were omitted.
We note that in order to be as precise and logical as possible during the practical screening and eligibility checks, we implemented the following eligibility criteria: (I1) Inclusion: No exclusion criteria apply; (I2) Inclusion: Study measures effect of data on DL; (E1) Exclusion: Focus of study is not data quality; (E2) Exclusion: Focus of study is not on general theoretical data quality framework; (E3) Exclusion: Study has high specificity to non-medical field; (E4): Exclusion: Focus of study is quality of data management or surveys. The logic we applied during screening and eligibility check is: If any exclusion criteria applies, the study is excluded, unless an inclusion criteria applies at the same time.
Literature review process
Titles and abstracts from the records of the database search were screened with respect to the eligibility criteria. This was done by two authors independently to mitigate biases. In case of disagreement, consensus was achieved by discussion. If necessary, a third author was consulted to arrive at the final decision. This step reduced the number of records to 165. The snowballing step expands the scope of the literature corpus to make it more independent of the initially chosen databases and search string which is important to reduce bias. For the process of snowballing, we considered all references from the so far 165 included papers which resulted in adding 775 records to the literature list. Analogously, title and abstract screening was performed on these new entries with the same criteria and workflow as before, leaving 135 additional papers from snowballing. As a final step, all 300 remaining papers were evaluated on the full text with respect to the eligibility criteria. In the end, 120 entries passed all screening steps. For each retrieved record, the decision whether to include or exclude was documented along with the corresponding eligibility criterion. Each record which had passed the screening was eligible for extracting data quality terms.
Data extraction strategy
In order to introduce a comprehensive data quality framework, the 120 selected records were each read by two authors and all terms that were deemed relevant to describe data quality were extracted. See Table 9 for details on extracted vocabulary from each record. We discarded terms if (i) their scope is limited to a specialised data source and not transferable to a general framework, (ii) the term refers to the quality of database infrastructure or (iii) no definition was given and it was impossible to grasp the intended meaning from the context. The accepted terms were copied into an Excel sheet, which served as a starting template for the METRIC-framework. We clustered related concepts into groups according to the terms’ definition or intended meaning. From these small and detailed groups we formed the so-called subdimensions, ensuring that each subdimension is mentioned by at least three references in the literature corpus, otherwise the level of detail was deemed too great leading to further grouping.
It seems that with 461 extracted terms, we are beyond a saturation point of finding new data quality dimensions. From a certain point on, more synonyms do not uncover new concepts. From a bias assessment point of view, it is possible that the literature that investigates effects of data quality on ML could be skewed towards investigating and reporting dimensions with bigger effects. The risk of missing out on vocabulary due to this is mitigated by the inclusion of broad theoretical frameworks in our literature corpus.
Thorough discussion of all authors about underlying concepts and definitions of the subdimensions resulted in hierarchically grouping these into dimensions and the dimensions into clusters. In parallel to this grouping, all authors reached consensus on definitions for dimensions and subdimensions of the METRIC-framework. The definitions were adopted from a recent data quality glossary197 if they existed there and met our understanding of the vocabulary in the given context of medical training data. If necessary, we included definitions given by Wang et al.57 in a second iteration. If none of these two sources suggested an appropriate definition, we captured the meaning of the desired term on the basis of the literature corpus and thus determined its definition in the context of medical training data (see Tables 1–6).
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Data availability
All data utilised in this study is available. The literature database that serves as a basis for this systematic review is provided in Supplementary Data 1. The extracted data quality vocabulary from the literature database that serves as a basis for the METRIC-framework is provided in Supplementary Data 2.
Code availability
All code utilised for this study is available at https://github.com/danielschw188/ReviewPaper_DataQualityForMLinMedicine.
References
Vaswani, A. et al. Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017).
Deng, L. Artificial intelligence in the rising wave of deep learning: the historical path and future outlook. IEEE Signal Process. Mag. 35, 180–177 (2018).
Silver, D. et al. Mastering the game of Go without human knowledge. Nature 550, 354–359 (2017).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, 770–778 (2016).
Redmon, J., Divvala, S., Girshick, R. & Farhadi, A. You only look once: unified, real-time object detection. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, 779–788 (2016).
OpenAI. GPT-4 technical report. Preprint at https://doi.org/10.48550/arXiv.2303.08774 (2023).
Rombach, R., Blattmann, A., Lorenz, D., Esser, P. & Ommer, B. High-resolution image synthesis with latent diffusion models. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10684–10695 (2022).
Chui, M., Yee, L., Hall, B. & Singla, A. The state of AI in 2023: Generative AI’s breakout year. https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai-in-2023-generative-ais-breakout-year (2023).
Esteva, A. et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature 542, 115–118 (2017).
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Teoh, E. R. & Kidd, D. G. Rage against the machine? Google’s self-driving cars versus human drivers. J. Saf. Res. 63, 57–60 (2017).
von Eschenbach, W. J. Transparency and the black box problem: why we do not trust AI. Philos. Technol. 34 1607–1622 (2021).
UK Government. Chair’s Summary of the AI Safety Summit 2023. https://www.gov.uk/government/publications/ai-safety-summit-2023-chairs-statement-2-november (2023).
Council of the European Union and European Parliament. Proposal for a regulation of the European Parliament and of the Council laying down harmonised rules on artificial intelligence (Artificial Intelligence Act) and amending certain Union legislative acts. https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:52021PC0206 (2021).
Food and Drug Administration. Proposed regulatory framework for modifications to artificial intelligence/machine learning (AI/ML)-based software as a medical device (SaMD). https://www.fda.gov/files/medical%20devices/published/US-FDA-Artificial-Intelligence-and-Machine-Learning-Discussion-Paper.pdf (2019).
Muehlematter, U. J., Daniore, P. & Vokinger, K. N. Approval of artificial intelligence and machine learning-based medical devices in the USA and Europe (2015–20): a comparative analysis. Lancet Digit. Health 3, e195–e203 (2021).
Zhu, S., Gilbert, M., Chetty, I. & Siddiqui, F. The 2021 landscape of FDA-approved artificial intelligence/machine learning-enabled medical devices: an analysis of the characteristics and intended use. Int. J. Med. Inform. 165, 104828 (2022).
Gulshan, V. et al. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA 316, 2402–2410 (2016).
Ardila, D. et al. End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography. Nat. Med. 25, 954–961 (2019).
Liu, X. et al. A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis. Lancet Digit. Health 1, e271–e297 (2019).
Ronneberger, O., Fischer, P. & Brox, T. U-net: Convolutional networks for biomedical image segmentation. Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, 234–241 (Springer International Publishing, Cham, 2015).
Chen, J. et al. TransUNet: Transformers make strong encoders for medical image segmentation. Preprint at https://doi.org/10.48550/arXiv.2102.04306 (2021).
Hatamizadeh, A. et al. Swin UNETR: Swin transformers for semantic segmentation of brain tumors in MRI images. In Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries, 272–284 (Springer International Publishing, 2022).
Feinstein, A. R. Scientific standards in epidemiologic studies of the menace of daily life. Science 242, 1257–1263 (1988).
WHO Technical Report Series, no. 1033. Annex 4—guideline on data integrity. https://www.gmp-navigator.com/files/guidemgr/trs1033-annex4-guideline-on-data-integrity.pdf (2021).
International Council For Harmonisation Of Technical Requirements For Pharmaceuticals For Human Use (ICH). Integrated addendum to ich e6(r1): guideline for good clinical practice. https://www.slideshare.net/ICRInstituteForClini/integrated-addendum-to-ich-e6r1-guideline-for-good-clinical-practice-e6r2 (2016).
Directive 2004/9/EC of the European Parliament and of the Council of 11 February 2004 on the inspection and verification of good laboratory practice (GLP). https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A02004L0009-20190726 (2004).
EudraLex - Volume 4 - Good Manufacturing Practice (GMP) guidelines. https://health.ec.europa.eu/medicinal-products/eudralex/eudralex-volume-4_en.
Adadi, A. & Berrada, M. Peeking inside the black-box: a survey on explainable artificial intelligence (XAI). IEEE Access 6, 52138–52160 (2018).
Liu, H. et al. Trustworthy AI: a computational perspective. ACM Trans. Intell. Syst. Technol. 14, 1–59 (2022).
Li, B. et al. Trustworthy AI: from principles to practices. ACM Comput. Surv. 55, 1–46 (2023).
Kale, A. et al. Provenance documentation to enable explainable and trustworthy AI: a literature review. Data Intell. 5, 139–162 (2023).
Alzubaidi, L. et al. Towards risk-free trustworthy artificial intelligence: significance and requirements. Int. J. Intell. Syst. 2023, 41 (2023).
AI, H. High-level expert group on artificial intelligence. https://digital-strategy.ec.europa.eu/en/library/policy-and-investment-recommendations-trustworthy-artificial-intelligence (2019).
Commission, E., Directorate-General for Communications Networks, C. & Technology. The assessment list for trustworthy artificial intelligence (ALTAI). https://digital-strategy.ec.europa.eu/en/library/assessment-list-trustworthy-artificial-intelligence-altai-self-assessment (2020).
Deloitte GmbH Wirtschaftsprüfungsgesellschaft. Trustworthy AI. https://www2.deloitte.com/de/de/pages/innovation/contents/trustworthy-ai.html.
VDE Verband der Elektrotechnik Elektronik Informationstechnik e.V. VCIO-based description of systems for AI trustworthiness characterisation. VDE SPEC 90012 v1.0 (en). https://www.vde.com/resource/blob/2242194/a24b13db01773747e6b7bba4ce20ea60/vcio-based-description-of-systems-for-ai-trustworthiness-characterisationvde-spec-90012-v1-0--en--data.pdf (2022).
Interessengemeinschaft der Benannten Stellen für Medizinprodukte in Deutschland - IG-NB. Questionnaire Artificial Intelligence (AI) in medical devices. https://www.ig-nb.de/?tx_epxelo_file%5Bid%5D=884878&cHash=53e7128f5a6d5760e2e6fe8e3d4bb02a (2022).
Hernandez-Boussard, T., Bozkurt, S., Ioannidis, J. P. A. & Shah, N. H. MINIMAR (MINimum Information for Medical AI Reporting): Developing reporting standards for artificial intelligence in health care. J. Am. Med. Inform. Assoc. 27, 2011–2015 (2020).
Arnold, M. et al. Factsheets: increasing trust in AI services through supplier’s declarations of conformity. IBM J. Res. Dev. 63, 6:1–6:13 (2019).
Mitchell, M. et al. Model cards for model reporting. In Proc. Conference on Fairness, Accountability, and Transparency, 220–229 (Association for Computing Machinery, New York, NY, USA, 2019).
Gebru, T. et al. Datasheets for datasets. Commun. ACM 64, 86–92 (2021).
The STANDING Together Collaboration. Recommendations for diversity, inclusivity, and generalisability in artificial intelligence health technologies and health datasets. https://doi.org/10.5281/zenodo.10048356 (2023).
Arora, A. et al. The value of standards for health datasets in artificial intelligence-based applications. Nat. Med. 29, 2929–2938 (2023).
Holland, S., Hosny, A., Newman, S., Joseph, J. & Chmielinski, K. The Dataset Nutrition Label: A Framework to Drive Higher Data Quality Standards, 1–26 (Hart Publishing, Oxford, 2020).
Pushkarna, M., Zaldivar, A. & Kjartansson, O. Data cards: Purposeful and transparent dataset documentation for responsible AI. In Proc. ACM Conference on Fairness, Accountability, and Transparency (ACM, Seoul, South Korea, 2022).
Rostamzadeh, N. et al. Healthsheet: Development of a transparency artifact for health datasets. In Proc. ACM Conference on Fairness, Accountability, and Transparency (ACM, Seoul, South Korea, 2022).
Bender, E. M. & Friedman, B. Data statements for natural language processing: Toward mitigating system bias and enabling better science. Trans. Assoc. Comput. Linguist. 6, 587–604 (2018).
Geiger, R. S. et al. Garbage in, garbage out? Do machine learning application papers in social computing report where human-labeled training data comes from? In Proc. Conference on Fairness, Accountability, and Transparency, 325–336 (2020).
Zhao, J., Wang, T., Yatskar, M., Ordonez, V. & Chang, K.-W. Men also like shopping: Reducing gender bias amplification using corpus-level constraints. In Proc. Conference on Empirical Methods in Natural Language Processing, 2979–2989 (Association for Computational Linguistics, Copenhagen, Denmark, 2017).
Whittlestone, J., Nyrup, R., Alexandrova, A., Dihal, K. & Cave, S. Ethical and societal implications of algorithms, data, and artificial intelligence: a roadmap for research (The Nuffield Foundation, London, 2019).
Zemel, R., Wu, Y., Swersky, K., Pitassi, T. & Dwork, C. Learning fair representations. In Proc. 30th International Conference on Machine Learning, vol. 28, 325–333 (PMLR, Atlanta, Georgia, USA, 2013).
Kim, B., Kim, H., Kim, K., Kim, S. & Kim, J. Learning not to learn: training deep neural networks with biased data (2019).
Wang, Z. et al. Towards fairness in visual recognition: effective strategies for bias mitigation. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8919–8928 (2020).
Suresh, H. & Guttag, J. A framework for understanding sources of harm throughout the machine learning life cycle. In Equity and Access in Algorithms, Mechanisms, and Optimization (EAAMO ’21), October 5–9, 2021, NY, USA. ACM, New York, NY, USA. https://doi.org/10.1145/3465416.3483305 (2021).
Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K. & Galstyan, A. A survey on bias and fairness in machine learning. ACM Comput. Surv. 54, 1–35 (2021).
Wang, R. Y. & Strong, D. M. Beyond accuracy: what data quality means to data consumers. J. Manag. Inf. Syst. 12, 5–33 (1996).
Khatri, V. & Brown, C. V. Designing data governance. Commun. ACM 53, 148–152 (2010).
Liaw, S.-T., Pearce, C., Liyanage, H., Cheah-Liaw, G. S. & De Lusignan, S. An integrated organisation-wide data quality management and information governance framework: theoretical underpinnings. J. Innov. Health Inform. 21, 199–206 (2014).
Mo, L. & Zheng, H. A method for measuring data quality in data integration. In Proc. International Seminar on Future Information Technology and Management Engineering, 525–527 (2008).
Lindquist, M. Data quality management in pharmacovigilance. Drug Saf. 27, 857–870 (2004).
Souibgui, M., Atigui, F., Zammali, S., Cherfi, S. & Yahia, S. B. Data quality in ETL process: a preliminary study. Proced. Comput. Sci. 159, 676–687 (2019).
Gebhardt, M., Jarke, M., Jeusfeld, M. A., Quix, C. & Sklorz, S. Tools for data warehouse quality. In Proc. Tenth International Conference on Scientific and Statistical Database Management (Cat. No. 98TB100243), 229–232 (1998).
Ballou, D. P. & Tayi, G. K. Enhancing data quality in data warehouse environments. Commun. ACM 42, 73–78 (1999).
Jenkinson, C., Fitzpatrick, R., Norquist, J., Findley, L. & Hughes, K. Cross-cultural evaluation of the Parkinson’s disease questionnaire: tests of data quality, score reliability, response rate, and scaling assumptions in the United States, Canada, Japan, Italy, and Spain. J. Clin. Epidemiol. 56, 843–847 (2003).
Lim, L. L., Seubsman, S.-a & Sleigh, A. Thai SF-36 health survey: tests of data quality, scaling assumptions, reliability and validity in healthy men and women. Health Qual. life outcomes 6, 1–9 (2008).
Candemir, S., Nguyen, X. V., Folio, L. R. & Prevedello, L. M. Training strategies for radiology deep learning models in data-limited scenarios. Radiol. Artif. Intell. 3, e210014 (2021).
Shorten, C. & Khoshgoftaar, T. M. A survey on image data augmentation for deep learning. J. Big Data 6, 1–48 (2019).
Feng, S. Y. et al. A survey of data augmentation approaches for NLP. Preprint at https://doi.org/10.48550/arXiv.2105.03075 (2021)
Larochelle, H., Bengio, Y., Louradour, J. & Lamblin, P. Exploring strategies for training deep neural networks. J. Mach. Learn. Res. 10, 1–40 (2009).
Vincent, P., Larochelle, H., Bengio, Y. & Manzagol, P.-A. Extracting and composing robust features with denoising autoencoders. In Proc. 25th International Conference on Machine Learning, 1096–1103 (2008).
Wang, R. & Tao, D. Non-local auto-encoder with collaborative stabilization for image restoration. IEEE Trans. Image Process. 25, 2117–2129 (2016).
Page, M. J. et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. Int. J. Surg. 88, 105906 (2021).
Redman, T. C. Data Quality for the Information Age (Artech House, Inc., 1997).
Loshin, D. Dimensions of data quality (2011).
Yoon, V. Y., Aiken, P. & Guimaraes, T. Managing organizational data resources: quality dimensions. Inf. Resour. Manag. J. 13, 5–13 (2000).
Sidi, F. et al. Data quality: A survey of data quality dimensions. In Proc. International Conference on Information Retrieval & Knowledge Management, 300–304 (2012).
Pipino, L. L., Lee, Y. W. & Wang, R. Y. Data quality assessment. Commun. ACM 45, 211–218 (2002).
Sebastian-Coleman, L. Measuring Data Quality for Ongoing Improvement: a Data Quality Assessment Framework (Newnes, 2012).
Stvilia, B., Gasser, L., Twidale, M. B. & Smith, L. C. A framework for information quality assessment. J. Am. Soc. Inf. Sci. Technol. 58, 1720–1733 (2007).
Kim, W., Choi, B.-J., Hong, E.-K., Kim, S.-K. & Lee, D. A taxonomy of dirty data. Data Min. Knowl. Discov. 7, 81–99 (2003).
DAMA UK Working Group on Quality Dimensions. The six primary dimensions for data quality assessment. Technical Report, DAMA UK - The premier organisation for data professionals in the UK (DAMA UK, 2013).
International Organization for Standardization and International Electrotechnical Commission. ISO 25012. https://iso25000.com/index.php/en/iso-25000-standards/iso-25012?start=15 (2008).
Corrales, D., Ledezma, A. & Corrales, J. From theory to practice: a data quality framework for classification tasks. Symmetry 10, 248 (2018).
Long, J., Richards, J. & Seko, C. The Canadian Institute for Health Information Data Quality Framework, version 1: a meta-evaluation and future directions. In Proc. Sixth International Conference on Information Quality, 370–383 (2001).
Chan, K. S., Fowles, J. B. & Weiner, J. P. Electronic health records and the reliability and validity of quality measures: a review of the literature. Med. Care Res. Rev. 67, 503–527 (2010).
Weiskopf, N. G. & Weng, C. Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research. J. Am. Med. Inform. Assoc. 20, 144–151 (2013).
Nahm, M. Data quality in clinical research. In: Clinical Research Information 175–201 (Springer, 2012).
Almutiry, O., Wills, G., Alwabel, A., Crowder, R. & Walters, R. Toward a framework for data quality in cloud-based health information system. In Proc. International Conference on Information Society (i-Society 2013), 153–157 (IEEE, 2013).
Chen, H., Hailey, D., Wang, N. & Yu, P. A review of data quality assessment methods for public health information systems. Int. J. Environ. Res. Public Health 11, 5170–5207 (2014).
Bloland, P. & MacNeil, A. Defining & assessing the quality, usability, and utilization of immunization data. BMC Public Health 19, 1–8 (2019).
Vanbrabant, L., Martin, N., Ramaekers, K. & Braekers, K. Quality of input data in emergency department simulations: Framework and assessment techniques. Simul. Model. Pract. Theory 91, 83–101 (2019).
Bian, J. et al. Assessing the practice of data quality evaluation in a national clinical data research network through a systematic scoping review in the era of real-world data. J. Am. Med. Inform. Assoc. 27, 1999–2010 (2020).
Kim, K.-H. et al. Multi-center healthcare data quality measurement model and assessment using omop cdm. Appl. Sci. 11, 9188 (2021).
Tahar, K. et al. Rare diseases in hospital information systems—an interoperable methodology for distributed data quality assessments. Methods Inf. Med. 62, 71–89 (2023).
Johnson, S. G., Speedie, S., Simon, G., Kumar, V. & Westra, B. L. A data quality ontology for the secondary use of EHR Data. In AMIA Annu Symposium Proceedings (2015).
Kahn, M. G. et al. A harmonized data quality assessment terminology and framework for the secondary use of electronic health record data. Egems 4 (2016).
Schmidt, C. O. et al. Facilitating harmonized data quality assessments. A data quality framework for observational health research data collections with software implementations in R. BMC Med. Res. Methodol. 21 (2021).
Lewis, A. E. et al. Electronic health record data quality assessment and tools: a systematic review. J. Am. Med. Inform. Assoc. 30, 1730–1740 (2023).
Liu, C., Talaei-Khoei, A., Storey, V. C. & Peng, G. A review of the state of the art of data quality in healthcare. J. Glob. Inf. Manag. 31, 1–18 (2023).
Mashoufi, M., Ayatollahi, H., Khorasani-Zavareh, D. & Talebi Azad Boni, T. Data quality in health care: main concepts and assessment methodologies. Methods Inf. Med. 62, 005–018 (2023).
Syed, R. et al. Digital health data quality issues: systematic review. J. Med. Internet Res. 25, e42615 (2023).
Declerck, J., Kalra, D., Vander Stichele, R. & Coorevits, P. Frameworks, dimensions, definitions of aspects, and assessment methods for the appraisal of quality of health data for secondary use: comprehensive overview of reviews. JMIR Med. Inform. 12, e51560 (2024).
Alipour, J. Dimensions and assessment methods of data quality in health information systems. Acta Med. Mediter. 313–320 (2017).
European Medicines Agency. Data quality framework for EU medicines regulation. https://www.ema.europa.eu/system/files/documents/regulatory-procedural-guideline/data-quality-framework-eu-medicines-regulation_en_1.pdf (2022).
Batini, C., Rula, A., Scannapieco, M. & Viscusi, G. From data quality to big data quality. J. Database Manag. 26, 60–82 (2015).
Eder, J. & Shekhovtsov, V. A. Data quality for medical data lakelands (2020).
Cai, L. & Zhu, Y. The challenges of data quality and data quality assessment in the big data era. Data Sci. J. 14, 2 (2015).
Gao, J., Xie, C. & Tao, C. Big data validation and quality assurance—issues, challenges, and needs. In Proc. IEEE Symposium on Service-Oriented System Engineering (SOSE) Oxford, UK, 2016, pp. 433–441 (2016).
Ramasamy, A. & Chowdhury, S. Big data quality dimensions: a systematic literature review. J. Inf. Syst. Technol. Manag. https://doi.org/10.4301/S1807-177520201700317 (2020).
Gudivada, V., Apon, A. & Ding, J. Data quality considerations for big data and machine learning: Going beyond data cleaning and transformations. Int. J. Adv. Softw. 10, 1–20 (2017).
Juddoo, S., George, C., Duquenoy, P. & Windridge, D. Data governance in the health industry: investigating data quality dimensions within a big data context. Appl. Syst. Innov. 1, 43 (2018).
Ijab, M. T., Mat Surin, E. S. & Mat Nayan, N. Conceptualizing big data quality framework from a systematic literature review perspective. Malays. J. Comput. Sci. 25–37 (2019).
Cao, W., Hu, L., Gao, J., Wang, X. & Ming, Z. A study on the relationship between the rank of input data and the performance of random weight neural network. Neural Comput. Appl. 32, 12685–12696 (2020).
Johnson, J. M. & Khoshgoftaar, T. M. The effects of data sampling with deep learning and highly imbalanced big data. Inf. Syst. Front. 22, 1113–1131 (2020).
Sahu, A., Mao, Z., Davis, K. & Goulart, A. E. Data processing and model selection for machine learning-based network intrusion detection. In Proc. IEEE International Workshop Technical Committee on Communications Quality and Reliability (CQR) (2020).
Qi, Z.-X., Wang, H.-Z. & Wang, A.-J. Impacts of dirty data on classification and clustering models: an experimental evaluation. J. Comput Sci. Technol. 36, 806–821 (2021).
Hu, J. & Wang, J. Influence of data quality on the performance of supervised classification models for predicting gravelly soil liquefaction. Eng. Geol. 324, 107254 (2023).
Jouseau, R., Salva, S. & Samir, C. On studying the effect of data quality on classification performances. Intelligent Data Engineering and Automated Learning – IDEAL. 82–93 (Springer Cham, 2022).
Tran, N., Chen, H., Bhuyan, J. & Ding, J. Data curation and quality evaluation for machine learning-based cyber intrusion detection. IEEE Access 10, 121900–121923 (2022).
Sha, L., Gašević, D. & Chen, G. Lessons from debiasing data for fair and accurate predictive modeling in education. Expert Syst. Appl. 228, 120323 (2023).
Lake, S. & Tsai, C.-W. An exploration of how training set composition bias in machine learning affects identifying rare objects. Astron. Comput. 40, 100617 (2022).
Bailly, A. et al. Effects of dataset size and interactions on the prediction performance of logistic regression and deep learning models. Comput. Methods Prog. Biomed. 213, 106504 (2022).
Althnian, A. et al. Impact of dataset size on classification performance: an empirical evaluation in the medical domain. Appl. Sci. 11, 796 (2021).
Michel, E., Zernikow, B. & Wichert, S. A. Use of an artificial neural network (ANN) for classifying nursing care needed, using incomplete input data. Med. Inform. Internet Med. 25, 147–158 (2000).
Barakat, M. S. et al. The effect of imputing missing clinical attribute values on training lung cancer survival prediction model performance. Health Inf. Sci. Syst. 5, 16 (2017).
Radliński, Ł. The impact of data quality on software testing effort prediction. Electronics 12, 1656 (2023).
Ghotra, B., McIntosh, S. & Hassan, A. E. Revisiting the impact of classification techniques on the performance of defect prediction models. In Proc. IEEE/ACM 37th IEEE International Conference on Software Engineering (2015).
Zhou, Y. & Wu, Y. Analyses on Influence Of Training Data Set To Neural Network Supervised Learning Performance, 19–25 (Springer, Berlin Heidelberg, 2011).
Bansal, A., Kauffman, R. J. & Weitz, R. R. Comparing the modeling performance of regression and neural networks as data quality varies: A business value approach. J. Manag. Inf. Syst. 10, 11–32 (1993).
Twala, B. Impact of noise on credit risk prediction: does data quality really matter? Intell. Data Anal. 17, 1115–1134 (2013).
Deshsorn, K., Lawtrakul, L. & Iamprasertkun, P. How false data affects machine learning models in electrochemistry? J. Power Sources 597, 234127 (2024).
Blake, R. & Mangiameli, P. The effects and interactions of data quality and problem complexity on classification. J. Data Inf. Qual. 2, 1–28 (2011).
Benedick, P.-L., Robert, J. & Traon, Y. L. A systematic approach for evaluating artificial intelligence models in industrial settings. Sensors 21, 6195 (2021).
Che, Z., Purushotham, S., Cho, K., Sontag, D. & Liu, Y. Recurrent neural networks for multivariate time series with missing values. Sci. Rep. 8, 6085 (2018).
Ismail Fawaz, H., Forestier, G., Weber, J., Idoumghar, L. & Muller, P.-A. Adversarial attacks on deep neural networks for time series classification. In Proc. International Joint Conference on Neural Networks (IJCNN) (IEEE, Budapest, Hungary, 2019).
Habib, A., Karmakar, C. & Yearwood, J. Impact of ecg dataset diversity on generalization of cnn model for detecting qrs complex. IEEE Access 7, 93275–93285 (2019).
Ito, A., Saito, K., Ueno, R. & Homma, N. Imbalanced data problems in deep learning-based side-channel attacks: analysis and solution. IEEE Trans. Inf. Forensics Secur. 16, 3790–3802 (2021).
Zhang, H., Singh, H., Ghassemi, M. & Joshi, S. ‘Why did the model fail?’ Attributing model performance changes to distribution shifts. In Proc. 40th International Conference on Machine Learning, Vol. 202, 41550–41578 (2023).
Masko, D. & Hensman, P. The impact of imbalanced training data for convolutional neural networks. https://www.kth.se/social/files/588617ebf2765401cfcc478c/PHensmanDMasko_dkand15.pdf (2015).
Buda, M., Maki, A. & Mazurowski, M. A. A systematic study of the class imbalance problem in convolutional neural networks. Neural Netw. 106, 249–259 (2018).
Johnson, J. M. & Khoshgoftaar, T. M. Survey on deep learning with class imbalance. J. Big Data 6, 1–54 (2019).
Bai, M. et al. The uncovered biases and errors in clinical determination of bone age by using deep learning models. Eur. Radiol. 33, 3544–3556 (2022).
Pan, Y., Xie, F. & Zhao, H. Understanding the challenges when 3D semantic segmentation faces class imbalanced and OOD data. IEEE Trans. Intell. Transp. Syst. 24, 6955–6970 (2023).
Ovadia, Y. et al. Can you trust your model’s uncertainty? Evaluating predictive uncertainty under dataset shift. Adv. Neural Inf. Process. Syst. 32 (2019).
Sun, C., Shrivastava, A., Singh, S. & Gupta, A. Revisiting unreasonable effectiveness of data in deep learning era. In Proc. IEEE International Conference on Computer Vision, 843–852 (2017).
Nuha, F. U. Training dataset reduction on generative adversarial network. Proced. Comput. Sci. 144, 133–139 (2018).
Hong, S. & Shen, J. Impact of training size on deep learning performance in in vivo 1H MRS. In Proc. ISMRM & SMRT Annual Meeting & Exhibition (2021).
Li, Y. & Chao, X. Toward sustainability: trade-off between data quality and quantity in crop pest recognition. Front. Plant Sci. 12, 811241 (2021).
Li, Y., Yang, J. & Wen, J. Entropy-based redundancy analysis and information screening. Digit. Commun. Netw. 9, 1061–1069 (2021).
Fan, F. J. & Shi, Y. Effects of data quality and quantity on deep learning for protein-ligand binding affinity prediction. Bioorg. Med. Chem. 72, 117003 (2022).
Ranjan, R., Sharrer, K., Tsukuda, S. & Good, C. Effects of image data quality on a convolutional neural network trained in-tank fish detection model for recirculating aquaculture systems. Comput. Electron. Agric. 205, 107644 (2023).
Vilaça, L., Viana, P., Carvalho, P. & Andrade, M. T. Improving efficiency in facial recognition tasks through a dataset optimization approach. IEEE Access 12, 32532–32544 (2024).
Barragán-Montero, A. M. et al. Deep learning dose prediction for IMRT of esophageal cancer: the effect of data quality and quantity on model performance. Phys. Med. 83, 52–63 (2021).
Motamedi, M., Sakharnykh, N. & Kaldewey, T. A data-centric approach for training deep neural networks with less data. Preprint at https://doi.org/10.48550/arXiv.2110.03613 (2021).
Xu, G., Yue, Q., Liu, X. & Chen, H. Investigation on the effect of data quality and quantity of concrete cracks on the performance of deep learning-based image segmentation. Expert Syst. Appl. 237, 121686 (2024).
Sukhbaatar, S., Bruna, J., Paluri, M., Bourdev, L. & Fergus, R. Training convolutional networks with noisy labels. Preprint at https://doi.org/10.48550/arXiv.1406.2080 (2014).
Wesemeyer, T., Jauer, M.-L. & Deserno, T. M. Annotation quality vs. quantity for deep-learned medical image segmentation. Medical Imaging 2021: Imaging Informatics for Healthcare, Research, and Applications (2021).
He, T., Yu, S., Wang, Z., Li, J. & Chen, Z. From data quality to model quality: An exploratory study on deep learning. In Proc. 11th Asia-Pacific Symposium on Internetware, 1–6 (2019).
Dodge, S. & Karam, L. Understanding how image quality affects deep neural networks. In Proc. Eighth International Conference on Quality of Multimedia Experience (QoMEX), 1–6 (2016).
Karahan, S. et al. How image degradations affect deep CNN-based face recognition? In Proc. International Conference of the Biometrics Special Interest Group, 1–5 (2016).
Pei, Y., Huang, Y., Zou, Q., Zhang, X. & Wang, S. Effects of image degradation and degradation removal to cnn-based image classification. IEEE Trans. Pattern Anal. Mach. Intell. 43, 1239–1253 (2019).
Schnabel, L., Matzka, S., Stellmacher, M., Patzold, M. & Matthes, E. Impact of anonymization on vehicle detector performance. In Proc. Second International Conference on Artificial Intelligence for Industries (AI4I) (2019).
Zhong, X. et al. A study of real-world micrograph data quality and machine learning model robustness. npj Comput. Mater. 7, 161 (2021).
Hukkelås, H. & Lindseth, F. Does image anonymization impact computer vision training? In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 140–150 (2023).
Jaspers, T. J. M. et al. Investigating the Impact of Image Quality on Endoscopic AI Model Performance, 32–41 (Springer, Cham, 2023).
Lee, J. H. & You, S. J. Balancing privacy and accuracy: Exploring the impact of data anonymization on deep learning models in computer vision. IEEE Access 12, 8346–8358 (2024).
Güneş, A. M. et al. Impact of imperfection in medical imaging data on deep learning-based segmentation performance: an experimental study using synthesized data. Med. Phys. 50, 6421–6432 (2023).
Rolnick, D., Veit, A., Belongie, S. & Shavit, N. Deep learning is robust to massive label noise. Preprint at https://doi.org/10.48550/arXiv.1705.10694 (2017).
Wang, F. et al. The devil of face recognition is in the noise. In Proc. European Conference on Computer Vision (ECCV), 765–780 (2018).
Peterson, J. C., Battleday, R. M., Griffiths, T. L. & Russakovsky, O. Human uncertainty makes classification more robust. In Proc. IEEE/CVF International Conference on Computer Vision (ICCV), 9616–9625 (IEEE Computer Society, Los Alamitos, CA, USA, 2019).
Karimi, D., Dou, H., Warfield, S. K. & Gholipour, A. Deep learning with noisy labels: exploring techniques and remedies in medical image analysis. Med. Image Anal. 65, 101759 (2020).
Taran, V., Gordienko, Y., Rokovyi, A., Alienin, O. & Stirenko, S. Impact of ground truth annotation quality on performance of semantic image segmentation of traffic conditions. Advances in Computer Science for Engineering and Education II, 183–193 (Springer, Cham, 2020).
Volkmann, N. et al. Learn to train: improving training data for a neural network to detect pecking injuries in turkeys. Animals 11, 2655 (2021).
Wei, J. et al. Learning with noisy labels revisited: a study using real-world human annotations. Preprint at https://doi.org/10.48550/arXiv.2110.12088 (2021).
Ma, J., Ushiku, Y. & Sagara, M. The effect of improving annotation quality on object detection datasets: a preliminary study. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4850–4859 (2022).
Schmarje, L. et al. Is one annotation enough? A data-centric image classification benchmark for noisy and ambiguous label estimation (2022).
Agnew, C. et al. Quantifying the effects of ground truth annotation quality on object detection and instance segmentation performance. IEEE Access 11, 25174–25188 (2023).
Costa, D., Silva, C., Costa, J. & Ribeiro, B. Enhancing pest detection models through improved annotations. In Proc. EPIA Conference on Artificial Intelligence, 364–375 (Springer, Cham, 2023).
Cui, J. et al. Impact of annotation quality on model performance of welding defect detection using deep learning. Weld. World 68, 855–865 (2024).
Wang, S., Gao, J., Li, B. & Hu, W. Narrowing the gap: Improved detector training with noisy location annotations. IEEE Trans. Image Process. 31, 6369–6380 (2022).
Whang, S. E., Roh, Y., Song, H. & Lee, J.-G. Data collection and quality challenges in deep learning: a data-centric AI perspective. VLDB J. 32, 791–813 (2023).
Xu, S. et al. Data quality matters: A case study of obsolete comment detection (2023).
Li, Y., Zhao, C. & Caragea, C. Improving stance detection with multi-dataset learning and knowledge distillation. In Proc. Conference on Empirical Methods in Natural Language Processing, 6332–6345 (2021).
Shimizu, A. & Wakabayashi, K. Examining effect of label redundancy for machine learning using crowdsourcing. J. Data Intell. 3, 301–315 (2022).
Zengin, M. S., Yenisey, B. U. & Kutlu, M. Exploring the impact of training datasets on Turkish stance detection. Turk. J. Electr. Eng. Comput. Sci. 31, 1206–1222 (2023).
Derry, A., Carpenter, K. A. & Altman, R. B. Training data composition affects performance of protein structure analysis algorithms. Pac. Symp. Biocomput. 27, 10–21 (2022).
Nikolados, E.-M., Wongprommoon, A., Aodha, O. M., Cambray, G. & Oyarzún, D. A. Accuracy and data efficiency in deep learning models of protein expression. Nat. Commun. 13 (2022).
Wang, L. & Jackson, D. A. Effects of sample size, data quality, and species response in environmental space on modeling species distributions. Landsc. Ecol. 38, 4009–4031 (2023).
Snodgrass, S., Summerville, A. & Ontañón, S. Studying the effects of training data on machine learning-based procedural content generation. Vol. 13 of Proc. AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, 122–128 (2017).
Eid, F.-E. et al. Systematic auditing is essential to debiasing machine learning in biology. Commun. Biol. 4, 183 (2021).
Guo, L. L. et al. Evaluation of domain generalization and adaptation on improving model robustness to temporal dataset shift in clinical medicine. Sci. Rep. 12, 2726 (2022).
Xu, H., Horn Nord, J., Brown, N. & Daryl Nord, G. Data quality issues in implementing an ERP. Ind. Manag. Data Syst. 102, 47–58 (2002).
Verma, R. M., Zeng, V. & Faridi, H. Data quality for security challenges: case studies of phishing, malware and intrusion detection datasets. In Proc. ACM SIGSAC Conference on Computer and Communications Security, 2605–2607 (2019).
Laney, D. 3D data management: controlling data volume, velocity and variety. https://www.scirp.org/reference/ReferencesPapers?ReferenceID=1611280 (2001).
Wook, M. et al. Exploring big data traits and data quality dimensions for big data analytics application using partial least squares structural equation modelling. J. Big Data 8, 1–15 (2021).
Black, A. & van Nederpelt, P. Dimensions of data quality (DDQ). https://www.dama-nl.org/wp-content/uploads/2020/09/DDQ-Dimensions-of-Data-Quality-Research-Paper-version-1.2-d.d.-3-Sept-2020.pdf (2020).
IEEE standard glossary of software engineering terminology. IEEE Std 610.12-1990 610, 1–84 (1990).
Wilkinson, M. D. et al. The FAIR guiding principles for scientific data management and stewardship. Sci. Data 3, 160018 (2016).
Bishop, C. M. Training with noise is equivalent to Tikhonov regularization. Neural Comput. 7, 108–116 (1995).
Grandvalet, Y., Canu, S. & Boucheron, S. Noise injection: theoretical prospects. Neural Comput. 9, 1093–1108 (1997).
Smilkov, D., Thorat, N., Kim, B., Viégas, F. & Wattenberg, M. Smoothgrad: removing noise by adding noise. Preprint at https://doi.org/10.48550/arXiv.1706.03825 (2017).
Thaler, R. H. & Sunstein, C. R. Nudge: Improving Decisions About Health, Wealth, and Happiness (Yale University Press, 2009).
Kahneman, D. Thinking, Fast and Slow (Farrar, Straus and Giroux, New York, 2011).
Malossini, A., Blanzieri, E. & Ng, R. T. Detecting potential labeling errors in microarrays by data perturbation. Bioinformatics 22, 2114 (2006).
Frénay, B. & Verleysen, M. Classification in the presence of label noise: a survey. IEEE Trans. Neural Netw. Learn. Syst. 25, 845–869 (2013).
Menze, B. H. et al. The multimodal brain tumor image segmentation benchmark (BRATS). IEEE Trans. Med. Imaging 34, 1993–2024 (2014).
Deng, L. The MNIST database of handwritten digit images for machine learning research. IEEE Signal Process. Mag. 29, 141–142 (2012).
Krizhevsky, A. Learning multiple layers of features from tiny images. https://www.cs.utoronto.ca/̃kriz/learning-features-2009-TR.pdf (2019).
Xiao, H., Rasul, K. & Vollgraf, R. Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. Preprint at https://doi.org/10.48550/arXiv.1708.07747 (2017).
Müller, N. M. & Markert, K. Identifying mislabeled instances in classification datasets. In Proc. International Joint Conference on Neural Networks (IJCNN), 1–8 (2019).
Northcutt, C., Jiang, L. & Chuang, I. Confident learning: estimating uncertainty in dataset labels. J. Artif. Intell. Res. 70, 1373–1411 (2021).
Kahneman, D., Sibony, O. & Sunstein, C. R. Noise: A flaw in Human Judgment (Hachette UK, New York, 2021).
Jaramillo, D. Radiologists and their noise: variability in human judgment, fallibility, and strategies to improve accuracy. Radiology 302, 511–512 (2022).
Radiological Society of North America. https://www.rsna.org.
National Cancer Institute, US. QIN - Quantitative Imaging Network. https://imaging.cancer.gov/programs_resources/specialized_initiatives/qin/about/default.htm.
European society of radiology. EIBALL - European Imaging Biomarkers Alliance. https://www.myesr.org/research/eiball/.
Anderson, R. N., Miniño, A. M., Hoyert, D. L. & Rosenberg, H. M. Comparability of cause of death between ICD-9 and ICD-10: preliminary estimates. vol. 49 of National Vital Statistics Reports (2001).
Sebastião, Y. V., Metzger, G. A., Chisolm, D. J., Xiang, H. & Cooper, J. N. Impact of ICD-9-cm to ICD-10-cm coding transition on trauma hospitalization trends among young adults in 12 states. Injury Epidemiol. 8, 4 (2021).
Remedios, S. W. et al. Distributed deep learning across multisite datasets for generalized CT hemorrhage segmentation. Med. Phys. 47, 89–98 (2020).
Onofrey, J. A. et al. Generalizable multi-site training and testing of deep neural networks using image normalization. In Proc. IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019), 348–351 (2019).
Pooch, E. H., Ballester, P. & Barros, R. C. Can we trust deep learning-based diagnosis? The impact of domain shift in chest radiograph classification. In Proc. Thoracic Image Analysis: Second International Workshop, TIA 2020, Held in Conjunction with MICCAI 2020, Lima, Peru, October 8, 2020, 74–83 (2020).
Glocker, B., Robinson, R., Castro, D. C., Dou, Q. & Konukoglu, E. Machine learning with multi-site imaging data: an empirical study on the impact of scanner effects. Preprint at https://doi.org/10.48550/arXiv.1910.04597 (2019).
Chawla, N. V., Bowyer, K. W., Hall, L. O. & Kegelmeyer, W. P. Smote: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002).
He, H., Bai, Y., Garcia, E. A. & Li, S. Adasyn: adaptive synthetic sampling approach for imbalanced learning. In Proc. IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence) (IEEE, Hong Kong, 2008).
Johnson, A. E. et al. MIMIC-III, a freely accessible critical care database. Sci. Data 3, 1–9 (2016).
Rubin, D. B. Inference and missing data. Biometrika 63, 581–592 (1976).
Schafer, J. L. & Graham, J. W. Missing data: our view of the state of the art. Psychol. Methods 7, 147 (2002).
Mazumder, M. et al. Dataperf: benchmarks for data-centric AI development. Adv. Neural Inf. Process. Syst. 36 (2024).
Zha, D. et al. Data-centric artificial intelligence: a survey. Preprint at https://doi.org/10.48550/arXiv.2303.10158 (2023).
Acknowledgements
The authors acknowledge funding by the EU project TEF-Health. The project TEF-Health has received funding from the European Union’s Digital Europe programme under grant agreement no. 101100700. We would like to thank Stefan Haufe for valuable input on the manuscript. We further thank the project partners of the TEF-Health project for feedback on our study.
Funding
Open Access funding enabled and organized by Projekt DEAL.
Author information
Authors and Affiliations
Contributions
D.S. and T.S. designed and supervised the study. D.S., K.B., M.S., and A.K. carried out the theoretical methods and analysed the data. M.S., A.K. extracted the data. D.S., K.B., M.S., A.K., and T.S. wrote the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Schwabe, D., Becker, K., Seyferth, M. et al. The METRIC-framework for assessing data quality for trustworthy AI in medicine: a systematic review. npj Digit. Med. 7, 203 (2024). https://doi.org/10.1038/s41746-024-01196-4
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41746-024-01196-4