Introduction

The application of artificial intelligence techniques to digital tissue images has shown great promise for improving pathological diagnosis1,2,3. They can not only automate time-consuming diagnostic tasks and make analyses more sensitive and reproducible, but also extract new digital biomarkers from tissue morphology for precision medicine4.

Pathology involves a large number of diagnostic tasks, each being a potential application for AI. Many of these involve the characterization of tissue morphology. Such tissue classification approaches have been developed for identifying tumors in a variety of tissues, including lung5,6, colon7, breast8,9, and prostate9 but also in non-tumor pathology, e.g., kidney transplants10. Further applications include predicting outcomes11,12 or gene mutations5,13,14 directly from tissue images. Similar approaches are also employed to detect and classify cell nuclei, e.g., to quantify the positivity of immunohistochemistry markers like Ki67, ER/PR, Her2, and PD-L115,16.

Testing AI solutions is an important step to ensure that they work reliably and robustly on routine laboratory cases. AI algorithms run the risk of exploiting feature associations that are specific to their training data17. Such “overfitted” models tend to perform poorly on previously unseen data. To obtain a realistic estimate of the prediction performance on real-word data, it is common practice to apply AI solutions to a test dataset. The results are then compared with reference results in terms of task-specific performance metrics, e.g., sensitivity, specificity, or area under the receiver operating characteristic curve (ROC-AUC).

Test datasets may only be used once to evaluate the performance of a finalized AI solution17. They may not be considered during development. This can be considered a consequence of Goodhart’s law stating that measures cease to be meaningful when used as targets18: If AI solutions are optimized for test datasets, they cannot provide realistic performance estimates for real-world data. Test datasets are also referred to as “hold-out datasets” or “(external) validation datasets.” The term “validation,” however, is not used consistently in the machine learning community and can also refer to model selection during development17.

Besides overfitting, AI methods are prone to “shortcut learning”19. Many datasets used in the development of AI methods contain confounding variables (e.g., slide origin, scanner type, patient age) that are spuriously correlated with the target variable (e.g., tumor type)20. AI methods often exploit features that are discriminative for such confounding variables and not for the target variable21. Despite working well for smaller datasets containing similar correlations, such methods fail in more challenging real-world scenarios in ways humans never would22. To minimize the likelihood of spurious correlations between confounding variables and the target variable, test datasets must be large and diversified20. At the same time, test datasets must be small enough to be acquired with realistic effort and cost. Finding a good balance between these requirements is a major challenge for AI developers.

Comparatively little attention has been paid to compiling test datasets for AI solutions in pathology. Datasets for training, on the other hand, were considered frequently9,23,24,25,26,27,28. Training data are collected with a different goal than test datasets: While training datasets should produce the best possible AI models, test datasets should provide the most realistic performance assessment for routine use, which presents unique challenges.

Some publications address individual problems in compiling test datasets in pathology, e.g., how to avoid bias in the performance evaluation caused by site-specific image features in test datasets29. Other publications provide general recommendations for evaluating AI methods for medical applications without considering the specific challenges of pathology30,31,32,33,34.

Appropriate test datasets are critical to demonstrate the utility of AI solutions as well as to obtain regulatory approval. However, the lack of guidance on how to compile test datasets is a major barrier to the adoption of AI solutions in laboratory practice.

This article gives recommendations for test datasets in pathology. It summarizes the results of extensive literature reviews and discussions by a committee of various stakeholders, including commercial AI developers, pathologists, and researchers. This committee was established as part of the EMPAIA project (Ecosystem for Pathology Diagnostics with AI Assistance), aiming to facilitate the adoption of AI in pathology35.

Results

The next sections discuss and provide recommendations on various aspects that must be considered when creating test datasets. For meaningful performance estimates, test datasets must be both diverse enough to cover the variability of data in routine diagnostics and large enough to allow statistically meaningful analyses. Relevant subsets must be covered, and test datasets should be unbiased. Moreover, test datasets must be sufficiently independent of datasets used in the development of AI solutions. Comprehensive information about test datasets must be reported and regulatory requirements must be met when evaluating the clinical applicability of AI solutions.

Target population of images

Compiling a test dataset requires a detailed description of the intended use of the AI solution to be tested. The intended use must clearly indicate for which diagnostic task(s) the solutions may be used, and whether the use is limited to images with certain characteristics. All images an AI solution may encounter in its intended use constitute its “target population of images.” A test dataset must be an adequate sample of this target population (see Fig. 1) to provide a reasonable estimate of the prediction performance of the AI solution. For all applications in pathology, the target population is distributed across multiple dimensions of variability, see Table 1.

Fig. 1: Schematic overview of sampling regimes for performance assessment in the entire target population of images or in specific subsets.
figure 1

Overall performance assessment requires a representative sample along all dimensions of variability, relevant subsets are typically limited along one dimension (e.g., age range or scanner type).

Table 1 Examples of data variabilities within the intended use20, 26, 36, 38,39,40, 61, 136,137,138.

Biological variability: The visual appearance of tissue varies between normal and diseased states. This is what AI solutions are designed to detect and characterize. But even tissue of the same category can look very different (see Fig. 2). The appearance is influenced by many factors (e.g., genetic, transcriptional, epigenetic, proteomic, and metabolomic) that differ between patients as well as between demographic and ethnic groups36. These factors often vary spatially (e.g., different parts of organs are differently affected) and temporally (e.g., the pathological alterations differ based on disease stage) within a single patient37.

Fig. 2: Examples of variability between biopsy images, illustrating a combination of inter- and intraindividual biological variability (tissue structure) and inter-individual technical variability (staining).
figure 2

The images show H&E-stained breast tissue of female patients with invasive carcinomas of no special type, scanned at 40× objective magnification.

Technical variability: Processing and digitization of tissue sections consists of several steps (e.g., tissue fixation, processing, cutting, staining and digitization) all of which can contribute to image variability38. Differences in section thickness and staining solutions can lead to variable staining appearances39. Artifacts frequently occur during tissue processing, including elastic deformations, inclusion of foreign objects, and cover glass scratches40. Differences in illumination, resolution, and encoding algorithms of slide scanner models also affect the appearance of tissue images38.

Observer variability: Images in test datasets are commonly associated with a reference label like a disease category or score determined by a human observer. It is well known that the assessment of tissue images is subject to intra- and inter-observer variability41,42,43,44,45,46,47. This variability results from subjective biases (e.g., caused by training, specialization, and experience) but also from inherent ambiguities in the images48,49.

Routine laboratory work occasionally produces images that are unsuitable for the intended use of an AI solution, e.g., because they are ambiguous or of insufficient quality. Most AI solutions require prior quality assurance steps to ensure that solutions are only applied to suitable images50,51. The boundary between suitable and unsuitable images is usually fuzzy and there are difficult images that cannot be clearly assigned to either category (see Fig. 3).

Fig. 3: Examples of different severity levels of imaging artifacts.
figure 3

The leftmost images are clearly within the intended use of algorithms for analyzing breast cancer histologies, whereas the rightmost images are clearly unsuitable. However, it is not obvious where to draw the line between those two regimes. The top row shows simulated foreign objects, the bottom row shows simulated focal blur, the original tissue images show H&E-stained breast tissue of female patients with invasive carcinomas of no special type, scanned at 40× objective magnification (same as in Fig. 1).

Defining the target population is challenging and presumes a clear definition of the intended use by the AI developer. The target population of images must be defined before test data are collected. It must be clearly stated which subsets of images fall under the intended use. Such subsets may consist of specific disease variants, demographic characteristics, ethnicities, staining characteristics, artifacts, or scanner types. These subsets typically overlap, e.g., the subset of images of one scanner type contains images from different patient age groups. A particular challenge is to define where the target population ends. Examples of images within and outside the intended use can help human observers sort out unsuitable images as objectively as possible.

Data collection

Test datasets must be representative of the entire target population of images, i.e., sufficiently diverse and unbiased. To minimize spurious correlations between confounding variables and the target variable and to uncover shortcut learning in AI methods, all dimensions of biological and technical variability must be adequately covered for the classes considered20,28, also reflecting the variability of negative cases without visible pathology28,52.

All images encountered in the normal laboratory workflow must be considered. One way to achieve this is to collect all cases that occurred over a given time period52 long enough for a sufficient number of cases to be collected (e.g., one year9). These cases should be collected from multiple international laboratories, since they differ in their spectra of patients and diseases, technical equipment and operating procedures. Data should be collected at the point in the workflow where the AI solution would be applied, taking into account possible prior quality assurance steps in the workflow.

All data in a test dataset must be collected according to a consistent acquisition protocol (see “Reporting”). The best way to ensure this is to prospectively collect test data according to this protocol. Retrospective datasets were typically compiled for a different purpose and are thus likely to be subject to selection bias, that is difficult to adjust for53. If retrospective data are used in a test dataset, a comprehensive description of the acquisition protocol must be available so that potential issues can be identified54.

Annotation

Test datasets for AI solutions contain not only images, but also annotations representing the expected analysis result, e.g., slide-level labels or delineations of tissue regions. In most cases, such reference annotations must be prepared by human observers with sufficient experience in the diagnostic use case. Since humans are prone to intra- and inter-observer variability, annotations in test datasets should be created by multiple observers from different hospitals or laboratories. For unequivocal results, it can be helpful to organize consensus conferences and to use standardized electronic reporting formats41. Any remaining disagreement should be documented with justification (e.g., suboptimal sample quality) and considered when evaluating AI solutions. Semi-automatic annotation methods can help reduce the effort required for manual annotation55,56. However, they can introduce biases themselves and should therefore be monitored by human observers.

Curation

Unsuitable data that does not fit the intended use of an AI solution should not be included in a test dataset. Such data usually must be detected by human observers, e.g., in a dedicated data curation step or during the generation of reference annotations. To avoid selection bias, it is important not to exclude artifacts or atypical images that are part of the intended use of the product9,52,57.

There are automated tools to support the detection of unsuitable data58. Some approaches detect unsuitable images based on basic image features such as brightness, predominant colors, and sharpness59,60 or by detecting typical artifacts like tissue folds and air bubbles61,62. Other methods analyze domain shifts63,64,65 or use dedicated neural networks trained for outlier detection66. There are also approaches for detecting outliers depending on the tested AI solution63,67,68,69,70. Although these approaches can help exclude unsuitable images from test datasets, they do not yet appear to be mature enough to be used entirely without human supervision.

Synthetic data

There are a variety of techniques for extending datasets with synthetic data. Some techniques alter existing images in a generic (e.g., rotation, mirroring) or histology-specific way (e.g., stain transformations26 or emulation of image artifacts40,71,72,73,74,75,76). Other techniques create fully synthetic images from scratch77,78,79,80,81. These techniques are useful for data augmentation1,2,82, i.e., enriching development data in order to avoid overfitting and increase robustness. However, they cannot replace original real-world data for test datasets. Because all of these techniques are based on simplified models of real-world variability, they are likely to introduce biases into a test dataset and make meaningful performance measurement impossible.

Sample size

Any test dataset is a sample from the target population of images, thus any performance metric computed on a test dataset is subject to sampling error. In order to draw reliable conclusions from evaluation results, the sampling error must be sufficiently small. Larger samples generally result in lower sampling error, but are also more expensive to produce. Therefore, the minimum sample size required to achieve a maximum allowable sampling error should be determined prior to data collection.

Many different methods have been proposed for sample size determination. Most of them refer to statistical significance tests which are used to test a prespecified hypothesis about a population parameter (e.g., sensitivity, specificity, ROC-AUC) on the basis of an observed data sample83,84,85. Such sample size determination methods are commonly used in clinical trial planning and available in many statistical software packages70.

When evaluating AI solutions in pathology, the goal is more often to estimate a performance metric with a sufficient degree of precision than to test a previously defined hypothesis. Confidence intervals (CIs) are a natural way to express the precision of an estimated metric and should be reported instead of or in addition to test results86. A CI is an interval around the sample statistic that is likely to cover the true population value at some confidence level, usually 95%87. The sample statistic can either be the performance metric itself or a difference between the performance metrics of two methods, e.g., when comparing performance to an established solution.

When using CIs, the sample size calculation can be based on the targeted width of the CI which is inversely proportional to the precision of the performance estimation86. Several approaches have been proposed for that matter88,89,90,91,92. To determine a minimum sample size, assumptions regarding the sample statistic, its variability, and usually also its distributional form must be made. The open-source software “presize” implements several of these methods and provides a simple web-based user interface to perform CI-based sample size calculations for common performance metrics93.

Subsets

AI solutions that are very accurate on average often perform much worse on certain subsets of their target population of images94, a phenomenon known as “hidden stratification.” Such differences in performance can exceed 20%22. Hidden stratification occurs particularly in low-prevalence subsets, but may also occur in subsets with poor label quality or subtle distinguishing characteristics22. There are substantial differences in cancer incidence, e.g., by gender, socioeconomic status, and geographic region95. Hence, hidden stratification may result in disproportionate harm to patients in less common demographic groups and jeopardize the clinical applicability of AI solutions22. Common performance measures computed on the entire test dataset can be dominated by larger subsets and do not indicate whether there are subsets for which an AI solution underperforms96.

To detect hidden stratification, AI solutions must be evaluated independently on relevant subsets of the target population of images (e.g., certain medical characteristics, patient demographics, ethnicities, scanning equipment)22,94, see Fig. 1. This means in particular that the metadata for identifying the subsets must be available30. Performance evaluation on subsets is an important requirement to obtain clinical approval by the FDA (see “Regulatory requirements”). Accordingly, such subsets should be specifically delineated within test datasets. Each subset needs to be sufficiently large to allow statistically meaningful results (see “Sample size”). It is important to provide information on why and how subsets were collected so that any issues AI solutions may have with specific subsets can be specifically tracked (see “Reporting”). Identifying subsets at risk of hidden stratification is a major challenge and requires extensive knowledge of the use case and the distribution of possible input images22. As an aid, potentially relevant subsets can also be detected automatically using unsupervised clustering approaches such as k-means22. If a detected cluster underperforms compared to the entire dataset, this may indicate the presence of hidden stratification that needs further examination.

Bias detection

Biases can make test datasets unsuitable for evaluating the performance of AI algorithms. Therefore, it is important to identify potential biases and to mitigate them early during data acquisition28. Bias, in this context, refers to sampling bias, i.e., the test dataset is not a randomly drawn sample from the target population of images. Subsets to be evaluated independently may be biased by construction with respect to particular features (e.g., patient age). Here, it is important to ensure that the subsets do not contain unexpected biases with respect to other features. For example, the prevalence of slide scanners should be independent of patient age, whereas the prevalence of diagnoses may vary by age group. Bias detection generally involves comparing the feature distributions of the test dataset and the target population of images. Similar methods can also be used to test the diversity or representativeness of a test dataset.

For features represented as metadata (e.g., patient age, slide scanner, or diagnosis), the distributions of the test dataset and target population can be compared using summary statistics (e.g., mean and standard deviation, percentiles, or histograms) or dedicated representativeness metrics97,98. Detection of bias in an entire test dataset requires a good estimate of the feature distribution of the target population of images. Bias in subsets can be detected by comparing the subset distribution to the entire dataset. Several toolkits for measuring bias based on metadata have been proposed99,100 and evaluated101.

Detecting bias in the image data itself is more challenging. Numerous features can be extracted from image data and it is difficult to determine the distribution of these features in the target population of images. Similar to automatic detection of unsuitable data, there are automatic methods to reveal bias in image data. Domain shifts63 can be detected either by comparing the distributions of basic image features (e.g., contrast) or by more complex image representations learned through specific neural network models63,66,102. Another approach is to train trivial machine learning models with modified images from which obvious predictive information has been removed (e.g., tumor regions): If such models perform better than chance, this indicates bias in the dataset103,104.

Independence

In the development of AI solutions, it is common practice to split a given dataset into two sets, one for development (e.g., a training and a validation set for model selection) and one for testing17. AI methods are prone to exploit spurious correlations in datasets as shortcut opportunities19. In this case, the methods perform well on data with similar correlations, but not on the target population. If both development and test datasets are drawn from the same original dataset, they are likely to share spurious correlations, and the performance on the test dataset may overestimate the performance on the target population. Therefore, datasets used for development and testing need to be sufficiently independent. As explained below, it is not sufficient for test datasets to merely contain different images than development datasets17,19.

To account for memory constraints, histologic whole-slide images (WSIs) are usually divided into small sub-images called “tiles.” AI methods are then applied to each tile individually, and the result for the entire WSI is obtained by aggregating the results of the individual tiles. If tiles are randomly assigned, tiles from the same WSI can end up in both the development and the test datasets, possibly inflating performance results. A substantial number of published research studies are affected by this problem105. Therefore, to avoid any risk of bias, none of the tiles in a test dataset may originate from the same WSI as the tiles in the development set105.

Datasets can contain site-specific feature distributions29. If these site-specific features are correlated with the outcome of interest, AI methods might use these features for classification rather than the relevant biological features (e.g., tissue morphology) and be unable to generalize to new datasets. A comprehensive evaluation based on multi-site datasets from The Cancer Genome Atlas (TCGA) showed that including data from one site in development and test datasets often leads to overoptimistic estimates of model accuracy29. This study also found that commonly used color normalization and augmentation methods did not prevent models from learning site-specific features, although stain differences between laboratories appeared to be a primary source of site-specific features. Therefore, the images in development and test datasets must originate not only from different subjects, but should also from different clinical sites31,106,107.

As described in the Introduction section, a given AI solution should only be evaluated once against a given test dataset17. Datasets published in the context of challenges or studies (many of which are based on TCGA4 and have regional biases108) should generally not be used as test datasets: it cannot be ruled out that they were taken into account in some form during development, e.g., inadvertently or as part of pretraining. Ideally, test datasets should not be published at all and the evaluation should be conducted by an independent body with no conflicts of interest30.

Reporting

Adequate reporting of test datasets is essential to determine whether a particular dataset is appropriate for a particular AI solution. Detailed metadata on the coverage of various dimensions of variability is required for detecting bias and identifying relevant subsets. Data provenance must be tracked to ensure that test data are sufficiently disjoint from development data28,29. Requirements for the test data109 and acquisition protocols110 should also be reported so that further data can be collected later. Accurate reporting of test datasets is important in order to submit evaluation results traceable to the test data for regulatory approval111.

Various guidelines for reporting clinical research and trials, including diagnostic models, have been published112. Some of these have been adapted specifically for machine learning approaches113,114 or such adaptation is under development115,116,117,118. However, only very few guidelines elaborate on data reporting119, and there is not yet consensus on structured reporting of test datasets, particularly for computational pathology.

Data acquisition protocols should comprehensively describe how and where the test dataset was acquired, handled, processed, and stored109,110. This documentation should include precise details of the hardware and software versions used and also cover the creation of reference annotations. Moreover, quality criteria for rejecting data and procedures for handling missing data119 should be reported, i.e., aspects of what is not in the dataset. To facilitate data management and analysis, individual images should be referenced via universally unique identifiers120 and image metadata should be represented using standard data models121,122. Protocols should be defined prior to data acquisition when prospectively collecting test data. Completeness and clarity of the protocols should be verified during data acquisition.

Reported information should characterize the acquired dataset in a useful way. For example, summary statistics allow an initial assessment whether a given dataset is an adequate sample of the target population (see section “Bias detection”). Relevant subsets and biases identified in the dataset should be reported as well. Generally, one should collect and report as much information as feasible with the available resources, since retrospectively obtaining missing metadata is hard or impossible. If there will be multiple versions of a dataset, e.g., due to iterative data acquisition or review of reference annotations, versioning is needed. Suitable hashing can guarantee integrity of the entire dataset as well as its individual samples, and identify datasets without disclosing contents.

Regulatory requirements

AI solutions in pathology are in vitro diagnostic medical devices (IVDMDs) because they evaluate tissue images for diagnostic purposes outside the human body. Therefore, regulatory approval is required for sale and use in a clinical setting123. The U.S. Food and Drug Administration (FDA) and European Union (EU) impose similar requirements to obtain regulatory approval. This includes compliance with certain quality management and documentation standards, a risk analysis, and a comprehensive performance evaluation. The performance evaluation must demonstrate that the method provides accurate and reliable results compared to a gold standard (analytical performance) and that the method provides real benefit in a clinical context (clinical performance). Good test datasets are an essential prerequisite for a meaningful evaluation of analytical performance.

EU + UK

In the EU and UK, IVDMDs are regulated by the In vitro Diagnostic Device Regulation (IVDR, formally “Regulation 2017/746”)124. After a transition period, compliance with the IVDR will be mandatory for novel routine pathology diagnostics as of May 26, 2022. The IVDR does not impose specific requirements on test datasets used in the analytical performance evaluation. However, the EU has put forward a proposal for an EU-wide regulation on harmonized rules for the assessment of AI125.

The EU proposal125 considers AI-based IVDMDs as “high-risk AI systems” (preamble (30)). For test datasets used in the evaluation of such systems, the proposal imposes certain quality criteria: test datasets must be “relevant, representative, free of errors and complete” and “have the appropriate statistical properties” (Article 10.3). Likewise, it requires test datasets to be subject to “appropriate data governance and management practices” (preamble (44)) with regard to design choices, suitability assessment, data collection, and identification of shortcomings.

USA

In the US, IVDMDs are regulated in the Code of Federal Regulations (CFR) Part 809126. Just like the IVDR, the CFR does not impose specific requirements on test datasets used in the analytical performance evaluation. However, the CFR states that products should be accompanied by labeling stating specific performance characteristics (e.g., accuracy, precision, specificity, and sensitivity) related to normal and abnormal populations of biological specimens.

In 2021, the FDA approved the first AI software for pathology127. In this context, the FDA has established a definition and requirements for approval of generic AI software for pathology, formally referred to as “software algorithm devices to assist users in digital pathology”128.

Test datasets used in analytical performance studies are expected to contain an “appropriate” number of images. To be “representative of the entire spectrum of challenging cases” (3.ii.A. and B. of source128) that can occur when the product is used as intended, test datasets should cover multiple operators, slide scanners, and clinical sites and contain “clinical specimens with defined, clinically relevant, and challenging characteristics.”(3.ii.B. of source128) In particular, test datasets should be stratified into relevant subsets (e.g., by medical characteristics, patient demographics, scanning equipment) to allow separate determination of performance for each subset. Case cohorts considered in clinical performance studies (e.g., evaluating unassisted and software-assisted evaluation of pathology slides with intended users) are expected to adhere to similar specifications.

Product labeling according to CFR 809 was also defined in more detail. In addition to the general characteristics of the dataset (e.g., origin of images, annotation procedures, subsets, …), limitations of the dataset (e.g., poor image quality or insufficient sampling of certain subsets) that may cause the software to fail or operate unexpectedly should be specified.

In summary, there are much more specific requirements for test datasets in the US than in the EU. However, none of the regulations clearly specify how the respective requirements can be achieved or verified.

Discussion

Our recommendations on compiling test datasets are summarized in Fig. 4. They are intended to help AI developers demonstrate the robustness and practicality of their solutions to regulatory agencies and end users. Likewise, the advice can be used to check whether test datasets used in the evaluation of AI solutions were appropriate and reported performance measures are meaningful. Much of the advice can also be transferred to image analysis solutions without AI and to similar domains where solutions are applied to medical images, such as radiology or ophthalmology.

Fig. 4: Overview of recommendations on compiling test datasets.
figure 4

Prior to data acquisition, the acquisition process must be thoroughly planned. In particular, the intended use of the AI solution must be precisely understood in order to derive the requirements for test datasets.

A key finding of the work is that it remains challenging to compile test datasets and that there are still many unanswered questions. The current regulatory requirements remain vague and do not specify in detail important aspects such as the required diversity or size of a test dataset. In principle, the methods described in the “Bias detection” and “Sample size” sections can be used to assess whether a sample is sufficiently diverse or large, respectively. These methods depend on a precise and comprehensive definition of the target population of images. However, since this population usually cannot be formally specified but only roughly described, it can be difficult to apply these methods in a meaningful way in practice.

For regulatory approval, a plausible justification is needed why the test dataset used was good enough. Besides following the advice in this paper, it can also be helpful to refer to published studies in which AI solutions have been comprehensively evaluated. Additional guidance can be found in the summary documents of approved AI solutions published by the FDA, which include information on their evaluation106. It turns out that many of the AI devices approved by the FDA were evaluated only at a small number of sites106 with limited geographic diversity129. Test sets used in current studies typically involved thousands of slides, hundreds of patients, less than five sites, and less than five scanner types50,52,130,131.

Today, AI solutions in pathology may not be used for primary diagnosis, but only in conjunction with a standard evaluation by the pathologist128. Therefore, compared to a fully automated usage scenario, requirements for robustness are considerably lower. This also applies to the expected confidence in the performance measurement and the scope of the test dataset used. In a supervised usage scenario, the accuracy of an AI solution determines how often the user needs to intervene to correct results, and thus its practical usefulness. End users are interested in the most meaningful evaluation of the accuracy of AI solutions to assess their practical utility. Therefore, a comprehensive evaluation of the real-world performance of a product, taking into account the advice given in this paper, can be an important marketing tool.

Limitations and outlook

Some aspects of compiling test datasets were not considered in this article. One aspect is how to collaborate with data donors, i.e., how to incentivize or compensate them for donating data. Other aspects include the choice of software tools and data formats for the compilation and storage of datasets or how the use of test datasets should be regulated. These aspects must be clarified individually for each use case and the AI solution to be tested. Furthermore, we do not elaborate on legal aspects of collecting test data, e.g., obtaining consent from patients, privacy regulations, licensing, and liability. For more details on these topics, we refer to other works132. The present paper focuses exclusively on compiling test datasets. For advice on other issues related to validating AI solutions in pathology, such as how to select an appropriate performance metric, how to make algorithmic results interpretable, or how to conduct a clinical performance evaluation with end users, we also refer to other works30,31,33,34,133,134.

For AI solutions to operate with less user intervention and to better support diagnostic workflows, real-world performance must be assessed more accurately than is currently possible. The key to accurate performance measures is the representativeness of the test dataset. Therefore, future work should focus on better characterizing the target population of images and how to collect more representative samples. Empirical studies should be conducted on how different levels of coverage of the variability dimensions (e.g., laboratories, scanner types) affect the quality of performance evaluation for common use cases in computational pathology.

In addition, clear criteria should be developed to delineate the target population from unsuitable data. Currently, the assessment of the suitability of data is typically done by humans, which might introduce subjective bias. Automated methods can help to make the assessment of suitability more objective (see “Curation”) and should therefore be further explored. However, such automated methods must be validated on dedicated test datasets themselves.

Another open challenge is how to deal with changes in the target population of images. Since the intended use for a particular product is fixed, in theory the requirements for the test datasets should also be fixed. However, the target distribution of images is influenced by several factors that change over time. These include technological advances in specimen and image acquisition, distribution of scanner systems used, and shifting patient populations133,135. As part of post-market surveillance, AI solutions must be continuously monitored during their entire lifecycle111. Clear processes are required for identifying changes in the target population of images and adapting performance estimates accordingly.

Conclusions

Appropriate test datasets are essential for meaningful evaluation of the performance of AI solutions. The recommendations provided in this article are intended to help demonstrate the utility of AI solutions in pathology and to assess the validity of performance studies. The key remaining challenge is the vast variability of images in computational pathology. Further research is needed on how to formalize criteria for sufficiently representative test datasets so that AI solutions can operate with less user intervention and better support diagnostic workflows in the future.