Main

Automatic image processing with ML is gaining increasing traction in biological and medical imaging research and practice. Research has predominantly focused on the development of new image processing algorithms. The critical issue of reliable and objective performance assessment of these algorithms, however, remains largely unexplored. Algorithm performance in image processing is commonly assessed with validation metrics (not to be confused with distance metrics in the pure mathematical sense) that should serve as proxies for the domain interest. In consequence, the impact of validation metrics cannot be overstated; first, they are the basis for deciding on the practical (for example, clinical) suitability of a method and are thus a key component for translation into biomedical practice. In fact, validation that is not conducted according to relevant metrics could be one major reason for why many artificial intelligence (AI) developments in medical imaging fail to reach clinical practice1,2. In other words, the numbers presented in journals and conference proceedings do not reflect how successful a system will be when applied in practice. Second, metrics guide the scientific progress in the field; flawed metric use can lead to entirely futile resource investment and infeasible research directions while obscuring true scientific advancements.

Despite the importance of metrics, an increasing body of work shows that the metrics used in common practice often do not adequately reflect the underlying biomedical problems, diminishing the validity of the investigated methods3,4,5,6,7,8,9,10,11. This especially holds true for challenges, internationally respected competitions that have become the de facto standard for comparative performance assessment of image processing methods. These challenges are often published in prestigious journals12,13,14 and receive tremendous attention from both the scientific community and industry. Among a number of shortcomings in design and quality control that were recently unveiled by a multicenter initiative8, the choice of inappropriate metrics stood out as a core problem. Compared to other areas of AI research, choosing the right metric is particularly challenging in image processing because the suitability of a metric depends on various factors. As a foundation for the present work, we identified three core categories related to pitfalls in metric selection (Fig. 1a):

Fig. 1: Contributions of the Metrics Reloaded framework.
figure 1

a, Motivation: Common problems related to metrics typically arise from inappropriate choice of the problem category (here: ObD confused with SemS; top left), poor metric selection (here: neglecting the small size of structures; top right) and poor metric application (here: inappropriate aggregation scheme; bottom). Pitfalls are highlighted in the boxes; refers to the average DSC values. Green metric values correspond to a good metric value, whereas red values correspond to a poor value. Green check marks indicate desirable behavior of metrics; red crosses indicate undesirable behavior. Adapted from ref. 27 under a Creative Commons license CC BY 4.0. b, Metrics Reloaded addresses these pitfalls. (1) To enable the selection of metrics that match the domain interest, the framework is based on the new concept of problem fingerprinting, that is, the generation of a structured representation of the given biomedical problem that captures all properties that are relevant for metric selection. Based on the problem fingerprint, Metrics Reloaded guides the user through the process of metric selection and application while raising awareness of relevant pitfalls. (2) An instantiation of the framework for common biomedical use cases demonstrates its broad applicability. (3) A publicly available online tool facilitates application of the framework. Second input image reproduced from dermoscopedia (ref. 58) under a Creative Commons license CC BY 4.0; fourth input image reproduced with permission from ref. 59, American Association of Physicists in Medicine.

Inappropriate choice of the problem category: The chosen metrics do not always reflect the biomedical need. For example, object detection (ObD) problems are often framed as segmentation tasks, resulting in the use of metrics that do not account for the potentially critical localization of all objects in the scene15,16 (Fig. 1a).

Poor metric selection: Certain characteristics of a given biomedical problem render particular metrics inadequate. Mathematical metric properties are often neglected, for example, when using the Dice similarity coefficient (DSC) in the presence of particularly small structures (Fig. 1a).

Poor metric application: Even if a metric is well suited for a given problem in principle, pitfalls can occur when applying that metric to a specific dataset. For example, a common flaw pertains to ignoring hierarchical data structure, as in data from multiple hospitals or a variable number of images per patient (Fig. 1a), when aggregating metric values.

These problems are magnified by the fact that common practice often grows historically, and poor standards may be propagated between generations of scientists and in prominent publications. To dismantle such historically grown poor practices and leverage distributed knowledge from various subfields of image processing, we established the multidisciplinary Metrics Reloaded consortium. (We thank the Intelligent Medical Systems laboratory members N. Sautter, P. Vieten and T. Adler for the suggestion of the name, inspired by the Matrix movies.) This consortium comprises international experts from the fields of medical image analysis, biological image analysis, medical guideline development, general ML, different medical disciplines, statistics and epidemiology, representing a large number of biomedical imaging initiatives and societies.

The mission of Metrics Reloaded is to foster reliable algorithm validation through problem-aware, standardized choice of metrics with the long-term goal of (1) enabling the reliable tracking of scientific progress and (2) aiding to bridge the current chasm between ML research and translation into biomedical imaging practice.

Based on a kickoff workshop held in December 2020, the Metrics Reloaded framework (Figs. 1b and 2) was developed using a multistage Delphi process17,18 for consensus building. Its primary purpose is to enable users to make educated decisions on which metrics to choose for a driving biomedical problem. The foundation of the metric selection process is the new concept of problem fingerprinting (Fig. 3). Abstracting from a specific domain, problem fingerprinting is the generation of a structured representation of the given biomedical problem that captures all properties relevant for metric selection. As depicted in Fig. 3, the properties captured by the fingerprint comprise domain interest-related properties, such as the particular importance of structure boundary, volume or center, target structure-related properties, such as the shape complexity or the size of structures relative to the image grid size, dataset-related properties, such as class imbalance, as well as algorithm output-related properties, such as the theoretical possibility of the algorithm output not containing any target structure.

Fig. 2: Metrics Reloaded recommendation framework from a user perspective.
figure 2

In step 1 – problem fingerprinting, the given biomedical image analysis problem is mapped to the appropriate image problem category, namely ImLC, SemS, ObD or InS; Fig. 4). The problem category and further characteristics of the given biomedical problem relevant for metric selection are then captured in a problem fingerprint (Fig. 3). In step 2 – metric selection, the user follows the respective colored path of the chosen problem category (ImLC →, SemS →, ObD → or InS →) to select a suitable pool of metrics from the Metrics Reloaded pools shown in green. When a tree branches, the fingerprint items determine which exact path to take. Finally, in step 3 – metric application, the user is supported in applying the metrics to a given dataset. During the traversal of the decision tree, the user goes through subprocesses, indicated by the plus sign, which are provided in Extended Data Figs. 19 and represent relevant steps in the metric selection process. Ambiguities related to metric selection are resolved via decision guides (Supplementary Note 2.7) that help users make an educated decision when multiple options are possible. A comprehensive textual description of the recommendations for all four problem categories as well as for the selection of corresponding calibration metrics (if any) is provided in Supplementary Notes 2.22.6. An overview of the symbols used in the process diagram is provided in Fig. SN 5.1. Condensed versions of the mappings for every category can be found in Supplementary Note 2.2 for ImLC, Supplementary Note 2.3 for SemS, Supplementary Note 2.4 for ObD and Supplementary Note 2.5 for InS.

Fig. 3: Relevant properties of a driving biomedical image analysis problem are captured by the problem fingerprint (selection for SemS shown here).
figure 3

The fingerprint comprises a set of items, each of which represents a specific property of the problem, is either binary or categorical, and must be instantiated by the user. Besides the problem category, the fingerprint comprises domain interest-related, target structure-related, dataset-related and algorithm output-related properties. A comprehensive version of the fingerprints for all problem categories can be found in Figs. SN 2.72.9 (ImLC), SN 2.102.11 (SemS), SN 2.122.14 (ObD) and SN 2.152.17 (InS). Pred, prediction; ref, reference.

Based on the problem fingerprint, the user is then, in a transparent and understandable manner, guided through the process of selecting an appropriate set of metrics while being made aware of potential pitfalls related to the specific characteristics of the underlying biomedical problem. The Metrics Reloaded framework currently supports problems in which categorical target variables are to be predicted based on a given n-dimensional input image (possibly enhanced with context information) at pixel, object or image level (Fig. 4). It thus supports problems that can be assigned to one of the following four problem categories: image-level classification (ImLC; image level), ObD (object level), semantic segmentation (SemS; pixel level) or instance segmentation (InS; pixel level). Designed to be imaging modality independent, Metrics Reloaded can be suited for application in various image analysis domains even beyond the field of biomedicine.

Fig. 4: Metrics Reloaded fosters the convergence of validation methodology across modalities, application domains and classification scales.
figure 4

The framework considers problems in which categorical target variables are to be predicted at image, object and/or pixel level, resulting (from top to bottom) in ImLC, ObD, InS or SemS problems. These problem categories are relevant across modalities (here CT, microscopy and endoscopy) and application domains. From left to right: annotation of benign and malignant lesions in CT images59, different cell types in microscopy images60 and medical instruments in laparoscopy images61. Left, reproduced with permission from ref. 59, American Association of Physicists in Medicine; center, reproduced with permission from ref. 60, Springer Nature Limited; right, reproduced with permission from ref. 61, Springer Nature Limited. RBC, red blood cell; WBC, white blood cell.

Here, we present the key contributions of our work in detail, namely (1) the Metrics Reloaded framework for problem-aware metric selection along with the key findings and design decisions that guided its development (Fig. 2), (2) the application of the framework to common biomedical use cases, showcasing its broad applicability (selection shown in Fig. 5) and (3) the open online tool that has been implemented to improve the user experience with our framework.

Fig. 5: Instantiation of the framework with recommendations for concrete biomedical questions.
figure 5

From top to bottom: (1) Image classification for the examples of sperm motility classification62 and disease classification in dermoscopic images63,58. (2) SemS of large objects for the examples of embryo segmentation from microscopy64 and liver segmentation in CT images65,66. (3) Detection of multiple and arbitrarily located objects for the examples of cell detection and tracking during the autophagy process67,68 and MS lesion detection in multimodal brain MRI images69,70. (4) InS of tubular objects for the examples of InS of neurons from the fruit fly71,72,73 and surgical instrument InS61. The corresponding traversals through the decision trees are shown in Supplementary Note 4. An overview of the recommended metrics can be found in Supplementary Note 3.1, including relevant information for each metric. ImLC-2, reproduced from dermoscopedia under a Creative Commons license CC BY 4.0; SemS-2, reproduced with permission from ref. 65, Springer Nature Limited; InS-1, reproduced with permission from ref. 61, Springer Nature Limited.

Metrics Reloaded framework

Metrics Reloaded is the result of a multistage Delphi process, comprising five international workshops, nine surveys, numerous expert group meetings and crowdsourced feedback processes, all conducted between 2020 and 2022. As a foundation of the recommendation framework, we identified common and rare pitfalls related to metrics in the field of biomedical image analysis using a community-powered process, detailed in this work’s sister publication19. We found that common practice is often not well justified, and poor practices may even be propagated from one generation of scientists to the next. Importantly, many pitfalls generalize not only across the four problem categories that our framework addresses but also across domains (Fig. 4). This is because the source of the pitfall, such as class imbalance, uncertainties in the reference or poor image resolution, can occur irrespective of a specific modality or application.

Following the convergence of AI methodology across domains and problem categories, we therefore argue for the analogous convergence of validation methodology.

Cross-domain approach enables integration of distributed knowledge

To break historically grown poor practices, we followed a multidisciplinary cross-domain approach that enabled us to critically question common practice in different communities and integrate distributed knowledge in one common framework. To this end, we formed an international multidisciplinary consortium of 73 experts from various biomedical image analysis-related fields. Furthermore, we crowdsourced metric pitfalls and feedback on our approach in a social media campaign. Ultimately, a total of 156 researchers contributed to this work, including 84 mentioned in the acknowledgements. Consideration of the different knowledge and perspectives on metrics led to the following key design decisions for Metrics Reloaded:

Encapsulating domain knowledge: The questions asked to select a suitable metric are mostly similar regardless of image modality or application: Are the classes balanced? Is there a specific preference for the positive or negative class? What is the accuracy of the reference annotation? Is the structure boundary or volume of relevance for the target application? Importantly, while answering these questions requires domain expertise, the consequences in terms of metric selection can largely be regarded as domain independent. Our approach is thus to abstract from the specific image modality and domain of a given problem by capturing the properties relevant for metric selection in a problem fingerprint (Fig. 3).

Exploiting synergies across classification scales: Similar considerations apply with regard to metric choice for classification, detection and segmentation tasks, as they can all be regarded as classification tasks at different scales (Fig. 4). The similarities between the categories, however, can also lead to problems when the wrong category is chosen (Fig. 1a). Therefore, we (1) address all four problem categories in one common framework (Fig. 2) and (2) cover the selection of the problem category itself in our framework (Extended Data Fig. 1).

Setting new standards: As the development and implementation of recommendations that go beyond the state of the art often requires critical mass, we involved stakeholders of various communities and societies in our consortium. Notably, our crowdsourcing-based approach led to a pool of metric candidates (Fig. SN 2.1) that includes not only commonly applied metrics, but also metrics that have to date received little attention in biomedical image analysis.

Abstracting from inference methodology: Metrics should be chosen based solely on the driving biomedical problem and not be affected by algorithm design choices. For example, the error functions applied in common neural network architectures do not justify the use of corresponding metrics (for example, validating with DSC to match the Dice loss used for training a neural network). Instead, the domain interest should guide the choice of metric, which, in turn, can guide the choice of the loss term.

Exploiting complementary metric strengths: A single metric typically cannot cover the complex requirements of the driving biomedical problem20. To account for the complementary strengths and weakness of metrics, we generally recommend the usage of multiple complementary metrics to validate image analysis problems. As detailed in our recommendations (Supplementary Note 2), we specifically recommend the selection of metrics from different families.

Validation by consensus building and community feedback: A major challenge for research on metrics is its validation, due to the lack of methods capable of quantitatively assessing the superiority of a given metric set over another. Following the spirit of large consortia formed to develop reporting guidelines (for example, CONSORT21, TRIPOD22 and STARD23), we built the validation of our framework on three main pillars: (1) Delphi processes to challenge and refine the proposals of the expert groups that worked on individual components of the framework; (2) community feedback obtained by broadcasting the framework via society mailing lists and social media platforms; and (3) and instantiation of the framework to a range of different biological and medical use cases.

Involving and educating users: Choosing adequate validation metrics is a complex process. Rather than providing a black box recommendation, Metrics Reloaded guides the user through the process of metric selection while raising awareness on pitfalls that may occur. In cases in which the trade-offs between different choices must be considered, decision guides (Supplementary Note 2.7) assist in deciding between competing metrics while respecting individual preferences.

Problem fingerprints encapsulate relevant domain knowledge

To encapsulate relevant domain knowledge in a common format and then enable a modality-agnostic metric recommendation approach that generalizes over domains, we developed the concept of problem fingerprinting (Fig. 3). As a foundation, we crowdsourced all properties of a driving biomedical problem that are potentially relevant for metric selection via surveys issued to the consortium (Supplementary Methods). This process resulted in a list of binary and categorical variables (fingerprint items) that must be instantiated by a user to trigger the Metrics Reloaded recommendation process. Common issues often relate to selecting metrics from the wrong problem category (Fig. 1a). To avoid such issues, problem fingerprinting begins with mapping a given problem with all its intrinsic and dataset-related properties to the corresponding problem category via the category mapping shown in Extended Data Fig. 1. The problem category is a fingerprint item itself.

In the following, we refer to all fingerprint items with the notation FPX.Y, where Y is a numerical identifier, and the index X represents one of the following families:

FP1 – Problem category refers to the problem category generated by S1 (Extended Data Fig. 1).

FP2 – Domain interest-related properties reflect user preferences and are highly dependent on the target application. A semantic image segmentation that serves as the foundation for radiotherapy planning, for example, would require exact contours (FP2.1 – particular importance of structure boundaries = TRUE). On the other hand, for a cell segmentation problem that serves as prerequisite for cell tracking, the object centers may be much more important (FP2.3 – particular importance of structure center(line) = TRUE). Both problems could be tackled with identical network architectures, but the validation metrics should be different.

FP3 – Target structure-related properties represent inherent properties of target structure(s) (if any), such as the size, size variability and the shape. Here, the term target structures can refer to any object/structure of interest, such as cells, vessels, medical instruments or tumors.

FP4 – Dataset-related properties capture properties inherent to the provided data to which the metric is applied. They primarily relate to class prevalences, uncertainties of the reference annotations and whether the data structure is hierarchical.

FP5 – Algorithm output-related properties encode properties of the output, such as the availability of predicted class scores.

Note that not all properties are relevant for all problem categories. For example, the shape and size of target structures is highly relevant for segmentation problems but irrelevant for image classification problems. The complete problem category-specific fingerprints are provided in Supplementary Note 1.3.

Metrics Reloaded addresses all three types of metric pitfalls

Metrics Reloaded was designed to address all three types of metric pitfalls identified in ref. 19 and illustrated in Fig. 1a. More specifically, each of the three steps shown in Fig. 2 addresses one type of pitfall:

Step 1– Fingerprinting. A user should begin by reading the general instructions of the recommendation framework (Supplementary Note 1.1). Next, the user should convert the driving biomedical problem to a problem fingerprint. This step not only is a prerequisite for applying the framework across application domains and classification scales, but also specifically addresses the inappropriate choice of the problem category via the integrated category mapping. Once the user’s domain knowledge has been encapsulated in the problem fingerprint, the actual metric selection is conducted according to a domain-agnostic and modality-agnostic process.

Step 2 Metric selection. A Delphi process yielded the Metrics Reloaded pool of reference-based validation metrics (Fig. SN 2.1). Notably, this pool contains metrics that are currently not widely known in some biomedical image analysis communities. A prominent example is the Net Benefit (NB)24 metric, popular in clinical prediction tasks and designed to determine whether basing decisions on a method would do more good than harm. A diagnostic test, for example, may lead to early identification and treatment of a disease, but typically will also cause a number of individuals without disease to be subjected to unnecessary further interventions. NB allows the consideration of such trade-offs by putting benefits and harms on the same scale so that they can be directly compared. Another example is the expected cost (EC) metric25, which can be seen as a generalization of accuracy with many desirable added features but is not well known in the biomedical image analysis communities26. Based on the Metrics Reloaded pool, the metric recommendation is performed with a business process model and notation (BPMN)-inspired flowchart (Fig. SN 5.1), in which conditional operations are based on one or multiple fingerprint properties (Fig. 2). The main flowchart has three substeps, each addressing the complementary strengths and weaknesses of common metrics. First, common reference-based metrics, which are based on the comparison of the algorithm output to a reference annotation, are selected. Second, the pool of standard metrics can be complemented with custom metrics to address application- specific complementary properties. Third, non-reference-based metrics assessing speed, memory consumption or carbon footprint, for example, can be added to the metric pool(s). In this paper, we focus on the step of selecting reference-based metrics, because this is where synergies across modalities and scales can be exploited.

These synergies are showcased by the substantial overlap between the different paths that, depending on the problem category, are taken through the mapping during metric selection. All paths comprise several subprocesses S (indicated by the symbol), each of which holds a subsidiary decision tree representing one specific step of the selection process. Traversal of a subprocess typically leads to the addition of a metric to the problem-specific metric pool. In multi-class prediction problems, dedicated metric pools for each class may need to be generated as relevant properties may differ from class to class. A three-dimensional (3D) SemS problem, for example, could require the simultaneous segmentation of both tubular and non-tubular structures (for example, liver vessels and tissue). These require different metrics for validation. Although this is a corner case, our framework addresses this issue in principle. In ambiguous cases, that is, when the user can choose between two options in one step of the decision tree, a corresponding decision guide details the trade-offs that need to be considered (Supplementary Note 2.7). For example, the intersection over union (IoU) and the DSC are mathematically closely related. The concrete choice typically boils down to a simple user or community preference.

Figure 2 along with the corresponding subprocesses S1–S9 (Extended Data Figs. 19) captures the core contribution of this paper, namely the consensus recommendation of the Metrics Reloaded consortium according to the final Delphi process. For all ten components, the required Delphi consensus threshold (>75% agreement) was met. In all cases of disagreement, which ranged from 0% to 7% for Fig. 2 and S1–S9, each remaining point of criticism was respectively only raised by a single person. The following paragraphs present a summary of the four different colored paths through step 2 - Metric selection of the recommendation tree (Fig. 2) for the task of selecting reference-based metrics from the Metrics Reloaded pool of common metrics. More comprehensive textual descriptions can be found in Supplementary Note 2.

Image-level classification

ImLC is conceptually the most straightforward problem category, as the task is simply to assign one of multiple possible labels to an entire image (Supplementary Note 2.2). The validation metrics are designed to measure two key properties: discrimination and calibration.

Discrimination refers to the ability of a classifier to discriminate between two or more classes. This can be achieved by counting metrics that operate on the cardinalities of a fixed confusion matrix (that is, the true/false positives/negatives in the binary classification case). Prominent examples are sensitivity, specificity or F1 score for binary settings and Matthews correlation coefficient (MCC) for multi-class settings. Converting predicted class scores to a fixed confusion matrix (in the binary case by setting a potentially arbitrary cutoff) can, however, be regarded as problematic in the context of performance assessment27. Multi-threshold metrics, such as area under the receiver operating characteristic curve (AUROC), are therefore based on varying the cutoff, which enables the explicit analysis of the trade-off between competing properties such as sensitivity and specificity.

While most research in biomedical image analysis focuses on the discrimination capabilities of classifiers, a complementary important property is the calibration of a model. An uncertainty-aware model should yield predicted class scores that represent the true likelihood of events28, as detailed in Supplementary Note 2.6. Overoptimistic or underoptimistic classifiers can be especially problematic in prediction tasks where a clinical decision may be made based on the risk of the patient developing a certain condition. Metrics Reloaded hence provides recommendations for validating the algorithm performance both in terms of discrimination and calibration. We recommend the following process for classification problems (Fig. 2 and Supplementary Note 2.2):

  1. 1.

    Select multi-class metric (if any): Multi-class metrics have the unique advantage of capturing the performance of an algorithm for all classes in a single value. With the ability to take into account all entries of the multi-class confusion matrix, they provide a holistic measure of performance without the need for customized class-aggregation schemes. We recommend using a multi-class metric if a decision rule applied to the predicted class scores is available (FP2.6). In certain use cases, especially in the presence of ordinal data, there is an unequal severity of class confusions (FP2.5.2), meaning that different costs should be applied to different misclassifications reflected by the confusion matrix. In such cases, we generally recommend EC as a metric. Otherwise, depending on the specific scenario, accuracy, balanced accuracy (BA) and MCC may be viable alternatives. The concrete choice of metric depends primarily on the prevalences (frequencies) of classes in the provided validation set and the target population (FP4.1/2), as detailed in subprocess S2 (Extended Data Fig. 2) and the corresponding textual description in Supplementary Note 2.2.

    As class-specific analyses are not possible with multi-class metrics, which can potentially hide poor performance on individual classes, we recommend an additional validation with per-class counting metrics (optional) and multi-threshold metrics (always recommended).

  2. 2.

    Select per-class counting metric (if any): If a decision rule applied to the predicted class scores is available (FP2.6), a per-class counting metric, such as the Fβ score, should be selected. Each class of interest is separately assessed, preferably in a ‘one-versus-rest’ fashion. The choice depends primarily on the decision rule (FP2.6) and the distribution of classes (FP4.2). Details can be found in subprocess S3 for selecting per-class counting metrics (Extended Data Fig. 3).

  3. 3.

    Select multi-threshold metric (if any): Counting metrics reduce the potentially complex output of a classifier (the continuous class scores) to a single value (the predicted class), such that they can work with a fixed confusion matrix. To compensate for this loss of information and obtain a more comprehensive picture of a classifier’s discriminatory performance, multi- threshold metrics work with a dynamic confusion matrix reflecting a range of possible thresholds applied to the predicted class scores. While we recommend the popular, well-interpretable and prevalence-independent AUROC as the default multi-threshold metric for classification, average precision can be more suitable in the case of high-class balance because it incorporates predicted values, as detailed in subprocess S4 for selecting multi-threshold metrics (Extended Data Fig. 4).

  4. 4.

    Select calibration metric (if any): If calibration assessment is requested (FP2.7), one or multiple calibration metrics should be added to the metric pool as detailed in subprocess S5 for selecting calibration metrics (Extended Data Fig. 5).

Semantic segmentation

In SemS, classification occurs at pixel level. However, it is not advisable to simply apply the standard classification metrics to the entire collection of pixels in a dataset for two reasons. Firstly, pixels of the same image are highly correlated. Hence, to respect the hierarchical data structure, metric values should first be computed per image and then be aggregated over the set of images. Note in this context that the commonly used DSC is mathematically identical to the popular F1 score applied at pixel level. Secondly, in segmentation problems, the user typically has an inherent interest in structure boundaries, centers or volumes of structures (FP2.1, FP2.2 and FP2.3). The family of boundary-based metrics (subset of distance-based metrics) therefore requires the extraction of structure boundaries from the binary segmentation masks as a foundation for segmentation assessment. Based on these considerations and given all the complementary strengths and weaknesses of common segmentation metrics27, we recommend the following process for segmentation problems (Fig. 2 and Supplementary Note 2.3):

  1. 1.

    Select overlap-based metric (if any): In segmentation problems, counting metrics such as the DSC or IoU measure the overlap between the reference annotation and the algorithm prediction. As they can be considered the de facto standard for assessing segmentation quality and are well interpretable, we recommend using them by default unless the target structures are consistently small, relative to the grid size (FP3.1), and the reference may be noisy (FP4.3.1). Depending on the specific properties of the problems, we recommend the DSC or IoU (default recommendation), the Fβ score (preferred when there is a preference for either false positive (FP) or false negative (FN)) or the centerline Dice similarity coefficient (clDice; for tubular structures). Details can be found in subprocess S6 for selecting overlap-based metrics (Extended Data Fig. 6).

  2. 2.

    Select boundary-based metric (if any): Key weaknesses of overlap-based metrics include shape unawareness and limitations when dealing with small structures or high size variability27. Our general recommendation is therefore to complement an overlap-based metric with a boundary-based metric. If annotation imprecisions should be compensated for (FP2.5.7), our default recommendation is the normalized surface distance (NSD). Otherwise, the fundamental user preference guiding metric selection is whether errors should be penalized by existence or distance (FP2.5.6), as detailed in subprocess S7 for selecting boundary-based metrics (Extended Data Fig. 7).

Object detection

ObD problems differ from segmentation problems in several key features with respect to metric selection. Firstly, they involve distinguishing different instances of the same class and thus require the step of locating objects and assigning them to the corresponding reference object. Secondly, the granularity of localization is comparatively rough, which is why no boundary-based metrics are required (otherwise the problem would be phrased as an InS problem). Finally, and crucially important from a mathematical perspective, the absence of true negatives (TNs) in ObD problems renders many popular classification metrics (for example, accuracy, specificity and AUROC) invalid. In binary problems, for example, suitable counting metrics can only be based on three of the four entries of the confusion matrix. Based on these considerations and taking into account all the complementary strengths and weaknesses of existing metrics27, we propose the following steps for ObD problems (Fig. 2 and Supplementary Note 2.4):

  1. 1.

    Select localization criterion: An essential part of the validation is to decide whether a prediction matches a reference object. To this end, (1) the location of both the reference objects and the predicted objects must be adequately represented (for example, by masks, bounding boxes or center points), and (2) a metric for deciding on a match (for example, mask IoU) must be chosen. As detailed in subprocess S8 for selecting the localization criterion (Extended Data Fig. 8), our recommendation considers both the granularity of the provided reference (FP4.4) and the required granularity of the localization (FP2.4).

  2. 2.

    Select assignment strategy: As the localization does not necessarily lead to unambiguous matchings, an assignment strategy needs to be chosen to potentially resolve ambiguities that occurred during localization. As detailed in subprocess S9 for selecting the assignment strategy (Extended Data Fig. 9), the recommended strategy depends on the availability of continuous class scores (FP5.1) as well as on whether double assignments should be punished (FP2.5.8).

Select classification metric(s) (if any): Once objects have been located and assigned to reference objects, generation of a confusion matrix (without TN) is possible. The final step therefore simply comprises choosing suitable classification metrics for validation. Several subfields of biomedical image analysis have converged to choosing solely a counting metric, such as the Fβ score, as the primary metric in ObD problems. We follow this recommendation when no continuous class scores are available for the detected objects (FP5.1). Otherwise, we disagree with the practice of basing performance assessment solely on a single, potentially suboptimal cutoff on the continuous class scores. Instead, we follow the recommendations for ImLC and propose complementing a counting metric (subprocess S3; Extended Data Fig. 3) with a multi-threshold metric (subprocess S4; Extended Data Fig. 4) to obtain a more holistic picture of performance. As multi-threshold metric, we recommend average precision or free-response receiver operating characteristic (FROC) score, depending on whether an easy interpretation (FROC score) or a standardized metric (average precision) is preferred. The choice of per-class counting metric depends primarily on the decision rule (FP2.6).

Note that the previous description implicitly assumed single-class problems, but generalization to multi-class problems is straightforward by applying the validation for each class. It is further worth mentioning that metric application is not trivial in ObD problems as the number of objects in an image may be extremely small, or even zero, compared to the number of pixels in an image. Special considerations with respect to aggregation must therefore be made (Supplementary Note 2.4).

Instance segmentation

InS delivers the tasks of ObD and SemS at the same time. Thus, the pitfalls and recommendations for InS problems are closely related to those for segmentation and ObD27. This is directly reflected in our metric selection process (Fig. 2 and Supplementary Note 2.5):

  1. 1.

    Select ObD metric(s): To overcome problems related to instance unawareness (Fig. 1a), we recommend selection of a set of detection metrics to explicitly measure detection performance. To this end, we recommend almost the exact process as for ObD with two exceptions. Firstly, given the fine granularity of both the output and the reference annotation, our recommendation for the localization strategy differs, as detailed in subprocess S8 (Extended Data Fig. 8). Secondly, as depicted in S3 (Extended Data Fig. 3), we recommend panoptic quality29 as an alternative to the Fβ score. This metric is especially suited for InS, as it combines the assessment of overall detection performance and segmentation quality of successfully matched (true positive (TP)) instances in a single score.

  2. 2.

    Select segmentation metric(s) (if any): In a second step, metrics to explicitly assess the segmentation quality for the TP instances may be selected. Here, we follow the exact same process as in SemS (subprocesses S6 and S7; Extended Data Figs. 6 and 7). The primary difference is that the segmentation metrics are applied for each instance.

Importantly, the development process of the Metrics Reloaded framework was designed such that the pitfalls identified in the sister publication of this work19 are comprehensively addressed. Table 1 makes the recommendations and design decisions corresponding to specific pitfalls explicit.

Table 1 Metrics Reloaded addresses common and rare pitfalls in metric selection, as compiled in ref. 19

Once common reference-based metrics have been selected and, where necessary, complemented by application-specific metrics, the user proceeds with the application of the metrics to the given problem.

Step 3 - Metric application. Although the application of a metric to a given dataset may appear straightforward, numerous pitfalls can occur27. Our recommendations for addressing them are provided in Extended Data Table 1. Following the taxonomy provided in the sister publication of this work19, they are categorized in recommendations related to metric implementation, aggregation, ranking, interpretation and reporting. While several aspects are covered in related work (for example, ref. 30), an important contribution of the present work is the metric-specific summary of recommendations captured in the metric cheat sheets (Supplementary Note 3.1). A further major contribution is our implementation of all Metrics Reloaded metrics in the open-source framework Medical Open Network for Artificial Intelligence (MONAI), available at https://github.com/Project-MONAI/MetricsReloaded/ (Supplementary Methods).

Metrics Reloaded is broadly applicable in biomedical image analysis

To validate the Metrics Reloaded framework, we used it to generate recommendations for common use cases in biomedical image processing (Supplementary Note 4). The traversal through the decision tree of our framework is detailed for eight selected use cases corresponding to the four different problem categories (Fig. 5):

ImLC (Figs. SN 5.55.8): frame-based sperm motility classification from time-lapse microscopy video of human spermatozoa (ImLC-1) and disease classification in dermoscopic images (ImLC-2).

SemS (Figs. SN 5.9 and 5.10): embryo segmentation in microscopy images (SemS-1) and liver segmentation in computed tomography (CT) images (SemS-2).

ObD (Figs. SN 5.6, 5.7, 5.11 and 5.12): cell detection and tracking during the autophagy process in time-lapse microscopy (ObD-1) and multiple sclerosis (MS) lesion detection in multimodal brain magnetic resonance imaging (MRI) images (ObD-2).

InS (Figs. SN 5.6, 5.7 and 5.95.12): InS of neurons from the fruit fly in 3D multicolor light microscopy images (InS-1) and surgical InS in colonoscopy videos (InS-2).

The resulting metric recommendations (Fig. 5) demonstrate that a common framework across domains is sensible. In the showcased examples, shared properties of problems from different domains result in almost identical recommendations. In the SemS use cases, for example, the specific image modality is irrelevant for metric selection. What matters is that a single object with a large size relative to the grid size should be segmented—properties that are captured by the proposed fingerprint. In Supplementary Note 4, we present recommendations for several other biomedical use cases.

The Metrics Reloaded online tool allows user-friendly metric selection

Selecting appropriate validation metrics while considering all potential pitfalls that may occur is a highly complex process, as demonstrated by the large number of figures in this paper. Some of the complexity, however, also results from the fact that the figures need to capture all possibilities at once. For example, many of the figures could be simplified substantially for problems based on only two classes. To leverage this potential and to improve the general user experience with our framework, we developed the Metrics Reloaded online tool (Supplementary Methods), which captures our framework in a user-centric manner and can serve as a trustworthy common access point for image analysis validation.

Discussion

Conventional scientific practice often grows through historical accretion, leading to standards that are not always well justified. This holds particularly true for the validation standards in biomedical image analysis.

The present work represents a comprehensive investigation and, importantly, constructive set of recommendations challenging the state of the art in biomedical image analysis algorithm validation with a specific focus on metrics. With the intention of revisiting—literally ‘re-searching’—common validation practices and developing better standards, we brought together experts from traditionally disjunct fields to leverage distributed knowledge. Our international consortium of more than 70 experts from the fields of biomedical image analysis, ML, statistics, epidemiology, biology and medicine, representing a large number of relevant biomedical imaging initiatives and societies, developed the Metrics Reloaded framework that offers guidelines and tools to choose performance metrics in a problem-aware manner. The expert consortium was primarily compiled in a way to cover the required expertise from various fields but also consisted of researchers of different countries, (academic) ages, roles and backgrounds (Supplementary Methods). Importantly, Metrics Reloaded comprehensively addresses all pitfalls related to metric selection (Table 1) and application (Extended Data Table 1) that were identified in this work’s sister publication19.

Metrics Reloaded is the result of a 2.5-year long process involving numerous workshops, surveys and expert group meetings. Many controversial debates were conducted during this time. Even deciding on the exact scope of the paper was anything but trivial. Our consortium eventually agreed on focusing on biomedical classification problems with categorical reference data and thus exploiting synergies across classification scales. Generating and handling fuzzy reference data (for example, from multiple observers) is a topic of its own31,32 and was decided to be out of scope for this work. Furthermore, the inclusion of calibration metrics in addition to discrimination metrics was originally not intended because calibration is a complex topic, and the corresponding field is relatively young and currently highly dynamic. This decision was reversed due to high demand from the community, expressed through crowdsourced feedback on the framework.

Extensive discussions also evolved around the inclusion criteria for metrics, considering the trade-off between established (potentially flawed) and new (not yet stress-tested) metrics. Our strategy for arriving at the Metrics Reloaded recommendations balanced this trade-off by using common metrics as a starting point and making adaptations where needed. For example, weighted Cohen’s kappa, originally designed for assessing inter-rater agreement, is the state-of-the-art metric used in the medical imaging community when handling ordinal data. Unlike other common multi-class metrics, such as (balanced) accuracy or MCC, it allows the user to specify different costs for different class confusions, thereby addressing the ordinal rating. However, our consortium deemed the (not widely known) metric EC generally more appropriate due to its favorable mathematical properties. Importantly, our framework does not intend to impose recommendations or act as a ‘black box’; instead, it enables users to make educated decisions while considering ambiguities and trade-offs that may occur. This is reflected by our use of decision guides (Supplementary Note 2.7), which actively involve users in the decision-making process (for the example above, for instance, see DG2.1).

An important further challenge that our consortium faced was how to best provide recommendations in case multiple questions are asked for a single given dataset. For example, a clinician’s ultimate interest may lie in assessing whether tumor progress has occurred in a patient. While this would be phrased as an ImLC task (given two images as input), an interesting surrogate task could be seen in a segmentation task assessing the quality of tumor delineation and providing explainability for the results. Metrics Reloaded addresses the general challenge of multiple different driving biomedical questions corresponding to one dataset pragmatically by generating a recommendation separately for each question. The same holds true for multi-label problems, for example, when multiple different types of abnormalities potentially co-occur in the same image/patient.

Another key challenge we faced was the validation of our framework due to the lack of ground truth ‘best metrics’ to be applied for a given use case. Our solution builds upon three pillars. Firstly, we adopted established consensus building approaches utilized for developing widely used guidelines such as CONSORT21, TRIPOD22 or STARD23). Secondly, we challenged our initial recommendation framework by acquiring feedback via a social media campaign. Finally, we instantiated the final framework to a range of different biological and medical use cases. Our approach showcases the benefit of crowdsourcing as a means of expanding the horizon beyond the knowledge peculiar to specific scientific communities. The most prominent change effected in response to the social media feedback was the inclusion of the aforementioned EC, a powerful metric from the speech recognition community. Furthermore, upon popular demand, we added recommendations on assessing the interpretability of model outputs, now captured by subprocess S5 (Extended Data Fig. 5).

After many highly controversial debates, the consortium ultimately converged on a consensus recommendation, as indicated by the high agreement in the final Delphi process (median agreement with the subprocesses: 93%). While some subprocesses (S1, S7 and S8) were unanimously agreed on without a single negative vote, several issues were raised by individual researchers. While most of them were minor (for example, concerning wording), a major debate revolved around calibration metrics. Some members, for example, questioned the value of stand-alone calibration metrics altogether. The reason for this view is the critically important misconception that the predicted class scores of a well-calibrated model express the true posterior probability of an input belonging to a certain class33—for example, a patient’s risk for a certain condition based on an image. As this is not the case, several researchers argued for basing calibration assessment solely on proper scoring rules (such as the Brier score), which assess the quality of the posteriors better than the stand-alone calibration metrics. We have addressed all these considerations in our recommendation framework including a detailed rationale for our recommendations (Supplementary Note 2.6).

While we believe our framework covers the vast majority of biomedical image analysis use cases, suggesting a comprehensive set of metrics for every possible biomedical problem may be out of its scope. The focus of our framework lies in correcting poor practices related to the selection of common metrics. However, in some use cases, common reference-based metrics—as a matter of principle —be unsuitable. In fact, the use of application-specific metrics may be required in some cases. A prominent example are InS problems in which the matching of reference and predicted instances is infeasible, causing overlap-based localization criteria to fail. Metrics such as the Rand index34 and variation of information35 address this issue by avoiding one-to-one correspondence between predicted and reference instances. To make our framework applicable to such specific use cases, we integrated the step of choosing application-specific metrics in the main workflow (Fig. 2). Examples of such application-specific metrics can be found in related work36,37.

Metrics Reloaded primarily provides guidance for the selection of metrics that measure some notion of the ‘correctness’ of an algorithm’s predictions on a set of test cases. It should be noted that holistic algorithm performance assessment also includes other aspects. One of them is robustness. For example, the accuracy of an algorithm for detecting disease in medical scans should ideally be the same across different hospitals that may use different acquisition protocols or scanners from different manufacturers. Recent work, however, shows that even the exact same models with nearly identical test set performance in terms of predictive accuracy may behave very differently on data from different distributions38.

Reliability is another important algorithmic property to be taken into account during validation. A reliable algorithm should have the ability to communicate its confidence and raise a flag when the uncertainty is high and the prediction should be discarded39. For calibrated models, this can be achieved via the predicted class scores, although other methods based on dedicated model outputs trained to express the confidence or on density estimation techniques are similarly popular. Importantly, an algorithm with reliable uncertainty estimates or increased robustness to distribution shift might not always be the best performing in terms of predictive performance40. For safe use of classification systems in practice, careful balancing of the trade-off between robustness and reliability over accuracy might be necessary.

So far, Metrics Reloaded focuses on common reference-based methods that compare model outputs to corresponding reference annotations. We made this design choice due to our hypothesis that reference-based metrics can be chosen in a modality-agnostic and application-agnostic manner using the concept of problem fingerprinting. As indicated by the step of choosing potential non-reference-based metrics (Fig. 2), however, it should be noted that validation and evaluation of algorithms should go far beyond purely technical performance41,42. In this context, Jannin introduced the global concept of ‘responsible research’ to encompass all possible high-level assessment aspects of a digital technology43, including environmental, ethical, economic, social and societal aspects. For example, there are increasing efforts specifically devoted to the estimation of energy consumption and greenhouse gas emission of ML algorithms44,45,46. For these considerations, we refer the reader to available tools such as the Green Algorithms calculator47 or Carbontracker48.

It must further be noted that while Metrics Reloaded places a focus on the selection of metrics, adequate application is also important. Detailed failure case analysis49 and performance assessment on relevant subgroups, for example, have been highlighted as critical components for better understanding when and where an algorithm may fail50,51. Given that learning-based algorithms rely on the availability of historical datasets for training, there is a real risk that any existing biases in the data may be picked up and replicated or even exacerbated when an algorithm makes predictions52,53. This is of particular concern in the context of systemic biases in healthcare, such as the scarcity of representative data from underserved populations and often higher error rates in diagnostic labels in particular subgroups54,55. Relevant meta information such as patient demographics, including biological sex and ethnicity, needs to be accessible for the test sets such that potentially disparate performance across subgroups can be detected56. Here, it is important to make use of adequate aggregations over the validation metrics as disparities in minority groups might otherwise be missed.

Finally, it must be noted that our framework addresses metric choice in the context of technical validation of biomedical algorithms. For translation of an algorithm into, for example, clinical routine, this validation may be followed by a (clinical) validation step assessing its performance compared to conventional, non-algorithm-based care according to patient-related outcome measures, such as overall survival57.

A key remaining challenge for Metric Reloaded is its dissemination such that it will substantially contribute to raising the quality of biomedical imaging research. To encourage widespread adherence to new standards, entry barriers should be as low as possible. While the framework with its vast number of subprocesses may seem very complex at first, it is important to note that from a user perspective only a fraction of the framework is relevant for a given task, making the framework more tangible. This is notably illustrated by the Metric Reloaded online tool, which substantially simplifies the metric selection procedure. As is common in scientific guideline and recommendation development, we intend to regularly update our framework to reflect current developments in the field, such as the inclusion of new metrics or biomedical use cases. This is intended to include an expansion of the framework’s scope to further problem categories, such as regression and reconstruction. To accommodate future developments in a fast and efficient manner, we envision our consortium building consensus through accelerated Delphi rounds organized by the Metric Reloaded core team. Once consensus is obtained, changes will be implemented in both the framework and online tool and highlighted so that users can easily identify changes to the previous version, which will ensure full transparency and comparability of results. In this way, we envision the Metrics Reloaded framework and online tool as a dynamic resource reliably reflecting the current state of the art at any given time point in the future, for years to come18.

Of note, while the provided recommendations originate from the biomedical image analysis community, many aspects generalize to imaging research as a whole. Particularly, the recommendations derived for individual fingerprints (for example, implications of class imbalance) hold across domains, although it is possible that for different domains the existing fingerprints would need to be complemented by further features that this community is not aware of.

In conclusion, the Metrics Reloaded framework provides biomedical image analysis researchers with systematic guidance on choosing validation metrics across different imaging tasks in a problem-aware manner. Through its reliance on methodology that can be generalized, we envision the Metrics Reloaded framework to spark a scientific debate and hopefully lead to similar efforts being undertaken in other areas of imaging research, thereby raising research quality on a much larger scale than originally anticipated. In this context, our framework and the process by which it was developed could serve as a blueprint for broader efforts aimed at providing reliable recommendations and enforcing adherence to good practices in imaging research.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.