## Introduction

Deep learning is a popular technology to achieve high performance for medical image analysis tasks. In the desire to achieve higher classification performance, important aspects of the model performance, such as test-retest variability remain overlooked, yet not all deep learning (DL) models are equal with respect to their repeatability. Consistency in the prediction of models is of utmost importance for such models to prove their potential as reliable and safe clinical support. However, DL models face substantial repeatability issues1,2. Empirically, minor changes in an image can lead to vastly different predictions by DL models. In clinical practice, this repeatability issue could lead to dangerous medical errors. Figure 1 illustrates this issue. Two cervical cancer screening images from the same precancerous cervix that were taken during the same visit led to completely different predictions. A binary DL model (without dropout layers) trained to distinguish between a normal cervix and one with a precancerous lesion (0: Normal, 1: Pre-cancer) predicted a normal cervix on one image and classified the second image as precancerous. This difference is represented by prediction results at each extreme of the spectrum, i.e., 0.01 and 0.98, suggesting high certainty for both outputs.

Dropout is the process of randomly removing units from a neural network during training to regularize learning and avoid overfitting3,4. For inference, dropout is usually disabled to leverage all the connections from the model. Gal et al.5 proposed to enable dropout at test time as a Bayesian approximation to sample multiple different predictions. From these Monte Carlo (MC) predictions, it is possible to derive uncertainty metrics that are indicative of model performance6 which has already been explored for multiple medical image classification tasks7,8,9. The final prediction is usually generated by taking the average overall MC predictions. We will refer to these models utilizing dropout as MC models.

Repeatability describes the variation between independent tests taken under the same conditions. In this work, we focus on repeatability of a single model using different images of the same anatomical region from the same patient taken the same day. For the public knee osteoarthritis dataset, only one image per knee for a given time point was available, hence, a second image was generated using minor data augmentation. To the best of our knowledge, few studies focus on methodologies to increase repeatability. However, some work notes the importance of repeatability for medical image analysis by assessing the test-retest reliability of their classification or segmentation models2,10,11,12,13,14,15. Kim et al.2 evaluated the test-retest variability for disease classification on chest radiographs and obtained limits of agreement (LoA) of ± 30% indicating variability within the test-retest predictions. Various post-processing techniques such as blurring or sharpening, which could naturally occur in real-life settings and alter the appearance of images, caused higher test-retest variability compared to positional changes. Multiple other factors have been shown to impact repeatability such as inter-rater variability in the labels, image quality, noise, or model uncertainty due to lack of knowledge and limited number of images, i.e., epistemic uncertainty2,16. For instance, images leading to high inter-rater variability among experts are likely to generate similar variability, especially at class boundaries17, since the model was trained based on the ratings of these experts. While some of these factors leading to low repeatability cannot be eliminated in practice (e.g., inter-rater variability), reliable DL models should be robust to minor changes in position, lighting, focus, etc.

Calibrated models will output probabilities reflecting the probability of the observed outcome (e.g., all the predictions of 0.9 from a perfectly calibrated model should have the positive class as ground truth 90% of the time). Good calibration allows robust rejection of low probability predictions as output probabilities represent more truthfully the likelihood of being wrong. Modern neural networks are poorly calibrated due to the recent neural network advances in architecture and training18. Multiple works have focused on developing methods for post-hoc calibration of models18,19,20 usually taking the validation set to adjust the test prediction. However, having an inherently more calibrated output could mitigate the need for prediction re-calibration. Brier score is a common metric to assess calibration as it indicates how close the predicted probabilities are to the true likelihood. A Brier score of 0 indicates perfect calibration.

Repeatability is an important and required characteristic of medial image analysis tools as it reflects the ability of the model to repeatedly generate a certain classification performance. More repeatable models with the same accuracy provide smaller variability in accuracy for a single measurement per patient. Hence, repeatable models generate more consistent classification performance leading to less variability.

While most works describing the development of DL models for medical image classification focus on accuracy and classification performance21,22,23, very few assess the repeatability of these models. This study proposes Monte Carlo (MC) dropout at test time as a method to improve repeatability and systematically assess this approach on different tasks, model types, and network architectures. All the selected medical tasks have an underlying continuous scale of disease severity but are routinely binned into binary or ordinal classes to simplify treatment decisions and ratings. Although specifically training networks to assess disease severity might be a preferred approach24,25,26, this is rarely done in practice27,28,29. The methodology and analysis were chosen based on the consideration that the underlying variable of interest, i.e., disease severity, of these medical tasks is better represented by a spectrum rather than clear distinct categories. In this work, we evaluate the model repeatability of four types of DL models, binary classification, multiclass classification, ordinal classification, and regression, each with and without MC dropout. We test the repeatability of these models’ predictions on four different medical image classification tasks: knee osteoarthritis grading, cervical cancer screening, breast density estimation, and retinopathy of prematurity (ROP) disease severity grading. True test-retest scenarios are studied with private datasets containing multiple images per patient for a given time point and anatomical region. Few public datasets exist with multiple images from the same anatomical region taken during the same visit. As we acknowledge the importance of reproducibility in research, a fourth dataset that is publicly available, the Multicenter Osteoarthritis Study, was added to the study and a second image per patient was generated by applying simple data augmentation to the original image, i.e., horizontal flip, to simulate test-retest reliability. Based on our results, we present recommendations for model choices that can lead to improved repeatability. Finally, we assess the calibration of regular models compared to MC models.

## Results

### Repeatability and classification performance

The repeatability of each model was assessed on all available images of the same patient during the same visit. MC dropout models were associated with increased repeatability and accuracy for all models and tasks excluding regression models (Table 1 and Fig. 2). Bland-Altman plots for all the tasks and model types are summarized in Fig. 2. An alternative way to compare the severity score from the test and retest images is presented in Supplementary Information. Ideally, all cases would lie near a horizontal line crossing the y-axis at 0 which means the difference between test-retest scores is low. For every task, the MC models showed better test-retest reliability than their conventional counterparts with the exception of the regression models. This is illustrated by the narrower 95% LoA and the highest concentration of differences near 0 on the y-axis. Model outputs exhibit higher differences near class boundaries. However, this effect is attenuated for MC models and almost absent for regression models. The range of predicted values remained similar for MC models, indicating that the effect of the MC model is not simply regressing scores towards the mean. Moreover, the increase in repeatability was in most cases associated with an improvement in classification performance (Table 1).

Repeatability and classification metrics for each approach can be found in Table 1. Repeatability of MC models for binary, multi-class, and ordinal models showed statistically significant improvements on at least one metric for all tasks. On average, across all tasks and classification models (i.e., excluding regression), the disagreement rate improved by 7% points and the 95% limit of the agreement by 16% points. Classification performance followed the same trend as the repeatability and increased for all classification MC models with the exception of the ROP task, which was exposed to a domain shift (see Discussion). Figure 3 illustrates cases where multiclass models showed poor repeatability while the MC multi-class was significantly more repeatable. While there are minor differences between test and retest images, we expect the model to be robust to changes in view, lighting, or zoom, as the disease severity or breast density does not change from one image to another. Adding MC iterations to regression models did not lead to consistent improvement in classification or repeatability performances. Regression models generally showed better repeatability compared to the other multi-class models (i.e., n-class and ordinal).

### Impact of number of MC iterations

Additionally, we evaluated the impact of increasing the number of MC iterations at test time to compute the final prediction on repeatability of MC models, i.e., 95% LoA, of multi-class models for all tasks as illustrated in Fig. 4. This analysis was limited to the multi-class models as they are the most commonly used for medical classification tasks. All models suggest that training with dropout, even without any MC iterations during testing, has better test-retest performance than non-dropout models (Fig. 4). Repeatability could be further improved by generating more MC samples. After about 20 MC iterations, additional samples had little to no impact on repeatability.

### Architecture comparison

Figure 5 compares, for the same task (i.e., knee osteoarthritis grading) and model type (i.e., multi-class), the DenseNet and ResNet architectures with respect to repeatability. Regardless of the model’s architecture, the behavior remains the same: the test-retest variability is lower meaning repeatability is increased when using multiple MC samples for the prediction. The disagreement rate decreased of 9% and 15% points and the LoA improved by 11% and 15% points for DenseNet and ResNet architectures, respectively.

### Calibration

Output probabilities are more calibrated for MC models than for the regular models as depicted in Fig. 6. Brier scores associated with MC models are lower for all tasks, i.e., average decrease of 0.031, and the calibration curves are closer to the identity line, i.e., the perfect calibration curve. Calibration curves of multi-class model outputs were displayed for knee osteoarthritis, cervix and breast density classification, while the binary models were chosen for ROP as the impact of adding MC was greater for this task compared with the multi-class models (see Table 1).

## Discussion

Our results demonstrate that MC dropout models lead to a significant increase in repeatability, i.e., improvement of at least one repeatability metric, while improving most classification metrics for binary, multi-class, and ordinal models. Concretely, this means higher class and score agreements between the test and retest outputs. The repeatability increased regardless of the disease imaged or the model architecture (DenseNet or ResNet). However, MC iterations did not benefit all regression models and even lowered classification performance for knee osteoarthritis and ROP classification. Regression models showed higher repeatability compared with non-MC multi-class and ordinal models, so the potential gain was more modest. For the two datasets where MC dropout did not improve repeatability on the regression model, i.e., cervical and ROP datasets, the highest repeatability was already reached by the regression model. Hence, in these cases, the models might have reached a limit in repeatability where MC dropout is of no extra help. While the lowest test-retest variability was reached for the regression model on the knee and cervical images, the model was associated with a lower quadratic κ and/or accuracy. Both accuracy and repeatability need to be reported to thoroughly assess deep learning models, especially in clinical settings.

The observed differences between test-retest images of the same patient were not constant along the mean axis as seen on the Bland-Altman plots in Fig. 2. Near the class boundaries, images show more variability with only a few cases with a difference near zero, which creates an arch-like pattern in the plots. MC dropout models stands out by its ability to improve repeatability at the class boundaries where non MC models display more oscillation patterns between classes. Non-MC models tend to avoid ambivalent class predictions to the benefit of choosing one class creating poor repeatability at class boundaries. When the model is equivocal about the class, MC dropout models have a better ability to output the same prediction score. This phenomenon can be partly explained by the training scheme of classification models. During training, models are optimized to predict classes with high certainty, discouraging the model from outputting ambivalent predictions (e.g., predicting 0.5 for a binary model), which leads to uncalibrated models18. Ideally, the output softmax or sigmoid probability of a model should reflect the uncertainty of the model between two or more classes. However, in practice, this is not the case leading to high differences in the class boundaries due to the misclassification of at least one of the images. This effect is alleviated with MC models, leading to more calibrated outputs and higher repeatability.

Fewer repeatability metrics showed a statistical difference between MC dropout and conventional models for the ROP disease severity classification task. Unlike knee osteoarthritis, cervical, and breast density classification, the ROP models were tested on views of the eye that the model has not seen during training (section Retinopathy of Prematurity). This domain shift might be adding variability in the model’s prediction impacting the global performance and repeatability, effectively abating the benefits of MC dropout models. Nonetheless, MC models still showed higher repeatability under domain shift than no-dropout models.

MC models are computationally more expensive than their conventional counterparts as they require multiple forward passes at testing time. Our results in Fig. 4 indicate that after approximately 20 MC iterations, there is no further gain in repeatability, and this, for all tasks on multi-class models. For settings where time and computational resources are limited, training with dropout layers, even without sampling multiple MC, helps regularize the training and reduces overfitting3.

Due to the high number of model types studied (8) and datasets (4), each model was trained only once. Varying data splittings for the training, validation, and test sets could help get a sense of the variability of the metrics for each model type. The analysis is limited to ResNet and DenseNet architectures for classification. Other architectures could behave differently with MC dropout. Future studies should focus on more extensive model architectures for classification and segmentation tasks. Finally, all the medical tasks studied in this work are prone to inter-rater variability. However, not all labels from the knee osteoarthritis, cervical, and breast density datasets were derived from multiple experts, which can affect the models’ performance.

We evaluated the repeatability of four model types on four medical tasks using distinct model architectures (ResNet18, ResNet50, DenseNet121). We demonstrated that MC sampling during test time leads to more reliable models providing more stable, repeatable, and calibrated predictions on different images from the same patient with or without a slight domain shift. MC dropout models reduced test-retest variability at the class boundaries where repeatability is the most challenging and crucial. Only regression models did not show a constant improvement when leveraging MC sampling. Repeatability metrics increased with an increasing number of MC iterations; after around 20 MC iterations, no further improvement of repeatability could be reached. MC sampling is flexible as it is applicable to any model type and architecture while being easily implementable. Future work should assess the impact of MC models on repeatability for other model architectures and other tasks such as segmentation.

## Methods

All images were de-identified prior to data access, ethical approval for this study was therefore not required.

### Knee osteoarthritis

Knee osteoarthritis is the most common musculoskeletal disorder30 and was the eleventh-highest contributor to global disability in 201031. Osteoarthritis can be diagnosed with a radiography, however, early diagnosis can be challenging in clinical practice and is prone to inter-rater variability justifying the emergence of AI models for osteoarthritis grading30. The severity is typically measured using the Kellgren-Lawrence (KL) scale from 0 to 4 where 0 corresponds to none, 1 to doubtful, 2 to mild, 3 to moderate, and 4 to severe32.

The publicly available longitudinal Multicenter Osteoarthritis Study (MOST) dataset contains 18,926 knee radiographs from 3017 patients of one or both knees when including only grades from 0 to 4 on the Kellgren-Lawrence scale32. Grades outside the Kellgren-Lawrence scale were excluded from the dataset for this work. 40% of the cases were labeled as grade 0, 15% as grade 1, 17% as grade 2, 19% as grade 3, and 9% as grade 4. The patients were split into training, validation, and test sets representing 65%, 10%, and 25% of the images, respectively. The binary models were trained to distinguish between knees with no or doubtful osteoarthritis (negative class) and knees with mild, moderate, or severe osteoarthritis (positive class). Images were center cropped to a size of 224x224 pixels and scaled to intensity values of 0 to 1. MOST does not include multiple images of the same during the same visit. Model predictions were generated for all the original test images, were then flipped horizontally, and retested to emulate a test-retest setting. Hence, the repeatability was measured on the same radiography from the same patient at a given time point with and without the horizontal flip. Since flipping is applied as data augmentation during training, we expect all models to be robust to this affine transformation (see Section Classification model training for details on training data augmentation).

### Cervical

Cervical cancer is the fourth most common cancer worldwide and the leading cause of cancer-related deaths of women in western, eastern, middle, and southern Africa33. Vaccinations against high-risk strains of the Human Papilloma Virus (HPV) have been proven to prevent up to 90% of cervical cancers34. Until HPV vaccination programs have not reached every eligible woman worldwide, and in light of the high prevalence of high-risk HPV types, there will be a great demand for effective screening at low costs to prevent the development of invasive cervical cancer. In addition to HPV testing, the visual assessment of the cervix using photographs can help to detect precancerous lesions in low-resource settings35,36,37.

The cervical cancer screening dataset consisted of 3509 cervical photographs from 1760 patients from two studies38,39. For most patients, we had access to two cervical photographs taken during the same session.

Each image was classified using cytological and histological data from the patient as one of the following three categories: Normal (1148 images, 33%), Gray zone, i.e., the presence of precancerous lesions was equivocal, (1159 images, 33%), Precancer/cancer (1202 images, 34%).

The dataset was split into training (65%), validation (10%), and test sets (25%) on a patient level, resulting in datasets containing 2283, 350, and 876 images (training/validation/test), preserving the class distributions described above within each subset. All images were de-identified before this study. All cervical images were cropped using bounding boxes from a trained Retina net for cervix detection, resized to 256x256 pixels, and scaled to intensity values of 0 to 1. The cervigram classification models were trained using all photographs for each patient in the training dataset. For the binary classification models, we utilized only images that were classified as either normal or pre-cancer/cancer. For all patients in the test dataset for whom both images were available, repeatability was assessed as the difference in predictions between the two photographs.

### Breast density

Breast cancer is the second most common cause of cancer deaths among women in the USA with an estimated number of more than 41,000 deaths in 201940. The density of a women’s breast is determined by the amount of fibroglandular tissue. It can be classified (with increasing density) based on its appearance on x-ray mammography as almost entirely fatty, scattered fibroglandular densities, heterogeneously dense, and extremely dense41. Importantly, the risk of developing breast cancer rises with increasing breast density42. Furthermore43, have shown that women with extremely dense breast tissue benefit from additional MRI screening. The development of AI models based on expert labels for breast density assessment could help to mitigate intra-, and interobserver variability and the inconsistency of current quantitative measurements with expert raters28.

The Digital Mammographic Imaging Screening Trial (DMIST) dataset consists of a total of 108,230 mammograms from 21,729 patients acquired at 33 institutions with an average of five mammographs of different standard mammography views for each patient44. Breast density labels were generated according to the BI-RADS criteria41 by a total of 92 different radiologists. The dataset consisted of 12,428 (11.5%) fatty, 47,909 (44.2%) scattered, 41,325 (38.2%) heterogeneously dense, and 6568 (6.1%) extremely dense samples and was split into training (70,293), validation (10,849), and test datasets (27,048 images) on a patient level preserving the label distribution of the full dataset. All images were de-identified before this study. We cropped all images to a size of 224x224 pixels. The breast density classification models were trained using all available views for each patient in the training dataset using either four labels or a simplified binary labeling system of fatty and scattered as one class, and dense and heterogeneous as the other class. Repeatability was assessed as the maximum difference between all available views for each patient in the test dataset.

### Retinopathy of prematurity

ROP is the leading cause of preventable childhood blindness worldwide45. It gets diagnosed based on the appearance of the retinal vessel tree on retinal photographs and classified into three discrete disease severity classes: normal, pre-plus, and plus disease46. However, the disease spectrum is continuous26 and the use of discrete class labels to train DL classifiers is complicated by inter-rater variability particularly for cases close to the class boundaries17,47. High interrater variability, an insufficient number of ophthalmologists and neonatologists with the expertise and willingness (e.g., due to significant malpractice liability) to manage ROP, and the rising incidence of ROP worldwide motivate the development of AI models for ROP classification and screening48.

The ROP dataset consists of 5511 retinal photographs acquired at eight different study centers48. For each patient, retinal photographs were acquired in 5 different standard fields of view (posterior, nasal, temporal, inferior, superior). Only the posterior, temporal, and nasal views were used in this study. Images were classified as normal, preplus disease, or plus disease following previously published methods49. The final label is based on the independent image-based diagnosis by 3 expert graders in combination with the full clinical diagnosis by an expert ophthalmologist. Of the 5511 images in the dataset, 4535 (82.3%) were classified as normal, 804 (14.6%) as pre-plus disease, and 172 (3.1%) as plus disease. The binary models were trained to distinguish between normal and pre-plus/plus disease. The dataset was split on a patient level into training, validation, and test datasets containing 4322/722/467 images while preserving the overall class distribution within each subset. Following48’s work, we trained ROP classification models using normalized pre-segmented vessel maps as input (size of 480x640). ROP classification models were trained using only the posterior field of view as ROP refers to arterial tortuosity and venous dilation within the posterior pole of the retina50. However, it was shown that experts use characteristics beyond the posterior view to assess ROP severity50. Hence, repeatability was tested using the posterior, temporal, and nasal views of all patients in the test dataset.

### Classification model training

For each dataset, we trained binary, multi-class, and ordinal51 classification models, as well as regression models each with and without dropout, resulting in a total of 8 models per dataset. We used the following ImageNet pretrained models for each dataset based on which performed the best for the conventional multi-class classification model: DenseNet121 (cervix), ResNet50 (knee osteoarthritis, breast density), and ResNet18 (ROP). Models were trained using binary cross-entropy, cross-entropy, CORAL51, and mean squared error (MSE) losses for binary, multi-class, ordinal, and regression models, respectively. Affine transformations, i.e., rotation ± 15 degrees and random horizontal flips with 50% probability, were applied as data augmentation during training. The code was implemented using the MONAI framework (version 0.5.2)52 based on the PyTorch library (version 1.9.0)53.

Models with dropout were trained using spatial dropout with a dropout rate of 0.1 for cervical images and DMIST, and 0.2 for knee osteoarthritis and ROP. The dropout rates were determined based on preliminary explorations to optimize the model’s classification performance and values from the literature54,55,56. Channels are independently and randomly zeroed for each dropout layer and forward pass, following the dropout rate from a Bernoulli distribution. For the DenseNet121 architecture, the dropout layers were applied after every dense layer, while for the ResNets the dropout layers were applied after each residual block. At test time, the dropout was enabled to generate N = 50 slightly different predictions and the final prediction was obtained by averaging over all the MC samples5. The choice of the number of MC predictions was based on values commonly found in the appropriate literature and experience; however, the optimal number of predictions to reach maximum repeatability was assessed in the results section (see Fig. 4).

This section enumerates the training parameters associated with the different datasets. These parameters were obtained by referring to previous work on these datasets29,57,58 or by initial dataset exploration. All models were trained with an Adam optimizer59 and a learning rate scheduler reducing of a factor 0.1 with a patience of 10 epochs. Table 2 enumerates the main training parameters that were different from a dataset to another.

### Evaluation

For direct comparison of a model’s predictions, we summarized each model’s outputs as a continuous severity score. For the binary and regression models, the output of the models was directly used without further modifications. For the multi-class model, we utilized the ordinality of all four classification problems and defined the continuous severity score as a weighted average using softmax probability of each class as described in Equation (1). For knee osteoarthritis (5 classes), the values lie in the range of 0 to 4, for breast density (4 classes) in the range of 0 to 3, and for cervical and ROP classification (3 classes), in the range of 0 to 2.

$$score=\mathop{\sum }\limits_{i=1}^{k}{p}_{i}\times i-1$$
(1)

with k being the number of classes and pi the softmax probability of class i. For the ordinal model, the classification problem of k ranks (i.e., class) is modified into a k − 1 binary classification60 leading to one output unit less than for the traditional classification model. For instance, for a 3-class problem, the ground truth would be encoded as followed: class 1 → [0, 0]; class 2 → [1, 0]; class 3 → [1, 1]. The continuous prediction score for ordinal models is obtained by summing the output neurons. Similarly to the multi-class models, values range from 0 to 2, 0 to 3, and 0 to 4, for 3-class, 4-class, and 5-class problems, respectively.

Repeatability was evaluated using the classification disagreement rate and the 95% LoA from the Bland-Altman plots. Since normality was not reached for the differences for the LoA, non-parametric LoA were calculated using empirical percentiles61. The LoA was presented as a fraction of the possible value range. The classification disagreement rate corresponds to the proportion of patients with different classification outcomes for different images acquired during the same session over the total number of patients. The classification accuracy and quadratic weighted Cohen’s κ were also reported. For the regression models, thresholds to binarize predictions for accuracy and Cohen’s κ calculation were computed by splitting the range of predictions equally (e.g., 3-class problem: s ≤ 0.67 → class 1; 0.67 < s ≤ 1.33 → class 2; s ≥ 1.33 → class 3). Model calibration was assessed using the Brier score.

Statistical difference between models was determined using a two-sided t-test and metric bootstrapping (500 iterations). Models with a p value smaller than 0.05 were considered significantly different. The normality of the distribution was verified using the Shapiro-Wilk test (α = 0.05).

### Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.