Objective evaluation of deep uncertainty predictions for COVID-19 detection

Deep neural networks (DNNs) have been widely applied for detecting COVID-19 in medical images. Existing studies mainly apply transfer learning and other data representation strategies to generate accurate point estimates. The generalization power of these networks is always questionable due to being developed using small datasets and failing to report their predictive confidence. Quantifying uncertainties associated with DNN predictions is a prerequisite for their trusted deployment in medical settings. Here we apply and evaluate three uncertainty quantification techniques for COVID-19 detection using chest X-Ray (CXR) images. The novel concept of uncertainty confusion matrix is proposed and new performance metrics for the objective evaluation of uncertainty estimates are introduced. Through comprehensive experiments, it is shown that networks pertained on CXR images outperform networks pretrained on natural image datasets such as ImageNet. Qualitatively and quantitatively evaluations also reveal that the predictive uncertainty estimates are statistically higher for erroneous predictions than correct predictions. Accordingly, uncertainty quantification methods are capable of flagging risky predictions with high uncertainty estimates. We also observe that ensemble methods more reliably capture uncertainties during the inference. DNN-based solutions for COVID-19 detection have been mainly proposed without any principled mechanism for risk mitigation. Previous studies have mainly focused on on generating single-valued predictions using pretrained DNNs. In this paper, we comprehensively apply and comparatively evaluate three uncertainty quantification techniques for COVID-19 detection using chest X-Ray images. The novel concept of uncertainty confusion matrix is proposed and new performance metrics for the objective evaluation of uncertainty estimates are introduced for the first time. Using these new uncertainty performance metrics, we quantitatively demonstrate when we could trust DNN predictions for COVID-19 detection from chest X-rays. It is important to note the proposed novel uncertainty evaluation metrics are generic and could be applied for evaluation of probabilistic forecasts in all classification problems.

www.nature.com/scientificreports/ pretrained DNNs 5 . Proposed solutions are evaluated by point prediction-based performance metrics such as accuracy, sensitivity, specificity, and area under receiver operating characteristic (AUC). It is important to note that the transition from normal to COVID-19 is not always clear-cut. Difficult-to-diagnosis cases from radiology images could even lead to disagreement between experienced medical doctors. In fact, studies have shown that radiologists disagree with their colleagues 25% of the time and themselves 20% of the time 9 . As the model decision has a direct impact on patient's treatment, it is of critical to know how confident DNNs are about their predictions. This information could be used for identifying patients which may best benefit from a medical second opinion. DNNs flagging potentially erroneous predictions due to high uncertainty can be used to mimic the common practice of requesting a second opinion from another health practitioner in medial settings 10 .
Such an uncertainty-aware decision-making pipeline could greatly improve the overall diagnosis performance.
In this paper, we comprehensively and quantitatively investigate the competency of DNNs for generating reliable uncertainty estimates for COVID-19 diagnosis. We first check the impact of pretraining using ImageNet and CXR image datasets on the network performance. Then MC-dropout (MCD), ensemble, and ensemble MC-dropout (EMCD) are implanted for quantifying uncertainties associated with point predictions of DNNs. Motivated by 11,12 , we introduce a novel uncertainty-aware confusion matrix from which novel performance metrics (Uncertainty sensitivity, Uncertainty recall, Uncertainty specificity, and Uncertainty precision) can be computed. These performance metrics can be used for comprehensive and quantitative evaluation of uncertainty estimates. The uncertainty estimate evaluation is conducted in a similar manner to that of binary classification evaluation. Through experiments, we try to shed light on whether uncertainty quantification methods proposed in literature can provide high uncertainty for erroneous predictions. This investigation is done both qualitatively (visually) and quantitatively (using new uncertainty evaluation metrics). The proposed metrics can be used for proper evaluation of uncertainty generated by different types of machine learning models. As these have a high similarity to the traditional confusion matrix, users can easily understand and apply them in the process of fair evaluation. At the same time, the proposed metrics can be used for engineering novel uncertainty-aware training algorithms for neural networks. These uncertainty-aware algorithms not only improve the model accuracy, but also take care of its uncertainty estimates. This will result in more trustworthy models that can detect whether their prediction is reliable or not.
The rest of this paper is organised as follows. Section "Related work" briefly reviews papers reporting applications of deep uncertainty quantification for COVID-19 diagnosis. Uncertainty quantification techniques are described in section "Uncertainty quantification techniques". Section "Predictive uncertainty evaluation" introduces metrics for quantitative evaluation of predictive uncertainty estimates. The dataset and experiments are described in sections "Dataset" and "Experiments" respectively. Section "Simulations and results" reports conducted simulations and obtained results. Finally, section "Conclusion" concludes the paper.

Related work
Several methods have been proposed in recent years for enabling DNNs to encompass uncertainty and generate probabilistic predictions. Many of the proposed solutions are based on the Bayesian theory 13 . Several approximate methods have been proposed to address the intractability of the exact Bayesian inference due to its massive computational burden. This include but not limited to variational inference 14,15 , MCD 16 , ensemble 17 . All these methods could be put in the two-step category of uncertainty via classification as they first train the model (classifier here) and then postprocess predictions to generate an uncertainty score 10 . There have been also attempts to generate uncertainty estimates without resorting to sampling methods. Direct uncertainty prediction methods train DNNs to directly generate uncertainty scores or distribution parameters in one single round of scoring 10,18,19 . Despite made progress in this field, reliable generation of uncertainty estimates is still an open question and subject to further investigation.
There is an abundance of papers reporting applications of DNNs generating single-valued predictions for COVID-19 diagnosis from medical images [5][6][7][8] . The input modalities are often chest X-ray and CT scan images which are processed using a wide variety of DNNs including pretrained networks for feature extraction, segmentation, and generative adversarial networks (GANs) 20 . For example, Singh and Singh 21 proposed an automated COVID-19 diagnosis method based on DNNs and Wavelet decomposition. Their model could classify input sample (i.e. patient X-Ray image) as Normal, Viral pneumonia, or COVID-19. Robust diagnostic tools are only part of the solution against COVID-19 pandemic. As pointed out by Pang et al. 22 , it is necessary to have a response plan for this pandemic. To this end, the concept of digital twin city and federated learning has been combined to achieve a collaborative paradigm in which multiple cities share their local status and strategies. Another study focused on the preventive measures to contain COVID-19 threat has been conducted in 23 . The authors used correlation coefficients and multiple linear regression to determine key factors contributing to COVID-19 containment. Lockdown and social distancing have been reported as two key factors to suppress the spread rate of COVID-19.
A comprehensive review of medical imaging applications of DNNs for COVID-19 has been provided in 5 . While proposed solutions differ in terms of utilized networks and the task nature, they all focus on deterministic decisions generated by DNNs. This comprehensive review clearly shows that the literature is quite naive on applying deep uncertainty quantification techniques for processing COVID-19 datasets. Discussion about the reliability and confidence of proposed models has been often overlooked in these studies.
There are a few studies reporting the importance of predictive uncertainty estimates for reliable COVID-19 detection and growth rate monitoring. Alberti and Faranda 24 focused on statistical predictions of COVID-19 by fitting asymptotic distributions to actual data. They studied the epidemic evolution of COVID-19 infections in China and Italy. They reported that prediction about epidemic evolution is accompanied with high uncertainty at early stages of epidemic growth. As the epidemic reaches its peak, the uncertainty decreases. They found that www.nature.com/scientificreports/ long term extrapolation of epidemics counts depend not only on the quality of data, but also on the stage of the epidemics. The MC-Dropweights method 25 was used in 26 to estimate uncertainties associated with COVID-19 predictions. It was shown that there is a strong correlation between model uncertainty and prediction correctness. The paper clearly highlights that availability of estimated uncertainties could potentially alert radiologists on false predictions and will accelerate the acceptance of deep learning-based solution in clinical practice. In another study 27 , a variational inference with MC-Dropweights based Bayesian neural network model was proposed for COVID-19. The model could estimate cost-sensitive calibrated predictive uncertainty in the present of asymmetric cost. In medical domain, the cost of different misclassification is different which makes the cost function asymmetric. Authors in 28 apply four DNNs pretrained on ImageNet dataset to process CXR and CT images. Extracted features are used to develop an ensemble of neural networks for epistemic uncertainty quantification.
Obtained results clearly highlight the need for uncertainty estimate to build trust in DNNS for COVID-19 detection. That is why Alizadehsani et al. 29 proposed a semi-supervised approach for COVID-19 diagnosis using limited amount of labeled data. They utilized GAN to learn (unsupervised training) key features of CT images of COVID-19 patients. The trained GAN was then further trained in supervised manner from limited number of labeled CT images. The output of the discriminator network of GAN was then used for COVID-19 diagnosis (classification) based on CT images. They utilized the probabilistic output of the discriminator as some sort of uncertainty measure to reject classification of samples that the discriminator is uncertain about. RAMIREZ et al. 30 also took a semi-supervised approach to uncertainty estimation. They proposed a COVID-19 diagnosis system based on X-ray images with uncertainty estimates. To make use of unlabeled data, they utilized MixMatch (a semi-supervised deep learning method) which uses a hybrid loss function defined on labeled and unlabeled data. To compute the loss for unlabeled data, they were accompanied with pseudo-labels. Each pseudo-label was the average model output of a perturbed input sample. In 31 , the choice of different cost functions to measure the uncertainty related to a segmentation task has been investigated. By using different architectures (MC dropout UNets, deep UNet ensembles, and conventional UNets without dropout), they tried to segment lung nodules on CT images using two different loss functions namely: weighted categorical cross-entropy (WCC) and soft Dice. In their study, more meaningful uncertainty information was achieved when they trained models by WCC loss function.
Authors in 32 also propose a probabilistic generalization of the non-parametric KNN approach for developing a deep uncertainty-aware classifier. The proposed probabilistic neighbourhood component analysis method maps samples to probability distributions in a latent space and then minimizes a form of nearest-neighbour loss for developing classifiers. It is shown that the proposed method generates less overconfident predictions for out of distribution samples compared to common DNNs and Bayesian neural networks. Despite that, the paper does not provide any quantitative and qualitative evaluation about predictive uncertainty estimates for correctly classified and misclassified samples.
Missing values may also cause uncertain predictions. In 33 , a two-phase approach is proposed for dynamic quality of service (QoS) prediction. In the first phase, if the user has invoked the service in the previous time slice, it will be used to predict QoS value for the user in the next time slice. If the user has not invoked the service in the previous time slice, covering algorithm is used to predict the missing value.

Uncertainty quantification techniques
For a Bayesian network, finding the posterior distribution is intractable for real-world problems (since there are hundreds/thousands of parameters) though there are some alternative ways such as different sampling methods but they are computationally costly. Thus some methods such as Monte Carlo Dropout, Ensemble Bayesian Networks, and Ensemble Monte Carlo Droput were designed to approximate posterior distribution and could be applied to different model architectures with millions of parameters. That is the reason they are used for uncertainty quantification in this paper. These three methods are reviewed below.
MCD. The most difficult part of the Bayesian network is finding the posterior distribution. This is often computationally intractable. One way to overcome this drawback is to use sampling methods. Gal 16 showed that MC samples of the posterior can be obtained by performing several stochastic forward passes at test time (keeping dropout on). The output posterior distribution could be approximated this way with minimum computational burden. The predictive mean ( µ pred ) of the model for a typical test input over MC iterations is estimated as below: where x is the test input. p(y = c|x,ω t ) is the probability that y belongs to c (the output of softmax), and ω t is the set of parameters of the model on the t th forward pass. T is the number of MC iterations (forward passes). The variance of the final distribution is also called predictive uncertainty. As per 16 , the predictive entropy (PE) can be treated as the uncertainty estimate generated by the trained model: where c ranges over both classes. The smaller the PE, the more confident the model about its predictions. It is note that, in the uncertainty literature, for the classification task, the entropy is used as a metric for quantifying that how much a prediction is related to each individual class. In other words, it helps us to quantify how a prediction is far from its true label. The pseudo-code of MCD is given in Algorithm 1. where ω t are the parameters of the model and C ranges over both classes.The pseudo-code of EMCD is given in Algorithm 3.

Predictive uncertainty evaluation
Similar to the idea of confusion matrix, here we define quantitative performance metrics for predictive uncertainty estimates. In contrast to the research done in 26 , the purpose is to do an objective and quantitative evaluation of the predictive uncertainty estimates. Predictions are first compared with ground truth labels and put into two groups: correct and incorrect. Predictive uncertainty estimates are also compared with a threshold and cast into two groups: certain and uncertain. The combination of correctness and confidence groups results in four possible outcomes as shown in Fig. 1: (i) correct and certain indicated true certainty (TC), (ii) incorrect and uncertain indicated by true uncertainty (TU), correct and uncertain indicated by false uncertainty (FU), and (iv) incorrect and certain indicated by false certainty (FC). TC and TU are the diagonal and favourite outcomes. These correspond to TN and TP outcomes in the traditional confusion matrix respectively. FU is a fortunate outcome as an uncertain prediction is correct. FC is the worst outcome as the network has confidently made an incorrect prediction. According to these, we define multiple quantitative performance metrics to objectively quantify predictive uncertainty estimates: • Uncertainty sensitivity (USen): USen is calculated as the number of incorrect and uncertain predictions divided by the total number of incorrect predictions: USen or uncertainty recall (URec) corresponds to sensitivity (recall) or true positive rate of the conventional confusion matrix. USen is of paramount importance as it quantifies the power of the model to communicate its confidence in misclassified samples. • Uncertainty Specificity (USpe): USpe is calculated as the number of correct and certain predictions (TC) divided by the total number of correct predictions: USpe or correct certain ratio is similar to the specificity performance metric. • Uncertainty precision (UPre): UPrec is calculated as the number of incorrect and uncertain predictions divided by the total number of uncertain predictions: www.nature.com/scientificreports/ UPre has the same concept of precision in traditional binary classification. • Uncertainty accuracy (UAcc): Similar to the accuracy of classifiers, the UAcc is calculated as the number of all diagonal outcomes divided by the total number of outcomes: A reliable model will generate a high UAcc.
The best USen, USpe, UPre, and UAcc values are one, whereas the worst are zero. It is always desirable to have these metrics as close as possible to one. Having USen and USpe, and UPre close to one means that the network is self-aware of what it knows and what it does not know. Such a network can tell us when the user can trust its predictions as it reliably gauges and communicates its lack of confidence (as captured in predictive uncertainty estimates).

Dataset
In this study, CXR images sourced from two databases (COVID and Non-COVID) are used for model training and testing. It is note that Institutional approval was granted for the use of the patient datasets in research studies for diagnostic and therapeutic purposes. Approval was granted on the grounds of existing datasets. Informed consent was obtained from all of the patients in this study. All methods were carried out in accordance with relevant guidelines and regulations. Ethical approval for the use of these data was obtained from the Tehran Omid hospital.
Non-COVID dataset. Cohen  COVID-19 dataset. The main COVID-19 dataset used in this study for model development and evaluation contains 522 CXR images from 391 COVID-19 patients and 131 normal subjects (The dataset is collected in Iran). It is important to note that the normal class represents patients that did not have COVID-19. The term normal here does not imply that these patients do not have any emerging disease.

Experiments
The main COVID-19 dataset has a limited number of images. This makes developing reliable DNNs from scratch impractical. To address this issue, we develop the deep model in a transfer learning setting 4 . The common research practice is to pick an existing deep network pretrained on natural image datasets such as ImageNet and then finetune its weights on the medical images. It has been recently shown that this approach is not optimal for medical imaging 43 . Motivated by these findings, we pretrain a DenseNet121 44 using thousands of CXR images of the non-COVID datasets described in section "Non-COVID dataset".
The whole dataset is split to 75-25% between training and testing subsets. All images are resized to 224 × 224 and standardised before being fed to convolutional layers of DenseNet121. This results in 50,176 convolutional features which are then processed by fully connected layers with a softmax on top of them. Relu activation function, 300 epochs, and dropout rate of 0.25 are used for model development using three uncertainty quantification techniques. The Adam algorithm with a learning rate of 0.001 is applied to optimize the cross entropy loss function. For the MCD model, the number of neurons in three fully connected layers is set to 512, 256 and 64 respectively. The ensemble model consists of 30 individual networks in which hidden layers are randomly chosen between 2 and 3. Also the number of neurons in fully connected layers is randomly chosen between (512, 1024), (128, 512), (8,128) respectively. The ensemble MCD is designed similar to the ensemble. The only difference is that the evaluation of each network is done by the MCD algorithm. To implement the experiments, Python language and libraries such as Numpy, PyTorch, and Pandas have been used. Each experiment has been repeated 100 times for different random seeds. The experiments were run on Google Colab with its default settings (GPU: 1xTesla K80, compute 3.7, 2496 CUDA cores, 12GB GDDR5 VRAM).

Simulations and results
We first compare the performance of DenseNet121 networks pretrained using ImageNet and CXR datasets. Figure 2 shows the violin plot of the accuracy and AUC performance metrics for these two networks trained and evaluated 100 times. Both performance metrics are greater and more consistent for networks pretrained using CXR datasets. A paired t-test is also run at 95% confidence level to determine whether the mean of performance metrics are statistically different. The obtained p-values for accuracy and AUC are 10 −39 and 10 −19 .
As both values are much smaller than 0.05, it can be concluded that there are statistically significant differences in accuracy and AUC of these two models.
Results shown in Fig. 2 indicate that the proposed pipeline does a promising job for COVID-19 detection from medical images. The model AUC and accuracy are 0.95 ± 0.02 and 93.94% ± 1.73% which are in par or www.nature.com/scientificreports/ better than results reported in similar studies 45 . Using CXR dataset (pipeline) for pretraining networks yields higher accuracy compared to using ImageNet dataset. Therefore, we choose networks that are trained on CXR through the rest of the paper. Before starting to analyze predictive uncertainty estimation results, we first check the calibration of predictions generated by DNNs. Figure 3 shows the expected calibration error (ECE) which is a plot of sample accuracy as a function of confidence 46 . To calculate ECE, predictions are grouped in different bins (here M bins) according to their confidence (the value of the max softmax output). The calibration error of each bin measures the difference between the fraction of correctly classified predictions (accuracy) and the mean of the probabilities (confidence). ECE is a weighted average of this error across all bins 46 : where acc(B m ) and conf (B m ) are the accuracy and confidence in the m-th bin: where 1(·) is the indicator function. Figure 4 displays the predictive posterior distribution for two non-COVID-19 (normal) CXR images obtained using MCD method (200 MC iterations). The class with the bigger softmax output for the distribution mean is reported as the predicted outcome. The uncertainty estimate associated with this outcome is also calculated using Eq. (2). The wider the predictive posterior distribution, the less confident the model. For the top case shown in Fig. 4, the predicted output is normal and the model is confident about its correct decision (a low predictive uncertainty estimate of 0.17). In contrast, the model prediction is wrong for the below CXR image in Fig. 4. However, the model generates a wide posterior distribution resulting in a high predictive uncertainty estimate (0.45). This way, it communicates its lack of confidence in this specific prediction and says I do not know. As the model is not confident about its prediction, the image could be sent to a medical expert for a second opinion 10 . In Fig. 3, the more the blue part deviation from the corresponding red parts, the less calibrated the neural network model. A perfectly calibrated model will generate the identity line in this chart. The current plot clearly shows     Fig. 3 indicate that all models overconfidently classify samples resulting in misleading outcomes. EMCD has the smallest ECE value of 9.85. We then investigate the overall model ability in gauging and reporting its lack of confidence in its prediction. Figure 5 displays the contour plots of predicted probabilities and uncertainty estimates for three uncertainty quantification methods. The estimated marginal distributions of probabilities and uncertainty estimates are also shown on the sides of plots grouped by correctly classified and misclassified predictions. The plots clearly show that the centers of two groups are well apart from each other resulting in high accuracy. The estimated distribution for predicted probabilities of the correct group is much more compact compared to incorrect group. At the same time, the estimated uncertainties are higher for misclassified images. The visual inspection of three subplots indicates the uncertainty estimate mean for erroneous predictions is on the right of the uncertainty estimate mean for correct predictions. This qualitative investigation highlights the model capability in gauging and communicating its confidence (or lack of confidence) in generated predictions. This finding is of paramount practical importance as reliable uncertainty estimates provide additional valuable information to predicted probabilities. These could be used to flag uncertain predictions and request a second opinion by a medical expert.
We also comprehensively evaluate predictive uncertainty estimates using performance metrics introduced in section "Predictive uncertainty evaluation". Figure 6 displays uncertainty accuracy (UAcc), uncertainty sensitivity   www.nature.com/scientificreports/ (USen), uncertainty specificity (USpe), and uncertainty precision (UPre) calculated for uncertainty thresholds between 0.1 and 0.9. UAcc, USpe, and UPre are positively correlated with the uncertainty threshold. This correlation is negative for USen. UAcc, USen, and USpe all achieve values close to one for a wide range of thresholds. Achieving a high USen means that all three uncertainty quantification methods are able to flag incorrect predictions with high uncertainty. These methods are quite capable of flagging erroneous predictions for further investigation. None of the uncertainty quantification methods achieves a high UPre close to one. This is because there are many correct predictions that have a high uncertainty (FU). This can be observed in the long tail of the estimated distributions of predictive uncertainties in Fig. 5 (top side plots). It is also important to note that the number of correctly classified images is much greater than the number of misclassified images. This makes the number of TU predictions always much smaller than FU resulting in a low Upre. This is an expected pattern for models with high accuracy. UAcc, USpe, and Upre achieve their maximum values for thresholds close to one. Thresholds close to zero lead to the highest values for USpe. Selecting the best uncertainty threshold value depends on the users' preferences, e.g., sensitivity vs. specificity. Setting it to 0.3 results in a good trade off between four uncertainty performance metrics for three uncertainty quantification methods. Table 1 reports uncertainty performance metrics for all methods. UAcc for the MCD method is 71.2% which is much smaller than UAcc for both ensemble methods. The ensemble method achieves the highest UAcc amongst uncertainty quantification methods and its UAcc (77.8%) is slightly superior to that of EMCD (76.3%). The same pattern holds for other performance metrics of three methods.

Conclusion
In this paper, we investigate the competency of deep uncertainty quantification techniques for the task of COVID-19 detection from CXR images. A novel confusion matrix and multiple performance metrics for the evaluation of predictive uncertainty estimates are introduced. Our investigations reveal that deep learning models pretrained using medical imaging datasets outperform models pretrained using natural datasets such as ImageNet. Through comprehensive evaluation, we also find that ensemble methods better capture uncertainties associated with their predictions resulting in more trustworthy diagnosis solutions. The proposed uncertainty confusion matrix also shows that uncertainty quantification methods achieve high uncertainty sensitivity and specificity. However, they often fail at producing uncertainty estimates resulting in high precision.
There are many rooms for improving the proposed uncertainty evaluation metrics and its application for DNN development. For future work, we will include the proposed uncertainty evaluation metrics as the loss function in the process of training DNNs. This will lead to networks that are optimized based on performance metrics of both point predictions and uncertainty estimates. Figure 6. Quantitative evaluation of three uncertainty quantification techniques using performance metrics introduced in section "Predictive uncertainty evaluation". Uncertainty accuracy, sensitivity, specificity, and prediction are calulated for threshold values between 0.1 and 0.9.