DropConnect is effective in modeling uncertainty of Bayesian deep networks

Deep neural networks (DNNs) have achieved state-of-the-art performance in many important domains, including medical diagnosis, security, and autonomous driving. In domains where safety is highly critical, an erroneous decision can result in serious consequences. While a perfect prediction accuracy is not always achievable, recent work on Bayesian deep networks shows that it is possible to know when DNNs are more likely to make mistakes. Knowing what DNNs do not know is desirable to increase the safety of deep learning technology in sensitive applications; Bayesian neural networks attempt to address this challenge. Traditional approaches are computationally intractable and do not scale well to large, complex neural network architectures. In this paper, we develop a theoretical framework to approximate Bayesian inference for DNNs by imposing a Bernoulli distribution on the model weights. This method called Monte Carlo DropConnect (MC-DropConnect) gives us a tool to represent the model uncertainty with little change in the overall model structure or computational cost. We extensively validate the proposed algorithm on multiple network architectures and datasets for classification and semantic segmentation tasks. We also propose new metrics to quantify uncertainty estimates. This enables an objective comparison between MC-DropConnect and prior approaches. Our empirical results demonstrate that the proposed framework yields significant improvement in both prediction accuracy and uncertainty estimation quality compared to the state of the art.


Scientific Reports
| (2021) 11:5458 | https://doi.org/10.1038/s41598-021-84854-x www.nature.com/scientificreports/ bias the expert. Given information about the model's confidence, an expert could rely more on his own judgment when the automated system is essentially guessing at random. Most of the studies on uncertainty estimation techniques are inspired by Bayesian statistics. Bayesian Neural Networks (BNNs) 12 are the probabilistic version of the traditional NNs with a prior distribution on the weights of the network. Such networks are intrinsically suitable for generating uncertainty estimates as they produce a distribution over the output for a given input sample 13 . These probabilistic systems are computationally expensive for large neural network models due to the huge number of parameters and the intractable inference of the model posterior. This limitation has prompted the scientific community to develop scalable, approximated BNNs.
Variational inference 14 is the most common approach used for approximating the model posterior using a simple variational distribution such as the Gaussian distribution 15 . The parameters of the distribution are then set in a way that minimizes the difference to the true distribution (usually by minimizing the Kullback-Leibler divergence). The use of the Gaussian distribution considerably increases the required number of parameters and makes it computationally expensive. In this paper, we propose a mathematically-grounded method called Monte Carlo DropConnect (MC-DropConnect) to approximate variational inference in BNNs. The main contributions of this paper are: 1. We propose imposing the Bernoulli distribution directly to the weights of the deep neural network to estimate the posterior distribution over its weight matrices. We derive the required equations to show that this generalization provides a computationally tractable approximation of a BNN, only using the existing tools and no additional model parameters. 2. We propose metrics to evaluate the uncertainty estimation performance of the Bayesian models in the classification and segmentation settings. Using these metrics, we show that our method is superior compared to the recently proposed technique called MC-Dropout. 3. We make an in-depth analysis of the uncertainty estimations in both classification and segmentation settings to investigate the robust generalization of MC-DropConnect. Our extensive evaluations show that the proposed uncertainty-informed decision is able to significantly improve the prediction accuracy compared to standard techniques.
Our experimental results (achieved using the proposed method and metrics) provide a new benchmark for other researchers to evaluate and compare their uncertainty estimation in pursuit of safer and more reliable deep networks. The rest of this paper is organized as follows: works related to approximating Bayesian inference and estimating uncertainty are presented in Related Work section. The Methodology section explains our proposed method along with the mathematical proofs to approximate variational inference in deep neural networks. We then present our findings and their interpretations in the Experimental Results and Discussion section. Finally, Conclusion section concludes the paper with future research directions.

Related work
In recent years, many studies have been conducted on approximate Bayesian inference for neural networks using deterministic approaches 13 , Markov Chain Monte Carlo with Hamiltonian Dynamics 16 , and variational inference 15 . In particular, Neal et al. introduced the Hamiltonian Monte Carlo for Bayesian neural network learning which gives a set of posterior samples 16 . This method does not require the direct calculation of the posterior but is computationally prohibitive.
Recently, Gal et al. 17 showed that Dropout, a well-known regularization technique 18 , is mathematically equivalent to approximate variational inference in the deep Gaussian process 19 . This method, commonly known as MC-Dropout, uses a Bernoulli approximating variational distribution on the network units and introduces no additional parameters for the approximate posterior. The main disadvantage of this method is that it often requires many forward-pass sampling which makes it resource-intensive 20 . Moreover, a fully Bayesian network approximated using this method (i.e. dropout applied to all layers) results in excessive regularization 21 that learns slowly and does not achieve high prediction accuracy. While Bernoulli dropout is the most common approach used in the literature due to its ease of use and computation speed, several dropout variations with other distributions such as Gaussian dropout have been studied 18,22 . Concrete dropout 23 was later proposed to use a continuous relaxation of dropout's discrete masks to allow for automatic tuning of the dropout probability in large models. However, it introduces bias to the gradients of the model and reduces its prediction performance. Motivated by concrete dropout, Boluki et al. proposed a learnable Bernoulli dropout (LBD) mechanism for general deep neural networks. In LBDs, the dropout probabilities are defined as variational parameters and are jointly trained with the other parameters of the DNN 24 . Their experimental results show that LBD is able to achieve improved accuracy and uncertainty estimates in image classification and semantic segmentation. Multiplicative Normalizing Flows 25 is another technique which is introduced as a family of approximate posteriors for the parameters of a variational BNN, capable of producing uncertainty estimates; this technique does not scale well with very large convolutional networks.
Another proposed approach is Deep Ensembles 26 which have been shown to achieve high-quality uncertainty estimates. This method takes the frequentist approach to estimate the model uncertainty by training several models and calculating the variance of their output prediction. This technique is quite resource-intensive as it requires the storage of several separate models while performing forward passes through all of them to generate the inference. An alternative to such methods was proposed by Devries et al. which proposes to learn uncertainty from the given input 27 .
Several approaches have been designed to compute the uncertainty estimates in the segmentation setting. The most commonly used approach is to induce a probability distribution by using dropout over extracted feature www.nature.com/scientificreports/ values to obtain independent pixel-wise probabilities 21,28 . However, these approaches have been shown to be prone to result in inconsistent outputs which is not plausible 29 . In contrast, a body of work designed various approaches that can result in a diverse set of outcomes to account for the inherent ambiguities observed in realworld applications. Several approaches trained models with oracle set loss which only accounts for the closest prediction to the ground truth [30][31][32] . Kohl et al. proposed the probabilistic U-Net, in which a separate network named prior-net is trained along with the base segmentation network and maps the input to an embedding hypothesis space 29 . Thus this network is able to generate multiple plausible segmentations with sampling different points from the learned hypothesis embedding space.

Methodology
In this section, we address the limitations of BNNs, variational inference as the standard technique in Bayesian modeling, and DropConnect as a method for regularizing NNs. We then use these tools to approximate Bayesian networks using standard NNs equipped with Bernoulli distributions applied directly to their weights. Finally, we explain the methods used for measuring and evaluating model uncertainty.
Bayesian neural networks. From a probabilistic perspective, standard NN training via optimization is equivalent to maximum likelihood estimation (MLE) for the weights. Using MLE ignores any uncertainty that we may have in the proper weight values. BNNs are the extension over NNs to address this shortcoming by placing a prior distribution (often a Gaussian) over a NN's weight. This brings vital advantages like automatic model regularization and uncertainty estimates on predictions 13,15 . Given a BNN model with L layers parametrized by weights w = {W i } L i=1 and a dataset D = (X, y) , Bayesian inference calculates the posterior distribution of the weights given the data, p(w|D ) . The predictive distribution of an unknown label y * of a test input data x * is given by: which shows that making a prediction about the unknown label is equivalent to using an ensemble of an infinite number of neural networks with various configuration of the weights. This is computationally intractable for neural networks of any size; the posterior distribution p(w|D ) cannot generally be evaluated analytically. This limitation has prompted the scientific community to develop ways to approximate BNNs to make them easier to train 33,34 .
One common approach is to use variational inference to approximate the posterior distribution of the weights. It introduces a variational distribution, q θ (w) , parametrized on θ that minimizes the Kullback-Leibler (KL) divergence between q and the true posterior distribution: Minimising the KL divergence is equivalent to minimizing the negative evidence lower bound (ELBO): with respect to variational parameter θ . The first term (commonly referred to as the expected log likelihood) encourages q θ (w) to place its mass on configurations of the latent variable that explain the observed data. The second term (referred to as prior KL) encourages q θ (w) to be similar to the prior, preventing the model from overfitting. The prior KL term can be analytically evaluated to properly select the prior and variational distributions, while the expectation (i.e. integral term) cannot be computed exactly for a non-linear neural network. Our goal in the next section is to develop an explicit and accurate approximation for this expectation. Our approach extends on the results of Gal et al. 35 and uses Bernoulli approximating variational inference and Monte-Carlo sampling.
DropConnect. DropConnect 36 , known as the generalized version of Dropout 18 , is a method used for regularizing deep neural networks. Here, we briefly review Dropout and DropConnect applied to a single fullyconnected layer of a standard NN. For a single K i−1 dimensional input v , the ith layer of an NN with K i units would output a K i dimensional activation vector a i = σ (W i v) where W i is the K i × K i−1 weight matrix and σ (.) is the nonlinear activation function (biases included in the weight matrix with a corresponding fixed input of one for the ease of notation).
When Dropout is applied to the output of a layer, the output activations can be written as a DO i = σ (z i ⊙ (W i v)) where ⊙ signifies the Hadamard product and z i is a K i dimensional binary vector with its elements drawn independently from z (k) i ∼ Bernoulli(p i ) for k = 1, . . . , K i and p i to be the probability of keeping the output activation. DropConnect is the generalization of Dropout where the Bernoulli dropping is applied directly to each weight, rather than each output unit, thus the output activation is re-written as a DC i = σ ((Z i ⊙ W i )v) . Here, Z i is the binary matrix of the same shape as W i , i.e. K i × K i−1 . Wan et al. 36 showed that adding DropConnect helps regularize large neural network models and outperforms Dropout on a range of data sets.
DropConnect for approximate Bayesian neural network. Assume the same Bayesian NN with L layers parametrized by weights w = {W i } L i=1 . We perform variational learning by approximating the variational distribution q(W i | i ) for every layer i as: www.nature.com/scientificreports/ where i is the matrix of variational parameters to be optimised, and Z i the binary matrix whose elements are distributed as: is the random binary value associated with the weight connecting the lth unit of the (i − 1) th layer to the kth unit of the ith layer. p i is the probability that the random variables z (l,k) i take the value one (assuming the same probability for all the weights in a layer). Therefore, z (l,k) i = 0 corresponds to the weight being dropped out. We start with rewriting the first term of Eq. (3) as a sum over all samples. Then we use Eq. (4) to re-parametrize the integrand so that it only depends on the Bernoulli distribution instead of w directly. We estimate the intractable integral with Monte Carlo sampling over w with a single sample as: Note that ŵ n is not maximum a posteriori estimate, but random variable realisations from the Bernoulli distribution, ŵ n ∼ q θ (w) , which is identical to applying DropConnect to the weights of the network. The final sum of the log probabilities is the loss of the NN, thus we set: where ŷ(x n ,ŵ n ) is the random output of the BNN. I NN is defined according to the task with the sum of squared loss and softmax loss commonly selected for the regression and classification respectively.
The second term in Eq. (3) can be approximated following 35 . It has been shown that the KL term is equivalent Thus, the objective function can be re-written as: which is a scaled unbiased estimator of Eq. (3). More interestingly, it is identical to the objective function used in a standard neural network with L2 weight regularization and DropConnect applied to all the weights of the network. Therefore, training such a neural network with stochastic gradient descent has the same effect as minimizing the KL term in Eq. (2). This scheme, similar to a BNN, results in a set of parameters that best explains the observed data while preventing over-fitting. After training the NN with DropConnect and proper regularization, we follow Eq. (1) to generate our inference. We replace the posterior p(w|D ) with the approximate posterior distribution q θ (w) and approximate the integral with Monte Carlo integration: with ŵ t ∼ q θ (w) . This means that at test time, unlike common practice, the DropConnect layers is kept active to keep the Bernoulli distribution over the network weights. Then each forward pass through the trained network generates a Monte Carlo sample from the posterior distribution. Several of such forward passes are needed to approximate the posterior distribution of softmax class probabilities. According to Eq. (9), the mean of these samples can be interpreted as the network prediction. We call this approach MC DropConnect which is a generalization over the previous work referred to as MC Dropout 35 and will show its superiority in terms of achieving higher prediction accuracy and more precise uncertainty estimation in different ML tasks.
Measuring the model uncertainty. Generally, there are two types of uncertainty in Bayesian modeling 10 .
Model uncertainty, also known as Epistemic uncertainty, measures what the model does not know due to the lack of training data. This uncertainty captures our ignorance about which model generated our collected data, thus can be explained away given enough data 9 . Aleatoric uncertainty, however, captures noise (such as motion or sensor noise) that is inherently present in the data and cannot be reduced by collecting more data 28 .
After computing the result of stochastic forward passes through the model, we can estimate the model confidence to its output. In the classification setting, several metrics are introduced to measure uncertainty. One straightforward approach used by Kendall et al. is to take the variance of the MC samples from the posterior distribution as the output model uncertainty for each class 21 . Predictive entropy is also suggested by Gal et al. which captures both epistemic and aleatoric uncertainty; in our case, this is not the proper choice as we are interested in regions of the data space where the model is uncertain 9 .
To specifically measure the model uncertainty for a new test sample x * , we can see it as the amount of information we would gain about the model parameters if we were to receive the true label y * . Theoretically, if the model is well-established in a region, knowing the output label conveys little information. In contrast, knowing the label would be informative in regions of data space where the model is uncertain 37 . Therefore, the mutual information (MI) between the true label and the model parameters are defined as as: where c ranges over all classes. This is not analytically tractable for deep NNs; we use Eq. (9) to approximate it as: where p MC (y * = c|x * ) is the average of the softmax probabilities of input x * being in class c over T Monte Carlo samples. Finally, MI can be re-written as: which can be computed for each model configuration at tth Monte Carlo run, ŵ t , obtained by the DropConnect. Note that the range of the obtained uncertainty values is not fixed across different data sets, network architectures, number of MC samples, etc. Therefore, we use the normalized mutual information I norm ∈ [0, 1] computed as I norm = I−I min I max −I min to report our results and facilitate the comparison across various sets and configurations. I min and I max are the minimum and maximum uncertainty values computed over the whole data set.
Uncertainty evaluation metrics. The proposed MC-DropConnect approach is a light-weight, scalable method to approximate Bayesian inference in deep neural networks. This enables us to perform inference and estimate the uncertainty in DNNs at once. Unlike model predictions, there is no ground truth for uncertainty values which makes evaluating the uncertainty estimates a challenging task. Therefore, there is no clear and direct approach to define a good uncertainty estimate.
We propose metrics that incorporate the ground-truth label, model prediction, and uncertainty value to evaluate the uncertainty estimation performance of such models. Figure 1 shows the required processing steps to prepare these quantities for our metrics in a segmentation example. Note that these metrics can be used for both classification and semantic segmentation tasks; semantic segmentation is identical to pixel-wise classification. The conversions applied to a pixel explains the classification task.
We first compute the map of correct and incorrect values (correctness map) by matching the ground truth labels and model predictions. Likewise, we can apply a threshold I T ∈ [0, 1] on the continuous uncertainty estimation values of I norm to split the predictions into certain (I norm < I T ) and uncertain ( I norm > I T ) groups. Therefore, when making inference in the Bayesian setting, we generally face four scenarios which are incorrectuncertain (iu), correct-uncertain (cu), correct-certain (cc), and incorrect-certain (ic) predictions (see Fig. 1). The following metrics reflects the characteristics of a good uncertainty estimator: 1. Correct-certain ratio ( R cc ): If a model is certain about its prediction, the prediction has the highest probability of being correct. This can be written as a conditional probability: www.nature.com/scientificreports/ where N represents the count for each combination and R represents the ratio. 2. Incorrect-uncertain ratio ( R iu ): If a model is making an incorrect prediction, it is desirable for the uncertainty to be high.
In this scenario, the model is capable of flagging a wrong prediction with a high epistemic uncertainty value to help the user take further precautions.
Note that the converse of the above two assumptions is not necessarily the case. This means that if a model is making a correct prediction on a sample, it does not necessarily need to be certain. A model might, for instance, be able to correctly detect an object, but with a relatively higher uncertainty because it has rarely seen that instance with such a pose or condition.
3. Uncertainty Accuracy (UA): Finally, the overall accuracy of the uncertainty estimation can be measured as the ratio of the desired cases explained above ( N cc and N iu ) over all possible cases: Clearly, for all the metrics proposed above, higher values correspond to the model that performs better. The value of these metrics depend on the uncertainty threshold, thus we plot each metric w.r.t the uncertainty threshold ( I T ) and compare them using the area under each curve (AUC) metric. This helps to summarize the value of each metric over various uncertainty thresholds in a single scalar.
Medical data collection methodology. Our paper performs all medical data collection following relevant guidelines and regulations. Specifically, all CT scans were anonymized to remove any patient-specific information. Our protocol waives the patient consent as the data were de-identified (Protocol PA12-1084). The data collection was approved by the Institutional Review Board 4 of the MD Anderson Cancer Center whose chair designee is Vera J. DeLaCruz (IRB 4 IRB00005015).

Experimental results and discussion
In this section, we assess the performance of uncertainty estimates obtained from DropConnect CNNs on the tasks of classification and semantic segmentation. We also compare the uncertainty obtained from our proposed method with a state-of-the-art method, MC-Dropout, on a range of data sets and show considerable improvement in prediction accuracy and uncertainty estimation quality. We quantitatively evaluate the uncertainty estimates using our proposed evaluation metrics. Note that in all experiments throughout the paper, MC-Dropconnect and MC-Dropout techniques were never used simultaneously in the same network. If a network is trained with Dropout or Dropconnect regularization, it would be tested with the same Dropout or Dropconnect, respectively. All the experiments are done using TensorFlow (version 1.13.1) framework 38 . Classification. We implement fully Bernoulli Bayesian CNNs using DropConnect to assess the theoretical insights explained above in the classification setting. We show that applying the mathematically principled DropConnect to all the weights of a CNN results in a test accuracy comparable with the state-of-the-art techniques in the literature while considerably improving the models' uncertainty estimation.
We adopt the LeNet structure (described in 39 ) for the MNIST 40 and a fully-convolutional network (FCNet) for the CIFAR-10 dataset 41 . FCNet is composed of three blocks, each containing two convolutional layers (filter size of three and stride of one) followed by a max-pooling layer (with filter size and stride of two). The numbers of filters in the convolution layers of the three blocks are 32, 64, and 128, respectively. Each convolutional layer is also followed by a batch normalization layer and Relu non-linear activation function. We refer to the tests applied to the Bayesian CNN with DropConnect applied to all the weights of the network as "MC-DropConnect" and will compare it with "None" (no dropout or drop connect), as well as "MC-Dropout" 20 which has dropout used after all layers. To make the comparison fair, Dropout and DropConnect are applied with the same rate of p = 0.5 . We evaluate the networks using two testing techniques. The first is the standard test applied to each structure keeping everything in place (no weight or unit drop). The second test incorporates the Bayesian methodology, generating the MC test equivalent to model averaging over T = 100 stochastic forward passes.
Our experimental results (Table 1, Fig. 2) show that MC-DropConnect yields marginally improved prediction accuracy when applying MC-sampling. More importantly, the uncertainty estimation metrics show a significant improvement when using MC-DropConnect. Example predictions are provided in Fig. 3. We also test the LeNet networks (trained on MNIST) on rotated and background MNIST data. These are the distorted versions of MNIST which can be assumed as the out-of-distribution examples 42 that the model has never seen before. This test is conducted to investigate the generalization of the predictive uncertainty to domain shift.
As shown in Fig. 3, MC-DropConnect BNN often yields a high uncertainty estimate when the prediction is wrong and makes accurate predictions when it is certain. We observed fewer failure cases using MC-Drop-Connect compared with MC-Dropout (also reflected in the R iu and R cc values in Fig. 2). Similar observations were made in Fig. 4 which illustrates the distribution of the model uncertainty over the correct and incorrect predictions separately. It implies that the MC-DropConnect approximation produces significantly higher model (14) R cc (I T ) = P I T (correct|certain) = P(correct, certain) P(certain) = N cc N cc + N ic  Fig. 3. These cases often correspond to visually complicated samples where the network is not confident. Such FPs are useful and can be considered red flags when a model is more likely to make inaccurate predictions.
Enhanced performance with uncertainty-informed referrals. An uncertainty estimation with such characteristics (i.e. high uncertainty as an indication of erroneous prediction, as well as informative FPs) provides valuable information in situations where the control is handed to automated systems in real-life settings, with the possibility of becoming life-threatening to humans. These include applications such as self-driving cars, autonomous control of drones, automated decision making and recommendation systems in the medical domain, etc. An automated cancer detection system, for example, trained on a limited number of data (which is often the case due to the expensive or time-consuming data collection process) could encounter test samples lying out of its observed data distribution. Therefore, it is prone to making unreasonable decisions or recommendations which could result in a biased decision being made by the expert. However, uncertainty estimation can be utilized in such scenarios to detect such undesirable behavior of the automated systems and enhance the overall performance by flagging appropriate subsets for further analysis.
We set up an experiment to test the usefulness of the proposed uncertainty estimation in mimicking the clinical work-flow, and referring samples with high uncertainty for further testing. First, the model predictions are sorted according to their corresponding epistemic uncertainty (measured by the mutual information metric). We then computed the prediction accuracy as a function of confidence. This is done by taking various levels of tolerated uncertainty and the fraction of retained data (see Fig. 5). We observed a monotonic increase in prediction accuracy with MC-DropConnect outperforming MC-Dropout for decreasing levels of tolerated uncertainty and a decreasing fraction of retained data. It is also compared with removing the same fraction of samples randomly, that is with no use of uncertainty information, which indicates the informativeness of the uncertainty about prediction performance as well. Note that in practice, the uncertainty cutoff threshold should be selected by taking the threshold that results in the best prediction performance on the validation dataset, and should not be changed when using the test set.
Convergence of the MC-DropConnect. Even though the proposed MC-DropConnect method results in better prediction accuracy and uncertainty estimation, it still comes with a price of prolonged test time. This is because we need to evaluate the network stochastically multiple times and average the results. Therefore, while the training time of the models and their probabilistic variant is identical, the test time is scaled by the number of averaged forward passes. This becomes more important in practice and for applications which the test-time efficiency is critical. To evaluate the MC-DropConnect approximation method, we assessed the prediction accu-  CamVid with SegNet. CamVid 44 is a road scene understanding data set which contains 367, 100, and 233 training, validation, and test images respectively, with 12 classes. Images are size 360 × 480 and include both bright and dark scenes. We chose SegNet as the network architecture to be used for the semantic segmentation task to make the results of our approach to those of 21 .
CityScapes with ENet. CityScapes 45 is one of the most popular data sets for the urban scene understanding with 5000, 500, and 1525 images for training, validation, and test. Images are of size 2048 × 1024 collected in 50 different cities and contains 20 different classes. Due to the large size of the images and more number of classes, we chose ENet 46 which is a more powerful network that requires fewer flops and parameters. The spatial dropout layers used in this framework are replaced with the regular dropout and dropconnect layers for our purpose. www.nature.com/scientificreports/ 3D CT-Organ with VNet. Since uncertainty estimates can play a crucial role in the medical diagnostics field, we also tested our model uncertainty estimation approach in the semantic segmentation of the body organs in abdominal 3D CT scans. The CT-Organ dataset includes 226 unique CT scans captured by General Electric and Siemens scanners at a single hospital. The study was approved by the Institutional Review Board (IRB) at the University of Texas MD Anderson Cancer Center. Informed consent requirement was waived by IRB as only deidentified data was used. The scans are down-sampled to 512 × 512 pixels and contain between 186 to 730 slices (mean=420, std=95). We used the volumetric CT scans from 180 patients for training and the rest are used for testing the models. We used V-Net 47 which is one of the most commonly used architectures for the segmentation of the volumetric medical images. The data include six classes including background, liver, spleen, kidney, bone, and vessel.
Qualitative observations. Figure 7 shows example segmentation and model uncertainty results from the various Bayesian frameworks on different datasets. This figure also compares the qualitative performance of MC-DropConnect with that of MC-Dropout. The correctness and confidence map highlights the misclassified and uncertain pixels respectively. Our observations show that MC-Dropconnect produces high-quality uncertainty estimation maps outperforming MC-Dropout, i.e. displays higher model uncertainty when models make wrong predictions. We generally observe that higher uncertainty values are associated with three main scenarios. First, at the boundaries of the object classes (capturing the ambiguity in labels transition). Second, we observe a strong relationship between the frequency at which a class label appears and the model uncertainty. Models generally have significantly higher uncertainty for the rare class labels (the ones that are less frequent in the data; such as pole and sign symbol classes in CamVid). Conversely, models are more confident about class labels that are more prevalent in the data sets. Third, models are less confident in their prediction for objects that are visually difficult or ambiguous to the model. For example, (bicyclist, pedestrian) classes in CamVid and (car, truck) classes in CityScapes are visually similar which makes it difficult for the model to make a correct prediction, thus outputting higher uncertainty values.
Quantitative observations. We report the semantic segmentation results in Table 2 and Fig. 8. We find that MC-DropConnect generally improves the accuracy of the predicted segmentation masks for all three model-data set pairs. Similar to what is done in the classification task, we computed the segmentation accuracies for varying levels of model confidence. The results are provided in Table 3. For all three data set-model pairs, we observed very high levels of accuracy for the 90th percentile confidence. This indicates that the proposed method results in the model uncertainty estimate which is an effective measure of confidence in the prediction.

Conclusion
We have presented MC-DropConnect as a mathematically grounded and computationally tractable approximate inference in Bayesian neural networks. This framework outputs a measure of model uncertainty with no additional computational cost; i.e. by extracting the information from the existing models that have been thrown away so far. We also developed new metrics to evaluate the uncertainty estimation of the models in all ML tasks, such as regression, classification, semantic segmentation, etc. We created the probabilistic variants of some of the most famous frameworks (in both classification and semantic segmentation tasks) using MC-DropConnect. Then we exploited the proposed metrics to evaluate and compare the uncertainty estimation performance of various models. Empirically, we observed that the MC-DropConnect improves the prediction accuracy, and   Qualitative results for semantic segmentation and uncertainty estimates on CamVid, CityScapes, and CT-Organ datasets. Each row depicts a single sample and includes the input image with ground truth, prediction, correctness, and confidence (using the mutual information metric) maps for both MC-Dropout and MC-DropConnect. Correctness map is the binary map that shows the correct and incorrect predictions. Confidence map is the thresholded map of uncertainty values computed over all classes. In all cases, the threshold is set manually to the one that achieves the highest UA. Correct and certain regions are respectively shown in white color in the correctness and confidence maps. www.nature.com/scientificreports/ mechanism, it would be interesting to investigate a learnable weight dropping rate (similarly to Boluki et al. 24 ) as a more flexible alternative. While we have effectively validated this method in classification and segmentation tasks, future works should investigate the feasibility of MC-Dropconnect in regression tasks. Leveraging the uncertainty in the training process to enrich the model's knowledge of the data domain is another interesting research direction that should be investigated.

Code availability
All scripts related to this work can be accessed without restriction at https ://githu b.com/hula-ai/mc_dropc onnec t.   Table 3. Pixel-wise accuracy of the Bayesian frameworks as a function of confidence for the 0th percentile (all pixels) through to the 90th percentile (10% most certain pixels). This shows that the estimated model uncertainty is an effective measure of prediction accuracy.