Uncertainty-inspired open set learning for retinal anomaly identification

Failure to recognize samples from the classes unseen during training is a major limitation of artificial intelligence in the real-world implementation for recognition and classification of retinal anomalies. We establish an uncertainty-inspired open set (UIOS) model, which is trained with fundus images of 9 retinal conditions. Besides assessing the probability of each category, UIOS also calculates an uncertainty score to express its confidence. Our UIOS model with thresholding strategy achieves an F1 score of 99.55%, 97.01% and 91.91% for the internal testing set, external target categories (TC)-JSIEC dataset and TC-unseen testing set, respectively, compared to the F1 score of 92.20%, 80.69% and 64.74% by the standard AI model. Furthermore, UIOS correctly predicts high uncertainty scores, which would prompt the need for a manual check in the datasets of non-target categories retinal diseases, low-quality fundus images, and non-fundus images. UIOS provides a robust method for real-world screening of retinal anomalies.


Introduction
Retina is part of the central nervous system responsible for phototransduction. Retinal diseases are the leading cause of irreversible blindness and visual impairment worldwide.
Treatment at the early stage of disease is important to reduce serious and permanent damages. Therefore, timely diagnosis and appropriate treatment are important for preventing threatened vision and even irreversible blindness. Diagnosis of retinal diseases requires expertise of trained ophthalmologists, while there is always heavy demand for large number of patients with retinal diseases to limited number of specialists. A solution to this service gap is image-based screening that alleviates workload of ophthalmologists.
Fundus photography-based screening has been shown to be successful to prevent irreversible vision impairment and blindness caused by diabetic retinopathy [1].
There are also some successful applications of deep learning in classifying multiple retinal diseases [15].
However, a major drawback of the standard AI models in real-world implementation is the problem of open set recognition. AI models are trained in a close set, i.e., a limited number of categories and limited characters of samples. But the real world is an open set environment, where some samples may be out of the categories in the training set or with untypical features. Previous studies have demonstrated that the performance of deep learning models declines when applied to data out of distribution (OOD), such as lowquality images and untypical cases [16][17][18]. Furthermore, if the testing image is a retinal disease not included in the training set, even if it is a non-fundus image, the standard AI model will still give a diagnosis of the disease category in the training data. This would lead to misdiagnosis. Meanwhile, in practice, it is impossible to collect data that cover all fundus abnormalities with sufficient sample size to train the model. Therefore, it is highly necessary to develop an open-set learning model that can accurately classify anomaly classification. Standard artificial intelligence (AI) and UIOS AI models were trained with the same dataset with 9 categories of retinal photos. In testing, standard AI model assigns a probability value (pi) to each of the 9 categories, and the one with the highest probability is output as the diagnosis. Even when the model is tested with a rare retinal disease image outside of the training set, the model still outputs one from the 9 categories, which may lead to mis-diagnosis. In contrast, UIOS outputs an uncertainty score (µ) besides the probability (pi) for the 9 categories. When the model is fed with an image with obvious features of retinal disease in the 9 categories, the uncertainty-based classifier will output a prediction result with a low uncertainty score below the threshold θ to indicate that the diagnosis result is reliable. Conversely, when the input data contains ambiguous features or is an anomaly outside of training categories, the model will assign a high uncertainty score above threshold θ to explicitly indicate that the prediction result is unreliable and requires a double-check from their ophthalmologist to avoid mis-diagnosis.
retinal diseases included in the training set, as well as for the screening other OOD samples without the need to collect and label additional data.
In this study, we developed a novel fundamental AI model of uncertainty-inspired open set (UIOS) based on the evidential uncertainty deep neural network. As shown in Fig.1, if the test data is a fundus disease included in the training set with distinct features, our proposed UIOS model will give a diagnosis decision with a low uncertainty score, which indicates that the decision is reliable. On the contrary, if the test data is in the category outside the training data set, low-quality images, and non-fundus data, UIOS will give a prediction result with a high uncertainty score, which suggests that the diagnosis result given by the AI model is unreliable. If so, a manual check by an experienced grader or ophthalmologist is required. Therefore, with the estimated uncertainty, our AI model is capable to give reliable diagnosis for retinal diseases involved in training data and avoid confusion from OOD samples.

Performance in the internal testing dataset
In the internal testing set with 2010 images, our UIOS achieved an F1 score ranging from 93.12% to 99.27% for the 9 categories, especially for pathologic myopia (PM, 98.84%), glaucoma (GL, 98.53%), retinal detachment (RD, 99.27%), and diabetic retinopathy (DR, 98.04%) ( Table 1). The average area under the curve (AUC) (Fig. 2), precision (Supplementary Table 1), F1 score (Table 1), sensitivity (Supplementary Table 2), and specificity (Supplementary Table 3) of the UIOS model were 99.79%, 97.57%, 97.29%, 97.04%, and 99.75%, respectively, which were better than the standard AI model, although the difference was statistically significant for F1 (p=0.004) but not AUC (p=0.565). Furthermore, UIOS also outperformed the standard AI model in terms of confusion matrix ( Supplementary Fig. 1). It should be noted that when an image is flagged as "uncertain" beyond the threshold by the UIOS model, those images are suggested to seek double checking by ophthalmologists and removed when calculating the eventual diagnostic performance metrics.
The distribution of the uncertainty score in the primary testing set was similar to the validation set, except that 9.75% of samples with uncertainty scores were above the threshold ( Fig. 3 Fig. 2(c)).
In addition, we compared the performance of UIOS with other commonly used uncertainty methods, including Monte-Carlo drop-out (MC-Drop), ensemble models (Ensemble), test time-augmentations (TTA), and using entropy across the categorical class probabilities in the standard AI model (Entropy). Our UIOS model consistently outperformed these uncertainty approaches in terms of F1 score, both on the original internal testing set (Supplementary Table 5) and dataset where samples with uncertainty scores above their threshold have been suggested to seek double-checking by ophthalmologists (Supplementary Table 6). Statistical analyses showed that the difference was significant except in the comparison of UIOS to Ensemble in the internal testing set with thresholding (Supplementary Table 7). The receiver operating characteristic (ROC) curves of different uncertainty methods are shown in Supplementary Fig. 2

Performance in the external datasets
To further evaluate the generality of UIOS for screening fundus diseases, we also conducted experiments on two external datasets of target categories from JSIEC1000 (TC-JSIEC) and unseen target categories (TC-unseen), with 435 and 3,716 images, respec-  Fig. 1). Furthermore, when applying our thresholding strategy (UIOS+thresholding) to indicate samples with uncertainty scores above the threshold that required manual check by ophthalmologists, we observed a further significant improvement in the confusion matrix and a significant reduction in misclassified samples (Supplementary Fig. 1). results, but also provided a higher probability value which led to mis-/under-diagnosis. In contrast, although UIOS could also gave wrong diagnostic results, the prediction results were indicated to be unreliable by assigning a high uncertainty score to the diagnostic results. The high uncertainty score suggested the need to seek an ophthalmologist to read the images again to prevent mis-/under-diagnosis.
We further compared the performance of our proposed UIOS to other uncertainty approaches in these two external testing sets. The results showed that our UIOS model achieved higher F1 scores (Supplementary Table 10   takes the fundus disease category with the highest probability score as the final diagnosis result, our UIOS will not only give the probability scores but also provide the corresponding uncertainty score to reflect the reliability of the prediction result. If the uncertainty score is less than the threshold theta, indicating the model prediction is reliable; Conversely, if the uncertainty score is greater than the threshold theta, which represents the result is unreliable, and manual double-checking is required to avoid possible mis-diagnosis problems. US: uncertainty score. θ: threshold theta.

Open set anomaly detection
In

Discussion
In the past few years, deep learning-based methods for the detection of retinal diseases have shown a rapid growing trend [13][14][15]. But  In addition, our proposed uncertainty paradigm is highly scalable and can be combined Besides assigning a probability to OOD samples as the standard AI model, our UIOS AI model also assigns a high uncertainty score to indicate that the final decision is unreliable and needs a double-check. US: uncertainty score.
with and enhance the performance of current commonly used baseline models for retinal diseases screening.
Recently, numerous methods have been developed to detect abnormalities in fundus images using various deep neural networks [19][20][21][22]. They trained the models with normal images only and detected abnormal images in the testing set. Although they have achieved an AUC of 0.8 to 0.9, these methods can only differentiate abnormal from nor-  values, leading to high confidence predictions that are incorrect. Consequently, relying solely on entropy may not provide robust detection or handling of out-of-distribution data. Evidential-based subjective logistic uncertainty to calculate the uncertainty score is directly based on the evidence collected from the feature extractor network [32][33][34].
The potential capacity of subjective logistic to estimate the reliability of classification has been explored by Han et al. [33], who introduced Dirichlet distribution into subjective logical (SL) to derive probabilities of different classes and the overall uncertainty score. However, they have not explored how to detect OOD samples based on uncertainty in a quantitative approach. Our previous studies have introduced evidential uncertainty to investigate uncertainty estimation for lesion segmentation in medical images [35,36].
Recently, two groups reported that estimating uncertainty improved the prediction of cancer by digital histopathology [37,38]. However, the uncertainty was estimated for the binary classification. In this study, we have improved the evidential uncertainty and formalized uncertainty thresholding based on the internal validation dataset to conduct confidence evaluation on the testing datasets to detect the fundus anomaly.
In general, compared to these uncertainty approaches, there are advantages of our evidential learning-based uncertainty method: 1) Our UIOS method directly calculates the belief masses of different categories and corresponding uncertainty score by mapping the learned features from the backbone network to the space of Dirichlet parameter distribution. Therefore, our UIOS is trainable end-to-end, making it easy to implement and deploy; 2) The Dirichlet-based evidential uncertainty method provides well-calibrated uncertainty estimates. It offers reliable uncertainty measurements that align with the true confidence level of the model's predictions, which is supported by the results of this study. This is crucial for applications where accurate assessment of uncertainty is essential, especially for medical diagnosis or critical decision-making scenarios [39,40]. We recognize limitations and the need for improvements in the current study. First, exhibited correct predictions with higher uncertainty than the threshold, resulting in additional labor requirements. Therefore, additional efforts are necessary to enhance the UIOS's ability to learn ambiguous features to further improve its reliability in predicting fundus diseases while reducing the need for manual reconfirmation. Second, we focused solely on classifying fundus images into one main disease category. In the next phase, we will collect more data with multi-label classifications and explore uncertainty evaluation methods for reliable multi-label diseases detection. Third, the model will be tested in more datasets. Samples with high uncertainty scores will be further assessed. Retraining will be performed with the expended dataset. Fourth, our proposed UIOS with the thresholding strategy will be applied to other image modalities (such as OCT, CT, MRI, and histopathology) and combined with artificial intelligence techniques for diagnosing specific diseases.

As indicated in Supplementary
In conclusion, UIOS model combined with thresholding strategy is capable to accu-

Target categories fundus photo datasets
This study was approved by the Joint Shantou International Eye Center Institutional Review Board and adhered to the principles of the Declaration of Helsinki. Data were desensitized and did not require subject notification. The data collection and labelling procedure are shown in Supplementary Fig. 4. Fundus images from 5 eye clinics with different models of fundus cameras were collected. Two trained graders performed the annotation independently. If their results were inconsistent, a retinal sub-specialist with more than 10 years experience would make the final decision. The numbers of images in each category within each dataset are listed in Supplementary Table 15.
We collected 10,034 fundus images of 8 different fundus diseases or normal condition.
They were named the primary target-categories (TC) dataset. These images were randomly divided into training (6,016), validation (2,008) and test sets (2,010) in the ratio  Supplementary Table 16.
To further validate the capacity of our proposed UIOS to screen retinal diseases, we also conducted experiments on the public dataset of JSIEC [15], which contained 1,000 fundus images from different subjects with 39 types of diseases and conditions.
Among them, 435 fundus images were with the target categories and set as the dataset of TC-JSIEC.

Non-fundus photo datasets
Three non-fundus photo public datasets were used to evaluate the performance of AI models in detecting OOD samples. The first was the VOC2012 dataset, with 17,125 natural images of 21 categories [41]. The second was RETOUCH dataset which consisted of 6,936 2-D retinal optical coherence tomography images [42]. The third was our OCTA dataset collected from our eye clinic, consisting of 304 2D retinal OCTA images.

Framework of the standard AI model
As shown in Fig. 1, the standard AI model consisted of a backbone network for extracting the feature in formation in fundus images, while a Softmax classifier layer was adopted to produce the prediction results based on the features from the backbone network. For deep learning based disease detections, pre-trained ResNet-50 [43] has been widely used as a backbone network to extract the rich feature information contained in medical images and have achieved excellent performance [44][45][46][47]. Therefore, in this study, we employed pre-trained ResNet-50 as our backbone network to conduct experiments. As shown in

Framework of UIOS
As shown in Fig. 1, our proposed UIOS architecture was simple and mainly consisted of a backbone network to capture feature information. An uncertainty-based classifier was used to generate the final diagnosis result with an uncertainty score that led to more reliable decision making without losing accuracy. To ensure experimental objectivity, we adopted pre-trained ResNet-50 as our backbone to capture the feature information contained in fundus images. Meanwhile, with fundus images through ResNet-50, the final decision and corresponding overall uncertainty score were obtained by our uncertaintybased classifier, which was mainly composed of three steps. Specifically, this was a K-class retinal fundus disease detection.
Step (1): Obtaining the evidence feature E = [e 1 , ..., e K ] for different fundus diseases by applying Softplus activation function to ensure the feature values are larger than 0: where F Out is the feature information obtained from the ResNet-50 backbone.
Step (2): Parameterizing E to Dirichlet distribution, as: where α k and e k are the k -th category Dirichlet distribution parameters and evidence, respectively.
Step (3): Calculating the belief masses and corresponding uncertainty score as: where S = K k=1 (e k + 1) = K k=1 α k is the Dirichlet intensities. It can be seen from Eq. 3 the probability assigned to category k is proportional to the observed evidence for category k. Conversely, if less total evidence was obtained, the greater the total uncertainty. The belief assignment can be considered as a subjective opinion. The probability of k -th retinal fundus disease was computed as p k = α k S based on the Dirichlet distribution [48] (The definition of Dirichlet distribution is detailed in Sec. 4.6). In addition, to further improve the performance of our UIOS, we also explore a novel loss function to guide the optimization of our UIOS, the details are shown in Sec. 4.7.

Definition of Dirichlet distribution
The Dirichlet distribution was parameterized by its concentration K parameters α = [α 1 , ..., α K ]. Therefore, the probability density function of the Dirichlet distribution is computed as follows: where S K is the K -dimensional unit simplex, as follows: where B (α) represent the K -dimensional multinomial beta function.

Loss function
Cross entropy loss function has been widely employed in most previous diseases detection studies, In our work, subjective logical (SL) associates the Dirichlet distribution with the belief distribution under the framework of evidential uncertainty theory to obtain the probabilities of different fundus diseases and the corresponding overall uncertainty score based on the evidence collected from the backbone network. Therefore, we could work out the Dirichlet distribution parameter of α = [α 1 , ..., α K ] and obtain the multinomial opinions D (p i |α i ), where p i is the class assignment probabilities on a simplex. Similar to TMC [33], CE loss was modified as follows: where L U N −CE was used to ensure that the correct prediction for each sample yielded more evidence than other classes, while L KL was employed to ensure that incorrect predictions would yield less evidence, and λ is the balance factor that was gradually increased so as to prevent the model from paying too much attention to the KL divergence in the initial stage of training, which might result in a lack of good exploration of the parameter space and cause the network to output a flat uniform distribution.
where ψ () is the digamma function, while β () is the multinomial beta function for the concentration parameter α.
whereα = y + (1 − y) ⊙ α is the adjusted parameter of the Dirichlet distribution which can avoid penalizing the evidence of the ground-truth class to 0, and Γ () is the gamma function.
While uncertainty loss L U N can guide the model optimization based on the feature distribution which was parameterized by Dirichlet concentration. However, Dirichlet concentration also changes the original feature distribution, which may cause a decline in the classifier's confidence in the parameterized features, thus resulting in a limited performance. Therefore, to ensure confidence for the parameterized features during training, we further introduce the temperature cross-entropy loss (L T CE ) to directly guide the model optimization based on the parameterized features.
where b k was the belief mass for k -th class, while τ was the temperature coefficients to adjust the belief values distribution, the value is initialized 0.01 was gradually increased to 1 to prevent the low confidence for the belief mass distribution in the initial stage of training.
Therefore, the final loss function for optimizing our proposed model was formalized as (The ablation experiments on the effectiveness of the loss function were shown in Supplementary Table 18):

Uncertainty thresholding strategy
In this study, the threshold θ was determined using the distribution of uncertainty score in our validation dataset, as shown in Supplementary Table 5. The prediction results below the threshold θ were more likely to be correct, i.e, diagnostic result with high confidence.
Conversely, the decisions with the uncertainty score higher than θ were considered more likely to be unreliable and needed assessment from ophthalmologist. To obtain the optimal threshold value, we calculated the ROC curve, all possible true positive rates (TPRs) and false positive rates (FPRs) for the wrong prediction of validation dataset based on wrong ground truth U = [ u 1 , ..., u n ] and uncertainty scores U = [u 1 , ..., u n ] for each sample in the validation dataset, n was the total number of samples in the validation dataset, and U was obtained by: Where P i and Y i were the final prediction result and ground truth of i -th sample in validation dataset. Inspired by Youden's index [49], the objective function based on the TPRs, TPRs, and thresholds of validation dataset is formalized as: ℓ (θ) = 2 * T P Rs (θ) − F P Rs (θ) , Therefore, the final optimal threshold value is calculated by θ = arg max θ ℓ (θ). Finally, we obtained the optimal threshold θ of 0.1158 and the confidence level of a model prediction result is,

Experimental deployment
We trained our UIOS and other comparison methods including standard AI model, Monte-Carlo drop-out (MC-Drop), ensemble models, time-augmentations (TTA), using entropy across the categorical class probabilities (Entropy), on the public platform Pytorch and Nvidia Geforce RTX 3090 GPU (24G). Adam was adopted as the optimizer to optimize the model. Its initial learning rate and weight decay were set to 0.0001 and 0.0001, respectively. The batch size was set to 64. To improve the computational efficiency of the model, we resized the image to 256×256. Meanwhile, online random leftright flipping was applied for data augmentation. In addition, to reduce the time and effort in training multiple models for the ensemble, we used snapshot ensembles [50] to obtain multiple weights for ResNet-50 by using different checkpoints in a single training run. We also compared and analyzed the AUC and F1 scores of different methods.
Additional data sets supporting the findings of this study were not publicly available due to the confidentiality policy of the Chinese National Health Council and institutional patient privacy regulations. However, they were available from the corresponding authors upon request. For replication of the findings and/or further academic and AI-related research activities, data may be requested from corresponding author H. Chen within 10 working days. Source data are provided in this paper.