Automatized self-supervised learning for skin lesion screening

Melanoma, the deadliest form of skin cancer, has seen a steady increase in incidence rates worldwide, posing a significant challenge to dermatologists. Early detection is crucial for improving patient survival rates. However, performing total body screening (TBS), i.e., identifying suspicious lesions or ugly ducklings (UDs) by visual inspection, can be challenging and often requires sound expertise in pigmented lesions. To assist users of varying expertise levels, an artificial intelligence (AI) decision support tool was developed. Our solution identifies and characterizes UDs from real-world wide-field patient images. It employs a state-of-the-art object detection algorithm to locate and isolate all skin lesions present in a patient’s total body images. These lesions are then sorted based on their level of suspiciousness using a self-supervised AI approach, tailored to the specific context of the patient under examination. A clinical validation study was conducted to evaluate the tool’s performance. The results demonstrated an average sensitivity of 95% for the top-10 AI-identified UDs on skin lesions selected by the majority of experts in pigmented skin lesions. The study also found that the tool increased dermatologists’ confidence when formulating a diagnosis, and the average majority agreement with the top-10 AI-identified UDs reached 100% when assisted by our tool. With the development of this AI-based decision support tool, we aim to address the shortage of specialists, enable faster consultation times for patients, and demonstrate the impact and usability of AI-assisted screening. Future developments will include expanding the dataset to include histologically confirmed melanoma and validating the tool for additional body regions.


Introduction
Malignant melanoma is the most serious form of skin cancer, due to its ability to metastasise and quickly spread to other organs.However, if diagnosed early enough it can be removed by a surgical intervention.Projections worldwide show an expected increase of 50% in incidence and 68% in death rate 1 by 2040.If we combine this scenario with the shortage of dermatologists to cope with such demand, it is necessary to explore options to support the experts in fighting this growing pandemic.Diagnosing skin cancer requires experience and unique skills that can only be acquired through proper training.This limits the professionals who can perform it properly.In a study funded by the Swiss Cancer League 2 , it was demonstrated that dermatological training of general practitioners (GPs) had a profound effect on their performance in skin cancer diagnosis.However this improvement was only temporary 3 , with its benefits fading within 12 months following the intervention.In order to support dermatologists in the most time consuming and error-prone task, total body screening, it is necessary to develop a reliable tool with high diagnostic accuracy to sustain improvement in the long term and to enable the involvement of additional non-experts such as nurses, technicians or GPs.Artificial Intelligence (AI) has attracted a lot of interest in analysis of dermoscopic images, as seen in competitions organized by the ISIC foundation 4 and several high-impact publications.The study referenced in 5 highlights the effectiveness of AI-based tools in diagnosing skin cancer through dermoscopic imaging of a single lesion.Despite outperforming board-certified dermatologists in certain scenarios, the use of AI in routine consultations still faces a significant challenge in performing total body screening (TBS).Therefore, AI support in TBS could potentially overcome this bottleneck and further improve skin cancer detection and diagnosis.During the TBS, the expert visually inspects the whole body and selects few lesions, also known as ugly ducklings (UDs), for further evaluation with a dermoscope.The concept of UD as a potential melanoma candidate that deviates from the individual's nevus pattern was first introduced in 6 .Only recently have AI-based research studies tackled the screening process by determining and isolating suspicious lesions or UDs, considering the patient's context, as shown in studies such as 7 .However, this pioneering study 7 has a significant limitation concerning generalization and real-world application because it relied on hand-labeled datasets by three experienced dermatologists.Since each patient's context is different and manual labeling is costly and time-intensive, it becomes clear that an alternative, labeling-independent approach is necessary.To address this issue, a proposal by 8 suggests the use of a Variational Autoencoder (VAE) for outlier (ugly duckling) detection.Still certain limitations were identified in this context.These limitations include the non-uniformity of the dataset, which was obtained from multiple sources, and the utilization of pretraining the VAE from a small patient data pool.This practice could potentially introduce bias into the predictions.Furthermore, validating the algorithm against a single dermatologist raises several concerns.These concerns encompass limited generalizability, a lack of sample diversity, the potential for single participant bias, and limited error detection.
To address these limitations, we propose adopting a self-supervised architecture.This architecture is trained on real-world data (RWD) acquired during routine consultations at the Dermatology Clinic of the University Hospital Zurich (USZ).Moreover, we propose validating this approach against multiple experienced dermatologists, and lastly, we present a clinical validation involving diverse groups of interest.We can summarize the main contributions of this study as follows: • Use of real-world data of total total body screening (RWD TBS) acquired in a standardized manner during routine consultations at the Dermatology Clinic of the USZ for training, validation and testing; • Adoption and development of a self-supervised approach for UD sign characterization based on analysis of single patient's context; • Clinical validation of the algorithm's predictions against different groups of interests and evaluation of its impact as an AI-based decision support tool for melanoma screening;

General Approach
A self-supervised AI algorithm was developed for automatic detection and characterization of UDs in high-risk patients (i.e.greater than 100 nevi lesions).Established collaboration with the Dermatology Clinic of the USZ and CRPP provided access to a real-world dataset and medical experts essential for a reliable clinical validation of AI-assisted total-body screening.To simplify the project, we broke down the overall goal into smaller, manageable objectives.The first step, consisted of identifying and extracting skin lesions from total body images of patients.Then the extracted lesions were embedded into a meaningful representation in which similar lesions were grouped together and scored according to their similarity compared to the average appearance of the lesions.In the final step, we clinically validated our AI pipeline with medical experts.

Data Acquisition
An informed consent was obtained from all participants regarding the use of their data for this study and it was approved by the Swiss Ethics under project 2020-01937.The Dermatology Clinic of the USZ owns FotoFinder© ATBM devices, which take high-quality polarized total-body images in a semi-automatic fashion.To date, the Dermatology Clinic of the USZ has documented 90 patients.For the present study, the dataset was limited to the assessment of lesions in the dorsal region of 20 patients.

Skin Lesion Detection
To apply the UD criteria to patients for skin lesion screening, all lesions must first be identified.This process can be modeled as an object detection problem.In the first step, we divided our data into three independent patients sets, as it is common in supervised learning, namely a training set to train the algorithm, a validation set to optimize the generalizability of the model, and a test set to evaluate its performance.The next step consisted of creating labels for the supervised model.After consulting with a medical expert, it was decided to include anything that could potentially be considered a skin lesion, such as freckles, in the labeling process since it is difficult, even for experts, to clearly distinguish what is important and what is not.On the one hand, it is important to provide the AI algorithm with as much patient context as possible if it is to meet the UD criterion.On the other hand, it was decided that any unfavorable prediction resulting from this labeling strategy could be discarded at a later stage in the overall AI pipeline.The labeling process can be quite lengthy in patients with many skin lesions.To speed up the labeling process, a blob detector such as the one from OpenCV was used before starting the manual labeling process.One of the challenges of our desired application is to detect small objects in high-resolution images.Most deep learning approaches scale down the images to lower resolution to make them computationally feasible.However, this method usually results in a loss of information due to the lower input resolution.The common practice to overcome this problem in aerial imagery 9 , also followed by 10 , is to create overlapping regions of the original images and labels before sending them to the AI model for training or inference.This approach was also followed in our study to ensure we didn't miss any skin lesions.After successful training and optimization of our object detection model, the predictions from the extracted regions are merged back together for testing, and Non-maximum Suppression (NMS) is applied to the merged results, with an Intersection over Union (IoU) threshold of 10%.The final version of the model was trained for 900 epochs on 15 patients, validated on 3, and tested on 4. We calculated test results using well-known object detection metrics, including Recall, Precision, Average Recall (AR) and Average Precision (AP).

Model Selection
For the purpose of this specific application, one-stage detection models were assumed to be the best choice in terms of real-time applicability.The general strategy for the model selection was to aim for high inference time first and switch to two-stage detection models in case of low accuracy.YOLOR 11 was the architecture that best met our desired criteria summarized below for the general MS COCO dataset compared to other available models.
Model selection criteria: • No additional training data are used.
• Low inference time for high resolution imagery.
• High accuracy for small objects measured by Average Precision for Small Objects (APS).
• Backbone architecture well suited for identifying local patterns i.e.CNNs We made the following changes to the official YOLOR architecture for our specific needs: • Changed the model architecture from 80 target classes to 1.
• We tailored the aspect ratios for anchor boxes by using KNN algorithms on our training data.
• Disabled normalization for color channels to avoid unwanted predictions for shadow areas in the image.
• Increasing the maximum number of allowed bounding boxes per image.
• We disabled the NMS for inference mode since it's used after merging.

Ugly Duckling Detector
Our main goal was to find an informative feature representation for our extracted skin lesions that approximates the workflow of dermatologists who compare lesions based on their similarity.To sort the skin lesions identified and extracted by our object detection model according to their suspiciousness, we had to embed them in a latent space that provides a meaningful representation according to lesion similarity.In this sense, UDs should be embedded further away from the majority of the lesions within the latent space.A self-supervised approach allowed us to work around the challenges presented in our introduction, namely bias due to mislabeled data or training patient's context and non-publicly available labeled datasets.There exist many potential candidates for accomplishing this goal.We have tried to find the model that best meets the following criteria: • Low number of model parameters, since we provide only the lesions of our test patient during training to avoid bias due to the context of other patients.
• High accuracy values when using KNN for image classification on general datasets such as ImageNet.
Considering these requirements, DINO, first introduced by Facebook in 2021, was selected for our representation task with Resnet18 as backbone architecture 12 .

Poorly Illuminated Lesions Filter
One of the first steps before applying our self-supervised approach was to filter out unwanted lesions, such as poorly illuminated lesions that influenced the UD selection by producing high outlier values.Our general strategy was to initially exclude these lesions and later acquire them from another image perspective with better illumination conditions.A probabilistic approach was used to identify and exclude poorly illuminated lesions.These approach was provided with hand-crafted features, namely the mean intensity level of their frame pixels.The predicted bounding boxes always included part of the skin around the lesions, making the pixels of their frames a perfect candidate for detecting poor illumination conditions.The probabilistic approach was implemented to exclude all lesions below a two-sigma event from their mean as poorly illuminated.

Preprocessing of Detected Skin Lesions
In deep learning approaches, the input dimensions must be identical for each data instance.If this is not the case, the model scales the dimensions to match.Most of our extracted lesions have a square shape.However, in the other cases, we need to ensure that there is no scaling taking place that could mislead the UD selection, e.g., by changing the width to height ratio.As a solution, constant padding using the mean frame pixel values of the shorter dimension was proposed for lesions that do not have a width to height ratio of 1.

Self-supervised Ugly Duckling Approach
The next step involved a pretext task to create pseudo labels in our chosen self-supervised approach.For our pretext task, many data augmentation processes were excluded from the official DINO publication or adapted to our task description.The following list gives an overview of the data augmentation processes used: • Upscaling of each lesion image to 224x224 pixels.Ensuring no lesion being downscaled.
• Random brightness jittering to counteract even further poor illuminated conditions.
• Random rescaling and cropping as part of the general augmentation strategy of DINO.
• Rescaling of each color channel by a factor of 10 to increase the significance of color differences.
The model was trained directly and only on the test patient's lesions to avoid possible biases that could arise from the different patient's context for at least 200 epochs.After 200 epochs, a specific stopping criterion was applied, i.e., when the ranking of the top 10 UDs no longer changes, the model is considered to be converged, and training is terminated.The maximum number of epochs was limited to 300 epochs for this study.During the inference mode, only the preprocessing steps on the image resolution and color channels were performed.The embeddings of the lesions were created using the teacher backbone.Using cosine similarity distance, a min max normalized UD score was calculated for each lesion embedding from the median of the lesion embeddings.The top 10 scoring lesions were then proposed as our AI UDs.In Figure 1, we can see a summary of our proposed approach.

Clinical Validation Study Design
In our study, we asked participants to select suspicious-looking lesions on 20 patient dorsal images.They had to select lesions that according to them would require further examination with a dermascope for melanoma screening, starting with the most unusual ones.After each patient image, they were asked to also rate their confidence on their selection on a scale from 1 to 5.However, unlike other AI studies on the UD sign, we also conducted an clinical validation to assess the performance of our algorithm and better understand the impacct of AI as a screening support tool.To this end, participants were asked to repeat the task with the same 20 patient images while being able to see the AI prediction.Patient images with AI prediction were displayed along with their UD score, with the top-10 AI UDs colored red and the others colored green.For one patient, only the top-9 AI UDs were mistakenly colored red.

Participants
We wanted to include participants with various levels of expertise to explore what different impact AI can have on their decisions and to understand the agreement among themselves.
Our participants consisted of:

Technical Design
To facilitate the participation in the clinical validation a web tool was developed.The web tool allowed participants to draw bounding boxes on the lesions of the patient images and included an instructional video at the beginning explaining to them their task.

Data Post-Processing
After the validation was finished, it was observed that some participants did not draw the bounding boxes perfectly around the lesion.Therefore, the data had to be manually cleaned to ensure that the bounding boxes drawn could match correctly to our AI detected lesions if detected.A binary array was then created for each participant with ones for the lesions they selected and zeros for the others.Lesions ranked above 20 and poorly illuminated lesions were not included and disregarded in this evaluation.

Performance Metrics
The agreement of our AI algorithms was measured primarily by calculating sensitivity values, defined as follows:

Top-u AI Sensitivity w.r.t. Participants Selection
Any lesion selected by a participant is considered an ugly duckling (UD) from their perspective.Thus, True Positives are the UDs found by the AI algorithm among the top-u ranked lesions.False Negatives, accordingly, are the UDs ranked below the top-u ranked lesions by the AI algorithm or were not detected at all.

Average Participants Sensitivity w.r.t. Majority Selection of Experts
Similarly, we define the average sensitivity for each participant using the majority voting of experts, namely lesions selected by at least 2 of the 3 dermatologists with >10 years of experience, as the ground truth.For each participant and each patient image, the sensitivity value for the majority selection of experts was calculated and averaged at the end.True Positives are the lesions that were selected by the participant in question and were part of the respective majority selection of experts.False Negatives are the lesions that were not found by the participant in question and were part of the respective majority selection of experts.The average sensitivity for individual experts w.r.t.their majority selection was included as well.Precision values are given for a selected confidence threshold of 20% and IoU of 50%.Visual inspection revealed that the False Negatives were mainly caused by NMS filtering, disregarding bounding boxes for lesions close to each other, or freckles that were scored below a 20% confidence level.Furthermore, False Positives were mainly caused by missing labels for freckles.The Precision level for this confidence level is relatively high.The Average Precision-Recall curve computed with an IoU threshold of 50%, also reflected this behavior.We find that the model achieves perfect Precision at high confidence levels.Moreover, increasing the IoU threshold shows only a small decrease in the performance metrics.There were some patients for whom the model did not perform adequately, namely hairy patients, patients with tattoos, and patients with out of distribution lesions.This was apparent from visual inspection.

Poorly Illuminated Lesions Filter
Patient images with poorly illuminated lesions were used to evaluate the effectiveness of the filter method.When a distributionbased approach was used with brightness level as the only feature, lesions located in shaded areas were identified accurately.An outlier threshold of 2 standard deviations below the mean brightness level of the skin lesions frame pixels was visually confirmed to be a robust choice for detecting poorly illuminated lesions.The chosen feature displayed right-leaning Gaussian distributions, with the poorly illuminated lesions mostly located on the left tail.However, when brightness alone was used as a feature, it was observed that this approach did not work well for patients with tanned areas.These patients exhibited a superimposed Gaussian distribution with different means, making it challenging to set an outlier threshold that would yield desirable results.

Self supervised Approach for Ugly Duckling Detection
Figure 3A shows a dorsal image of a patient, highlighting the 10 highest rated lesions with red bordered boxes and their corresponding scores.The remaining lesions are shown with green bounding boxes.This result provides an initial means of visually assessing the performance of our proposed AI algorithm.In Figure 3B, we present a 2D t-SNE plot of the embeddings for the same patient, using our UD score metric to group the lesions into three clusters: those with scores ≤50%, >50% and the top 10 scoring lesions.With green bounding boxes we see the lesions rating equal or less than 50%, in blue above 50% and in red the top 10 using our UD score metric.

Sensitivity of our Skin Lesion Detector w.r.t. Ugly Ducklings
We further evaluated the sensitivity of our lesion detector by analyzing all selected UDs identified by participants on the 20 patient images.We achieved a recall of 100% for our lesion detector.

Number of Ugly Ducklings
Figure 4A shows that dermatologists typically choose on average 3 to 4 UDs per patient image without AI assistance.However, there is often significant variability in the number of selected lesions.In contrast, students selected an average of 7 to 8 skin lesions per patient image and even reached the technical limit of 20 lesions on some images.With AI assistance, all participants except GPs and dermatologists with ≤ 10 years of experience selected more skin lesions on average for further risk assessment.

Average Participants Sensitivity w.r.t. Majority Selection of Experts
The agreement among participants with respect to the majority selection of experts was one of the most intriguing questions we explored.As shown in Figure 4B dermatologists with ≤ 10 years of experience achieved an average sensitivity of 65%-70% without AI assistance, and even decreased slightly with AI assistance for the majority selection of experts.Among the experts themselves, the average sensitivity was 84% with and without AI assistance.Students had an average sensitivity of 70% without and 76% with AI assistance, while GPs had a relatively low average sensitivity of 59% without and 58% with AI assistance.Lastly, we provided the sensitivity values of the top-10 AI UDs.We can see that our AI predictions achieve on average a sensitivity value of 93% and reach even 100% on average when presented to the experts.The interquartile range (IQR) of each group is quite large in both cases with and without AI assistance.With AI assistance, there is a slight decrease observable for the IQR for every group exluding the GPs.The AI predictions have in both cases an IQR of zero.

Top-10 AI Sensitivity w.r.t. Participants Selection
In Figure 4C, we show that without any assistance from AI, we achieve a top-10 AI sensitivity of 81 to 82% for dermatologists, while for the students and GPs we had lower averages of 66% and 68%, respectively.With AI assistance, we achieve high values for dermatologists, ranging from 91 to 99% ,and for students and GPs, we achieved an average sensitivity of 83% and 86%, respectively.When provided with AI assistance the IQR for each group decreased, and for some dermatologists with >5 years of experience even an IQR of zero was achieved.

Confidence Level
In Figure 4D we show the absolute values found for each participant grouped by expertise.We see a clear difference in the confidence level between students and dermatologists.Interestingly dermatologists ≤ 10 years experience exhibit a strong confidence level compared to the other dermatologists groups.We also noticed that for some groups, namely dermatologists with ≤5 and ≥10 years of experience the IQR of their confidence decreased to zero after being presented with AI predictions.In Figure 4F, we additionally show the relative differences in participants' confidence levels before and after they were exposed to the AI predictions.A clear upward trend in the average is observed for all groups, with the greatest increase being shown by the experts and students.However, for some images, we noticed that the confidence level decreased after participants were shown the AI predictions.

Top-u AI Sensitivity w.r.t. Participants Selection
Figure 4E displays the average top-u sensitivity of the AI system for different values of u, ranging from 1 to 50, across participant groups.Without the aid of AI, dermatologists achieved average sensitivity values between 60% and 70% for the top-5 AI UDs and 90% for more than 20 lesions.However, on average, students and GPs attained lower top-u AI sensitivity values, but their agreement with AI predictions increased after seeing them.With AI assistance, dermatologists quickly converged to AI sensitivity values above 90% at u = 9-10.Dermatologists with over 5 years of experience achieved 100% AI sensitivity at u = 24, whereas those with less than 5 years of experience had slightly lower agreement than their more experienced colleagues.Although the top-u AI sensitivity values for students and GPs with AI assistance were lower than those for dermatologists, there was a clear trend observable that they began to follow the AI predictions more if provided to them.

Model Comparison of Top-u AI Sensitivity w.r.t. Majority Selection of Experts
In Figure 4G, we compare the sensitivity of our chosen model architecture, DINO, to that of MoCo v2 with respect to the top-u AI sensitivity using majority selection of experts.Both models were tested for the same 3 backbones (Resnet18, Resnet34 and Resnet50).Our analysis shows that DINO outperforms MoCo v2, with DINO achieving over 90% sensitivity at u=9 for backbones Resnet18 and Resnet50, while MoCo v2 only achieves this level of sensitivity at u=14 for Resnet18 and ResNet50.

Discussion
In the present study, we introduce a novel approach for total body melanoma screening incorporating self-supervised detection of suspicious lesions based on each patient's context.The results presented evaluate the impact of this approach on clinicians' decision-making and confirms its validity through comparison with the assessments of experienced pigmented lesion experts.Total body screening is the most time-consuming and error-prone task, therefore, a reliable support tool would enable dermatologists to better optimize consultation time.
The architecture comprises two main high-level tasks: first, the automatic detection and extraction of lesions from wide field images, and second, the characterization of suspicious lesions through self-supervised clustering.A detailed scheme of the complete pipeline is presented in Figure 1.The dataset for this study was collected ad-hoc during routine consultations at the Dermatology Clinic of the USZ and consisted of 90 patients.For the purposes of this study, the analysis was constrained to the dorsal region of 20 patients.To perform the skin lesion detection and extraction, a one-stage object detection model was employed.The model was trained using a semi-automatic process that utilized a combination of a blob detector and manual labeling to accelerate the process.The performance of this stage is summarized in Figure 2. The approach used for lesion detection resulted in high recall rates for all four test subjects, reaching 95% with an IoU threshold of 50%.This is a critical factor in the detection of potential melanoma cases, ensuring that no lesions are missed.Furthermore, the deep learning model achieved a relatively high precision rate, namely > 90% P 50 , while maintaining a high recall rate, namely > 80 % R 50 , highlighting its effectiveness in detecting skin lesions, even when higher IoU thresholds are used.Worst performance was observed with 'atypical' patients, such as those with significant amount of hair.However, in all cases, the suspicious lesions or ugly ducklings (UD) selected by the experienced dermatologist were consistently detected by the object detector.
Regarding the characterization of suspicious lesions or UDs, we propose a self-supervised architecture which allows to evaluate possible outliers considering only the patient's context.As illustrated in Figure 3, the algorithm visually presents to the user suspicious looking lesions by drawing a red bounding box around them.The model was able to distinguish UD more effectively compared to common freckles and other average-looking skin lesions.Moreover, visual representation using t-SNE shows a clear meaningful embedding of UDs versus average-looking lesions.Despite minor illumination issues that require attention, the overall approach appears to be a promising option.The quantitative comparison of both architectures, DINO and MoCo v2, using 3 different backbones is shown in Figure 4G.We conclude that DINO converged faster towards the mark of 90% sensitivity for 50% fewer suggested AI ugly ducklings.
Finally, a clinical validation study was conducted to evaluate the performance of our tools and measure its impact in real-world consultation conditions.The study included different groups of interest with varying levels of melanoma screening experience, as introduced in the Clinical Design Study.The study demonstrated that our top 10 AI algorithms performed well for dermatologists, with an average accuracy of 82%.Furthermore, after reviewing our predictions, dermatologists' trust in our algorithms increased, reaching an average of 95%.Therefore, we conclude that this algorithm if deployed in routine consultations with an initial training, nurses or non-experts could reach similar sensitivity values such as experts.Another noteworthy outcome is the number of ugly ducklings selected by experienced dermatologists for each patient.On average they went from 3-4 to 4-5 UDs when being provided by the AI suggestions.This implies that the AI support drew attention to some lesions that were previously overlooked due to their location or the high number of lesions in the image.For two groups, dermatologists and GPs, the number of selected lesions did not increase with AI-assistance.However, they changed their initial selection by following more the algorithm proposals.Inexperienced students, due to their lack of experience and uncertainty with the task, chose more UDs.This also explains their high sensitivity values towards the majority selection of experts and low sensitivity towards AI.
Despite the promising results, we have identified several limitations.Although total body imaging was available for each patient for the sake of proof-of-concept, we restricted ourselves to the dorsal region of 20 patients.Therefore, the next steps should include validating the model in different body regions and on more patients.The challenge of acquiring total body imaging makes our sample size relatively small, which impacts our capacity for evaluating the model's generalization performance.However, the Dermatology Clinic of the USZ plans to extend the validation campaign to further evaluate the models robustness.Enlarging the number of patients and clinical validation should contribute to increasing the reliability and generalization capacity of our tool.Some patients presented a considerable amount of hair, which impacted both the detection and UD characterization.Despite the limited number of subjects, we should consider a proper way for handling such cases in the future.Additionally, an improved lighting setup will be implemented to prevent shadowed regions in the imaging, thereby avoiding potential detection issues.

Conclusion
We introduce a novel AI-assisted total body screening tool that achieves expert-level accuracy in identifying suspicious lesions in wide field images, improving upon the results of previous studies 7 .The architecture includes a state-of-the-art skin lesion detection system, followed by a self-supervised "ugly duckling" characterization module trained on real-world data acquired from the Dermatology Clinic of the USZ.By eliminating the need for time-consuming manual labeling of UDs and performing predictions on a patient-by-patient basis, we enhance the generalization capacity of our system.These tools ultimately enable the separation of total body screening from routine consultations and even facilitate the involvement of non-expert staff, who can assist dermatologists using reliable tools.The saved screening time can be then reallocated by dermatologists into single lesion assessment and discussions with patients.

Figure 1 .
Figure 1.Summary of proposed AI-pipeline for Self-supervised Ugly Duckling Selection in Wide-field Imagery (top) and Clinical Evaluation (bottom).

Figure 2 .
Figure 2. Top: Average Precision and Recall curve for 4 unseen patient images using an IoU threshold of 50%.Bottom: Averaged performance results of 4 unseen manually labeled patient images for YOLOR.Recall and Precision for an IoU threshold of 50% and a previously fine-tuned confidence threshold of 20% on validation patients are shown in addition to the Average Precision and Recall values for IoU thresholds of 50% and 75%.

Figure 3 .
Figure 3. A: AI results on dorsal image of one patient showing the top 10 highest scoring lesions in red bounding boxes and the UD scores on top.B: Showing 2D t-SNE visualization of our embeddings in the latent space.With green bounding boxes we see the lesions rating equal or less than 50%, in blue above 50% and in red the top 10 using our UD score metric.

Figure 4 .
Figure 4. A: Comparing the number of UDs selected per patient image and group participant, with and without AI assistance.B: Showing average sensitivity values of each group with respect to majority selection of experts, with and without AI assistance.Top-10 AI sensitivity values averaged over each patient image are also presented.C: Comparing the top-10 AI sensitivity for each patient image with and without AI assistance, w.r.t. each participant.D: Comparing the confidence levels for participants without and with AI assistance for each patient image.E: Showing average top-u AI sensitivity values for each group, with and without AI assistance.F: Comparing differences in confidence levels between groups for cases with and without AI assistance, for each patient image.G: Comparing the average top-u AI sensitivity values for MoCo and DINO models, w.r.t.majority selection of experts, for each group without AI assistance.