Introduction

Melanoma, the leading cause of skin cancer deaths worldwide, has increased in incidence over the last few decades1,2,3. Early detection of this disease can reduce the size and extent of surgery as well as adverse effects of late-stage systemic therapies and is thus beneficial for patients, especially if detected pre-metastasizing4. Traditionally, physicians diagnose lesions using visual inspections and dermoscopy5,6. Prediction quality thereby depends on the expertise and experience of the dermatologist7. Considering that the demand for such experts is increasing due to changing epidemiology, ethnographic trends, and skilled advertising8 and that it is difficult to find such experts, innovative diagnostic approaches are required, especially for atypical or uncommon cases.

Recently, artificial intelligence (AI) systems for melanoma detection have emerged as a promising tool, with numerous retrospective studies reporting that AI algorithms can match or even surpass the diagnostic accuracy of experienced dermatologists in artificial settings7,9,10,11. While the findings of these retrospective studies are undoubtedly encouraging, hinting at an improvement in patient care while reducing dermatologists’ workload, a lack of prospective evaluations remains12. Prospective studies typically allow for more unbiased, complete, and tailored evaluations, but the few existing prospective analyses of AI-based melanoma detection suffer from limitations such as a single-center design and a relatively small number of lesion samples, particularly with respect to rare melanoma subtypes13,14,15.

In this study, we address these limitations, thereby taking a substantial step towards comprehensive, prospective evaluations of AI-based melanoma detection by evaluating the open-source AI algorithm “All Data Are Ext”16 (ADAE) on a heterogeneous dataset. ADAE is a binary melanoma classifier and ranked first in the Society for Imaging Informatics in Medicine (SIIM) and International Skin Image Collaboration’s (ISIC) Challenge 202017. The dataset we introduce is heterogeneous and shows substantial domain diversity (see Fig. 1 for the feature distribution of our dataset). We ensure a broad representation of potential real-life clinical and technical settings by using a multicenter design involving eight German university hospitals as well as four distinct hardware configurations. In addition, we demonstrate a strong performance of the algorithm on difficult-to-diagnose lesions, which suggests integrating AI as a supportive tool for dermatologists in diagnosing particularly challenging cases.

Fig. 1: Feature distribution of our dataset.
figure 1

The feature distribution of our dataset (n = 11460) using t-distributed Stochastic Neighbor Embedding (2 components, 1000 iterations). The features are colored according to their type (melanoma or non-melanoma).

Methods

Study design

This prospective, multicenter study was approved by the respective institutional review boards of the Technical University Dresden (BO-EK-53012021), the Friedrich-Alexander University Erlangen-Nuremberg (69_21 Bc), the University Duisburg-Essen (20-9784-BO), the University Hospital Mannheim (2010-318N-MA), the LMU Munich (21-0182), the University Regensburg (20-2190-103), as well as the University Würzburg (293/20_z) and adheres to the Declaration of Helsinki guidelines. Specific IRB approval was not required from the Charité Berlin because the Berufsordnung der Ärztekammer Berlin (professional code of conduct of the medical association Berlin), §15(2), states that additional approval is not necessary for a study across multiple centers if there is approval from another IRB of a German University or medical association. The STARD 2015 reporting standards18 were followed, and written informed consent was obtained from all participating patients.

Data on clinically suspected melanomas that were subsequently excised, consisting of dermoscopic images and patient-specific metadata (including age, Fitzpatrick skin type, lesion localization, and diameter), were prospectively gathered as part of routine clinical practice from eight university hospitals in Germany (located in Berlin, Dresden, Erlangen, Essen, Mannheim, Munich, Regensburg, and Wuerzburg) between April 2021 and March 2023.

Participants

Participants for this study were eligible if they met all of the following criteria: at least 18 years of age and presenting with clinically melanoma-suspicious skin lesions. Patients were excluded if these melanoma-suspicious lesions had undergone pre-biopsy procedures, were located near the eye or beneath the fingernails or toenails, or had person-identifying features (such as tattoos) in the immediate vicinity of the lesion due to data privacy concerns.

Data collection

After informed consent, imaging and dermoscopic examination were performed, and melanoma-suspicious lesions were subsequently excised. All lesions were histopathologically diagnosed by at least one experienced (dermato)pathologist at the respective hospital. During clinical examinations, a dermatologist captured six dermoscopic images of each suspected melanoma lesion while deliberately introducing random variations in the orientation/angle, position, and operational mode of the dermatoscope, including both polarized and nonpolarized settings. A dermatologist hereby is defined as a doctor who studies and treats skin diseases in a Department of Dermatology, but has not necessarily completed board certification yet. To mitigate the influence of potential confounding variables, dermatologists were explicitly instructed to avoid known artifacts (such as skin markings). All images were acquired using one of four distinct hardware configurations that were consistently used across the participating medical centers (see Supplementary Methods).

Model training and testing

As we employed the ready-to-use ADAE algorithm for binary classification of lesion images into melanoma and non-melanoma, no additional training was needed. ADAE is trained solely on public data from the respective 2020 and 2019 (which includes 2018 data) SIIM-ISIC challenges, comprising a total of 58,457 lesions, 5106 of which were labeled melanoma19,20,21,22. The ensemble consists of 18 convolutional neural network (CNN)-models (16 EfficientNets B3-7, 1 SE-ResNext101, 1 ResNest101), of which four additionally incorporate patient meta-data (including sex, age, and lesion location). Since the best model of each 5-fold cross-validation is kept, this totals 90 models in the final ensemble (for a detailed description of the algorithm and its training procedure, please refer to the original paper16).

To ensure that the algorithm runs correctly, it was tested on the SIIM-ISIC 2020 data according to a publicly available script on GitHub (https://github.com/ISIC-Research/ADAE/blob/main/predict.py)23. The resulting AUROC scores match those available in the literature (ISIC validation: ours 0.945 vs. literature 0.949, ISIC test: 0.951 vs. 0.950)13,17.

Since our dataset includes six images per lesion, ADAE is adapted with R-TTA to utilize these additional images, as this has proven to be beneficial with respect to diagnostic performance, robustness and uncertainty estimation24. There, all six real images are fed to the algorithm at test time, and the resulting outputs are then aggregated to one final prediction. A comparison of the diagnostic accuracy of ADAE with versus without R-TTA is shown in Supplementary Figs. 6 and 7, which also include the scores for the individual models that the ensemble comprises.

To find a suitable threshold differentiating positive from negative predictions, a validation set was split from the data, namely, the data of hospital 8, as it also has its own unique technical domain while being sufficiently large enough to allow for a representative estimate. Therefore, the threshold is set such that a sensitivity of at least 85% is exceeded, as sensitivities of approximately 80–85% are realistic in a clinical setting9,25,26.

Statistics and reproducibility

The difference between ADAE and dermatologists’ diagnostic accuracy was primarily quantified through balanced accuracy, and secondarily via sensitivity and specificity. For each endpoint, pairwise two-sided Wilcoxon signed-rank tests were used to compare the respective metrics. To evaluate the generalizability of the algorithm on different subsets, the Breslow–Day test for homogeneity of the odds ratio was used27. Hypothesis H0 is a constant odds ratio over a stratified variable k, indicating whether there is a significant association between prediction and k. The algorithm’s predictive ability was assessed using the AUROC. Differences in model performance were assessed by statistically comparing the corresponding AUROCs using DeLong’s method28.

To reduce the impact of stochastic events, mean values for each metric were calculated using 1000 bootstrap iterations. The corresponding 95% confidence intervals (CIs) were determined using the nonparametric percentile method. P-values smaller than 0.05 were considered statistically significant. Statistical analysis was performed using SciPy 1.11.229 and R30.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Results

Patient characteristics

Our dataset comprises images of a total of 1910 skin lesions that were clinically suspected to be melanoma from 1716 patients at eight German university hospitals collected between April 2021 and March 2023 (for detailed patient characteristics, see Table 1). The patient’s age at diagnosis ranged from 18 to 96 years, with a median age of 62 years. While all skin types are present in the study population, Fitzpatrick skin types II and III are most prevalent, whereas types V and VI are less frequent.

Table 1 Patient characteristics of our dataset

Lesion characteristics

The dataset consists of 750 melanomas (including rare subtypes such as spitzoid or desmoplastic melanomas), 885 nevi, and 275 other diagnoses (for detailed lesion characteristics, see Supplementary Data 1). Additional information about the diagnosis, such as the exact subtype (see Table 4) and the self-assessed dermatologists’ confidence in their diagnosis (on a scale from 1—low confidence to 5—high confidence), were collected. Furthermore, the lesion size and location were recorded. The lesions were collected from eight different university hospitals (i.e., the data source) with four distinct technical setups (i.e., hardware configurations, a certain combination of camera and dermatoscope that were used to capture the images, see Supplementary Methods), thus ensuring domain diversity.

ADAE outperforms dermatologists in balanced accuracy and sensitivity

To assess the performance of AI algorithms in diagnosing suspected melanoma lesions in prospectively acquired data, we compared the prediction quality of ADAE to that of dermatologists recorded within physical patient examinations, who also had access to patient and lesion-specific metadata. This assessment is based primarily on balanced accuracy, and secondarily on sensitivity and specificity. P-values smaller than 0.05 are considered statistically significant and are determined using a pairwise two-sided Wilcoxon signed-rank test. The algorithm was enhanced with real test-time augmentation (R-TTA, introduced by Hekler et al. as MV-Real)24 by providing multiple images per lesion at test-time (see also Supplementary Figs. 6 and 7). That is, each image was initially classified individually as either melanoma or non-melanoma (including basal and spinal cell carcinoma, dermatofibroma, keratosis, nevus, vascular lesions), and ultimately aggregated to produce one final prediction for the respective lesion. The performance was measured using histopathological labels diagnosed by experienced pathologists as ground truth.

Overall, at the predetermined 85% sensitivity threshold (see Methods/model training and testing), ADAE showed higher balanced accuracy than dermatologists originally diagnosing the lesions (ADAE: 0.798, 95% confidence interval (CI) 0.779–0.814 vs. dermatologists: 0.781, 95% CI 0.760–0.802; p = 4.0e−145) with significantly higher sensitivity (0.922, 95% CI 0.900–0.942 vs. 0.734, 95% CI 0.701–0.770; p = 3.3e−165), but significantly lower specificity (0.673, 95% CI 0.641–0.702 vs. 0.828, 95% CI 0.804–0.852; p = 3.3e−165). For the test set results, see Tables 2 and 3, and for differences stratified by domains, see Supplementary Figs. 14. In total, ADAE detected 602 of 653 melanomas (0.922 sensitivity), while dermatologists detected 479 of 653 (0.734 sensitivity), respectively (for the individual melanoma subtype results, see Table 4 and for a visualization of the differences, see Supplementary Fig. 3). Thereby, a total of 623 of 653 melanomas (0.954 sensitivity) were detected by either the AI, the dermatologist or both. Concurrently, ADAE classified 618 of 918 non-melanoma correctly (0.673 specificity), while dermatologists classified 760 of 918 non-melanoma (0.828 specificity) correctly. A total of 833 of 918 non-melanoma (0.907 specificity) were classified correctly by either the AI, the dermatologist, or both. Thus, the combination could exceed the individual outcomes of both AI and dermatologists, wherein solitary ADAE exhibited a higher detection, but also a higher false positive rate than dermatologists. Hence, a synergistic approach could benefit patients more than relying on man or machine alone.

Table 2 Diagnostic performance of dermatologists and ADAE for biological subsetsof our test set (i.e., validation samples are not included here)
Table 3 Diagnostic performance of dermatologists and ADAE for other subsets of our test set (i.e., validation samples are not included here)
Table 4 Detection rate of melanoma, stratified by melanoma subtype

Subset analyses

Additionally, multiple subset analyses were performed to identify disparities in diagnostic performance, revealing substantial differences within certain subsets.

Lesion subtypes

The algorithm showed significantly higher sensitivity on all melanoma subtypes except for nodular melanoma than dermatologists (p < 0.001 for all comparisons, see Table 4). It also showed significantly higher specificity for basal cell carcinoma, blue nevus, and actinic keratosis but significantly lower specificity for dysplastic/clark nevus, benign keratosis, and acral nevus (p < 0.001 for all comparisons, see Table 5).

Table 5 Detection rate of non-melanoma, stratified by non-melanoma subtype

Data source

Moreover, the algorithm achieved a better-balanced accuracy than dermatologists on data derived from five out of seven hospitals (i.e., the data of the test set, excluding the validation data from hospital 8, see Table 3), but performed significantly worse on data from hospital 1 (0.772, 95% CI 0.742–0.803 vs. 0.783, 95% CI 0.751–0.817; p = 6.7e−52), and hospital 3 (0.615, 95% CI 0.521–0.716 vs. 0.758, 95% CI 0.659–0.860; p = 7.4e−165). The dermatologists’ sensitivity was worse for all data except those from hospital 3 (0.897, 95% CI 0.806–0.976 vs. 0.923, 95% CI 0.823–1.000; p = 6.7e−23), while their specificity was higher without exception. Although consistent with the overall test set results, this observation highlights Hospital 3 as an outlier.

Patient age and lesion location

Furthermore, ADAE achieved significantly higher balanced accuracy in patients younger than 35 years (0.890, 95% CI 0.859–0.920 vs. 0.767, 95% CI 0.636–0.897; p = 1.9e−161), and for lesions on the head or neck (0.775, 95% CI 0.726–0.822 vs. 0.660, 95% CI 0.603–0.714; p = 3.3e−165) but performed significantly worse for lesions on the palms or soles (0.649, 95% CI 0.508–0.774 vs. 0.798, 95% CI 0.642–0.944; p = 2.0e−158) compared to dermatologists.

Classification complexity

The dermatologists indicated the level of confidence they had in their own diagnosis on a scale from 1 (low confidence) to 5 (high confidence), signifying the perceived classification complexity of the lesion. Based on these data, the algorithm demonstrated significantly higher balanced accuracy than the dermatologists on lesions that were assigned lower confidence scores (confidence score 1: 0.754, 95% CI 0.608–0.895 vs. 0.508, 95% CI 0.357–0.688; p = 3.3e−164; confidence score 2: 0.761, 95% CI 0.689–0.829 vs. 0.588, 95% CI 0.491–0.680; p = 3.3e−165; confidence score 3: 0.767, 95% CI 0.729–0.804 vs. 0.662, 95% CI 0.615–0.709; p = 3.3e−165; confidence score 4: 0.811, 95% CI 0.782–0.839 vs. 0.805, 95 CI 0.775–0.838; p = 5.8e−18) but lower balanced accuracy for lesions diagnosed with a confidence score of 5 (0.820, 95% CI 0.779–0.858 vs. 0.899, 95% CI 0.871–0.925; p = 3.3e−165). Hence, the AI algorithm seems less susceptible to the diagnostic difficulty. This also indicates that dermatologists assess their own prediction somewhat accurately, and especially when unsure could benefit greatly from such an AI prediction.

Likewise, we binned the confidence of the AI algorithm into a comparable 1 (low confidence) to 5 (high confidence) scale. AI and dermatologists’ confidence ratings have low correlation (see Supplementary Fig. 8). Based on this stratification, the algorithm demonstrated significantly higher balanced accuracy than dermatologists on lesions with AI confidence scores of 2 or higher (AI confidence score 2: 0.679, 95% CI 0.634–0.727 vs. 0.653, 95% CI 0.603–0.705; p = 1.4e−96; AI confidence score 3: 0.889, 95% CI 0.851–0.921 vs. 0.794, 95% CI 0.748–0.835; p = 3.3e−165; AI confidence score 4: 0.943, 95% CI 0.920–0.962 vs. 0.813, 95% CI 0.776–0.851; p = 3.3e−165; AI confidence score 5: 0.979, 95% CI 0.956–0.995 vs. 0.928, 95% CI 0.891–0.962; p = 3.3e−165) but lower balanced accuracy for lesions with an AI confidence score of 1 (0.526, 95% CI 0.470–0.579 vs. 0.624, 95% CI 0.558–0.691; p = 9.0e−165). When taking the confidence of the AI algorithm into account, there lies potential in more accurate and trustworthy predictions.

ADAE generalizes robustly on different subsets

The predictive quality of AI algorithms may depend on certain features of the test set, such as data source, patient age, lesion size, or location. To analyze whether the algorithm in this study demonstrates robust generalizability, we evaluated the association between the predictions and specific stratified domains included in the heterogeneous dataset used for classifier testing (see Tables 2 and 3 for test set characteristics). Therefore, the Breslow–Day test for homogeneity of the odds ratio was used. For inter-AI comparisons, the area under the receiver operating characteristic curve (AUROC), in combination with DeLong’s method, is used as a more objective alternative than accuracy since it is independent of a threshold setting. p-Values smaller than 0.05 are considered statistically significant.

ADAE prediction generalizability

Overall, there was no significant association between the ADAE prediction performance and the patient age (p = 0.104), skin type (p = 0.587; excluding unknown skin types, and skin types V + VI due to low sample size (<25)), or lesion location domain (p = 0.233; excluding lesions in unknown locations, and oral/genital lesions due to low sample size (<25)), the technical domain (i.e., camera setup; p = 0.068) or the dermatologists’ diagnostic confidence score, although borderline significant (p = 0.050). However, a significant association between lesion diameter (p = 0.009) and data source (i.e., the originating hospital; p = 0.027) was identified, indicating that performance correlates with these features. Specifically, the algorithm performed significantly worse on data from hospital 3, as indicated by the lower AUROC score (hospital 3: 0.775, 95% CI 0.660–0.877 vs. other hospitals combined: 0.921, 95% CI 0.906–0.934; p = 0.013). Without the outlier dataset (hospital 3), there was no significant association between the predictive performance and data source (p = 0.416). Furthermore, the AUROC was significantly higher for lesions with diameters greater than 6 mm (≤6 mm: 0.802, 95% CI 0.747–0.857 vs. >6 mm: 0.917, 95% CI 0.901–0.932; p = 7.2e−5), and significantly higher for pigmented lesions (pigmented: 0.920, 95% CI 0.905–0.934, vs. non-pigmented: 0.673, 95% CI 0.469–0.859; p = 0.019). These findings suggest that the algorithm has robust generalization capabilities on most domains, while it is influenced by lesion diameter, pigmentation and data source. However, it is worth noting that some meta information, such as patient age and lesion location, is used by ADAE as input.

While not all differences exhibited by ADAE within each subset were significant, the disparities between the upper and lower bound within some subsets were significant. Specifically, for patient age, ADAE achieved a higher AUROC score for the youngest patients than for the oldest ones (<35: 0.974, 95% CI 0.942–0.997 vs. >74: 0.879, 95% CI 0.847–0.909; p = 1.1e−5). Similarly, for the Fitzpatrick skin type, the lightest skin type was associated with lower AUROCs than the darkest one (type I: 0.865, 95% 0.802–0.922 vs. type IV: 0.985, 95% CI 0.943–1.000; p = 4.0e−4). Thus, even though the algorithm generalizes robustly, certain trends are still evident, signifying partial dependencies with respect to some groups of lesions or patients, which can still affect diagnostic performance.

Dermatologist prediction generalizability

Among dermatologists, on the other hand, there were no significant associations between prediction performance and skin type (excluding unknown skin types and V + VI due to low sample size (<25); p = 0.750), lesion diameter (p = 0.164), pigmentation (p = 0.781), technical domain (p = 0.862), or data source (p = 0.527). There were, however, significant associations of the prediction performance with patient age (p = 2.9e−4), lesion location (excluding unknown locations, and oral/genital locations due to low sample size (<25); p = 7.5e−6), and the dermatologist diagnostic confidence (p < 1.0e−6). These findings indicate that the prediction quality of the dermatologists depends on these features. Specifically, the dermatologists’ specificity decreased with increasing patient age (<35 years: 0.963, 95% CI 0.932–0.988 vs. 35–54 years: 0.863, 95% CI 0.819–0.902; p = 3.3e−165; vs. 55–74 years: 0.813, 95% CI 0.766–0.859; p = 1.3e−157; vs. >74 years: 0.697, 95% CI 0.632–0.757; p = 3.3e-165). Relatedly, sensitivity was significantly lower for patients younger than 35 years (<35 years: 0.571, 95% CI 0.308-0.833 vs. all other age groups combined: 0.737, 95% CI 0.703–0.771; p = 3.7e−137) but followed similar trends as the specificity for age groups older than 35 years (35–54 years: 0.780, 95% CI 0.692–0.850 vs. 55–74 years: 0.739, 95% CI 0.688–0.788; p = 1.0e−96; vs. >74 years: 0.716, 95% CI 0.659–0.773; p = 8.8e−53). Additionally, the dermatologists’ balanced accuracy was significantly lower for lesions on the head or neck (head/neck: 0.660, 95% CI 0.603–0.714 vs. all other locations combined: 0.810, 95% CI 0.788–0.832; p = 3.3e−165) but was higher for lesions that received higher dermatologists’ confidence scores (confidence score 1: 0.508, 95% CI 0.357–0.688 vs. confidence score 2: 0.588, 95% CI 0.491–0.680; p = 1.3e−86; vs. confidence score 3: 0.662, 95% CI 0.615–0.709; p = 1.2e−149; vs. confidence score 4: 0.806, 95% CI 0.775–0.838; p = 3.3e−165; vs. confidence score 5: 0.899, 95% CI 0.871–0.925; p = 3.3e−165).

Discussion

In this multicenter study with prospectively collected samples, we evaluated the diagnostic performance of ADAE in differentiating between melanoma and non-melanoma skin lesions and compared it to dermatologists’ diagnostic accuracy as recorded during real-life patient examinations. One of the main strengths of the study, in addition to the prospective data collection, is its comprehensive test set, which includes a wide range of melanoma subtypes and lesion locations encountered in routine care. Altogether, ADAE performed better than dermatologists in terms of balanced accuracy and sensitivity but achieved a lower specificity. Moreover, the algorithm generalized robustly to domains such as patient age and skin type, lesion location, and camera setup, but its performance was affected by lesion diameter. The dermatologists’ diagnostic accuracy, in contrast, correlated significantly with patient age and lesion location.

To evaluate AI algorithms for melanoma detection, we measured the performance of ADAE against dermatologists using a prospectively collected, heterogeneous dataset. ADAE performed better in terms of balanced accuracy (ADAE: 0.798 vs. dermatologists: 0.781), and achieved a higher sensitivity (0.922 vs. 0.734) at the cost of a lower specificity (0.673 vs. 0.828). Additionally, our findings suggest that AI algorithms may be better suited than dermatologists for diagnosing skin lesions of younger patients or patients with lesions on the head or neck, as indicated by the balanced accuracy in these tasks (<35 years: 0.890 vs. 0.767 and head/neck: 0.775 vs. 0.660, respectively). In contrast, dermatologists were significantly better at diagnosing lesions on acral skin, i.e., on the palms or soles (0.649 vs. 0.798). Interestingly, in our study, the algorithm exhibited significantly higher diagnostic accuracy in cases where dermatologists tended to be unsure, and vice-versa, thus highlighting the potential synergies between AI and human experts. Our findings are in line with previous studies7,10,11,13,31,32,33 that demonstrated the potential advantages arising from the cooperation of dermatologists with AI. Our study addresses the limitations of previous studies, specifically by our multicenter design encompassing a substantial number of dermatologists, a larger cohort of lesions, and rare melanoma subtypes. Marchetti et al.13 previously analyzed ADAE in terms of classification performance and its impact on dermatologists’ decisions, but are limited by a small test set. We compare our results to underscore their external validity. Our overall AUROC was slightly higher than Marchetti et al. reported (without R-TTA: 0.898, 95% CI 0.884–0.913 vs. 0.858). The subset analyses were also largely similar between the studies: the performance of ADAE was worse for older patients and those with type I skin in both studies. Specificity was lower for the larger lesions, despite the increase in AUROC. However, unlike in our study, Marchetti et al. reported a lower specificity of the algorithm for head/neck area lesions. These findings suggest that a collaborative13,31 rather than a comparative approach7,10,11,32 may ultimately lead to an improvement in patient care, achieving an increased detection rate while reducing the number of unnecessary excisions as compared to relying solely on either dermatologists or AI alone.

When we investigated the effect of the different variables on diagnostic performance, we found differences in those effects between the AI algorithm and dermatologists. Specifically, the performance of the AI was affected by the lesion diameter and pigmentation; it performed worse for lesions with diameters smaller than 6 mm, and for lesions without pigmentation, while dermatologists discriminated lesions of all sizes and pigmentation states relatively consistently. Concurrently, the dermatologists’ decisions, but not those of ADAE, were influenced by patient age and lesion location. This further underscores the advantages of a holistic approach, as the diagnostic strengths of the AI and dermatologists may compensate for each other’s shortcomings in their generalization abilities.

While the algorithm was largely unaffected by the data source (i.e., the hospital), it performed significantly worse on data from one particular hospital, hospital 3. While the sensitivity is comparable to other hospitals (hospital 3: 0.897 vs. other hospitals: 0.923), the specificity is significantly worse (0.333 vs. 0.684). One contributing factor is the presence of relatively larger lesions (mean of 14.3 mm vs. 12.3 mm). Additionally, the proportion of non-pigmented skin lesions is higher (10.6% vs. 2.13%). Furthermore, the population of this hospital comprised older individuals when compared to the other hospitals (age at diagnosis: 27–96, median of 65.5 years vs. 18–95, median of 63 years) while exhibiting a slightly different distribution of skin types from the overall study population (i.e., a preponderance of type I and II skin types). We did find that ADAE performed worse for non-pigmented skin lesions, older patients and those with lighter skin types, and has lower specificity the larger the lesion, which may explain the deviant performance.

While our study comprises multiple centers, they are all located in Germany. Thus, our findings might not translate to other ethnicities or skin types (especially types V and VI, which are underrepresented in our study). Furthermore, we are limited to a binary classification (melanoma vs. non-melanoma), which does not fully model the complexity of clinical reality which involves lesion classification into multiple classes. Finally, our comparison of AI and dermatologists was performed by comparing diagnostic accuracy and generalization but does not include other metrics and aspects, such as the for ensembles typically problematic computing costs and explainability, nor does it investigate a prospective impact on dermatologists’ management decisions. Especially explainability is a feature of AI that is both required by EU standards for transparency in AI34,35 and demanded by physicians and patients alike36,37,38,39. In a recent retrospective study, a dermatologist-like explainable AI-enhanced trust and confidence in diagnosing melanoma among 116 participating clinicians, promoting its future use in care33. Future research could build upon this work by evaluating different AI architectures, such as model soups40,41, that use the average weights of multiple models to improve performance and address the ensemble drawbacks, namely performance costs as well as explainability aspects.

In conclusion, ADAE showed better performance than dermatologists in terms of balanced accuracy and sensitivity, but worse specificity. It generalizes robustly on most domains of a heterogeneous, prospectively collected test set. Thus, it could be particularly useful in medical settings, where there often is a large discrepancy between hospitals due to technical differences related to imaging and sometimes patient populations. Ultimately, AI algorithms can support physicians in their diagnoses to identify melanomas more accurately, especially for difficult cases in which human dermatologists are unsure of their diagnoses. Future research should address the shortcomings of the algorithm, such as lack of explainability and low specificity, both particularly problematic in facilitating the clinical use of AI.