Introduction

Skin cancer is the most common neoplasm worldwide. Early detection and diagnosis are critical for the survival of affected patients. For skin cancer detection in early stages, a complete physical examination is of paramount importance; however, visual inspection is often not sufficient, and less than one quarter of U.S. patients will have a dermatologic examination in their lifetime1. Dermoscopy is a diagnostic tool, which allows for improved recognition of numerous skin lesions when compared to naked eye examination alone; however, this improvement depends on the level of training and experience of clinicians2. In recent years, advances have been made in noninvasive tools to improve skin cancer diagnostic performance, including the use of artificial intelligence (AI) for clinical and/or dermoscopic image diagnosis in dermatology.

Convolutional neural networks (CNN) is a type of machine learning (ML) that simulates the processing of biological neurons and is the state-of-the-art network for pattern recognition in medical image analysis1,2. As diagnosis in dermatology relies heavily on both clinical and dermoscopic image recognition, the use of CNN has the potential to collaborate or improve diagnostic performance. Studies have been published demonstrating that CNN-based AI algorithms can perform similarly or even outperform specialists for skin cancer diagnosis3. This has created an ‘AI revolution’ in the field of skin cancer diagnosis. Recently, a few dermatology AI systems have been CE (Conformité Européenne) approved by the European Union and are use in practice making of paramount importance to understand the data behind these algorithms4.

While there have been relevant systematic reviews performed in the past few years, the importance of this work which combines a high-quality systematic review with a meta-analysis is that it quantitatively asks the question of where we are with AI for skin cancer detection. The main objective of this study was to perform a systematic review and meta-analysis to critically evaluate the evidence published to date on the performance of AI algorithms in skin cancer classification in comparison with clinicians.

Methods

Guidelines followed

This systematic review was based on the PRISMA guidelines. A flow chart diagram is presented in Fig. 1. The present study has also been registered in the Prospective Register of Systematic Reviews (PROSPERO) System (PROSPERO ID: CRD42022368285).

Fig. 1
figure 1

PRISMA flow diagram of included studies.

Search strategy

Three electronic databases, PubMed, Embase, and Cochrane library were searched by a librarian (J.M.). Studies published up to August 2022 were included. We uploaded all the titles and abstracts retrieved by electronic searching into Rayyan and removed any duplicate. Then we collected all the full texts of the studies that met the inclusion criteria based on the title or abstract for detailed inspection. Two reviewers (M.P.S. and J.S.) independently assessed the eligibility of the retrieved papers and resolved any discrepancies through discussion.

Study population—selection

The following PICO (Population, Intervention or exposure, Comparison, Outcome) elements were applied as inclusion criteria for the systematic review: (i) Population: Images of patients with skin lesions, (ii) Intervention: Artificial intelligence diagnosis/classification, (iii) Comparator: Diagnosis/ classification by clinicians, (iv) Outcome: Diagnosis of skin lesions. Only primary studies comparing the performance of artificial intelligence versus dermatologists or clinicians were included.

Studies about diagnosis of inflammatory dermatoses, without extractable data, non-English publications, or animal studies, were excluded.

Data extraction

For studies fulfilling the inclusion criteria, two independent reviewers extracted data in a standardized and predefined form. The following data were extracted and recorded: (i) Database (ii) Title, (iii) Year of publication, (iv) Author, (v) Journal, (vi) Prospective vs retrospective study, (vii) Image database used for training and internal vs external dataset for testing (viii) Type of images included: clinical and/or dermoscopy, (ix) Histopathology confirmation of diagnosis, (x) Inclusion of clinical information, (xi) Number and expertise of participants (experts dermatologists, non-expert dermatologists, and generalists), (xii) Name and type of AI algorithm, (xiii) Included diagnosis, (xiv) Statistics on diagnostic performance (sensitivity [Sn], specificity [Sp], receiver operating characteristic [ROC] curve, area under the curve [AUC]). The main comparisons conducted were diagnostic performance of the AI algorithm compared with clinician diagnostic performance. When available, the change in diagnostic performance of dermatologists with the support of the AI algorithm was included, as well as the change in diagnostic performance after including clinical data (data in supplementary material).

Risk of bias assessment

Two review authors independently assessed the quality of the studies included and the risk of bias using QUADAS-25. Based on the questions, we classified each QUADAS-2 domain as low (0), high (1) or unknown (2) risk of bias.

Meta-analysis

Nineteen out of 53 studies were included in the meta-analysis. The studies met the following criteria: dermoscopic images only, diagnosis of skin cancer, dichotomous classification (benign/malignant, melanoma/nevus), extractable data from the original article (to calculate true positives [TP], false positives [FP], true negatives [TN], and false negatives [FN]), distinction in level of expertise of clinicians (experts dermatologists vs non-expert dermatologists vs generalists). For study purposes and to obtain a global estimate, we grouped all levels of clinical expertise as ‘overall clinicians’. During data processing, two extra analysis that were not pre-specified in the PROSPERO protocol were performed: clinician vs AI algorithms in prospective vs retrospective studies and internal vs external test (validation) sets, respectively. Internal vs external test sets were defined according to Cabitza6 and Shung et al.7. ‘Internal test set’ was defined as a non-overlapping, ‘held out’ subset of the original patient group data that was not used for AI algorithm development and training, used to test the AI model. ‘External test set’ was defined as a set of new data originating from different cohorts, facilities, or repositories other than the data used for model development and training (e.g., dataset originated in different country or institution). Two investigators classified included studies into internal vs external test sets. If both internal and external test sets were used, we classified them as external for study purposes. We decided to perform these non-pre-specified analysis given the relevance of the results for understanding of the data8.

We extracted binary diagnostic accuracy data and constructed contingency tables to calculate Sn and Sp. We conducted a meta-analysis of studies providing 2 ×2 tables to estimate the accuracy of AI and clinicians (confirmatory approach). If an included study provided various 2 ×2 tables, we assumed these data to be independent from each other. We performed a hierarchical summary receiver operating characteristic (HSROC) as well as a bivariate model of the accuracy of AI and clinicians. ROC curves were constructed to simplify the plotting of graphical summaries of fitted models. A likelihood ratio test was used to compare models. A p-value less than 0.05 was considered statistically significant. Analyses were performed using Stata 17.0 statistics software package (codes in supplementary material).

Results

A total of 53 comparative studies (since Piccolo et al. in 20029) fulfilled the inclusion criteria (Fig. 1). Most of the studies focused on dermoscopic images (n = 31), followed by clinical images (n = 14), or both (n = 8). Detailed extracted data is shown in Table 1 for dermoscopic imaging studies, Table 2 for clinical imaging studies, and Table 3 for clinical and dermoscopic imaging studies.

Table 1 Included studies general characteristics, dataset used, and performance evaluating dermoscopy
Table 2 Included studies general characteristics, dataset used, and performance evaluating clinical images
Table 3 Included studies general characteristics, dataset used, and performance evaluating both dermoscopic and clinical images

Regarding the risk of bias, most of the studies had an uncertain risk (58%), and 14 (26%) had a low risk of bias. Detail of QUADAS-2 score for each study included in the systematic review is in Fig. 2.

Fig. 2: QUADAS-2 results of the assessment of risk of bias in the included studies.
figure 2

QUADAS-2 tool was used to assess the risk of bias in the included studies in terms of 4 domains (participants, index test, reference standard, and analysis). Low risk (cyan) refers to the number of studies that have a low risk of bias in the respective domain. Unclear (gray) refers to the number of studies that have an unclear risk of bias in the respective domain due to lack of information reported by the study. High risk (purple) refers to the number of studies that have a high risk of bias in the respective domain. a. Risk of Bias Assessment b. Applicability Concerns.

Databases used

Only institutional or private databases were used in 20 articles (37.7%). In all, 16 articles (30.2%) used exclusively open-source data; the most commonly used databases were ‘ISIC’ and ‘HAM10000’10,11. Eighteen studies (33.9%) used a combination of institutional and public dataset. Twenty-two studies (41.5%) used only images of lesions confirmed with histopathology, while 27 (50.9%) included images diagnosed by expert consensus as the gold standard. Four studies (7.5%) did not specify a method of diagnosis confirmation. Fourteen studies (26.4%) used an external database for testing the algorithm, 39 studies (73.6%) tested with an internal dataset (Tables 13).

Study type and participants included

A total of 50 studies (94.3%) were retrospective and 3 (5.7%) were prospective. Twenty-seven studies (50.9%) included only specialists, in some cases detailing the level of expertise (expert dermatologists vs non-expert dermatologists). Twenty-three studies (43.3%) included dermatologists and other non-specialist clinicians (dermatology residents and/or generalists), and 3 studies (5.6%) included only generalists.

Diagnosis included and metadata

Forty-three studies (81.1%) considered differential diagnosis between skin tumors only, while 10 (18.8%) also included inflammatory diagnosis or other pathologies (multiclass algorithms). Eighteen articles (33.9%) included clinical information on the patients (metadata), mainly age, sex, and lesion location.

Artificial intelligence assistance

Of the total number of articles included in the review, 11 (20.7%) evaluated potential changes in diagnostic performance or therapeutic decisions of clinicians with AI assistance. Nine of 11 studies showed an improvement in global diagnostic performance when using AI collaboration, 6 of which showed a higher percentage of improvement in the generalists group.

Diagnostic performance of artificial intelligence algorithm versus clinicians, from dermoscopic images of skin lesions

Thirty-one studies evaluated diagnostic performance with dermoscopic images (Table 1). In general, 61.2% (n = 19) of the studies showed a better performance of AI when compared to clinicians. A total of 29.0% (n = 9) resulted in a comparable performance, and in 9.7% (n = 3) specialists outperformed AI.

Dichotomous classification (‘benign’ vs ‘malignant’)

Eighteen studies used AI with dichotomous classification (58.0%) as ‘benign’ vs ‘malignant’. In 61.1% AI outperformed clinicians (n = 11)12,13,14,15,16,17,18,19,20,21,22,23, being statistically significant in 54.5% of them12,15,16,18,20,21. A total of 27.7% showed comparable performance between AI and clinicians (n = 5)9,24,25,26,27. In all, 11.1% resulted in a better performance for clinicians in comparison to AI (n = 2)28,29, 1 of them showing statistical significance29. Five studies16,17,18,19,28 evaluated the collaboration between AI and clinicians (‘augmented intelligence’). All of them showed improved diagnostic accuracy when evaluating clinicians with the support of AI algorithms, being more relevant for less experienced clinicians. Statistical significance was demonstrated in two16,17.

Multiclass and combined classification

Eight of the 31 studies used multiclass classification; in 4 of them, AI had a better performance30,31,32,33; in 3 studies the diagnostic accuracy was comparable34,35,36; and in 1 clinicians outperformed AI37. Two out of 8 studies evaluated AI-assistance, all of them showing improvement in diagnostic accuracy for human raters, with least experienced clinicians benefiting the most32,35. Five of the 31 dermoscopy studies developed both dichotomous and multiclass algorithms, 4 of them resulting in a better performance of AI over humans38,39,40,41.

Diagnostic performance of artificial intelligence algorithms versus clinicians, using clinical images

A total of 14 AI articles evaluating CNN-based classification approaches that used clinical images only were included (Table 2). Of these, 42.8% (n = 6) showed a better performance of AI algorithms, 28.6% (n = 4) obtained comparable results, and in 28.6% (n = 4) clinicians outperformed AI.

Dichotomous classification (‘benign’ vs ‘malignant’)

Six studies42,43,44,45,46 developed an AI algorithm with dichotomous outcomes, obtaining a performance comparable or superior to clinicians in 5 of them42,43,44,45. One study showed a better performance for clinicians46.

Multiclass and combined classification

Five studies47,48,49,50,51 incorporated AI algorithms with multiclass classification. Zhao et al.48 and Pangti et al.49 obtained superior performance of AI algorithms, while Chang et al.47, showed comparable performance between AI and specialists. In one study, clinicians outperformed AI algorithm50.

Three studies52,53,54 with clinical images used both dichotomous and multiclass algorithms. Han et al.53 observed an improvement in diagnostic Sn and Sp with the assistance of the AI algorithm for both classifications, being statistically significant for less experienced clinicians.

Diagnostic performance of artificial intelligence algorithms versus clinicians, from both clinical and dermoscopic images

Eight studies included clinical and dermoscopic images as part of their analysis21,55,56,57,58,59,60,61. Overall, 75% (n = 6) resulted in comparable performance, and 25% (n = 2) showed better performance for AI algorithms in comparison to clinicians. Only 1 study obtained statistical significance57.

Dichotomous classification

Six studies applied dichotomous classification; Haenssle et al.57 being the only study obtaining a better performance for the AI algorithm over clinicians despite the incorporation of metadata. Five remaining studies showed a comparable performance between AI and clinicians.

Multiclass and combined classification

Huang et al.61 classified into 6 categories, with AI being superior to specialists in average accuracy. Finally, Esteva et al.55 used two multiclass classifications, showing comparable performance between AI and clinicians in both.

Meta-analysis

A total of 19 studies were included in the meta-analysis. Table 4 shows the summary estimates calculated to compare performance between AI and clinicians with different levels of experience.

Table 4 Meta-analysis results, summary estimates of sensitivity, specificity, and likelihood ratio according to subgroups

Only 1 prospective study met the inclusion criteria and was included in the meta-analysis.

AI vs overall clinicians’ meta-analysis

When analyzing the whole group of clinicians, not accounting for expertise level, AI obtained a Sn 87.0% (95% CI 81.7–90.9%) and Sp 77.1% (95% CI 69.8–83.0%), and overall clinicians obtained a Sn 79.8% (95% CI 73.2–85.1%) and Sp 73.6% (95% CI 66.5–79.6%), with a statistically significant difference for both Sn and Sp, according to the likelihood ratio test (p < 0.001 for both Sn and Sp). The Forest plot is available in Fig. 3a, b. The ROC curve shapes confirmed the prior differences (Fig. 4). Supplementary Fig. 1a, b shows the sub analysis adjusted for retrospective vs prospective design.

Fig. 3: Forest plot detailing the sensitivity and specificity for all groups of clinicians (‘overall’) and artificial intelligence algorithms from each study included in the meta-analysis according to type of test set (external vs internal).
figure 3

a Sensitivity for artificial intelligence (left) and all clinicians (‘overall’) (right). b Specificity for artificial intelligence (left) and all clinicians (‘overall’) (right).

Fig. 4: Hierarchical ROC curves of studies for comparing performance between artificial intelligence algorithms (left) and all group of clinicians (right).
figure 4

ROC receiver operating characteristic. Each circle size represents the individual study sample size (circle size is inversely related to study variance).

AI vs generalists clinicians’ meta-analysis

When analyzing the AI performance vs generalists, AI obtained a Sn 92.5% (95% CI 88.9–94.9%) and Sp 66.5% (95% CI 56.7–75.0%), and generalists a Sn 64.6% (95% CI 47.1–78.9%) and Sp 72.8% (95% CI 56.7–84.5%), the difference being statistically significant for both Sn and Sp, according to the likelihood ratio tests (p < 0.001 for both). The ROC curve shapes confirmed the prior differences, with higher heterogeneity and wider confidence interval for generalists (Fig. 5). Subgroup analysis comparing internal vs external test set was not possible given all included studies were performed using internal test set in this subgroup (Fig. 6a, b).

Fig. 5: Hierarchical ROC curves of studies for comparing performance between artificial intelligence algorithms (left) and generalists (right).
figure 5

ROC receiver operating characteristic. Each circle size represents the individual study sample size (circle size is inversely related to study variance).

Fig. 6: Forest plots of studies showing artificial intelligence vs generalists sensitivity and specificity.
figure 6

a Sensitivity for artificial intelligence (left) and for generalists (right). b Specificity for artificial intelligence (left) and for generalists (right).

AI vs non-expert dermatologists’ meta-analysis

AI obtained a Sn 85.4% (95% CI 78.9–90.2%) and Sp 78.5% (95% CI 70.6–84.8%), while non-expert dermatologists obtained Sn 76.4% (95% CI 71.1–80.9%) and Sp 67.1% (95% CI 57.2–75.6%), with a statistically significant difference, both in Sn and Sp (p < 0.001 for both). The ROC curve shapes confirmed these results (Fig. 7). The Forest plot is available in Fig. 8a, b. In the internal vs external test set subgroup analysis (Fig. 8a, b), AI achieved better Sn in the external test set, while greater Sp with an internal test set. For non-expert dermatologists, no changes in Sn were observed; however, they achieved better Sp in the external test set. In the prospective vs. retrospective subgroup analysis (Supplementary Fig. 2), only 1 prospective study met the inclusion criteria and was included in the meta-analysis. A trend towards better Sn in retrospective versus prospective studies was observed.

Fig. 7: Hierarchical ROC curves of studies for comparing performance between artificial intelligence algorithms (left) and non-expert dermatologists (right).
figure 7

ROC receiver operating characteristic. Each circle size represents the individual study sample size (circle size is inversely related to study variance).

Fig. 8: Forest plots of studies showing artificial intelligence vs non-expert dermatologists sensitivity and specificity according to type of test set (external vs internal).
figure 8

a Sensitivity for artificial intelligence (left) and for non-expert dermatologists (right). b Specificity for artificial intelligence (left) and for non-expert dermatologists (right).

AI vs expert dermatologists’ meta-analysis

AI obtained a Sn 86.3% (95% CI 80.4–90.7%) and Sp 78.4% (95% CI 71.1–84.3%), and expert dermatologists a Sn 84.2% (95% CI 76.2–89.8%) and Sp 74.4% (95% CI 65.3–81.8%), this difference was statistically significant for both Sn and Sp, according to the likelihood ratio test (p < 0.001 for both). The ROC curve shapes were comparable for both AI and expert dermatologists, with narrow confidence intervals (Fig. 9). The subgroup analysis by internal vs external test set showed that AI had better Sn in external test set while Sp was better for internal test set. For expert dermatologists there was no difference in Sn; Sp was better in external test set (Fig. 10a, b). The subgroup analysis regarding study design, retrospective vs. prospective (Supplementary Fig. 3), found only one study.

Fig. 9: Hierarchical ROC curves of studies for comparing performance between artificial intelligence algorithms (left) and expert dermatologists (right).
figure 9

ROC receiver operating characteristic. Each circle size represents the individual study sample size (circle size is inversely related to study variance).

Fig. 10: Forest plots of studies showing artificial intelligence vs expert dermatologists sensitivity and specificity according to type of test set (external vs internal).
figure 10

a Sensitivity for artificial intelligence (left) and expert dermatologists (right). b Sensitivity for artificial intelligence (left) and for expert dermatologists (right).

Discussion

In the present study, we found an overall Sn and Sp of 87% and 77% for AI algorithms and an overall Sn of 79% and Sp of 73% for all clinicians (‘overall clinicians’) when performing a meta-analysis of the included studies. Differences between AI and all clinicians were statistically significant. Performance between AI algorithms vs specialists was comparable between both groups. The difference between AI performance (Sn 92%, Sp 66%) and the generalists subgroup (Sn 64%, Sp 72%) was more marked when compared to the difference between AI and expert dermatologists. In studies that evaluated AI-assistance (‘augmented intelligence’), overall diagnostic performance of clinicians was found to improve significantly when using AI algorithms62,63,64. This improvement was more important for those clinicians with less experience. This is in line with this meta-analysis’ results where the difference was greater for generalist than for expert dermatologists and opens an opportunity for AI assistance in the group of less-experienced clinicians. To the best of our knowledge, this is the first systematic review and meta-analysis on the diagnostic accuracy of health-care professionals versus AI algorithms using dermoscopic or clinical images of cutaneous neoplasms. The inclusion of a meta-analysis is key to better understanding, quantitatively, the current state-of-the-art of AI algorithms for the automated diagnosis of skin cancer.

In general, the included studies presented diverse methodologies and significant heterogeneity regarding the type of images included, the different classifications, the characteristics of the participants, and the methodology for presenting the results. This is important to consider when analyzing and attempting to generalize and meta-analyze the obtained findings and should be taken into consideration when interpreting this study results. Research in AI and its potential applications in clinical practice have increased exponentially during the last few years in different areas of medicine, not only in dermatology65. Other systematic reviews have also reported that, in experimental settings, most algorithms are able to achieve at least comparable results when compared with clinicians; however, they also describe similar limitations as those described here66,67,68,69. Only a few studies have evaluated the role of AI algorithms in real clinical scenarios in dermatology. Our study confirms that only 5.7% of studies were prospective and only one of the prospective studies was suitable for meta-analysis62,63. This contrasts with recent data in other medical areas showing an increase in the clinical use of AI70 and highlights the relevance of understanding the role of AI in skin cancer and dermatology. However, prospective studies pose a real challenge for AI algorithms to become part of daily clinical practice as they face specific tests such as ‘out-of-distribution’ images or cases.

Based on this systematic review and meta-analysis results, several challenges have been evidenced when applying AI in clinical practice. First, databases are essential when training an AI algorithm. Small databases, inclusion of only specific populations, or limited variation in skin phototypes, limits the extrapolation of results71,72,73. The lack and underrepresentation of certain ethnic groups and skin types in current datasets has been mentioned as a potential source of perpetuation healthcare disparity73. Based on the results of our systematic review, we can confirm that most algorithms have been trained using the same datasets over and over in at least half of the studies. This translates into lack of representation of specific groups. The diversity of techniques and camera types (e.g. professional vs smartphones) used to capture images and their quality, possible artifacts such as pencil marks, rulers or other objects, are variables that must also be considered when evaluating the performance of AI algorithms71,72,74. A second limitation is the lack of inclusion of metadata in the AI algorithms. In the real world, we manage additional layers of information from patients, including demographic data, personal and family history, habits, evolution of the disease, and a complete physical examination, including palpation, side illumination, and not only 2-D visual examination. These elements are important to render a correct differential diagnosis and to guide clinical decision-making, and so far, very few AI models incorporate them. Therefore, real-world diagnosis is different from static 2-D image evaluations. Regarding the design of human evaluation in experimental and retrospective studies, in most cases it aims to determine whether a lesion is benign or malignant, or to provide a specific diagnosis. This differs from clinical practice in a real-life setting, in which decisions are generally behavioral, whether following up, taking a biopsy or removing a lesion, beyond exclusively providing a specific diagnosis based on the clinical evaluation. The scarce available prospective studies that account for this real-world clinical evaluation makes generalization of these positive results of AI mainly based on retrospective studies restricted. In addition, the management of patient information and privacy, and legal aspects of regulation regarding the application of AI-based software in clinical practice, also represents an emerging challenge75.

The current evidence gathered from this article supports collaboration between AI and clinicians (‘augmented intelligence’), especially for non-expert physicians. In the future, AI algorithms are likely to become a relevant tool to improve the evaluation of skin lesions by generalists in primary care centers, or clinicians with less access to specialists63. AI algorithms could also allow for prioritization of referral or triage, improving early diagnosis. Currently, there are undergoing studies evaluating the application of AI algorithms in real clinical settings, which will demonstrate the applicability of these results in clinical practice. The first prospective randomized controlled trial by Han et al.62, showed that when a group of clinicians used AI assistance, the diagnosis accuracy improved. This improvement was better for generalists. The results of this recent randomized clinical trial partially confirm the potentially positive role of AI in dermatology. These results also confirm that the benefit is more pronounced for generalists, aligning with the findings of the present meta-analysis.

With the aim of reducing the current barriers, we propose to generate and apply guidelines with standardization of the methodology for AI studies. One proposal is the Checklist for Evaluation of Image-Based Artificial Intelligence Reports in Dermatology, published by Daneshjou et al.76. These guidelines should include the complete workflow and start from the moment images are captured to protocols on databases, experience of participants, statistical data, definition on how to measure accuracy, among many others. This will allow us to compare different studies and generate better quality evidence. For example, Esteva et al.52. defined ‘overall accuracy’ as the average of individual inference class accuracies, which might differ from others. In addition, it is mandatory to collaborate with international collaborative databases (e.g. ISIC, available at www.isic-archive.com) to provide accessible public benchmarks and ensure repeatability and the inclusion of a diverse group of skin types and ethnicities to avoid for underrepresentation of certain groups. These strategies would make current datasets more diverse and generalizable.

The main strengths of the present study were the extensive and systematic search in 3 databases, encompassing studies from early AI days up to the most recently published studies, the strict criteria applied for the evaluation of studies and extraction of data, following the available guidelines for systematic reviews, and the performance of a meta-analysis, that allows for quantitatively assess the current AI data.

Limitations include the possibility of not having incorporated articles available in databases other than the ones included, or in other languages, thus constituting selection bias. Also, AI is a rapidly evolving field, and new relevant articles might have emerged while analyzing the data. To the best of our knowledge, no landmark studies were published in the meantime. Publication bias cannot be ruled out, since it is more likely that those articles with statistically significant results were to be published. Also, as shown in our results, more than half of the studies (64.1%) utilized the same public databases (e.g. ISIC and HAM10000), generating a possible overlap of the images in the training and testing group. Furthermore, most studies used the same dataset for training and testing the algorithm (73.6% used an internal test set) which might further bias the results. As observed in the subgroup analysis of the present study, there were differences in estimated Sn and Sp for both AI and clinicians depending on whether an internal vs. external test set was used. However, these were post-hoc analysis and should be interpreted with caution. External test set is key for proper evaluation of AI algorithms6 to ‘validate’ that the algorithm will retain its performance when presented with data from other datasets. Limited details regarding humans’ assessment by readers were available and could also affect the results. We also grouped all skin cancers as one group for analysis, variations in accuracy exists for different skin cancers (e.g. melanoma vs basal cell carcinoma vs squamous cell carcinoma) for humans and for AI algorithms. The application of QUADAS-2 shows a potential information bias, as it is an operator-dependent tool which generates subjectivity and qualitative results. Regarding the meta-analysis, we faced two main limitations. Firstly, the heterogeneity between studies makes it difficult to interpret or generalize the results obtained. Secondly, due to the lack of necessary data, the number of studies included in the meta-analysis was reduced when compared to the studies included in the systematic review. Finally, there was a minimal number of prospective studies included in the systematic review and only one was subjected to the meta-analysis and therefore, those results must be interpreted with caution. Nevertheless, in this post-hoc analysis prospective studies showed worse performance of AI algorithms compared to clinicians confirming the relevance of the complete physical examination and other clinical variables such as history, palpation, etc. This also shows a lack of real-world data published given most studies were retrospective reader studies.

Conclusion

This systematic review and meta-analysis demonstrated that the diagnostic performance of AI algorithms was better than generalists, non-expert dermatologists, and despite being statistically significant, AI algorithms were comparable to expert dermatologists in the clinical practice as the differences were minimal. As most studies were performed in experimental settings, future studies should focus on prospective, real-world settings, and towards AI-assistance. Our study suggests that it is time to move forward to real-world studies and randomized clinical trials to accelerate progress for the benefit of our patients. The only randomized study available has shown a better diagnosis accuracy when using AI algorithms as ‘augmented intelligence’62. We envision a fruitful collaboration between AI and humans leveraging the strengths of both to enhance diagnostic capabilities and patient care.