Introduction

As a result of increasing data availability and computational power, artificial intelligence (AI) algorithms—have reached a level of sophistication that enables them to take on complex tasks previously only conducted by human beings1. Several AI algorithms are now approved by the United States Food and Drug Administration (FDA) for medical use2,3,4. Though there are currently no image-based dermatology AI applications that have FDA approval, several are in development2.

Skin cancer diagnosis relies heavily on the interpretation of visual patterns, making it a complex task that requires extensive training in dermatology and dermatoscopy5,6. However, AI algorithms have been shown to accurately diagnose skin cancers, even outperforming experienced dermatologists in image classification tasks in constrained settings7,8,9. However, these algorithms can be sensitive to data distribution shifts. Therefore, AI-human partnerships could provide performance improvements that surmount the limitations of both human clinicians or AI alone. Notably, Tschandl et al. demonstrated in their 2020 paper that the accuracy of clinicians supported by AI algorithms surpassed that of either clinicians or AI algorithms working separately10. This approach of an AI-clinician partnership is considered the most likely clinical use of AI in dermatology, given the ethical and legal concerns of automated diagnosis alone. Therefore, there is an urgent need to better understand how the use of AI by clinicians affects decision making11. The goal of this study was to evaluate the diagnostic accuracy of clinicians with vs. without AI assistance using a systematic review and meta-analysis of the available literature.

Results

Literature search and screening

For this systematic review and meta-analysis, 2983 records were initially retrieved, of which, 1972 abstracts were screened after the automatic duplicate removal by Covidence (Fig. 1). As 1936 articles were considered irrelevant and further excluded, the full text of 36 articles was reviewed. A total of 12 studies were included in the systematic review10,12,13,14,15,16,17,18,19,20,21,22 and ten studies were included in the meta-analysis10,12,13,14,15,17,19,20,21,22, whereas the information needed to create contingency tables of AI-assisted and un-assisted medical professionals was unavailable in two studies16,18.

Fig. 1: Study selection.
figure 1

Flow diagram of the study selection process.

Study characteristics

Tables 1 and 2 presents the characteristics of the included studies. Half of the studies were conducted in Asia (50%, South Korea=5, China=1) and the other half was done in North/South America (25%, USA = 1, Argentina=1, Chile=1), and Europe (25%, Austria=1, Germany=1, Switzerland=1). More studies were performed in experimental (67%, n = 8) than clinical settings (33% n = 4). A quarter of studies included only dermatologists (25%, n = 3), more than a half (58%, n = 7) included a combination of dermatology specialists (e.g., dermatologist and dermatology residents) and non-dermatology medical professionals (e.g., primary care physicians, nurse practitioners, medical students) and among these, two studies included lay persons, but this data was not included for meta-analysis. In two studies (17%), only non-dermatology medical professionals were included. The median number of study participants was 18.5, ranging from 7 to 302.

Table 1 Study characteristics of the 12 included studies
Table 2 Characteristics of the algorithms and data used in the included studies

Clinical information was provided to study participants in addition to images or in-patient visits in half of the studies (50%, n = 6). For diagnosis, outpatient clinical images were most frequently provided (42%, n = 5), followed by dermoscopic images (33%, n = 4) and in-patient visits (25%, n = 3). Diagnostic task was either choosing the most likely diagnosis (58%, n = 7) or rating the lesion as malignant vs. benign (42%, n = 5). Most studies (75%, n = 9) used a paired design with the same reader diagnosing the same case first without, then with AI assistance, whereas two studies provided different images between the two tasks. A fully crossed design (i.e., all readers diagnosing all cases in both modalities) was performed in four studies. One study only reported diagnosis with AI support, thus did not allow to analyze the effect of AI16. Studies included a reference standard that was either varying combinations of either histopathology, a dermatologist panel’s diagnosis or the treating physician, from medical records, clinical follow-up or in vivo confocal microscopy (75%, n = 9) or histopathologic diagnosis on all images (17%, n = 2). One study considered either histopathology or the study participant being in concordance with two AI tools that were studied as the reference standard17. Most AI algorithms did not provide explanation for their outputs or presentation beyond the top-1 or top-3 diagnoses along with their respective probabilities or a binary malignancy score. Content-based image retrieval (CBIR) was the only explainability method that was used, namely in two of the studies (17%) and Tschandl et al.10 was the only study that delved into the effects of various representation of AI output on the diagnostic performance of physicians. Definition of target condition varied across studies, but all studies included at least one skin cancer among the differential diagnoses. The summary of methodological quality assessments can be found in Supplementary Table 1. Although κ was low (κ = 0.33), the Bowker’s Test of Symmetry23 was not significant, hence two raters were considered having the same propensity to select categories. All three assessors agreed with the final quality assessments.

Meta-analyses results

The summary estimate of sensitivity for clinicians overall was 74.8% (95% CI 68.6–80.1) and specificity 81.5% (73.9–87.3). The overall diagnostic accuracy increased with AI assistance to a pooled sensitivity and specificity of 81.1% (74.4–86.5) and 86.1% (79.2–90.9), respectively. The SROC curves and forest plots of ten studies for clinicians without vs. with AI assistance each are shown in Figs. 2 and 3, respectively, where less heterogeneity is observed in the sensitivity of clinicians overall compared to clinicians with AI assistance.

Fig. 2: SROC Curves.
figure 2

SE sensitivity, SP specificity. Performance of clinicians with no AI assistance (a) compared to AI-assisted clinicians (b) in the included studies.

Fig. 3: Flow diagram of the study selection process.
figure 3

Forest plots. Meta-analysis results of the diagnostic performance of clinicians without (a) or with (b) AI assistance.

To investigate the effect of AI assistance in more detail, we conducted subgroup analyses based on clinical experience level, test task and image type (Table 3). We observed that dermatologists had the highest diagnostic accuracy in terms of sensitivity and specificity. Residents (including dermatology residents and interns) were the second most accurate group, followed by non-dermatologists (including primary care providers, nurse practitioners and medical students). Notably, AI assistance significantly improved the sensitivity and specificity of all groups of clinicians. The non-dermatologist group appeared to benefit the most from AI assistance regarding improvement of pooled sensitivity (+13 points) and specificity (+11 points). For classification task, the sensitivity of both binary classification (malignant vs. benign) and top diagnosis improved with AI assistance. Meanwhile, AI assistance significantly improved pooled specificity only for top classification, reaching a specificity of 88.8%, (86.5–90.8). No significant difference was observed for image type.

Table 3 Subgroup analysis by clinician’s experience level, image type and classification task

There was no evidence of a small-study effect in regression test asymmetry for both humans without (p = 0.33) and with AI assistance (p = 0.23). Please see Supplementary Fig. 1 for funnel plots. The Spearman correlation test found that the presence of positive threshold effect was low likely for both groups. Sensitivity analyses revealed that excluding outliers slightly increased the pooled sensitivity and specificity in both groups while the pooled sensitivity and specificity mostly remained unchanged when excluding the low-quality study (Supplementary Table 2).

Discussion

This systematic review and meta-analysis included 12 studies and 67,700 diagnostic evaluations of potential skin cancer by clinicians with and without AI assistance. Our findings highlight the potential of AI-assisted decision-making in skin cancer diagnosis. All clinicians, regardless of their training level, showed improved diagnostic performance when assisted by AI algorithms. The degree of improvement, however, varied across specialties, with dermatologists exhibiting the smallest increase in diagnostic accuracy and non-dermatologists, including primary care providers, demonstrating the largest improvement. These results suggest that AI assistance may be especially beneficial for clinicians without extensive training in dermatology. Given that many dermatological AI devices have recently obtained regulatory approval in Europe, including some CE marked algorithms utilized in the analyzed studies24,25, AI assistance may soon be a standard part of a dermatologist’s toolbox. It is therefore important to better understand the interaction between human and AI in clinical decision-making.

While several studies have been conducted to evaluate the dermatologic use of new AI tools, our review of published studies found that most have only compared human clinician performance with that of AI tools, without considering how clinicians interact with these tools. Two of the studies in this systematic review and meta-analysis reported that clinicians perform worse when the AI tool provides incorrect recommendations10,19. This finding underscores the importance of accurate and reliable algorithms in ensuring that AI implementation enhances clinical outcomes, and highlights the need for further research to validate AI-assisted decision-making in medical practice. Notably, in a recent study by Barata et al.26, the authors demonstrated that a reinforcement learning model that incorporated human preferences outperformed a supervised learning model. Furthermore, it improved the performance of participating dermatologists in terms of both diagnostic accuracy and optimal management decisions of potential skin cancer when compared to either a supervised learning model or no AI assistance at all. Hence, the development of algorithms in collaboration with clinicians appears to be important for optimizing clinical outcomes.

Only two studies explored the impact of one explainability technique (CBIR) on physician’s diagnostic accuracy or perceived usefulness. The real clinical utility of explainability methods needs to be further examined, and current methods should be viewed as tools to interrogate and troubleshoot AI models27. Additionally, prior research has shown that human behavioral traits can affect trust and reliance on AI assistance in general28,29. For example, a clinician’s perception and confidence in the AI’s performance on a given task may influence whether they decide to incorporate AI advice in their decision30. Moreover, research has also shown that the human’s confidence in their decision, the AI’s confidence level, and whether the human and AI agree all influence if the human incorporates the AI’s advice30. To ensure that AI assistance supports and improves diagnostic accuracy, future research should investigate how factors such as personality traits29, cognitive style28 and cognitive biases31 affect diagnostic performance in real clinical situations. Such research would help inform the integration of AI into clinical practice.

Our findings suggest that AI assistance may be particularly beneficial for less experienced clinicians, consistent with prior studies of human-AI interaction in radiology32. This highlights the potential of AI assistance as an educational tool for non-dermatologists and for improving diagnostic performance in settings such as primary care or for dermatologists in training. In a subgroup analysis, we observed no significant difference between AI-assisted other medical professionals vs. unassisted dermatologists (data not shown). However, this area warrants further research.

Some limitations need to be considered when interpreting the findings. First, among the ten studies that provided sufficient data to conduct meta-analysis, there were differences in design, number and experience level of participants, target condition definition, classification task, and algorithm output and training. Taken together, this heterogeneity implies that direct comparisons should be interpreted carefully. Furthermore, caution is warranted for the interpretation of the subgroup analyses due to the small sample size of the subgroups (up to seven) and the data structure (i.e., repeated measures) since the same participants examined the clinical images both without and with AI assistance in most studies. Given the low number of studies, we refrained from performing further subgroup analyses, such as, comparing specific cancer diagnoses in the subset of articles where these are available. Despite these limitations, our results from this meta-analysis support the notion that AI assistance can yield a positive effect on clinician diagnostic performance. We were able to adjust for potential sources of heterogeneity, including diagnostic task and clinician experience level when comparing the diagnostic accuracy of clinicians with vs. without AI assistance. Moreover, no signs of publication bias and low likelihood of threshold effects were observed. Lastly, the findings were robust such that the pooled sensitivity and specificity nearly stayed the same after excluding outliers or low-quality studies.

Of note, few studies provided participating clinicians with both clinical data and dermoscopic images, which would be available in a real-life clinical situation. Previous research has shown that the use of dermoscopy enables a relative improvement of diagnostic accuracy of melanoma by almost 50% compared to the naked eye5. In one of such study, participants were explicitly not allowed to use dermoscopy during the patient examination19. Overall, only four studies were conducted in a prospective clinical setting, and three of these could be included for meta-analysis. Thus, most diagnostic ratings in this meta-analysis were made in experimental settings and do not necessarily reflect the decisions made in a clinical real-world situation.

One of the main concerns regarding the accuracy of AI tools rely on the quality of the data it has been trained on33. As only three studies used publicly available datasets, evaluation of the data quality is difficult. Furthermore, darker skin tones were underrepresented in the datasets of the included studies, which is a known problem in the field, as most papers do not report skin tone outputs34. However, datasets with diverse skin tones have been developed and made publicly available as an effort to reduce disparity in AI performance in dermatology35,36. Moreover, few studies provided detailed information about the origin and number of images that had been used for training, validation, and testing of the AI tool and different definitions of these terms were used across studies. There is a need for better transparency guidelines for AI tool reporting to enable users and readers to understand the limits and capabilities of these diagnostic tools. Efforts are being made to develop guidelines that are adapted for this purpose, including the STARD-AI37, TRIPOD-AI and, PROBAST-AI38 guidelines, as well as the dermatology-specific CLEAR Derm guidelines39. In addition, PRISMA-AI40 guidelines for systematic reviews and meta-analyses are being developed. These are promising initiatives that will hopefully make both the reporting and evaluation of AI diagnostic tool research more transparent.

Conclusion

The results of this systematic review and meta-analysis indicate that clinicians benefit from AI assistance in skin cancer diagnosis regardless of their experience level. Clinicians with the least experience in dermatology may benefit the most from AI assistance. Our findings are timely as AI is expected to be widely implemented in clinical work globally in the near future. Notably, only four of the identified studies were conducted in clinical settings, three of which could be included in the meta-analysis. Therefore, there is an urgent need for more prospective clinical studies conducted in real-life settings where AI is intended to be used, in order to better understand and anticipate the effect of AI on clinical decision making.

Methods

Search strategy and selection criteria

We searched four electronic databases, including PubMed, Embase, Institute of Electrical and Electronics Engineers Xplore (IEE Xplore) and Scopus for peer-reviewed articles of AI-assisted skin cancer diagnosis without language restriction from January 1, 2017, until November 8, 2022. Search terms were combined for four key concepts: (1) AI, (2) skin cancer, (3) diagnosis, (4) doctors. The full search strategy is available in the Supplementary material (Supplementary Table 3). We chose 2017 as the cutoff for this review since this was the year when deep learning was first reported as performing at a level comparable to dermatologists, notably in the seminal study by Esteva et al9, which suggested that AI technology had reached a clinically useful level in assisting skin cancer diagnosis.

We applied Google Translate software for abstract screening of non-English articles. Manual searches were performed for conference proceedings, including NeurIPS, HICSS, ICML, ICLR, AAAI, CVPR, CHIL and ML4Health, and to identify additional relevant articles by reviewing bibliographies and citations of the screened papers and searching Google Scholar.

We included studies comparing diagnostic accuracy of clinicians detecting skin cancer with and without AI assistance. If studies provided diagnostic data from medical professionals other than physicians this data was also included for analysis, as long as the study also included physicians. However, we excluded studies if (1) diagnosis was not made from either images of skin lesions or in-person visits (e.g., pathology slides), (2) diagnostic accuracy was only compared between clinicians and an AI algorithm, (3) non-deep learning techniques were used, or (4) the articles were editorials, reviews, and case reports. We did not limit participants’ expertise, study design or sample size, reference standard, or skin diagnosis if at least one skin malignancy was included in the study. We contacted nine authors to request additional data and clarifications required for the meta-analysis and received data from five of them10,12,13,14,15 and clarifications from two16,17. In four studies10,14,15,17 raw data was not available for all experiments or lesions, and our meta-analysis included the data that was available. Studies with insufficient data to construct contingency tables16,18 were included in the systematic review but not in the meta-analysis.

Three reviewers performed eligibility assessment, data extraction, and study quality evaluations (IK, JK, ZRC). Commonly used standardized programs were employed for duplicate removal, title and abstract screening, and full-text review (Covidence) and data extraction (Microsoft Excel). Paired reviewers independently screened the titles and abstracts using predefined criteria and extracted data. Disagreement was resolved by discussions with the third reviewer. IK imported the extracted data into the summary table for systematic review and two reviewers (JK and ZRC) verified it. JK imported the extracted data and prepared it for meta-analysis and two reviewers (ZRC and IK) verified it. Biostatistician (AL) reviewed and confirmed the final data for meta-analysis. All co-authors reviewed the final tables and figures. This systematic review and meta-analysis followed the PRISMA DTA guidelines41 and the study protocol was registered with PROSPERO, CRD42023391560.

Data analysis

We extracted key information, including true positive, false positive, false negative, and true negative information among clinicians with and without AI assistance. We generated contingency tables, where possible, to estimate diagnostic test accuracy in terms of pooled sensitivity and specificity. Additional information about the AI algorithm (e.g., architecture, image sources, validation and AI assistance method), participants, patients, target condition, reference standard, study setting and design, and funding were extracted.

A revised tool for the methodological quality assessment of diagnostic accuracy studies (QUADAS-2)42 was used to assess risk of bias and concerns of applicability of each study in four domains, patient selection, index test, reference standard, and flow and timing (Supplementary Table 1). A pair of reviewers independently evaluated the domains, compared the ratings, and, if conflicted, reconciled the discrepancies through discussions led by the third reviewer (IK, JK, ZRC).

We used the Metandi package43 for Stata 17 (College Station, TX) to compute summary estimates of sensitivity and specificity with 95% confidence intervals (95% CI) of humans with AI-assistance compared to humans without AI assistance using a bivariate model44. Summary Receiver Operating Characteristics (SROC) curves were plotted to visually present the summary estimates of sensitivity and specificity with 95% confidence region and the 95% prediction region, which refers to the confidence areas that the sensitivity and specificity of future studies likely fall into. The Bivariate models were performed separately for clinicians with vs. without AI assistance because the Metandi package could not handle the paired design of the data. We applied a random effects model to account for the anticipated heterogeneity across studies, potentially due to the variance of the data, including the use of different AI algorithms, medical professionals, and study settings. Heterogeneity was assessed by visual inspection of graphics, including SROC curve and forest plots45,46. Additionally, we conducted bivariate meta-regression analysis using the Meqrlogit package (Stata 17, College Station, TX) by the presence of AI assistance or not, for each experience level in dermatology (dermatologists, residents, non-dermatology medical professionals), type of diagnostic task (binary classification or top diagnosis) and type of image (clinical or dermoscopic) separately, to compare diagnostic accuracy by AI assistance and adjust for the potential heterogeneity caused by these factors47. To investigate the presence of a positive threshold effect, Spearman correlation coefficient was computed between sensitivity and specificity48. Pre-planned sensitivity analyses were conducted by excluding potential outliers,49 studies with poor methodology (where at least three domains were rated as unclear or high bias), and studies with reference standards other than only histopathology. We examined publication bias using Deeks’ Funnel Plot Asymmetry Test, which ran a regression on the effective sample size funnel plots vs. diagnostic odds ratio50. We calculated κ statistics to evaluate the agreements between QUADAS-2 assessors. All statistical significance was determined at p < 0.05.