A systematic review and meta-analysis of artificial intelligence versus clinicians for skin cancer diagnosis

Salinas, Maria Paz; Sepúlveda, Javiera; Hidalgo, Leonel; Peirano, Dominga; Morel, Macarena; Uribe, Pablo; Rotemberg, Veronica; Briones, Juan; Mery, Domingo; Navarrete-Dechent, Cristian

doi:10.1038/s41746-024-01103-x

Download PDF

Review Article
Open access
Published: 14 May 2024

A systematic review and meta-analysis of artificial intelligence versus clinicians for skin cancer diagnosis

npj Digital Medicine volume 7, Article number: 125 (2024) Cite this article

1141 Accesses
8 Altmetric
Metrics details

Subjects

An Author Correction to this article was published on 24 May 2024

This article has been updated

Abstract

Scientific research of artificial intelligence (AI) in dermatology has increased exponentially. The objective of this study was to perform a systematic review and meta-analysis to evaluate the performance of AI algorithms for skin cancer classification in comparison to clinicians with different levels of expertise. Based on PRISMA guidelines, 3 electronic databases (PubMed, Embase, and Cochrane Library) were screened for relevant articles up to August 2022. The quality of the studies was assessed using QUADAS-2. A meta-analysis of sensitivity and specificity was performed for the accuracy of AI and clinicians. Fifty-three studies were included in the systematic review, and 19 met the inclusion criteria for the meta-analysis. Considering all studies and all subgroups of clinicians, we found a sensitivity (Sn) and specificity (Sp) of 87.0% and 77.1% for AI algorithms, respectively, and a Sn of 79.78% and Sp of 73.6% for all clinicians (overall); differences were statistically significant for both Sn and Sp. The difference between AI performance (Sn 92.5%, Sp 66.5%) vs. generalists (Sn 64.6%, Sp 72.8%), was greater, when compared with expert clinicians. Performance between AI algorithms (Sn 86.3%, Sp 78.4%) vs expert dermatologists (Sn 84.2%, Sp 74.4%) was clinically comparable. Limitations of AI algorithms in clinical practice should be considered, and future studies should focus on real-world settings, and towards AI-assistance.

Human-AI interaction in skin cancer diagnosis: a systematic review and meta-analysis

Article Open access 09 April 2024

An artificial intelligence based app for skin cancer detection evaluated in a population based setting

Article Open access 20 May 2023

Human–computer collaboration for skin cancer recognition

Article 22 June 2020

Introduction

Skin cancer is the most common neoplasm worldwide. Early detection and diagnosis are critical for the survival of affected patients. For skin cancer detection in early stages, a complete physical examination is of paramount importance; however, visual inspection is often not sufficient, and less than one quarter of U.S. patients will have a dermatologic examination in their lifetime¹. Dermoscopy is a diagnostic tool, which allows for improved recognition of numerous skin lesions when compared to naked eye examination alone; however, this improvement depends on the level of training and experience of clinicians². In recent years, advances have been made in noninvasive tools to improve skin cancer diagnostic performance, including the use of artificial intelligence (AI) for clinical and/or dermoscopic image diagnosis in dermatology.

Convolutional neural networks (CNN) is a type of machine learning (ML) that simulates the processing of biological neurons and is the state-of-the-art network for pattern recognition in medical image analysis¹^,2. As diagnosis in dermatology relies heavily on both clinical and dermoscopic image recognition, the use of CNN has the potential to collaborate or improve diagnostic performance. Studies have been published demonstrating that CNN-based AI algorithms can perform similarly or even outperform specialists for skin cancer diagnosis³. This has created an ‘AI revolution’ in the field of skin cancer diagnosis. Recently, a few dermatology AI systems have been CE (Conformité Européenne) approved by the European Union and are use in practice making of paramount importance to understand the data behind these algorithms⁴.

While there have been relevant systematic reviews performed in the past few years, the importance of this work which combines a high-quality systematic review with a meta-analysis is that it quantitatively asks the question of where we are with AI for skin cancer detection. The main objective of this study was to perform a systematic review and meta-analysis to critically evaluate the evidence published to date on the performance of AI algorithms in skin cancer classification in comparison with clinicians.

Methods

Guidelines followed

This systematic review was based on the PRISMA guidelines. A flow chart diagram is presented in Fig. 1. The present study has also been registered in the Prospective Register of Systematic Reviews (PROSPERO) System (PROSPERO ID: CRD42022368285).

Search strategy

Three electronic databases, PubMed, Embase, and Cochrane library were searched by a librarian (J.M.). Studies published up to August 2022 were included. We uploaded all the titles and abstracts retrieved by electronic searching into Rayyan and removed any duplicate. Then we collected all the full texts of the studies that met the inclusion criteria based on the title or abstract for detailed inspection. Two reviewers (M.P.S. and J.S.) independently assessed the eligibility of the retrieved papers and resolved any discrepancies through discussion.

Study population—selection

The following PICO (Population, Intervention or exposure, Comparison, Outcome) elements were applied as inclusion criteria for the systematic review: (i) Population: Images of patients with skin lesions, (ii) Intervention: Artificial intelligence diagnosis/classification, (iii) Comparator: Diagnosis/ classification by clinicians, (iv) Outcome: Diagnosis of skin lesions. Only primary studies comparing the performance of artificial intelligence versus dermatologists or clinicians were included.

Studies about diagnosis of inflammatory dermatoses, without extractable data, non-English publications, or animal studies, were excluded.

Data extraction

For studies fulfilling the inclusion criteria, two independent reviewers extracted data in a standardized and predefined form. The following data were extracted and recorded: (i) Database (ii) Title, (iii) Year of publication, (iv) Author, (v) Journal, (vi) Prospective vs retrospective study, (vii) Image database used for training and internal vs external dataset for testing (viii) Type of images included: clinical and/or dermoscopy, (ix) Histopathology confirmation of diagnosis, (x) Inclusion of clinical information, (xi) Number and expertise of participants (experts dermatologists, non-expert dermatologists, and generalists), (xii) Name and type of AI algorithm, (xiii) Included diagnosis, (xiv) Statistics on diagnostic performance (sensitivity [Sn], specificity [Sp], receiver operating characteristic [ROC] curve, area under the curve [AUC]). The main comparisons conducted were diagnostic performance of the AI algorithm compared with clinician diagnostic performance. When available, the change in diagnostic performance of dermatologists with the support of the AI algorithm was included, as well as the change in diagnostic performance after including clinical data (data in supplementary material).

Risk of bias assessment

Two review authors independently assessed the quality of the studies included and the risk of bias using QUADAS-2⁵. Based on the questions, we classified each QUADAS-2 domain as low (0), high (1) or unknown (2) risk of bias.

Meta-analysis

Nineteen out of 53 studies were included in the meta-analysis. The studies met the following criteria: dermoscopic images only, diagnosis of skin cancer, dichotomous classification (benign/malignant, melanoma/nevus), extractable data from the original article (to calculate true positives [TP], false positives [FP], true negatives [TN], and false negatives [FN]), distinction in level of expertise of clinicians (experts dermatologists vs non-expert dermatologists vs generalists). For study purposes and to obtain a global estimate, we grouped all levels of clinical expertise as ‘overall clinicians’. During data processing, two extra analysis that were not pre-specified in the PROSPERO protocol were performed: clinician vs AI algorithms in prospective vs retrospective studies and internal vs external test (validation) sets, respectively. Internal vs external test sets were defined according to Cabitza⁶ and Shung et al.⁷. ‘Internal test set’ was defined as a non-overlapping, ‘held out’ subset of the original patient group data that was not used for AI algorithm development and training, used to test the AI model. ‘External test set’ was defined as a set of new data originating from different cohorts, facilities, or repositories other than the data used for model development and training (e.g., dataset originated in different country or institution). Two investigators classified included studies into internal vs external test sets. If both internal and external test sets were used, we classified them as external for study purposes. We decided to perform these non-pre-specified analysis given the relevance of the results for understanding of the data⁸.

We extracted binary diagnostic accuracy data and constructed contingency tables to calculate Sn and Sp. We conducted a meta-analysis of studies providing 2 ×2 tables to estimate the accuracy of AI and clinicians (confirmatory approach). If an included study provided various 2 ×2 tables, we assumed these data to be independent from each other. We performed a hierarchical summary receiver operating characteristic (HSROC) as well as a bivariate model of the accuracy of AI and clinicians. ROC curves were constructed to simplify the plotting of graphical summaries of fitted models. A likelihood ratio test was used to compare models. A p-value less than 0.05 was considered statistically significant. Analyses were performed using Stata 17.0 statistics software package (codes in supplementary material).

Results

A total of 53 comparative studies (since Piccolo et al. in 2002⁹) fulfilled the inclusion criteria (Fig. 1). Most of the studies focused on dermoscopic images (n = 31), followed by clinical images (n = 14), or both (n = 8). Detailed extracted data is shown in Table 1 for dermoscopic imaging studies, Table 2 for clinical imaging studies, and Table 3 for clinical and dermoscopic imaging studies.

Table 1 Included studies general characteristics, dataset used, and performance evaluating dermoscopy

Full size table

Table 2 Included studies general characteristics, dataset used, and performance evaluating clinical images

Full size table

Table 3 Included studies general characteristics, dataset used, and performance evaluating both dermoscopic and clinical images

Full size table

Regarding the risk of bias, most of the studies had an uncertain risk (58%), and 14 (26%) had a low risk of bias. Detail of QUADAS-2 score for each study included in the systematic review is in Fig. 2.

**Fig. 2: QUADAS-2 results of the assessment of risk of bias in the included studies.**

Databases used

Only institutional or private databases were used in 20 articles (37.7%). In all, 16 articles (30.2%) used exclusively open-source data; the most commonly used databases were ‘ISIC’ and ‘HAM10000’^10,11. Eighteen studies (33.9%) used a combination of institutional and public dataset. Twenty-two studies (41.5%) used only images of lesions confirmed with histopathology, while 27 (50.9%) included images diagnosed by expert consensus as the gold standard. Four studies (7.5%) did not specify a method of diagnosis confirmation. Fourteen studies (26.4%) used an external database for testing the algorithm, 39 studies (73.6%) tested with an internal dataset (Tables 1–3).

Study type and participants included

A total of 50 studies (94.3%) were retrospective and 3 (5.7%) were prospective. Twenty-seven studies (50.9%) included only specialists, in some cases detailing the level of expertise (expert dermatologists vs non-expert dermatologists). Twenty-three studies (43.3%) included dermatologists and other non-specialist clinicians (dermatology residents and/or generalists), and 3 studies (5.6%) included only generalists.

Diagnosis included and metadata

Forty-three studies (81.1%) considered differential diagnosis between skin tumors only, while 10 (18.8%) also included inflammatory diagnosis or other pathologies (multiclass algorithms). Eighteen articles (33.9%) included clinical information on the patients (metadata), mainly age, sex, and lesion location.

Artificial intelligence assistance

Of the total number of articles included in the review, 11 (20.7%) evaluated potential changes in diagnostic performance or therapeutic decisions of clinicians with AI assistance. Nine of 11 studies showed an improvement in global diagnostic performance when using AI collaboration, 6 of which showed a higher percentage of improvement in the generalists group.

Diagnostic performance of artificial intelligence algorithm versus clinicians, from dermoscopic images of skin lesions

Thirty-one studies evaluated diagnostic performance with dermoscopic images (Table 1). In general, 61.2% (n = 19) of the studies showed a better performance of AI when compared to clinicians. A total of 29.0% (n = 9) resulted in a comparable performance, and in 9.7% (n = 3) specialists outperformed AI.

Dichotomous classification (‘benign’ vs ‘malignant’)

Eighteen studies used AI with dichotomous classification (58.0%) as ‘benign’ vs ‘malignant’. In 61.1% AI outperformed clinicians (n = 11)^{12,13,14,15,16,17,18,19,20,21,22,23}, being statistically significant in 54.5% of them^{12,15,16,18,20,21}. A total of 27.7% showed comparable performance between AI and clinicians (n = 5)^{9,24,25,26,27}. In all, 11.1% resulted in a better performance for clinicians in comparison to AI (n = 2)^28,29, 1 of them showing statistical significance²⁹. Five studies^{16,17,18,19,28} evaluated the collaboration between AI and clinicians (‘augmented intelligence’). All of them showed improved diagnostic accuracy when evaluating clinicians with the support of AI algorithms, being more relevant for less experienced clinicians. Statistical significance was demonstrated in two^16,17.

Multiclass and combined classification

Eight of the 31 studies used multiclass classification; in 4 of them, AI had a better performance^30,31,32,33; in 3 studies the diagnostic accuracy was comparable^34,35,36; and in 1 clinicians outperformed AI³⁷. Two out of 8 studies evaluated AI-assistance, all of them showing improvement in diagnostic accuracy for human raters, with least experienced clinicians benefiting the most^32,35. Five of the 31 dermoscopy studies developed both dichotomous and multiclass algorithms, 4 of them resulting in a better performance of AI over humans^38,39,40,41.

Diagnostic performance of artificial intelligence algorithms versus clinicians, using clinical images

A total of 14 AI articles evaluating CNN-based classification approaches that used clinical images only were included (Table 2). Of these, 42.8% (n = 6) showed a better performance of AI algorithms, 28.6% (n = 4) obtained comparable results, and in 28.6% (n = 4) clinicians outperformed AI.

Dichotomous classification (‘benign’ vs ‘malignant’)

Six studies^{42,43,44,45,46} developed an AI algorithm with dichotomous outcomes, obtaining a performance comparable or superior to clinicians in 5 of them^42,43,44,45. One study showed a better performance for clinicians⁴⁶.

Multiclass and combined classification

Five studies^{47,48,49,50,51} incorporated AI algorithms with multiclass classification. Zhao et al.⁴⁸ and Pangti et al.⁴⁹ obtained superior performance of AI algorithms, while Chang et al.⁴⁷, showed comparable performance between AI and specialists. In one study, clinicians outperformed AI algorithm⁵⁰.

Three studies^52,53,54 with clinical images used both dichotomous and multiclass algorithms. Han et al.⁵³ observed an improvement in diagnostic Sn and Sp with the assistance of the AI algorithm for both classifications, being statistically significant for less experienced clinicians.

Diagnostic performance of artificial intelligence algorithms versus clinicians, from both clinical and dermoscopic images

Eight studies included clinical and dermoscopic images as part of their analysis^{21,55,56,57,58,59,60,61}. Overall, 75% (n = 6) resulted in comparable performance, and 25% (n = 2) showed better performance for AI algorithms in comparison to clinicians. Only 1 study obtained statistical significance⁵⁷.

Dichotomous classification

Six studies applied dichotomous classification; Haenssle et al.⁵⁷ being the only study obtaining a better performance for the AI algorithm over clinicians despite the incorporation of metadata. Five remaining studies showed a comparable performance between AI and clinicians.

Multiclass and combined classification

Huang et al.⁶¹ classified into 6 categories, with AI being superior to specialists in average accuracy. Finally, Esteva et al.⁵⁵ used two multiclass classifications, showing comparable performance between AI and clinicians in both.

Meta-analysis

A total of 19 studies were included in the meta-analysis. Table 4 shows the summary estimates calculated to compare performance between AI and clinicians with different levels of experience.

Table 4 Meta-analysis results, summary estimates of sensitivity, specificity, and likelihood ratio according to subgroups

Full size table

Only 1 prospective study met the inclusion criteria and was included in the meta-analysis.

AI vs overall clinicians’ meta-analysis

When analyzing the whole group of clinicians, not accounting for expertise level, AI obtained a Sn 87.0% (95% CI 81.7–90.9%) and Sp 77.1% (95% CI 69.8–83.0%), and overall clinicians obtained a Sn 79.8% (95% CI 73.2–85.1%) and Sp 73.6% (95% CI 66.5–79.6%), with a statistically significant difference for both Sn and Sp, according to the likelihood ratio test (p < 0.001 for both Sn and Sp). The Forest plot is available in Fig. 3a, b. The ROC curve shapes confirmed the prior differences (Fig. 4). Supplementary Fig. 1a, b shows the sub analysis adjusted for retrospective vs prospective design.

Fig. 3: Forest plot detailing the sensitivity and specificity for all groups of clinicians (‘overall’) and artificial intelligence algorithms from each study included in the meta-analysis according to type of test set (external vs internal).

**Fig. 4: Hierarchical ROC curves of studies for comparing performance between artificial intelligence algorithms (left) and all group of clinicians (right).**

AI vs generalists clinicians’ meta-analysis

When analyzing the AI performance vs generalists, AI obtained a Sn 92.5% (95% CI 88.9–94.9%) and Sp 66.5% (95% CI 56.7–75.0%), and generalists a Sn 64.6% (95% CI 47.1–78.9%) and Sp 72.8% (95% CI 56.7–84.5%), the difference being statistically significant for both Sn and Sp, according to the likelihood ratio tests (p < 0.001 for both). The ROC curve shapes confirmed the prior differences, with higher heterogeneity and wider confidence interval for generalists (Fig. 5). Subgroup analysis comparing internal vs external test set was not possible given all included studies were performed using internal test set in this subgroup (Fig. 6a, b).

**Fig. 5: Hierarchical ROC curves of studies for comparing performance between artificial intelligence algorithms (left) and generalists (right).**

**Fig. 6: Forest plots of studies showing artificial intelligence vs generalists sensitivity and specificity.**

AI vs non-expert dermatologists’ meta-analysis

AI obtained a Sn 85.4% (95% CI 78.9–90.2%) and Sp 78.5% (95% CI 70.6–84.8%), while non-expert dermatologists obtained Sn 76.4% (95% CI 71.1–80.9%) and Sp 67.1% (95% CI 57.2–75.6%), with a statistically significant difference, both in Sn and Sp (p < 0.001 for both). The ROC curve shapes confirmed these results (Fig. 7). The Forest plot is available in Fig. 8a, b. In the internal vs external test set subgroup analysis (Fig. 8a, b), AI achieved better Sn in the external test set, while greater Sp with an internal test set. For non-expert dermatologists, no changes in Sn were observed; however, they achieved better Sp in the external test set. In the prospective vs. retrospective subgroup analysis (Supplementary Fig. 2), only 1 prospective study met the inclusion criteria and was included in the meta-analysis. A trend towards better Sn in retrospective versus prospective studies was observed.

**Fig. 7: Hierarchical ROC curves of studies for comparing performance between artificial intelligence algorithms (left) and non-expert dermatologists (right).**

**Fig. 8: Forest plots of studies showing artificial intelligence vs non-expert dermatologists sensitivity and specificity according to type of test set (external vs internal).**

AI vs expert dermatologists’ meta-analysis

AI obtained a Sn 86.3% (95% CI 80.4–90.7%) and Sp 78.4% (95% CI 71.1–84.3%), and expert dermatologists a Sn 84.2% (95% CI 76.2–89.8%) and Sp 74.4% (95% CI 65.3–81.8%), this difference was statistically significant for both Sn and Sp, according to the likelihood ratio test (p < 0.001 for both). The ROC curve shapes were comparable for both AI and expert dermatologists, with narrow confidence intervals (Fig. 9). The subgroup analysis by internal vs external test set showed that AI had better Sn in external test set while Sp was better for internal test set. For expert dermatologists there was no difference in Sn; Sp was better in external test set (Fig. 10a, b). The subgroup analysis regarding study design, retrospective vs. prospective (Supplementary Fig. 3), found only one study.

**Fig. 9: Hierarchical ROC curves of studies for comparing performance between artificial intelligence algorithms (left) and expert dermatologists (right).**

**Fig. 10: Forest plots of studies showing artificial intelligence vs expert dermatologists sensitivity and specificity according to type of test set (external vs internal).**

Discussion

In the present study, we found an overall Sn and Sp of 87% and 77% for AI algorithms and an overall Sn of 79% and Sp of 73% for all clinicians (‘overall clinicians’) when performing a meta-analysis of the included studies. Differences between AI and all clinicians were statistically significant. Performance between AI algorithms vs specialists was comparable between both groups. The difference between AI performance (Sn 92%, Sp 66%) and the generalists subgroup (Sn 64%, Sp 72%) was more marked when compared to the difference between AI and expert dermatologists. In studies that evaluated AI-assistance (‘augmented intelligence’), overall diagnostic performance of clinicians was found to improve significantly when using AI algorithms^62,63,64. This improvement was more important for those clinicians with less experience. This is in line with this meta-analysis’ results where the difference was greater for generalist than for expert dermatologists and opens an opportunity for AI assistance in the group of less-experienced clinicians. To the best of our knowledge, this is the first systematic review and meta-analysis on the diagnostic accuracy of health-care professionals versus AI algorithms using dermoscopic or clinical images of cutaneous neoplasms. The inclusion of a meta-analysis is key to better understanding, quantitatively, the current state-of-the-art of AI algorithms for the automated diagnosis of skin cancer.

In general, the included studies presented diverse methodologies and significant heterogeneity regarding the type of images included, the different classifications, the characteristics of the participants, and the methodology for presenting the results. This is important to consider when analyzing and attempting to generalize and meta-analyze the obtained findings and should be taken into consideration when interpreting this study results. Research in AI and its potential applications in clinical practice have increased exponentially during the last few years in different areas of medicine, not only in dermatology⁶⁵. Other systematic reviews have also reported that, in experimental settings, most algorithms are able to achieve at least comparable results when compared with clinicians; however, they also describe similar limitations as those described here^66,67,68,69. Only a few studies have evaluated the role of AI algorithms in real clinical scenarios in dermatology. Our study confirms that only 5.7% of studies were prospective and only one of the prospective studies was suitable for meta-analysis^62,63. This contrasts with recent data in other medical areas showing an increase in the clinical use of AI⁷⁰ and highlights the relevance of understanding the role of AI in skin cancer and dermatology. However, prospective studies pose a real challenge for AI algorithms to become part of daily clinical practice as they face specific tests such as ‘out-of-distribution’ images or cases.

Based on this systematic review and meta-analysis results, several challenges have been evidenced when applying AI in clinical practice. First, databases are essential when training an AI algorithm. Small databases, inclusion of only specific populations, or limited variation in skin phototypes, limits the extrapolation of results^71,72,73. The lack and underrepresentation of certain ethnic groups and skin types in current datasets has been mentioned as a potential source of perpetuation healthcare disparity⁷³. Based on the results of our systematic review, we can confirm that most algorithms have been trained using the same datasets over and over in at least half of the studies. This translates into lack of representation of specific groups. The diversity of techniques and camera types (e.g. professional vs smartphones) used to capture images and their quality, possible artifacts such as pencil marks, rulers or other objects, are variables that must also be considered when evaluating the performance of AI algorithms^71,72,74. A second limitation is the lack of inclusion of metadata in the AI algorithms. In the real world, we manage additional layers of information from patients, including demographic data, personal and family history, habits, evolution of the disease, and a complete physical examination, including palpation, side illumination, and not only 2-D visual examination. These elements are important to render a correct differential diagnosis and to guide clinical decision-making, and so far, very few AI models incorporate them. Therefore, real-world diagnosis is different from static 2-D image evaluations. Regarding the design of human evaluation in experimental and retrospective studies, in most cases it aims to determine whether a lesion is benign or malignant, or to provide a specific diagnosis. This differs from clinical practice in a real-life setting, in which decisions are generally behavioral, whether following up, taking a biopsy or removing a lesion, beyond exclusively providing a specific diagnosis based on the clinical evaluation. The scarce available prospective studies that account for this real-world clinical evaluation makes generalization of these positive results of AI mainly based on retrospective studies restricted. In addition, the management of patient information and privacy, and legal aspects of regulation regarding the application of AI-based software in clinical practice, also represents an emerging challenge⁷⁵.

The current evidence gathered from this article supports collaboration between AI and clinicians (‘augmented intelligence’), especially for non-expert physicians. In the future, AI algorithms are likely to become a relevant tool to improve the evaluation of skin lesions by generalists in primary care centers, or clinicians with less access to specialists⁶³. AI algorithms could also allow for prioritization of referral or triage, improving early diagnosis. Currently, there are undergoing studies evaluating the application of AI algorithms in real clinical settings, which will demonstrate the applicability of these results in clinical practice. The first prospective randomized controlled trial by Han et al.⁶², showed that when a group of clinicians used AI assistance, the diagnosis accuracy improved. This improvement was better for generalists. The results of this recent randomized clinical trial partially confirm the potentially positive role of AI in dermatology. These results also confirm that the benefit is more pronounced for generalists, aligning with the findings of the present meta-analysis.

With the aim of reducing the current barriers, we propose to generate and apply guidelines with standardization of the methodology for AI studies. One proposal is the Checklist for Evaluation of Image-Based Artificial Intelligence Reports in Dermatology, published by Daneshjou et al.⁷⁶. These guidelines should include the complete workflow and start from the moment images are captured to protocols on databases, experience of participants, statistical data, definition on how to measure accuracy, among many others. This will allow us to compare different studies and generate better quality evidence. For example, Esteva et al.⁵². defined ‘overall accuracy’ as the average of individual inference class accuracies, which might differ from others. In addition, it is mandatory to collaborate with international collaborative databases (e.g. ISIC, available at www.isic-archive.com) to provide accessible public benchmarks and ensure repeatability and the inclusion of a diverse group of skin types and ethnicities to avoid for underrepresentation of certain groups. These strategies would make current datasets more diverse and generalizable.

The main strengths of the present study were the extensive and systematic search in 3 databases, encompassing studies from early AI days up to the most recently published studies, the strict criteria applied for the evaluation of studies and extraction of data, following the available guidelines for systematic reviews, and the performance of a meta-analysis, that allows for quantitatively assess the current AI data.

Limitations include the possibility of not having incorporated articles available in databases other than the ones included, or in other languages, thus constituting selection bias. Also, AI is a rapidly evolving field, and new relevant articles might have emerged while analyzing the data. To the best of our knowledge, no landmark studies were published in the meantime. Publication bias cannot be ruled out, since it is more likely that those articles with statistically significant results were to be published. Also, as shown in our results, more than half of the studies (64.1%) utilized the same public databases (e.g. ISIC and HAM10000), generating a possible overlap of the images in the training and testing group. Furthermore, most studies used the same dataset for training and testing the algorithm (73.6% used an internal test set) which might further bias the results. As observed in the subgroup analysis of the present study, there were differences in estimated Sn and Sp for both AI and clinicians depending on whether an internal vs. external test set was used. However, these were post-hoc analysis and should be interpreted with caution. External test set is key for proper evaluation of AI algorithms⁶ to ‘validate’ that the algorithm will retain its performance when presented with data from other datasets. Limited details regarding humans’ assessment by readers were available and could also affect the results. We also grouped all skin cancers as one group for analysis, variations in accuracy exists for different skin cancers (e.g. melanoma vs basal cell carcinoma vs squamous cell carcinoma) for humans and for AI algorithms. The application of QUADAS-2 shows a potential information bias, as it is an operator-dependent tool which generates subjectivity and qualitative results. Regarding the meta-analysis, we faced two main limitations. Firstly, the heterogeneity between studies makes it difficult to interpret or generalize the results obtained. Secondly, due to the lack of necessary data, the number of studies included in the meta-analysis was reduced when compared to the studies included in the systematic review. Finally, there was a minimal number of prospective studies included in the systematic review and only one was subjected to the meta-analysis and therefore, those results must be interpreted with caution. Nevertheless, in this post-hoc analysis prospective studies showed worse performance of AI algorithms compared to clinicians confirming the relevance of the complete physical examination and other clinical variables such as history, palpation, etc. This also shows a lack of real-world data published given most studies were retrospective reader studies.

Conclusion

This systematic review and meta-analysis demonstrated that the diagnostic performance of AI algorithms was better than generalists, non-expert dermatologists, and despite being statistically significant, AI algorithms were comparable to expert dermatologists in the clinical practice as the differences were minimal. As most studies were performed in experimental settings, future studies should focus on prospective, real-world settings, and towards AI-assistance. Our study suggests that it is time to move forward to real-world studies and randomized clinical trials to accelerate progress for the benefit of our patients. The only randomized study available has shown a better diagnosis accuracy when using AI algorithms as ‘augmented intelligence’⁶². We envision a fruitful collaboration between AI and humans leveraging the strengths of both to enhance diagnostic capabilities and patient care.

Data availability

All metadata are available as supplementary material.

Code availability

Codes are available as supplementary material.

Change history

24 May 2024
A Correction to this paper has been published: https://doi.org/10.1038/s41746-024-01138-0

References

Lakhani, N. A. et al. Total body skin examination for skin cancer screening among U.S. adults from 2000 to 2010. Prev. Med. 61, 75–80 (2014).
Article PubMed PubMed Central Google Scholar
Wu, Y. et al. Skin cancer classification with deep learning: A systematic review. Front Oncol. 12, 893972 (2022).
Article PubMed PubMed Central Google Scholar
Jones, O. T. et al. Artificial intelligence and machine learning algorithms for early detection of skin cancer in community and primary care settings: a systematic review. Lancet Digit Health 4, e466–e476 (2022).
Article CAS PubMed Google Scholar
Sangers, T. E. et al. Position statement of the EADV Artificial Intelligence (AI) Task Force on AI-assisted smartphone apps and web-based services for skin disease. J. Eur. Acad. Dermatol Venereol https://doi.org/10.1111/jdv.19521 (2023).
Whiting, P. F. QUADAS-2: A revised tool for the quality assessment of diagnostic accuracy studies. Ann. Intern Med. 155, 529 (2011).
Article PubMed Google Scholar
Cabitza, F. et al. The importance of being external. methodological insights for the external validation of machine learning models in medicine. Comput. Methods Prog. Biomed. 208, 106288 (2021).
Article Google Scholar
Shung, D., Simonov, M., Gentry, M., Au, B. & Laine, L. Machine learning to predict outcomes in patients with acute gastrointestinal bleeding: A systematic review. Dig. Dis. Sci. 64, 2078–2087 (2019).
Article PubMed Google Scholar
Steyerberg, E. W. & Harrell, F. E. Prediction models need appropriate internal, internal–external, and external validation. J. Clin. Epidemiol. 69, 245–247 (2016).
Article PubMed Google Scholar
Piccolo, D. et al. Dermoscopic diagnosis by a trained clinician vs. a clinician with minimal dermoscopy training vs. computer-aided diagnosis of 341 pigmented skin lesions: a comparative study. Br. J. Dermatol 147, 481–486 (2002).
Article CAS PubMed Google Scholar
The International Skin Imaging Collaboration. https://www.isic-archive.com/
Tschandl, P. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Sci. Data 14, 180161 (2018).
Article Google Scholar
Friedman, R. J. et al. The diagnostic performance of expert dermoscopists vs a computer-vision system on small-diameter melanomas. Arch. Dermatol 144, 476–482 (2008).
Article PubMed Google Scholar
Marchetti, M. A. et al. Results of the 2016 International Skin Imaging Collaboration International Symposium on Biomedical Imaging challenge: Comparison of the accuracy of computer algorithms to dermatologists for the diagnosis of melanoma from dermoscopic images. J. Am. Acad. Dermatol 78, 270 (2018).
Article PubMed Google Scholar
Brinker, T. J. et al. Comparing artificial intelligence algorithms to 157 German dermatologists: the melanoma classification benchmark. Eur. J. Cancer 111, 30–37 (2019).
Article PubMed Google Scholar
Brinker, T. J. et al. Deep neural networks are superior to dermatologists in melanoma image classification. Eur. J. Cancer 119, 11–17 (2019).
Article PubMed Google Scholar
Maron, R. C. et al. Artificial intelligence and its effect on dermatologists’ accuracy in dermoscopic melanoma image classification: Web-based survey study. J. Med. Internet Res 22, e18091 (2020).
Article PubMed PubMed Central Google Scholar
Lee, S. et al. Augmented decision-making for acral lentiginous melanoma detection using deep convolutional neural networks. J. Eur. Acad. Dermatol Venereol. 34, 1842–1850 (2020).
Article CAS PubMed Google Scholar
Marchetti, M. A. et al. Computer algorithms show potential for improving dermatologists’ accuracy to diagnose cutaneous melanoma: Results of the International Skin Imaging Collaboration 2017. J. Am. Acad. Dermatol 82, 622–627 (2020).
Article PubMed Google Scholar
Fink, C. et al. Diagnostic performance of a deep learning convolutional neural network in the differentiation of combined naevi and melanomas. J. Eur. Acad. Dermatol Venereol. 34, 1355–1361 (2020).
Article CAS PubMed Google Scholar
Tognetti, L. et al. A new deep learning approach integrated with clinical data for the dermoscopic differentiation of early melanomas from atypical nevi. J. Dermatol Sci. 101, 115–122 (2021).
Article PubMed Google Scholar
Haenssle, H. A. et al. Man against machine reloaded: Performance of a market-approved convolutional neural network in classifying a broad spectrum of skin lesions in comparison with 96 dermatologists working under less artificial conditions. Ann. Oncol. 31, 137–143 (2020).
Article CAS PubMed Google Scholar
Pham, T. C., Luong, C. M., Hoang, V. D. & Doucet, A. AI outperformed every dermatologist in dermoscopic melanoma diagnosis, using an optimized deep-CNN architecture with custom mini-batch logic and loss function. Sci. Rep. 11, 17485 (2021).
Article CAS PubMed PubMed Central Google Scholar
Yu, Z. et al. Early melanoma diagnosis with sequential dermoscopic images. IEEE Trans. Med. Imaging 41, 633–646 (2022).
Article PubMed Google Scholar
Ferris, L. K. et al. Computer-aided classification of melanocytic lesions using dermoscopic images. J. Am. Acad. Dermatol 73, 769–776 (2015).
Article PubMed Google Scholar
Tschandl, P., Kittler, H. & Argenziano, G. A pretrained neural network shows similar diagnostic accuracy to medical students in categorizing dermatoscopic images after comparable training conditions. Br. J. Dermatol. 177, 867–869 (2017).
Article CAS PubMed Google Scholar
Yu, C. et al. Acral melanoma detection using a convolutional neural network for dermoscopy images. PLoS One 13, e0193321 (2018).
Article PubMed PubMed Central Google Scholar
Phillips, M. et al. Assessment of accuracy of an artificial intelligence algorithm to detect melanoma in images of skin lesions. JAMA Netw. Open 2, e1913436 (2019).
Article PubMed PubMed Central Google Scholar
Dreiseitl, S., Binder, M., Hable, K. & Kittler, H. Computer versus human diagnosis of melanoma: evaluation of the feasibility of an automated diagnostic system in a prospective clinical trial. Melanoma Res. 19, 180–184 (2009).
Article PubMed Google Scholar
Winkler, J. K. et al. Monitoring patients at risk for melanoma: May convolutional neural networks replace the strategy of sequential digital dermoscopy? Eur. J. Cancer 160, 180–188 (2022).
Article PubMed Google Scholar
Tschandl, P. et al. Comparison of the accuracy of human readers versus machine-learning algorithms for pigmented skin lesion classification: an open, web-based, international, diagnostic study. Lancet Oncol. 20, 938–947 (2019).
Article PubMed PubMed Central Google Scholar
Minagawa, A. et al. Dermoscopic diagnostic performance of Japanese dermatologists for skin tumors differs by patient origin: A deep learning convolutional neural network closes the gap. J. Dermatol 48, 232–236 (2021).
Article PubMed Google Scholar
Tschandl, P. et al. Human-computer collaboration for skin cancer recognition. Nat. Med. 26, 1229–1234 (2020).
Article CAS PubMed Google Scholar
Combalia, M. et al. Validation of artificial intelligence prediction models for skin cancer diagnosis using dermoscopy images: the 2019 International Skin Imaging Collaboration Grand Challenge. Lancet Digit Health 4, e330–e339 (2022).
Article CAS PubMed PubMed Central Google Scholar
Wang, S. Q. et al. Deep learning-based, computer-aided classifier developed with dermoscopic images shows comparable performance to 164 dermatologists in cutaneous disease diagnosis in the Chinese population. Chin. Med. J. (Engl.) 133, 2027–2036 (2020).
Article PubMed Google Scholar
Lucius, M. et al. Deep neural frameworks improve the accuracy of general practitioners in the classification of pigmented skin lesions. Diagnostics (Basel) 10, 969 (2020).
Article PubMed Google Scholar
Zhu, C. Y. et al. A deep learning based framework for diagnosing multiple skin diseases in a clinical environment. Front Med. (Lausanne) 8, 626369 (2021).
Article PubMed Google Scholar
Van Molle, P. et al. The value of measuring uncertainty in neural networks in dermoscopy. J. Am. Acad. Dermatol 87, 1191–1193 (2022).
Article PubMed Google Scholar
Hekler, A. et al. Superior skin cancer classification by the combination of human and artificial intelligence. Eur. J. Cancer 120, 114–121 (2019).
Article PubMed Google Scholar
Maron, R. C. et al. Systematic outperformance of 112 dermatologists in multiclass skin cancer image classification by convolutional neural networks. Eur. J. Cancer 119, 57–65 (2019).
Article PubMed Google Scholar
Winkler, J. K. et al. Collective human intelligence outperforms artificial intelligence in a skin lesion classification task. J. Dtsch Dermatol Ges. 19, 1178–1184 (2021).
PubMed Google Scholar
Ba, W. et al. Convolutional neural network assistance significantly improves dermatologists’ diagnosis of cutaneous tumours using clinical images. Eur. J. Cancer 169, 156–165 (2022).
Article CAS PubMed Google Scholar
Han, S. S. et al. Classification of the clinical images for benign and malignant cutaneous tumors using a deep learning algorithm. J. Investig. Dermatol. 138, 1529–1538 (2018).
Article CAS PubMed Google Scholar
Fujisawa, Y. et al. Deep-learning-based, computer-aided classifier developed with a small dataset of clinical images surpasses board-certified dermatologists in skin tumour diagnosis. Br. J. Dermatol. 180, 373–381 (2019).
Article CAS PubMed Google Scholar
Han, S. S. et al. Keratinocytic skin cancer detection on the face using region-based convolutional neural network. JAMA Dermatol 156, 29–37 (2020).
Article PubMed Google Scholar
Huang, K. et al. Assistant diagnosis of basal cell carcinoma and seborrheic keratosis in Chinese population using convolutional neural network. J. Health. Eng. 2020, 1713904 (2020).
Article Google Scholar
Polesie, S. et al. Discrimination between invasive and in situ melanomas using clinical close-up images and a de novo convolutional neural network. Front Med (Lausanne) 8, 723914 (2021).
Article PubMed Google Scholar
Chang, W. Y. et al. Computer-aided diagnosis of skin lesions using conventional digital photography: A reliability and feasibility study. PLoS One 8, e76212 (2013).
Article CAS PubMed PubMed Central Google Scholar
Zhao, X. Y. et al. The application of deep learning in the risk grading of skin tumors for patients using clinical images. J. Med. Syst. 43, 283 (2019).
Article PubMed Google Scholar
Pangti, R. et al. Performance of a deep learning-based application for the diagnosis of basal cell carcinoma in Indian patients as compared to dermatologists and nondermatologists. Int J. Dermatol 60, e51–e52 (2021).
Article PubMed Google Scholar
Agarwala, S., Mata, D. A. & Hafeez, F. Accuracy of a convolutional neural network for dermatological diagnosis of tumours and skin lesions in a clinical setting. Clin. Exp. Dermatol 46, 1310–1311 (2021).
Article CAS PubMed Google Scholar
Kim, Y. J. et al. Augmenting the accuracy of trainee doctors in diagnosing skin lesions suspected of skin neoplasms in a real-world setting: A prospective controlled before-and-after study. PLoS One 17, e0260895 (2022).
Article CAS PubMed PubMed Central Google Scholar
Han, S. S. et al. Assessment of deep neural networks for the diagnosis of benign and malignant skin neoplasms in comparison with dermatologists: A retrospective validation study. PLoS Med. 17, e1003381 (2020).
Article PubMed PubMed Central Google Scholar
Han, S. S. et al. Augmented intelligence dermatology: Deep neural networks empower medical professionals in diagnosing skin cancer and predicting treatment options for 134 skin disorders. J. Investig. Dermatol. 140, 1753–1761 (2020).
Article CAS PubMed Google Scholar
Jinnai, S. et al. The development of a skin cancer classification system for pigmented skin lesions using deep learning. Biomolecules 10, 1–13 (2020).
Article Google Scholar
Esteva, A. et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature 542, 115–118 (2017).
Article CAS PubMed PubMed Central Google Scholar
Tschandl, P. et al. Expert-level diagnosis of nonpigmented skin cancer by combined convolutional neural networks. JAMA Dermatol 155, 58–65 (2019).
Article PubMed Google Scholar
Haenssle, H. A. et al. Man against Machine: Diagnostic performance of a deep learning convolutional neural network for dermoscopic melanoma recognition in comparison to 58 dermatologists. Ann. Oncol. 29, 1836–1842 (2018).
Article CAS PubMed Google Scholar
Brinker, T. J. et al. A convolutional neural network trained with dermoscopic images performed on par with 145 dermatologists in a clinical melanoma image classification task. Eur. J. Cancer 111, 148–154 (2019).
Article PubMed Google Scholar
Li, C. X. et al. Diagnostic capacity of skin tumor artificial intelligence-assisted decision-making software in real-world clinical settings. Chin. Med. J. (Engl.) 133, 2020–2026 (2020).
Article PubMed Google Scholar
Willingham, M. L. et al. The potential of using artificial intelligence to improve skin cancer diagnoses in Hawai’i’s multiethnic population. Melanoma Res. 31, 504–514 (2021).
Article PubMed PubMed Central Google Scholar
Huang, K. et al. The classification of six common skin diseases based on xiangya-derm: Development of a chinese database for artificial intelligence. J. Med. Internet Res 23, e26025 (2021).
Article PubMed PubMed Central Google Scholar
Han, S. S. et al. Evaluation of artificial intelligence-assisted diagnosis of skin neoplasms: a single-center, paralleled, unmasked, randomized controlled trial. J. Invest Dermatol 142, 2353–2362.e2 (2022).
Article CAS PubMed Google Scholar
Muñoz‐López, C. et al. Performance of a deep neural network in teledermatology: A single‐centre prospective diagnostic study. J. Eur. Acad. Dermatol. Venereol. 35, 546–553 (2021).
Article PubMed Google Scholar
Han, S. S. et al. The degradation of performance of a state-of-the-art skin image classifier when applied to patient-driven internet search. Sci. Rep. 12, 16260 (2022).
Article CAS PubMed PubMed Central Google Scholar
Liu, X. et al. A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis. Lancet Digit Health 1, e271–e297 (2019).
Article PubMed Google Scholar
Haggenmuller, S. et al. Skin cancer classification via convolutional neural networks: Systematic review of studies involving human experts. Eur. J. Cancer 156, 202–216 (2021).
Article PubMed Google Scholar
Takiddin, A., Schneider, J., Yang, Y., Abd-Alrazaq, A. & Househ, M. Artificial intelligence for skin cancer detection: Scoping review. J. Med. Internet Res. 23, e22934 (2021).
Article PubMed PubMed Central Google Scholar
Brinker, T. J. et al. Skin cancer classification using convolutional neural networks: Systematic review. J. Med. Internet Res. 20, e11936 (2018).
Article PubMed PubMed Central Google Scholar
Marka, A., Carter, J. B., Toto, E. & Hassanpour, S. Automated detection of nonmelanoma skin cancer using digital images: A systematic review. BMC Med. Imaging 19, 21 (2019).
Article PubMed PubMed Central Google Scholar
Wu, K. et al. Characterizing the clinical adoption of medical AI devices through U.S. insurance claims. NEJM AI https://doi.org/10.1056/AIoa2300030 (2023).
Navarrete-Dechent, C. et al. Automated dermatological diagnosis: Hype or reality? J. Investig. Dermatol. 138, 2277–2279 (2018).
Article CAS PubMed Google Scholar
Navarrete-Dechent, C., Liopyris, K. & Marchetti, M. A. Multiclass artificial intelligence in dermatology: Progress but still room for improvement. J. Investig. Dermatol. 141, 1325–1328 (2021).
Article CAS PubMed Google Scholar
Adamson, A. S. & Smith, A. Machine learning and health care disparities in dermatology. JAMA Dermatol 154, 1247–1248 (2018).
Article PubMed Google Scholar
Daneshjou, R., Smith, M. P., Sun, M. D., Rotemberg, V. & Zou, J. Lack of transparency and potential bias in artificial intelligence data sets and algorithms: A scoping review. JAMA Dermatol 157, 1362–1369 (2021).
Article PubMed PubMed Central Google Scholar
Jobson, D., Mar, V. & Freckelton, I. Legal and ethical considerations of artificial intelligence in skin cancer diagnosis. Australas. J. Dermatol 63, e1–e5 (2022).
Article PubMed Google Scholar
Daneshjou, R. et al. Checklist for evaluation of image-based artificial intelligence reports in dermatology: CLEAR derm consensus guidelines from the international skin imaging collaboration artificial intelligence working group. JAMA Dermatol 158, 90–96 (2022).
Article PubMed PubMed Central Google Scholar
Tenenhaus, A. et al. Detection of melanoma from dermoscopic images of naevi acquired under uncontrolled conditions. Skin Res. Technol. 16, 85–97 (2010).
Article PubMed Google Scholar
Brinker, T. et al. Deep learning outperformed 136 of 157 dermatologists in a head-to-head dermoscopic melanoma image classification task. Eur. J. Cancer 113, 47–54 (2019).
Article PubMed Google Scholar

Download references

Acknowledgements

Partial funding was obtained from La Fondation La Roche Possay Research Awards. ANID - Millennium Science Initiative Program ICN2021_004.

Author information

These authors contributed equally: Maria Paz Salinas, Javiera Sepúlveda.
These authors jointly supervised this work: Juan Briones, Domingo Mery, Cristian Navarrete-Dechent.

Authors and Affiliations

Department of Dermatology, Escuela de Medicina, Pontificia Universidad Católica de Chile, Santiago, Chile
Maria Paz Salinas, Javiera Sepúlveda, Leonel Hidalgo, Dominga Peirano, Pablo Uribe & Cristian Navarrete-Dechent
Universidad Catolica-Evidence Center, Cochrane Chile Associated Center, Pontificia Universidad Católica de Chile, Santiago, Chile
Macarena Morel
Melanoma and Skin Cancer Unit, Escuela de Medicina, Pontificia Universidad Católica de Chile, Santiago, Chile
Pablo Uribe & Cristian Navarrete-Dechent
Dermatology Service, Department of Medicine, Memorial Sloan Kettering Cancer Center, New York, NY, USA
Veronica Rotemberg
Department of Oncology, Escuela de Medicina, Pontificia Universidad Católica de Chile, Santiago, Chile
Juan Briones
Department of Computer Science, Pontificia Universidad Católica de Chile, Santiago, Chile
Domingo Mery

Authors

Maria Paz Salinas
View author publications
You can also search for this author in PubMed Google Scholar
Javiera Sepúlveda
View author publications
You can also search for this author in PubMed Google Scholar
Leonel Hidalgo
View author publications
You can also search for this author in PubMed Google Scholar
Dominga Peirano
View author publications
You can also search for this author in PubMed Google Scholar
Macarena Morel
View author publications
You can also search for this author in PubMed Google Scholar
Pablo Uribe
View author publications
You can also search for this author in PubMed Google Scholar
Veronica Rotemberg
View author publications
You can also search for this author in PubMed Google Scholar
Juan Briones
View author publications
You can also search for this author in PubMed Google Scholar
Domingo Mery
View author publications
You can also search for this author in PubMed Google Scholar
Cristian Navarrete-Dechent
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

M.P.S, J.S., L.H, V.R., J.B., D.M., and C.N-D. designed and conceived the study; M.P.S, J.S., D.P., M.M., V.R., J.B., D.M., and C.N-D. acquired, analyzed and interpreted the data; M.P.S, J.S., L.H, D.P., M.M., P.U., V.R., J.B., D.M., and C.N-D. drafted and revised the manuscript; M.P.S, J.S., L.H, D.P., M.M., P.U., V.R., J.B., D.M., and C.N-D.

Corresponding author

Correspondence to Cristian Navarrete-Dechent.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary figures

STATA Codes

Dataset 1

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Salinas, M.P., Sepúlveda, J., Hidalgo, L. et al. A systematic review and meta-analysis of artificial intelligence versus clinicians for skin cancer diagnosis. npj Digit. Med. 7, 125 (2024). https://doi.org/10.1038/s41746-024-01103-x

Download citation

Received: 22 September 2023
Accepted: 04 April 2024
Published: 14 May 2024
DOI: https://doi.org/10.1038/s41746-024-01103-x

Subjects

Abstract

Similar content being viewed by others

Human-AI interaction in skin cancer diagnosis: a systematic review and meta-analysis

An artificial intelligence based app for skin cancer detection evaluated in a population based setting

Human–computer collaboration for skin cancer recognition

Introduction

Methods

Guidelines followed

Search strategy

Study population—selection

Data extraction

Risk of bias assessment

Meta-analysis

Results

Databases used

Study type and participants included

Diagnosis included and metadata

Artificial intelligence assistance

Diagnostic performance of artificial intelligence algorithm versus clinicians, from dermoscopic images of skin lesions

Dichotomous classification (‘benign’ vs ‘malignant’)

Multiclass and combined classification

Diagnostic performance of artificial intelligence algorithms versus clinicians, using clinical images

Dichotomous classification (‘benign’ vs ‘malignant’)

Multiclass and combined classification

Diagnostic performance of artificial intelligence algorithms versus clinicians, from both clinical and dermoscopic images

Dichotomous classification

Multiclass and combined classification

Meta-analysis

AI vs overall clinicians’ meta-analysis

AI vs generalists clinicians’ meta-analysis

AI vs non-expert dermatologists’ meta-analysis

AI vs expert dermatologists’ meta-analysis

Discussion

Conclusion

Data availability

Code availability

Change history

24 May 2024

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Supplementary information

Supplementary figures

STATA Codes

Dataset 1

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links