A systematic review and meta-analysis of artificial intelligence versus clinicians for skin cancer diagnosis

Scientific research of artificial intelligence (AI) in dermatology has increased exponentially. The objective of this study was to perform a systematic review and meta-analysis to evaluate the performance of AI algorithms for skin cancer classification in comparison to clinicians with different levels of expertise. Based on PRISMA guidelines, 3 electronic databases (PubMed, Embase, and Cochrane Library) were screened for relevant articles up to August 2022. The quality of the studies was assessed using QUADAS-2. A meta-analysis of sensitivity and specificity was performed for the accuracy of AI and clinicians. Fifty-three studies were included in the systematic review, and 19 met the inclusion criteria for the meta-analysis. Considering all studies and all subgroups of clinicians, we found a sensitivity (Sn) and specificity (Sp) of 87.0% and 77.1% for AI algorithms, respectively, and a Sn of 79.78% and Sp of 73.6% for all clinicians (overall); differences were statistically significant for both Sn and Sp. The difference between AI performance (Sn 92.5%, Sp 66.5%) vs. generalists (Sn 64.6%, Sp 72.8%), was greater, when compared with expert clinicians. Performance between AI algorithms (Sn 86.3%, Sp 78.4%) vs expert dermatologists (Sn 84.2%, Sp 74.4%) was clinically comparable. Limitations of AI algorithms in clinical practice should be considered, and future studies should focus on real-world settings, and towards AI-assistance.


Study population-selection
The following PICO (Population, Intervention or exposure, Comparison, Outcome) elements were applied as inclusion criteria for the systematic review: (i) Population: Images of patients with skin lesions, (ii) Intervention: Artificial intelligence diagnosis/classification, (iii) Comparator: Diagnosis/ classification by clinicians, (iv) Outcome: Diagnosis of skin lesions.Only primary studies comparing the performance of artificial intelligence versus dermatologists or clinicians were included.
Studies about diagnosis of inflammatory dermatoses, without extractable data, non-English publications, or animal studies, were excluded.

Data extraction
For studies fulfilling the inclusion criteria, two independent reviewers extracted data in a standardized and predefined form.The following data were extracted and recorded:  performance of the AI algorithm compared with clinician diagnostic performance.When available, the change in diagnostic performance of dermatologists with the support of the AI algorithm was included, as well as the change in diagnostic performance after including clinical data (data in supplementary material).

Risk of bias assessment
Two review authors independently assessed the quality of the studies included and the risk of bias using QUADAS-2 5 .Based on the questions, we classified each QUADAS-2 domain as low (0), high (1) or unknown (2) risk of bias.

Meta-analysis
Nineteen out of 53 studies were included in the meta-analysis.The studies met the following criteria: dermoscopic images only, diagnosis of skin cancer, dichotomous classification (benign/malignant, melanoma/nevus), extractable data from the original article (to calculate true positives [TP], false positives [FP], true negatives [TN], and false negatives [FN]), distinction in level of expertise of clinicians (experts dermatologists vs nonexpert dermatologists vs generalists).For study purposes and to obtain a global estimate, we grouped all levels of clinical expertise as 'overall clinicians'.During data processing, two extra analysis that were not prespecified in the PROSPERO protocol were performed: clinician vs AI algorithms in prospective vs retrospective studies and internal vs external test (validation) sets, respectively.Internal vs external test sets were defined according to Cabitza 6 and Shung et al. 7 .'Internal test set' was defined as a non-overlapping, 'held out' subset of the original patient group data that was not used for AI algorithm development and training, used to test the AI model.'External test set' was defined as a set of new data originating from different cohorts, facilities, or repositories other than the data used for model development and training (e.g., dataset originated in different country or institution).Two investigators classified included studies into internal vs external test sets.If both internal and external test sets were used, we classified them as external for study purposes.We decided to perform these non-pre-specified analysis given the relevance of the results for understanding of the data 8 .
We extracted binary diagnostic accuracy data and constructed contingency tables to calculate Sn and Sp.We conducted a meta-analysis of studies providing 2 ×2 tables to estimate the accuracy of AI and clinicians (confirmatory approach).If an included study provided various 2 ×2 tables, we assumed these data to be independent from each other.We performed a hierarchical summary receiver operating characteristic (HSROC) as well as a bivariate model of the accuracy of AI and clinicians.ROC curves were constructed to simplify the plotting of graphical summaries of fitted models.A likelihood ratio test was used to compare models.A p-value less than 0.05 was considered statistically significant.Analyses were performed using Stata 17.0 statistics software package (codes in supplementary material).

Results
A total of 53 comparative studies (since Piccolo et al. in 2002 9 ) fulfilled the inclusion criteria (Fig. 1).Most of the studies focused on dermoscopic images (n = 31), followed by clinical images (n = 14), or both (n = 8).Detailed extracted data is shown in Table 1 for dermoscopic imaging studies, Table 2 for clinical imaging studies, and Table 3 for clinical and dermoscopic imaging studies.
Regarding the risk of bias, most of the studies had an uncertain risk (58%), and 14 (26%) had a low risk of bias.Detail of QUADAS-2 score for each study included in the systematic review is in Fig. 2.

Databases used
Only institutional or private databases were used in 20 articles (37.7%).In all, 16 articles (30.2%) used exclusively open-source data; the most commonly used databases were 'ISIC' and 'HAM10000' 10,11 .Eighteen studies (33.9%) used a combination of institutional and public dataset.Twenty-two studies (41.5%) used only images of lesions confirmed with histopathology, while 27 (50.9%)included images diagnosed by expert consensus as the gold standard.Four studies (7.5%) did not specify a method of diagnosis confirmation.Fourteen studies (26.4%) used an external database for testing the algorithm, 39 studies (73.6%) tested with an internal dataset (Tables 1-3).

Study type and participants included
A total of 50 studies (94.3%) were retrospective and 3 (5.7%)were prospective.Twenty-seven studies (50.9%) included only specialists, in some cases detailing the level of expertise (expert dermatologists vs non-expert dermatologists).Twenty-three studies (43.3%) included dermatologists and other non-specialist clinicians (dermatology residents and/or generalists), and 3 studies (5.6%) included only generalists.

Diagnosis included and metadata
Forty-three studies (81.1%) considered differential diagnosis between skin tumors only, while 10 (18.8%) also included inflammatory diagnosis or other pathologies (multiclass algorithms).Eighteen articles (33.9%) included clinical information on the patients (metadata), mainly age, sex, and lesion location.

Artificial intelligence assistance
Of the total number of articles included in the review, 11 (20.7%)evaluated potential changes in diagnostic performance or therapeutic decisions of clinicians with AI assistance.Nine of 11 studies showed an improvement in global diagnostic performance when using AI collaboration, 6 of which showed a higher percentage of improvement in the generalists group.

Diagnostic performance of artificial intelligence algorithm versus clinicians, from dermoscopic images of skin lesions
Thirty-one studies evaluated diagnostic performance with dermoscopic images (Table 1).In general, 61.2% (n = 19) of the studies showed a better performance of AI when compared to clinicians.A total of 29.0% (n = 9) resulted in a comparable performance, and in 9.7% (n = 3) specialists outperformed AI.

Multiclass and combined classification
Eight of the 31 studies used multiclass classification; in 4 of them, AI had a better performance [30][31][32][33] ; in 3 studies the diagnostic accuracy was comparable [34][35][36] ; and in 1 clinicians outperformed AI 37 .Two out of 8 studies evaluated AI-assistance, all of them showing improvement in diagnostic accuracy for human raters, with least experienced clinicians benefiting the most 32,35 .Five of the 31 dermoscopy studies developed both dichotomous and multiclass algorithms, 4 of them resulting in a better performance of AI over humans [38][39][40][41] .

Multiclass and combined classification
Five studies [47][48][49][50][51] incorporated AI algorithms with multiclass classification.Zhao et al. 48and Pangti et al. 49 obtained superior performance of AI algorithms, while Chang et al. 47 , showed comparable performance between AI and specialists.In one study, clinicians outperformed AI algorithm 50 .Three studies [52][53][54] with clinical images used both dichotomous and multiclass algorithms.Han et al. 53 observed an improvement in diagnostic Sn and Sp with the assistance of the AI algorithm for both classifications, being statistically significant for less experienced clinicians.

Dichotomous classification
Six studies applied dichotomous classification; Haenssle et al. 57 being the only study obtaining a better performance for the AI algorithm over clinicians despite the incorporation of metadata.Five remaining studies showed a comparable performance between AI and clinicians.

Multiclass and combined classification
Huang et al. 61 classified into 6 categories, with AI being superior to specialists in average accuracy.Finally, Esteva et al. 55 used two multiclass classifications, showing comparable performance between AI and clinicians in both.

Meta-analysis
A total of 19 studies were included in the meta-analysis.Table 4 shows the summary estimates calculated to compare performance between AI and clinicians with different levels of experience.
Only 1 prospective study met the inclusion criteria and was included in the meta-analysis.

AI vs overall clinicians' meta-analysis
When analyzing the whole group of clinicians, not accounting for expertise level, AI obtained a Sn 87.0% (95% CI 81.7-90.9%)and Sp 77.1% (95% CI 69.8-83.0%),and overall clinicians obtained a Sn 79.8% (95% CI 73.2-85.1%)and Sp 73.6% (95% CI 66.5-79.6%),with a statistically significant difference for both Sn and Sp, according to the likelihood ratio test (p < 0.001 for both Sn and Sp).The Forest plot is available in Fig. 3a, b.The ROC curve shapes confirmed the prior differences (Fig. 4).Supplementary Fig. 1a, b shows the sub analysis adjusted for retrospective vs prospective design.

AI vs generalists clinicians' meta-analysis
When analyzing the AI performance vs generalists, AI obtained a Sn 92.5% (95% CI 88.9-94.9%)and Sp 66.5% (95% CI 56.7-75.0%),and generalists a Sn 64.6% (95% CI 47.1-78.9%)and Sp 72.8% (95% CI 56.7-84.5%), the difference being statistically significant for both Sn and Sp, according to the likelihood ratio tests (p < 0.001 for both).The ROC curve shapes confirmed the prior differences, with higher heterogeneity and wider confidence interval for generalists (Fig. 5).Subgroup analysis comparing internal vs external test set was not possible given all included studies were performed using internal test set in this subgroup (Fig. 6a, b).
AI vs non-expert dermatologists' meta-analysis AI obtained a Sn 85.4% (95% CI 78.9-90.2%)and Sp 78.5% (95% CI 70.6-84.8%),while non-expert dermatologists obtained Sn 76.4% (95% CI ), with a statistically significant difference, both in Sn and Sp (p < 0.001 for both).The ROC curve shapes confirmed these results (Fig. 7).The Forest plot is available in Fig. 8a, b.In the internal vs external test set subgroup analysis (Fig. 8a, b), AI achieved better Sn in the external test set, while greater Sp with an internal test set.For non-expert dermatologists, no changes in Sn were observed; however, they achieved better Sp in the external test set.In the prospective vs. retrospective subgroup analysis (Supplementary Fig. 2), only 1 prospective study met the inclusion criteria and was included in the metaanalysis.A trend towards better Sn in retrospective versus prospective studies was observed.The ROC curve shapes were comparable for both AI and expert dermatologists, with narrow confidence intervals (Fig. 9).The subgroup analysis by internal vs external test set showed that AI had better Sn in external test set while Sp was better for internal test set.For expert dermatologists there was no difference in Sn; Sp was better in external test set (Fig. 10a, b).The subgroup analysis regarding study design, retrospective vs. prospective (Supplementary Fig. 3), found only one study.

Discussion
In the present study, we found an overall Sn and Sp of 87% and 77% for AI algorithms and an overall Sn of 79% and Sp of 73% for all clinicians ('overall clinicians') when performing a meta-analysis of the included studies.Differences between AI and all clinicians were statistically significant.Performance between AI algorithms vs specialists was comparable between both groups.The difference between AI performance (Sn 92%, Sp 66%) and the generalists subgroup (Sn 64%, Sp 72%) was more marked when compared to the difference between AI and expert dermatologists.In studies that evaluated AI-assistance ('augmented intelligence'), overall diagnostic performance of clinicians was found to improve significantly when using AI algorithms [62][63][64] .This improvement was more important for those clinicians with less experience.This is in line with this meta-analysis' results where the difference was greater for generalist than for expert dermatologists and opens an opportunity for AI assistance in the group of less-experienced clinicians.To the best of our knowledge, this is the first systematic review and meta-analysis on the diagnostic accuracy of health-care professionals versus AI algorithms using dermoscopic or clinical images of cutaneous neoplasms.The inclusion of a meta-analysis is key to better understanding,  quantitatively, the current state-of-the-art of AI algorithms for the automated diagnosis of skin cancer.In general, the included studies presented diverse methodologies and significant heterogeneity regarding the type of images included, the different classifications, the characteristics of the participants, and the methodology for presenting the results.This is important to consider when analyzing and attempting to generalize and metaanalyze the obtained findings and should be taken into consideration when interpreting this study results.Research in AI and its potential applications in clinical practice have increased exponentially during the last few years in different areas of medicine, not only in dermatology 65 .Other systematic reviews have also reported that, in experimental settings, most algorithms are able to achieve at least comparable results when compared with clinicians; however, they also describe similar limitations as those described here [66][67][68][69] .Only a few studies have evaluated the role of AI algorithms in real clinical scenarios in dermatology.Our study confirms that only 5.7% of studies were prospective and only one of the prospective studies was suitable for meta-analysis 62,63 .This contrasts with recent data in other medical areas showing an increase in the clinical use of AI 70 and highlights the relevance of understanding the role of AI in skin cancer and dermatology.However, prospective studies pose a real challenge for AI algorithms to become part of daily clinical practice as they face specific tests such as 'out-of-distribution' images or cases.Based on this systematic review and meta-analysis results, several challenges have been evidenced when applying AI in clinical practice.First, databases are essential when training an AI algorithm.Small databases, inclusion of only specific populations, or limited variation in skin phototypes, limits the extrapolation of results [71][72][73] .The lack and underrepresentation of certain ethnic groups and skin types in current datasets has been mentioned as a potential source of perpetuation healthcare disparity 73 .Based on the results of our systematic review, we can confirm that most algorithms have been trained using the same datasets over and over in at least half of the studies.This translates into lack of representation of specific groups.The diversity of techniques and camera types (e.g.professional vs smartphones) used to capture images and their quality, possible artifacts such as pencil marks, rulers or other objects, are variables that must also be considered when evaluating the performance of AI algorithms 71,72,74 .A second limitation is the lack of inclusion of metadata in the AI algorithms.In the real world, we manage additional layers of information from patients,    including demographic data, personal and family history, habits, evolution of the disease, and a complete physical examination, including palpation, side illumination, and not only 2-D visual examination.These elements are important to render a correct differential diagnosis and to guide clinical decision-making, and so far, very few AI models incorporate them.Therefore, real-world diagnosis is different from static 2-D image evaluations.Regarding the design of human evaluation in experimental and retrospective studies, in most cases it aims to determine whether a lesion is benign or malignant, or to provide a specific diagnosis.This differs from clinical practice in a real-life setting, in which decisions are generally behavioral, whether following up, taking a biopsy or removing a lesion, beyond exclusively providing a specific diagnosis based on the clinical evaluation.The scarce available prospective studies that account for this real-world clinical evaluation makes generalization of these positive results of AI mainly based on retrospective studies restricted.In addition, the management of patient information and privacy, and legal aspects of regulation regarding the application of AI-based software in clinical practice, also represents an emerging challenge 75 .
The current evidence gathered from this article supports collaboration between AI and clinicians ('augmented intelligence'), especially for nonexpert physicians.In the future, AI algorithms are likely to become a relevant tool to improve the evaluation of skin lesions by generalists in primary care centers, or clinicians with less access to specialists 63 .AI algorithms could also allow for prioritization of referral or triage, improving early diagnosis.Currently, there are undergoing studies evaluating the application of AI algorithms in real clinical settings, which will demonstrate the applicability of these results in clinical practice.The first prospective randomized controlled trial by Han et al. 62 , showed that when a group of clinicians used AI assistance, the diagnosis accuracy improved.This improvement was better for generalists.The results of this recent randomized clinical trial partially confirm the potentially positive role of AI in dermatology.These results also confirm that the benefit is more pronounced for generalists, aligning with the findings of the present meta-analysis.
With the aim of reducing the current barriers, we propose to generate and apply guidelines with standardization of the methodology for AI studies.One proposal is the Checklist for Evaluation of   Image-Based Artificial Intelligence Reports in Dermatology, published by Daneshjou et al. 76 .These guidelines should include the complete workflow and start from the moment images are captured to protocols on databases, experience of participants, statistical data, definition on how to measure accuracy, among many others.This will allow us to compare different studies and generate better quality evidence.For example, Esteva et al. 52 .defined 'overall accuracy' as the average of individual inference class accuracies, which might differ from others.In addition, it is mandatory to collaborate with international collaborative databases (e.g.ISIC, available at www.isicarchive.com)to provide accessible public benchmarks and ensure repeatability and the inclusion of a diverse group of skin types and ethnicities to avoid for underrepresentation of certain groups.These strategies would make current datasets more diverse and generalizable.
The main strengths of the present study were the extensive and systematic search in 3 databases, encompassing studies from early AI days up to the most recently published studies, the strict criteria applied for the evaluation of studies and extraction of data, following the available guidelines for systematic reviews, and the performance of a meta-analysis, that allows for quantitatively assess the current AI data.
Limitations include the possibility of not having incorporated articles available in databases other than the ones included, or in other languages, thus constituting selection bias.Also, AI is a rapidly evolving field, and new relevant articles might have emerged while analyzing the data.To the best of our knowledge, no landmark studies were published in the meantime.Publication bias cannot be ruled out, since it is more likely that those articles with statistically significant results were to be published.Also, as shown in our results, more than half of the studies (64.1%) utilized the same public databases (e.g.ISIC and HAM10000), generating a possible overlap of the images in the training and testing group.Furthermore, most studies used the same dataset for training and testing the algorithm (73.6% used an internal test set) which might further bias the results.As observed in the subgroup analysis of the present study, there were differences in estimated Sn and Sp for both AI and clinicians depending on whether an internal vs. external test set was used.However, these were post-hoc analysis and should be interpreted with caution.External test set is key for proper evaluation of AI algorithms 6 to 'validate' that the algorithm will retain its performance when presented with data from other datasets.Limited details regarding humans' assessment by readers were available and could also affect the results.We also grouped all skin cancers as one group for analysis, variations in accuracy exists for different skin cancers (e.g.melanoma vs basal cell carcinoma vs squamous cell carcinoma) for humans and for AI algorithms.The application of QUADAS-2 shows a potential information bias, as it is an operator-dependent tool which generates subjectivity and qualitative results.Regarding the meta-analysis, we faced two main limitations.Firstly, the heterogeneity between studies makes it difficult to interpret or generalize the results obtained.Secondly, due to the lack of necessary data, the number of studies included in the meta-analysis was reduced when compared to the studies included in the systematic review.Finally, there was a minimal number of prospective studies included in the systematic review and only one was subjected to the meta-analysis and therefore, those results must be interpreted with caution.Nevertheless, in this post-hoc analysis prospective studies showed worse performance of AI algorithms compared to clinicians confirming the relevance of the complete physical examination and other clinical variables such as history, palpation, etc.This also shows a lack of real-world data published given most studies were retrospective reader studies.

Conclusion
This systematic review and meta-analysis demonstrated that the diagnostic performance of AI algorithms was better than generalists, non-expert dermatologists, and despite being statistically significant, AI algorithms were comparable to expert dermatologists in the clinical practice as the differences were minimal.As most studies were performed in experimental settings, future studies should focus on prospective, real-world settings, and towards AI-assistance.Our study suggests that it is time to move forward to real-world studies and randomized clinical trials to accelerate progress for the benefit of our patients.The only randomized study available has shown a better diagnosis accuracy when using AI algorithms as 'augmented intelligence' 62 .We envision a fruitful collaboration between AI and humans leveraging the strengths of both to enhance diagnostic capabilities and patient care.
(i) Database (ii) Title, (iii) Year of publication, (iv) Author, (v) Journal, (vi) Prospective vs retrospective study, (vii) Image database used for training and internal vs external dataset for testing (viii) Type of images included: clinical and/or dermoscopy, (ix) Histopathology confirmation of diagnosis, (x) Inclusion of clinical information, (xi) Number and expertise of participants (experts dermatologists, non-expert dermatologists, and generalists), (xii) Name and type of AI algorithm, (xiii) Included diagnosis, (xiv) Statistics on diagnostic performance (sensitivity [Sn], specificity [Sp], receiver operating characteristic [ROC] curve, area under the curve [AUC]).The main comparisons conducted were diagnostic

Fig. 2 |
Fig. 2 | QUADAS-2 results of the assessment of risk of bias in the included studies.QUADAS-2 tool was used to assess the risk of bias in the included studies in terms of 4 domains (participants, index test, reference standard, and analysis).Low risk (cyan) refers to the number of studies that have a low risk of bias in the respective domain.Unclear (gray) refers to the number of studies that have an unclear risk of bias in the respective domain due to lack of information reported by the study.High risk (purple) refers to the number of studies that have a high risk of bias in the respective domain.a. Risk of Bias Assessment b.Applicability Concerns.

Fig. 3 |
Fig. 3 | Forest plot detailing the sensitivity and specificity for all groups of clinicians ('overall') and artificial intelligence algorithms from each study included in the meta-analysis according to type of test set (external vs internal).a Sensitivity for artificial intelligence (left) and all clinicians ('overall') (right).b Specificity for artificial intelligence (left) and all clinicians ('overall') (right).

Fig. 4 |
Fig. 4 | Hierarchical ROC curves of studies for comparing performance between artificial intelligence algorithms (left) and all group of clinicians (right).ROC receiver operating characteristic.Each circle size represents the individual study sample size (circle size is inversely related to study variance).

Fig. 5 |
Fig. 5 | Hierarchical ROC curves of studies for comparing performance between artificial intelligence algorithms (left) and generalists (right).ROC receiver operating characteristic.Each circle size represents the individual study sample size (circle size is inversely related to study variance).

Fig. 6 |
Fig. 6 | Forest plots of studies showing artificial intelligence vs generalists sensitivity and specificity.a Sensitivity for artificial intelligence (left) and for generalists (right).b Specificity for artificial intelligence (left) and for generalists (right).

Fig. 7 |
Fig. 7 | Hierarchical ROC curves of studies for comparing performance between artificial intelligence algorithms (left) and non-expert dermatologists (right).ROC receiver operating characteristic.Each circle size represents the individual study sample size (circle size is inversely related to study variance).

Fig. 8 |Fig. 9 |
Fig. 8 | Forest plots of studies showing artificial intelligence vs non-expert dermatologists sensitivity and specificity according to type of test set (external vs internal).a Sensitivity for artificial intelligence (left) and for non-expert dermatologists (right).b Specificity for artificial intelligence (left) and for non-expert dermatologists (right).

Fig. 10 |
Fig. 10 | Forest plots of studies showing artificial intelligence vs expert dermatologists sensitivity and specificity according to type of test set (external vs internal).a Sensitivity for artificial intelligence (left) and expert dermatologists (right).b Sensitivity for artificial intelligence (left) and for expert dermatologists (right).

Table 1 (
continued) | Included studies general characteristics, dataset used, and performance evaluating dermoscopy HP histopathology confirmation, I/E internal/external test set, P prospective, R retrospective, B both (a subset of lesions were biopsy proven and a subset based on clinical/consensus diagnosis), CD clinical data (metadata) available, CNN convolutional neural network, DCNN deep convolutional neural network, AK actinic keratosis, BCC basal cell carcinoma, BKL benign keratosis, SK seborrheic keratosis, DF dermatofibroma, MEL melanoma, NT not trained, SCC squamous cell carcinoma, VASC vascular lesion, Sn sensitivity, Sp specificity, Acc accuracy, NPV negative predictive value, PPV positive predictive value, OR odds ratio, ROC receiver operating characteristic curve, AI artificial intelligence.

Table 2 (
continued) | Included studies general characteristics, dataset used, and performance evaluating clinical images HP histopathology confirmation, I/E internal/external test set, P prospective, R retrospective, B both (a subset of lesions were biopsy proven and a subset based on clinical/consensus diagnosis), CD clinical data (metadata) available, CNN convolutional neural network, DCNN deep convolutional neural network, AK actinic keratosis, BCC basal cell carcinoma, BKL benign keratosis, SK seborrheic keratosis, DF dermatofibroma, MEL melanoma, NT not trained, SCC squamous cell carcinoma, VASC vascular lesion, Sn sensitivity, Sp specificity, Acc accuracy, NPV negative predictive value, PPV positive predictive value, ROC receiver operating characteristic curve, AI artificial npj Digital Medicine | (2024)7:125

Table 3 (
continued) | Included studies general characteristics, dataset used, and performance evaluating both dermoscopic and clinical images HP histopathology confirmation, I/E internal/external test set, P prospective, R retrospective, B both (a subset of lesions were biopsy proven and a subset based on clinical/consensus diagnosis), CD clinical data (metadata) available, CNN convolutional neural network, DCNN deep convolutional neural network, AK actinic keratosis, BCC basal cell carcinoma, BKL benign keratosis, SK seborrheic keratosis, DF dermatofibroma, MEL melanoma, NT not trained, SCC squamous cell carcinoma, VASC vascular lesion, Sn sensitivity, Sp specificity, Acc accuracy, NPV negative predictive value, PPV positive predictive value, ROC receiver operating characteristic curve, AI artificial intelligence.

Table 4 |
Meta-analysis results, summary estimates of sensitivity, specificity, and likelihood ratio according to subgroups