Introduction

The digitisation of healthcare, accelerated by the COVID-19 pandemic, has led to an accumulation of ‘big data’, so called not only for its substantial volume but also for its complexity and diversity. The availability of such data and advances in computing capacity provide a unique opportunity to revolutionise healthcare using artificial intelligence (AI)1.

Machine learning (ML) is a subfield of AI that uses computational models to perform intelligent predictions based on training datasets without direct human intervention2. In recent years, deep learning (DL) has become the most widely used computational approach in the field of ML3. DL, inspired by the information processing patterns of the human brain, is based on multi-layered artificial neural networks that learn from big data. Broadly, there are 3 types of datasets used in a DL study: training, validation and testing. A training dataset is used to derive DL models, where the algorithms are ‘fitted’ to perform particular functions. A validation dataset is then used to provide an evaluation of the performance of the model, whilst finetuning its architecture. The test dataset provides the final evaluation of the model4.

Interest in medical applications of DL has largely been in the fields of radiology, ophthalmology, pathology, and dermatology5. As a visual specialty with large image databases, dermatology has considerable potential to augment disease diagnosis and severity assessment using DL, leading to improved healthcare efficiency and reduced costs. A major focus of DL research (bolstered by expanding big data, image quality, computing capacity and DL techniques) is in assisting clinicians in the diagnosis of skin cancers from images6,7,8. This has achieved encouraging results with a meta-analysis of 70 studies suggesting that the accuracy of computer-aided (including DL-based) diagnosis of melanoma was comparable to that of dermatologists6.

Whilst this area of dermatology remains a focus of research, easing the burden of non-lesion dermatological diseases warrants attention owing to their high prevalence, visibility, psychosocial impact, need for long-term treatment and associated costs. An estimated 20–25% of the population is affected by chronic inflammatory skin diseases, the most common of which include eczema and psoriasis9. The delayed access to healthcare and unpredictable clinical course of inflammatory skin diseases further adds to the already substantial impact on quality of life and underlines the potential for improved, early diagnosis and close monitoring approaches that leverage DL image analysis.

This systematic review assesses the evidence for using DL image analyses in the diagnosis and severity monitoring of skin diseases, beyond benign and malignant skin lesions.

Results

Study screening

The searches identified 13,857 references. After removing duplicates, 12,320 titles and abstracts were screened. Subsequent full-text screening of 268 studies identified 64 studies that met the eligibility criteria (Fig. 1 and Supplementary Material 1). No studies prior to 2012 met our eligibility criteria. This is in keeping with the paradigm shift towards the use of DL in ML research in 2012, when AlexNet (a type of DL) was shown to significantly outperform traditional ML image analysis methods10.

Fig. 1: PRISMA flowchart of study records.
figure 1

PRISMA flowchart showing the study selection process.

Quality assessment

Risk of bias and applicability concerns for all 64 included studies were assessed using our modified QUADAS-2 framework (Supplementary Material 2). 59 studies (92%) had overall high risk of bias judgements, and 62 studies (97%) had overall high level applicability concerns (Fig. 2 and Supplementary Material 3). Of 9 studies that used external datasets to validate or test their DL algorithms, 8 studies (89%) still had an overall high risk of bias and all 9 studies (100%) had overall high level applicability concerns (Supplementary Material 4).

Fig. 2: Summary of quality assessment results of all studies (using modified QUADAS-2).
figure 2

Quality assessment of all included studies.

With respect to risk of bias, the participant and outcome domains were more commonly rated as high/unclear (92% and 83%, respectively), in contrast to the reference standard and index test domains (9% and 9%, respectively) (Fig. 2). To determine the reference standard, most studies (61%, n = 39) had datasets verified by at least one clinician. With respect to the index test, 91% (n = 58) accounted for overfitting, underfitting, and/or optimism when assessing DL algorithm performance against the reference standard.

All 64 studies scored high or unclear in the participant domain of applicability concerns (Fig. 2). All externally validated/tested DL algorithms (n = 9) also had high level applicability concerns in this domain (Supplementary Material 4). There was poor reporting of participant characteristics such as Fitzpatrick skin type, age, and gender, as well as poor generalisability of the study settings. 63% and 38% of all studies had high/unclear applicability concerns in the index test and outcome domains, respectively (Fig. 2). This reduced to 0% and 11%, respectively, when considering only externally validated/tested DL algorithms (Supplementary Material 4).

General study characteristics

Overall, 144 skin diseases were studied. Of these, the most frequently studied diseases were acne (n = 30), psoriasis (n = 27), eczema (n = 22), rosacea (n = 12), vitiligo (n = 12) and urticaria (n = 8) (Tables 1 and 2). The most common skin disease categories were inflammatory, follicular, pigmentary and infectious disorders (Table 3).

Table 1 Baseline characteristics of all studies.
Table 2 Outcomes of deep learning algorithms for the diagnosis of the six most studied diseases.
Table 3 Outcomes of deep learning algorithms for the diagnosis of the five main categories of skin disease.

47 of 64 (73%) included studies reported research funding, 6 (9%) did not and 11 (17%) were unclear (Supplementary Material 5). The authors were most frequently affiliated to China (n = 20), India (n = 9) and the USA (n = 5), and private datasets were mostly from Asia (73%, n = 35) (Supplementary Material 6).

Study design

Most studies (88%, n = 56) used retrospectively collected data and most (85%, n = 55) used the same image dataset for both training and validation/testing (Table 1). Few studies (14%, n = 9) used independent external data to validate or test their DL algorithms (Supplementary Material 7). Overall, 24 studies (37.5%) evaluated the algorithm in an independent dataset, a clinical setting or prospectively. No RCTs of DL in skin diseases were found.

DL algorithms were developed predominantly for disease diagnosis (81%, n = 52), rather than severity assessment (19%, n = 12). Diagnostic DL algorithms were most commonly developed for acne (n = 24), psoriasis (n = 23) and eczema (n = 21) (Table 1). Disease severity DL algorithms were most commonly developed for acne (n = 6) and psoriasis (n = 4).

Participants and images

Of those studies performing training (n = 60), internal/external validation (n = 34) and internal/external testing (n = 52) of DL algorithms, the number of participants was reported by 18% (median 2000 participants, IQR 416–5860; n = 11), 24% (median 626 participants, IQR 167–3102; n = 8) and 15% (median 185 participants, IQR 90–340; n = 8), respectively (Table 1). Participant age was reported in 13 (20%) studies and sex was reported in 12 (19%) studies.

In the minority of studies reporting participant ethnicity and/or Fitzpatrick skin type (19%, n = 12; Table 1), there was representation across most ethnicities and skin types. Of 10 studies reporting Fitzpatrick skin types, 4 specified the number of participants per Fitzpatrick skin type group: most (>85%) participants had skin types II–IV. In the other 6 studies, 5 specified that participants were mostly skin types III–IV and 1 study stated that participants were mostly skin types II–III.

Most image datasets (88%, n = 56) comprised macroscopic images of skin, hair or nails. Dermoscopic images were most commonly used for psoriasis (n = 5) and eczema (n = 4) (Table 1). In contrast to participant characteristics, the number of images used in training, validation and testing datasets was reported by most studies: 60 (93%), 61 (91%) and 62 (96%) studies, respectively. Generally, a greater number of images was used to train DL algorithms (median 2555 images, IQR 902–8550) than to validate (median 1032 images, IQR 274–2000) or test (median 331 images, IQR 157–922) DL algorithms. The ratio of median number of images to participants was 1.3 for training datasets, 1.6 for validation and 1.8 for testing datasets. This indicates that a single participant contributed more than one image through, for example, multiple photographs of anatomically distinct sites or splitting/modification of an image (Table 1).

DL algorithms

Five studies used more than one type of DL algorithm, hence the total number of algorithms was 69 across 64 studies. Overall, the commonest types of DL algorithm were convolutional neural networks (CNN) and deep convolutional neural networks (DCNN) (80%, n = 55 of 69 algorithms) (Fig. 3 and Table 1). CNNs and DCNNs are considered interchangeable terms, as ‘deep’ refers to the number of layers in the algorithm architecture and most modern CNNs consist of a large number of layers11. The first CNN/DCNN study included in our review appeared in 2017. By 2021, 85% (n = 17 of 20) of studies applied CNN/DCNN algorithms. Ensemble DL algorithms, which combines multiple DL algorithms to improve prediction performance, first appeared in 2018, however was less frequently used compared to CNN/DCNN in subsequent years. Multilayer perceptron (MLP) (3%, n = 2 of 69 algorithms) and artificial neural networks (ANN) (3%, n = 2 of 69 algorithms), which are now considered outdated types of DL, were also less commonly used.

Fig. 3: Type of deep learning algorithms included in the systematic review, by year of publication.
figure 3

The number of different types of deep learning (DL) algorithms are presented by year of publication of the studies. As five studies used multiple DL algorithms, the total number of algorithms sum to 69 across the 64 included studies.

Most studies (77%, n = 49) reported the reference standard of the DL algorithm; 36 (73%) used a clinician assessment of images, of which 27 (75%) were dermatologists. The remaining 27% (n = 13 of 49) used multiple reference standards inconsistently across datasets or other reference standards including biopsies, blood tests and curated databases (Table 1). The severity scales used for disease severity grading DL algorithms were varied (Supplementary Material 8).

Transparency

Most studies (95%, n = 61) disclosed the source of images. The most common sources of images were hospital/university databases (47%, n = 30), and many studies also used public databases (22%, n = 14). Image datasets were fully or partially available in under one third of studies (31%, n = 20). DL algorithm codes were available in 26 (41%) studies. In seven studies (11%) there were no details provided on the architecture of the DL algorithms (Table 1). With regards to transparency of reporting of primary and secondary outcomes, 26 of 64 studies (41%) provided the raw values that were used to calculate accuracy, sensitivity or specificity.

Accuracy of diagnostic DL algorithms: six most studied diseases

Accuracy (the primary outcome) was the most commonly reported outcome for assessing the performance of all DL algorithms (75%, n = 48). The median diagnostic accuracy of the DL algorithms for the six most studied diseases (acne, psoriasis, eczema, rosacea, vitiligo, urticaria) was high, ranging from 81% for urticaria (n = 2) to 94% for both acne (IQR 86–98, n = 11) and rosacea (IQR 90–97, n = 4) (Table 2). The accuracies of the externally validated/tested diagnostic DL algorithms were higher for acne (median 92%, n = 2) and eczema (96%, n = 2) compared with psoriasis (74%, n = 1), however direct comparison was limited by the small number of studies. Most diagnostic DL algorithms for the six most studied diseases performed multiclass classification (79%, n = 26 of 33), rather than binary classification (21%, n = 7 of 33) (Supplementary Material 9).

Accuracy of diagnostic DL algorithms: five categories of disease

The median diagnostic accuracy of DL algorithms for the five categories of skin diseases (inflammatory disorders, follicular disorders of skin, alopecia, pigmentary disorders, skin infections) was high, ranging from 88% for both skin infections (IQR 60–95, n = 17) and pigmentary disorders (IQR 80–99, n = 5) to 100% for alopecia (n = 2) (Table 3). The diagnostic accuracies of DL algorithms for inflammatory disorders (median 92%, IQR 80–96; n = 30) and follicular disorders of skin (median 93%, IQR 87–97; n = 16) were similarly high.

The median diagnostic accuracy of externally validated/tested DL algorithms was high for inflammatory disorders (83%, IQR 53–100; n = 6) and follicular disorders of skin (84%, n = 3), although numerically lower than that of all DL algorithms. Both studies reporting diagnostic accuracy of DL algorithms for alopecia used external testing and had an accuracy of 100%. In contrast, the median accuracy of externally validated/tested DL algorithms for diagnosing skin infections was low (59%, IQR 50–74; n = 7).

Accuracy of severity grading DL algorithms

The analysis of DL algorithms for disease severity grading was limited by a paucity of studies (n = 12, Supplementary Material 8). The accuracy of DL algorithms in grading psoriasis severity was 93–100% (n = 2), however external validation/testing was not performed (Supplementary Material 10). The single study of a DL algorithm for grading eczema severity did perform external validation and reported 88% accuracy. Of 4 studies assessing DL algorithms that grade acne severity (median accuracy 76%, IQR 68-85), one performed external testing and reported lower accuracy (68%).

Secondary outcomes of diagnostic DL algorithms: six most studied diseases

A total of 23 studies reported AUC. The median AUC of diagnostic DL algorithms was high, ranging from 0.90 (IQR 0.87–0.94, n = 4) for rosacea to 0.98 (IQR 0.93–0.99, n = 4) for acne (Table 2). Diagnostic accuracy of externally validated/tested DL algorithms was similarly high.

Overall, 29 studies reported specificity. The median specificity of diagnostic DL algorithms was high, ranging from 88% (IQR 80-98, n = 4) for vitiligo to 100% (n = 2) for urticaria (Table 2). Externally validated/tested algorithms had similarly high specificity, all above 96% (n = 6).

A total of 43 studies reported sensitivity. The median sensitivity of diagnostic DL algorithms was variable, ranging from 63% (IQR 42–92, n = 6) in rosacea to 91% (IQR 80-95, n = 5) in vitiligo. The range of sensitivity values for each disease was wide, and contrasted the narrower ranges for specificity. Externally validated/tested diagnostic DL algorithms generally had lower sensitivities compared with the overall dataset, ranging from 42% (n = 1) in rosacea to 87% (n = 2) in acne.

31 and 8 studies reported PPV and NPV, respectively. The median PPV of diagnostic DL algorithms varied from 77% for urticaria (n = 2) to 91% for vitiligo (n = 3) (Table 2). In contrast, the NPV of diagnostic DL algorithms was >90% for all six diseases, which was also a consistent finding for externally validated/tested DL algorithms.

Secondary outcomes of diagnostic DL algorithms: five categories of disease

In line with the above findings, diagnostic DL algorithms for the five disease categories (inflammatory disorders, follicular disorders of skin, alopecia, pigmentary disorders, skin infections) were broadly highly specific but had variable sensitivity.

The median specificity of diagnostic DL algorithms ranged from 97% (IQR 93-99, n = 15) for follicular disorders of skin to 100% (n = 2) for alopecia (Table 3). With respect to inflammatory skin diseases (the most frequently studied disease category), the median accuracy of diagnostic DL algorithms was 98% (IQR 95-99, n = 40) and this remained high when only externally validated/tested algorithms were considered (100%, IQR 98-100; n = 10).

The median sensitivity of diagnostic DL algorithms ranged from 77% in inflammatory skin diseases (IQR 63–92, n = 47) and skin infections (IQR 63–93, n = 33) to 87% (IQR 67–94, n = 19) in follicular disorders (Table 3). When considering only externally validated/tested diagnostic DL algorithms, the median sensitivities remained variable and were lowest in inflammatory disorders (58%, IQR 48–72; n = 12) and skin infections (70%, IQR 56–80; n = 17), compared to follicular disorders (87%, IQR 63–92; n = 4). The range of sensitivity values for each disease category was also wide, and contrasted the narrower ranges for specificity.

Secondary outcomes of severity grading DL algorithms

Although data were limited, the specificity of disease severity DL algorithms was high and ranged from 94–95% for acne (n = 2) to 97–100% for psoriasis (n = 2) (Supplementary Material 10). AUC was reported in only one study of a severity grading DL algorithm, in psoriasis (AUC 0.99). The sensitivity of disease severity grading DL algorithms ranged from 82–84% for acne (n = 2) to 93–96% for psoriasis (n = 2). PPV was reported for severity grading DL algorithms in acne (range 54–86%, n = 3) and psoriasis (93%, n = 1). No studies reported these metrics for externally validated/tested DL algorithms of disease severity (Supplementary Material 10).

Discussion

This systematic review provides a comprehensive evaluation of DL image analysis studies of skin diseases. Skin conditions are the fourth leading cause of non-fatal disease burden worldwide12. Our review encompasses the commonest long-term skin conditions in the global population including eczema, psoriasis, acne, rosacea, vitiligo and urticaria. The reported diagnostic accuracy of DL algorithms was broadly encouraging for common inflammatory skin diseases, in contrast to skin infections, for which the diagnostic accuracy of externally validated/tested DL algorithms was generally low. Although diagnostic DL algorithms were mostly specific, there was variation in their sensitivities, which were notably lower in the fewer yet more robust studies that performed external validation or testing. While relatively few studies assessed DL algorithms for disease severity grading, the highest accuracy was reported in psoriasis, followed by eczema and acne.

Importantly, our findings on the reliability and applicability of current DL studies indicate that they should be interpreted with caution. There are key limitations, which bring the real-world clinical applicability of the reported DL algorithms into question. These include heterogeneity of study design and a lack of RCTs. Although 47% of studies utilised images from hospital or university databases, public databases were used in 22%, which may not be representative of the target population or healthcare setting of the algorithms’ intended use. The generalisability of the DL algorithms was also limited by poor capture of number, age, gender and skin colour of study participants. Ethnicity and/or Fitzpatrick skin type was reported in only 19% of studies and, amongst those reporting these characteristics, skin types I, V and VI were underrepresented. The reference standard for evaluating DL algorithm performance was inconsistent, with some studies not defining the ‘ground truth’ or specifying the type of clinician who assessed disease diagnosis/severity. There was also poor transparency of reporting of outcome metrics including confidence intervals for specificity and sensitivity, and numerator/denominator data. There were omissions in the reporting of data class-balance, and bias towards images of particular phenotypes and from specific geographical locations (Asia), leading to potentially skewed training datasets. Although authors implemented measures to mitigate model overfitting, the extent to which this was successful requires external validation/testing, which was only performed in 14% of studies.

There is a notable paucity of evidence synthesis of DL image analyses in non-cancer skin diseases, however similar limitations are highlighted in prior reviews of DL in skin cancer detection8,13,14, and are mirrored across other medical specialities at the forefront of DL image analysis research such as radiology and ophthalmology15,16,17,18,19. The need to improve the quality and interpretability of studies through improved reporting, consistent use of out-of-sample external validation/testing and well-defined clinical cohorts are common themes. A systematic review of 14 skin cancer image datasets similarly demonstrated poor ethnicity and/or Fitzpatrick skin type reporting20, which is in line with recent scoping reviews21,22. Most studies in our review were at early developmental stages with 84% stating that further prospective work or trials are required before clinical use. Notably, there are no current FDA approved AI/ML-enabled medical devices relating to dermatology, and of 521 total devices listed in the latest FDA update, most are applied to radiology (75%), followed by cardiovascular (11%) specialties23.

The strengths of this systematic review include a broad search strategy, which provides a comprehensive overview of DL image analysis in dermatology from its inception in the field. Our protocol, which adhered to PRISMA and SWiM guidelines, was developed with multidisciplinary input from experts in clinical dermatology, deep learning, and systematic reviews. Article searches, screening, data extraction and quality assessment were carried out by at least two independent researchers. We modified the QUADAS-2 tool to systematically assess the quality of included DL articles, which may be a valuable resource for future research in this rapidly expanding field.

Limitations include the lack of an existing formal quality assessment tool for AI/DL studies. Recent progress has been made, with PROBAST-AI24 and QUADAS-AI25 currently under development, in addition to publication of the CLEAR Derm consensus guidelines of best practice for image-based DL assessment26. Our modified QUADAS-2 tool is in line with the CLEAR Derm guidelines and enables in depth AI-centred evaluation of both risk of bias and applicability. We were unable to perform a meta-analysis due to the high degree of heterogeneity in the reporting of study design and outcome metrics. The small number of studies on each individual disease also precluded accurate inter-disease comparisons, in addition to variation in disease severity grading scales. Therefore, comparisons across studies should be interpreted with caution. Comparison of outcomes across diagnostic studies is particularly challenging since some DL algorithms perform binary classification and others perform multiclass classification. However, most diagnostic DL algorithms for the six most studied diseases in the systematic review performed multiclass classification, with few performing binary classification. Our search strategy was restricted to papers published in English, which may have omitted some studies, particularly given the dominance of investigators affiliated to China. To reduce result heterogeneity, we excluded studies that reported only results derived from pooling together data from non-lesional and lesional diagnoses; seven studies were excluded for this reason, suggesting that bias due to selective reporting27 may be an issue to be aware of in this evidence base. However, these exclusions may also have biased towards inclusion of studies of more simplistic DL algorithms.

Our findings are timely and clinically relevant. The recent prioritisation of teledermatology within healthcare workflows, enabling ready availability of skin images, has accelerated the need to understand the potential of DL image analysis. The deployment of AI technologies forms an integral part of national and international strategy to address the inequity of access to care for those with inflammatory skin conditions28. It may improve patient selection for early intervention by identifying those at risk of worse outcomes (severe disease), to help address the rising global burden of skin conditions. Our review indicates that DL image analysis has exciting potential, particularly in the diagnosis and severity assessment of common, highly treatable skin conditions (e.g. acne, eczema, psoriasis), however current studies have methodological and reporting concerns. There is a need for prospective image dataset curation with detailed clinical metadata, and external validation and testing of DL algorithms. Collaborative large-scale efforts from global dermatology networks29,30 to collect high-quality images from well-phenotyped cohorts are vital. A relative paucity of studies of disease severity versus diagnostic DL algorithms was also identified. Standardised reporting frameworks and performance evaluation metrics are critical for the interpretation of the wealth of emerging data. SPIRIT-AI and CONSORT-AI31 offer guidance for DL clinical trials. DL-specific reporting guidelines such as TRIPOD-AI24,26,31 and formalised regulatory, evaluation and data governance pathways32 will facilitate the transition of DL from the current experimental phase to realising its full potential in advancing healthcare efficiency and disease outcomes.

Methods

Search strategy and selection criteria

This systematic review is reported in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA) guidelines33. The study protocol was prospectively registered on PROSPERO (CRD42022309935). The eligibility criteria were structured using the population, intervention, comparison, outcome, and study type (PICOS) framework (Supplementary Material 11)34. For population, we included skin, hair or nail diseases of any severity in all ages, ethnicities and skin types. Studies assessing wounds, and benign or malignant skin lesions were excluded. The intervention was DL algorithms applied to macroscopic and/or dermoscopic images of skin, hair or nail diseases to diagnose or grade the severity of disease. Although we included studies using any comparators, we defined our reference standard for evaluating DL algorithm performance as an assessment by a clinician. Studies using any outcome measures to report DL algorithm performance (e.g. accuracy, sensitivity, specificity) were included. Studies other than original research articles (e.g. letters, editorials, conference proceedings and reviews) were excluded.

We searched five electronic databases (PubMed, Embase, Web of Science, ACM digital library and IEEE Xplore). Search terms were selected based on consensus expert opinion and adapted for each database (Supplementary Material 12). All primary research articles published in peer-reviewed journals from 1st January 2000 to 23rd June 2022 were considered for inclusion.

Each study was screened using a two-stage process, using the systematic review web-tool Rayyan35. After removal of duplicates, five members of the research team (SPC, BJK, AP, WRT, SPT) independently screened titles and abstracts for potentially eligible studies so that each record was blindly assessed by at least two reviewers. Full texts of studies included from the initial screening stage were assessed for final eligibility by at least two researchers independently. Any disagreements were resolved by consensus or a third reviewer.

Data analysis

Data from each included study were extracted by one member of the research team (SPC, BJK, AP, WRT, SMLL, JS, SPT) and checked by another member. Baseline characteristics, study design, characteristics and outcomes of DL algorithms were extracted using a predefined data extraction sheet. Any disagreements were resolved by consensus or a third reviewer. One author (BJK) amalgamated the extracted data and performed the data analysis.

Studies were categorised into diagnostic and severity grading studies. Diagnostic studies use DL to make predictions about disease diagnosis based on images. Severity studies make predictions about disease severity based on images, as measured by relevant severity scales e.g. psoriasis area and severity index (PASI) for psoriasis. Results were summarised separately for diagnosis and severity grading studies. Study types were categorised according to Prediction model Risk Of Bias Assessment Tool (PROBAST) definitions36 that we modified based on consensus among the research team to suit DL studies (Supplementary Material 13).

Given the large number of skin diseases covered in the review, we first summarised findings for the six most commonly studied diseases, followed by findings for five key skin disease categories (inflammatory disorders, follicular disorders of skin, alopecia, pigmentary disorders and skin infections). We referred to acne, rosacea and hidradenitis suppurativa as ‘follicular disorders of skin’ to distinguish them from alopecia, since studies of the latter use images of hair rather than skin.

To explore the impact of possible bias introduced by studies that use the same dataset to both train and validate/test algorithms, we also separately present the results for studies that use independent external data for validation and/or testing.

Due to high heterogeneity of the studies with respect to DL techniques and methods of evaluation, a meta-analysis would not have been suitable or informative. Instead, we conducted a narrative synthesis, following Synthesis Without Meta-analysis (SWiM) guidelines37. We performed descriptive statistical analyses, including calculation of median, interquartile range (IQR) and range using Microsoft Excel Version 2208. The primary outcome was the accuracy of DL algorithms in diagnosing and/or grading the severity of disease. Secondary outcomes included sensitivity, specificity, area under the receiver operating characteristic curve (AUC), positive predictive value (PPV) and negative predictive value (NPV).

Quality assessment

There is a lack of quality assessment frameworks specific for AI/DL studies, with tools such as PROBAST-AI and QUADAS-AI still in development24,25. We modified the QUADAS-2 framework25 so that it could be used to assess risk of bias and applicability of DL studies in this systematic review and also provide a valuable resource for future research in the field (Supplementary Material 2). Questions were added to probe the robustness of DL algorithms in dermatology, such as the reporting of Fitzpatrick skin type, use of our defined reference standard (clinician assessment), and external validation/testing of the algorithm. All studies were blindly assessed using the modified QUADAS-2 by two reviewers (SPC, WRT) and any disagreements resolved through consensus.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.