Systematic review of deep learning image analyses for the diagnosis and monitoring of skin disease

Choy, Shern Ping; Kim, Byung Jin; Paolino, Alexandra; Tan, Wei Ren; Lim, Sarah Man Lin; Seo, Jessica; Tan, Sze Ping; Francis, Luc; Tsakok, Teresa; Simpson, Michael; Barker, Jonathan N. W. N.; Lynch, Magnus D.; Corbett, Mark S.; Smith, Catherine H.; Mahil, Satveer K.

doi:10.1038/s41746-023-00914-8

Download PDF

Review Article
Open access
Published: 27 September 2023

Systematic review of deep learning image analyses for the diagnosis and monitoring of skin disease

Shern Ping Choy ORCID: orcid.org/0000-0001-5891-6511¹^na1,
Byung Jin Kim²^na1,
Alexandra Paolino¹^na1,
Wei Ren Tan¹^na1,
Sarah Man Lin Lim³,
Jessica Seo⁴,
Sze Ping Tan⁵,
Luc Francis¹,
Teresa Tsakok¹,
Michael Simpson⁶,
Jonathan N. W. N. Barker¹,
Magnus D. Lynch¹,
Mark S. Corbett⁷,
Catherine H. Smith¹^na2 &
…
Satveer K. Mahil ORCID: orcid.org/0000-0003-4692-3794¹^na2

npj Digital Medicine volume 6, Article number: 180 (2023) Cite this article

4296 Accesses
5 Citations
11 Altmetric
Metrics details

Subjects

Abstract

Skin diseases affect one-third of the global population, posing a major healthcare burden. Deep learning may optimise healthcare workflows through processing skin images via neural networks to make predictions. A focus of deep learning research is skin lesion triage to detect cancer, but this may not translate to the wider scope of >2000 other skin diseases. We searched for studies applying deep learning to skin images, excluding benign/malignant lesions (1/1/2000-23/6/2022, PROSPERO CRD42022309935). The primary outcome was accuracy of deep learning algorithms in disease diagnosis or severity assessment. We modified QUADAS-2 for quality assessment. Of 13,857 references identified, 64 were included. The most studied diseases were acne, psoriasis, eczema, rosacea, vitiligo, urticaria. Deep learning algorithms had high specificity and variable sensitivity in diagnosing these conditions. Accuracy of algorithms in diagnosing acne (median 94%, IQR 86–98; n = 11), rosacea (94%, 90–97; n = 4), eczema (93%, 90–99; n = 9) and psoriasis (89%, 78–92; n = 8) was high. Accuracy for grading severity was highest for psoriasis (range 93–100%, n = 2), eczema (88%, n = 1), and acne (67–86%, n = 4). However, 59 (92%) studies had high risk-of-bias judgements and 62 (97%) had high-level applicability concerns. Only 12 (19%) reported participant ethnicity/skin type. Twenty-four (37.5%) evaluated the algorithm in an independent dataset, clinical setting or prospectively. These data indicate potential of deep learning image analysis in diagnosing and monitoring common skin diseases. Current research has important methodological/reporting limitations. Real-world, prospectively-acquired image datasets with external validation/testing will advance deep learning beyond the current experimental phase towards clinically-useful tools to mitigate rising health and cost impacts of skin disease.

A deep learning system for differential diagnosis of skin diseases

Article 18 May 2020

Deep learning-aided decision support for diagnosis of skin disease across skin tones

Article Open access 05 February 2024

Predicting the clinical management of skin lesions using deep learning

Article Open access 08 April 2021

Introduction

The digitisation of healthcare, accelerated by the COVID-19 pandemic, has led to an accumulation of ‘big data’, so called not only for its substantial volume but also for its complexity and diversity. The availability of such data and advances in computing capacity provide a unique opportunity to revolutionise healthcare using artificial intelligence (AI)¹.

Machine learning (ML) is a subfield of AI that uses computational models to perform intelligent predictions based on training datasets without direct human intervention². In recent years, deep learning (DL) has become the most widely used computational approach in the field of ML³. DL, inspired by the information processing patterns of the human brain, is based on multi-layered artificial neural networks that learn from big data. Broadly, there are 3 types of datasets used in a DL study: training, validation and testing. A training dataset is used to derive DL models, where the algorithms are ‘fitted’ to perform particular functions. A validation dataset is then used to provide an evaluation of the performance of the model, whilst finetuning its architecture. The test dataset provides the final evaluation of the model⁴.

Interest in medical applications of DL has largely been in the fields of radiology, ophthalmology, pathology, and dermatology⁵. As a visual specialty with large image databases, dermatology has considerable potential to augment disease diagnosis and severity assessment using DL, leading to improved healthcare efficiency and reduced costs. A major focus of DL research (bolstered by expanding big data, image quality, computing capacity and DL techniques) is in assisting clinicians in the diagnosis of skin cancers from images^6,7,8. This has achieved encouraging results with a meta-analysis of 70 studies suggesting that the accuracy of computer-aided (including DL-based) diagnosis of melanoma was comparable to that of dermatologists⁶.

Whilst this area of dermatology remains a focus of research, easing the burden of non-lesion dermatological diseases warrants attention owing to their high prevalence, visibility, psychosocial impact, need for long-term treatment and associated costs. An estimated 20–25% of the population is affected by chronic inflammatory skin diseases, the most common of which include eczema and psoriasis⁹. The delayed access to healthcare and unpredictable clinical course of inflammatory skin diseases further adds to the already substantial impact on quality of life and underlines the potential for improved, early diagnosis and close monitoring approaches that leverage DL image analysis.

This systematic review assesses the evidence for using DL image analyses in the diagnosis and severity monitoring of skin diseases, beyond benign and malignant skin lesions.

Results

Study screening

The searches identified 13,857 references. After removing duplicates, 12,320 titles and abstracts were screened. Subsequent full-text screening of 268 studies identified 64 studies that met the eligibility criteria (Fig. 1 and Supplementary Material 1). No studies prior to 2012 met our eligibility criteria. This is in keeping with the paradigm shift towards the use of DL in ML research in 2012, when AlexNet (a type of DL) was shown to significantly outperform traditional ML image analysis methods¹⁰.

**Fig. 1: PRISMA flowchart of study records.**

Quality assessment

Risk of bias and applicability concerns for all 64 included studies were assessed using our modified QUADAS-2 framework (Supplementary Material 2). 59 studies (92%) had overall high risk of bias judgements, and 62 studies (97%) had overall high level applicability concerns (Fig. 2 and Supplementary Material 3). Of 9 studies that used external datasets to validate or test their DL algorithms, 8 studies (89%) still had an overall high risk of bias and all 9 studies (100%) had overall high level applicability concerns (Supplementary Material 4).

With respect to risk of bias, the participant and outcome domains were more commonly rated as high/unclear (92% and 83%, respectively), in contrast to the reference standard and index test domains (9% and 9%, respectively) (Fig. 2). To determine the reference standard, most studies (61%, n = 39) had datasets verified by at least one clinician. With respect to the index test, 91% (n = 58) accounted for overfitting, underfitting, and/or optimism when assessing DL algorithm performance against the reference standard.

All 64 studies scored high or unclear in the participant domain of applicability concerns (Fig. 2). All externally validated/tested DL algorithms (n = 9) also had high level applicability concerns in this domain (Supplementary Material 4). There was poor reporting of participant characteristics such as Fitzpatrick skin type, age, and gender, as well as poor generalisability of the study settings. 63% and 38% of all studies had high/unclear applicability concerns in the index test and outcome domains, respectively (Fig. 2). This reduced to 0% and 11%, respectively, when considering only externally validated/tested DL algorithms (Supplementary Material 4).

General study characteristics

Overall, 144 skin diseases were studied. Of these, the most frequently studied diseases were acne (n = 30), psoriasis (n = 27), eczema (n = 22), rosacea (n = 12), vitiligo (n = 12) and urticaria (n = 8) (Tables 1 and 2). The most common skin disease categories were inflammatory, follicular, pigmentary and infectious disorders (Table 3).

Table 1 Baseline characteristics of all studies.

Full size table

Table 2 Outcomes of deep learning algorithms for the diagnosis of the six most studied diseases.

Full size table

Table 3 Outcomes of deep learning algorithms for the diagnosis of the five main categories of skin disease.

Full size table

47 of 64 (73%) included studies reported research funding, 6 (9%) did not and 11 (17%) were unclear (Supplementary Material 5). The authors were most frequently affiliated to China (n = 20), India (n = 9) and the USA (n = 5), and private datasets were mostly from Asia (73%, n = 35) (Supplementary Material 6).

Study design

Most studies (88%, n = 56) used retrospectively collected data and most (85%, n = 55) used the same image dataset for both training and validation/testing (Table 1). Few studies (14%, n = 9) used independent external data to validate or test their DL algorithms (Supplementary Material 7). Overall, 24 studies (37.5%) evaluated the algorithm in an independent dataset, a clinical setting or prospectively. No RCTs of DL in skin diseases were found.

DL algorithms were developed predominantly for disease diagnosis (81%, n = 52), rather than severity assessment (19%, n = 12). Diagnostic DL algorithms were most commonly developed for acne (n = 24), psoriasis (n = 23) and eczema (n = 21) (Table 1). Disease severity DL algorithms were most commonly developed for acne (n = 6) and psoriasis (n = 4).

Participants and images

Of those studies performing training (n = 60), internal/external validation (n = 34) and internal/external testing (n = 52) of DL algorithms, the number of participants was reported by 18% (median 2000 participants, IQR 416–5860; n = 11), 24% (median 626 participants, IQR 167–3102; n = 8) and 15% (median 185 participants, IQR 90–340; n = 8), respectively (Table 1). Participant age was reported in 13 (20%) studies and sex was reported in 12 (19%) studies.

In the minority of studies reporting participant ethnicity and/or Fitzpatrick skin type (19%, n = 12; Table 1), there was representation across most ethnicities and skin types. Of 10 studies reporting Fitzpatrick skin types, 4 specified the number of participants per Fitzpatrick skin type group: most (>85%) participants had skin types II–IV. In the other 6 studies, 5 specified that participants were mostly skin types III–IV and 1 study stated that participants were mostly skin types II–III.

Most image datasets (88%, n = 56) comprised macroscopic images of skin, hair or nails. Dermoscopic images were most commonly used for psoriasis (n = 5) and eczema (n = 4) (Table 1). In contrast to participant characteristics, the number of images used in training, validation and testing datasets was reported by most studies: 60 (93%), 61 (91%) and 62 (96%) studies, respectively. Generally, a greater number of images was used to train DL algorithms (median 2555 images, IQR 902–8550) than to validate (median 1032 images, IQR 274–2000) or test (median 331 images, IQR 157–922) DL algorithms. The ratio of median number of images to participants was 1.3 for training datasets, 1.6 for validation and 1.8 for testing datasets. This indicates that a single participant contributed more than one image through, for example, multiple photographs of anatomically distinct sites or splitting/modification of an image (Table 1).

DL algorithms

Five studies used more than one type of DL algorithm, hence the total number of algorithms was 69 across 64 studies. Overall, the commonest types of DL algorithm were convolutional neural networks (CNN) and deep convolutional neural networks (DCNN) (80%, n = 55 of 69 algorithms) (Fig. 3 and Table 1). CNNs and DCNNs are considered interchangeable terms, as ‘deep’ refers to the number of layers in the algorithm architecture and most modern CNNs consist of a large number of layers¹¹. The first CNN/DCNN study included in our review appeared in 2017. By 2021, 85% (n = 17 of 20) of studies applied CNN/DCNN algorithms. Ensemble DL algorithms, which combines multiple DL algorithms to improve prediction performance, first appeared in 2018, however was less frequently used compared to CNN/DCNN in subsequent years. Multilayer perceptron (MLP) (3%, n = 2 of 69 algorithms) and artificial neural networks (ANN) (3%, n = 2 of 69 algorithms), which are now considered outdated types of DL, were also less commonly used.

**Fig. 3: Type of deep learning algorithms included in the systematic review, by year of publication.**

Most studies (77%, n = 49) reported the reference standard of the DL algorithm; 36 (73%) used a clinician assessment of images, of which 27 (75%) were dermatologists. The remaining 27% (n = 13 of 49) used multiple reference standards inconsistently across datasets or other reference standards including biopsies, blood tests and curated databases (Table 1). The severity scales used for disease severity grading DL algorithms were varied (Supplementary Material 8).

Transparency

Most studies (95%, n = 61) disclosed the source of images. The most common sources of images were hospital/university databases (47%, n = 30), and many studies also used public databases (22%, n = 14). Image datasets were fully or partially available in under one third of studies (31%, n = 20). DL algorithm codes were available in 26 (41%) studies. In seven studies (11%) there were no details provided on the architecture of the DL algorithms (Table 1). With regards to transparency of reporting of primary and secondary outcomes, 26 of 64 studies (41%) provided the raw values that were used to calculate accuracy, sensitivity or specificity.

Accuracy of diagnostic DL algorithms: six most studied diseases

Accuracy (the primary outcome) was the most commonly reported outcome for assessing the performance of all DL algorithms (75%, n = 48). The median diagnostic accuracy of the DL algorithms for the six most studied diseases (acne, psoriasis, eczema, rosacea, vitiligo, urticaria) was high, ranging from 81% for urticaria (n = 2) to 94% for both acne (IQR 86–98, n = 11) and rosacea (IQR 90–97, n = 4) (Table 2). The accuracies of the externally validated/tested diagnostic DL algorithms were higher for acne (median 92%, n = 2) and eczema (96%, n = 2) compared with psoriasis (74%, n = 1), however direct comparison was limited by the small number of studies. Most diagnostic DL algorithms for the six most studied diseases performed multiclass classification (79%, n = 26 of 33), rather than binary classification (21%, n = 7 of 33) (Supplementary Material 9).

Accuracy of diagnostic DL algorithms: five categories of disease

The median diagnostic accuracy of DL algorithms for the five categories of skin diseases (inflammatory disorders, follicular disorders of skin, alopecia, pigmentary disorders, skin infections) was high, ranging from 88% for both skin infections (IQR 60–95, n = 17) and pigmentary disorders (IQR 80–99, n = 5) to 100% for alopecia (n = 2) (Table 3). The diagnostic accuracies of DL algorithms for inflammatory disorders (median 92%, IQR 80–96; n = 30) and follicular disorders of skin (median 93%, IQR 87–97; n = 16) were similarly high.

The median diagnostic accuracy of externally validated/tested DL algorithms was high for inflammatory disorders (83%, IQR 53–100; n = 6) and follicular disorders of skin (84%, n = 3), although numerically lower than that of all DL algorithms. Both studies reporting diagnostic accuracy of DL algorithms for alopecia used external testing and had an accuracy of 100%. In contrast, the median accuracy of externally validated/tested DL algorithms for diagnosing skin infections was low (59%, IQR 50–74; n = 7).

Accuracy of severity grading DL algorithms

The analysis of DL algorithms for disease severity grading was limited by a paucity of studies (n = 12, Supplementary Material 8). The accuracy of DL algorithms in grading psoriasis severity was 93–100% (n = 2), however external validation/testing was not performed (Supplementary Material 10). The single study of a DL algorithm for grading eczema severity did perform external validation and reported 88% accuracy. Of 4 studies assessing DL algorithms that grade acne severity (median accuracy 76%, IQR 68-85), one performed external testing and reported lower accuracy (68%).

Secondary outcomes of diagnostic DL algorithms: six most studied diseases

A total of 23 studies reported AUC. The median AUC of diagnostic DL algorithms was high, ranging from 0.90 (IQR 0.87–0.94, n = 4) for rosacea to 0.98 (IQR 0.93–0.99, n = 4) for acne (Table 2). Diagnostic accuracy of externally validated/tested DL algorithms was similarly high.

Overall, 29 studies reported specificity. The median specificity of diagnostic DL algorithms was high, ranging from 88% (IQR 80-98, n = 4) for vitiligo to 100% (n = 2) for urticaria (Table 2). Externally validated/tested algorithms had similarly high specificity, all above 96% (n = 6).

A total of 43 studies reported sensitivity. The median sensitivity of diagnostic DL algorithms was variable, ranging from 63% (IQR 42–92, n = 6) in rosacea to 91% (IQR 80-95, n = 5) in vitiligo. The range of sensitivity values for each disease was wide, and contrasted the narrower ranges for specificity. Externally validated/tested diagnostic DL algorithms generally had lower sensitivities compared with the overall dataset, ranging from 42% (n = 1) in rosacea to 87% (n = 2) in acne.

31 and 8 studies reported PPV and NPV, respectively. The median PPV of diagnostic DL algorithms varied from 77% for urticaria (n = 2) to 91% for vitiligo (n = 3) (Table 2). In contrast, the NPV of diagnostic DL algorithms was >90% for all six diseases, which was also a consistent finding for externally validated/tested DL algorithms.

Secondary outcomes of diagnostic DL algorithms: five categories of disease

In line with the above findings, diagnostic DL algorithms for the five disease categories (inflammatory disorders, follicular disorders of skin, alopecia, pigmentary disorders, skin infections) were broadly highly specific but had variable sensitivity.

The median specificity of diagnostic DL algorithms ranged from 97% (IQR 93-99, n = 15) for follicular disorders of skin to 100% (n = 2) for alopecia (Table 3). With respect to inflammatory skin diseases (the most frequently studied disease category), the median accuracy of diagnostic DL algorithms was 98% (IQR 95-99, n = 40) and this remained high when only externally validated/tested algorithms were considered (100%, IQR 98-100; n = 10).

The median sensitivity of diagnostic DL algorithms ranged from 77% in inflammatory skin diseases (IQR 63–92, n = 47) and skin infections (IQR 63–93, n = 33) to 87% (IQR 67–94, n = 19) in follicular disorders (Table 3). When considering only externally validated/tested diagnostic DL algorithms, the median sensitivities remained variable and were lowest in inflammatory disorders (58%, IQR 48–72; n = 12) and skin infections (70%, IQR 56–80; n = 17), compared to follicular disorders (87%, IQR 63–92; n = 4). The range of sensitivity values for each disease category was also wide, and contrasted the narrower ranges for specificity.

Secondary outcomes of severity grading DL algorithms

Although data were limited, the specificity of disease severity DL algorithms was high and ranged from 94–95% for acne (n = 2) to 97–100% for psoriasis (n = 2) (Supplementary Material 10). AUC was reported in only one study of a severity grading DL algorithm, in psoriasis (AUC 0.99). The sensitivity of disease severity grading DL algorithms ranged from 82–84% for acne (n = 2) to 93–96% for psoriasis (n = 2). PPV was reported for severity grading DL algorithms in acne (range 54–86%, n = 3) and psoriasis (93%, n = 1). No studies reported these metrics for externally validated/tested DL algorithms of disease severity (Supplementary Material 10).

Discussion

This systematic review provides a comprehensive evaluation of DL image analysis studies of skin diseases. Skin conditions are the fourth leading cause of non-fatal disease burden worldwide¹². Our review encompasses the commonest long-term skin conditions in the global population including eczema, psoriasis, acne, rosacea, vitiligo and urticaria. The reported diagnostic accuracy of DL algorithms was broadly encouraging for common inflammatory skin diseases, in contrast to skin infections, for which the diagnostic accuracy of externally validated/tested DL algorithms was generally low. Although diagnostic DL algorithms were mostly specific, there was variation in their sensitivities, which were notably lower in the fewer yet more robust studies that performed external validation or testing. While relatively few studies assessed DL algorithms for disease severity grading, the highest accuracy was reported in psoriasis, followed by eczema and acne.

Importantly, our findings on the reliability and applicability of current DL studies indicate that they should be interpreted with caution. There are key limitations, which bring the real-world clinical applicability of the reported DL algorithms into question. These include heterogeneity of study design and a lack of RCTs. Although 47% of studies utilised images from hospital or university databases, public databases were used in 22%, which may not be representative of the target population or healthcare setting of the algorithms’ intended use. The generalisability of the DL algorithms was also limited by poor capture of number, age, gender and skin colour of study participants. Ethnicity and/or Fitzpatrick skin type was reported in only 19% of studies and, amongst those reporting these characteristics, skin types I, V and VI were underrepresented. The reference standard for evaluating DL algorithm performance was inconsistent, with some studies not defining the ‘ground truth’ or specifying the type of clinician who assessed disease diagnosis/severity. There was also poor transparency of reporting of outcome metrics including confidence intervals for specificity and sensitivity, and numerator/denominator data. There were omissions in the reporting of data class-balance, and bias towards images of particular phenotypes and from specific geographical locations (Asia), leading to potentially skewed training datasets. Although authors implemented measures to mitigate model overfitting, the extent to which this was successful requires external validation/testing, which was only performed in 14% of studies.

There is a notable paucity of evidence synthesis of DL image analyses in non-cancer skin diseases, however similar limitations are highlighted in prior reviews of DL in skin cancer detection^8,13,14, and are mirrored across other medical specialities at the forefront of DL image analysis research such as radiology and ophthalmology^{15,16,17,18,19}. The need to improve the quality and interpretability of studies through improved reporting, consistent use of out-of-sample external validation/testing and well-defined clinical cohorts are common themes. A systematic review of 14 skin cancer image datasets similarly demonstrated poor ethnicity and/or Fitzpatrick skin type reporting²⁰, which is in line with recent scoping reviews^21,22. Most studies in our review were at early developmental stages with 84% stating that further prospective work or trials are required before clinical use. Notably, there are no current FDA approved AI/ML-enabled medical devices relating to dermatology, and of 521 total devices listed in the latest FDA update, most are applied to radiology (75%), followed by cardiovascular (11%) specialties²³.

The strengths of this systematic review include a broad search strategy, which provides a comprehensive overview of DL image analysis in dermatology from its inception in the field. Our protocol, which adhered to PRISMA and SWiM guidelines, was developed with multidisciplinary input from experts in clinical dermatology, deep learning, and systematic reviews. Article searches, screening, data extraction and quality assessment were carried out by at least two independent researchers. We modified the QUADAS-2 tool to systematically assess the quality of included DL articles, which may be a valuable resource for future research in this rapidly expanding field.

Limitations include the lack of an existing formal quality assessment tool for AI/DL studies. Recent progress has been made, with PROBAST-AI²⁴ and QUADAS-AI²⁵ currently under development, in addition to publication of the CLEAR Derm consensus guidelines of best practice for image-based DL assessment²⁶. Our modified QUADAS-2 tool is in line with the CLEAR Derm guidelines and enables in depth AI-centred evaluation of both risk of bias and applicability. We were unable to perform a meta-analysis due to the high degree of heterogeneity in the reporting of study design and outcome metrics. The small number of studies on each individual disease also precluded accurate inter-disease comparisons, in addition to variation in disease severity grading scales. Therefore, comparisons across studies should be interpreted with caution. Comparison of outcomes across diagnostic studies is particularly challenging since some DL algorithms perform binary classification and others perform multiclass classification. However, most diagnostic DL algorithms for the six most studied diseases in the systematic review performed multiclass classification, with few performing binary classification. Our search strategy was restricted to papers published in English, which may have omitted some studies, particularly given the dominance of investigators affiliated to China. To reduce result heterogeneity, we excluded studies that reported only results derived from pooling together data from non-lesional and lesional diagnoses; seven studies were excluded for this reason, suggesting that bias due to selective reporting²⁷ may be an issue to be aware of in this evidence base. However, these exclusions may also have biased towards inclusion of studies of more simplistic DL algorithms.

Our findings are timely and clinically relevant. The recent prioritisation of teledermatology within healthcare workflows, enabling ready availability of skin images, has accelerated the need to understand the potential of DL image analysis. The deployment of AI technologies forms an integral part of national and international strategy to address the inequity of access to care for those with inflammatory skin conditions²⁸. It may improve patient selection for early intervention by identifying those at risk of worse outcomes (severe disease), to help address the rising global burden of skin conditions. Our review indicates that DL image analysis has exciting potential, particularly in the diagnosis and severity assessment of common, highly treatable skin conditions (e.g. acne, eczema, psoriasis), however current studies have methodological and reporting concerns. There is a need for prospective image dataset curation with detailed clinical metadata, and external validation and testing of DL algorithms. Collaborative large-scale efforts from global dermatology networks^29,30 to collect high-quality images from well-phenotyped cohorts are vital. A relative paucity of studies of disease severity versus diagnostic DL algorithms was also identified. Standardised reporting frameworks and performance evaluation metrics are critical for the interpretation of the wealth of emerging data. SPIRIT-AI and CONSORT-AI³¹ offer guidance for DL clinical trials. DL-specific reporting guidelines such as TRIPOD-AI^24,26,31 and formalised regulatory, evaluation and data governance pathways³² will facilitate the transition of DL from the current experimental phase to realising its full potential in advancing healthcare efficiency and disease outcomes.

Methods

Search strategy and selection criteria

This systematic review is reported in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA) guidelines³³. The study protocol was prospectively registered on PROSPERO (CRD42022309935). The eligibility criteria were structured using the population, intervention, comparison, outcome, and study type (PICOS) framework (Supplementary Material 11)³⁴. For population, we included skin, hair or nail diseases of any severity in all ages, ethnicities and skin types. Studies assessing wounds, and benign or malignant skin lesions were excluded. The intervention was DL algorithms applied to macroscopic and/or dermoscopic images of skin, hair or nail diseases to diagnose or grade the severity of disease. Although we included studies using any comparators, we defined our reference standard for evaluating DL algorithm performance as an assessment by a clinician. Studies using any outcome measures to report DL algorithm performance (e.g. accuracy, sensitivity, specificity) were included. Studies other than original research articles (e.g. letters, editorials, conference proceedings and reviews) were excluded.

We searched five electronic databases (PubMed, Embase, Web of Science, ACM digital library and IEEE Xplore). Search terms were selected based on consensus expert opinion and adapted for each database (Supplementary Material 12). All primary research articles published in peer-reviewed journals from 1^st January 2000 to 23^rd June 2022 were considered for inclusion.

Each study was screened using a two-stage process, using the systematic review web-tool Rayyan³⁵. After removal of duplicates, five members of the research team (SPC, BJK, AP, WRT, SPT) independently screened titles and abstracts for potentially eligible studies so that each record was blindly assessed by at least two reviewers. Full texts of studies included from the initial screening stage were assessed for final eligibility by at least two researchers independently. Any disagreements were resolved by consensus or a third reviewer.

Data analysis

Data from each included study were extracted by one member of the research team (SPC, BJK, AP, WRT, SMLL, JS, SPT) and checked by another member. Baseline characteristics, study design, characteristics and outcomes of DL algorithms were extracted using a predefined data extraction sheet. Any disagreements were resolved by consensus or a third reviewer. One author (BJK) amalgamated the extracted data and performed the data analysis.

Studies were categorised into diagnostic and severity grading studies. Diagnostic studies use DL to make predictions about disease diagnosis based on images. Severity studies make predictions about disease severity based on images, as measured by relevant severity scales e.g. psoriasis area and severity index (PASI) for psoriasis. Results were summarised separately for diagnosis and severity grading studies. Study types were categorised according to Prediction model Risk Of Bias Assessment Tool (PROBAST) definitions³⁶ that we modified based on consensus among the research team to suit DL studies (Supplementary Material 13).

Given the large number of skin diseases covered in the review, we first summarised findings for the six most commonly studied diseases, followed by findings for five key skin disease categories (inflammatory disorders, follicular disorders of skin, alopecia, pigmentary disorders and skin infections). We referred to acne, rosacea and hidradenitis suppurativa as ‘follicular disorders of skin’ to distinguish them from alopecia, since studies of the latter use images of hair rather than skin.

To explore the impact of possible bias introduced by studies that use the same dataset to both train and validate/test algorithms, we also separately present the results for studies that use independent external data for validation and/or testing.

Due to high heterogeneity of the studies with respect to DL techniques and methods of evaluation, a meta-analysis would not have been suitable or informative. Instead, we conducted a narrative synthesis, following Synthesis Without Meta-analysis (SWiM) guidelines³⁷. We performed descriptive statistical analyses, including calculation of median, interquartile range (IQR) and range using Microsoft Excel Version 2208. The primary outcome was the accuracy of DL algorithms in diagnosing and/or grading the severity of disease. Secondary outcomes included sensitivity, specificity, area under the receiver operating characteristic curve (AUC), positive predictive value (PPV) and negative predictive value (NPV).

Quality assessment

There is a lack of quality assessment frameworks specific for AI/DL studies, with tools such as PROBAST-AI and QUADAS-AI still in development^24,25. We modified the QUADAS-2 framework²⁵ so that it could be used to assess risk of bias and applicability of DL studies in this systematic review and also provide a valuable resource for future research in the field (Supplementary Material 2). Questions were added to probe the robustness of DL algorithms in dermatology, such as the reporting of Fitzpatrick skin type, use of our defined reference standard (clinician assessment), and external validation/testing of the algorithm. All studies were blindly assessed using the modified QUADAS-2 by two reviewers (SPC, WRT) and any disagreements resolved through consensus.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data Availability

All data supporting the findings of this study are available within the paper and its Supplementary Information files.

References

Esteva, A. et al. A guide to deep learning in healthcare. Nat. Med. 25, 24–29 (2019).
Article CAS PubMed Google Scholar
Du-Harpur, X., Watt, F. M., Luscombe, N. M. & Lynch, M. D. What is AI? Applications of artificial intelligence to dermatology. Br. J. Dermatol. 183, 423–30. (2020).
Article CAS PubMed PubMed Central Google Scholar
LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
Article CAS PubMed Google Scholar
Brownlee J. What is the Difference Between Test and Validation Datasets? https://machinelearningmastery.com/difference-test-validation-datasets/ (2023).
Nagendran, M. et al. Artificial intelligence versus clinicians: systematic review of design, reporting standards, and claims of deep learning studies. BMJ 368, m689 (2020).
Article PubMed PubMed Central Google Scholar
Dick, V., Sinz, C., Mittlböck, M., Kittler, H. & Tschandl, P. Accuracy of computer-aided diagnosis of melanoma: a meta-analysis. JAMA Dermatol. 155, 1291–1299 (2019).
Article PubMed PubMed Central Google Scholar
Esteva, A. et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature 542, 115–118 (2017).
Article CAS PubMed PubMed Central Google Scholar
Jones, O. T. et al. Artificial intelligence and machine learning algorithms for early detection of skin cancer in community and primary care settings: a systematic review. Lancet Digit Health 4, e466–e76 (2022).
Article CAS PubMed Google Scholar
Ujiie, H. et al. Unmet medical needs in chronic, non-communicable inflammatory skin diseases. Front. Med. 9, 875492 (2022).
Krizhevsky, A., Sutskever, I. & Hinton, G. E. Imagenet classification with deep convolutional neural networks. Commun. ACM 60, 84–90 (2017).
Article Google Scholar
Moolayil J. J. A Layman’s Guide to Deep Convolutional Neural Networks. https://towardsdatascience.com/a-laymans-guide-to-deep-convolutional-neural-networks-7e937628605f (2023).
Hay, R. J. et al. The global burden of skin disease in 2010: an analysis of the prevalence and impact of skin conditions. J. Invest Dermatol. 134, 1527–34 (2014).
Article CAS PubMed Google Scholar
Jeong, H. K., Park, C., Henao, R. & Kheterpal, M. Deep learning in dermatology: a systematic review of current approaches, outcomes, and limitations. JID Innov. 3, 100150 (2023).
Article PubMed Google Scholar
Ferrante di Ruffano, L. et al. Computer-assisted diagnosis techniques (dermoscopy and spectroscopy-based) for diagnosing skin cancer in adults. Cochrane Database Syst. Rev. 12, Cd013186 (2018).
PubMed Google Scholar
Aggarwal, R. et al. Diagnostic accuracy of deep learning in medical imaging: a systematic review and meta-analysis. npj Digit. Med. 4, 65 (2021).
Article PubMed PubMed Central Google Scholar
Larrazabal, A. J., Nieto, N., Peterson, V., Milone, D. H. & Ferrante, E. Gender imbalance in medical imaging datasets produces biased classifiers for computer-aided diagnosis. Proc. Natl Acad. Sci. USA 117, 12592–12594 (2020).
Article CAS PubMed PubMed Central Google Scholar
Bellemo, V. et al. Artificial intelligence using deep learning to screen for referable and vision-threatening diabetic retinopathy in Africa: a clinical validation study. The Lancet Digital Health 1, e35–e44 (2019).
Article PubMed Google Scholar
Kim, H. et al. Development and validation of a deep learning–based synthetic bone-suppressed model for pulmonary nodule detection in chest radiographs. JAMA Network Open 6, e2253820 (2023).
Article PubMed PubMed Central Google Scholar
Liu, X. et al. A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis. Lancet Digit Health 1, e271–e97 (2019).
Article PubMed Google Scholar
Wen, D. & Khan, S. M. Xu A. J, et al. Characteristics of publicly available skin cancer image datasets: a systematic review. Lancet Digit. Health 4, e64–e74 (2022).
Chen, M. L. et al. Evaluation of diagnosis diversity in artificial intelligence datasets: a scoping review. Brit. J. Dermatol. 188, 292–294 (2022).
Article Google Scholar
Daneshjou, R., Smith, M. P., Sun, M. D., Rotemberg, V. & Zou, J. Lack of transparency and potential bias in artificial intelligence data sets and algorithms: a scoping review. JAMA Dermatol.157, 1362–1369 (2021).
Article PubMed PubMed Central Google Scholar
U.S. Food & Drug. Artificial Intelligence and Machine Learning (AI/ML)-Enabled Medical Devices. https://www.fda.gov/medical-devices/software-medical-device-samd/artificial-intelligence-and-machine-learning-aiml-enabled-medical-devices (2022).
Collins, G. S. et al. Protocol for development of a reporting guideline (TRIPOD-AI) and risk of bias tool (PROBAST-AI) for diagnostic and prognostic prediction model studies based on artificial intelligence. BMJ Open 11, e048008 (2021).
Article PubMed PubMed Central Google Scholar
Sounderajah, V. et al. A quality assessment tool for artificial intelligence-centered diagnostic test accuracy studies: QUADAS-AI. Nat Med 27, 1663–1665 (2021).
Article CAS PubMed Google Scholar
Daneshjou, R. et al. Checklist for evaluation of image-based artificial intelligence reports in dermatology: CLEAR derm consensus guidelines from the international skin imaging collaboration artificial intelligence working group. JAMA Dermatol. 158, 90–96 (2022).
Article PubMed PubMed Central Google Scholar
Page M. J. H. J. & Sterne J. A. C. Chapter 13: Assessing risk of bias due to missing results in a synthesis. in Cochrane Handbook for Systematic Reviews of Interventions version 63 (updated February 2022) Ch. 13 (Cochrane, 2022).
NHS. Referral Optimisation for People with Skin Conditions. https://www.england.nhs.uk/long-read/referral-optimisation-for-people-with-skin-conditions/ (2022).
TPW. Psoprotect: Psoriasis Registry for Outcomes, Therapy and Epidemiology of COVID-19 Infection. https://psoprotect.org/ (2020).
Group ILoDS. International League of Dermatological Societies. https://www.ilds.org/ (2023).
Liu, X. et al. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension. Nat. Med. 26, 1364–74 (2020).
Article CAS PubMed PubMed Central Google Scholar
NICE. Artificial Intelligence (AI) and Digital Regulations Service. https://www.nice.org.uk/about/what-we-do/digital-health/multi-agency-advisory-service-for-ai-and-data-driven-technologies (2023).
Moher, D. et al. Preferred reporting items for systematic review and meta-analysis protocols (PRISMA-P) 2015 statement. Syst. Rev. 4, 1 (2015).
Article PubMed PubMed Central Google Scholar
Schardt, C., Adams, M. B., Owens, T., Keitz, S. & Fontelo, P. Utilization of the PICO framework to improve searching PubMed for clinical questions. BMC Med. Informatics and Decision Making 7, 16 (2007).
Article Google Scholar
Ouzzani, M., Hammady, H., Fedorowicz, Z. & Elmagarmid, A. Rayyan—a web and mobile app for systematic reviews. Syst. Rev. 5, 210 (2016).
Article PubMed PubMed Central Google Scholar
Wolff, R. F. et al. PROBAST: A tool to assess the risk of bias and applicability of prediction model studies. Ann. Intern. Med. 170, 51–58 (2019).
Article PubMed Google Scholar
Campbell, M. et al. Synthesis without meta-analysis (SWiM) in systematic reviews: reporting guideline. BMJ 368, l6890 (2020).
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

This systematic review was funded by the National Institute for Health and Care Research (NIHR) Guy’s and St Thomas’ Biomedical Research Centre. The views expressed in this publication are those of the authors and not necessarily those of the National Health Service or the NIHR. Support for this work was also received from the Psoriasis Association [RE18933/BSTOP5]. C.H.S. was supported by an NIHR Senior Investigator award, S.K.M. by an NIHR Advanced Fellowship [NIHR302258], T.T. by a MRC Clinical Research Training Fellowship [MR/R001839/1] and L.F. by a Psoriasis Association PhD Studentship. The funding sources had no role in the study design, data collection, data analysis, data interpretation, writing of the report, or in the decision to submit for publication.

Author information

These authors contributed equally: Shern Ping Choy, Byung Jin Kim, Alexandra Paolino, Wei Ren Tan.
These authors jointly supervised this work: Catherine H. Smith, Satveer K. Mahil.

Authors and Affiliations

St John’s Institute of Dermatology, Guy’s and St Thomas’ NHS Foundation Trust and King’s College London, London, UK
Shern Ping Choy, Alexandra Paolino, Wei Ren Tan, Luc Francis, Teresa Tsakok, Jonathan N. W. N. Barker, Magnus D. Lynch, Catherine H. Smith & Satveer K. Mahil
St George’s University Hospitals NHS Foundation Trust, London, UK
Byung Jin Kim
Maidstone and Tunbridge Wells NHS Trust, Kent, UK
Sarah Man Lin Lim
Imperial College London, London, UK
Jessica Seo
Barking, Havering and Redbridge University Hospitals NHS Trust, London, UK
Sze Ping Tan
Department of Medical and Molecular Genetics, King’s College London, London, UK
Michael Simpson
Center for Reviews and Dissemination, University of York, York, UK
Mark S. Corbett

Authors

Shern Ping Choy
View author publications
You can also search for this author in PubMed Google Scholar
Byung Jin Kim
View author publications
You can also search for this author in PubMed Google Scholar
Alexandra Paolino
View author publications
You can also search for this author in PubMed Google Scholar
Wei Ren Tan
View author publications
You can also search for this author in PubMed Google Scholar
Sarah Man Lin Lim
View author publications
You can also search for this author in PubMed Google Scholar
Jessica Seo
View author publications
You can also search for this author in PubMed Google Scholar
Sze Ping Tan
View author publications
You can also search for this author in PubMed Google Scholar
Luc Francis
View author publications
You can also search for this author in PubMed Google Scholar
Teresa Tsakok
View author publications
You can also search for this author in PubMed Google Scholar
Michael Simpson
View author publications
You can also search for this author in PubMed Google Scholar
Jonathan N. W. N. Barker
View author publications
You can also search for this author in PubMed Google Scholar
Magnus D. Lynch
View author publications
You can also search for this author in PubMed Google Scholar
Mark S. Corbett
View author publications
You can also search for this author in PubMed Google Scholar
Catherine H. Smith
View author publications
You can also search for this author in PubMed Google Scholar
Satveer K. Mahil
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

S.P.C., B.J.K., A.P., W.R.T. were joint first authors, S.K.M. and C.H.S. were joint senior authors to this work. S.K.M. and C.H.S. conceptualised the study and led the work. S.P.C. and A.P. developed the protocol. S.P.C., B.J.K., A.P. and W.R.T. screened all articles for inclusion and drafted the manuscript. Data extraction and curation were performed by S.P.C., B.J.K., A.P., W.R.T., S.M.L.L., J.S. and S.P.T. A.P., W.R.T. and M.S.C. developed the quality assessment. S.P.C. and W.R.T. performed the search and completed the quality assessment. B.J.K. and W.R.T. synthesised the findings. J.N.W.N.B., C.H.S., S.K.M., M.L., M.S.C., T.T. and L.F. critically revised the manuscript. All authors had access to the data presented in the manuscript and approved the final version.

Corresponding author

Correspondence to Satveer K. Mahil.

Ethics declarations

Competing interests

J.N.W.N.B. has attended advisory boards and/or spoken at sponsored symposia and/or received research funding from: AbbVie, Almirall, Amgen, Boehringer-Ingelheim, Bristol Myers Squibb, Celgene, Janssen, Leo, Lilly, Novartis, Samsung, Sun Pharma. C.H.S. reports departmental research funding as investigator in EU-IMI consortia involving multiple industry partners (see biomap-imi.eu and hippocrates-imi.eu for details). S.K.M. reports departmental income from Abbvie, Almirall, Eli Lilly, Leo, Novartis, Sanofi, UCB, outside the submitted work. There were no conflicts of interest reported by the remaining authors.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Reporting Summary

Supplementary Material

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Choy, S.P., Kim, B.J., Paolino, A. et al. Systematic review of deep learning image analyses for the diagnosis and monitoring of skin disease. npj Digit. Med. 6, 180 (2023). https://doi.org/10.1038/s41746-023-00914-8

Download citation

Received: 07 May 2023
Accepted: 22 August 2023
Published: 27 September 2023
DOI: https://doi.org/10.1038/s41746-023-00914-8

This article is cited by

Deep learning models across the range of skin disease
- Kaushik P. Venkatesh
- Marium M. Raza
- Joseph C. Kvedar
npj Digital Medicine (2024)

Subjects

Abstract

Similar content being viewed by others

A deep learning system for differential diagnosis of skin diseases

Deep learning-aided decision support for diagnosis of skin disease across skin tones

Predicting the clinical management of skin lesions using deep learning

Introduction

Results

Study screening

Quality assessment

General study characteristics

Study design

Participants and images

DL algorithms

Transparency

Accuracy of diagnostic DL algorithms: six most studied diseases

Accuracy of diagnostic DL algorithms: five categories of disease

Accuracy of severity grading DL algorithms

Secondary outcomes of diagnostic DL algorithms: six most studied diseases

Secondary outcomes of diagnostic DL algorithms: five categories of disease

Secondary outcomes of severity grading DL algorithms

Discussion

Methods

Search strategy and selection criteria

Data analysis

Quality assessment

Reporting summary

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Supplementary information

Reporting Summary

Supplementary Material

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Deep learning models across the range of skin disease

Search

Quick links