Automated severity scoring of atopic dermatitis patients by a deep neural network

Scoring atopic dermatitis (AD) severity with the Eczema Area and Severity Index (EASI) in an objective and reproducible manner is challenging. Automated measurement of erythema, papulation, excoriation, and lichenification severity using images has not yet been investigated. Our aim was to determine whether convolutional neural networks (CNNs) could assess erythema, papulation, excoriation, and lichenification severity at a level of competence comparable to dermatologists. We created a standard dataset of 8,000 clinical images showing AD. Each component of the EASI was scored from 0 to 3 by three dermatologists. We trained four CNNs (ResNet V1, ResNet V2, GoogLeNet, and VGG-Net) with the image dataset and determined which CNN was the most suitable for erythema, papulation, excoriation, and lichenification scoring. The brightness of the images in each dataset was adjusted to − 80% to + 80% of the original brightness (i.e., 9 levels by 20%) to investigate if the CNNs accurately measured scores if image brightness levels were changed. Compared to the dermatologists’ scoring, accuracy rates of the CNNs were 99.17% for erythema, 93.17% for papulation, 96.00% for excoriation, and 97.17% for lichenification. CNNs trained with brightness-adjusted images achieved a high accuracy without the need to standardize camera settings. These results suggested that CNNs perform at level of competence comparable to dermatologists for scoring erythema, papulation, excoriation, and lichenification severity.

www.nature.com/scientificreports/ The EASI consists of four components: erythema, induration/papulation, excoriation, and lichenification, which are scored from 0 to 3 according to severity (none, mild, moderate, and severe). Another important component in measuring the EASI is the affected body surface area, which is divided into head/neck, upper limbs, trunk, and lower limbs, giving 0 to 6 points for the AD-affected area. The EASI score is calculated from the four severityrelated components and the affected area points via a mathematical function 2 . Scoring AD severity with the EASI in an objective and reproducible manner is challenging. To obtain an accurate EASI score, observers must be trained and validated. Therefore, education on EASI scoring is important. However, standardizing conventional educational programs is difficult, as seen for PASI education 4 . In addition, EASI measurements are time consuming and difficult to measure each time a patient visits a clinical setting.
Convolutional neural networks (CNNs) are a branch of deep learning algorithms that have been applied to detect skin cancer, diabetic retinopathy, and onychomycosis [5][6][7][8] . In these reports, the accuracy of CNNs trained with a large number of clinical photographs was comparable to specialist clinicians [5][6][7][8] . These results were achieved through validation with a large number of clinical photographs and the development of CNNs. Therefore, with a validated dataset of clinical AD photographs, CNNs were expected to be trained to distinguish erythema, induration/induration/papulation, excoriation, and lichenification scores, which are the individual components of the EASI. Our aim was to determine if the CNNs could assess erythema, induration/papulation, excoriation, and lichenification severity at a level of competence comparable to dermatologists. We trained four CNN models (ResNet V1, ResNet V2, GoogLeNet and VGG-Net) with an image dataset and examined which CNN was most suitable for scoring each component of EASI.

Methods
Datasets and CNN training. We used clinical images from Seoul St. Mary's Hospital to construct AD datasets. Data on the images were collected via a retrospective chart review, and all data were fully anonymized before we accessed them. In total, 24,852 clinical images of AD were acquired from 2009 to 2017, and the lesion area of the images was cropped to 224 by 224 pixels. Poorly focused images and poor-quality images were excluded. Severity of images was scored from 0 to 3 for each component of the EASI by three dermatologists, with the final score determined by consensus among the dermatologists (Fig. 1). For each EASI sign, 500 images were assigned a severity score to create a dataset of 2000 scored images for each EASI component. Of the 8000 cropped images selected, 5600 images (1,400 images each for erythema, induration/papulation, excoriation and lichenification) were used to train the CNNs. The remaining 2400 images (600 images each for erythema, induration/papulation, excoriation and lichenification) were used to validate the CNNs (Fig. 2). For external validation, 400 images each EASI sign were selected from Uijeongbu St. Mary's hospital in the same way. This study was reviewed and approved by the Institutional Review Board of the Catholic University of Korea (CMC Central IRB: KC18RESI0827).
CNNs such as VGG-Net with 16 and 19 layers (i.e., VGG16 and VGG19); GoogLeNet V1, GoogLeNet V2, GoogLeNet V3, GoogLeNet V4, ResNet V1 with 50, 101, and 152 layers; and ResNet V2 with 50, 101 and 152 layers achieved good performance for image classification in the ImageNet Large Scale Visual Recognition Challenge 9-12 . Additionally, these 12 CNNs achieved excellent performance for dermatology image classification. For this reason, the CNNs were trained in this study to classify the severity of each EASI component.
Evaluation of the CNNs. The output of the trained CNNs was four continuous numbers between 0 and 1 for each input image that could be interpreted as the probability of each severity level. For example, if an image www.nature.com/scientificreports/ X was given to one of CNNs, the output was y 1 , y 2 , y 3 , and y 4 , which were the probabilities of each severity score from 0, 1, 2, and 3, respectively. To identify misclassified severity scores, the specificity and sensitivity of each severity score were analyzed over a change in threshold from 0.01 to 1.00 and their receiver operating characteristic (ROC) curves were plotted using the following equations: In this study, because trained CNNs performed multiple classification tasks, the performance of each CNN was also analyzed using a confusion matrix. A confusion matrix is a visualization tool typically used in multiclass supervised learning and contains information about the actual classifications and the classifications predicted by a classification model. Each column of the matrix represents the instances in a predicted class, while each row represents the instances in an actual class 9-12 . Furthermore, each element is a number, which is the conditional probability between the predicted outputs obtained from the CNNs and the actual values.
Adjustment of image brightness. The brightness of the images in each dataset was adjusted to − 80% to + 80% of the original brightness (i.e., 9 levels by 20%) to investigate if the CNNs accurately measured scores if image brightness levels were changed. Additionally, the differences in the accuracy between the CNNs trained with only the original images and those trained with the brightness-adjusted images were investigated.

Use of human participants.
(i) Research was performed in accordance with relevant guidelines/regulations (ii) Informed consent was obtained from all participants and/or their legal guardians.  Confusion matrices for erythema, induration/papulation, excoriation, and lichenification for the CNNs are in Fig. 3. The misclassification probabilities of the CNNs occurred mainly between severity scores of 1 and 2, but the probabilities were not high.
As a result of verifying our model with the Uijeongbu St. Mary's hospital dataset, the accuracy of severity scoring was 90.63% for erythema, 89.06% for induration/papulation, 87.50% for excoriation and 85.94% for lichenification.

Discussion
This paper used CNNs to measure AD severity. The application of deep neural networks in dermatology is mainly limited to the diagnosis of skin cancers 5,[13][14][15][16][17] . Although making diagnoses through a deep neural network is important, replacing time-consuming tasks for physicians through a deep neural network is also important. One of these tasks is measuring the EASI score in dermatology. The use of CNNs may increase the accuracy of AD severity scoring, allowing an accurate treatment response for patients and, improving rapport with patients to improve treatment compliance.
The EASI is an investigator-assessed instrument identified as one of the three best-validated outcome measures for AD 18,19 . The EASI was chosen by the International Harmonizing Outcomes Measures for Eczema initiative, after extensive systematic evaluation of its measurement properties, as the preferred core instrument to  19,20 . Currently, the EASI is used often in clinical practice and trials of AD. The problems with measuring the EASI are that it is time consuming and has intermediate interobserver reliability. Training in the EASI takes approximately 30 min 2,21 . The time required to measure the EASI in one patient is 6.0 ± 4.5 min (mean ± SD) 19 . EASI training does not take much time, but checking the EASI in complicated cases can take as long as 10 min, reducing the time for patient care and education in clinical practice settings. Improving interobserver reliability requires validation between observers, which increases the training time for EASI and requires educational lectures and reference photographs 4 . Therefore, a reliable measuring system could support observers, improve interobserver reliability and shorten measuring time of EASI. Studies are underway to develop a reliable measuring system for diseases such as melasma, vitiligo, and psoriasis [22][23][24] . For AD, a deep neural network may solve these problems. Deep neural networks, including CNNs, achieve state-of-the-art performance in numerous vision tasks, including image classification, object detection, and segmentation. However, no reports have applied CNNs to measure severity scores in skin diseases, including AD.
According to our results, for erythema and lichenification scoring, ResNet V1 with 101 layers achieved an accuracy greater than 99%. Erythema is a component confirmed by degree of redness, and seems to allow high accuracy because few factors affected the CNNs. The lichenification score is determined by skin thickness and wrinkle depth. In clinical photographs, depth of wrinkles tends to be represented by shadows that are relatively www.nature.com/scientificreports/ dark compared to the surrounding skin. Since this tendency is clear for lichenification, CNNs may have shown the high accuracy of 97%. However, since recognizing the depth in a 2-dimensional image is difficult and induration/papulation is often accompanied by erythema, those severity scores may be less accurate (e.g., 93% for of induration/papulation). As CNNs become more accurate and as the amount of training data increases, we expect that training with more data will overcome these limitations. External validation results with the Uijeongbu St. Mary's Hospital dataset showed that the accuracy of our model was 85% to 90% for each component of EASI. These results appeared to be due to the intermediate interobserver reliability of EASI. If dermatologists from Uijeongbu and Seoul St. Mary's hospital scored the severity of each component of EASI in agreement, the results might also have high accuracy. This result means that more accurate models could be created if more dermatologists participated, and it is expected to create models that can be used globally in the future.
Standardizing camera conditions such as the shutter speed, iris, and film speed are thought to be necessary to standardize the light intensity or brightness of photographs when taking clinical images in dermatology clinics 6 . Since not all clinical images can be taken under the same conditions in the real world, the brightness of clinical pictures was adjusted and used to train the CNNs. The result was a large difference in the accuracy of the severity scoring between CNNs trained with the brightness-adjusted images and CNNs not trained with the brightness-adjusted images. Training with the brightness-adjusted images was also effective at inflating the size of the dataset, which seemed to increase the accuracy. This process can be automated through the program and is recommended to increase the accuracy of CNNs.
This system had some limitations. The system would be better if more clinical images per EASI component had been used to train the CNNs. This study was conducted on Korean population, and the Fitzpatrick skin type of Koreans is usually 2-4. Therefore, darker skin patients were not included in this study. However, we suspect that this method could also work in a dark skin population with appropriate adjustments. This study was a pilot to investigate if CNNs could be used for EASI scoring and the CNNs achieved a high accuracy. In order to measure the EASI, the severity score of each component and the ratio of the lesion area is required, and further study is needed to determine how to recognize the area score automatically.
The results from this pilot study suggest that CNNs could be used for clinical scoring of atopic dermatitis and to assist dermatologists in measuring the EASI.