Introduction

Until a few years ago, the computer-aided diagnosis of skin lesions from images involved extracting the lesion boundary to distinguish it from the surrounding healthy skin (i.e., skin lesion segmentation), followed by calculating features based on rules developed by dermatologists such as the ABCD rule and the CASH rule1,2 based on the obtained segmentation, and ultimately using these features to train classical machine learning models (e.g., support vector machines and random decision forests3,4,5,6,7,8) to recommend diagnoses. Since skin lesion segmentation is an intermediate task in the dermatological analysis pipeline, the use of deep learning to predict diagnosis directly from the images, bypassing the segmentation, is now commonplace9,10,11,12 and is evident in other imaging modalities as well13,14,15,16,17. We project a similar trend where the model deemphasizes predicting the diagnosis and instead prioritizes producing accurate predictions of the ultimate clinical task (e.g., clinical management).

While deep learning based diagnoses of dermatological conditions from images are reaching the performance levels of medical professionals9,10,18,19, no work has been published to directly predict the management of the disease. Even in scenarios where the diagnosis is decided by an automated prediction model, the general physician or the dermatologist must still decide on the disease management (be it the treatment plan or some other course of action, e.g., requesting other exams or future follow-ups). Moreover, in some cases, accurately diagnosing the underlying skin condition may not be possible from an image alone. For example, a recent study evaluating the ‘majority decision’ obtained from over a hundred dermatologists for melanoma classification resulted in a sensitivity of 71.8% with respect to the ground truth diagnosis20. Thus, in the case where the visual presentation of a lesion is ambiguous, rather than diagnosing the condition, the correct action may be to perform a biopsy to gain further information. Machine learning-based approaches that classify the underlying skin condition and use the predicted skin condition to directly decide on a disease management (e.g., Han et al.21) may not well distinguish among different management decisions that exist within a single class. A management decision (e.g., scheduling a follow-up visit to monitor the skin lesion progression) may even be necessary to confirm a diagnosis (when there is insufficient information within the image), and therefore must precede it. For example, the clinical management decision for a nevi without atypical characteristics may be that no further action is required, whereas for a nevi with atypical characteristics, a dermatologist may opt for a clinical follow up or an excision, which may depend on the severity of the atypical characteristics. Therefore, it is desirable to explore the performance of an artificial intelligence based automatic skin disease management prediction system. Such a system can suggest management decisions to a clinician (i.e., as a second opinion) or directly to patients in under-served communities22. Moreover, when there are fewer management decisions to choose from than there are diagnosis classes (since multiple subsets of disease classes may be prescribed the same course of action), predicting the management decisions is likely a simpler computational problem to address than predicting the diagnosis and then inferring the management.

Previous work on clinical management prediction for skin lesions includes comparing management predictions made by MelaFind (a handheld imaging device developed by MELA Sciences Inc. which acquires 10 spectral bands) to histologic slides as the reference labels23 and to decisions made by dermatologists23,24. Carrara et al.25 used shallow artificial neural networks to predict whether a skin lesion should be excised based on lesion descriptors extracted from their multispectral images (15 spectral bands). Marchetti et al.26 compared the diagnostic accuracy of an ensemble of automated diagnosis prediction methods (including 2 machine learning-based methods) to the management decisions made by 8 dermatologists for a set of 100 dermoscopic images, but did not directly predict the management decisions using a learning-based approach. To the best of our knowledge, we are the first to predict management decisions, without relying on explicit diagnosis predictions, using machine learning (shallow or deep) using only RGB images of skin lesions and, in fact, using a deep learning-based approach for any skin lesion imaging modality. We evaluate our proposed method on the Interactive Atlas of Dermoscopy Dataset27,28,29, the largest publicly available database containing both dermoscopy and clinical skin lesion images with the associated management decisions, and show that predicting management decisions directly is more accurate than inferring the management decision from a predicted diagnosis. We also validate our model on the publicly available Melanoma Classification Benchmark (MClass-D)18,30 and show that our model exhibits excellent generalization performance when evaluated on data from a different source, and that our model’s clinical management predictions are in agreement with those made by 157 dermatologists.

Results and discussion

The Interactive Atlas of Dermoscopy dataset was used to test the performance of a model trained to predict the clinical management decisions (\(\rm{MGMT}_{\rm{pred}}\)) compared with inferring the management decisions based on the outputs of a diagnosis prediction model (\(\rm{MGMT}_{\rm{infr}}\)). This dataset contains 1,011 lesion cases spanning 20 diagnosis labels (Table 1) grouped into 5 categories28: basal cell carcinoma (BCC), nevus (NEV), melanoma (MEL), seborrheic keratosis (SK), and others (MISC), and 3 management decisions: ‘clinical follow up’ (CLNC), ‘excision’ (EXC), and ‘no further examination’ (NONE). The MClass-D dataset30 was used to compare the diagnosis and the management prediction performance of our model with that of dermatologists. This dataset contains 100 dermoscopic images comprising of 80 benign nevi and 20 melanomas, as well as the responses of 157 dermatologists when asked to make a clinical management decision to each of these 100 images: ‘biopsy/further treatment’ (EXC) or ‘reassure the patient’ (NOEXC).

Table 1 Breakdown of the seven-point criteria evaluation dataset29 by management and diagnosis labels and the train-valid-test splits used to train the model.
Figure 1
figure 1

An overview of the three prediction models. All the models take the clinical and the dermoscopic images of the skin lesion and the patient metadata as input. Note that we also perform an input ablation study (A multi-task prediction model section; Table 4). (a) The first model predicts the lesion diagnosis probabilities, \(\rm{DIAG}_{\rm{pred}}\). (b) The second model predicts the management decision probabilities, \(\rm{MGMT}_{\rm{pred}}\). (c) The third is a multi-task model and predicts the seven-point criteria (\(\rm{Criterion}\{1,2,\ldots ,7\}_{\rm{pred, multi}}\)) in addition to \(\rm{DIAG}_{\rm{pred, multi}}\) and \(\rm{MGMT}_{\rm{pred, multi}}\). The argmax operation assigns 1 to the most likely label and 0 to all others. For (a), \(\rm{DIAG}_{\rm{pred}}\) diagnosis is used to arrive at a management decision either using (a1) binary labeling, \(\rm{MGMT}_{\rm{infr, binary}}\), or (a2) prior based inference, \(\rm{MGMT}_{\rm{infr, all}}\). Similarly, the outputs of (b) can be used to directly predict a management decision using either (b1) binary labeling, \(\rm{MGMT}_{\rm{pred, binary}}\), or (b2) all the labels, \(\rm{MGMT}_{\rm{pred, all}}\). As explained in the text, the diagnosis labels are basal cell carcinoma (BCC), nevus (NEV), melanoma (MEL), seborrheic keratosis (SK), and others (MISC), and the management decision labels are ‘clinical follow up’ (CLNC), ‘excision’ (EXC), and ‘no further examination’ (NONE). In the case of binary management decisions, we predict whether a lesion should be excised (EXC) or not (NOEXC).

Interactive atlas of dermoscopy dataset

Predicting whether a lesion should be excised or not

The outputs of the diagnosis prediction model are mapped to a binary management decision (\(\rm{MGMT}_{\rm{infr, binary}}\); Fig. 1a1 of whether a lesion should be excised (EXC) or not (NOEXC). All malignancies (MEL and BCC) are mapped to EXC and all other diagnoses to NOEXC. Similarly, the outputs of the management prediction model are mapped to a binary decision (\(\rm{MGMT}_{\rm{pred, binary}}\); Fig. 1b1) by retaining the EXC class from \(\rm{MGMT}_{\rm{pred}}\) as is and grouping CLNC and NONE to form NOEXC. These binary mapping-based approaches serve as our baselines, and we observe that \(\rm{MGMT}_{\rm{infr, binary}}\) correctly predicts 218 of the 395 cases (overall accuracy = 55.19%), whereas \(\rm{MGMT}_{\rm{pred, binary}}\) yields a superior classification performance of 289 correct predictions (overall accuracy = 73.16%), outperforming the inference-based management decision by 17.97%.

A data-driven approach to inferring management decision from diagnosis predictions

Figure 2
figure 2

Quantitative evaluation of the \(\rm{MGMT}_{\rm{infr, all}}\) and \(\rm{MGMT}_{\rm{pred, all}}\) predictions. (a) Violin plots of the distance measures of the probabilistic predictions show that the \(\rm{MGMT}_{\rm{pred, all}}\) predictions are closer (statistically significant) to the target labels for test data. (b, c) ROC curves and (d, e) confusion matrices of \(\rm{MGMT}_{\rm{infr, all}}\) and \(\rm{MGMT}_{\rm{pred, all}}\) respectively along with cell-wise diagnosis breakdown. Note that \(\rm{MGMT}_{\rm{infr, all}}\) has a tendency to over-excise lesions.

Since cases belonging to a disease label can be managed in multiple ways, a data-driven approach using conditional probabilities (Interactive Atlas of Dermoscopy Dataset section, Equation (3)) can be adopted to infer the probabilistic management decisions from the diagnosis predictions, and this does not have to be restricted to a binary management. These inferred management decisions (\(\rm{MGMT}_{\rm{infr, all}}\); Fig. 1a2) can then be compared to the probabilistic outputs of the management prediction model (\(\rm{MGMT}_{\rm{pred, all}}\); Fig. 1b2).

Figure 2a shows the distribution of the four sets of distance measures for examining the correctness of the probabilistic \(\rm{MGMT}_{\rm{infr, all}}\) and \(\rm{MGMT}_{\rm{pred, all}}\) predictions with respect to the target labels, where each dot represents a test case. For (1—cosine similarity), the mean [95% CI] distance is lower for \(\rm{MGMT}_{\rm{pred, all}}\) as compared to \(\rm{MGMT}_{\rm{infr, all}}\) (0.3584 [0.3260–0.3909] versus 0.4703 [0.4490–0.4915]; Cohen’s d = 0.4033). We observe similar patterns for the Jensen-Shannon divergence (0.3551 [0.3320–0.3783] versus 0.4397 [0.4249–0.4544]; Cohen’s d = 0.4311), the Wasserstein distance (0.1358 [0.1246–0.1469] versus 0.2687 [0.2581–0.2793]; Cohen’s d = 1.2064), and the Hellinger distance (0.4131 [0.3868–0.4394] versus 0.5111 [0.4944–0.5278]; Cohen’s d = 0.4404).

The final management predictions from the two approaches (\(\rm{MGMT}_{\rm{infr, all}}\) and \(\rm{MGMT}_{\rm{pred, all}}\)) are obtained by extracting the most likely label over the probabilistic predictions, and their quantitative results are presented in Table 2. The ROC curves for the two approaches are shown in Fig. 2b,c and their respective confusion matrices, with each cell in the confusion matrices also indicating a diagnosis-wise breakdown of the test samples, are shown in Fig. 2d,e.

Table 2 Comparing skin lesion management prediction results obtained using \(\rm{MGMT}_{\rm{infr, all}}\) and \(\rm{MGMT}_{\rm{pred, all}}\). All the prediction models have been trained using all the input data modalities (i.e., clinical image, dermoscopic image, and patient metadata). Mean ± standard deviation reported for all the metrics for the 3-fold cross validation.

We observe that the overall accuracy and AUROC of \(\rm{MGMT}_{\rm{infr, all}}\) (62.53% and 0.7741) are considerably lower than those of \(\rm{MGMT}_{\rm{pred, all}}\) (69.87% and 0.8443), indicating that predicting the management decisions directly leads to a better accuracy than predicting the diagnosis and then inferring the management. Another interesting observation is that \(\rm{MGMT}_{\rm{infr, all}}\) predictions tend to favor EXC (excision) more than other labels (as can be observed by the dominant blue colored cells in the rightmost column of Fig. 2b), which although leads to an excellent sensitivity (0.9835) for the EXC class, yields unacceptable classification performance for the other two classes (0.2 and 0.0, for NONE and CLNC respectively). For example, none of the clinical follow-up cases were predicted correctly by \(\rm{MGMT}_{\rm{infr, all}}\), and 106 cases (94.64%) were predicted to be over-treated by excision. Similarly, the algorithm wrongly predicted excising 32 cases (40%) that, in fact, needed no further examination. On the other hand, \(\rm{MGMT}_{\rm{pred, all}}\) yields a higher overall accuracy without favoring any particular class. Finally, three-fold cross validation results show that this improvement in performance holds true for all metrics across multiple training, validation, and testing partitions of the dataset, with sufficiently low standard deviations across all folds.

A multi-task prediction model

Figure 3
figure 3

Evaluating the multi-modal multi-task model. (a) ROC curve and (b) precision-recall curve for the management prediction task. Confusion matrices for (c) the management prediction task and (d) the diagnosis prediction task along with the diagnosis-wise breakdown for the management labels.

It has been shown that models optimized to jointly predict related tasks perform better than models trained on individual tasks separately31. As such, we expect to observe an improvement in the management prediction accuracy of our multi-task model trained to simultaneously predict the seven-point criteria32 of the lesions (\(\rm{Criteria1}_{\rm{pred, multi}} \cdots \rm{Criteria7}_{\rm{pred, multi}}\)), the diagnosis label (\(\rm{DIAG}_{\rm{pred, multi}}\)), and the management decision (\(\rm{MGMT}_{\rm{pred, multi}}\)). We plot the confusion matrix and the ROC curves for \(\rm{MGMT}_{\rm{pred, multi}}\) for this multi-task model in Fig. 3a. As expected, we improve the overall management prediction accuracy by 3.8% (from 69.87% to 73.67%). Moreover, since we have fairly imbalanced classes (see Table 1; for example, there are 243 EXC cases as compared to only 40 NONE cases in the test partition) where ROC curves can indicate an “overly optimistic view” of the algorithm’s performance33, we also plot the precision-recall curves for the multi-task model in Fig. 3b. A detailed analysis of class-wise performance is presented in Table 3. Three-fold cross validation results show the robustness of this multi-task model to different training, validation, and testing partitions of the dataset. In addition to its higher management prediction accuracy, this multi-task model may be regarded as less opaque and more trustworthy as its final management prediction was linked to clinically meaningful predictions, i.e., the seven-point criteria and the diagnosis. Finally, an input data ablation study for estimating the importance of each input modality (i.e., clinical image, dermoscopic image, and patient metadata) was conducted where six prediction models were trained using various combinations of input data modalities. Their quantitative results presented in Table 4 and the p-values for pairwise comparison of their predictions using mid-p McNemar’s test are as shown in Fig. 4. We draw the following key observations:

  1. 1.

    Dermoscopic images may be more useful than clinical images for predicting management decisions. We compare the experiments where dermoscopic images and clinical images are used without (‘D’ versus ‘C’) and with (‘DM’ versus ‘CM’) the patient metadata. When not using the metadata (‘D’ versus ‘C’), using dermoscopic images over clinical images significantly improves all the metrics by \(7.94 \pm 1.87\%\) (\(p = 2.49e-03\)−03). Similarly, in the presence of metadata (‘DM’ versus ‘CM’), using dermoscopic images significantly improves all the metrics by \(8.11 \pm 2.36\%\) (\(p = 4.98e\)−03) as compared to when using clinical images.

  2. 2.

    The value of adding a clinical image is questionable when a dermoscopic image is already present. We compare the experiments where clinical images are added in addition to a dermoscopic image, both in the absence (‘CD’ versus ‘D’; \(p = 1.25e\)−01) or presence (‘CDM’ versus ‘DM’; \(p = 4.57e\)−01) of patient metadata and observe no consistent pattern of either improvement or degradation in the metrics.

  3. 3.

    The inclusion of patient metadata may improve the management prediction accuracy. When using only clinical images (‘CM’ versus ‘C’), only dermoscopic image (‘DM’ versus ‘D’), or both (‘CDM’ versus ‘CD’), all but one metrics improved with the inclusion of metadata by \(2.23 \pm 2.68\%\), with the most impactful contribution of metadata being in the \(10.63\%\) improvement of sensitivity in ‘CDM’ versus ‘CD’, and the only metric which decreased was the precision in ‘CDM’ versus ‘CD’ (\(2.59\%\) decrease). However, these improvements are not statistically significant with \(p = 1.67e-02\) (‘CM’ versus ‘C’), \(p = 6.14e-02\) (‘DM’ versus ‘D’), and \(p = 8.94e-01\) (‘CDM’ versus ‘CD’).

Table 3 Skin lesion management prediction results \(\rm{MGMT}_{\rm{pred, multi}}\) obtained using a multi-modal multi-task model.
Table 4 Input data modality ablation study for skin lesion management prediction results \(\rm{MGMT}_{\rm{pred, multi}}\) obtained using a multi-task model.
Figure 4
figure 4

Evaluating the statistical significance of each input data modality’s contribution in improving the management decision prediction \(\rm{MGMT}_{\rm{pred, multi}}\). ‘C’, ‘D’, and ‘M’ refer to clinical image, dermoscopic image, and patient metadata respectively, and the row and the column names refer to the experiments in the ablation study presented in Table 4. For each pair of experiments (i) and (j), the cell (i, j) contains the p-value corresponding to the McNemar’s test performed on the corresponding pair of predictions.

While predicting management decisions, we posit that the clinical penalty of misclassifying certain management decisions is more severe than others. For example, consider a lesion where the correct management decision is for the lesion to be excised. Incorrectly predicting a management decision of ‘no further examination’ when the lesion should be excised is a more severe mistake than predicting a management decision of ‘clinical follow up’, since the decision to excise may be corrected in a future examination. We can extend this assumption to also include cases where the model predicts NONE when the target label is CLNC. For example, an EXC or a CLNC misclassified as a NONE is a more severe mistake than a NONE misclassified as an EXC or a CLNC, because in the latter scenario, the best course of action can ultimately be determined by the dermatologist in the clinical visit.

Since the multi-task model has also been trained to predict lesion diagnosis, the confusion matrix for the diagnosis prediction task is shown in Fig. 3d. Looking at the relationship between the diagnosis and the management labels (Table 1), we notice that all the malignant skin lesions, namely melanomas (MEL) and basal cell carcinomas (BCC), map to the same management label, i.e., excision (EXC). This means that if we can accurately predict a lesion to be either BCC or MEL, we can infer that it has to be excised. Therefore, if we were to first diagnose skin lesions and then infer their management, we would misclassify 46 malignant cases (the number of BCC or MEL misclassified as neither BCC nor MEL; Fig. 3c), and thus incorrectly predict their management. On the other hand, if we directly predict the management decisions, we only misclassify 3 malignant cases (1 BCC and 2 MEL; Fig. 3a).

MClass-D dataset

Figure 5
figure 5

Evaluating the multi-task model on the MClass-D dataset. (a) Confusion matrices and (b) ROC curves for \(\rm{MGMT}_{\rm{pred}}\) and \(\rm{MGMT}_{\rm{infr}}\) predictions with both \(\rm{MGMT}_{\rm{GT, agg}}\) and \(\rm{MGMT}_{\rm{GT, true}}\) as target clinical management labels.

Next, we validate our trained prediction model on the publicly available MClass benchmark30. For this, we use the multi-task prediction model from  A multi-task prediction model section to simultaneously predict the diagnosis labels (DIAG) and the clinical management decisions (MGMT) for the 100 dermoscopic images in the MClass-D dataset. We use the multi-task model trained on the Interactive Atlas of Dermoscopy as is and do not fine-tune on the MClass-D dataset.

The prediction classes for DIAG are benign (BNGN) or malignant (MLGN), whereas those for MGMT are excision (EXC) or not (NOEXC). While the diagnosis ground truth labels from the ISIC Archive are available for the lesions, there are multiple ways of choosing a target label for the clinical management decision. Therefore, we look at two possible ways of assigning the “ground truth” management decision: using the aggregated recommendations of the 157 dermatologists present in the dataset (\(\rm{MGMT}_{\rm{GT, agg}}\)), or using the the diagnosis ground truth to derive the “true” management decision (\(\rm{MGMT}_{\rm{GT, true}}\)) (where “true” indicates the ideal management decision if the underlying diagnosis was known). For each of the two scenarios, we compare the performance of the directly predicted management decision (\(\rm{MGMT}_{\rm{pred}}\)) to that of a scenario when the predicted diagnosis is used to infer the management decision (\(\rm{MGMT}_{\rm{infr}}\)), similar to Predicting whether a lesion should be excised or not section.

The confusion matrices and the ROC curves for these two sets of predictions (\(\rm{MGMT}_{\rm{infr}}\) and \(\rm{MGMT}_{\rm{pred}}\)) as compared to both methods of choosing the “ground truth” management labels are presented in Fig. 5a,b respectively. When we set \(\rm{MGMT}_{\rm{GT, agg}}\) as the target labels, as shown in the left column and red curves of Fig. 5a,b respectively, we observe that predicting the management decision directly (\(\rm{MGMT}_{\rm{pred}}\)) performs well for both the management classes without favoring any single particular class and achieves a notable improvement in the area under the ROC curve, as compared to when inferring the management decision (\(\rm{MGMT}_{\rm{infr}}\)) based on the model’s diagnosis prediction. Additionally, as discussed in A multi-task prediction model section , not all misclassification errors are equal, and the clinical penalty of misclassifying an EXC as NOEXC is much more than other errors. While an ROC curve shows the performance over all probability thresholds, the AUROC does not consider the actual decision of the model. When using a default probability threshold of 0.5, we note that directly predicting the management decisions incurs far fewer such mistakes than inferring the management (16 versus 36). Similarly, when setting \(\rm{MGMT}_{\rm{GT, true}}\) as the target management labels, we observe that although the area under the ROC curves are similar (Fig. 5b green curves), the confusion matrix (Fig. 5a right column) reveals that the \(\rm{MGMT}_{\rm{pred}}\) leads to better overall performance across both the classes and fewer instances of EXC being misclassified as NOEXC (6 versus 12).

For evaluating the agreement between the model’s predictions and those of the 157 dermatologists, we calculate two agreement measures—Cohen’s kappa and Fleiss’ kappa. The Cohen’s kappa between our model’s predictions and that of the aggregated recommendations of the 157 dermatologists is 0.5424. This is higher than that of the agreement between all pairs of dermatologists (\(0.4124 \pm 0.1032\)), and is comparable to the agreement between one dermatologist and the aggregated recommendations of all the others, repeated for all dermatologists (\(0.5497 \pm 0.0899\)). Next, the Fleiss’ kappa for agreement among the recommendations of 157 dermatologists is 0.4086. To calculate the Fleiss’ kappa for capturing the agreement between our model’s predictions with those of the dermatologists, we calculate the agreement among a set of 156 dermatologists’ recommendations and the model’s predictions, and repeated by leaving out one dermatologist at a time, yielding a score of \(0.4080 \pm 0.0006\). To address concerns that the recommendations of 156 dermatologists might overshadow the model’s predictions in the score calculated above, we repeat this experiment for a set of 10 predictions, comprising of 9 dermatologists’ recommendations and the model’s predictions, and repeat this 1000 times, yielding a score of \(0.3961 \pm 0.0301\). These results indicate that our model’s clinical management predictions agree with those made by dermatologists as much as they do amongst each other.

Although Brinker et al.18 achieve a better performance at classifying melanomas than our model, we believe this can be attributed to multiple factors. First, Brinker et al. trained their prediction model on over 12,000 images and reported the mean of the results obtained from 10 trained models. Our model, on the other hand, is trained on considerably fewer images (413) and the reported results are from a single training run. Second, the training, validation, and testing partitions for Brinker et al. all come from the same data source, i.e., the ISIC Archive, whereas our model was trained on the Interactive Atlas of Dermoscopy and evaluated on images from the ISIC Archive, leading to a domain shift. CNNs have been shown to exhibit poor generalizability for skin lesion classification tasks when trained and evaluated on separate datasets34. Despite this, our multi-task prediction model is able to adapt to the new domain and exhibits strong generalization performance for clinical management predictions.

Limitations

Although this study provides a proof of concept of the potential advantages of using deep learning to directly predict the clinical management decisions of skin lesions over inferring management decisions based on predicted diagnosis labels, it suffers from some limitations. First, the dataset that our model is trained on, the Interactive Atlas of Dermoscopy, only contains 20 diagnosis labels and 3 management labels and is not an exhaustive list of all diagnosis and management decisions. Second, although we trained the models on the Interactive Atlas of Dermoscopy with a reasonable effort on hyperparameter tuning and fine tuning, we did not pursue maximizing the classification accuracy. This means that even though our trained prediction model performs well on a held-out test set and is also able to generalize well when evaluated on data coming from a different source than the one it was trained on, better classification performance may be achievable with careful optimization of the prediction models. Finally, we acknowledge that unlike a dermatologist who has access to richer and non-image patient metadata such as patient history, demographics, patient preferences, and difficulty of diagnosis, our model only makes predictions based on the attributes present in these two datasets. However, this is not a technical limitation of our approach and rich multi-modal patient information can be incorporated as and when such attributes become available.

Conclusion

In this work, we proposed a model to predict the management of skin lesions using clinical and dermoscopic lesion images and patient metadata. We showed that predicting the management decisions directly is significantly more accurate than predicting the diagnoses first and then inferring the management decision. Moreover, we also observed a considerable increase in the management prediction accuracy with a multi-task model trained to simultaneously predict the seven-point criteria, the diagnoses, and the corresponding management labels.

Furthermore, evaluation of our model on another dataset showed excellent cross dataset generalizability and strong agreement with the recommendations of dermatologists.

Our goal with this work is not to propose a method that overrides the dermatologists, rather to provide a second opinion. Deep learning-based approaches for diagnosis, although commonplace as a clinical tool now35,36,37, were far from it a decade ago, and we predict a similar shift towards automated algorithms recommending the clinical management of diseases. Since we have proposed a learning-based approach, the model’s predictions can be made more robust and similar to dermatologists’ predictions by leveraging more complex patient attributes. Future research directions would include collecting and testing on other datasets with other skin conditions and treatments to assess the value of directly predicting management labels and deemphasizing the latent tasks such as diagnosis prediction.

Materials and methods

Dataset

We have adopted the Interactive Atlas of Dermoscopy dataset28, a credible and extensively validated dataset that has been widely used to teach dermatology residents38,39,40, to train and evaluate our prediction models. The dataset contains clinical and dermoscopic images of skin lesions, patient metadata (patient gender and the location and the elevation of the lesion), the corresponding seven-point criteria32 for the dermoscopic images, and the diagnosis and the management labels for 1011 cases with mean [standard deviation] age of 28.08 [18.70] years; 489 males (48.37%); 294 malignant cases (29.08%); skin lesion diameter of 8.84 [5.39] mm. Following Kawahara et al.28, we split the dataset into training, validation, and testing partitions in the ratio of approximately 2 : 1 : 2 (413 : 203 : 395 to be precise) and maintain a similar distribution of the management labels across all the three subsets. A breakdown of the dataset according to the management and the diagnosis labels along with the details of the three splits is presented in Table 1, and more detailed breakdowns of the dataset according to the diagnosis classes and the patient metadata is presented as Supplementary Information (Supplementary Tables 1 and 2 respectively). We also present the evaluation of the multi-task prediction model on the MClass-D dataset18, a collection of 100 dermoscopic images from the ISIC Archive with the corresponding diagnosis labels and the clinical management decision of 157 dermatologists surveyed. The dermatologists came from 12 university hospitals in Germany and 43.9% of them were board-certified. The melanomas in the dataset were histopathology-verified and the nevi were diagnosed as benign either by expert consensus or by a biopsy.

The prediction models

In this section, we present three management prediction models, a detailed breakdown of which is presented in Fig. 6. In order to train prediction models that leverage both the clinical and the dermoscopic images as well as the patient metadata available in the dataset, we use a multi-modal framework28 and train two models: the first to predict the diagnosis and the second to predict the management decision. For both of these models, we adopt an InceptionV341-backbone pretrained on the ImageNet dataset42 as the feature extraction model and drop the final output layer. We combine the extracted features from both clinical and dermoscopic images and compute the global average pooled responses, to which we then concatenate the patient metadata as a one-hot encoded vector. Next, we add a \(1 \times 1\) convolutional layer for the prediction task (either the diagnosis or the management) as the final classification layer with the associated loss. We use the categorical cross-entropy loss to train the model, and they are denoted by \(L_{\rm{DIAG}}\) and \(L_{\rm{MGMT}}\) for the diagnosis and the management prediction models respectively. Since there is an inherent class imbalance in the dataset, we adopt a mini-batch sampling and weighting approach28. The loss function used to train these two single prediction task models is as follows:

$$\begin{aligned} L_{\langle \rm{task}\rangle } \equiv L\left( (x_c, x_d, x_m), y_{\langle \rm{task}\rangle } | \Theta \right) = - \frac{1}{|b|} \sum _{i=1}^{|b|} \sum _{j=1}^{n_{\langle \rm{task}\rangle }} w_j \ . \ y_{\langle \rm{task}\rangle , j}^{(i)} . \log \left( \phi \left( x^{(i)} | \Theta \right) _{j}\right) , \end{aligned}$$
(1)

where \(x_c, x_d, x_m\) denote the clinical image, the dermoscopic image, and the patient metadata, respectively, |b| denotes the size of the mini-batch, ‘task’ denotes either the diagnosis or the management prediction task, and \(y_{\langle \rm{task}\rangle }\) and \(n_{\langle \rm{task}\rangle }\) denote the target variable and the number of classes for the corresponding tasks respectively. \(w_j\) denotes the weight assigned to the \(j^{\rm{th}}\) class (calculated similar to Kawahara et al.28), and \(\phi \left( x^{(i)} | \Theta \right) _{j}\) denotes the predicted probability for the \(j^{\rm{th}}\) class given an input \(x^{(i)}\) by the model parameterized by \(\Theta\).

Figure 6
figure 6

A breakdown of the inputs, outputs, loss functions, and architecture of the three prediction models. Global average pooled feature responses from the clinical and the dermoscopic images are extracted and concatenated (denoted by the plus symbol) with one-hot encoded patient meta-data, and the three models are trained with \(L_{\rm{DIAG}}\), \(L_{\rm{MGMT}}\), and \(L_{\rm{multi}}\) respectively. The first model predicts the diagnosis labels (\(\rm{DIAG}_{\rm{pred}}\)) which are then used along with the management priors to obtain inferred management decisions (\(\rm{MGMT}_{\rm{infr}}\)), whereas the second model predicts the management decisions directly (\(\rm{MGMT}_{\rm{pred}}\)). Finally, the last model is a multi-task one and is trained to predict the seven-point criteria, the diagnosis, and the management (outputs enclosed in the dashed box).

It has been shown that models optimized to jointly predict related tasks perform better on the individual tasks than models trained on each individual tasks separately31,43. Therefore, we train a third model by extending the multi-modal multi-task framework28 to simultaneously predict the seven-point criteria, the diagnosis, and the management decision. The architecture remains the same as the two models described above, except for the last layer, where we add a \(1 \times 1\) convolutional layer for each prediction task as the final classification layer with the multi-task loss. The multi-task loss, denoted by \(L_{\rm{multi}}\) , accounts for all the 9 prediction tasks, namely: the seven-point criteria, the lesion diagnosis, and the lesion management, and is the sum of prediction losses for each of the tasks. As with the previous two models, we adopt the same mini-batch sampling and weighting approach. The loss function used to train this multi-task prediction model is defined as:

$$\begin{aligned} L_{\rm{multi}} \equiv {L}\left( (x_c, x_d, x_m), y_{\rm{diag}}, y_{\rm{mgmt}}, z | \Theta \right)&= L_{\rm{DIAG}}\left( (x_c, x_d, x_m), y_{\rm{diag}} | \Theta \right) \nonumber \\&\quad + L_{\rm{MGMT}}\left( (x_c, x_d, x_m), y_{\rm{mgmt}} | \Theta \right) \nonumber \\&\quad + \sum _{k=1}^{7} L\left( (x_c, x_d, x_m), z_k | \Theta \right) , \end{aligned}$$
(2)

where \(L(\cdot )\) denotes the categorical cross entropy loss (as described in Equation (1)) and \(\Theta\) denotes the parameters of the multi-task model. The model outputs are \(y_{\rm{DIAG}}\), \(y_{\rm{MGMT}}\), and \(z_k \in \mathbb {Z}^7\), which denote, respectively, the diagnosis label, the management label, and the integer score for each of the seven-point criteria.

Making management predictions

Interactive atlas of dermoscopy dataset

Since we ultimately seek the management decision for each patient, we evaluate all the models based on their management prediction performance. We examine two types of management decisions: predicting whether a lesion should be excised or not (our baseline) and predicting all the management decisions. The first model (Fig. 1a) is trained to predict the diagnosis, and so we infer the management decisions \(\rm{MGMT}_{\rm{infr}}\) from its diagnosis predictions (\(\rm{DIAG}_{\rm{pred}}\)) either by predicting the binary management decision \(\rm{MGMT}_{\rm{infr, binary}}\): EXC versus NOEXC (Fig. 1a1), or by predicting all management decisions \(\rm{MGMT}_{\rm{infr, all}}\), which for our dataset are EXC, CLNC, and NONE (Fig. 1a2). The second model is trained to predict the management decisions \(\rm{MGMT}_{\rm{pred}}\), either binary \(\rm{MGMT}_{\rm{pred, binary}}\) (Fig. 1b1) or all decisions \(\rm{MGMT}_{\rm{infr, all}}\), directly (Fig. 1b2). As for the third model, since it is trained to predict the diagnosis and the management along with the 7-point criteria (Fig. 1c), we follow the same approach as the first two models to obtain management predictions. For all the prediction models, we also perform a three-fold cross validation to support the robustness of our results. The dataset was partitioned into three folds while ensuring that the class-wise proportions of the different categories (7-point criteria, diagnosis labels, and management decisions) remain similar across the training, validation, and testing partitions28. Moreover, in order to study the contribution of the three input data modalities (clinical image, dermoscopic image, and patient metadata) to the final management prediction, we also carry out an input ablation study on the multi-task prediction model (i.e., the third model; Fig. 1c), where we train and evaluate six multi-task prediction models with different combinations of the three input modalities.

The binary management decisions, \(\rm{MGMT}_{\rm{infr, binary}}\) (Fig. 1a1) and \(\rm{MGMT}_{\rm{pred, binary}}\) (Fig. 1b1), are obtained using a binary mapping as described in Results and Discussion. Next, given that there are multiple ways to manage a disease category (e.g., in Table 1, NEV cases are managed using all three management labels), we adopt a data-driven approach (Fig. 1a2) to calculate the likelihood of all management decisions given a diagnosis prediction. We use the distribution of the management decisions across diagnosis classes in the training data to estimate the priors for assigning a management class \(m_i\) to a patient assigned the diagnosis class \(d_j\). This can be denoted as \(p({\mathrm{MGMT}} = m_i | {\mathrm{DIAG}} = d_j)\). At inference time, given a patient’s data x, we estimate the probability of management \(m_i\) by marginalizing over all possible diagnosis classes:

$$\begin{aligned} P({\mathrm{MGMT}} = m_i | x) = \sum _{d_j} p({\mathrm{DIAG}} = d_j | x) \ . \ \underbrace{p({\mathrm{MGMT}} = m_i | {\mathrm{DIAG}} = d_j)}_{\text{prior from dataset}}. \end{aligned}$$
(3)

MClass-D dataset

The multi-task model used to evaluate the images from MClass-D predicts both the lesion diagnosis (\(\rm{DIAG}_{\rm{pred}}\)) and the clinical management (\(\rm{MGMT}_{\rm{pred}}\)). The management labels inferred (\(\rm{MGMT}_{\rm{infr}}\)) from the diagnosis predictions are obtained by the binary mapping described in Predicting whether a lesion should be excised or not section. To recap, a lesion predicted to be malignant (MLGN) is mapped to the ‘excise’ (EXC) label and a lesion predicted to be benign (BNGN) would be mapped to ‘do not excise’ (NOEXC), meaning that the inferred management decision (\(\rm{MGMT}_{\rm{infr}}\)) would have a direct mapping from the predicted diagnosis (\(\rm{DIAG}_{\rm{pred}}\)).

Next, we look at the two different ways of obtaining the “ground truth” management labels. First, we aggregate the recommendations of the 157 dermatologists by majority voting to obtain a single prediction for each image (\(\rm{MGMT}_{\rm{GT, agg}}\)), and use these as one type of target labels to compare the directly predicted management decisions (\(\rm{MGMT}_{\rm{pred}}\)) and the inferred management decisions (\(\rm{MGMT}_{\rm{infr}}\)). The second type of target labels are formed by generating the “true” clinical management labels by using a direct mapping from the disease diagnosis to clinical management. This is supported by the fact in an ideal world, we would want all malignancies (MLGN) to be excised (EXC) and all the benign lesions (BNGN) to not be (NOEXC). As such, the “true” clinical management labels (\(\rm{MGMT}_{\rm{GT, true}}\)) are obtained by directly mapping the ground truth diagnosis classes to the corresponding management labels.

Evaluation

For comparing the performance of the baseline binary labeling approach, we compare the per-class sensitivity averaged over the two classes for the two sets of binary predictions, \(\rm{MGMT}_{\rm{infr, binary}}\) and \(\rm{MGMT}_{\rm{pred, binary}}\).

Next, for each of the two sets of management predictions (\(\rm{MGMT}_{\rm{pred, all}}\) and \(\rm{MGMT}_{\rm{infr, all}}\)), we obtain probabilistic predictions. To compare the performance of the two models, we choose to evaluate using two methods: (a) using the probabilistic management predictions, and (b) using the most likely label (i.e., choosing the single label with the highest predicted probability). While the evaluation for the latter is rather straightforward with accuracy values and confusion matrices, we formulate the following methodology for evaluating the quality of the probabilistic management predictions: given a set of predicted probability values (over management classes) and the corresponding target management labels, we report distance measures between the probabilistic predictions and the one-hot encoded representations of the target management labels.

Statistical analysis

The primary outcome measures are class-wise sensitivity, specificity, precision, AUROC and overall accuracy for the diagnosis and the management prediction tasks.

To compare the probabilistic predictions for the management decision obtained using \(\rm{MGMT}_{\rm{infr, all}}\) and \(\rm{MGMT}_{\rm{pred, all}}\), we use four distance measures to compare the similarity of these probability-vectors to the one-hot-encoded target labels: cosine similarity, Jensen-Shannon divergence, Wasserstein distance, and Hellinger distance. Since a lower value is better for all these metrics except the cosine similarity, we instead use the (1—cosine similarity) value for consistency across measures and visualize them using a swarm plot overlaid onto a box plot.

We use the two-sided Wilcoxon signed-rank test44 to compare the two sets of distance measures for each of the four measures since the differences between the two sets cannot be assumed to be normally distributed. We perform bootstrapping45 and sub-sampling 1000 times46 with a sample size of N/2 (where N is the size of the test set) with convergence criteria satisfied47. For all the distance measures, we report the means and the 95% confidence intervals along with Cohen’s d values48. Results are considered statistically significant at \(p < 0.001\) level. For the ablation study, we use the mid-p McNemar’s test49,50,51 to compare the management prediction accuracies of the six models, where each model is trained with a different combination of input data modalities, and the results are considered statistically significant at \(p < 0.05\) level.

For evaluation on the MClass-D dataset, we use two inter-rater measures for assessing the similarity of our model’s predictions with those of the 157 dermatologists: Cohen’s kappa52 and Fleiss’ kappa53. For Cohen’s kappa, we calculate the agreement between the model’s prediction and the labels obtained by aggregating the recommendations of all 157 dermatologists (\(\rm{MGMT}_{\rm{GT, agg}}\)), and compare it with the average agreement between any two dermatologists. To account for the variability among the predictions of multiple dermatologists and how this might not be reflected in the aggregated recommendation, we also compare this with the agreement between one dermatologist and the aggregated recommendation of all others, repeating this over all 157 dermatologists in a leave-one-out fashion and report the average agreement. Unlike Cohen’s kappa, Fleiss’ kappa can assess the agreement among more than two raters, and therefore we first calculate the agreement among all the 157 dermatologists. For calculating the agreement of the model’s predictions with those of the dermatologists, we first calculate the Fleiss’ kappa for a set of 157 predictions obtained from 156 dermatologists and our model, and repeat this 157 times in a leave-one-out fashion and report the average agreement. However, this could lead to concerns that the agreement among the 156 dermatologists might affect the kappa value, so we further carry out the same experiment but with a set of 10 management decisions obtained from the recommendations of 9 dermatologists sampled at random from the dataset and our model’s predictions. We repeat this 1000 times and report the average agreement.

All statistical analyses were performed in Python using NumPy54, SciPy55, statsmodels56, PyCM57, and scikit-learn58 libraries, and all visualizations were created in Python using matplotlib59 and seaborn60 libraries.

Implementation details

The Keras framework61 was used to implement all the deep learning models. We follow a similar training paradigm as Kawahara et al.28. For all the models, the ImageNet-pretrained weights are frozen at the beginning and the models are fine-tuned with a learning rate of \(10^{-3}\) for 50 epochs, followed by iteratively ‘un-freezing’ one Inception block at a time (starting from the Inception block closest to the output all the way to the second Inception block) and fine-tuning for 25 epochs with a learning rate of \(10^{-3}\). We use real-time data augmentation using rotations, horizontal and vertical flipping, zooming, and height and width shifts for these initial 275 epochs. Lastly, we turn off data augmentation and fine-tune for 25 epochs. We use stochastic gradient descent with a weight decay of \(10^{-6}\) and a momentum of 0.9 to optimize the weights.