Prospective validation of dermoscopy-based open-source artificial intelligence for melanoma diagnosis (PROVE-AI study)

The use of artificial intelligence (AI) has the potential to improve the assessment of lesions suspicious of melanoma, but few clinical studies have been conducted. We validated the accuracy of an open-source, non-commercial AI algorithm for melanoma diagnosis and assessed its potential impact on dermatologist decision-making. We conducted a prospective, observational clinical study to assess the diagnostic accuracy of the AI algorithm (ADAE) in predicting melanoma from dermoscopy skin lesion images. The primary aim was to assess the reliability of ADAE’s sensitivity at a predefined threshold of 95%. Patients who had consented for a skin biopsy to exclude melanoma were eligible. Dermatologists also estimated the probability of melanoma and indicated management choices before and after real-time exposure to ADAE scores. All lesions underwent biopsy. Four hundred thirty-five participants were enrolled and contributed 603 lesions (95 melanomas). Participants had a mean age of 59 years, 54% were female, and 96% were White individuals. At the predetermined 95% sensitivity threshold, ADAE had a sensitivity of 96.8% (95% CI: 91.1–98.9%) and specificity of 37.4% (95% CI: 33.3–41.7%). The dermatologists’ ability to assess melanoma risk significantly improved after ADAE exposure (AUC 0.7798 vs. 0.8161, p = 0.042). Post-ADAE dermatologist decisions also had equivalent or higher net benefit compared to biopsying all lesions. We validated the accuracy of an open-source melanoma AI algorithm and showed its theoretical potential for improving dermatology experts’ ability to evaluate lesions suspicious of melanoma. Larger randomized trials are needed to fully evaluate the potential of adopting this AI algorithm into clinical workflows.

Metadata: Four (4) of the EfficientNet models also expect the following metadata fields: (1) age (which for the training and original test data of the challenge was binned into 5-year brackets to prevent PHI leakage), (2) sex (as 'female ', 'male', or 'unknown'), (3) broad anatomic location ('head/neck', 'oral/genital', 'upper extremity', 'palms/soles', 'torso', 'anterior torso', 'posterior torso', 'lateral torso', 'lower extremity', 'unknown'), and (4), derived from patient identifiers, the number of images per patient, as well as (5)  Internal prediction mechanism: The fully connected layer (head) at the end of the model processing chain produces unbound values (between -Infinity and +Infinity). To normalize these scores, the softmax algorithm (see https://en.wikipedia.org/wiki/Softmax_function) is applied across the number of output classes, leading to a set of values between 0 and 1, summing up to 1 for each given image. Seventeen of the 18 models internally predict a class label from among nine (9) target classes (AK = actinic keratosis, BCC = basal cell carcinoma, BKL = benign keratosis, DF = dermatofibroma, MEL = melanoma, NV = nevus, SCC = squamous cell carcinoma, VASC = vascular lesion, UNK = unknown diagnosis), with the remaining model internally predicting a class label from among four (4) target classes (BKL, MEL, NV, and UNK). For the purpose of the challenge, the submitting team chose to simply extract the (postsoftmax) value for the MEL (melanoma) output class.
Combining and cross-model ensembling of scores: Original submission: the post-softmax values for the 4 (rotation) by 2 (flipping state) images passed through a given model are first averaged. This score is computed for batches of images, until all scores for the full set of images have been computed. Next, the 5 folds of a given model are averaged [in the original submission uploaded to Kaggle, due to a coding error, only the 5th/last fold of each model was used for the ensembling; the AUC reported on the Kaggle competition site, despite being the winning algorithm, still improves significantly by using all 90 folds]. For the cross-model ensembling, the model authors chose to rank-transform the 18 lists of scores (for, say, the training set), which yields a list of equi-distant scores between 0 and 1 for all images. The final score of any given image (in a set of images) is then computed by averaging these 18 rank-transformed scores into a value between 0 and 1.
Present study: As the need arose to score individual images, we found that averaging the raw (post-softmax) scores across the 90 folds yielded sub-optimal results. The reason the original submission authors chose to rank-transform the data was to improve (reduce) the noise across the spectrum of values. Given the high imbalance in both the training and test data (approximately 3% of images had a melanoma (positive) class label), and the fact that the challenge was scored on AUC (rather than a 50% a priori threshold and accuracy at that threshold), the models generally produce low absolute values, even for positive cases. For instance, across the entire dataset, the raw-score-average scores yielded a threshold of 0.005356 for a 95% sensitivity, corresponding to an average rank/percentile of 0.658. This leads to an artifact by which even one large value of a single model fold (of 0.5 or greater) would push the average above the 95% sensitivity threshold. To avoid this, the authors of the present study decided to implement a logtransform for the 90 individual fold scores prior to averaging. Over the 2020 challenge data, the raw outputs aggregation and log-transformed outputs aggregation attained AUCs of 0.9492 and 0.9502, although these were not significantly distinguishable (p=0.4177; DeLong's test).
We compared the AUC of the full 18 model ADAE with the 14 models that did not incorporate clinical metadata and found no significant difference. The AUC achieved by the 18-model ensemble was 0.857 and the ensemble of 14-models that did not utilize metadata achieved 0.860 (P-value = 0.411; DeLong's test for two correlated ROC curves).
Potential effect of image choice: Most images were captured using the Canfield Scientific (Parsippany, NJ) Veos DS3 TM system, but dermatologists used what was the standard practice in their clinic, which could also include the Canfield Veos SLR. In most cases a contact polarized dermoscopy image was chosen as the clinically most representative image (88%, after excluding 79 unknowns); Derm1 = 95%, Derm2 = 61%, Derm3 = 94%, Derm4 = 83%, Derm5 = 20%, Derm6-11 = 91%. No difference was identified in the AUC of ADAE between the image uploaded to the study web-app and a randomly selected dermoscopy image from the remaining unselected images (0·857 v. 0·861, p=0·740). Additionally, no differences in ADAE AUC were found on contact polarized, contact non-polarized, and non-contact polarized images on the subset of lesions (n=526) with all 3 image types available (Figures S1-2