Introduction

Despite recent therapeutic advances in the treatment of advanced-stage melanomas, the number of melanoma deaths in the United States is estimated to exceed 90,000 over the next decade1. Emerging evidence suggests that the majority of melanoma mortality occurs in patients with a recurrence of the disease that was early-stage (Stage I or Stage II) at the time of diagnosis2. However, the recurrence of melanoma is typically not detected until the onset of symptomatic metastatic progression. Thus, there is a critical need for accurate prognostic tools to identify high-risk patients who would benefit from enhanced surveillance3. Additionally, identifying high-risk patients can support clinicians in determining who should receive adjuvant immunotherapy, which was recently approved for early-stage melanomas4. This decision is particularly important as immunotherapy has been associated with a high rate of morbid and potentially fatal immune-related adverse events (irAEs) occurring in up to 40% of treated patients5,6,7. Accordingly, balancing potential benefits with these risks requires a reliable understanding of which patients are most likely to experience disease recurrence and thereby would be most likely to benefit from early systemic therapy. Overall, early detection of melanoma recurrence can help minimize the number of patients exposed to treatment toxicities, prevent metastatic progression, and improve patient survival.

Research to date on early-stage melanoma recurrence has been limited by the sample size of individual studies8,9, duration of follow-up10,11, or access to well-phenotyped cohorts with detailed patient and tumor characteristics11,12. As a result, it has been difficult to reliably determine disease recurrence status and assess downstream outcomes in this population. In the present study, we leverage detailed multi-institutional registries across the Mass General Brigham healthcare system (MGB) and the Dana-Farber Cancer Institute (DFCI) to develop risk prediction models for early-stage melanoma recurrence. As the largest providers of dermatology services in the state of Massachusetts, these hospital systems offer access to detailed clinical features and tumor characteristics from a large population of patients who were diagnosed with melanoma both within and outside of the academic setting.

We build on prior literature on this topic by incorporating a wide array of clinical and histopathologic variables in our risk prediction assessment10,12,13,14,15,16,17,18,19. With the intent to stratify melanoma by survival outcomes, the American Joint Committee on Cancer (AJCC) melanoma staging system was established based on analyses of a large international melanoma database, which determined that Breslow thickness, ulceration, and historically mitotic rate were the most important prognostic factors in patients with localized melanoma16,17. Patient age at diagnosis and anatomic site of the primary melanoma were also shown to be significantly associated with recurrence of primary melanoma in separate investigations15,18,20. Furthermore, a recent study suggested that the presence of lymphovascular invasion and the baseline autoimmune disease were also independently associated with melanoma recurrence10. However, there is insufficient information on the cumulative predictive capacity of these features in prognosticating melanoma recurrence. To address this knowledge gap, we propose a comprehensive risk stratification approach using machine-learning modeling of a wide range of clinical and tumor synoptic features from EHRs. Similar approaches have been previously applied with success in other cancer settings21, achieving the area under the receiver operating characteristic curve (AUC) of 0.79–0.8022,23. To the best of our knowledge, our study presents a comprehensive analysis of melanoma recurrence combining state-of-the-art machine-learning techniques and granular EHR data. The primary goal of this study is to build machine-learning models for reliable prediction of early-stage melanoma recurrence and identification of significant independent risk factors using the extracted clinical and histopathologic features. Specifically, we performed two types of prediction: (1) recurrence vs. non-recurrence classification; (2) time-to-event recurrence risk prediction.

Results

Patient characteristics

We identified 1720 stage I/II cutaneous melanomas, with 1172 (68%) melanomas from MGB and 548 (32%) melanomas from DFCI (Fig. 1), that were diagnosed between 2000 and 2020. The non-recurrent melanomas were categorized into two groups: one group with a minimum of 5-year follow-up duration (to minimize the risk of false non-recurrences); another group 3:1 best matched to the recurrent melanomas in terms of follow-up duration. The first group was compared to recurrent melanomas for the binary recurrence classification tasks. The second group was compared to recurrent melanomas for the time-to-event recurrence prediction tasks. Comprehensive clinical and histopathologic features were extracted from EHRs (Supplementary Table 1).

Fig. 1: MGB and DFCI primary melanoma registry design and cohort definitions.
figure 1

Non-recurrent melanomas were categorized into two groups: one group with a minimum of 5-year follow-up duration; another group 3:1 best matched to the recurrent melanomas in terms of follow-up duration. The first group was compared to recurrent melanomas in the binary recurrence classification tasks. The second group was compared to recurrent melanomas in the time-toevent recurrence prediction tasks.

The basic patient characteristics of the entire study population are described in Table 1. All characteristics are detailed in Supplementary Table 3. The median follow-up for the entire cohort was 7.2 (IQR: 3.6–11.6) years. Overall, 310 out of 1720 (18%) melanomas recurred, among which 151 (48.7%) were distant recurrences. The median time from diagnosis to recurrence was 1.9 (IQR: 0.9–3.9) years. There was a small difference in the distribution of the year of diagnosis between the recurrent group and the non-recurrent group (p-value: 0.02). However, the mean and median years of diagnosis were identical for both groups (2010). The recurrent group had a higher mortality rate (49% vs. 22%, p-value <0.001) and was older at the time of diagnosis (65 vs. 60 years old, p-value <0.001) when compared to the non-recurrent group. The percentage of males in the recurrent group was higher (64% vs. 55%, p-value: 0.004) than in the non-recurrent group. Most of the population self-identified as Caucasian in both groups (recurrence: 98%; non-recurrence: 99%). Among the 310 recurrent melanomas, the respective number of cases that recurred within 5 years and 7 years after the diagnosis of the primary melanoma is 255 (82%) and 285 (92%) (Fig. 2). The majority of the entire cohort (74% for recurrences and 95% for non-recurrences) was with negative or not indicated regional lymph node histology. There were more patients in the recurrence group who didn’t have the sentinel lymph node biopsy performed due to age or comorbidity (19%) than in the non-recurrence group (2%).

Table 1 Characteristics of the study population.
Fig. 2: Distribution of recurrent melanomas in the study population.
figure 2

Among the 310 recurrent melanomas, 255 (82%) recurred within five years and 285 (92%) recurred within 7 years of primary melanoma diagnosis.

The patient characteristics of the MGB and DFCI cohorts for the binary recurrence classification tasks are shown in Supplementary Table 4. In the MGB cohort, the median follow-up was 11.4 (IQR: 7.2–15.1) years, and 215 of 881 (24%) melanomas recurred, among which 98 (46%) cases were distant recurrences. The median time from diagnosis to recurrence was 1.9 (IQR: 0.9–4.2) years. In the DFCI cohort, the median follow-up was 7.0 (IQR: 6.1–8.1) years, and 95 of 435 (22%) melanomas recurred, among which 53 (56%) cases were distant recurrences. The median time from diagnosis to recurrence was 1.8 (IQR: 1.1–3.5) years.

The patient characteristics of the MGB and DFCI cohorts for time-to-event recurrence prediction tasks are shown in Supplementary Table 5. The respective median follow-up duration for the MGB and DFCI cohorts was 5.9 (IQR: 1.9–11.3) years and 5.8 (IQR: 3.4–7.8) years. After the non-recurrent melanomas were identified by 3:1 best matching to the recurrent melanomas in terms of follow-up duration, there were no significant differences in follow-up duration and year of diagnosis between the recurrences and non-recurrences in each cohort.

Melanoma recurrence classification

Melanomas were labeled as having a recurrence or not, and five machine-learning algorithms were applied to classify recurrent and non-recurrent melanomas. Recurrence prediction model performances with successive additions of patient demographics, medical history, and tumor characteristics are shown in Fig. 3a. The area under the receiver operating characteristic curve (AUC), positive predictive value (PPV), sensitivity, specificity, and accuracy of models that used all extracted features are detailed in Supplementary Table 6.

Fig. 3: Model performances in the binary recurrence classification using patient demographics, medical history, and tumor characteristics.
figure 3

a AUC and PPV in internal validation (first row) and external validation (second row) when experimenting on the original cohorts (MGB: 215 recurrences vs. 666 non-recurrences; DFCI: 95 recurrences vs. 340 non-recurrences). b AUC and PPV in internal validation (first row) and external validation (second row) when experimenting on the core complete cohorts: negative or not indicated regional lymph node histology, known mitotic rate, and known ulceration (MGB: 142 recurrences vs. 410 non-recurrences; DFCI: 74 recurrences vs. 317 non-recurrences). The best model performance among SVM, RF, GB, MLP, and LR models is specified below each plot.

When using only demographic features, all models failed to yield an acceptable prediction performance (external AUC values <0.65). After adding medical history features, the AUC values were significantly improved in both internal and external validation for GB models (p-value: <0.01), while the PPV was not significantly improved (p-value: >0.05). After integrating all extracted tumor characteristics, both AUC and PPV were significantly improved for all models (p-value: <0.001), indicating that tumor characteristics were the most dominant features for the melanoma recurrence classification in our study.

When using all extracted features, GB models achieved the highest AUC in both internal (0.845; 95% CI: 0.840–0.850) and external (0.812; 95% CI: 0.804–0.819) validations. Additionally, GB models achieved the highest performance in terms of PPV (internal: 0.803; 95% CI: 0.797–0.810; external: 0.785; 95% CI: 0.774–0.795), and accuracy (internal: 0.772; 95% CI: 0.768–0.777; external: 0.741; 95% CI: 0.735–0.746) (Supplementary Table 6). The AUC and PPV of the GB model in the external validation are significantly different from other models (p-value < 0.001). SVM models did not achieve competitive performance in both internal and external validations.

In addition to experiments on the original MGB and DFCI cohorts (Supplementary Table 4), we also conducted sensitivity analyses using subgroups with no missing values for core tumor features (histologic type, tumor site, AJCC stage, Breslow thickness, anatomic level, laterality, mitotic rate, and ulceration), and only negative sentinel lymph nodes or not indicated sentinel lymph node biopsy (labeled “core complete cohorts”). There were 142 recurrences vs. 410 non-recurrences in the MGB cohort, and 74 recurrences vs. 317 non-recurrences in the DFCI cohort. The results are presented in Fig. 3b. Compared to the original cohorts, the GB model achieved consistent performance in the external validation (AUC: 0.812 vs. 0.809, p-value: 0.577, PPV: 0.785 vs. 0.772, p-value: 0.055).

The receiver operating characteristic (ROC) curves using GB models are shown in Fig. 4. Each experiment was repeated 50 times. All 50 ROC curves, the mean ROC curve, and the standard deviation are presented. The curves in Fig. 4a, b demonstrated similar shapes. The ROC curves for RF, LR, MLP, and SVM are presented in Supplementary Fig. 2.

Fig. 4: ROC curves with 50 repeats for the binary recurrence classification by GB models.
figure 4

a ROC curves in internal and external validations with the original cohorts. b ROC curves in internal and external validations with the core complete cohorts: negative or not indicated regional lymph node histology, known mitotic rate, and known ulceration. Consistent results between the original and core complete cohorts are achieved.

Sensitivity analyses for recurrence classification

In the original MGB and DFCI cohorts (Supplementary Table 4), a 5-year minimum follow-up duration for non-recurrent melanomas was applied to reduce the likelihood of false non-recurrence. We further performed recurrence vs. non-recurrence classification experiments using GB and RF models when the minimum follow-up duration for non-recurrent melanomas was 7 years (labeled “seven years”) and when all tumor features and median income were without missing values, and regional lymph node histology was negative or not indicated (labeled “complete cohorts”). The sample sizes and results are summarized in Table 2.

Table 2 Sensitivity analyses for recurrence classification by GB and RF models.

In the complete cohorts, the number of recurrences in the MGB cohort was 116 and the number of recurrences in the DFCI cohort was 45. The prediction performance (internal AUC: 0.805, PPV: 0.785; external AUC: 0.752, PPV: 0.737) deteriorated significantly (p-values: <0.001) compared to the performance of the original analysis (internal AUC: 0.845, PPV: 0.803; external AUC: 0.812, PPV: 0.785). When the minimum follow-up duration for non-recurrent melanomas was 7 years, the GB model achieved slightly better results in the external validation (AUC: 0.827, PPV: 0.793, p-value: 0.001 for AUC; p-value: 0.175 for PPV) compared to the performance of the original analysis (AUC: 0.812, PPV: 0.785).

Feature importance in recurrence classification

We further examined the primary features used in the recurrence vs. non-recurrence classification by conducting permutation importance24 with each classifier. All features after one-hot encoding are described in Supplementary Table 2. The experiments were performed with 50 repeats and AUC used for scoring. Figure 5 and Supplementary Fig. 3 demonstrated the 20 most predictive features in the recurrence classification tasks with GB, RF (Fig. 5), MLP, LR, and SVM (Supplementary Fig. 3) models. Figure 5 showed that in the experiments on all three types of cohorts (original, core complete, and complete) by GB and RF models (best models), the two most important features were consistently Breslow thickness and mitotic rate. Other predictive features included insurance type, age at diagnosis, median income, AJCC stage, tumor site, total surgical margin, radial growth, and anatomic level. In the MLP, LR, and SVM models (Supplementary Fig. 3), the most important features included insurance type, mitotic rate, AJCC stage, anatomic level, tumor site, and sex. In these models, Breslow thickness was not among the top 10 important features.

Fig. 5: Feature importance in recurrence classification by RF and GB models.
figure 5

a The 20 most important features when experimenting on the original cohorts. b The 20 most important features when experimenting on the core complete cohorts: negative or not indicated regional lymph node histology, known mitotic rate, and known ulceration. c The 20 most important features when experimenting on the complete cohorts: negative or indicated regional lymph node histology, known median income, and all tumor features available. The box extends from the first quartile to the third quartile of the feature importance values for each feature, with a line at the median. The whiskers extend from the box by 1.5x the interquartile range. Flier points are those past the end of the whiskers.

Given the collinearity of anatomic level, Breslow thickness, AJCC stage, and ulceration, we also conducted the following experiments by (1) keeping AJCC stage (removing anatomic level, Breslow thickness, and ulceration) to examine the importance of AJCC stage; (2) by keeping Breslow thickness and ulceration (removing anatomic level and AJCC stage) to see the importance of ulceration. In these two experiments, all other unmentioned features were kept in the models. Results are presented in Supplementary Table 7 and Supplementary Fig. 4. Supplementary Table 7 showed that overall consistent performances were achieved. Supplementary Fig. 4 demonstrated that when anatomic level, Breslow thickness, and ulceration were removed, AJCC stage became one of the top two features in the GB model and became more important in the RF model. When anatomic level and AJCC stage were removed, ulceration became more important in the GB model compared to the results when all extracted features were used. In these experiments, mitotic rate remained as one of the two most important features.

Time-to-event melanoma recurrence risk prediction

We built time-to-event models for melanoma recurrence risk prediction using four well-known machine-learning algorithms. Along with the original cohorts, we also conducted experiments on the “core complete” and the “complete” cohorts. The sample sizes and results are presented in Table 3 (GB-T and RF-T) and Supplementary Table 8 (Coxnet and CoxPH).

Table 3 Time-to-event melanoma recurrence prediction by GB-T and RF-T models.

For the original cohorts, GB-T and RF-T models achieved better performance than Coxnet and CoxPH (p-value: <0.001). In both internal and external validations, the GB-T model achieved better time-dependent AUC (internal 0.853 vs. 0.845, p-value: 0.013; external 0.820 vs. 0.810, p-value: <0.001) and concordance index (internal 0.820 vs. 0.813, p-value: 0.007; external: 0.809 vs. 0.807, p-value: 0.045) than the RF-T model.

When experimenting on the core complete cohorts, the prediction performance of the GB-T model deteriorated significantly in the internal validation (time-dependent AUC: 0.853 vs. 0.829; concordance index: 0.820 vs. 0.788; p-values: <0.001) and decreased slightly in the external validation (time-dependent AUC: 0.820 vs. 0.804, p-value: <0.001; concordance index: 0.820 vs. 0.808; p-value: 686) compared to the performance of the original cohorts. In the complete cohorts, the number of recurrences was 116 in the MGB cohort and 45 in the DFCI cohort. The time-dependent AUC and concordance index of the GB-T model were further reduced in both internal (time-dependent AUC: 0.808, concordance index: 0.768) and external (time-dependent AUC: 0.673, concordance index: 0.752) validations. In the external validation, the best time-dependent AUC (0.692) achieved by the Coxnet and CoxPH models was significantly worse than the ones on the original cohorts (p-value: <0.001).

Recurrence probabilities for seven randomly selected melanomas from the original DFCI cohort predicted by the four time-to-event models trained by the original MGB cohort are presented in Fig. 6. Among the seven examples, ID3, ID0, and ID6 recurred with an increased duration from diagnosis of the primary melanoma to recurrence. The RF-T and GB-T models predicted them to have the highest recurrence probability proportional to their duration from diagnosis to recurrence.

Fig. 6: Recurrence probabilities for seven randomly selected melanomas in the external cohort (DFCI) predicted by the four time-to-event models trained by the MGB cohort.
figure 6

ID3 (red), ID0 (blue), and ID6 (pink) recurred with an increased duration from diagnosis of the primary melanoma to recurrence. The RF-T and GB-T models predicted them to have the highest recurrence probability proportional to their duration from diagnosis to recurrence. In the comparison of ID0 (recurrence) and ID5 (non-recurrence) where they had similar duration from diagnosis to recurrence or to last follow-up (ID0: 2.22, ID5: 2.34), all four models predicted ID0 to have a higher recurrence probability.

Feature importance in time-to-event recurrence risk prediction

Feature importance in time-to-event prediction is presented in Fig. 7 and Supplementary 5. Results on the three types of cohorts were consistent, where Breslow thickness and mitotic rate were the two most important features. Additional important features included insurance type, age at diagnosis, median income, and absent radial growth phase. When experimenting with the case where anatomic level, Breslow thickness, and ulceration were removed and with the case where anatomic level and AJCC stage were removed (Supplementary Table 9 and Supplementary Fig. 6), overall consistent performances were achieved. In the first case, AJCC stage became one of the top three features by the GB-T and RF-T models. In the second case, ulceration became more important in both GB-T and RF-T models compared with the results when all extracted features were used.

Fig. 7: Feature importance in time-to-event recurrence prediction by RF-T and GB-T models.
figure 7

a The 20 most important features when experimenting on the original cohorts. b The 20 most important features when experimenting on the core complete cohorts: negative or not indicated regional lymph node histology, known mitotic rate, and known ulceration. c The 20 most important features when experimenting on the complete cohorts: negative or not indicated regional lymph node histology, known median income, and all tumor features available. The box extends from the first quartile to the third quartile of the feature importance values for each feature, with a line at the median. The whiskers extend from the box by 1.5x the interquartile range. Flier points are those past the end of the whiskers.

Sample size verification

We performed sample size calculations to ensure that our sample size was sufficient to evaluate the capacity of the machine-learning models. Using the observed true-positive rate of 0.79, the observed false-positive rate of 0.21, and a resulting standard difference of 1.2 in the internal validation, a sample size of 24 was required to achieve a power of 0.8 and a type I error rate of 0.05. As we had 215 recurrent melanomas in the MGB cohort and non-recurrent melanomas were randomly sampled to match the number of recurrent cases, the total number of samples was 430. With fivefold cross-validation, there were 86 testing samples in each experiment, which was sufficient for internal validation. Using the observed true-positive rate of 0.78, the observed false-positive rate of 0.22, and a resulting standard difference of 1.1 in the external validation, a sample size of 26 was required to achieve a power of 0.8 and a type I error rate of 0.05. As we had 95 recurrent melanomas in the DFCI cohort and non-recurrent melanomas were randomly sampled to match the number of recurrent cases, the total number of samples was 190. Thus, our models were adequately powered (>80%) to make predictions.

Discussion

The increasing number of deaths associated with early-stage melanoma recurrence and the recently expanded indications for adjuvant immunotherapy in the management of early-stage melanomas highlight the need for prognostic tools to support clinicians in counseling patients and determining the optimal surveillance and risk stratification strategy in this population. We have performed the largest and most comprehensive study to date assessing the ability of machine-learning algorithms to predict early-stage melanoma recurrence using clinicopathologic features extracted from EHR data in a real-world clinical setting.

In the melanoma recurrence vs. non-recurrence classification, we were able to delineate melanomas with recurrence from non-recurrence. The GB models achieved the best performance in both internal (AUC: 0.845; PPV: 0.803) and external (AUC: 0.812; PPV:0.785) validations. These results demonstrate that, as previously shown with other primary malignancies25, we can reliably predict early-stage melanoma recurrence. A literature survey25 published recently reviewed the use of machine-learning techniques in predicting various outcomes related to cancer diagnosis and prognosis, including susceptibility, recurrence, and survival. Studies investigating cancer recurrence generally reported AUC in the range from 0.726 to 0.84625. The AUC values of these models have been shown to depend primarily on how much information is incorporated into the model, what type of information is used, and what specific outcomes are investigated. Our AUC values are consistent with the recent studies that investigated the recurrence of prostate, non-small cell lung, colorectal, and biliary cancers, with reported AUC ranging from 0.581 to 0.89426,27,28,29,30,31. The incorporation of more potentially predictive features tends to improve the performance of models. One study predicting oral cancer recurrence was able to achieve a reported accuracy of 0.917 after incorporating genomics-related data into the model25,32. Models based primarily on image analysis of digital pathology often reported high AUC values in the range from 0.861 to 0.99333,34,35. Based on the existing literature, it is reasonable to expect that by incorporating additional data, such as genomics and digital histopathology, our model performance would be significantly improved.

In the time-to-event recurrence prediction, the GB-T models achieved the best performance (internal time-dependent AUC: 0.853, concordance index: 0.820; external time-dependent AUC: 0.820, concordance index: 0.809), demonstrating that our models can achieve reliable performance in the time-to-event analysis. We expect that incorporating additional samples and relevant features, such as the histopathologic images and genomic data, will further improve the model performance.

In the recurrence group, 18% (55/310) of early-stage melanomas recurred after 5 years following the primary melanoma diagnosis, and among this population, 36% (20/55) recurred distantly (Stage IV at the time of recurrence). Current post-surgical monitoring guidelines for early-stage melanoma recurrence include serial skin examinations, regional lymph node examinations, and/or imaging modalities36,37. These recommendations and long-term surveillance follow-up vary depending on the stage of disease and individualized risk factors (i.e., family history)36,37. As demonstrated in Fig. 6, recurrence probabilities can differ significantly in a sample of 7 melanomas from our cohort over a 10-year time period depending on clinical and histopathologic features. The recurrence probability of melanoma ID04 (>0.8) was significantly higher as compared to melanoma ID01 (<0.1) at the 10-year follow-up mark. The time-to-event analysis provides an additional tool to support clinicians in the decision-making process of customizing surveillance and counseling patients diagnosed with early-stage primary melanomas.

We also assessed the impact of individual risk factors on machine-learning prediction of melanoma recurrence. These were ranked using feature importance in the prediction by each model, and the results provided further support for previously established risk factors in addition to identifying new trends that have not been reported previously. Breslow thickness has long been established as one of the most predictive features for melanoma recurrence38, and the results of this study support this finding. In our comprehensive feature model (including Breslow thickness, anatomic level, AJCC stage, and ulceration), Breslow thickness was one of the most important features both in the recurrence vs. non-recurrence classification by the RF and GB models and in the time-to-event recurrence prediction by the RF-T and GB-T, which typically yielded the most accurate prediction.

Currently, Breslow thickness and ulceration are the criteria used to determine the tumor (T) stage in early-stage disease39. While AJCC stage is another commonly cited risk factor for melanoma recurrence12, the stage was not in the top five most important features for recurrence prediction. This is likely due to how the machine-learning models account for collinearity. For example, when the model extracted predictive signals from Breslow thickness, it would not extract the same signals contained in AJCC stage. In addition, Breslow thickness was incorporated as a continuous variable, which may provide more detailed information as compared to the broad AJCC stage, leading to a higher feature ranking in our RF, GB, RF-T, and GB-T models. After performing sensitivity analyses by removing anatomic level, Breslow thickness, and ulceration (Supplementary Table 7 and Supplementary Fig. 4 for the binary recurrence classification, Supplementary Table 9 and Supplementary Fig. 6 for time-to-event prediction), AJCC stage was one of the two most important features by the GB, GB-T, and RF-T models and became a more important ranking feature in the RF model compared to the original cohorts. In particular, stage 1A was ranked as a higher predictive feature compared to other stages.

Mitotic rate was another critical feature in the discrimination of recurrence vs. non-recurrence by the RF and GB models and in the time-to-event prediction by the RF-T and GB-T models. Mitotic rate was the second most important feature in the MLP and LR models for the recurrence classification (Supplementary Fig. 3) and in the Coxnet and CoxPH models for the time-to-event prediction (Supplementary Fig. 5). Previous studies have identified tumor mitotic rate or presence of mitoses as a risk factor for recurrence, which our results support10,40,41. In the most recent AJCC melanoma staging system, mitotic rate is no longer a T1 staging criterion14,39,42,43. The most recent ASCO-SSO guidelines for sentinel lymph node biopsy also removed mitotic rate greater than 1 per mm2 as a high-risk factor for sentinel lymph node biopsy consideration, however mitotic rate can still be taken into account when considering a patient’s overall risk based on clinicopathologic features43,44. Previously, the mitotic rate was included in the 7th edition of the AJCC staging criteria13. Per the guidelines, all melanomas with mitoses ≥1 mm2 were upstaged from 1 A to 1B.13, however this criterion was removed due to the poor inter-user reliability of mitotic rates45. Despite this, given how our models performed, with the mitotic rate being a more important feature than Breslow thickness in some machine-learning models, this feature should be increasingly evaluated in future studies and during consultations in clinical care. In our analyses, the mitotic rate was defined as a continuous variable, allowing the machine-learning algorithm to use detailed information for recurrence prediction, as compared to the AJCC 7th edition categories of <1 mm2 and ≥1 mm2 13. In addition, the prior mitotic rate criterion only affected melanomas diagnosed as stages 1A and 1B. Our results show that mitotic rate can be broadly incorporated in the risk assessment for melanoma recurrence across all stages (1A, 1B, 2A, 2B, and 2C) of early-stage primary melanomas.

Finally, this study also included social determinants of health such as patient median income and insurance type. Median income was in the top ten predictive risk factors in our best performing models (RF, GB, RF-T, and GB-T), and the insurance type “self-pay” was in the top ten predictive risk factors in the GB, LR, and MLP models, and was in the top three predictive risk factors in the GB-T models. To investigate the reasons for our observations, we used a linear regression analysis to examine the association between thickness and patient demographics (sex, age at diagnosis, race, ethnicity, insurance type, median income, and marital status). We found that patients with Self-Pay and Medicaid insurance were at significantly greater risk for presenting with thicker melanomas at the time of diagnosis (Beta: 0.64, p < 0.001 for Self-Pay; 1.1, p = 0.025 for Medicaid). The findings are consistent with prior literature identifying delays in primary melanoma diagnosis and increased diagnosis of advanced-stage melanoma among uninsured or publicly insured patients. In one study, individuals who are underinsured have been found to present with thicker tumors (OR: 2.19) compared with non-Medicaid insured patients46. Similarly, Medicaid patients were found to also present with thicker tumors at diagnosis compared to those not on Medicaid coverage (OR: 2.37)47. Since Medicaid is the primary insurance coverage for low-income individuals and self-pay patients are likely underinsured individuals, our results demonstrate socioeconomic disparities in the timeliness of melanoma diagnosis. These SES risk factors are also surrogate markers that affect melanoma care, from diagnosis to treatment and surveillance. Medicaid patients experience delays in surgical treatment of primary melanomas by 6 weeks48. Although, it is unknown whether delay in surgical treatment affects mortality, timing of melanoma treatment should be investigated in future studies to assess associations between risk of recurrence and timing of recurrence.

The overall results of our feature importance ranking remained consistent after sensitivity analyses where only melanomas with negative or not indicated regional lymph node histology, and melanomas without no missing values for ulceration/mitotic rate were included. Breslow thickness and mitotic rate were consistently the two most important features for classification by RF and GB binary classification models (Fig. 5) and for time-to-event prediction by RF-T and GB-T models (Fig. 7). Median income and the insurance type, self-pay, remained in the top five important features classification by RF and GB binary classification models and for time-to-event prediction by RF-T and GB-T models. We also observed only a mild decrease in AUC (0.812 to 0.809 for classification; 0.820 to 0.804 for time-to-event prediction) in these more restricted settings. This is expected as our sample size was reduced, however, the consistent results demonstrate the robustness of the GB and GB-T models in the binary classification and time-to-event prediction of melanoma recurrence. Despite limitations with missing clinical or histology features, our sensitivity analyses show our models yield robust results.

This study was also limited by its retrospective nature. However, with 1,720 included melanomas, our study represents an improvement in sample size and predictive power compared to many prior studies investigating risk factors associated with melanoma recurrence. Additionally, the mitotic rate variable in this study was extracted from pathology reports, which is subject to high levels of interobserver variation. Future investigations incorporating automatic mitotic feature detection from H&E whole slide images using similar deep learning approaches to other cancers (e.g., breast cancer)49,50 are necessary to evaluate the performance of a more objective mitotic rate variable in predicting melanoma recurrence. Despite this limitation, mitotic rate was nevertheless consistently selected as a top performing feature in our models. It is also possible that some of the non-recurrent melanomas included in our study experienced recurrence after the end of study follow-up (false non-recurrence). However, this is likely to affect a minority of the population since (1) more than 80% of our study population experienced recurrence within 5 years (and more than 90% within 7 years), and (2) we specifically selected our control population to have a minimum of 5 years of follow-up in the binary melanoma recurrence classification tasks. Furthermore, a sensitivity analysis was performed to compare the performance of binary classification when the criteria for non-recurrent melanomas was extended to 7 years. The results of the sensitivity analysis (Table 2) show consistent performance (AUC: 0.827) compared to the primary models with criteria of at least 5 years of follow-up. Overall, this study also provided longer follow-up than available in previous investigations, further decreasing the possibility of misclassification of non-recurrent melanomas. There were no minimum follow-up constraints for non-recurrent melanomas in the time-to-event analysis, and therefore the risk of false non-recurrence is not applicable to this analysis. In the time-to-event recurrence prediction, the censoring status for non-recurrences was classified as either dead or censored. The time to event for recurrences was defined as the duration from primary melanoma diagnosis to recurrence. The time to event for non-recurrences was defined as the duration from primary melanoma diagnosis to the date of death or date of censoring.

In summary, we demonstrate consistent and reliable performance of machine-learning models for predicting melanoma recurrence in the largest cohort of patients diagnosed with early-stage melanomas to date. We delineate the most significant features for recurrence prediction, which include Breslow thickness, mitotic rate, AJCC stage, median income, insurance type, and age at diagnosis. Overall, our study provides important insights for clinicians to counsel patients on the contributions of individualized risk factors for early-stage melanoma recurrence. However, despite the incorporation of a comprehensive array of over 36 demographic, clinical, and histopathologic features, there is a plateau in model performance for the recurrence classification (AUC: 0.812) and for the time-to-event recurrence prediction (time-dependent AUC: 0.820). The predictive capabilities of these models can benefit from the incorporation of additional features, including digital histopathology images, genomics data, and novel tumor biomarkers. Nevertheless, our presented models may be deployed clinically to aid in the identification of high-risk early-stage melanoma patients that may benefit from increased surveillance or adjuvant immunotherapy.

Methods

This study was conducted to determine the capabilities of machine-learning algorithms to predict early-stage melanoma recurrence using patient demographics, medical history, and melanoma tumor characteristics. We performed two types of prediction by using nine machine-learning algorithms: (1) melanoma recurrence classification; (2) time-to-event melanoma recurrence risk prediction. Models were all validated internally and externally. First, we used the stratified fivefold cross-validation on the MGB cohort (internal validation). Second, we trained models using the MGB cohort and evaluated them independently on the DFCI cohort (external validation) (Supplementary Fig. 1).

Materials

Patients diagnosed with Stage I or Stage II melanoma at MGB and DFCI between January 2000 and February 2020 were included in this study. Melanoma histology (confirmed by ICD-O-3 codes), diagnosis date of the primary melanoma, and recurrence date were extracted from the cancer registrars at both institutions. Cutaneous melanomas with no evidence of metastasis at the time of diagnosis and of the following histological types were included: lentigo maligna, nodular, superficial spreading, and melanoma NOS (not otherwise specified). Acral, mucosal, and uveal melanomas were excluded from the study. The Mass General Brigham Institutional Review Board (IRB) approved the study (Protocol # 2020P002179). The need for consent was waived by the IRB as the study meets the criteria for exemption 45 CFR 46.104(d)(#). Secondary research for which consent is not required if the research involves only information collection and analysis involving the investigator’s use of identifiable health information when that use is regulated under 45 CFR parts 160 and 164, subparts A and E.

The Research Patient Data Registry (RPDR)51 and the Enterprise Data Warehouse52 (EDW) are two institutional clinical databases at MGB and DFCI, which were used to extract clinical and medical history information. The RPDR contains multimodal clinical information, including basic demographics, International Classification of Diseases (ICD) codes, and lab results. The EDW provides detailed documentation of medication administrations. For this study, we extracted the following variables from RPDR: date of birth, sex, race, ethnicity, insurance type, marital status, zip code, date of death or last follow-up. Median income was extracted from the U.S. Census data based on the patient’s zip code53. ICD codes from all visits of a patient before the diagnosis of early-stage melanoma were used to calculate Charlson Comorbidity score54 (CCS) and to extract medical history features (Supplementary Table 1). The ICD codes used to identify non-melanoma skin cancer, benign neoplasms of skin, cutaneous autoimmune diseases, and other systemic autoimmune diseases were specified in Supplementary Tables 1012. Mortality information was additionally ascertained via the institutional cancer registrars using linkage with obituary databases and the National Death Index (NDI). For patients who did not have known mortality information, their last encounter with the system was considered their censoring date.

We obtained data for collecting features of interest (Supplementary Table 1) using established protocols for electronic data extraction from institutional clinical databases. For all features with incomplete information, rigorous manual chart review and data extraction were performed. Manual chart reviews by two independent reviewers were conducted to ascertain the recurrence status (recurrence vs. non-recurrence) and recurrence type (locoregional recurrence vs. distant recurrence). All melanomas that were Stage IV at the time of recurrence, based on the American Joint Committee on Cancer (AJCC) 8th edition staging guidelines39, were labeled as having a distant recurrence. All recurrent melanomas without distant metastases were labeled as having a locoregional recurrence. In classification tasks, melanomas that did not recur and were followed up at least 5 years were labeled as “non-recurrence”. If a patient died without melanoma recurrence and the follow-up duration was less than 5 years, the patient was not included in the binary classification tasks. In the time-to-event prediction tasks, all melanomas without recurrence were labeled as “non-recurrence” regardless of the follow-up duration. The time to event is the duration from diagnosis to recurrence if the melanoma recurred; otherwise, the time to event is the duration from diagnosis to date of death or last follow-up.

Figure 1 shows the flow diagram of how the study population was obtained. The MGB population was utilized for model development and the DFCI population was utilized for external validation of the model. In total, 36 features were extracted for inclusion in our registry (Supplementary Table 1), which can be categorized into three groups: demographics, medical history, and tumor characteristics. Specifically, seven demographic features include age at diagnosis, sex, race, ethnicity, median income, insurance type, and marital status. Medical history includes seven features: Charlson comorbidity score (CCS), history of prior cutaneous melanoma (HPCM), history of non-melanoma skin cancer (HNMSC), history of situ or benign neoplasms of skin (HSBN), history of other malignancy (HOM), history of cutaneous autoimmune disease (HCAID), and history of systemic autoimmune disease (HSAID). Tumor characteristics consist of 22 features: histological type, tumor site, AJCC stage at diagnosis, Breslow thickness, anatomic level, mitotic rate, ulceration, laterality, tumor infiltrating lymphocytes (TIL), tumor infiltrating lymphocytes type, precursor lesion, precursor type, radial growth phase, vertical growth phase, vertical growth type (VGT), microsatellites, regression, lymphovascular invasion (LVI), perineural invasion, total surgical margins of excision(s), check if the surgical margins of the excision(s) met management guidelines (margin check)36,55, and regional lymph node histology (RLNH). All histopathologic features included in this study were collected from treatment-naïve biopsies of primary melanomas at the time of melanoma diagnosis and the included tumors were all early-stage at diagnosis. Therefore, no patients included in our study have received treatment with systemic therapy for their primary early-stage melanomas.

Based on clinical guidelines for melanoma, all melanomas with tumor categories above T1b (Breslow thickness >1.00 mm) are recommended to undergo a sentinel lymph node biopsy (SLNB)44. SLNB was not recommended for patients diagnosed with primary melanomas that are <0.8 mm thick, non-ulcerated lesions (T1a)44. Patients diagnosed with primary melanomas that are 0.8 to 1.0 mm thick or are <0.8 mm thick, ulcerated lesions (T1b) may be offered a SLNB after a clinical discussion of the risks and benefits of the procedure with their provider44. As a result, patients with stage IA melanoma were not recommended to undergo a SLNB due to the low overall rate of SLNB positivity in this population. Additionally, in real-world clinical settings, patients may defer the SLNB due to various reasons, e.g., comorbidity, frailty, unwillingness to undergo an additional invasive procedure, etc. Thus, we detailed the variable regional lymph node histology (RLNH) incorporating this complexity of SLNB in a real-world clinical setting, which includes the following values: 0. all nodes negative; 1. not indicated; 2. not performed due to age or comorbidity; 3. not performed due to an unknown reason; 4. deferred by the patient.

To reduce the degree of omission for many included synoptic features, we expanded our manual phenotyping efforts, including reviews of manually scanned pathology reports uploaded into the electronic medical records and reviews of all available clinical notes in the electronic medical records, and incorporated input from dermatopathologists with institutional experience in synoptic feature reporting across time. Given that our data was extracted from real-world electronic medical records (non-clinical trial settings), some variables, like precursor lesion and perineural invasion, may not be included in the pathology report if they were not identified in the biopsy by the pathologist. As a result, and with input from dermatopathologists with expert knowledge of the reporting schema at our institutions, precursor lesion, precursor type, microsatellites, regression, lymphovascular invasion, and perineural invasion were assumed absent if not listed as present in the pathology report. All melanomas without a pathology report available and melanomas with positive microscopic satellites were excluded from this study.

Categorical features were converted by one-hot encoding (Supplementary Table 2). For missing continuous features (Supplementary Table 1), we assigned the median values of the samples to individuals, including median income, Charlson comorbidity score, mitotic rate, and total surgical margins.

Machine-learning methods for binary recurrence classification

We compared the classification performances of five classic supervised machine-learning algorithms using the extracted clinicopathologic features. The algorithms include Support Vector Machine (SVM)56, Gradient Boosting (GB)57,58, Random Forest (RF)59, Logistic Regression (LR)60, and Multi-layer Perceptron (MLP)61,62. Model parameters were optimized by cross-validated grid-search over a parameter grid on the MGB cohort. The kernel of SVM was the radial basis function. The GB had 100 estimators. The L2 regularization was used for LR to reduce overfitting. The MLP was performed with a logistic activation function. Model performance was evaluated based on the area under the receiver operating characteristic curve (AUC), positive predictive value (PPV), sensitivity, specificity, and accuracy. Mean and 95% confidence intervals were reported. All experiments were implemented using scikit-learn Python library24.

Binary melanoma recurrence classification

For our primary outcome, we evaluated the ability of machine-learning models to predict the recurrence (recurrence vs. non-recurrence). Stratified fivefold cross-validation was applied in the internal validation, which preserves the percentage of samples for each class in each fold. In the external validation, the entire MGB cohort was used for training models and the DFCI cohort was left out for independently evaluating the models. Each experiment was repeated 50 times, and each time non-recurrent melanomas were randomly sampled to match the number of recurrent cases (Supplementary Fig. 1a). We further investigated the predictive features used in the recurrence prediction by conducting permutation importance24. The permutation importance for feature evaluation was repeated for 50 times and the AUC was used for scoring.

Sensitivity analyses

Ensuring sentinel lymph node biopsy negativity is important to be able to exclude possibility of nodal involvement at time of diagnosis. However, given that our data was extracted from EHRs in a real-world clinical setting and to remain consistent with clinical guidelines, we didn’t exclude patients who didn’t perform sentinel lymph node biopsy at time of diagnosis in our main analysis. In the sensitivity analysis, we excluded the melanomas for which the sentinel lymph node biopsy was indicated but not performed, ulceration was missing, or mitotic rate was unknown. We also conducted experiments on a complete data for which median income and all tumor features were without missing values. In our main analysis, a minimum of 5-year follow-up was used to ensure sufficient time to observe a recurrence in the non-recurrence population. Though the melanoma recurrence rate decreases significantly after 5 years (Fig. 2), the minimum 5-year follow-up duration may not have captured all recurrences. Therefore, we further performed binary recurrence classification experiments when the minimum follow-up duration for the non-recurrent melanomas was 7 years.

Machine-learning methods for time-to-event recurrence prediction

Melanoma recurrence is a time-to-event outcome. It is important to predict the risk probability as time goes by. We compared the time-to-event prediction performances of four supervised machine-learning algorithms using the extracted clinicopathologic features. The algorithms include GradientBoostingSurvivalAnalysis63 (GB-T), RandomSurvivalForest64 (RF-T), CoxnetSurvivalAnalysis65 (Coxnet), and CoxPHSurvivalAnalysis66 (CoxPH). Similar to the classification tasks, model parameters were optimized by cross-validated grid-search over a parameter grid on the MGB cohort. Model performance was evaluated based on the time-dependent AUC67 and concordance index for right-censored data68. Each experiment was repeated 50 times. Mean and 95% confidence intervals were reported. All experiments were implemented by using scikit-survival Python library69.

We also investigated the predictive features by conducting permutation importance24. The permutation importance for feature evaluation was repeated for 50 times and the concordance index was used for scoring. Sensitivity analysis on cohorts with negative or not indicated sentinel lymph node biopsy and with available tumor features was also conducted.

Statistical methods

The minimum sample size at a power of 0.8 and a type I error rate of 0.05 were calculated to ensure that our study sample size was sufficient to evaluate the capacity of models for predicting melanoma recurrence70. To compare groups, we used Pearson’s Chi-squared test or Fisher’s exact test for categorical variables, and the t-test or Kruskal–Wallis test for continuous variables, provided by the stats package in R version 4.1.071.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.