Introduction

Melanoma accounts for only about 1–5% of skin cancers1. Among the melanoma cases, over 90% is represented from cutaneous melanoma, which is one of the most aggressive forms of skin cancer with a high mortality rate2,3. A prognosis improvement in cutaneous melanoma patients is crucial to better plan personalized treatments. Although more and more advanced treatments for melanoma have constantly introduced in clinical practice over the years, e.g. with the advent of checkpoint inhibitors immunotherapy and target therapy with antiBRAF/antimek drugs for patients harboring BRAF mutation4,5,6, they can cause toxicity and overtreatment, especially for early-stage patients7. Not less relevant, these treatments can entail significant resource expenditure due to their high costs corresponding to over $20,000 per patient per month8. Currently, clinical prognosis methods for the evaluation of the risk of recurrence includes multiple parameters, such as Breslow tumor thickness, ulceration, local or nodal metastasis, which are at the basis of the American Joint Committee on Cancer (AJCC) pathologic tumor stage9,10. Despite routinely applied in clinical practice, these clinical prognosis methods have some pitfalls. Among them, the evaluation of complete staging is performed by means of lymph node examination, that is a matter of an ongoing scientific debate due to the associated significant post-operative morbidity and/or infection11. Meanwhile, genomic-based tools complementing the traditional staging system are being developed in order to evaluate their prognostic power in comparison with traditional factors12,13. However, these tools are currently in the experimentation phase and have not yet been applied in actual clinical practice. Thus, finding more reliable and widely applicable prognostic biomarkers in melanoma patient is urgent. Within this emerging scenario, digital pathology image-based prediction models can be designed. Due to ongoing developments in technology, e.g. cloud storage systems and computer processing powers, whole slide images (WSIs), which refers to digital slides, have become the predominant imaging modality in pathology departments across the world14. In recent years, artificial intelligence and its deep learning branch based on Convolutional Neural Networks (CNNs) have shown potential in solving challenging imaging problems related to the medical field15,16,17. Thanks to the extraction of a huge number of image characteristics, which are naked to human eyes, correlations between image patterns and a pre-defined outcome can be recognized. The accomplishment of a specific task can be achieved either by defining an ad hoc CNNs based on a huge amount of imaging data18 or by using the so-called transfer learning technique based on relatively small-size datasets19,20. While ad hoc CNNs are trained to both extract features and make predictions, transfer learning technique allows to apply CNNs pre-trained on thousands of images as feature extractors on images under study, and then, to use a standard machine learning classifier to make prediction. Some research studies have extensively investigated the role of deep learning and in particular of CNNs in the fulfilment of diagnostic tasks related to melanoma. As example, classification of histopathological melanoma images has been performed by means of deep learning methods21,22. Conversely, the investigation of prognostic tasks based on WSI analysis via deep learning techniques is at the beginning stages. The sentinel lymph node status, anti-PD-L1 response as well as visceral recurrence and death in melanoma patients have been recently probed through the application of deep learning on WSIs images23,24,25. Due to the demonstrated relevance of the deep learning application on WSIs to fulfill prognostic tasks involving melanoma patients, further investigations need to be performed. In this study, we propose a deep learning model which makes use of features extracted by transfer learning to predict 1-year disease-free survival in patients with cutaneous melanoma. Whole slide images referred to a cohort of 43 patients from Clinical Proteomic Tumor Analysis Consortium Cutaneous Melanoma (CPTAC-CM) public database were firstly analyzed to design the predictive model. Then, the model was validated on independent test, i.e. a validation cohort of 11 cutaneous melanoma patients referred to our Institute. This study represents the first effort of the transfer learning use to predict disease-free survival in cutaneous melanoma patients by means of a direct analysis of WSIs.

Results

Study design and data collection

The study was conducted according to the guidelines of the Declaration of Helsinki and approved by the Scientific Board of Istituto Tumori ‘Giovanni Paolo II’, Bari, Italy- prot. 17729/2020. Then, this study was determined by the Scientific Board to not require written consent from subjects, as it is retrospective and involves minimal risk.

A binary classification task was developed to give the prediction of 1-year disease-free survival in melanoma patients starting from the analysis of whole slide images (WSIs) representing cutaneous melanoma. Hence, patients were discerned in disease free (DF) and non-disease-free (non-DF) cases.

In the current paper, the experimental data referred to two diverse cohorts of patients were analysed. The predicted model was designed and firstly validated by exploiting data from CPTAC-CM public database26, which is part of The Cancer Imaging Archive (TCIA)27. The database contains cases of 49 patients with 1-year follow-up, counting 35 DF patients and 14 non-DF patients. Haematoxylin and eosin (H&E) images of cutaneous melanoma and some clinical data, such as gender, tumor site, pathological stage (stage), pathologic staging primary tumor (T), age, were downloaded from the CPTAC-CM online website. With respect to the pathologic staging of regional lymph nodes (N), the patients were distributed as: 34.9% as N0, 6.8% as N1, 9.3% as N2, 9.3% as N3 and 39.7 as NX, where NX means that it was not possible to determine whether the tumor has spread to the lymph nodes. Only patients with stage I–III melanomas as well as patients whose histological images were judged as good quality by our pathologist experts were considered as eligible for our analysis. Thus, a final cohort of 43 patients, out of which 31 DF patients and 12 non-DF patients, were retained. The cancer staging was evaluated according to either the seventh edition or the eighth edition of AJCC classification (36 patients and 7 patients, respectively). Anyway, this difference does not create bias in our analysis, since the two editions differ among each other for the evaluation of the only patients with T1 melanoma28. Clinical data characteristics of the 43 patients from CPTAC-CM dataset provided in Table 1.

Table 1 Clinical data referred to CPTAC-CM public dataset.

The model was further tested on a validation cohort of 11 cutaneous melanoma patients, out of which 8 DF and 3 non-DF at 1-year follow-up, whose data, consisting in H&E slides as well as clinical data were provided by our Institute. The WSIs were acquired by Camera Cmos DFK, Nikon, 33UX183 and saved as pyramidal images. Clinical data were summarized in Table 2.

Table 2 Clinical data referred to the validation cohort of patients.

Disease-free survival prediction

The designing and the first validation of the proposed predictive model was performed on data related to patients of the CPTAC-CM dataset. The identification of quantitative imaging biomarkers was executed by means of the three CNNs described in the Methods section. Basically, the WSIs were manually annotated by expert histopathologists of our Institute to identify a Region Of Interest (ROI), that were automatically divided in tiles. As detailed described in Methods Section, only tiles with high cell density were retained and then partitioned in crops of dimension equal to one-fourth of the tile size. The three deep models built in correspondence of the three CNNs were firstly performed separately and then ensembled together. In Table 3, the median AUC values reached by the three deep models firstly at crop level, and then at WSI level, are compared. To pass from crop level to WSI level, a vote score thresholding procedure based on the definition of a unique classification score for the WSI as the 75th percentile value of the distribution of the classification scores related to crops was implemented (seeMethods”). A median AUC value of 50.5%, 56.2% and 54.8% was achieved at crop level by ResSVM, DenseSVM, InceptionSVM, respectively. The implementation of the vote score thresholding procedure to obtain a unique AUC value per WSI and hence per patient led to an AUC improvement up to 10% for each of the three deep models: a median AUC value of 64.7%, 64.8% and 64.9% was returned by ResSVM, DenseSVM, InceptionSVM, respectively. The performances were balanced among the models, thus pointing out the robustness of the proposed data analysis pipeline.

Table 3 Comparison of the median percentage AUC values achieved on both crops and entire WSIs by means of the three deep models, ResSVM, DenseSVM and InceptionSVM.

The three deep models were overcome by the ensemble model, DeepSVM, which reached a median AUC value of 69.5%, as shown in Table 3.

Performing an association test between 1-year disease-free survival outcome and each clinical factor no significant associations were recognized for both public and validation cohorts of patients. By following the data analysis pipeline, we then designed a SVM classifier, which exploits all the clinical variables at disposal (see Table 1) according to a fivefold cross validation scheme on the CPTAC-CM public dataset. In our experimental analysis, we attempted to implement two feature selection algorithms: Random Forest which computes Gini impurity to identify the most important features29; forward sequential feature selection algorithm30, which selected features over the training set during cross validation until there is no improvement in prediction of an SVM classifier over the same training set. However, no significant improvement in performances were obtained. For this reason, we reported only the results achieved by the model using all clinical variables. Indicated as Clinical in Fig. 1, this model reached a median AUC value of 57.3%, a median accuracy value of 52.0%, a median sensitivity value of 44.4%, a median specificity value of 57.1%, a median F1-score value of 58.2%, and a median Geometric mean (G-mean) value of 53.5%.

Figure 1
figure 1

Boxplots of the performances of the 1-year disease-free survival predictive models in terms of percentage AUC, accuracy, sensitivity, specificity, F1-score and G-mean. The results are evaluated by implementing a fivefold cross validation scheme for five rounds on the CPTAC-CM public dataset.

With respect to the Clinical model, DeepSVM model led to great improvement for all the performance evaluation metrics except for F1-score (see Fig. 1): a median AUC value of 69.5%, a median accuracy value of 72.7%, a median sensitivity value of 68.8%, a median specificity value of 75.0%, a median F1-score value of 42.3%, and a median G-mean value of 75% were obtained, respectively. The sensitivity and specificity values were well balanced. The DeepSVM + clinical model, defined by combining the scores achieved by DeepSVM and Clinical separately via a soft voting technique, did not lead to an overall improvement of the results with respect to the DeepSVM model. The only metric which increased was the median specificity value (83.3%), but at the expense of the median sensitivity value (59.4%). However, if compared with the Clinical model alone, the DeepSVM + clinical model returned better performance for all the evaluation metrics except for F1-score (see Fig. 1).

The model which showed the best performances was those obtained by exploiting the image information alone, i.e. DeepSVM. This best performing model was finally tested on the validation cohort of patients recruited from our Institute, reaching an AUC value of 66.7%, an accuracy value of 72.7%, a sensitivity value of 100%, a specificity value of 62.5%, a F1-score value of 66.6%, a G-mean value of 79.1%. The resulting the Receiver Operating Characteristic (ROC) curve is represented in Fig. 2. The achieved results demonstrated how the proposed model was quite robust and generalizable.

Figure 2
figure 2

ROC Curve related to the DeepSVM model on the validation cohort. The dotted line represents the random guess.

Discussion

The post-surgery evaluation of disease-free survival outcome has gained increasing attention for management of melanoma patients in consequence of the continuous progress in defining even more suitable therapies during the last few years. Whether WSI analysis can improve the prediction of prognostic tasks related to melanoma with respect to considering clinical data alone is currently under scrutiny within the scientific community23,24,25,31,32. As example, Peng et al.31 developed lasso Cox prediction models based on the integration of clinical variables, gene signature and WSI features to predict recurrence-free survival in melanoma patients. The integration of WSI features with baseline clinical variable improved performance obtained by using clinical variable only. However, they made use of handcrafted features, that, as demonstrated elsewhere33, can be affected by human bias. The advent of deep learning has attracted great interest of both clinical and technical figures operating together in multidisciplinary teams of cancer centers, since it has opened the path to the possibility of performing an automatic feature extraction from raw images without human intervention. The extracted features were recognized as able to discern complex structure underlying image texture, which are usually hidden to human eyes15. In this study, we wanted to make a contribution on the development of a cost-effective prognostic model by integrating clinical variables with quantitative WSI information automatically extracted via deep learning. We developed a deep learning-based model which make use of features extracted by transfer learning with the aim of learning prognostic biomarkers from WSIs to predict 1-year disease-free survival in stage I–III cutaneous melanoma patients. To the best of our knowledge, this is the first work which exploits transfer learning for this prognostic task. Transfer learning allows to handle the limited size of the datasets at disposal in this study. Small datasets are very common in clinical studies due to the complexity and high costs of patient data collection. It is well-known how CNNs trained and validated on small datasets tends to lose its prediction power, being more prone to overfitting and unstable predictions34. The application of transfer learning combined with cross validation and ensemble model strategies enables to overcome these technical issues, also making the achieved results more generalizable. As proof of concept, in this paper, we firstly designed and validated the model on a public dataset, and, afterwards, we tested the robustness and generalizability of the model on a cohort of melanoma patients recruited from our Institute. Basically, we made use of pre-trained networks as feature extractors (transfer learning) only and SVMs as standard classifiers. The combination of clinical and imaging data used by DeepSVM + Clinical model led to a clear improvement in performances if compared with the model Clinical using clinical variables only: the reached AUC values were 68.2% and 57.3%, respectively. However, clinical variables did not add significant information when added to the imaging features. Indeed, the best predictive classification performances were obtained in terms of median AUC and accuracy with values of 69.5% and 72.7%, respectively, when the only quantitative imaging biomarkers extracted by CNNs were evaluated. The abovementioned results were achieved on the CPTAC-CM public dataset, remaining comparable when the model was tested on the validation cohort of melanoma patients (AUC value of 66.7% and accuracy value of 72.7%, respectively).

A fundamental peculiarity of the proposed model is the automatic identification of quantitative imaging information from the raw WSIs directly. In other words, we used a computerized system to automatically extract information that are usually evaluated manually and visually by pathologists. Our model was able to automatically capture fine tumor or lymphocytic infiltrate characteristics, such as morphology of tumor nuclei as well as density distribution of lymphocytes, that are well-known to be associated with metastasis and survival outcomes35,36, and then with the disease-free survival.

The limitation of this study is represented by the relatively small size of the analysed datasets. The generalizability of the proposed model should be further validated on wider cohorts of patients. The next step of our research work will be to train the algorithm on WSIs of patients recruited from our Institute in the hypothesis of improving the predicting performances and then validate the robustness in multi-centric studies. Here, we have extracted high-dimensional features, which refer to global cues of an image, such as shapes or entire objects. We are planning to add key low-dimensional features, i.e. features that are related to local image characteristics, to the developed model.

Beyond the validation on larger datasets, in future extensions of our work, eXplainable Artificial Intelligence (XAI) models37 will be integrated to make the decisions achieved by the proposed model as more intelligible to humans. The comprehension and trust of the outcomes created by artificial intelligence algorithms is essential to an easier applicability of artificial intelligence in clinical practice. In our case, the optimization of our model through XAI could lead to both effective and low-cost prognostic model to be used to manage the care of melanoma patients in routine clinical practice.

Finally, the present study proposed an artificial intelligence model to determine which cutaneous melanoma patients could show 1-year disease-free survival thanks to a direct identification of quantitative imaging biomarkers from WSIs. The promising results achieved in this preliminary work suggest how our proposal, after further validations of wider cohorts of patients as well as technical refinement, has the potential to fulfil the predictive task with great improvement in the melanoma patient management in terms of time and costs, also representing a complementary tool with respect to the current genetic and manual immunohistochemistry methods.

Methods

Image pre-processing

Histopathological images are pyramidal images and hence hard to be taken in input to an artificial intelligence algorithm directly. Hence, an image pre-processing phase was performed, as shown in the upper panel of Fig. 3. Two expert histopathologists of our Institute selected and then annotated the most representative WSI per patient to mark a Region Of Interest (ROI). Each ROI was split into tiles with 224 × 224 pixels at 20 × magnification using QuPath open-source software38. An automated cell detection to identify both tumor cells and lymphocytes was run on each tile. Only tiles with high cell density were retained. Cell density was computed as the ratio between the number of cells within the tile and the tile area. The distribution of cell density referred to tiles of a same ROI were then defined. A tile was considered as containing a high cell density, i.e. a high cell content information, if the related cell density exceeded the 90th percentile of the distribution. All the other tiles were discarded (see Fig. 3). The retained tiles were then partitioned in crops of dimension equal to one-fourth of the tile size. The centres of the crops were obtained as points randomly sampled from a 2D Gaussian distribution centred on the tile. A maximum number of 50 crops per tile were generated. The number of crops varied from tile to tile since only crops with less than 25% background pixels (Luma > 170) were retained for further analysis. A total of 12,575 crops were extracted from WSIs related to the CPTAC-CM dataset, while an amount of 4011 crops were lastly obtained from WSIs related to the validation cohort referred to our Institute. Finally, the colour of each crop was normalized by a standard WSI normalization algorithm, known as Macenko’s method39, to overcome many inconsistencies during staining process of the histological slides due to either different stain manufacturers or different staining procedures or even different storage times.

Figure 3
figure 3

Workflow of the proposed approach. Image pre-processing was firstly performed: WSIs were annotated by expert pathologists; the identified Region of Interests (ROIs), one per WSI, were tessellated into tiles of 224 × 224 pixels; cell detection was performed and only tiles with high cell density were retained and then divided into low-dimensional crops, which were finally colour-normalized. Data-analysis was then performed: crops were taken in input by three pre-trained CNNs. Each CNN extracted thousands of imaging features, later undergoing to a feature selection process; the selected features were employed to build a SVM classifier, which expressed a decision (DF or non-DF) firstly at crop level and then at slide level through the implementation of a vote score thresholding. In correspondence of the three CNNs, three models, called ResSVM, DenseSVM and InceptionSVM were defined. An ensemble model, named DeepSVM, was designed by combining the classification scores of the three models. A model, named Clinical, which took in input clinical data and used a SVM classifier to give in output classification scores at WSI level, was defined. A soft-voting procedure was implemented to combine the classification scores of DeepSVM and Clinical at WSI level and then at patient level. The final model was called DeepSVM + Clinical. Predictive performances were assessed by standard evaluation metrics.

Data analysis pipeline

The data analysis pipeline following image pre-processing is depicted in the bottom panel of Fig. 3. We used commonly cutting-edge pre-trained CNNs implemented in MATLAB R2019a (MathWorks, Inc., Natick, MA, USA) software, such as ResNet5040, DenseNet20141, InceptionV342, to extract high-dimensional imaging features from crops. ResNet50 model40 is a 50 layers deep CNN which makes possible to train much deeper networks maintaining compelling performances by stacking residual blocks on top of each other. All crops were resized as 224 × 224 size images to be taken in input to ResNet50. A total of 2048 features were finally extracted. DenseNet20141 model is composed by layers receiving additional inputs from all preceding layers and passing their feature-maps to all subsequent layers. All crops were resized as 224 × 224 size images to be taken in input to DenseNet201. A total of 1920 features were finally extracted. InceptionV342 has an architectural design with repeated components called inception modules. All crops were resized as 299 × 299 size images to be taken in input to InceptionV3. A total of 2048 features were finally extracted. The weights of the pre-trained CNNs referred to the training on ImageNet dataset. The networks have acquired knowledge from a huge number of natural (non-medical) images of the ImageNet dataset during the training phase that, in this paper, has been transferred on never unseen images (WSIs). In our study, only the labels of crops, inherited by the WSIs to which they belong (DF cases vs non-DF cases) were needed for the classification phase (SVM classifiers, see later), while transfer learning was entirely performed using an unsupervised method.

In correspondence of features extracted from the three CNNs, three deep models, called ResSVM, DenseSVM and InceptionSVM, respectively, were designed.

The three deep models shared the same backbone architecture, which is described in the following. After feature extraction, the initial public cohort of 43 patients was divided in turn into training and test sets, according to a fivefold cross validation procedure for 5 rounds. In this manner, all the crops associated to the WSI of one patient were part either of the training set or the test set depending on whether the patient was assigned to the training set or the test set, respectively. We implemented a nested waterfall feature selection technique consisting of two basic feature selection methods. First, features were selected according to their statistical significance which was assessed through the computation of Area Under the receiver operating Curve (AUC), which accounts for the general discriminative capability of each feature with respect to a binary classification problem: this metric takes percentage values ranging from 50 to 100%, indicating random guess and perfect discriminatory ability, respectively43. Features whose AUC value was less than 60% were dropped. Afterwards, principal component analysis (PCA)44 based on the explained variance criterion was performed to further reduce the number of features selected by the first algorithm. The first p components were chosen according to the percentage of variance explained (we fixed a threshold of 80%) by total selected components and later taken in input by a Support Vector Machine (SVM) classifier with radial basis kernel function45. A first decision was returned at crop level by the classifier: a classification score ranging from 0 to 1 was assigned to each crop separately. To obtain a single classification score for each WSI and then for each individual patient, a vote score thresholding procedure was implemented: the distribution of the scores attributed by the classifier to each crop related to a same WSI was defined and the classification score corresponding to the 75th percentile value was assigned as the final classification score to the WSI and thus to the patient to which the WSI corresponded. Hence, three classification scores were assigned to each individual patient, each one in correspondence of each of the three deep models. These three classification scores were combined according to an ensemble procedure: if the classification scores returned by at least two out of three deep models exceeded the value th = 0.28, that corresponded to the ratio of DF patients over the total number of the sample population, the maximum score was assigned as a final score. Conversely, if the classification scores returned by at least two out of three deep models was less than th, the minimum score was attributed as final score. In the following, we refer to this model as DeepSVM.

As a last step of analysis, a SVM classifier exploiting the clinical features summarized in Table 1 was designed. The model, called Clinical, returned a classification score for each individual patient.

Finally, a soft voting technique, consisting of averaging the classification scores of diverse models, was implemented to combine the scores obtained by DeepSVM and Clinical. Thus, a new model with the name DeepSVM + Clinical was designed. The model between DeepSVM and DeepSVM + Clinical model, which returned the best predictive performances on the CPTAC-CM public dataset was further validated by using the 43 patients of the CPTAC-CM public dataset as training set and the validation cohort of 11 patients recruited from our Institute as test set.

Statistical analysis and performance evaluation

The association between each clinical characteristic and disease-free survival was evaluated by means of statistical tests on overall dataset: Wilcoxon-Mann-Whitney test46 was used for continuous features, whereas Chi Squared test47 was employed for those features measured on an ordinal scale. A result was considered statistically significant when the p-value was less than 0.05.

The predictive performances of the developed models were evaluated in terms of AUC and, once the optimal threshold was identified by Youden’s index on ROC curve48, standard metrics, such as accuracy, sensitivity, and specificity were also computed:

$${\text{Accuracy }} = \, \left( {{\text{TP }} + {\text{ TN}}} \right)/ \, \left( {{\text{TP }} + {\text{ TN }} + {\text{ FP }} + {\text{ FN}}} \right),$$
$${\text{Sensitivity }} = {\text{ TP}}/ \, \left( {{\text{TP }} + {\text{ FN}}} \right),$$
$${\text{Specificity }} = {\text{ TN}}/ \, \left( {{\text{TN }} + {\text{ FP}}} \right),$$

where TP and TN stand for True Positive (number of non-DF cases correctly classified) and True Negative (number of DF cases correctly classified), while FP (number of DF cases identified as non-DF cases) and FN (number of non-DF cases identified as DF cases are False Positive and False Negative ones, respectively. In this paper, since we solved a binary classification problem (DF cases vs non-DF cases) on imbalanced data (see Tables 1, 2), we also computed two other metrics, i.e. F1-score and Geometric mean (G-mean), that have been suggested as suitable for practitioners to determine an appropriate performance measure in the case of analysis of imbalanced datasets49.

The F1-score works well for imbalanced datasets since the relative contribution of precision and sensitivity are computed as equal:

$${\text{F1}} - {\text{score }} = { 2 }* \, \left( {{\text{sensitivity}}*{\text{ precision}}} \right) \, /\left( {{\text{sensitivity }}*{\text{ precision}}} \right),$$

with precision = TP/(TP + FP).

The Geometric Mean (G-Mean) measures the balance between classification performances on both the classes, thus avoiding overfitting the class with the major number of subjects and underfitting the class with the minor number of subjects:

$$\mathrm{G}-\mathrm{mean }= \sqrt{\mathrm{Sensitivity}*\mathrm{Specificity}}.$$

Ethics approval and consent to participate

The study was conducted according to the guidelines of the Declaration of Helsinki and approved by the Scientific Board of Istituto Tumori ‘Giovanni Paolo II’, Bari, Italy- prot 17729/2020. The authors affiliated to Istituto Tumori “Giovanni Paolo II”, IRCCS, Bari are responsible for the views expressed in this article, which do not necessarily represent the ones of the Institute.