An eXplainable Artificial Intelligence analysis of Raman spectra for thyroid cancer diagnosis

Raman spectroscopy shows great potential as a diagnostic tool for thyroid cancer due to its ability to detect biochemical changes during cancer development. This technique is particularly valuable because it is non-invasive and label/dye-free. Compared to molecular tests, Raman spectroscopy analyses can more effectively discriminate malignant features, thus reducing unnecessary surgeries. However, one major hurdle to using Raman spectroscopy as a diagnostic tool is the identification of significant patterns and peaks. In this study, we propose a Machine Learning procedure to discriminate healthy/benign versus malignant nodules that produces interpretable results. We collect Raman spectra obtained from histological samples, select a set of peaks with a data-driven and label independent approach and train the algorithms with the relative prominence of the peaks in the selected set. The performance of the considered models, quantified by area under the Receiver Operating Characteristic curve, exceeds 0.9. To enhance the interpretability of the results, we employ eXplainable Artificial Intelligence and compute the contribution of each feature to the prediction of each sample.

The article is organized as follows: in the "Results" section we show the feature engineering procedure applied to the dataset of Raman spectra, the Artificial Intelligence workflow, consisting in the identification of the optimal Machine Learning classifier and the interpretation of its outcomes trough XAI, and investigate the limitations of the proposed approach when applied to spectra with anomalous properties; in the "Discussion" section we focus on insights and implications of this work and present perspectives for future research; finally, in the "Methods" section, we provide a technical description of the clinical, the spectroscopy and the computational steps of the study.

Results
The study proposed in the present research consists of three conceptual blocks, highlighted in Fig. 1: (i) the clinical step, which includes patient enrollment, surgical excision of thyroid glands and pathological evaluation; (ii) the spectroscopy step, in which the Raman spectra associated with each histological sample are obtained; (iii) the Artificial Intelligence step, in which we implement a Machine Learning classifier to distinguish the spectra labeled as healthy/benign from those diagnosed with carcinoma, and then we interpret the predictions provided by the model through a XAI analysis.The procedures for carrying out the first two steps are described in detail in the "Methods" section.In the following, instead, we shall focus on the results of the Artificial Intelligence workflow: the classification performance of different Machine Learning algorithms, the fingerprints of the spectra that most influence predictions and, finally, the limitations of the model in the classification of some specific case studies, hereinafter called ambiguous spectra, which present anomalous characteristics and are therefore particularly interesting in view of a possible application of the proposed framework in a clinical context.

Data and feature engineering
The dataset employed in this study comprises 59 Raman spectra obtained from histological samples (tissue slices) excised from the thyroids of individuals with suspected cancer.The samples were examined by the Unit of Endocrine Organs and Neuromuscolar Pathology of Fondazione Policlinico Universitario Campus Bio-Medico, which labeled them as healthy tissues (14 instances), benign adenoma (11), or one of the three most common types of carcinoma: PTC (25), FC (4), and FV-PTC (5).The aims of the analysis are to implement a Machine Learning algorithm capable of classifying Raman spectra, distinguishing healthy or benign nodules (25) from those associated with cancer diagnosis (34), and to identify the main determinants of the model's predictions using XAI.
The computational workflow starts from a preprocessing stage for the identification of peaks, described in the "Methods" section, in which spectra are interpolated, normalized and fitted with a univariate Gaussian mixture model.Such a preprocessing phase detects 32 peaks in the spectra and assigns a mean Raman shift value µ i and standard deviation σ i to each of them.This allows for the creation of an interval [µ i − σ i , µ i + σ i ] for each peak.In order to avoid redundancy, two intervals are initially removed, namely those corresponding to i = 27 and i = 30 , as they are entirely contained within at least one of the other intervals.Subsequently, we merge pairs of partially overlapping intervals (i, j) into a single interval [min(µ i − σ i , µ j − σ j ), max(µ i + σ i , µ j + σ j )] ; this results in the merging of the intervals originally labelled as i = 23 and i = 24 .The selection process described above gives a set of 29 intervals that do not overlap, which are henceforth identified with new indexes ranging from 1 to 29.The boundaries of these intervals can be found in the Supplementary Table S1.It should be noted that these spectral bands have been identified through a completely data-driven and unsupervised approach from the analysis of the aggregate distribution of all spectra, without any information regarding their diagnostic label.
As previously reported in literature, the ability to distinguish between spectra of healthy/benign and carcinoma tissues is attributed to the presence or prominence of various types of spectral lines, including reduced and oxidised cytochrome, and carotenoids 18 .Figure 2 shows the typical structure of the Raman spectra considered in this study, highlighting the characteristic fingerprints of healthy, benign or different types of carcinoma (PTC, www.nature.com/scientificreports/FV-PTC, FC) tissues.The relevant characteristic bands in Raman spectra which allow to distinguish healthy/ benign and carcinoma tissues are the following, corresponding to specific categories of molecules: To compare the spectra in a Machine Learning framework, we create features based on the highest intensity value P k (prominence) in each of the 29 intervals.These features rely on 812 ratios P k /P ℓ (k, ℓ = 1, . . ., 29 k � = ℓ) between prominences referred to different intervals.However, the features in the pair (P k /P ℓ , P ℓ /P k ) are not independent, and it is not clear which one to choose beforehand.In fact, it is not possible to make a selection by comparing prominence values, since they generally change their hierarchy depending on whether the spectrum corresponds to a healthy/benign or cancerous tissue.Additionally, choosing only, e.g., P k /P ℓ with k < ℓ is arbi- trary.Thus, for each pair (P k /P ℓ , P ℓ /P k ) , we evaluate the distributions of P k /P ℓ and P ℓ /P k on the entire dataset and keep the quantity characterized by the largest ratio between mean value and standard deviation, discarding the other.After this selection process, the number of features is reduced to 406.

Artificial intelligence workflow
Figure 3 outlines the Artificial Intelligence procedure that has been implemented in this study to develop a Machine Learning classifier of healthy/benign and carcinoma spectra, and interpret its outcomes through XAI.The workflow contains two nested loops: an outer loop, represented in Fig. 3 as a green rectangle, which consists of multiple executions of the Synthetic Minority Over-sampling TEchnique (SMOTE) 53 , and an inner loop represented by a red rectangle, where a leave-one-out classification procedure is performed.This computational pipeline has been specifically designed to address the case study under consideration, that is based on a dataset of limited size in which the two classes to be distinguished are moderately unbalanced.In particular, the leaveone-out cycle allows to optimize the availability of information to train the algorithm, despite a dataset containing a small number of samples.During the ith leave-one-out trial, where i ranges from 1 to 59, spectrum S i is treated as a test instance, and the feature set is initially reduced using the Boruta feature selection process applied to the 58 remaining training samples S j j =i .This algorithm evaluates the importance of each feature by measuring the model performance variation under random shuffling.To ensure control and reproducibility of this process, we conduct 100 Boruta iterations on the dataset of 58 items and 406 features, corresponding to different values of the internal parameter "random_state" ranging from 1 to 100.We select the N i features that are chosen by all the Boruta runs and pass them to the next steps of the Machine Learning workflow depicted in Fig. 3.Then, to compensate imbalances in the training set consisting of 58 spectra and N i features, we oversample the minority class therein by applying the SMOTE approach.The set thus obtained is used to train a Machine Learning algorithm, which is then validated on the test instance, namely the spectrum S i .
In this workflow we consider the following Machine Learning algorithms: Random Forest, XGBoost, Support Vector Machine and Gaussian Naïve Bayes; for each of them, we explore the internal parameter space in order to identify the optimal configuration.The performance of an algorithm on the entire dataset, i.e., on the whole leave-one-out cycle, is quantified through the area under curve (AUC).This metrics is obtained by evaluating the algorithm performance with varying classification threshold, representing the results as points in a plane where the horizontal and vertical coordinates correspond to the false positive and true positive rates, respectively, and finally computing the area comprised between the receiver operating characteristic (ROC) curve, namely the line connecting the points, and the horizontal axis.
Since the SMOTE algorithm includes random steps, as explained in detail in the "Methods" section, we account for the variability arising from its application by performing 100 leave-one-out cycles for each of the Machine Learning algorithms listed above, keeping their internal parameters fixed.Each trial is associated with a distinct value, ranging from 1 to 100, of the random_state parameter of the SMOTE algorithm that is implemented on the training set.As a result, for each model and each configuration of internal parameters, we acquire a distribution of 100 AUC values, whose median is used as a proxy of the model's effectiveness.Then, the best classifier is obtained upon comparison among the median AUC values obtained for the different algorithms and internal parameter configurations.
Finally, we inspect the functioning of the best classifier through the XAI approach, by collecting the SHAP values of the different (feature, prediction) pairs, and averaging each of them over the 100 SMOTE runs.SHAP values leverage the interpretability of the classifier as they quantify the impact of the different features on the model's predictions, revealing connections between Raman spectral properties and diagnoses.In the following www.nature.com/scientificreports/subsections we will show the results of the Artificial Intelligence workflow concerning Machine Learning and XAI steps.

Machine learning classifier
In this study, we identify as best classifier the one with the highest median AUC over the 100 runs of the SMOTE algorithm.If two or more classifiers have the same median AUC, we select the classifier with the lowest interquartile range (IQR) of the AUC values distribution.As highlighted in Fig. 3, we compared the performances of multiple algorithms, namely Random Forest, XGBoost, Support Vector Machine, and Gaussian Naïve Bayes.The parameter space explored for each algorithm is detailed in the "Methods" section.The best algorithm in terms of AUC for the leave-one-out healthy/benign-versus-cancer tissue classification is Random Forest (median 0.9441, interquartile range 0.0049) with n_estimators= 50 , max_depth= 5 or 10, and either criterion='entropy' or 'log_loss' (providing the same results).For definiteness, we will take as a reference henceforth the case with n_estimators= 50 , max_depth= 5 , and criterion='entropy' .The performances of all the examined algorithms, for the different configurations of their internal parameters, are reported in the Supplementary Information (including Supplementary Table S2).Though the classification outcomes of XGBoost and Support Vector Machine depend on the chosen values of their internal parameters, their median AUC values are instead independent of the specific configuration.For the Gaussian Naïve Bayes classifier, no internal parameter variation has been performed, thus providing a single median AUC value, computed after 100 SMOTE algorithm runs.
Figure 4 shows the median ROC curves corresponding to the best Random Forest classifier, along with the ones obtained for XGBoost and Support Vector Machine algorithms with arbitrary internal parameters, and for the Gaussian Naïve Bayes one.From the analysis of ROC curves, it is possible to identify, for each model, the optimal classification threshold, as the one that maximizes a specific metric of interest.According to a widely established criterion, we set as the target metric to be maximized G = Sensitivity • Specificity , namely the geometric mean of sensitivity and specificity, quantifying the balance between these two performance indicators.To determine the optimal classification threshold we choose, for each SMOTE run, the one maximizing G.The distribution of classification thresholds found in this way for the best classifier has median 0.5 and IQR 0.065.Figure 5 shows the normalized confusion matrix produced in this optimal case by aggregating the predictions from the 100 runs with varying SMOTE random seeds.The analogous results for the other Machine Learning algorithms are reported in the Supplementary Fig. S1.
Since the presented classification outcomes are averages over 100 SMOTE runs, it is important to evaluate how much the randomness entailed in the artificial oversampling of the minority class in the training set impacts on predictions of test set instances.The stability of classification outcomes provided by the best classifier with threshold 0.5 is satisfactory, with 51 spectra out of 59 that are classified in the same way in all the runs, 3 spectra showing a classification variability below 10% , 3 between 10% and 15% , and only 2 with higher rates of discrepancy.
Although the proposed model has been optimized for the healthy/benign-versus-cancer tissue classification, it is natural to wonder if its outcomes expressed in terms of prediction probability can be retroactively used to discern diagnostic categories in more detail, also distinguishing Healthy spectra from those generated by benign nodules, as well as the different types of carcinoma.After computing the median prediction probabilities of each spectrum on the 100 SMOTE runs, we aggregate the results based on diagnostic categories, and then compare the respective distributions.Median prediction probabilities corresponding to the different labels are: 0 (with

XAI analysis
As a reference for the XAI analysis, we consider the best performing Random Forest classifier, averaging on all the runs the SHAP values associated to the features provided in input by Boruta.A SHAP value 0 is automatically assigned to a feature in a given run, in case it is not selected.In general, SHAP values indicate how a specific feature influences the prediction associated with a given instance.In our analysis, negative and positive SHAP values correspond to a feature's contribution towards assigning the healthy/benign and cancer labels to an instance, respectively.Each data point in the summary plot depicted in Fig. 6 represents the SHAP value of a particular feature for a specific instance.Higher absolute SHAP values indicate a greater feature relevance in the prediction.
The most influential features, i.e. those with top 20 mean absolute SHAP values on the entire dataset, are the following: • P 24 /P 11 , with the interval #24 containing the line 1376 cm −1 (oxidised cytochrome b) and the interval #11 not containing lines associated with known categories of molecules; higher values drive the classifier towards the healthy/benign prediction.• P 29 /P 17 , with the interval #29 containing the line 1638 cm −1 (oxidised cytochrome c) and the interval #17 containing the line 1155 cm −1 (carotenoids); higher values drive the classifier towards the healthy/benign prediction.
• P 8 /P 20 , with the intervals #8 and #20 not containing lines associated with known categories of molecules; higher values drive the classifier mostly towards the cancer prediction.• P 29 /P 18 , with the interval #29 containing the line 1638 cm −1 (oxidised cytochrome c) and the interval #18 not containing lines associated with known categories of molecules; higher values drive the classifier towards the healthy/benign prediction.• P 2 /P 17 , with the interval #2 not containing lines associated with known categories of molecules and the inter- val #17 containing the line 1155 cm −1 (carotenoids); higher values drive the classifier towards the healthy/ benign prediction.• P 29 /P 13 , with the interval #29 containing the line 1638 cm −1 (oxidised cytochrome c) and the interval #13 containing the line 1003 cm −1 (carotenoids); higher values drive the classifier towards the healthy/benign prediction.• P 24 /P 17 , with the interval #24 containing the line 1376 cm −1 (oxidised cytochrome b) and the interval #17 containing the line 1155 cm −1 (carotenoids); higher values drive the classifier towards the healthy/benign prediction.• P 29 /P 8 , with the interval #29 containing the line 1638 cm −1 (oxidised cytochrome c) and the interval #8 not containing lines associated with known categories of molecules; higher values drive the classifier mostly towards the healthy/benign prediction.• P 16 /P 17 , with the interval #16 containing the line 1125 cm −1 (reduced cytochrome c) and the interval #17 containing the line 1155 cm −1 (carotenoids); higher values drive the classifier towards the healthy/benign prediction.• P 18 /P 17 , with the interval #18 not containing lines associated with known categories of molecules and the interval #17 containing the line 1155 cm −1 (carotenoids); higher values drive the classifier towards the healthy/benign prediction.• P 24 /P 18 , with the interval #24 containing the line 1376 cm −1 (oxidised cytochrome b) and the interval #18 not containing lines associated with known categories of molecules; higher values drive the classifier towards the healthy/benign prediction.• P 24 /P 27 , with the interval #24 containing the line 1376 cm −1 (oxidised cytochrome b) and the interval #27 containing the line 1516 cm −1 (carotenoids); higher values drive the classifier towards the healthy/benign prediction.• P 24 /P 13 , with the interval #24 containing the line 1376 cm −1 (oxidised cytochrome b) and the interval #13 containing the line 1003 cm −1 (carotenoids); higher values drive the classifier towards the healthy/benign prediction.• P 20 /P 13 , with the interval #20 not containing lines associated with known categories of molecules and the interval #13 containing the line 1003 cm −1 (carotenoids); higher values drive the classifier towards the healthy/benign prediction.• P 17 /P 11 , with the interval #17 containing the line 1155 cm −1 (carotenoids) and the interval #11 not contain- ing lines associated with known categories of molecules; higher values drive the classifier towards the cancer prediction.• P 23 /P 17 , with the interval #23 not containing lines associated with known categories of molecules and the interval #17 containing the line 1155 cm −1 (carotenoids); higher values drive the classifier towards the healthy/benign prediction.• P 9 /P 20 , with the intervals #9 and #20 not containing lines associated with known categories of molecules; higher values drive the classifier mostly towards the cancer prediction.• P 29 /P 27 , with the interval #29 containing the line 1638 cm −1 (oxidised cytochrome c) and the interval #27 containing the line 1516 cm −1 (carotenoids); higher values drive the classifier towards the healthy/benign prediction.• P 17 /P 12 , with the interval #17 containing the line 1155 cm −1 (carotenoids) and the interval #12 not contain- ing lines associated with known categories of molecules; higher values drive the classifier towards the cancer prediction.• P 29 /P 9 , with the interval #29 containing the line 1638 cm −1 (oxidised cytochrome c) and the interval #9 not containing lines associated with known categories of molecules; higher values drive the classifier towards the healthy/benign prediction.
These results are further analyzed and interpreted in the "Discussion" section.www.nature.com/scientificreports/

Ambiguous samples
Although the proposed model has provided very satisfactory performances in terms of median AUC, and can be straightforwardly interpreted through the XAI approach, it is worth investigating its limitations when applied to the classification of spectra with anomalous properties.For this purpose, we consider 13 additional instances, henceforth called ambiguous samples, whose characteristics differ in many respects from the canonical ones, which would be expected on the basis of their diagnosis.The 100 runs of the best classifier are applied to all 72 available samples, namely the 59 spectra included in the original dataset and the newly-added 13 ambiguous ones.Such runs correspond to values of the SMOTE random_state internal parameter ranging from 1 to 100.The results presented below are obtained through a two-step process: • classification of the 59 unambiguous spectra with the same leave-one-out procedure displayed in Fig. 3, performed on the original dataset (that does not comprise the ambiguous samples); • classification of the 13 ambiguous spectra with the same algorithm, trained on the 59 unambiguous ones.
Predictably, the model performance is reduced in the presence of ambiguous spectra.The resulting AUC distribution from 100 runs has a median of 0.7949 and an interquartile range (IQR) of 0.0135.Figure 7 displays the confusion matrix obtained by combining the predictions from the 100 SMOTE runs on the dataset consisting of 72 spectra.The model erroneously classifies 9 of the 13 ambiguous samples in all runs, one in 99% runs, and one in 75% runs.Of the 9 samples misclassified in all runs • 2 contain PTC cancerous tissue that is erroneously classified as healthy/benign, since the carotenoid lines are not well visible in their Raman spectra; • 1 is healthy/benign, but classified as cancerous due to the low visibility of the oxidised cytochrome b line at 1376 cm −1 ; • 6 are healthy/benign from a histological point of view, but are classified as cancerous due to the presence of mutations, revealed through an immunohistochemical analysis, that determine the presence of carotenoid lines in the spectra.Actually, SHAP values associated with these instances indicate that prominence ratios involving carotenoids have the largest impact on the algorithm decision, as expected since these lines are a characteristic feature of samples associated with a carcinoma diagnosis, particularly PTC and FV-PTC.
The sample misclassified in 99% runs corresponds to a case of FC, difficult to classify due to the absence of carotenoid lines and the scarce representation in the dataset.Finally, the sample misclassified in 75% runs is labelled as healthy/benign, but classified as cancerous due to the low visibility of the oxidised cytochrome b line at 1376 cm −1 .

Discussion
In this research work we have developed an Artificial Intelligence workflow capable of interpreting Raman spectra of a particular specimen and providing a highly reliable prediction of the thyroid lesion's malignancy.A strength of this approach is the fact that the classifier implementation is based on a feature engineering step, which is completely data-driven.Actually, the preprocessing pipeline identifies the interesting intervals from which peak prominences and impactful variables should be extracted in a completely unbiased manner, without using any information about the diagnostic labels associated to the spectra.Besides being accurate, the predictions provided by the best classifier are also directly interpretable.The results of the XAI analysis indicate a clear pattern consistent with established knowledge: a spectrum with dominant carotenoid lines tends to indicate a cancer diagnosis, while a spectrum with dominant oxidised cytochrome b and c lines is generally associated with a healthy/benign diagnosis.Among the top 20 most impactful features, the prominence referred to the band at 1155 cm −1 (interval #17), corresponding to carotenoids, is the most recurrent, as it appears 8 times, 5 of which in the top ten of the same ranking.On the other hand, prominences associated with carotenoid bands at 1003 cm −1 (interval #13) and 1516 cm −1 (interval #27) appear few times and have a lower impact on prediction.This finding is noteworthy because it suggests a potentially different importance hierarchy for cancer biomarkers.Further investigation is needed to determine if this hierarchy holds when analyzing a larger dataset.The most impactful features also include the two lines at 1376 cm −1 (interval #24) and 1638 cm −1 (interval #29) corresponding to oxidised cytochromes b and c, respectively.It is worth mention- ing that only one reduced cytochrome c line, specifically that at 1125 cm −1 (interval #16), appears in one of the influential model features reported in Fig. 6.This is due to the fact that reduced cytochrome c lines are not able to definitely differentiate between healthy/benign and cancer-related spectra, as they are prominent in spectra with both FC and FV-PTC tumor diagnoses, while being undetectable in the case of PTC.
The present study was conducted with a dataset of limited size and characterized by a subject imbalance between the healthy/benign and cancer categories.The Machine Learning pipeline, shown in Fig. 3, employs two nested iterative procedures, SMOTE and leave-one-out, both designed to tackle datasets with such problematic features, providing satisfactory performances.Nonetheless, the limited size of the dataset and the scarce representativeness of FC and FV-PTC subclasses prevented us to implement a more targeted Artificial Intelligence workflow, that could distinguish between the different thyroid carcinoma categories.On the other hand, the performed analysis provided encouraging indications that the proposed algorithm could accomplish this kind of classification, when trained on a larger dataset.Actually, in a previous study 18 , a clustering linkage algorithm highlighted a ranking of the different types of carcinoma tissue, based on their similarity with the healthy/benign one; the extremes of such hierarchy are FC (most similar) and PTC (least similar).Remarkably, the same ordering seems to emerge also in the present research work, by comparing the distributions of the median prediction probabilities, computed on the 100 SMOTE runs, referred to spectra belonging to different diagnostic categories.The corroboration of such a finding would benefit from a larger dataset containing enough representatives of each category, which we plan to investigate in future research to further validate our workflow and its potential applicability in a clinical setting.
To identify the limitations of the proposed model, we tested its ability to classify spectra that have anomalous characteristics, inconsistent with those expected on the basis of their diagnostic label.The application of the optimal algorithm to the 6 instances with mutations is noteworthy.These samples, misclassified in 100% of SMOTE runs, are considered healthy based on histological analysis, but an immunohistochemical test reveals the presence of mutations resulting in peaks corresponding to carotenoids.It is possible that these samples were excised before the onset of the disease.However, medical opinions on this matter are divided, and it is not universally accepted that tissues exhibiting these characteristics will inevitably progress to cancer.In such cases, it is generally recommended to operate on the patient.While the classifier may make formal assignment errors on such histological samples, it highlights an interesting class of tissue from a clinical perspective.
The results of the study suggest that the use of Artificial Intelligence for the healthy/benign-versus-cancer classification of histological samples can lay the foundations for promising innovations in the clinical field, allowing the development of new devices to support diagnosis.In fact, the proposed workflow has the potential to enable fast and nearly real-time lesion classification, standardize Raman spectra interpretation and reduce costs associated with patient management, especially if its application is extended to samples that can be acquired with less invasive procedures, such as fine needle aspiration.

Study enrollment, clinical evaluation and tissue preparation
The enrollment phase, managed by the Unit of Metabolic Bone and Thyroid Diseases of Fondazione Policlinico Universitario Campus Bio-Medico, lasted from January 2018 to October 2021.All patients were submitted to US scan of thyroid gland and neck area, performed with a frequency range of 10-12 MHz on a MyLab 50 (Esaote, Genova, Italy) by 2 experienced endocrinologists at the Metabolic Bone and Thyroid disordes Unit.The observed nodules were classified according to ACR TI-RADS risk stratification criteria 54 .In doubtful cases the endocrinologists conducted a separate session to reach a unified consensus.Patients with clinical or US characteristics indicating the need to perform fine needle aspiration according to the literature 55 , were asked to sign the informed consent to participate in the study.Only patients who underwent surgery (total thyroidectomy) after a cytological diagnosis of indeterminate, suspicious or malignant nodule, in according to the international guidelines 56 , were definitely enrolled in this study.www.nature.com/scientificreports/Neuromuscolar Pathology of the same Institution in an properly labelled container.Here, after evaluating the gross anatomy of the samples, tissues slices of about 1 cm × 1 cm × 3 mm were cut, including both healthy and neoplastic areas, avoiding surgical margins.Tissue slices were snap frozen in the cold plate of a cryostat.A 5 µ m section was stained with haematoxylin/eosin to confirm the presence of healthy and cancerous tissues, and then 4 consecutive cryostatic sections were cut at a thickness of 30 µ m, collected on separate slides and stored at a temperature of −20 • C. The Raman analysis was exclusively performed on these frozen unfixed samples.For definitive histology, the residual slices were defrosted, fixed in formalin, and embedded in paraffin along with the paired surgical samples.The final diagnosis was made in agreement with current edition of WHO classification of endocrine tumours 57 .Paraffin sections from neoplastic areas in each patient were used for immunohistochemical analysis of Galectin3 (Gene Tex), CD56 (Agilent), and HBME1 (Agilent) using an automated immunostainer (Omnis, Agilent) 18 .

Raman measurements
We acquired Raman spectra using a Renishaw InVia Micro-Raman spectrometer and a solid-state diode laser source at 532 nm with a nominal output power of about 100 mW for excitation.In our experimental arrangement, we focused the laser beam onto the sample (unfixed 30 µm section) and gathered the back-scattered unpolarized intensity using either a Leica 50× LWD objective or an Olympus 100× objective mounted on a Leica DM2700 M confocal microscope.The investigation areas of cancerous and healthy tissues were defined by the correspondence of the subsequent sections with that characterized with hematoxylin-eosin, described above.
The laser beam could be focused on the sample in a spot of a few microns in diameter, thus allowing for the separation of the signal contribution originating from the cells under investigation .Neutral-density filters were used to reduce the power of the laser beam incident on the sample to prevent photo-damage.Our setup employs a holographic edge filter to ensure high-contrast rejection of the elastically scattered light.A 1800 grooves/mm diffraction grating is utilized to disperse the Raman inelastic scattering contribution and a 1024 × 256 pixels, Peltier-cooled, CCD camera is used to detect the scattered light.We collected punctual spectra by utilizing the extended scan mode across the 100-3600 cm −1 Raman-shift range, with a spectral resolution of approximately 1 cm −1 .For each sample, we carried out five measurements at selected points, with five scans acquired for each point.The cumulative integration time for each point was almost 50 seconds.The Renishaw Wire software was used to collect the raw spectra and to perform data reduction, such as background and fluorescence subtraction.

Preprocessing of the spectra and feature extraction
The spectra are preprocessed using the following steps.Firstly, all spectra are interpolated to a Raman shift grid with equal spacing of 1 cm −1 .Next, each spectrum is normalized to have a sum of one (area under curve equals to one), and then cubic spline smoothing is performed.Peak detection is then carried out on each preprocessed spectrum, and the resulting local maxima are collected from all spectra.A univariate Gaussian mixture model with unequal variance is used to fit the distribution of the local maxima across the 59 samples, and the optimal model is selected based on the Bayesian Information Criterion (BIC) 58 .We use R (version 4.2.2) packages gsignal 59 (version 0.3-5) to find peaks and mclust 60 (version 6.0.0) to fit a Gaussian mixture model to the histogram of the local maxima.

Boruta feature importance
In order to mitigate the effects of noise and data redundancy, we utilize a wrapper method for feature selection based on the Boruta framework 61 .This procedure identifies only those features that are uncorrelated with each other and significantly improve the performance of the machine learning algorithm.The Boruta feature selection tool is based on a supervised learning Random Forest algorithm, of which it exploits the founding concept: randomizing the training samples and perturbing the system helps to mitigate the negative impact of random fluctuations and correlations in the learning model.In the Boruta framework, the original set of features is expanded by adding shadow features, which are constructed by randomly shuffling the values of each original indicator.This augmented dataset is then used to train a Random Forest algorithm, which is capable of making predictions and evaluating the importance of both the original and shadow features.Boruta selects features that, within the dataset, provide statistically more accurate predictions than those obtainable by replacing them with their corresponding shadow counterparts, after conducting a series of independent shuffling operations.As a result, the competition among features in Boruta does not require the use of an arbitrary importance threshold to determine which variables are relevant, as is often necessary in traditional feature selection techniques.

SMOTE algorithm
Imbalanced classification refers to the task of building predictive models on classification datasets where one class has significantly fewer examples than the other.The main difficulty of working with imbalanced datasets is that standard machine learning techniques often ignore the minority class, leading to poor performance on it.A common solution to this problem is to oversample the minority class examples, which involves duplicating them in the training dataset prior to model fitting.Although this can balance the class distribution, it does not add any new information to the model.A more effective strategy than duplicating minority class examples is to generate new instances by synthesizing them from the samples already existing in the minority class.This approach,

Figure 1 .
Figure 1.General workflow of the analysis.

Figure 2 .
Figure 2. Raman spectra.Typical Raman spectra of the examined thyroid tissues, labelled according to the histology report.Blue squares correspond to the Raman characteristic peaks of reduced cytochrome c, orange stars indicate the spectral lines of oxidised cytochrome c, green triangles the ones of oxidised cytochrome b and the red circles those of carotenoids.

Figure 3 .
Figure 3. Detailed workflow of the Machine Learning and eXplainable Artificial Intelligence (XAI) analysis.After preprocessing, 100 runs of the synthetic minority over-sampling technique (SMOTE) with different random seeds are executed.In each SMOTE run, a leave-one-out classification is implemented, and in the ith leave-one-out iteration (where i ranges from 1 to 59) the Boruta algorithm selects N i relevant features, that are used to construct the training set; then, before implementing different Machine Learning algorithms, SMOTE is applied to oversample the minority class.The classification algorithms employed in this study are random forest (RF), XGBoost (XGB), support vector machine (SVM), and Gaussian Naïve Bayes (GNB).Their performances are quantified by the AUC metrics, which is the area under the receiver operating characteristic (ROC) curve.The impact of features on the prediction for each instance is evaluated through the Shapley (SHAP) values, averaged over all SMOTE runs.

Figure 4 .
Figure 4. Receiver operating characteristic (ROC) curves for one of the random forest (RF) classifiers that maximize median AUC (n_estimators = 50, max_depth = 5, criterion = 'entropy'), for XGBoost (XGB) and support vector machine (SVM) algorithms with arbitrary internal parameters, and for the Gaussian Naïve Bayes (GNB) algorithm.Plots referred to XGB and SVM have been obtained in the configurations num_parallel_tree = 100, max_depth = 3, n_jobs = 1, and c = 1, kernel = 'entropy' , respectively.The True Positive Rate and False Positive Rate coordinates of points in the ROC curves are median values computed over 100 SMOTE runs.

Figure 5 .
Figure 5. Confusion matrix obtained by collecting the predictions of 100 SMOTE runs, with different random seeds, for a Random Forest model with n_estimators = 50, max_depth = 5, and criterion = 'entropy' .Such a model provides the best performance in terms of AUC (median 0.9441, interquartile range 0.0049) among the considered ones.

Figure 6 .
Figure 6.Summary plot of the mean SHAP values, computed on 100 runs of the SMOTE algorithm, with different random seeds, for a Random Forest model with n_estimators = 50, max_depth = 5, and criterion = 'entropy' .

Figure 7 .
Figure 7. Confusion matrix quantifying the aggregated performance of 100 SMOTE runs of a Random Forest model with n_estimators = 50, max_depth = 5, and criterion = 'entropy' , applied to all 72 available samples, namely the 59 spectra included in the original dataset and 13 ambiguous spectra.The results are obtained through a two-step process: first, the 59 unambiguous spectra are classified with a leave-one-out procedure, not involving the ambiguous ones; then the 13 ambiguous spectra are classified with the same algorithm trained only on the 59 unambiguous ones.
The thyroidectomies were carried out at the Unit of Thoracic Surgery of the aforementioned Institution.Study population included 54 subjects (34 females, 20 males) affected by thyroid nodular pathology, with age distribution centered at 46.3 years, with a 11.2 years standard deviation.Specimens removed during surgery were promptly submitted unfixed to the Unit of Endocrine Organs and Vol.:(0123456789) Scientific Reports | (2023) 13:16590 | https://doi.org/10.1038/s41598-023-43856-7