Raman spectroscopy and topological machine learning for cancer grading

Conti, Francesco; D’Acunto, Mario; Caudai, Claudia; Colantonio, Sara; Gaeta, Raffaele; Moroni, Davide; Pascali, Maria Antonietta

doi:10.1038/s41598-023-34457-5

Download PDF

Article
Open access
Published: 04 May 2023

Raman spectroscopy and topological machine learning for cancer grading

Francesco Conti^1,2,
Mario D’Acunto³,
Claudia Caudai¹,
Sara Colantonio¹,
Raffaele Gaeta⁴,
Davide Moroni¹ &
…
Maria Antonietta Pascali¹

Scientific Reports volume 13, Article number: 7282 (2023) Cite this article

1636 Accesses
4 Citations
2 Altmetric
Metrics details

Subjects

Abstract

In the last decade, Raman Spectroscopy is establishing itself as a highly promising technique for the classification of tumour tissues as it allows to obtain the biochemical maps of the tissues under investigation, making it possible to observe changes among different tissues in terms of biochemical constituents (proteins, lipid structures, DNA, vitamins, and so on). In this paper, we aim to show that techniques emerging from the cross-fertilization of persistent homology and machine learning can support the classification of Raman spectra extracted from cancerous tissues for tumour grading. In more detail, topological features of Raman spectra and machine learning classifiers are trained in combination as an automatic classification pipeline in order to select the best-performing pair. The case study is the grading of chondrosarcoma in four classes: cross and leave-one-patient-out validations have been used to assess the classification accuracy of the method. The binary classification achieves a validation accuracy of 81% and a test accuracy of 90%. Moreover, the test dataset has been collected at a different time and with different equipment. Such results are achieved by a support vector classifier trained with the Betti Curve representation of the topological features extracted from the Raman spectra, and are excellent compared with the existing literature. The added value of such results is that the model for the prediction of the chondrosarcoma grading could easily be implemented in clinical practice, possibly integrated into the acquisition system.

Contribution of Raman Spectroscopy to Diagnosis and Grading of Chondrogenic Tumors

Article Open access 07 February 2020

Prediction of disease progression indicators in prostate cancer patients receiving HDR-brachytherapy using Raman spectroscopy and semi-supervised learning: a pilot study

Article Open access 06 September 2022

Raman spectroscopy and group and basis-restricted non negative matrix factorisation identifies radiation induced metabolic changes in human cancer cells

Article Open access 16 February 2021

Introduction

Raman spectroscopy (RS) is a noninvasive optical technique sensitive to the molecular composition of biological tissues so that RS can be used to optically probe the molecular changes associated with disease tissues, making it possible to classify malignant cancer degrees¹. Raman spectrum is a plot of scattered intensity as a function of the energy difference between the incident and scattered photons and is obtained by pointing a monochromatic laser beam at the tissue under investigation. Hence, the loss or gain in the photon energies corresponds to the difference in the final and initial vibrational energy levels of the molecules belonging to the specific spots of the tissue investigated. The difference between final and initial vibrational energy levels denote shifts in wavenumbers, which are unique for individual molecules resulting in specific peaks that are spectrally narrow and potentially associated with the vibration of a specific chemical bond in the molecules².

Since the grading of cancer tissues is one of the main challenges for pathologists, RS is establishing itself as one of the most promising new techniques for supporting pathologists in making diagnoses as accurate as possible, avoiding or limiting as much as possible false positives and false negatives, unfortunately still commonly experienced by pathologists today, and increasing the overall accuracy of diagnostic protocols^3,4,5,6,7. Recently, RS has been applied to chondrogenic tumour classification with excellent results⁸. Chondrogenic tumours are the second worldwide largest group of bone tumours, whose histologic pattern suggests a deep relationship to hyaline cartilage. Chondrosarcomas are tumours whose malignant cells produce a cartilaginous matrix. When they occur in previously normal bones, they are generally classified as primary chondrosarcomas. At the same time, secondary chondrosarcomas result from the malignant transformation of a benign cartilaginous lesion. They are classified into three malignant degrees, the first degree (CS G1), the second one (CS G2) and the third one (CS G3). In addition to such three degrees, Enchondroma (EC) is a noncancerous version. Distinguishing between EC and CS G1 is a rather critical issue for pathologists, generating many false positive and false negative diagnoses^9,10. In order to adequately address the solution to this problem, RS has proved extremely useful⁸. Multivariate analysis is the basic discriminant approach able to handle Raman data to perform a diagnosis. The first application of multivariate statistics to chondrosarcoma has been primarily based on the principal component analysis—linear discriminant analysis (PCA-LDA) algorithm together with leave-one-out cross-validation method, yielding the sensitivities of 70% between EC and G1, and 90%, between G1 and G2, respectively. These results have indicated that Raman spectroscopy combined with multivariate analysis techniques can be used to explore the biochemical intravariability of the cancerous tissue under investigation⁸. A more recent paper¹¹ exploited a more complex processing scheme, i.e. CLARA (CLAssification through wavelet transform of RAman spectra). CLARA is a two-stage classification method: the first stage directly uses the 1D signal to discriminate between EC, CS G1 and CS G2, G3, while the second stage applies the wavelet transform to Raman spectra in order to discriminate between EC and CS G1. CLARA achieves a 97% accuracy in the 3-label classification.

In this paper, we propose a novel method leveraging the topological features extracted from the Raman spectrum, to enhance the classification capability of standard machine learning techniques in classification. Even if the experimental dataset is not large, results show that such a method outputs a classification model which not only achieves high accuracy on never before seen data samples but also can be easily integrated into a Raman spectroscopic system as an automatic tool for supporting clinicians in grading the tumour.

The following section is devoted to the description of both the experimental data (Dataset 1 and Dataset 2) and the processing pipeline. In “Results” section, the processing pipeline is applied to the experimental data, and the classification results are reported; also, this section includes two ablation studies showing that the pipeline is more efficient than using only topological data analysis or only machine learning classifiers. The section closes with: (i) a thorough comparison of the best result we found in the state of the art; (ii) a description of the results achieved on new data (which have been acquired on new subjects using a different acquisition system). “Final test with new data” section discusses the best results achieved and concludes the paper.

Materials and methods

Data acquisition

The data acquisition was carried out with a Thermo Fisher Scientific DXR2xi Raman microscope. A total of 10 patients, who were being treated at the Institution, Azienda Ospedaliera Universitaria Pisana, Pisa, were enrolled in the study under the Ethical Committee agreement. Details can be found in the paper⁸. Formalin-fixed paraffin-embedded tumour tissue sections (e.g. in Fig. 1) were collected on glass slides and subsequently submitted to RS analysis after the dewaxing step (e.g. in Fig. 2). The protocol to remove paraffin and formalin has provided the immersion of the histopathological sections in a series of two baths of xylene for 10 min, respectively, and then washing the sections in PolyButylene Succinate (PBS) to remove residual formalin. Moreover, to give an idea of the variability of the datasets, Fig. 3 shows the Raman spectra coming from Dataset 1 (Fig. 3a) and from Dataset 2 (Fig. 3b).

The Raman spectroscopy measurements were configured based on the following experimental parameters: laser wavelength 532 nm; power laser of 5–10 mW; 400–3400 cm\(^{-1}\) full range grating; 10×, 50× and 100× objectives; 25 µm pinhole; 5 (FWHM) cm\(^{-1}\) spectral resolution. Integration time for recording a Raman spectrum was 1 s and 10 scans for any spectrum. As a first step, the tissue morphology overview was carried out to identify the regions of interest with the collection of a number of mosaic images at low (10×) and intermediate (50×) magnification. Thus, the acquisition of Raman spectra was carried out with a 100× objective. Optimization of signal-to-noise ratio and minimization of sample fluorescence were obtained through preliminary measurements in order to set the best experimental parameters. Multiple measurements were performed in different regions within the various samples, in order to assess intra-sample variability. In turn, no pre-treatment of the samples was necessary before Raman measurements. Minimal preprocessing, including background removal and baseline application, was performed using the tools of the DXR2xi GUI, and a \(5{\text{th}}\) order polynomial correction was used to compensate for the tissue fluorescence. Peaks were identified with specific tool support by Omicron 9.0 software.

Raman hyperspectral chemical maps ranging from \(50 \times 50\,\upmu \hbox {m}^2\) (step size 1 \(\upmu \hbox {m}\)) to approximately \(200 \times 200\,\upmu \hbox {m}^{2}\) (step size \(4\, \upmu \hbox {m}\)), recording several hundreds of spectra per map were collected. Raman maps provide the fundamental advantage of being able to localize Raman spectra to specific locations, providing local information about chemical composition. Step sizes were chosen to have a collection time for each map less than 7 h for all the maps.

Ten supplemental spectra have been acquired, making use of an Xplora Plus (Horiba) in a similar experimental setup and preprocessing procedure in order to test the classification method on never seen data samples. This way, the results of the final test reported at the end of the “Results” section show that the classification method proposed is neither subject-dependent nor vendor-specific (DXR Thermo Fisher data for model training, Xplora Horiba data for final model testing).

In this paper, we have introduced the following labels for the machine learning part: EC = 0, CS G1 = 1, CS G2 = 2, CS G3 = 3. In the following, they will be used as synonyms. Finally, the data acquired have been split into two datasets:

Dataset 1 400 spectra from ten subjects, belonging to the following chondrosarcoma malignancy classes [0, 0, 0, 1, 1, 1, 2, 2, 3, 3]. Each subject has respectively the following number of spectra: [32, 31, 37, 24, 38, 38, 50, 50, 49, 51].
Dataset 2 10 spectra from ten subjects (no intersection with Dataset 1), belonging to the following chondrosarcoma malignancy classes [1, 2, 2, 1, 2, 0, 3, 3, 0].

Data analysis

The core idea of our study is to employ the many tools of Topological Data Analysis (TDA) and machine learning (ML) to perform classification in the dataset of Raman spectra described in the previous section. The concept of using topological and geometrical ideas in medical data is not a novel one and has already demonstrated substantial potential through multiple research papers^{12,13,14,15,16,17,18}.

In our approach, we evaluated the effectiveness of topological features by combining them with established machine learning algorithms, including support vector classifiers, random forest classifiers, and Ridge regressions. The reason why we preferred general machine learning algorithms to deep learning is twofold. Firstly, NNs are highly task dependent, whereas the adopted processing pipeline¹⁹ based on TDA and ML is intended to be very general. Secondly, for completeness, a CNN was trained on the persistence images obtained in “Combining TDA and ML: the classification pipeline” section. Since the results were not at all satisfactory, this experiment was excluded from this work.

Mathematical background

This section is mainly devoted to the description of TDA, a relatively new branch of applied mathematics that aims to bridge the gap between computational topology and discrete Morse theory in the study of high dimensional data. The interested reader can find more information about this topic here^20,21. More precisely, we introduce Persistent Homology (PH) as one of the main concepts of TDA. Roughly speaking, PH studies the geometry of spaces by looking at the evolution of k-dimensional holes at different scales. It keeps track of the appearance and disappearance of such holes, which are the topological features, in the form of intervals \((\text {birth},\, \text {death})\). The persistence of a topological feature is the span of its detectability, and it is a measure of its importance. In particular, features with a longer lifespan are more likely to be key features in describing the shape of the data space, while features with a short lifespan can often be assimilated to noise. The collection of the intervals \((\text {birth},\, \text {death})\) is called the Persistence Diagram (PD). Mathematically, a persistence diagram is a multiset, which is a set where elements can appear multiple times, i.e. each element has a multiplicity. Different metrics can be defined in the space of persistence diagrams. Notwithstanding the precise mathematical definition of these metrics (for which we refer the reader to^20,21), an essential property is that the process that associates a PD with data is stable with respect to these metrics. This means that a small perturbation in the data yields a small perturbation of the associated PD. This property is of fundamental importance in applications because it guarantees robustness against noise and repeatability. The main drawback of PDs is that the space of multisets lacks fundamental properties required in a machine learning context. For this reason, a number of representation methods have been devised in order to exploit the PDs’ expressiveness in ML algorithms. For more information on such representation methods, we refer the reader to these papers^22,23,24,25. The key idea of all these methods is to embed the space of persistence diagrams in a more broad Hilbert space in a stable way, i.e. to vectorize the PD. After this last step, we are able to exploit the topological features extracted by persistent homology directly in a machine learning algorithm. Figure 4 shows the classical paradigm for topological data analysis.

Combining TDA and ML: the classification pipeline

This section is devoted to the description of the topological pipeline employed in this study. We refer to Fig. 5 for a general scheme of our approach. The classification pipeline consists of an automatic grid search for the optimal choice of (i) PH-base representation of the input data; (ii) ML classifier for cancer staging. Such a pipeline has already been presented in this article¹⁹, where a variety of tests on benchmark datasets were carried out. The present work describes its first application to experimental data. See Fig. 6 for a graphical example of our pipeline. The first step of the pipeline is to compute the PDs from the Raman Spectra. In doing so, we chose the Vietoris-Rips filtration²⁷. In this approach, each point of the spectra is treated as a point in the Euclidean space \(\mathbb {R}^2\). We grow balls centered at each point of the signal and when i balls intercept, an \(i-1\) simplex is added to the simplicial complex with birth value r, the current radius. An alternative approach might have been to use lower star filtration^20,28. Without going into details, since this filtration does not generate points in \(H_1\) for 1D signals, Vietoris-Rips was preferred. Hence, starting from the Raman spectra (Fig. 6a), restricted to the wavenumber range 400–1800 \(\text {cm}^{-1}\), the persistence diagram of homology in dimension 0 and 1 is computed with a Vietoris Rips filtration using the python Ripser package²⁹ (Fig. 6b). The PD is therefore vectorized using four different vectorization methods with different combinations of parameters. More specifically, the PDs are vectorized using the following setup:

Persistence Images²² (PI) with bandwidth \(\sigma \in \{0.1, 1, 10\}\) and resolution \(n \in \{5, 10, 25\}\) (Fig. 6c);
Persistence Landscapes²³ (PL) with resolution \(n \in \{25, 50, 75, 100\}\) (Fig. 6d);
Persistence Silhouette²⁴ (PS) with resolution \(n \in \{25, 50, 75, 100\}\) (Fig. 6e);
Betti Curve²⁵ (BC) with resolution \(n \in \{25, 50, 75, 100\}\) (Fig. 6f).

It is important to highlight the fact that PDs produce points in different homological dimensions, and such information must be treated carefully. In more detail, following the rich TDA literature, we employed four different approaches to deal with information originating from different dimensions. In the first (resp. second) approach, only the points in dimension \(H_i\) for \(i = 0\) (resp. \(i=1\)) are considered in the vectorization. In the third one, the actual homology dimension is neglected, and all the points are vectorized altogether regardless of the dimension in which they show up. Finally, in the fourth approach, \(H_0\) and \(H_1\) are vectorized separately, and then the corresponding vectors are concatenated. We will refer to these approaches as \(H_0\), \(H_1\), \(H_0 + H_1\) (fused) and \(H_0 + H_1\) (concat) respectively. Such vectors represent the input for different machine learning classifiers. The classifiers employed in our pipeline are:

Support Vector Classifier³⁰ (SVC) with RBF kernel and \(C \in \{1, 2, 3, 5, 10, 20\}\);
Random Forest Classifier³¹ (RFC) with \(\#\text {trees} = 100\);
Ridge Regression³² (RR) with \(\alpha = 1\).

These are well-known and standard ML classifiers; in this work, we used the implementation of the Scikit-learn library³³. The pipeline performs a grid search between the four approaches, the different vectorization and classifiers and returns the accuracy of each method for each of the ten runs of a leave-one-patient-out cross-validation³⁴ (LOPO). We stress that the design of our experimentation, including vectorizations, classifiers and LOPO, is motivated by two main reasons: (i) to achieve enough consistency with the previous work of the pipeline¹⁹ and others TDA papers; (ii) the limited amount of available data allows for meticulous research of optimality.

Ethics statements

The study was approved by the local Ethical Committee Comitato Etico Regionale per la Sperimentazione Clinica della Regione Toscana sezione AREA VASTA NORD OVEST (protocol number 14249). Ten patients affected by primary chondrogenic tumours of the skeleton were enrolled in this study. Informed consent was collected from all patients. All the experiments were carried out in accordance with Good Clinical Practice (GCP) and with the ethical principles of the Declaration of Helsinki. All patients were diagnosed and treated at Azienda Ospedaliera Universitaria Pisana, Pisa, in 2018.

Results

In this section, we are going to explore the results achieved by the pipeline described in “Combining TDA and ML: the classification pipeline” section. Due to the scarcity of data, we were able to perform a large number of experiments without any kind of computational restriction. For a more detailed description of the experimental data, please refer to “Data acquisition” section. In our first experiment (“Supervised results” section), we performed supervised learning on Dataset 1. In more detail, in “Supervised results” section we trained different combinations of labels. Specifically, we experimented by training the classifier with 4 labels (“LOPO validation 4 labels” section), with 3 labels (EC vs. CS G1 vs. CS G2 and CS G3, “LOPO validation 3 labels (EC vs. CS G1 vs. CS G2 and G3)” section) and two binary classifiers (EC vs. CS, “LOPO validation 2 labels (EC vs. CS)” section; EC and CS G1 vs. CS G2 and CS G3, “LOPO validation 2 labels (EC, CS G1 vs. CS G2, G3)” section). As explained in “Introduction” section, the most clinically meaningful subdivision is the binary classification EC vs. CS. Nevertheless, other subdivisions that may be clinically useful were also investigated, as a proof-of-concept study of the applicability of our method to the more challenging task of supporting the pathologist in the tumour grading, i.e. the subdivision no cancer vs. mild cancer vs. severe cancer in “LOPO validation 3 labels (EC vs. CS G1 vs. CS G2 and G3)” section. Indeed, the results obtained with these subdivisions, although of limited validity due to the small number of patients, encourage to enlarge the experimentation in order to validate our method further. Moreover, in “Unsupervised clustering” section we performed unsupervised learning (clustering) on Dataset 1. In “Comparison with CLARA¹¹” section, we compare with the state of the art from the paper¹¹. Finally, in “Final test with new data” section, the best model of “LOPO validation 2 labels (EC vs. CS)” section (supervised, 2-label classification), trained on Dataset 1, was tested on Dataset 2, in order to assess its generalization capability.

The pipeline takes the Raman spectra as input, computes the PDs by means of a Vietoris Rips filtration, vectorizes the PDs, and feeds such vectors to a machine learning classifier: basically, we start from a vector, and we end up with another vector of topological features. For this reason, in “Supervised learning without TDA” section, an ablation study has been carried out by feeding the Raman spectra directly to the machine learning classifiers, and comparing with the results achieved. We recall that the pipeline performs a grid search between a large number of methods {vectorization method, classifier}. Moreover, we treat separately the different homology dimensions discussed in “Combining TDA and ML: the classification pipeline” section. For this reason, in the following, for each experiment, we report both a table showing the best accuracy among all methods for each run of the LOPO validation and each homology dimension, as well as a table showing the best single method (as average accuracy) for each homology dimension.

Supervised results

In the first, somewhat naive experiment, we split all the spectra in training and test, not requiring to have all the spectra from the same patient in the training or in the test set (and not in both). Soon after, we opted for a leave-one-patient-out cross-validation (LOPO) approach to prevent overfitting. Moreover, we did not always carry out a 4-class classification, but also conducted 3 and 2-class studies in accordance with the existing literature.

Tenfold cross-validation with 4 labels

The first experiment uses all the 400 spectra from Dataset 1 and performs a tenfold cross-validation. We highlight that, in doing so, spectra coming from the same patient can occur both in the train dataset and in the test dataset. We report the classification accuracy of each run and each homology dimension in Table 1, while Table 2 reports the single best method for each homology dimension. Clearly, the accuracy results are extremely satisfying and fully justify a study of Raman spectra for chondrosarcoma tumour’s degree of malignancy.

Table 1 Accuracy of the pipeline with 4 labels and a tenfold cross-validation approach.

Subjects

Abstract

Similar content being viewed by others

Contribution of Raman Spectroscopy to Diagnosis and Grading of Chondrogenic Tumors

Prediction of disease progression indicators in prostate cancer patients receiving HDR-brachytherapy using Raman spectroscopy and semi-supervised learning: a pilot study

Raman spectroscopy and group and basis-restricted non negative matrix factorisation identifies radiation induced metabolic changes in human cancer cells

Introduction

Materials and methods

Data acquisition

Data analysis

Mathematical background

Combining TDA and ML: the classification pipeline

Ethics statements

Results

Supervised results

Tenfold cross-validation with 4 labels

LOPO validation 4 labels

LOPO validation 3 labels (EC vs. CS G1 vs. CS G2 and G3)

LOPO validation 2 labels (EC vs. CS)

LOPO validation 2 labels (EC, CS G1 vs. CS G2, G3)

Supervised learning without TDA

Unsupervised clustering

Comparison with CLARA11

Data augmentation

Train-test split

Results

Final test with new data

Discussion and conclusions

Data availability

References

Acknowledgments

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's note

Rights and permissions

About this article

Cite this article

Share this article

Comments

Search

Quick links

Comparison with CLARA¹¹