Automatic and objective oral cancer diagnosis by Raman spectroscopic detection of keratin with multivariate curve resolution analysis

We have developed an automatic and objective method for detecting human oral squamous cell carcinoma (OSCC) tissues with Raman microspectroscopy. We measure 196 independent Raman spectra from 196 different points of one oral tissue sample and globally analyze these spectra using a Multivariate Curve Resolution (MCR) analysis. Discrimination of OSCC tissues is automatically and objectively made by spectral matching comparison of the MCR decomposed Raman spectra and the standard Raman spectrum of keratin, a well-established molecular marker of OSCC. We use a total of 24 tissue samples, 10 OSCC and 10 normal tissues from the same 10 patients, 3 OSCC and 1 normal tissues from different patients. Following the newly developed protocol presented here, we have been able to detect OSCC tissues with 77 to 92% sensitivity (depending on how to define positivity) and 100% specificity. The present approach lends itself to a reliable clinical diagnosis of OSCC substantiated by the “molecular fingerprint” of keratin.

Component Analysis (PCA) in conjunction with statistical multi-parameter analyses. The key advantage of PCA is that once a spectral data set obtained from tissues is analyzed and separated into several particular categories, then a new spectrum can automatically be assigned to one of those categories, for example, cancerous vs. normal. However, PCA does not extract detailed molecular spectral information from the categorized spectra and its physical basis of categorization tends to remain unclear.
Biological tissues are highly heterogeneous and their Raman spectra vary widely depending on the position where they are measured. Furthermore, molecular compositions of tissues are so complicated that their raw Raman spectra can hardly be interpreted. In order to accomplish global tissue analysis effective for cancer diagnosis, we need to (1) collect Raman spectra from as many as possible points from a tissue sample, (2) estimate the number of principal spectral components contained in this large number of Raman spectra, (3) decompose the raw spectra into spectrally interpretable components and finally (4) objectively characterize tissues according to the extracted spectral information. The methodology employed up to now relies greatly on specialized "spectroscopic eyes", which has not facilitated its practical applications in cancer diagnosis. The aim of the present study is to develop an automatic and objective method for discriminating oral cancer tissue by detecting keratin without any specialized knowledge of spectroscopy. We (1) collected a total of 196 Raman spectra from one oral tissue sample, (2) estimated the number of principal spectral components by Singular Value Decomposition (SVD), (3) applied Multivariate Curve Resolution-Alternating Least Square (MCR-ALS) 20,21 analysis to decompose a large set of complicated spectra into spectrally interpretable components and (4) carried out the spectral matching analysis between these MCR-decomposed spectral components and the keratin standard spectrum, to objectively discriminate OSCC against normal tissues via Unit Normalized Euclidean Distance (UNED).
The present method fully utilizes the Raman spectral information (molecular fingerprint) of the marker molecule, keratin; in contrast, in PCA approaches, Raman spectra are treated just as two-dimensional signature for a pattern recognition analysis without referring much to their physicochemical meanings. The identification of keratin signature is automatically and objectively achieved with the use of UNED, making the whole analysis readily acceptable for non-specialists of spectroscopy.

Results
Determination of the number of principal spectral components contained in the observed spectra. We first determined the number of principal spectral components, k, based on the signal-to-noise ratio (S/N) consideration described in the Methods Section. We have tried several threshold values and finally set it to S/N = 4. If the threshold value is higher, we have fewer spectral components (smaller k values) in which keratin signatures are likely to be mixed up with other protein signatures. If the threshold value is lower, we have more spectral components (larger k) in which keratin signatures are likely to be contaminated with noise and dispersed among plural spectral components. The present threshold value, S/N = 4, is the optimized value for the present data set of 196 × 24 = 4704 Raman spectra from 14 patients. This threshold is to be further optimized with larger number of data from larger number of patients in the future. For present, we used the threshold value, S/N = 4, to show that the following automatic analysis proceeds successfully, once the threshold value is fixed. The determined k values are shown in Fig. 1 for Patient-1 ~ Patient-10 (OSCC and normal tissue samples), Patient-11 ~ Patient-13 (OSCC) and Patient-14 (normal). These different k values show the variation of samples obtained from different patients.
MCR-ALS fitting to decompose the observed spectra into spectrally interpretable components. After determining the number of principal spectral components in tissue, we applied MCR-ALS analysis (see Methods) to decompose the complicated raw spectra into spectrally interpretable components. The MCR spectral components of the OSCC tissue of Patient 1 is shown in Fig. 2(a-f). The normalized residual R ij = |(A ij -WH ij )/A ij | at the i-th row and the j-th column is less than 5 ~ 7%, indicating that the principal signatures in the original data set A are well represented by the product WH with MCR decomposition. Thus obtained MCR decomposed spectra are readily compared with the standard keratin spectrum in Fig. 2g. We notice that one of the MCR components,  The result for the normal oral tissue of Patient 1 is shown in Fig. 3(a-e). The normalized residual is less than 6%. The keratin spectrum does not seem to match any spectral components from the normal oral tissue. Spectral components in Fig. 3a,b and e are ascribed to autofluorescence. The spectral component in Fig. 3c is likely to contain protein signatures with a characteristic band of phenylalanine residue at 1003 cm −1 . Prominent signatures in Fig. 3d are from glass substrate.
Although we discuss the assignments of the decomposed components, we can process the MCR and the spectral matching evaluation (next step) analysis without them. The process is fully automatic and does not require spectral assignments in our protocol.
Spectral matching between principal MCR spectral components and the standard spectrum of keratin. By the preceding MCR analysis, we obtained decomposed spectral components. In order to evaluate how "similar" these spectral components are to the standard keratin spectrum, without relying on specialized "spectroscopic eyes", we employed the idea of spectral matching. We have tried several indicators of spectral matching including Spectral Angle (SA), Euclidean Distance (ED) and Spectral Information Divergence (SID) [23][24][25] and found that Unit Normalized Euclidean Distance (UNED) (described in Methods) provides the clearest measure of the spectral similarity.
We calculated UNEDs between each MCR-decomposed spectrum and the standard keratin spectrum in the region 800 ~ 1800 cm −1 . We then picked up the minimum UNED value among decomposed spectra in each tissue sample to quantify the "highest similarity" of each sample. We first used the ten paired-samples (OSCC/normal) from the same ten patients for comparison. The result is shown in Fig. 4 with ten red points for OSCC and ten blue points for normal tissue samples, respectively.
We note that OSCC points tend to have smaller UNED values than the corresponding normal points. This trend indicates that the decomposed spectral components in the OSCC tissue samples have higher similarity with the standard keratin spectrum than those in the normal. The 95% confidence interval of the OSCC points is 0.16 < UNED < 0.26, while that of the normal points is 0.29 < UNED < 0.38. These two 95% confidence intervals do not overlap with each other, showing that UNED clearly distinguishes OSCC and normal oral tissues with high accuracy and specificity. If we simply take the upper bound of OSCC confidence interval (UNED = 0.26) as the threshold value, we can separate OSCC and normal groups with 70% accuracy in OSCC tissues (failure for Patient 5, 7, 10) and 100% specificity in normal tissues. The three false points, which were histologically diagnosed as cancerous, may well correspond to the preliminary stage of cancer that has more cancerous tissues than normal (see discussion below).

Figure 2. MCR-ALS spectral components from Patient-1 OSCC tissue sample (a-f) and the standard keratin spectrum (g); the spectral component (b) shows excellent correspondence with the standard keratin spectrum (g).
With the threshold UNED = 0.26, we analyzed the other three independent OSCC and one independent normal tissue samples. The three OSCC tissue samples show the UNED similarity values of 0.22, 0.13 and 0.22, respectively. These values are all smaller than 0.26. The normal tissue sample shows the value 0.38, which is much higher than 0.26. If we include the three independent OSCC samples, the accuracy increases from 70% to 77%. 100% specificity does not change if we add one normal sample in the analysis.

Discussion
In the present study, we have developed an automatic and objective method for oral cancer diagnosis by applying the MCR-ALS analysis with spectral matching. In spectral matching, we compared the distance, UNED, between normalized MCR-decomposed spectral components and the normalized standard keratin spectrum to evaluate their "similarity". The UNED "similarity" value tells us how much the MCR-decomposed spectra contain the characteristic Raman signature of keratin. When UNED value is small, the decomposed spectral component contains much signature of keratin. When UNED value is large, the decomposed spectral component contains less keratin signature. Our results indicated that, from the OSCC tissue samples, high similarity was always found for one of the decomposed spectral components but that no spectral component showed high similarity from  the normal tissue samples. Keratin signature was successfully captured from the OSCC tissue samples but not from the normal. Keratin in the normal tissue samples was not detected by MCR primarily because of the much less keratin amount in normal tissues than in OSCC 3 . In addition, spatial distribution of keratin may also play certain roles in the MCR decomposition. Note that the MCR decomposition is based on the differences not only in the spectral profile but also in the spatial distribution. It is likely that the keratin spatial distribution in OSCC is different from that in normal tissues. OSCC tissues may consist of a homogenous population of cells at one particular stage of differentiation, whereas normal tissues consist of cells at different stages of differentiation 26,27 . Cancer cells in OSCC tissues are likely to have more chance to stay at G 2 phase with aberrant keratin syntheses and produce localized spatial distribution of keratin. On the contrary, normal cells mostly progress at different stages of differentiation, M, G 1 and G 2 phases, to have keratin randomly distributed spatially. In the present study, we randomly measured the points in tissue samples to obtain global spectral information. We anticipate that specific areas of tissues can be examined by this approach for comparison with immunohistochemical staining result, and use the spatial distribution of keratin, H i in Equation 4, to further substantiate the discrimination of OSCC tissues against the normal tissues.
In spectral matching, the threshold UNED = 0.26 was set to discriminate OSCC against normal oral tissues from the same patient. The UNED value could also help elucidating the metastasis condition of cancer. The marginal region (UNED = 0.26 ~ 0.29 in Fig. 4) probably represents metastatic cancer cells gradually accessing into normal tissues. With UNED = 0.26 discrimination, the UNED values were larger than 0.26 for the three false points in Patients − 5, − 7, − 10; hence, these were not identified as cancerous. However, their UNED values were smaller than the corresponding values of the normal. If we make comparisons of the UNED values within the same patient, the OSCC tissue samples always show lower UNED values than the normal (for Patient-10, they almost overlap). In that sense, the two OSCC tissue samples, Patients − 5, − 7, can probably be diagnosed as suspicious (though not identified as cancerous) for having more cancerous tissues than normal. If we regard these suspicious samples as positive, the accuracy increases from 77% to 92%. The pair comparison of the UNED values may provide further information on the metastasis condition of oral cancer.

Tissue samples. Use of tissue samples was approved by Institutional Review Board of the Taichung Veterans
General Hospital. All experiments were performed in accordance with the approved guidelines and regulations. Informed consent was obtained from all subjects. Samples from fourteen oral cancer patients included ten paired (cancer and normal tissue samples from the same patient), three independent cancer and one independent normal tissue samples. Cancerous oral tissue samples were histologically confirmed as oral squamous cell carcinoma (OSCC). All tissue samples, immediately after surgical removal, were flash-frozen at − 196 degrees Celsius and stored in liquid nitrogen. Tissue samples once stored at liquid nitrogen temperature were then embedded in optimum cutting temperature (OCT) compound and sectioned in a microtome in approximately ten-micrometer thick and were mounted on glass slides. Standard keratin sample was prepared from human stratum corneum, which was known to contain 80% keratin 28 . Practically, it was obtained from stratum corneum cut out from the heel of one of the authors. The sample was soaked in a 1:1 mixture of methanol and chloroform overnight and then was immersed in deionized water 29 . The standard keratin sample that we have taken from human stratum corneum is extensively used as the standard antigen in immunostaining detection of keratin in squamous cell carcinoma (SCC) tissues 3 .

Raman microspectroscopy.
We used a laboratory-constructed Raman microspectrometer for all the Raman measurements. The 488 nm line of an Ar-ion laser (CVI Melles Griot) was used for excitation with a power of about 1 mW at the sample point. The laser beam was focused into the sample by using a non-immersion 40X, NA = 0.6, objective (Olympus, LUCPlanFL N). The laser spot size at the sample was estimated to be about 1 μm. The back-scattered light was collected by the same objective lens and was focused on to the entrance slit of a polychromator (Andor, SR303i-BNS). A 1200-grooves/mm grating was used to disperse scattered light. The signal was detected by a CCD detector (Andor, DU401A-BV) cooled to − 80 °C. The acquisition time was 60 sec for each measurement. The Raman spectrum of indene was acquired for wavenumber calibration 30 . The Raman spectra of all samples were recorded in the 300-2000 cm −1 wavenumber region, which covers most Raman signatures observed from oral tissues.
In the present study, we emphasized on extracting global molecular information of tissues. Therefore, we tried to globally and randomly measure as many points as possible without specific localization in a tissue sample. A piezo X-Y stage (Physilk Instrumente) was used to scan 7 × 7 = 49 points in one region of a tissue sample, with the distance of 5 μm between two adjacent points (Fig. 5). The same measurement was repeated four times at different regions of the sample and a total of 196 spectra were collected for each sample for subsequent analysis. Data Analysis. The flow chart of data analysis is shown in Fig. 6. Wavenumber calibration based on the standard spectrum of indene was carried out prior to the analysis. The analysis consists of the following three steps: (1) determination of the number of principal spectral components contained in the observed spectra, (2) multivariate curve resolution-alternating least squares (MCR-ALS) fitting to decompose the observed spectra into spectrally interpretable components, (3) spectral matching between principal MCR spectral components and the standard spectrum.
(1) Determination of the number of principal spectral components. To determine the number of principal spectral components in the observed spectra, we introduced a new protocol based on signal-to-noise ratio (S/N) consideration. First, SVD analysis was performed to obtain SVD-decomposed spectra, Intensity SVDoriginal . Then, these Scientific RepoRts | 6:20097 | DOI: 10.1038/srep20097 SVD-decomposed spectra were smoothen by Savitzky-Golay (polynomial) method to obtain Intensity SVDsmooth , which was regarded as the "Signal". Then, for each SVD-decomposed spectrum, Intensity SVDoriginal can be written as, i i By using this method, we can automatically select out spectral components that have S/N ratios higher than a prefixed threshold value. In the present study, we fixed the threshold at 4. SVD spectra with S/N ratios higher than 4 were included in the subsequent analysis.   spectrally interpretable components. The experimental raw spectral data set A can be written as an m × n matrix, where m is the number of data points in one spectrum and n is the number of spectra in the data set; each column vector of A, A i = ( A 1i , …, A mi ) T , represents the i-th Raman spectrum having m data points. We decompose A into a product of two matrices W and H, where W is an m × k matrix and H is an k × n matrix, k is the number of components determined by the S/N consideration in the first step. Equation (3) can be written in matrix form as, During the process of the MCR analysis, W and H matrices are forced to be non-negative; i.e., W ≥ 0, H ≥ 0. The final solutions are obtained by iterative refinement to minimize the Frobenius norm ||A-WH|| 2 . The SVD spectral components are used as initial guesses of the spectral components in the iteration process. The negative values in the SVD spectra are truncated to be zero. The present MCR-ALS analysis does not require orthogonality among column vectors in W and row vectors in H. Note that the component spectra and the intensity patterns are definitely not orthogonal to one another; in contrast, they are assumed to be orthogonal in other spectral decomposition methods like PCA and SVD. The details of the MCR-ALS method are given elsewhere 20,31 .
(3) Spectral matching between principal MCR spectral components and standard spectral component. The decomposed spectra W 1 , …, W k and the standard spectrum S std were normalized so that their norms are unity. These normalized vectors can be written as,   where j is the index of the j-th element of the m-dimensional vectors W i,unit and S std,unit . Figure 8 schematically shows the principle of the UNED analysis, represented in 2-D space for simplicity. UNED represents the distance between a normalized MCR-decomposed spectral component and the normalized standard spectrum, whose minimum value is 0 (two identical vectors) and maximum value is 2 (two vectors along opposite direction). Therefore, the smaller the UNED value is, the larger is the similarity. Thus, from UNED, we can evaluate the distance between the two normalized spectral vectors to know how similar the two spectra are.