Introduction

Breast cancer is the second most common and the leading cause of cancer-related death amongst women1. According to the Brazilian Mortality Information System, 14,206 women died in 2013 due to this disease2. In 2014, the estimation was about 49,240 cases, and in 2018 it was expected to reach 59,700 new cases of breast cancer in Brazil alone1. This neoplasm is relatively rare in women < 35 years old, and increases progressively above this age, especially after age 50 years3. Therefore, breast cancer is a major public health problem taking into consideration the detection and treatment costs4. The control of breast cancer has been a priority and is present in the Brazilian Strategic Action Plan for Confronting Non-transmissible Chronic Diseases since 20115.

Only one in three cases of breast cancer can be cured if discovered at an early stage2 and there are no effective ways of reducing the incidence of this disease6. The best alternative approach to tackle breast cancer is the concept that the earlier the disease is detected, the more effective is the treatment. Early detection through screening is the only method that has proven to be effective in reducing mortality1. Screening programs are an important health policy practice where the asymptomatic phase of disease is long enough to allow direct or indirect detection of pre-cancerous lesions. A significant degree of transformation in such lesions found in this phase would allow determination of their clinical significance and implementation of effective treatment to improve the patient’s prognosis. Such a screening test that diagnoses early disease needs to be acceptable to patients and available at a reasonable cost5.

Mammography is the recommended method for routine screening of breast cancer worldwide6. This technique performed with an x-ray machine is described as a radiological examination for evaluation of the breasts. It can be used for checking breast cancer-like lesions in apparently healthy woman by finding nodules or calcifications. Exposure to this radiation rarely causes cancer, unless performed with a high periodic frequency whereby risk will increase. Besides being considered painful, relatively expensive, and a source of much discomfort and even embarrassment to patients, its sensitivity varies from 88 to 93%, while its specificity varies from 85 to 94%6. Such statistical metrics demonstrate the proportion of women with breast cancer who will present a positive mammogram signalling disease presence, and the rate of women without breast cancer who will have a normal mammography, respectively6. Some breast cancer screening tests also include breast self-examination (BSE), clinical examination of breasts (CBE), nuclear magnetic resonance (NMR), and ultrasonography. However, the time from initial patient examination until diagnosis can be too lengthy; about 70% of breast cancer cases lead to complete removal of the breast(s). Many examinations are required to identify the presence of neoplasm: mammogram, breast exam, biopsy, magnetic resonance imaging (MRI) and ultrasound.

Infrared (IR) spectroscopy is a vibrational technique capable of analysing biomolecules, such as nucleic acids (asymmetric PO2 in DNA and RNA at ~ 1,225 cm−1), carbohydrates (C–O stretching at ~ 1,155 cm−1), proteins (amide II at ~ 1,550 cm−1 and amide I at ~ 1,660 cm−1) and lipids (C=C stretching at ~ 1,750 cm−1), that exhibit characteristic features in the IR region7. Attenuated total reflection Fourier-transform IR (ATR-FTIR) spectroscopy has been used to analyse several biofluids due to its fast spectral acquisition, minimum sample preparation and sample volume, and its non-destructive nature to the sample8. Recent research is progressing gradually in which excellent diagnostic results compared to traditional methods have been obtained in various types of cancer such as ovarian9, cervical10, and prostate11; additionally, to diagnosis neurodegenerative diseases such as Alzheimer’s12. Herein, we present the results of using ATR-FTIR spectroscopy together with chemometrics for classification of patients with breast cancer in a large-scale screening program using blood biopsies.

Results

The FTIR spectral data in the fingerprint region (900–1,800 cm−1) were pre-processed by Savitzky–Golay smoothing (window of 7 points, 2nd order polynomial fitting) followed by AWLS baseline correction and normalization to the Amide I peak (1,650 cm−1). The raw and pre-processed spectral data are shown in Fig. 1, where visual overlaps between breast cancer and healthy control spectra are present throughout the whole spectral region indicating the need of chemometric techniques to distinguish samples in such complex matrices. The pre-processed spectral data underwent chemometric analysis by several classification techniques (Table 1). Amongst the classification techniques tested, SPA-SVM presented the best classification performance with accuracy of 92.9% (94% sensitivity and 91% specificity) to detect breast cancer samples based on an external test set (15% of samples, n = 71 patients). ~ 70% of samples (n = 334 patients) were used for model construction and another 15% for internal validation (n = 71 patients). Overall classification performance represented by the F-Score and G-Score values was good (93%), indicating equal performance with or without considering imbalanced data. Figure 2 shows the receiver operating characteristic (ROC) curve for all models. The best ROC curve (area under the curve [AUC] = 0.929) was found for SPA-SVM, indicating an excellent predictive performance. PCA-SVM (AUC = 0.886) and GA-SVM (AUC = 0.871) were, respectively, the second and third best classification algorithms, demonstrating a good classification performance.

Figure 1
figure 1

ATR-FTIR spectra of plasma samples in the bio-fingerprint region (1,800–900 cm−1). (a) Raw spectral data for breast cancer (BC) and healthy controls (HC) samples; (b) pre-processed spectral data (Savitzky–Golay smoothing [window of 7 points, 2nd order polynomial fitting] followed by AWLS baseline correction and normalization to the Amide I peak) for breast cancer (BC) and healthy controls (HC) samples.

Table 1 Statistical results in % for the test set using the PCA-LDA/QDA/SVM, SPA-LDA/QDA/SVM and GA-LDA/QDA/SVM to discriminate healthy controls and breast cancer samples.
Figure 2
figure 2

Receiver operating characteristic (ROC) curve. Where, PCA-LDA: principal component analysis linear discriminant analysis; PCA-QDA: principal component analysis quadratic discriminant analysis; PCA-SVM: principal component analysis support vector machines; SPA-LDA: successive projections algorithm linear discriminant analysis; SPA-QDA: successive projections algorithm quadratic discriminant analysis; SPA-SVM: successive projections algorithm support vector machines; GA-LDA: genetic algorithm linear discriminant analysis; GA-QDA: genetic algorithm quadratic discriminant analysis; GA-SVM: genetic algorithm support vector machines. AUC: area under the curve.

The spectral variables selected by the best classification model (SPA-SVM) are shown in Fig. 3. In total, 16 wavenumbers (901, 959, 980, 999, 1,018, 1,277, 1,364, 1,402, 1,464, 1,489, 1582, 1,311, 1626, 1643, 1661, and 1742 cm−1) were responsible for class differentiation using SPA-SVM. The tentative biochemical assignments of these variables based on Movasaghi et al.13 are shown in Table 2.

Figure 3
figure 3

Selected wavenumbers by the successive projections algorithm support vector machines (SPA-SVM) model.

Table 2 Selected wavenumbers by the SPA-SVM to distinguish healthy controls and breast cancer samples.

Discussion

Breast cancer accounts for approximately 15% of all female cancer deaths and has a 5-years survival rate ranging from approximately 40% in low-income countries to ≥ 80% in developing contruies14. Its incidence is continually increasing worldwide. This is partly due to a change in the distribution of risk factors: e.g., in developed countries such as the UK, there have been significant increases in women giving birth later in life and in the number of women childless by age 45 years. In addition, there has been an increasing adoption of Westernized lifestyles in developing countries14, which may be a risk factor for breast cancer.

Mammography-based breast cancer screening is a common practice for early detection of breast cancers, where its efficiency has been demonstrated in randomized controlled trials and observational studies; hence, most organizations that issue recommendations endorse regular mammography as an important part of preventive care15. However, although mammography-based breast cancer screening is associated with reduced morbidity and mortality, the majority of women who undergo screening will not develop breast cancer in their lifetime15. In addition to the low risk of cumulative exposure to radiation over time and the great discomfort or shame associated with mammography-based screening, false positive results may lead to additional tests and investigations potentially causing psychological distress and anxiety. Conversely, negative results (i.e., where no signs of abnormality are found in the screening) may falsely reassure women when cancer is actually present14. Moreover, mammography-based screening may also not benefit all women who are diagnosed with breast cancer, since it may lead to harm in women who undergo further biopsy for abnormalities that may not be breast cancer15. For these reasons, less invasive and more accurate breast cancer screening strategies are urgently needed.

Herein, ATR-FTIR spectroscopy in conjunction with chemometric techniques was used to detect breast cancer in a total cohort of 476 patients recruited over 2 years for an early-stage breast cancer screening program in Natal, Brazil. Breast cancer detection among normal samples was successfully performed based on the blood plasma spectra with 93% accuracy (94% sensitivity, 91% specificity, AUC = 0.929) in an external (blind) cohort of 71 patients using the SPA-SVM algorithm. Sixteen spectral features were responsible for class differentiation in the fingerprint region (Table 2). These are predominantly associated with phosphodiesters (P–O vibrations), polysaccharides (C–O stretching), proteins (CH3 bending, Amide III, Amide I band), nucleic acids (C=O stretching and C–C ring breathing mode), and lipids (C=O stretching and (C=C)cis). C–O vibrations in carbohydrates, P–O vibrations in phosphodiesters, and proteins vibrations; these have been previously associated with breast cancer in serum15,16. Serum applications for breast cancer detection have been performed using IR spectroscopy by Backhaus et al.15, where 98% sensitivity and 95% specificity (using cluster analysis) and 92% sensitivity and 100% specificity [using artificial neural networks (ANN)] was obtained in a study carried out with 196 patients. Likewise, Elmi et al.16 detected breast cancer in serum-based IR spectroscopy with 76% sensitivity and 72% specificity for breast cancer cases using principal component analysis linear discriminant analysis (PCA-LDA) in a study with 86 samples (43 breast cancer, 43 healthy controls). The results reported herein are higher taking into consideration the large number of patients, where the sensitivity and specificity are found to be > 90%; being comparable to results obtained by more sophisticated methods such as using quantum cascade laser IR imaging, where sensitivity and specificity has been reported at 94% and 86%, respectively, using a random forest classifier17. However, there are no studies reporting breast cancer screening based on plasma samples using IR spectroscopy for a big cohort of samples. Herein, 476 patients were studied resulting in a diagnostic accuracy, sensitivity and specificity above 90% for cancer detection.

Methods

Samples

In this study, we evaluated two groups of women. The first, Breast Cancer (BC), refers to a group of women diagnosed with breast cancer, with or without neoadjuvant treatment, and were collected by professionals trained at the Liga Contra o Câncer Hospital (Natal/RN, Brazil), during a period of 2 years. The second, Healthy Controls (HC), refers to a group of women with no previous or current diagnosis of breast cancer, collected at the Prontoclínica Dr. Paulo Gurgel (Natal/RN, Brazil), during the same time period. In both groups, patients were > 18 years old, and family history related to some type of cancer was not taken into account. The Institutional Ethics Committee for Human Research of the Hospital Universitário Onofre Lopes (HUOL), of the Federal University of Rio Grande do Norte (UFRN), Brazil, approved this study (Ethical Approval Number—44113115.1.1001.5292) and informed consent was obtained from all subjects. Also, all the methods carried out in this study were by the approved guidelines. Samples from both groups were obtained after the reading of a Free Informed Consent Form and signature of the patients. Vacutainer tubes BD with 5 mL EDTA were used with disposable vacuum syringes. Thereafter, they were centrifuged for 10 min, and frozen at approximately − 20 °C until the time of analysis. A total of 476 samples were obtained.

ATR-FTIR spectroscopy

The samples were removed from the freezer 15 min before analysis to allow thawing. Samples were randomized and, to minimize temporal or instrumental effects, a similar number of samples from both groups were measured on each day. The absorption spectra were obtained using an attenuated total reflection Fourier-transform infrared (ATR-FTIR) spectrometer model IRAffinity-1S (Shimadzu Corp., Kyoto, Japan). The spectra were obtained in the range between 600 and 4,000 cm−1, with 32 co-added scans and 4 cm−1 spectral resolution (2 cm−1 data spacing). The ATR crystal was cleaned with alcohol (70% v/v) and acetone (P.A.) for each new sample and before setting the new background. A 10-μL staken performed. This procedure was repeated in triplicate. The measurement time for each sample was approximately 5 min.

Three spectra collected per sample were first averaged and the following pre-processing was applied to the dataset: truncation to the biofingerprint region (900–1800 cm−1 with 468 wavenumber data points), Savitzky–Golay (SG) smoothing to remove random noise (window = 15 points, 2nd order polynomial fitting), automatic weighted least squares baseline correction, and normalization to the Amide I peak (1,650 cm−1).

Data analysis

The spectral data import, pre-processing and construction of multivariate classification models were performed using the MATLAB R2014b environment version 8.4 (MathWorks, Inc., Natick, USA) with the PLS-Toolbox version 7.9.3 (Eigenvector Research, Inc., Manson, USA) and laboratory-made routines. All spectra were organized into a data matrix, where samples were represented as rows and the wavenumbers as columns. The samples were divided into three different subsets by the Kennard–Stone (KS) sample selection algorithm18: training (70%), validation (15%) and test (15%) sets. The training set was used to build the classification models, while the validation set to optimize and evaluate its internal performance. Finally, the test set was used to evaluate the model classification performance towards external samples.

The computational analysis consisted of testing three algorithms for feature extraction and selection: principal component analysis (PCA)19, successive projections algorithm (SPA)20 or genetic algorithm (GA)21; followed by discriminant analysis classifiers: linear discriminant analysis (LDA)22, quadratic discriminant analysis (QDA)22 or support vector machines (SVM)23. These algorithms were coupled as feature extraction/selection and classification as: PCA-LDA, PCA-QDA, and PCA-SVM; SPA-LDA, SPA-QDA, and SPA-SVM; and GA-LDA, GA-QDA, and GA-SVM.

PCA is a feature extraction method widely used for data reduction19. It decomposes the pre-processed spectral data into a small number of principal components (PCs) containing scores (variance on sample direction) and loadings (variance on wavenumber direction). The PCA scores are used to assess similarities/dissimilarities between the samples, while the PCA loadings to investigate potential spectral markers. SPA is a forward feature selection method20. Its purpose is to select wavenumbers whose information content is minimally redundant in order to solve co-linearity problems. The model starts with one wavenumber, then incorporates a new one at each iteration until it reaches a specified number of wavenumbers. SPA does not modify the original data space as PCA does. In SPA, the projections are used only for variable selection purposes. Thus, the relationship between the spectral variables is preserved.

On the other hand, the GA uses a combination of selection, recombination and mutation to select a set of variables21. The GA aims to reduce the original data in a few number of wavenumbers following a natural evolutionary process based on Darwin’s theory where the best set of wavenumbers, in this case considered as a chromosome, is selected according to a fitness function. The GA routine was carried out during 100 generations with 200 chromosomes each where mutation and crossover probabilities were set to 10% and 60%, respectively. The best solution in GA, in terms of fitness value, is obtained after three realizations starting from different random initial populations. Similarly to SPA, GA also does not modify the original data space as PCA does. The SPA/GA fitness is calculated as the inverse of the cost function \(G\), which is defined as follows24:

$$ G = \frac{1}{{N_{V} }} \mathop \sum \limits_{n = 1}^{{N_{V} }} g_{n} $$
(1)

where \(N_{V}\) is the number of validation samples and \(g_{n}\) is defined as:

$$ g_{n} = \frac{{r^{2} \left( {x_{n} ,m_{I\left( n \right)} } \right)}}{{min_{I\left( m \right) \ne I\left( n \right)} r^{2} \left( {X_{n} ,m_{I\left( m \right)} } \right)}} $$
(2)

where the numerator is the squared Mahalanobis distance between object \(x_{n}\) of class index \(I\left( n \right)\) and the sample mean \(m_{I\left( n \right)}\) of its true class; and the denominator is the squared Mahalanobis distance between object \(x_{n}\) and the centre of the closest wrong class. The advantages of these variable reduction methods (PCA, SPA and GA) prior discriminant analysis lie in the fact that they efficiently remove co-linearity in the dataset, thus preserving only non-redundant information; they solve dimensionality problems for LDA and QDA; and they speed-up the computational time for SVM.

LDA and QDA are discriminant analysis classifiers based on a Mahalanobis distance calculation between the samples; where the main difference between them is that LDA assumes classes having similar variance structures, hence, using a pooled covariance matrix, while QDA assumes classes having different variance structures therefor using the variance–covariance matrix of each class individually for calculation22. The LDA classification score for sample i of class \(k\) (\(L_{ik}\)) is calculated for a given class sample in a non-Bayesian form by the following equation22,25:

$$ L_{ik} = \left( {{\mathbf{x}}_{i} - { }{\overline{\mathbf{x}}}_{k} } \right)^{{\text{T}}} {\mathbf{C}}_{{{\text{pooled}}}}^{ - 1} \left( {{\mathbf{x}}_{i} - { }{\overline{\mathbf{x}}}_{k} } \right) $$
(3)

where \({\mathbf{x}}_{i}\) is a vector with the input variables for sample \(i\); \({\overline{\mathbf{x}}}_{k}\) is the mean of class \(k\); and \({\mathbf{C}}_{{{\text{pooled}}}}\) is the pooled covariance matrix between the classes. The QDA classification score for sample \(i\) of class \(k\) (\(Q_{ik}\)) is estimated using the variance–covariance for each class \(k\) (\({\mathbf{C}}_{k}\)) in a non-Bayesian form as follows22,25:

$$ Q_{ik} = \left( {{\mathbf{x}}_{i} - { }{\overline{\mathbf{x}}}_{k} } \right)^{{\text{T}}} {\mathbf{C}}_{k}^{ - 1} \left( {{\mathbf{x}}_{i} - { }{\overline{\mathbf{x}}}_{k} } \right) $$
(4)

SVM is a powerful supervised classification method that nonlinearly transform the input sample space into a feature space using a kernel function that maximizes the margins of separation between the sample groups, and then it constructs a linear hyperplane that discriminates the samples from different groups in this feature space23. In this study, a radial basis function (RBF) kernel was utilized. The RBF is calculated as follows26:

$$ k\left( {{\varvec{x}}_{i} ,{\varvec{z}}_{j} } \right) = \exp \left( { - \gamma \left| {\left| {{\varvec{x}}_{i} - {\varvec{z}}_{j}^{2} } \right|} \right|} \right) $$
(5)

where \({\varvec{x}}_{i}\) and \({\varvec{z}}_{j}\) are sample measurements vectors, and \(\gamma\) is a tuning parameter that controls the RBF width. In the RBF kernel function, the \(\gamma\) parameter was set to 1. The SVM classification rule is obtained by the following equation26:

$$ f\left( x \right) = {\text{sign}}\left( {\mathop \sum \limits_{i = 1}^{{N_{SV} }} \alpha_{i} y_{i} k\left( {{\varvec{x}}_{i} ,{\varvec{z}}_{j} } \right) + b} \right) $$
(6)

where \(N_{SV}\) is the number of support vectors; \(\alpha_{i}\) is the Lagrange multiplier; \(y_{i}\) is the class membership (± 1); \(k\left( {x_{i} ,z_{j} } \right)\) is the kernel function; and \(b\) is the bias parameter. These SVM parameters were obtained and optimized via an external validation set.

Quality performance

The statistical parameters for the evaluation of the classification models were: accuracy (AC), sensitivity (SENS), specificity (SPEC), Youden’s Index (YOU), positive predictive value (PPV), negative predictive value (NPV), F-Score and G-Score. AC is related to the percentage of correct classification achieved by the model. SENS measures the proportion of positive results that are correctly identified while SPEC measures the proportion of negative results that are correctly identified. In this study, when we have a case–control patients approach, sensitivity can be understood as the probability to find a positive result when the disease is present, while specificity can be understood as the probability to find a negative result when the disease is not present. Youden’s index (YOU) evaluates the classifier’s ability to avoid failure. The PPV measures the proportion of positives that are correctly assigned (its value varies between 0 and 1); the NPV measures the proportion of negatives that are correctly assigned (its value varies between 0 and 1); the F-score represents the weighted average of the precision and sensitivity; and the G-score accounts for the model precision and sensitivity without the influence of positive and negative class sizes27. These parameters are calculated based on the equations shown in Table 3. In addition, a receiver operating characteristics (ROC) curve was generated to all models. The area under curve (AUC) value was calculated to evaluate how well the model can distinguish the samples between the different classes analysed.

Table 3 Equations to calculate the figures of merit for model evaluation.