Introduction

Lung cancer is among the leading causes of cancer death in the world, especially in developing countries. In the past several years, the morbidity and mortality related to lung cancer have been on the rise, mostly due to late diagnosis and metastasis1, 2, 3. Therefore, early diagnosis is an important factor for improving the prognosis of lung cancer patients. The vast majority of lung cancer patients in an early stage exhibit no symptoms, and the cancer is commonly detected as an abnormal shadow on a chest roentgenogram or a chest computed tomography (CT) scan4. It is therefore urgent to identify better methods that can provide more information for lung cancer diagnosis, especially during the early stages. The currently used biomarkers for the early diagnosis of lung cancer in the clinic are carcinoembryonic antigen (CEA), cytokeratin-19 fragments (CYFRA-211) and neuron-specific enolase (NSE)5, 6, 7. However, all of these biomarkers have a poor positive predictive value for lung cancer patients, especially for those in an early stage, and some of these biomarkers are not specific to lung cancer.

Applications of proteomic technologies in recent years have significantly broadened our understanding of the molecular mechanisms of numerous diseases and have aided in biomarker screening and drug target discovery. SELDI-TOF-MS is considered a high-efficiency comparative proteomic approach that possesses diverse advantages over traditional protein separation and identification techniques8. Moreover, SELDI-TOF-MS and pattern recognition software have been successfully used in the identification of specific markers for certain diseases, such as prostate cancer, ovarian cancer and rheumatism8, 9, 10.

In this article, serum specimens from lung cancer patients and non-cancer controls were subjected to magnetic bead-based SELDI-TOF-MS. Characteristic peaks for lung cancer patients were identified, and a diagnostic model was constructed. The diagnostic model was validated using a blind test set, which demonstrated clinical significance of this model in lung cancer diagnosis.

Materials and methods

Patient recruitment and sample collection

Serum specimens from 30 lung cancer patients (CA), 30 pulmonary tuberculosis (benign lung disease) patients (TB), and 33 healthy individuals (HC) were collected. Another 8 randomly selected patients with lung cancer, 10 cases of pulmonary tuberculosis and 10 healthy volunteers were enrolled to test the validity of the diagnostic model constructed in this study. The demographic characteristics of the lung cancer patients and the control subjects are shown in Table 1. Informed consent was obtained from every participant, and the study received ethical committee approval. The blood samples were collected in 4 mL BD vacutainer tubes without anticoagulant, allowed to clot at room temperature for up to 1 h, and centrifuged at 4 °C for 5 min at 1000×g. The pooled sera were frozen and stored at -80 °C for future analysis.

Table 1 Demographic characteristics of lung cancer patients and control subjects.

Proteomic fraction preparation and SELDI-TOF-MS assay

Serum samples were denatured in the presence of U9 lysis buffer (9 mol/L urea, 2% CHAPS, 50 mmol/L Tris-HCl, pH 9.0; Bio-Rad) and then mixed with WCX-2 (NaAC, pH 4.0) buffer. The sample was then incubated with WCX-2-pretreated magnetic beads (Changchun Bokun Science and Technology, China) for 60 min. The beads were then washed twice with NaAC and eluted with 1% TFA (trifluoroacetic acid)11. The protein fraction was then lyophilized and prepared for SELDI-TOF-MS (BioRad, USA) according to the manufacturer's instructions. Masses were acquired from m/z 1000-50 000.

Protein identification and bioinformatics analysis

Protein database searching was performed with the MASCOT search engine (http://www.matrixscience.com/; Matrix Science, London, UK), which compared the monoisotopic peaks against the NCBI nonredundant protein database (http://www.ncbi.nlm.nih.gov/). The allowed mass tolerance was less than 0.05%12, 13, 14. Proteins with MASCOT scores greater than 63 were considered significant (P<0.05).

Data processing and pattern recognition

Principal component analysis (PCA) was performed to cluster the samples (MATLAB, MathWorks, USA). The score plots aided in visualizing the data, and a diagnostic model was constructed with the marker proteins using a linear discrimination analysis method. The classification performance (specificity and sensitivity) was assessed using the AUC values of the ROC curves15, 16. The mass spectrometry analysis system was applied to identify the characteristic molecules corresponding to the featured peaks in the Metlin (http://metlin.scripps.edu/) and HMDB (http://www.hmdb.ca/) databases. The protein hits were verified using text mining techniques.

Results

Identification of the serum protein profile

The protein fractions of the serum samples from 30 lung cancer patients, 30 pulmonary tuberculosis patients and 33 healthy controls were enriched using WCX magnetic beads, which is a particularly effective method for the detection of low-molecular-weight proteins and peptides, and were analyzed by SELDI-TOF-MS. Following baseline correction and peak alignment, protein signals were obtained for all samples. Representative mass chromatogram data for lung cancer patients, pulmonary tuberculosis patients and healthy controls are shown in Figure 1.

Figure 1
figure 1

Representative data of mass chromatogram analysis. The data were from MS assay of lung cancer patients, pulmonary tuberculosis patients and healthy controls. The x-axis represents the molecular mass calculation (m/z), and the y-axis represents the relative intensity.

Protein fingerprint analysis of serum protein from lung cancer patients

The mass spectrum data of the 121 samples were analyzed using Biomarker Wizard (version 3.1.0) to identify the peaks that were different between the lung cancer patients and the control individuals. The Shapiro-Wilk test was used to evaluate the normality of the distribution of the peaks, and the homogeneity of the variance was calculated by the Levene's test. All of the peaks were sorted using the P value from the ANOVA and the Student-Newman-Keuls post-hoc test using SPSS 16.0 (SPSS Inc, USA). Seventeen significant discriminating m/z peaks (4188.21, 4548.39, 4763.26, 4983.18, 5069.19, 5351.19, 5486.84, 6212.41, 6445.26, 6573.47, 9725.37, 11705.4, 11769.7, 15126.9, 15335.2, 15938.7, and 19790.5) were found between the lung cancer group and the control groups (P<0.05), and are shown in Table 2. The importance of these peaks determined by the Biomarker Pattern Software is listed as well. The most important peak was assigned an importance index of 100. The importance of other peaks was calculated relative to that of the top peak, and a value below 100 was conferred to each peak. The peaks at 4763.26, 5069.19, 5351.19, 5486.84, 6212.41, 11705.4, 11769.7, 15335.2, and 15938.7 Da were higher in the lung cancer patients than that in the control groups, and the peaks at 4188.21, 4548.39, 4983.18, 6445.26, 6573.47, 9725.37, 15126.9, and 19790.5 Da were lower in the lung cancer patients than that in the control groups.

Table 2 Discriminant m/z peaks between lung cancer group and the control group.

Diagnostic model construction and validation

The 17 m/z peaks that could discriminate between lung cancer group and the control groups were identified by Biomarker Patterns Software Version 5.0 and analyzed to select peaks for the establishment of a diagnostic biomarker pattern. The m/z peaks at 6445, 9725, 11705, and 15126 were selected by the pattern recognition software as the best markers to construct a diagnostic model for lung cancer (Figure 2). This four-peak model established in the training set could discriminate lung cancer patients from healthy individuals as well as pulmonary tuberculosis patients with a sensitivity of 93.3% (28/30) and a specificity of 90.5% (57/63). The decision tree is presented in Figure 3, and the characteristics of the diagnostic model are shown in Figure 4. The prediction accuracy was validated using a blind test set consisting of 28 randomly selected individuals. The sensitivity and specificity of the prediction are shown in Table 3. We combined database searching with literature mining to determine the identities of the proteins corresponding to the featured peaks. Three of the featured proteins were identified as chaperonin (M9725), hemoglobin subunit beta (M15335) and serum amyloid A (M11548). There was no protein match for the 6445 Da peak in the searched databases, indicating it might be a novel protein.

Figure 2
figure 2

Four characteristic peaks in lung cancer patients. Protein spectrum of serum samples from two different lung cancer patients (CA), two pulmonary tuberculosis patients (TB) and two healthy controls (HC). The x-axis represents the molecular mass calculation (m/z), and the y-axis represents the relative intensity.

Figure 3
figure 3

Boosting decision tree classification of the participants. The root node (top) and descendant nodes were shown as ellipses and the terminal nodes (Nodes 1–7) were shown as rectangles. The mass value in the nodes was followed by lower or equal to intensity value. If the answer to the question in a node of the tree is yes, proceed down to the left node, otherwise (ie no), proceed down to the right node. When proceeding to the terminal nodes, the decision tree assigned samples to three groups. Samples in terminal nodes 2, 3, 4, and 6 were assigned to TB, terminal node 1 was to CA and terminal nodes 5 and 7 were to HC. The numbers in rectangles represent the actual clinical diagnosis of samples assigned to this terminal node by decision tree (ie in terminal node 1, decision tree assigned 31 samples to CA, but actually 28 of them were CA according to the clinical diagnosis).

Figure 4
figure 4

ROC of the boosting decision tree.

Table 3 Sensitivity and specificity of decision tree model.

To explore the clinical significance of the constructed model, the validity of the model was tested by a blind test set consisting of 8 randomly selected lung cancer patients, 10 pulmonary tuberculosis patients and 10 healthy volunteers. The sensitivity of the diagnostic model was 75.0% (6/8), and the specificity was 95% (19/20).

Discussion

During the last several decades, the identification of novel biomarkers for complex diseases has become increasingly successful because of the emergence of high-throughput proteomic techniques such as SELDI-TOF-MS17. Biomarkers, especially biomarker patterns, are considered to be reliable and powerful tools for the early diagnosis, differential diagnosis, and therapy of some diseases18, 19. Analysis of serum proteins by SELDI-TOF-MS provides new information about small proteins with high serum abundances20, 21. Magnetic beads with large surfaces are better able to enrich proteins from serum than other materials. Joint application of magnetic beads and SELDI-TOF-MS might be a more powerful strategy to discover novel serum biomarkers with low abundances.

In this study, the protein fingerprints in the sera from 30 lung cancer patients, 30 pulmonary tuberculosis patients and 33 healthy controls were analyzed using magnetic beads and SELDI-TOF-MS, and seventeen characteristic m/z peaks were identified using Biomarker Wizard. Theoretically, multiple markers are much more powerful and reliable in the diagnosis of a disease than a single marker alone. In our study, the 17 discriminating m/z peaks were analyzed by Biomarker Patterns Software, and only 4 protein peaks, those at 6445, 9725, 11705, and 15126 m/z, were capable of serving as markers for lung cancer diagnosis. This four-peak model established in the training set could discriminate lung cancer patients from healthy individuals and from pulmonary tuberculosis patients with a sensitivity of 93.3% (28/30) and a specificity of 90.5% (57/63). The validity of the model was tested using a blind test set, and the sensitivity of the diagnostic model was 75.0% (6/8), and the specificity was 95% (19/20). Despite the limitation of the test set size, these data indicate that our study provides a novel and potent tool to distinguish lung cancer patients from tuberculosis patients and healthy individuals using serum.

In conclusion, our findings suggest that the application of magnetic beads and SELDI-TOF-MS could be a potent strategy for the identification of serum biomarkers. More importantly, the diagnostic model we constructed using the protein peaks at 6445, 9725, 11705, and 15126 m/z could successfully distinguish lung cancer patients from tuberculosis patients and normal controls, which might be of clinical significance in the early diagnosis of lung cancer. Database searching with literature mining showed that the featured peaks were chaperonin (M9725), hemoglobin subunit beta (M15335), serum amyloid A (M11548) and an unknown protein. A previous report has illustrated that serum amyloid A could be a promising serum biomarker for lung cancer, consistent with the results of our study. However, the clinical significance of chaperonin and hemoglobin subunit beta in lung cancer diagnosis deserves further investigation.

Author contribution

Qi-bin SONG and Hua-zong ZENG designed research; Wei-guo HU and Yi YAO performed research; Peng WANG contributed new reagents and samples; Wei-guo HU and Peng WANG analyzed data; and Qi-bin song, Wei-guo HU, and Hua-zong ZENG wrote the paper.