Plasma Metabolite Profiling and Chemometric Analyses of Lung Cancer along with Three Controls through Gas Chromatography-Mass Spectrometry

Lung cancer has been the most common death causing cancer in the world for several decades. This study is focused on the metabolite profiling of plasma from lung cancer (LC) patients with three control groups including healthy non-smoker (NS), smokers (S) and chronic obstructive pulmonary disease patients (COPD) samples using gas chromatography-mass spectrometry (GC-MS) in order to identify the comparative and distinguishing metabolite pattern for lung cancer. Metabolites obtained were identified through National Institute of Standards and Technology (NIST) mass spectral (Wiley registry) and Fiehn Retention Time Lock (RTL) libraries. Mass Profiler Professional (MPP) Software was used for the alignment and for all the statistical analysis. 32 out of 1,877 aligned metabolites were significantly distinguished among three controls and lung cancer using p-value ≤ 0.001. Partial Least Square Discriminant Analysis (PLSDA) model was generated using statistically significant metabolites which on external validation provide high sensitivity (100%) and specificity (78.6%). Elevated level of fatty acids, glucose and acids were observed in lung cancer in comparison with control groups apparently due to enhanced glycolysis, gluconeogenesis, lipogenesis and acidosis, indicating the metabolic signature for lung cancer.

L ung cancer has been the most common death causing cancer in the world for several decades. Regardless of tremendous efforts, long-term survival has not improved significantly over the last 25 years. 5-Year survival rates of lung cancer patient remain only 15% 1 , which may increase up to 80%, if the lung cancer is detected in early stages 2 . According to the International Agency for Research on Cancer (IARC) for 2012 report, one of the most frequent cancers in the world is lung cancer which has the highest incidence rate worldwide (1.8 million, 13% of the total). As far as the mortality rate is concern, lung cancer is again at the top (1.6 million, 19.4% of the total) 3 . Several studies have been conducted on molecular biomarkers for the early detection of lung cancer at genomics, epigenomics, proteomics, and metabolomics levels [4][5][6][7] to reduce their mortality rate. Metabolomics in the post-genomic era is a powerful tool for profiling differences in metabolites among normal, precancerous, and cancerous cells or tissues. Moreover, metabolomics has gained considerable importance due to recent advances in experimental methodologies and technologies, and ability to process large amounts of data. Based on this, metabolomics approaches can permit early diagnosis or real-time monitoring of the effects of a disease 8 .
The metabolic studies of lung cancer in human tissues and biofluids have been reported in the last few years. Kenjiro Kami et al., have reported metabolomic profiling of lung and prostate tumor tissues by Capillary Electrophoresis Mass Spectrometry (CE-MS) 9 . Rocha et al., have studied the metabolic differentiation between tumor and non-involved adjacent lung tissues by High Resolution Magical Angle Spinning Nuclear Magnetic Resonance (HRMAS-NMR) spectroscopy 10 . They investigated increased levels of lactate, phosphocholine (PC), and glycerophosphocholine (GPC) in tumors, while glucose, myo-inositol, inosine/adenosine and acetate level were decreased. Carrola et al, investigated the Nuclear Magnetic Resonance (NMR) based metabonomics in blood plasma and urine 11 for metabolic signatures in lung cancer. Using a more global profiling approach, Jordan and colleagues reported the NMR analysis of paired tissues and serum samples from 14 subjects with two different lung cancer histological types (adenocarcinoma and squamous cell carcinoma), as well as of serum from 7 healthy individuals 12 . In another pubilcation, a panel of 8 metabolites were identified for the diagnosis of breast, lung, colon or prostate cancers with a high sensitivity and specificity 13 .
A few targeted metabolic profiling of blood plasma/serum have been reported for lung cancer biomarkers discovery. Maeda and coworkers reported the differences in the amino acid profiling of plasma between healthy controls and non-small-cell lung cancer (NSCLC) patients, as assessed by Liquid Chromatography Mass Spectrometry (LC/MS) 14 . Targeted analysis of lysophosphatidylcholines (lysoPC) showed that irregular levels of lysoPC isomers with different fatty acyl positions were found in the plasma of lung cancer patients as compared to controls 15 . In another targeted analysis, serum lipid metabolite profiling of 58 lung cancer using Fourier transform ion cyclotron resonance MS has been reported 16 .
Recent advances in NMR, GC-MS and LC/MS techniques have enabled the use of more global metabolomic approaches for the identification of novel biomarkers for specific diseases 7,17,18 as well as new targets for drug discovery and development. Among the recent techniques, GC-MS proved to be a significantly useful method due to its high sensitivity and resolution, reproducibility and cost effectiveness. Moreover, in comparison to LC/MS, the availability of a large GC-MS electron impact (EI) spectral library further aids the identification of biomarkers in various pathological condition 19 . There are few reports published based on GC-MS analysis of lung cancer metabolites. Metabolites in serum and urine of 19 lung cancer patients and 15 patients with other lung diseases were analyzed using GC-MS 20 . Serum metabolomic analysis of lung cancer patients was performed using GC-MS from 29 healthy volunteers and 33 lung cancer patients 7 . Few studies on GC-MS based volatile organic compounds (VOC) as lung cancer biomarkers have also been reported 21-25 . In all above cited investigations, either limited numbers of samples were used or one healthy control group was used to discriminate lung cancer metabolites. In the present study, we have used 384 samples with three control groups including healthy non-smokers, smokers and persons with COPD in order to identify diseases related metabolites through comprehensive comparison. Previously, we have developed a comprehensive, straightforward, reproducible and efficient sample preparation method which can cover a wide range of metabolites for metabolite profiling with 2D-C18 fractionation approach 26 . In this investigation, all the samples were analyzed through 2D-C18 method for the first time to investigate differentiative metabolite patterns between the lung cancer and controls, followed by chemometric analyses.

Methods
Solvents and reagents. All solvents used for GC-MS analysis were of analytical grade. Methanol, hexane and ammonium hydroxide were purchased from Tedia (Tedia way, Fairfield, USA), while isopropanol and hydrochloric acid (37%) were purchased from Fisher Scientific (Loughborough, Leicestershire, U.K.), formic acid and myristic-d 27 acid were purchased from Sigma-Aldrich (St. Louis, MO, USA, respectively). MSTFA (N-Methyl-N-(trimethylsilyl) trifluoroacetamide) and methoxylamine hydrochloric were purchased from Acros Organic (New Jersey, USA). Deionized water (Milli-Q) was used throughout the study (Millipore, Billerica, MA, USA). Blood samples of male and female were collected from the JPMC Karachi, Pakistan, after consent. About 8 mL of the blood was drawn in the morning from the overnight fasting volunteers in BD Vacutainer tubes (BD Franklin Lakes, NJ, USA, REF 367856), containing K 2 -ethylenediaminetetraacetic acid as an anticoagulant. Plasma was separated immediately by centrifugation at 4,500 rpm for 10 min at 4uC. Finally, the plasma was aliquoted and frozen at 280uC. A code was given to each sample. Sample collection description and codes are mentioned in Table 1&2.
Sample preparation. Method was carried out in accordance with our previous protocol 26 with some modification. Samples were processed in a 96-well plate, in each plate aliquots of 100 mL of plasma of each samples were mixed with 800 mL of solvent methanol, 20 mL of internal standard myristic-d 27 acid (1 mg/mL stock solution) was added and left on ice for 30 minutes. The precipitated proteins were then removed by centrifugation at 12,000 rpm for 10 min (Eppendorf Centrifuge 5804 C/R). Aliquots (600 mL) of the resulting clear supernatants were loaded onto the C18 96-well plate (Strata C18-E, 55 mm pore size, 70uA particle, 100 mg sorbent/well Phenomenex, USA) and drawn through the solid phase under vacuum. Prior to extraction, the phase was activated with 2 3 300 mL of methanol and then further conditioned with 2 3 300 mL of water. After loading of sample on plate, the phase was washed with 2 3 200 mL of water and eluted with 600 mL of methanol. The eluates were collected in 96well collection plates. The eluate was then evaporated under N 2 at room temperature. The dry samples were stored at 4uC until analysis. The SPE extractions were performed on solid phase extraction vacuum manifold AH0-7502 Phenomenex (USA).
Derivatization and GC-MS analysis. The dried extract of all the samples were derivatized subsequently by adding 50 mL methoxylamine hydrochloride in pyridine (15 mg/mL), vortexed and left for 2 hr at 35uC. Then BSTFA was added with 1% trimethylchlorosilane (TCMS) and placed at 70uCfor 60 min to form trimethylsilyl (TMS) derivatives. GC-MS parameters were same as those reported in our previous paper 26 . GC-MS analysis was performed using 7890A gas chromatography (Agilent technologies, USA), equipped with an Agilent Technology GC sampler 120 (PAL LHX-AG12) autosampler and coupled to a Agilent 7000 Triple Quad system (Agilent technologies, USA) and HP-5MS 30 m-250 mm (i.d.) fused-silica capillary column (Agilent J&W Scientific, Folsom, CA, USA), chemically bonded with a 5% diphenyl 95% dimethylpolysiloxane cross-linked stationary phase (0.25 mm film thickness) according to our previous report 26 .
GC-MS data preprocessing and statistical analysis. Metabolite profiling of blood samples were analyzed using the optimized GC-MS assay. Data processing was performed using the Agilent Mass Hunter Qualitative Analysis (version B.04.00). Peak integration and deconvolution (parameter were same as previously reported except SNR threshold 3.0 26 were performed on Mass Hunter. Putative identification  of low molecular weight metabolites were established by comparing the mass spectra of the peaks with those available in the NIST mass spectral (Wiley registry NIST 11) and Fiehn RTL libraries. The identification of peaks was based on 70% similarity index. All the GC-MS spectra were exported as CEF format, and uploaded on MPP for peak alignment, normalization, significance testing, fold change and multivariate analysis for both identified and unidentified compounds. All the available data (full scan mode from m/z 50 to 650 and retention time window 6.5 to 35 min) and minimum absolute abundance of 5,000 counts were used to filter the data. Alignment parameter was set as retention time tolerance 0.05, match factor 0.3 and delta MZ 0.2. Data was normalized to unit scale. After the normalization of data, baseline differences in metabolism between the four groups were eliminated. For baseline correction, all the compounds treated equally regardless of their intensity. It subtracts the mean abundance of each entity from the corresponding values in each sample. A total of 1,877 entities were found in the entire samples after alignment. Entities were filtered by frequency (those which appeared in more than 50% of samples in at least one group of samples were chosen), p # 0.001, fold change. 3 and coefficient of variance (CV) , 25%. Statistical significance analysis using the one way ANOVA and a level of probability of 0.001 was used as the criterion for significance. 32 Entities were found to be significantly different in lung cancer and controls. Turkey's honest Significance Difference (HSD) post Hoc test was then applied to identify which entities were responsible for significant differences in the four groups. Hierarchical clustering was performed by applying Pearson's uncentered-absolute distance metric, complete linkage. Class prediction was built using a PLSDA model. PLSDA was constructed using 32 entities of filtered data using four components including auto scaling, N fold validation type, three numbers of fold and with ten numbers of repeats. Sensitivity and specificity were also measured from the construct model. 40 Samples were randomly selected and validated through the constructed model.

Results and discussion
Metabolite profiling of a total 384 plasma samples from healthy nonsmokers, smokers, COPD and lung cancer patients (96 samples of each group) were analyzed by using GC-MS. 2D-C18 sample preparation method was used for the enrichment of metabolites based on our previous findings 26 . Data files were subjected to extensive statistical analysis using MPP software in order to identify the comparative and statistically distinguished metabolites for the search of lung cancer biomarkers.
Significance testing and fold change. The purpose of significant testing and fold change is to identify statistically differentiative metabolites by applying appropriate test and conditions. Thirty two metabolites, out of 1,877 were found to be significantly different among the three controls (NS, S and COPD) and lung cancer using one way ANOVA and a level of probability of 0.001 and fold change . 3 (Table 3) (Table 3), while the remaining were not identified at this similarity index ( Table 3). The EI/MS spectra of unidentified compounds are shown in supplementary information (Fig. S1).
After the completion of ANOVA, Turkey's honest significant difference (HSD) post Hoc test was applied in order to find out which entities or metabolites were significantly expressed among controls and lung cancer. It was found that a large number of metabolites were significantly different in lung cancer and the three control groups. 31 in COPD, 30 in smoker and 27 metabolites in healthy were significantly expressed, as compared to lung cancer. Only five metabolites were statistically different in smoker and COPD, showing the close resemblance between these two groups. 11 and 12 metabolites in healthy groups were statistically significant, as compared to COPD and smoker, respectively. Turkey's honest significant difference (HSD) post Hoc test summary is shown in supplementary information (Table S1) while identities of statistically significant metabolites which were differing in the four groups are also provided in supplementary information (Table S2). Venn diagram shows the overlapping of statistically differentiative metabolites between controls and lung cancer. In comparison of lung cancer with smoker and COPD, no peaks were overlapped in all the samples. 27 out of 32 were overlapped in smokers and COPD showing their close resemblance. However, 29 peaks were unique in lung cancer group which created differences between lung cancer and controls, while only 1 and 2 peaks were overlapped between lung cancer with COPD and smokers, respectively (Fig. 1A). In contrast, comparison of lung cancer with smokers and healthy non-smokers showed only 1 overlap peak in all samples, while 20 peaks were overlapped in healthy non-smokers and smokers. In this comparison, 24 peaks were unique to lung cancer which created a difference between lung cancer and controls, while only 2 and 5 peaks were overlapped between the lung cancer with smokers and healthy non-smokers, respectively (Fig. 1B).
Clustering. Cluster analysis is a powerful method to organize either entities (compounds) or groups of samples into clusters, based on the similarity of their profiles. Hierarchical clustering was performed to produce a dendrogram for clustering of samples groups using normalized intensities of thirty two significance metabolites (Fig. 2). The length of the vertical lines in the dendrogram is a measure of dissimilarity, while shorter lines demonstrate close relationship of the groups. This approach clustered the four groups (three controls and lung cancer group) into classes I, II and III (Fig. 2). The two groups, i.e. lung cancer (LC) and COPD clustered together in class I with dissimilarity level of only 0.206 (Fig. 2). In class II, three groups, i.e. LC, COPD and smokers (S) were at dissimilarity level of 0.461 (Fig. 2). Clustering of all the four groups in class III showing dissimilarity level of 0.924 (Fig. 2) indicated that healthy non-smokers (NS) are most dissimilar from among the three groups, i.e. S, COPD and LC. Almost all the LC and COPD patients possess smoking background which results in close relationship of the three groups. An image of heat map using non-average samples   (visualizing all samples) with normalized intensities of thirty two significant metabolites is shown in Fig. 3. From this figure, it is clear that lung cancer profile is totally different from three controls by considering all the samples of each group. There is also good reproducibility in each group and mostly the significantly differentiative metabolites are highly expressed in lung cancer as compare to control ones. Each histological subgroup of lung cancer was also compared with control groups (Fig. S3of supplementary material). Squamous cell carcinoma and small cell carcinoma of lung cancer are strongly related with smoking habit and this is also supporting in our clustering analysis of significance metabolites in Fig. S3(A and B) while adenocarcinoma of lung cancer were not clustered with smokers, as adenocarcinoma is the most common form of lung cancer among people who have often or never smoked in their lifetimes Fig. S3C. Non-small cell lung cancer were also not clustered with smokers, this may be due to most of the samples in this class have adenocarcinoma (a type of non-small cell) Fig. S3D.
Class prediction model and test. A model was built using thirty two statistically significant metabolites. Partial Least Square Discri-mination (PLSD) algorithm was used to classify samples into discrete classes. The classes in the input data are randomly divided into three equal parts; two parts were used for training, and the remaining part was used for testing. The process was repeated ten times with a different part that is used for testing in each iteration. Thus each row is used at least once in training and testing, and a Confusion Matrix is generated. The results of Confusion Matrix (a matrix which gives the accuracy of prediction of each class) are presented in supplementary information Table S3. Figure 4 shows the plots obtained by PLS-DA scores. A clear separation trend was observed between the three controls involving healthy non-smokers, smokers and COPD with lung cancer samples in the PLS-DA scores plot (Fig. 4). The smokers and COPD lies close to each other as there are 27 entities were common between them (Fig. 1A). The lung cancer group was totally different from the controls groups as there were at least 24 entities significantly different from the controls in the lung cancer group (Figure 1) and this is also seen in the heat map (Fig. 3). Sensitivity and specificity are also measured from the constructed model. Sensitivity was calculated from the ratio of true positives (cancer samples which correctly predicted) to the total number of subjected cancer samples, whereas specificity was  determined from the ratio of true negatives (control samples which correctly predicted) to the total number of subjected control samples. Sensitivity and specificity was found to be 96.2% and 92.0%, respectively, and overall accuracy of the model was found to be 93.1%. External validation measures the predictive capability (sensitivity and specificity) of a calculated model. The model was used to externally validate an independent or blind-test set of 38 plasma samples (8 healthy non-smokers, 10 smokers, 10 COPD and 10 lung cancer patients). PLSDA classifier correctly predicted the presence of LC in 10 out of 10 patients, healthy non-smokers in 8 out of 8, COPD 9 out of 10 and smokers 5 out of 10 resulting with 100% sensitivity and 78.6% specificity. 50% of the smokers were incorrectly predicted by the model as COPD, may be due to the common smoking history of both. All the sample prediction reports are shown in Figure S2 of supporting information.
Pathway analysis. Pathway analysis was done through MPP software using thirty two significantly differentiative metabolites which reveals disturbance in several pathways including pyruvate metabolism and citric acid (TCA) cycle, fatty acid triacylglycerol and ketone body metabolism, bile acid and bile salt metabolism, ATP Binding Cassette (ABC) family protein mediated transport and G-Protein Coupled Receptor (GPCR) downstream signaling pathways.
Pyruvate metabolism and citric acid (tca) cycle. All cells in our bodies require oxygen and nutrients. Energy is constantly needed to perform cellular functions. For the proliferation of cells, nutrients are needed in abundance for rapid growth. Therefore, cancer cells require a plentiful supply of nutrients. Most cancer cells are highly dependent on glucose for energy. Our experimental data showed that the level of glucose was different between lung cancer and control plasma samples. High levels of glucose were found in the plasma samples of lung cancer, as compared to controls. Warburg reported the conversion of glucose to lactic acid in the presence of oxygen as a specific metabolic abnormality of cancer cells 27 (Mishra and Verma, 2010). High level of lactic acid was also found in the plasma samples of lung cancer. High level of glucose in lung cancer does not show the decrease in glycolysis as lactic acid is also upregulated in lung cancer. Glycolysis results in the breakdown of glucose, but several reactions in the glycolysis pathway are reversible and participate in the re-synthesis of glucose, so gluconeogenesis may be responsible for the increased levels of glucose in lung cancer. Pathway analysis through MPP shows the alteration or disturbance in lactic acid, carbon dioxide and phosphoric acid involved in pyruvate metabolism and citric acid (TCA) cycle between controls and lung cancer. This is shown in Fig. S4 of supplementary material.
Fatty acid triacylglycerol and ketone body metabolism. Alterations of several lipids metabolism are often observed in lung cancer samples, including over-expression of fatty acid synthase (FAS). Comparatively high levels of fatty acids, including palmitic acid, octadecanoic acid, stearic acid and cholesterol were found in the plasma samples of lung cancer as compared to controls. FAS serves to store the energy derived from carbohydrate metabolism. Fatty acids are esterified to phospholipids, such as phophatidylcholine 28 .
They are activated to acyl-CoA in a 2-step reaction, forming diacylglycerides with glycerol 3-phosphate. These diacylglycerides then react with CDP choline to form phosphatidylcholine. Pathway analysis through MPP shows the alteration in phosphoric acid, palmitate, carbon dioxide, glycerol and archidonic acid involved in fatty acid triacylglycerol and ketone body metabolism between controls and lung cancer as shown in Fig. S5 of supplementary material. Over expression of FAS has been observed in many lung cancers studies 10,11,29 . Experimental studies have indicated that various oncogenic signaling pathways lead to increased FAS expression 30,31 . Recently SREBP (Sterol Regulatory Element-Binding Protein, a transcription factor and is a direct target of PI3K/Akt and MAPK pathways) that regulates the lipid synthesis and uptake through up-regulation of key enzymes of lipogenesis 32,33 . High content of glucose may be due to the high requirement of energy of lung cancer cells which results in carbohydrate  metabolism and lipogenesis to provide the energy in the form of glucose.
GPCR downstream signaling. In cancer cells (lung, gastric, colorectal, pancreatic and prostatic cancers) abnormal expression of GPCRs and/or their ligands has been observed 34,35 . Pathway analysis shows increase in phosphoric acid, glycerol and arachidonic acid levels in lung cancer, involved in GPCR downstream signaling pathway derived from endocannabinoids anandamide (AEA) and 2-arachidonoyl glycerol (2-AG). The resulting altered pattern of receptor expression is shown in Fig. S6 of supplementary material. This consequently leads to changes in fatty acid synthesis and glucose utilization 36 .
ABC family protein mediated transport. ABC transporters are membrane proteins which generate energy from ATP hydrolysis to actively transport a variety of compounds across the membrane, including ions, sugars, amino acids, lipids, toxins and anticancer drugs. ABC transporters are involved in tumor resistance. ABCB1 or MDR1 P-glycoprotein are involved in lipid transport which is their main function 37 . Pathway analysis shows the alteration of phosphoric acid and cholesterol involved in ABC family protein mediated transport, as shown in Fig. S7 of the supplementary material.
Bile acid and bile salt metabolism. Bile acids are steroidal amphipathic molecules, derived from the catabolism of cholesterol. The catabolism of cholesterol to bile acids is an important route for the elimination of cholesterol from the body, accounting for approximately 50% of cholesterol eliminated daily. Bile acids are involved in signal transduction pathways that regulate apoptosis 38 . Pathway analysis shows the alternation of phosphoric acid and cholesterol, involved in bile acid and bile salt metabolism, as shown in Fig. S8 of the supplementary material. Up-regulation of acidic environment (decrease pH) in cancer cells is common due to production of lactic acid. Our experimental data shows high level of lactic acid, phosphoric acid and benzoic acid in lung cancer patients, as compared to controls. Acidic environment of cancer typically results in necrosis or apoptosis through p53 and caspase-3-dependent mechanisms 39 . Consequently, up-regulation of glycolysis requires resistance to apoptosis or up-regulation of membrane transporters to maintain pH. These changes may result in a malignant phenotype and facilitate local invasion and metastasis formation 39 .
Concluding remarks. Our study has shown that GC-MS-based metabolite profiling of blood plasma using 2D-C18 fractionation approach followed by chemometric analyes is able to identify biomarker metabolites which can significantly differentiate lung cancer from three control groups (healthy non-smokers, smokers and COPD) with high sensitivity (96.2%) and specificity (92.05%). The two groups, i.e. lung cancer (LC) and COPD are much close to each other (dissimilarity level of only 0.206 by cluster analysis). Elevated levels of almost all the fatty acids, glucose and acids were found in lung cancer patients, in comparison to the controls. Generally, glycolysis increased in cancer but in this study high level of glucose was found in lung cancer samples as compare to controls. However, high level of glucose in lung cancer does not show the decrease in glycolysis as lactic acid is also up-regulated in lung cancer. From the pathway analysis, it was concluding that glycolysis results in the breakdown of glucose, but several reactions may be responsible for the increased levels of glucose in lung cancer like gluconeogenesis, carbohydrate metabolism and lipogenesis to provide the energy in the form of glucose. Up regulation of acidic environment (decrease pH) and alterations of several lipid metabolism favors the lung cancer growth. A promising finding is the newly built model based on thirty two significantly metabolites which accurately classifies lung cancer and controls on external validation. Unfortunately, only 37% of the metabolites were characterized and their pathways are correlated. Identification of unknown metabolites with high resolution can increase human metabolome and ultimately help in biomarker identification of lung cancer.