High-resolution metabolomic biomarkers for lung cancer diagnosis and prognosis

Lung cancer is the leading cause of human cancer mortality due to the lack of early diagnosis technology. The low-dose computed tomography scan (LDCT) is one of the main techniques to screen cancers. However, LDCT still has a risk of radiation exposure and it is not suitable for the general public. In this study, plasma metabolic profiles of lung cancer were performed using a comprehensive metabolomic method with different liquid chromatography methods coupled with a Q-Exactive high-resolution mass spectrometer. Metabolites with different polarities (amino acids, fatty acids, and acylcarnitines) can be detected and identified as differential metabolites of lung cancer in small volumes of plasma. Logistic regression models were further developed to identify cancer stages and types using those significant biomarkers. Using the Variable Importance in Projection (VIP) and the area under the curve (AUC) scores, we have successfully identified the top 5, 10, and 20 metabolites that can be used to differentiate lung cancer stages and types. The discrimination accuracy and AUC score can be as high as 0.829 and 0.869 using the five most significant metabolites. This study demonstrated that using 5 + metabolites (Palmitic acid, Heptadecanoic acid, 4-Oxoproline, Tridecanoic acid, Ornithine, and etc.) has the potential for early lung cancer screening. This finding is useful for transferring the diagnostic technology onto a point-of-care device for lung cancer diagnosis and prognosis.

*Shi-ang Qi and Qian Wu contributed equally, and both are the first authors. **Corresponding author: Jie Chen, Youguang Huang, and Yunchao Huang; jc65@ualberta.ca (Jie Chen); huangyouguang2008@126.com (Youguang Huang); huangyunchao2013@163.com (Yunchao Huang). Figure S1. The PCA score plots of metabolic profile of QC, health group and lung cancer group. Figure S2. The OPLS-DA score plots and validation graphs of metabolic profile. Figure S3. Pathway analysis of the differential metabolites between healthy controls and lung cancer patients Figure S4. Enrichment analysis based on KEGG of the differential metabolites. Figure S5. Enrichment analysis based on SMPDB of the differential metabolites. Table S1. Qualitative identification results of differential plasma metabolites of lung cancer. Table S2. The combined results of the core differential plasma metabolites of lung cancer. Table S3. Performance of logistic regression models with various biomarkers for discriminating healthy controls, early-stage patients, and advanced-stage lung cancer patients. Table S4. Performance of logistic regression models with various biomarkers for discriminating different lung cancer types. Table S5. Summary of the top 20 significant differential metabolites of lung cancer.

List of supplementary materials
A. Method: The detailed information of metabolomic analyses.    Table S3. Performance of logistic regression models with various biomarkers for discriminating healthy controls, early-stage patients, and advanced-stage lung cancer patients.  Metabolite Name, (a) RPLC Neg, (b) RPLC Pos, and (c) HILIC; Trend, average normalized intensity of lung cancer patients compared with the controls.

LC-MS and LC-MS/MS
An Ultimate-3000 UPLC system coupled to a Q Exactive hybrid quadrupole-Orbitrap MS system (Thermo Scientific) was used for the sample analysis. Before injection, the residues were resuspended in platform-specific solutions. A combination of three conditions:

Data Preprocessing
MS raw data were acquired using the software Xcalibur (version 3.1, Thermo Scientific). The spectra selection, retention time alignment, and peak identification were performed using Compound Discoverer (version 3.0, Thermo Scientific), to obtain the data matrix containing molecular weight, retention time, peak intensity and annotation result. A workflow named "Untargeted Metabolomics with statistics detect unknowns with ID using Online Database and mzLogic" was chosen to process. Key processing parameters were: Mass    Using the metabolites in Table S6, a linear relationship, log(P/(1-P)) = 4.84E-05 − 3.62E-05 × Palmitic acid − 7.29E-04 × Heptadecanoic acid + 6.43E-03 × Ornithine− 1.92E-02 × Tridecanoic acid + 1.48E-05 × Stearic acid, was used for calculations. The optimal threshold (cut-off point) for the above logistic model is 0.338. From Figures S6 (a) (b) and Table S6, we can see that after using this held-out validation, the AUC can increase from 85.5% to 88.7% using top ten significant metabolites to identifying early-stage lung cancer from healthy people.
We also established a discriminant model for identifying advanced-stage lung cancer from healthy people. A discovery cohort of 50 healthy control samples and 29 advanced-stage (stage III and stage IV) samples was adopted to find prominent metabolites and train the discriminant model. A held-out validation cohort consisting of 25 healthy controls and 15 advanced-stage cancer patients was used to validate the performance of the well-trained model.  Using the metabolites in Table S7, a linear relationship, log(P/(1-P)) = 3.31E-05 − 4.22E-05 × Palmitic acid − 2.42E-03 × Heptadecanoic acid + 7.69E-03 × Ornithine− 9.44E-03 × Tridecanoic acid − 2.01E-07 × Stearic acid, was used. The optimal threshold (cut-off point) for the above logistic model is 0.271. The results from Figures S6 (c) (d) and Table S7, show that we can obtain a good model to discriminating healthy controls versus advanced-stage lung cancer patients, and healthy controls versus early-stage lung cancer patients separately with validation accuracy all above 0.85.
The optimal threshold (cut-off point) for the above logistic regression model is 0.572.
The optimal threshold (cut-off point) for the above model is 0.541.
The optimal threshold (cut-off point) for the above model is 0.327.
The optimal threshold (cut-off point) for the above model is 0.238.