Introduction

Metabolomics offers a comprehensive view of small molecule concentrations within a biological system and plays a pivotal role in the discovery of disease biomarkers for diagnostic purpose1. Liquid chromatography mass spectrometry (LC-MS) is a widely practiced experimental technique in global metabolomic studies2,3. High sensitivity, stability, reproducibility, and detection throughput are the unique advantages of untargeted LC-MS4. Despite its capacity to measure thousands of ion peaks, the conventional metabolomic study by LC-MS remains a challenging task due to laborious data processing such as peak picking and alignment, metabolite annotation by comparing to authenticated databases, and data normalisation to control unwanted variability in a large-scale study5,6. The broader application of metabolomics in precision medicine may be impeded by obstacles such as complex data processing, high inter-batch variability, and burdensome metabolite identification7.

Untargeted metabolomics has been conducted on various human biological fluids, including serum and plasma, for the discovery of biomarkers in cancers such as hepatocellular carcinoma8, pancreatic9, prostate10, and lung cancers11,12,13. However, such biomarker discovery studies utilising metabolomics face significant challenges regarding reproducibility, likely due to signal drifts in cross-batch or cross-platform analysis14 and the limited integration of data from different laboratory samples15. Furthermore, unknown metabolites are excluded when comparing detected features to authenticated databases16, which may hinder our ability in discovering new biomarkers associated with diseases. Several previous studies have combined machine learning with LC-MS for in vitro disease diagnosis and improved the efficiency of LC-MS data analysis. For example, Huang et al. conducted machine learning to extract serum metabolic patterns from laser desorption/ionisation mass spectrometry to detect early-stage lung adenocarcinoma11; Chen et al. adopted machine learning models to conduct targeted metabolomic data analysis to identify non-invasive biomarker for gastric cancer diagnosis17; Shen et al. developed a deep-learning-based Pseudo-Mass Spectrometry Imaging method and applied it in the prediction of gestational age of pregnant women, as well as the diagnosis of endometrial cancer and colon cancer18. However, these studies still face challenges such as batch effects and unknown metabolites in metabolomics7. Consequently, a new analytical approach is urgently needed to overcome the experimental bottlenecks and reveal disease-associated profiles comprising both identified and unknown components derived from LC-MS peaks.

Deep learning has been widely applied in various omics data analyses, holding promise for addressing the complexities of metabolomic data19. The encoding and modelling capabilities of deep learning offer a potential solution to overcome the aforementioned bottlenecks in handling intricate and high-dimensional data, mitigating bias in machine learning algorithms20,21. However, deep learning necessitates high-quality data and a sufficient quantity of samples, otherwise leading to issues like the curse of dimensionality and the overfitting of predictive models22. Moreover, integrating large dataset collected from multiple hospitals may introduce significant variations. Furthermore, as deep learning methods are usually perceived as “black-box” processes, the importance of model interpretability for prediction in the context of biomedical research is increasingly recognised23,24,25. Therefore, a deep learning model with both interpretability for biological soundness and capability to mitigate batch effects is highly desirable to enhance the reliability of large-scale metabolomic analyses for diagnostic purposes.

In this study, we develop an ensemble end-to-end deep learning method named as deep learning-based mass spectrum profiler (DeepMSProfiler) for untargeted metabolomic data analysis. We firstly apply this method to differentiate healthy individuals and patients with benign lung nodules or lung adenocarcinoma using 859 serum samples from three distinct hospitals, followed by its extended analysis on lipid metabolomic data derived from 928 cell lines to reveal metabolites and proteins associated with multiple cancer types. Without the process of peak extraction and identification as well as potential errors by conventional machine learning approaches, our method directly converts raw LC-MS data into outputs such as predicted classification, heatmaps illustrating key metabolite signals specific to each class, and metabolic networks that influence the predicted classes. Importantly, DeepMSProfiler effectively removes undesirable batch effects and variations across different hospitals and infers the unannotated metabolites associated with specific classifications. Furthermore, it leverages an ensemble-model strategy that optimises feature attribution from multiple individual models. DeepMSProfiler achieved an area under the receiver operating characteristic curve (AUC) score of 0.99 in an independent testing dataset, along with an accuracy of 96.1% in detecting early-stage lung adenocarcinoma. The results are explainable through locating relevant biological components as contribution factors to prediction. Our method provides a straightforward and reliable approach for metabolomic applications in disease diagnosis and mechanism discovery.

Results

The overview of the ensemble end-to-end deep-learning model

The DeepMSProfiler method includes three main components: the serum-based mass spectrometry, the ensemble end-to-end model, and the disease-related LC-MS profiles (Fig. 1a). In the first component, the raw LC-MS-based metabolomic data was generated using 859 human serum samples (Fig. 1a left) collected from 210 healthy individuals, 323 benign lung nodules, and 326 lung adenocarcinomas. The space of the LC-MS raw data contains three dimensions: retention time (RT), mass-to-charge ratio (m/z), and intensity. Using the RT and m/z dimensions, the data can be mapped from three-dimensional space into the frequency and time domains, respectively (Fig. 1b left to middle). Ion current maps and primary mass spectra can then be generated and used for metabolite identification (Fig. 1b middle to right). Conventional step-by-step methods of metabolomic analysis5,22 (Supplementary Fig. 1 top) may lead to a large number of lost metabolic signals. To address these issues, DeepMSProfiler directly takes untargeted LC-MS raw data as model input, and builds an end-to-end deep learning model to profile disease-related metabolic signals (Supplementary Fig. 1 bottom).

Fig. 1: The DeepMSProfiler method using LC-MS-based untargeted serum metabolome.
figure 1

a The overview of DeepMSProfiler. Serum samples of different populations (top left) were collected and sent to the instrument (bottom left) for liquid chromatography-mass spectrometry (LC-MS) analysis. The raw LC-MS data, containing information on retention time (RT), mass-to-charge ratio (m/z), and intensity, is used as input to the ensemble model (middle). Multiple single convolutional neural networks form the ensemble model (centre) to predict the true label of the input data and generate three outputs (right), including the predicted sample classes, the contribution heatmaps of classification-specific metabolic signals, and the classification-specific metabolic networks. b The data structure of raw data. The mass spectra of different colours (centre) represent the corresponding m/z and ion intensity of ion signal groups recorded at different RT frames. All sample points are distributed in a three-dimensional space (left) which can be mapped along three axes to obtain chromatograms, mass spectra, and two-dimensional matrix data. Chromatograms and mass spectra are used for conventional qualitative and quantitative analysis (right), while the two-dimensional matrix serves as input data for convolutional neural networks. c The structure of a single end-to-end model. The input data undergoes the pre-pooling processing to reduce dimensionality and become three-channel data. As the model passes through each convolutional layer (conv) in the feature extractor module, the weights associated with the original signals change continuously. The sizes of different frames in the enlarged layers (top) represent different receptive fields, with DenseNet allowing the model to generate more flexible receptive field sizes. After the last fully connected layer (FC), the classifications are resulted.

The main model adopts an ensemble strategy and consists of multiple sub-models (Fig. 1a middle). The ensemble strategy is considered to be able to provide better generalisation26 according to the complexity of the hypothesised space27, the local optimal search27 and the parameter diversity28,29 by the random training process, as well as the bias-variance theory30,31,32. The structure of each sub-model consists of three parts: a pre-pooling module, a feature extraction module, and a classification module (Fig. 1c). The pre-pooling module transforms three-dimensional data into two-dimensional space through a max-pool layer, effectively reducing dimensionality and redundancy while preserving global signals (Fig. 1c left). The feature extraction module is based on a convolutional neural network to perform classification tasks by extracting category-related features (Fig. 1c middle). DenseNet121 is chosen as the backbone of the feature extraction module due to its highest accuracy and the least number of parameters (Supplementary Data 1). Additionally, the design of densely connected convolutional networks in DenseNet12133 makes the model more flexible in terms of receptive field size, allowing adaptation to different RT intervals of metabolic signal peaks. After the non-linear transformation by the feature extraction layer, different weights could be assigned to the input original signal peaks for subsequent classification. We evaluated the effect of the number of sub-models on improving performance and consequently chose to use 18 sub-models for DeepMSProfiler (Supplementary Fig. 2). The classification module implements a simple dense neural network to compute the probabilities of different classes (Fig. 1c right).

In the third component, each input sample results in three outputs from the model, including the predicted classification of the sample, the heatmap for the locations of the key metabolic signals, and the metabolic network that influences the predicted category (Fig. 1a right). In the predicted classification, the category with the highest probability is assigned as the predicted label of the model. In the heatmap presentation, the key metabolic signals associated with different classifications are inferred by the perturbation-based method, and the m/z and RT of the key metabolic signals can be located. Finally, the model infers the underlying metabolite-protein networks and metabolic pathways associated with these key metabolic signals directly from m/z (see Methods).

Model performance metrics

To test DeepMSProfiler’s ability in classifying disease states and discovering disease-related profiles, we performed the global metabolomic analysis of serum samples from healthy individuals and patients with benign lung nodules or lung cancer. We collected serum samples from three different hospitals to construct and validate the model. Benign nodule cases were followed for up to 4 years and lung adenocarcinoma samples were pathologically examined (see Methods). We built the model using 859 untargeted LC-MS samples, of which 686 as the discovery dataset and 173 as the independent testing dataset (Fig. 2a). The samples were generated from 10 batches, as shown in Supplementary Table 1. Statistics analysis shows a significant correlation between lesion size and disease type, but the correlations between other clinical factors are not significant (Supplementary Table 2). To avoid confounding effects by clinical features, we performed further distribution statistics analysis, which shows no significant difference in the distribution of lesion diameter and patient age in both the discovery dataset and the independent testing dataset (Supplementary Fig. 3). The samples in the discovery dataset were randomly divided into two subsets: a training dataset (80%) for parameter optimisation and a validation dataset (20%) to cross-validate the performance of different models. Remarkably, the accuracies of our DeepMSProfiler in the training dataset, validation dataset, and independent testing dataset are 1.0, 0.92, and 0.91, respectively (Supplementary Table 3).

Fig. 2: The prediction performance of DeepMSProfiler.
figure 2

a The sample allocation chart. The outer ring indicates the types of diseases and the inner ring indicates the sex distribution. Healthy: healthy individuals; Benign: benign lung nodules; Malignant: lung adenocarcinoma. b Predicted receiver operating characteristic (ROC) curves of different methods. Random: performance baseline in a random state. Comparison of performance metrics of different methods (n = 50): accuracy (c), precision (d), recall (e), and F1 score (f). The blue areas show the different conventional analysis processes using machine learning methods, and the red areas display different end-to-end analysis processes using deep learning methods. The boxplot shows the minimum, first quartile, median, third quartile and maximum values, with outliers as outliers. g Model accuracy rates for different age groups. The sample sizes for different groups are 52, 69, 40, and 12, respectively. h Model accuracy rates for different lesion diameter groups. The sample sizes for different groups are 27, 37, 18, 13, and 34, respectively. The boxplot shows the minimum, first quartile, median, third quartile and maximum values. i Prediction accuracy and parameter scale of different model architectures. j The confusion matrix of the DeepMSProfiler model. The numbers inside the boxes are the number of matched samples between the true label and the predicted label. The ratio in parentheses is the number of matched samples divided by the number of all samples of the true label.

In the independent testing dataset, DeepMSProfiler significantly outperforms traditional methods and single deep learning models (Fig. 2b–h). Compared with Support Vector Machine (SVM), Random Forest (RF), Deep Learning Neural Network (DNN), Adaptive Boosting (AdaBoost), and Extreme Gradient Boosting (XGBoost) based on traditional methods, and Densely Connected Convolutional Networks (DenseNet121) using raw data, DeepMSProfiler presents the highest areas under the curve (AUC) of 0.99 (Fig. 2b). Notably, DeepMSProfiler exhibits higher specificity than other models while maintaining high sensitivity, indicating its ability to accurately identify true negatives (Supplementary Table 4).

Regarding overall performance against other models, our model achieves the best performance in multiple evaluation metrics: accuracy of 95% (95% CI, 94%–97%) (Fig. 2c), precision of 96% (95% CI, 94%–97%) (Fig. 2d), recall of 95% (95% CI, 94%–96%) (Fig. 2e), and F1 of 98% (95% CI, 97%–98%) (Fig. 2f). Compared to XGBoost, our model performs better in different groups of lesion sizes and ages (Fig. 2g, h), except for samples from patients over 70 years old. DeepMSProfiler is also superior to commonly used single deep learning models, such as DenseNet121 (Fig. 2i). When using the ensemble strategy, we did not set different weights for each sub-model, so each sub-model is equally involved in the final prediction. The confusion matrices of prediction performance for each sub-model (Supplementary Fig. 4) show their contributions to the overall results. The robust performance, coupled with the efficiency in terms of computational resources, makes DeepMSProfiler a promising choice for the classification tasks (Fig. 2i).

Furthermore, DeepMSProfiler demonstrates consistent performance across different categories. All of the AUCs for lung adenocarcinoma, benign lung nodules, and healthy individuals achieves 0.99 (Supplementary Fig. 5), and their respective classification accuracies are 85.7%, 90.8%, and 97.0% (Fig. 2j). Most importantly, our model has good performance for detecting stage-I of lung adenocarcinoma with an accuracy of 96.1%, indicating its potential as an effective method for early lung cancer screening.

Insensitivity to batch effects

Batch effect is one of the most common error sources for the analysis of metabolomic data. To evaluate the impact of batch effects on the non-targeted LC-MS data, we first generated 3 biological replicates as reference samples for each of the 10 batches. These reference replicates were taken from a mixture of 100 healthy human serum samples, and each of them contains equal amounts of isotopes including 13C-lactate, 13C3-pyruvate, 13C-methionine, and 13C6-isoleucine (see Methods). The differences in the data structure of reference replicates from different batches are visualised in 3D and 2D illustrations (Fig. 3a and Supplementary Fig. 6), which indicate the changes of shapes and area as well as the RT shifts among different batches. Comparison of individual isotopic peaks in samples from different batches also shows that the batch effects are mainly in the form of RT shifts, and the differences in peak shapes and areas for the same metabolites (Fig. 3b).

Fig. 3: Explainable deep learning method avoids limitations of batch effects in conventional methods.
figure 3

a Batch effects in 3D point array and 2D mapped heatmap of reference samples. RT: retention time; m/z: mass-to-charge ratio. b Isotope peaks of the same concentration in different samples. Different colours represent the batches to which the samples belong. c The visualisation of dimensionality reduction of normalisation by the Reference Material method. Below: different colours represent different classes; Above: different colours indicate different batches. Healthy: healthy individuals; Benign: benign lung nodules; Malignant: lung adenocarcinoma. d The visualisation of dimensionality reduction for the output data of the hidden layers in DeepMSProfiler. Conv1 to Conv5 are the outputs of the first to the fifth pooling layer in the feature extraction module. Block4 and Block5 are the outputs of the fourth and fifth conv layers in the fifth feature extraction module. Upper: different colours indicate different sample batches; Lower: different colours represent different population classes. e Correlation of the output data of the hidden layer with the batch and class information in DeepMSProfiler. The horizontal axis represents the layer names. Conv1 to Conv5 are the outputs of the first to the fifth pooling layer in the feature extraction module. Block10 and Block16 are the outputs of the tenth and sixteenth conv layers in the fifth feature extraction module. The blue line represents the batch-related correlations, and the orange line illustrates the classification-related correlations. f The accuracy rates of traditional methods (blue), corrected methods based on reference samples (purple), and DeepMSProfiler (red) in independent testing dataset (n = 50). The boxplot shows the minimum, first quartile, median, third quartile and maximum values, with outliers as outliers.

We then investigated the batch effect corrections and compared the performance between DeepMSProfiler and conventional correction methods (see Methods). As shown in Fig. 3c, after correction by the Reference Material (Ref-M) method34, we still observed 3 clusters in the principal component analysis (PCA) profiles, which represent the sample dots from three different hospitals. While Ref-M effectively addresses the batch effect within samples from the same hospital, the residual variation across hospitals remains (Fig. 3c). Samples of batch 1–6 and 9–10 were obtained from the Sun Yat-Sen University Cancer Centre, and samples of batch 8 came from the First Affiliated Hospital of Sun Yat-Sen University. Samples of batch 7 were a mixture from three different hospitals. Among them, lung cancer samples came from the Affiliated Cancer Hospital of Zhengzhou University, lung nodule samples came from the First Affiliated Hospital of Sun Yat-Sen University, and healthy samples came from the Sun Yat-Sen University Cancer Centre. 100 healthy human serum samples used as reference during conventional procedures were all from the Sun Yat-Sen University Cancer Centre, which might be the main reason why it is difficult to correct batch effect for batch 7–9. To illustrate the DeepMSProfiler’s end-to-end process in the automatic removal of batch effects, we extracted the output of the hidden layer and visualised the flow of data during the forward propagation of the network (see Methods). From the input layer to the output layer, the similarity between different batches becomes progressively higher, while the similarity between different types becomes progressively lower (Fig. 3d and Supplementary Fig. 7). DenseNet121 is a deep neural network with 431 layers, of which 120 are convolutional layers (Supplementary Data 1). The fourth and fifth layers refer to the output of the fourth dense connected module and the output of the fifth dense connected module. There are 112 layers between them. Figure 3d and Supplementary Fig. 7 illustrate the intermediate change process from the output of the fourth closely connected module to the output of the fifth closely connected module. When the batch effects are removed, the classification becomes clearer. We further quantified this process of change using different metrics that measure the correlation between the PCA clusters and the given labels i.e., K-nearest neighbour batch effect test score35, local inverse Simpson’s index36, adjusted rand index (ARI), normalised mutual information (NMI), and average silhouette coefficient (ASC) (see Methods). The closer to the output layer, the less relevant the data is to batch labels and the more relevant to class labels (Fig. 3e). This explains how the batch effect removal is achieved via progress through hidden layers (Fig. 3d). This capability might be gained via the supervised learning. Our findings suggest that in the forward propagation process, the DeepMSProfiler model excludes batch-related information from the input data layer by layer, while retaining class-related information.

Further, we compared the performance of our deep learning method against machine learning methods with and without batch effect correction. The correction of batch effects could improve accuracies when using machine learning classifiers such as SVM, RF, AdaBoost, and XGBoost. However, DeepMSProfiler, without any additional manipulation, surpassed the machine learning methods with or without batch effect correction in terms of prediction accuracy (Fig. 3f).

The impact of unknown mass spectrometric signals

To investigate the impact of unknown metabolite signals on classification prediction, we first performed the conventional analysis with peak extraction and metabolite annotation using existing databases such as Human Metabolome Database (HMDB) and Kyoto Encyclopedia of Genes and Genomes (KEGG) (see Methods). We found that 83.5% of all detected features remain as unknown metabolites (Fig. 4a). The absence of these unknown metabolites undermines the prediction accuracy (Fig. 4b), indicating this large number of unknown metabolites may impose a significant impact on classification performance. One of the advantages of our approach over traditional methods is the ability to retain complete metabolomic features including the unknown metabolites.

Fig. 4: The unknown space in databases and previous studies.
figure 4

a Statistics of annotated metabolite peaks. Blue colour represents all peaks, orange, purple and while yellow colours indicate metabolites annotated in HMDB, KEGG, and all databases, respectively. The overlap between orange and purple includes 414 metabolites annotated in both HMDB and KEGG. HMDB: Human Metabolome Database; KEGG: Kyoto Encyclopedia of Genes and Genomes. b The feature selection plot illustrates the effect of different contribution score thresholds removing unknown metabolites versus non-removing. The horizontal axis represents the change in threshold, while the vertical axis shows the accuracy of the model using the remaining features. The shadings of solid lines (mean) represent error bars (standard deviation). c Collection standard of published lung cancer serum metabolic biomarker. SCLC: Small Cell Lung Cancer; LUSC: Lung Squamous Cell Carcinoma; LUAD: Lung Adenocarcinoma; NAR: Nuclear Magnetic Resonance; MS: Mass Spectrometry. d The number counts of known biomarkers published in the current literature. e Molecular weight distribution plot of known biomarkers. f Accuracy comparison between the ablation experiment and DeepMSProfile (n = 50). The boxplot shows the minimum, first quartile, median, third quartile and maximum values. In the ablation experiment, we investigated the effect of varying the publication count (PC) of known biomarkers in the literature. Specifically, we eliminated metabolic signals that were not reported in the original data based on the m/z of known biomarkers. We retained only the metabolic signals with publication counts greater than 1, greater than 3, and greater than 8 for modelling development. All ablated data was analysed using the same architecture as the original unprocessed data in the same DeepMSProfiler architecture. The vertical axis shows the accuracy of models built on the dataset of different publication counts and our DeepMSProfiler.

We then tested the limitation of lung cancer prediction using biomarkers identified by annotated metabolites. We collected 826 biomarkers for lung cancer based on serum mass-spectrometry analysis from 49 publications (Fig. 4c). We deduced the molecular weight of the biomarkers from the HMDB database and the information in these publications (see Methods). Only 42.7% of the biomarkers discovered based on traditional methods appear in more than two articles, and their reproducibility is suboptimal (Fig. 4d). In addition, the molecular weights of these metabolites are mainly distributed in the range of 200–400 Da (Fig. 4e). We found that prediction performance of DeepMSProfiler using complete raw data is highly accurate compared with the one using only the corresponding m/z signals of the reported biomarkers (Fig. 4f). This indicates that there are still unknown metabolomic signals in the serum samples related to lung cancer that has not been unveiled in the current research. In contrast, DeepMSProfiler derives the classifications directly from the complete signals in the raw data.

Explainability to uncover the black box

After the model construction based on deep learning, we sought to explain the classification basis of the black box model and identify the key signals for specific classifications. We adopted a perturbation-based method, Randomised Input Sampling for Explanation (RISE)37, to count feature contributions. We slightly modified RISE to improve its operational speed and efficiency, and developed a method to evaluate the importance of RISE scores in different classifications (see Methods).

Interestingly, we found a “background category” phenomenon in some of the single models (Fig. 5a). For each class in a single tri-classification model of DenseNet121, the classification performance gradually deteriorates as the metabolic signals with higher contribution scores are removed. However, there is one category that is always unaffected by all the features involved in the classification decision, while maintaining a very high number of true positives and false positives. These findings imply that the tri-classification model only predicts the probabilities of two of the categories, and calculates the probability of the third category from the results of the other two. In other words, the classification-related metabolites associated with the “foreground category” negatively contribute to the “background category”. Furthermore, the categories used as “background category” are not consistent across the different models. We were intrigued to test whether this phenomenon occurs exclusively in metabolomic data, so we conducted a seven-classification task using the Photo-Art-Cartoon-Sketch (PACS) image dataset38. We observed a similar phenomenon in the resultant feature scoring of different models (Supplementary Fig. 8). This suggests that “background category” may generally exist in multi-classification task by single models, although its underlying mechanism is currently unclear and may require future investigation.

Fig. 5: Feature constructions of prediction and significant metabolic signals relevant to biological pathways.
figure 5

a Prediction performance and feature scoring by different single models. b Prediction performance and feature scoring by DeepMSProfiler. c Heatmap matrices of classification contribution in healthy individuals (Healthy), benign lung nodules (Benign), and lung adenocarcinoma (Malignant). The horizontal and vertical axes of the matrix are the prediction label and the true label, respectively. The heatmaps of upper left, the middle one, and the bottom right represent the true healthy individuals, the true benign nodules, and the true lung adenocarcinoma, respectively. The horizontal and vertical axes of each heatmap are RT and m/z, respectively. The classification contributions of metabolites corresponding to true healthy individuals (d), benign nodules (e), and lung adenocarcinoma (f). The horizontal axis represents the retention time and the vertical axis represents the m/z of the corresponding metabolites. The colours represent the contribution score of the metabolites. The redder the colour, the greater the contribution to the classification. Metabolites-proteins network for healthy individuals (g), benign nodules (h), and lung adenocarcinoma (i). Pathway enrichment analysis using the signalling networks for healthy individuals (j), benign nodules (k), and lung adenocarcinoma (l) (FDR < 0.05). m/z: mass charge ratio, RT: retention time.

In contrast, the phenomenon of “background category” no longer exists when the feature contributions are calculated by our ensemble model. As shown in Fig. 5b, when we progressively eliminated metabolic signals in each category according to their contribution, their performance of the ensemble model decreased accordingly. Each individual model captures different features that contribute to their corresponding classifications, while the ensemble model could combine these features to improve the accuracy of disease prediction and reduce the possibility of overfitting.

Metabolomic profiles in lung adenocarcinoma, benign nodules, and healthy individuals

To analyse the global metabolic differences between lung adenocarcinoma, benign nodules, and healthy individuals, we extracted the heatmaps of feature contributions counted by RISE from DeepMSProfiler (Fig. 5c). As shown in Fig. 5d–f, the horizontal and vertical labels in the heatmaps represent m/z and retention time respectively. By mapping the label information to the heatmaps, we are able to locate the metabolites corresponding to different m/z and retention times to obtain their feature contribution scores. In true-positive healthy and benign nodule samples, the metabolic signals with the most significant contribution are uniformly located between 200 and 400 m/z and in 1–3 min (Fig. 5d, e). In comparison, the metabolic signals located between 200 and 600 m/z and in 1–4 min contribute most in lung adenocarcinoma samples, but signals in other regions also have relatively high scores (Fig. 5f).

As higher contribution scores in the heatmaps represent more important correlations, we screened signals with scores above 0.70 and attempted to identify the corresponding metabolic profiles in each classification (Supplementary Data 2). As observed in Fig. 5b, by retaining metabolic signals with a contribution score above 0.7, the overall accuracy is around 0.8, which manages to maintain an efficient classification impact. Considering the RT shift among different batches, we matched metabolic peaks only by m/z. We then fed these m/z signals, together with metabolites identified by tandem mass spectrometry (MS2), into the analysis tool PIUMet based on protein–protein and protein-metabolite interactions39 to build disease-associated feature networks (Fig. 5g–i). As the network shown in Fig. 5i, 82 proteins and 121 metabolites are matched in the lung adenocarcinoma samples, including 9 already identified by MS2 and 111 hidden metabolites found by the correlation between key metabolic peaks. As such, the current analysis based on protein-protein and protein-metabolite interactions allows the discovery of unknown metabolic signals associated with diseased states, although the resolution of the current model might be relatively low in distinguishing all individual peaks contributing to the disease classification. In order to explore the biological explainability, among the features extracted by PIUMet, we also selected 11 metabolites (Supplementary Table 5) with available authentic standards in our laboratory to justify their presence in the lung cancer serum samples. Indeed, these metabolites could be identified in the lung cancer serum as described in our previous study40. We further analysed the metabolic networks to explore the biological relevance associated with each classification. The heatmaps (Fig. 5d, e) and pathway analysis (Fig. 5j, k) consistently show that healthy individuals and benign nodules share similar metabolic profiles. In contrast, the cancer group presents a distinct profile with specific pathways and increasing counts of metabolites or proteins in the shared pathways with healthy individuals or benign nodules (Fig. 5f, l). The detailed metabolites and protein candidates were further shown in Supplementary Data 3. Taken together, our network and pathway analyses demonstrated the interpretability of DeepMSProfiler based on deep learning.

Application of model in colon cancer

Considering the transferability of DeepMSProfiler, we obtained a public colon cancer LC-MS dataset which contained 236 samples from MetaboLights (ID is MTBLS1129). There are 197 colon cancer samples and 39 healthy human samples in the dataset. We randomly divided this dataset into discovery dataset and independent testing dataset at 4:1 ratio. The discovery dataset contained 157 colon cancer samples and 31 healthy control samples. The independent testing dataset contained 40 colon cancer patient samples and 8 healthy control samples.

Due to the differences in cancer types and mass spectrometry analysis procedures between the colon cancer dataset and the lung adenocarcinoma dataset, we re-trained the DeepMSProfiler model. The colorectal cancer data was randomly divided into a discovery dataset and an independent testing dataset, and the discovery dataset was further randomly divided into a training dataset and a validation dataset with multiple times. In the independent testing dataset of the colon cancer dataset, our model achieved an accuracy of 97.9% (95% CI, 97.7%–98.1%), a precision of 98.7% (95% CI, 98.6%–98.8%), a recall of 93.4% (95% CI, 92.9%–94.1%), and an F1 of 95.8% (95% CI, 95.4%–96.2%) (Supplementary Fig. 9). These results suggest an excellent transferability of DeepMSProfiler.

Discovery of metabolic-protein networks in pan-cancer

In a continued effort to investigate the capabilities of DeepMSProfiler in analysing metabolomics data across multiple cancer types, raw lipid metabolomic data of 928 cell lines spanning 23 cancer types were collected from the Cancer Cell Line Encyclopaedia (CCLE) database2 and then subjected to processing by DeepMSProfiler. Notably, in addition to the raw metabolomic data, these cell lines also contain valuable data of annotated metabolites, methylation, copy number variations, and mutations2. DeepMSProfiler constructed a model encompassing the 23 distinct categories, followed by a feature extraction from the 23-category model to identify the respective crucial metabolic signals of each category. Due to the limited number of samples for many cancer types, particularly for biliary tract, pleura, prostate, and salivary gland cancers, each with less than 10 samples, we did not set a separate independent testing dataset for the performance validation. 20 sub-models have been trained, and in each sub-model training, 80% of all samples were randomly allocated for training to ensure that every sample could contribute to the training process, especially for cancer types with very few samples. The final ensemble model used for explainable analysis achieved 99.3% accuracy, 97.2% sensitivity, and 100% specificity. Next, the priority-collecting Steiner forest optimisation algorithm39 was employed to unveil the correlation between pivotal metabolic signals and proteins using databases of HMDB41, Recon242 and iRefIndex43 (see Methods).

As results, we successfully generated disease-specific metabolite-protein networks (Fig. 6a–c) along with a contribution score heatmap (Fig. 6e), where contribution scores exceeding 0.70 were considered indicative of disease-specific metabolites. Metabolites identified within the metabolite-protein network were directly inferred from the mass-to-charge ratio (m/z) of metabolic signals from the raw data using feature spectra extracted by the DeepMSProfiler model. Notably, we identified 14 metabolites and 3 proteins that exhibited co-occurrence within the 23 cancer-related metabolite-protein networks (Fig. 6d). Finally, we correlated the metabolic data and the methylation information and subsequently verified the associations between the PLA and UGT gene families and the disease-specific metabolites of high contribution (Fig. 6f). Previous studies44,45,46,47,48 have reported the important roles of PLA and UGT gene families in a variety of diseases, such as PLA2G7 and PLA2G6 in breast and prostate cancers and neurodegenerative diseases, as well as UGT3A2 in head and neck cancers. These evidences support our findings by DeepMSProfiler. In summary, our extended analysis spanning pan-cancer scenarios highlights the capability of DeepMSProfiler in the discovery of potential disease-associated metabolites and proteins.

Fig. 6: Metabolite and protein associations across 23 cancer types.
figure 6

Metabolite-protein networks for (a) lung cancer, (b) gastric cancer, and (c) leukaemia. Yellow squares: metabolites. Red circles: proteins. Blue labels: metabolites and proteins shared in 23 cancer metabolite-protein networks. d Metabolites and proteins shared in the metabolite-protein networks of 23 cancer types. e Heatmap of the classification contribution of different lipid metabolites across 23 cancer types. f Correlation of important pan-cancer-related metabolites with methylation of the PLA and UGT gene families.

Discussions

Metabolomics faces challenges in precision medicine due to complex analytical process, metabolic diversity, and database limitations5,6. DeepMSProfiler starts with raw untargeted metabolomic data and retains essential information, enabling more effective global analysis. It offers an alternative approach by directly processing raw data of metabolomic samples, bypassing time-consuming experiments such as quality control or reference sample preparation and subsequent normalisation analysis.

In metabolomic study, systematic variations in the measured metabolite profiles may occur during sample collection, processing, analysis, or even in different batches of reagents or instruments. Batch effects can significantly impact the interpretation of the results, leading to inconsistencies in replicating findings across different studies15. While batch effects can manifest as variations in retention time (RT) offset, peak area, and peak shape, conventional quantitative methods often prioritise peak area integration while overlooking peak shape49,50. Significantly, our results demonstrate that DeepMSProfiler is able to automatically eliminate cross-hospital variations during the end-to-end forward propagation process (Fig. 3d), effectively revealing classification profiles.

Moreover, DeepMSProfiler can address the challenges of unidentified metabolites. LC-MS metabolomics can reveal tens of thousands of metabolite peaks in a biological sample. A substantial number of these peaks remains unidentified or unannotated in existing databases. In this study, we demonstrated that among all detected peaks, only 16.5% are identified by HMDB and KEGG. However, the presence of a significant proportion of unknown metabolites has a considerable influence on the accuracy of classification (Fig. 4b). Indeed, annotating metabolomic peaks has remained a major study focus in the field16. A common approach involves comparing the exact mass of detected peaks with authenticated standards, along with either the retention time or the fragmentation spectra obtained through tandem mass spectrometry (MS2). Despite significant development of molecular structural databases and MS2 spectral databases, their current capabilities and coverage remain limited51. In addition, network analysis, which examines complex peak relationships and clusters, has also been developed to facilitate the comprehensive identification of metabolites52. In this study, we employed the deep learning method to capture original signals in LC-MS metabolomic analysis without compromising data integrity. We further implemented a direct transition from m/z to pathway annotations by taking advantage of the network-based analysis tool PIUMet39, effectively identifying 82 proteins and 121 metabolites in the cancer group, compared with 9 metabolites annotated by MS2.

Furthermore, our method is able to cover the metabolites identified by conventional annotation and simultaneously uncovers the undetected disease-specific features. In the traditional metabolomic analysis, biomarkers specific to the disease of interest are usually sought by comparison of metabolite levels between control and case samples. Therefore, peak alignment and metabolite annotation are crucial to the end results. Here, by employing the end-to-end strategy, we unveiled the complete biological outputs that contribute to the distinct metabolomic profiles of each group. For example, tryptophan metabolism was identified among the characteristics of lung adenocarcinoma profile (Fig. 5l). The result was consistent with our previous discovery by the conventional annotation method that metabolites in the tryptophan pathway were decreased in the early-stage lung adenocarcinoma compared with benign nodules and healthy controls40. Serine and glycine are also important for nucleotide synthesis by mediating one-carbon metabolism, which is relevant to therapeutic strategy targeting non-small cell lung cancer53,54,55,56. Intriguingly, we also observed the contribution of bile secretion in the lung adenocarcinoma profile (Fig. 5l), which aligns with another report of aberrant bile acid metabolism in invasive lung adenocarcinoma57. However, it should be noted that the resolution of our model may be limited to distinguish all individual peaks contributing to the disease classification.

We additionally demonstrated that among deep learning models, ensemble models are more stable and class-balanced than single models. Although we have not fully comprehended the reason for the occurrence of “background category”, the ensemble strategy has effectively mitigated this phenomenon (Fig. 5a, b). Our investigation on the PACS image dataset suggests that “background category” may generally exist in multi-classification tasks using single models. Understanding its underlying mechanism requires further investigation with a broader range of dataset.

The high-resolution heatmaps generated by DeepMSProfiler display the feature contributions to the predicted classes and the precise location of specific metabolomic signals (Fig. 5c), providing explainable analysis to assure the researchers of the biological soundness of the prediction. With the capability of batch effect removal, comprehensive metabolomic profiling, and ensemble strategy, DeepMSProfiler demonstrates consistent and robust performance across different categories. It achieves AUCs over 0.99 for the predictions of lung adenocarcinoma, benign nodules, and healthy samples, and an accuracy of 96.1% for early-stage (stage-I) lung adenocarcinoma. Moreover, its extended analysis in pan-cancer illustrates it ability to uncover potential disease-associated metabolites and proteins beyond lung cancer. In conclusion, our DeepMSProfiler offers a straightforward and reliable method suitable for applications in disease diagnosis and mechanism discovery, potentially advancing the use of metabolomics in precision medicine. Its effective end-to-end strategy applied to raw metabolomic data can benefit a broader population in non-invasive clinical practices for disease screening and diagnosis.

Methods

Clinical sample collection

This study was approved by the Ethics Committees of the Sun Yat-Sen University Cancer Centre, the First Affiliated Hospital of Sun Yat-Sen University and the Affiliated Cancer Hospital of Zhengzhou University. A total of 210 healthy individuals, 323 patients with benign nodules and 326 patients with lung adenocarcinoma were enroled. Cases of lung adenocarcinoma were collected prior to lung resection surgery and had pathological confirmation. Serum from benign nodules was collected from individuals undergoing annual physical examinations. Participants with benign nodules were defined as those with stable 3–5 years follow-up Computed Tomography (CT) scans at the time of analysis. The sample collection period was from January 2018 to October 2022. The sex of the participants was determined by self-report. Informed consent was obtained from all participants. The study design and conduct complied with all relevant regulations regarding the use of human study participants and was conducted in accordance to the criteria set by the Declaration of Helsinki. Research with humans has been conducted according to the principles of the Declaration of Helsinki.

In addition, we collected serum samples from 100 healthy blood donors, including 50 males and 50 females, aged between 40 and 55 years, from the Department of Cancer Prevention and Medical Examination, Sun Yat-Sen University Cancer Centre. All these samples were mixed in equal amounts and the resultant mixture was aliquoted and stored. These mixtures were used as reference samples for quality control and data normalisation in the conventional metabolomic analysis as previously described34.

Serum metabolite extraction

Fasting blood samples were collected in serum separation tubes without the addition of anticoagulants, allowed to clot for 1 h at room temperature, and then centrifuged at 2851 × g for 10 min at 4 °C to collect the serum supernatant. The serum was aliquoted and then frozen at −80 °C until metabolite extraction.

Reference serum and study samples were thawed and a combined extraction method (methyl tert-butyl ether/methanol/water) was used to extract metabolites. Briefly, 50 μL of serum was mixed with 225 μL of ice-cold methanol and 750 μL of ice-cold methyl-tertbutyl ether (MTBE). The mixture was vortexed and incubated for 1 h on ice. Then 188 μL MS grade water containing internal standards (13C-lactate, 13C3- pyruvate, 13C-methionine and 13C6-isoleucine, all from Cambridge Isotope Laboratories) was added and vortexed. The mixture was centrifuged at 15,000 × g for 10 min at 4 °C, and then the bottom phase was transferred to two tubes (each containing 125 μL) for LC-MS analysis in positive and negative modes. Finally, the samples were dried in a high-speed vacuum concentrator.

Untargeted liquid chromatography-mass spectrometry

The dried metabolites were resuspended in 120 μL of 80% acetonitrile, vortexed for 5 min and centrifuged at 15,000 × g for 10 min at 4 °C. The supernatant was transferred to a glass amber vial with a micro insert for metabolomic analysis. Untargeted metabolomic analysis was performed on an ultra-performance liquid chromatography-high resolution mass spectrometry (UPLC-HRMS) platform. The metabolites were separated using the Dionex Ultimate 3000 UPLC system with an ACQUITY BEH Amide column (2.1 × 100 mm, 1.7 μm, Waters). In positive mode, the mobile phase comprised 95% (A) and 50% acetonitrile (B), containing 10 mmol/L ammonium acetate and 0.1% formic acid. In negative mode, the mobile phase was composed of 95% and 50% acetonitrile for phases A and B, respectively, both containing 10 mmol/L ammonium acetate and adjusted to pH 9. Gradient elution was performed as follows: 0–0.5 min, 2% B; 0.5–12 min, 2–50% B; 12–14 min, 50–98% B; 14–16 min, 98% B; 16–16.1 min, 98–2% B; 16.1–20 min, 2% B. The column temperature was maintained at 40 °C, and the autosampler was set at 10 °C. The flow rate was 0.3 mL/min, and the injection volume was 3 μL. A Q-Exactive orbitrap mass spectrometer (Thermo Fisher Scientific) with an electrospray ionisation (ESI) source was operated in full scan mode coupled with ddMS2 monitoring mode for mass data acquisition. The following mass spectrometer settings were used: spray voltage +3.8 kV/−3.2 kV; capillary temperature 320 °C; sheath gas 40 arb; auxiliary gas 10 arb; probe heater temperature 350 °C; scan range 70–1050 m/z; resolution 70000. Xcalibur 4.1 (Thermo Fisher Scientific) was used for data acquisition.

In this study, all serum samples were analysed by LC-MS in 10 batches. To assess data quality, a mixed quality control (QC) sample was generated by pooling 10 μL of supernatant from each sample in the batch. Six QC samples were injected at the beginning of the analytical sequence to assess the stability of the UPLC-MS system, with additional QC samples injected periodically throughout the batch. Serum pooled from 100 healthy donors was used as reference material in each batch to monitor extraction and batch effects. All untargeted metabolomic analysis was performed at the Sun Yat-Sen University Metabolomics Centre.

Public dataset collection

The raw dataset for pan-cancer lipid metabolomics data of CCLE was downloaded from the Metabolomics Workbench database with accession ST001142

(https://www.metabolomicsworkbench.org/data/DRCCMetadata.php?Mode=Study&StudyID=ST001142). There are 946 samples in total, including 23 cancer types. The quantitative lipid metabolite matrix and the DNA methylation matrix were downloaded from the appendix of the article2.

The LC-MS dataset of colon cancer was downloaded from the MetaboLights database (https://www.ebi.ac.uk/metabolights/editor/MTBLS1129/descriptors) with 236 samples in total, including 197 colon cancer cases and 39 healthy controls. Due to the differences of disease samples, classification purposes, instruments, and parameters of LC-MS between the public dataset and the private lung adenocarcinoma dataset, the DeepMSProfiler model needs to be re-trained on the public dataset.

Data format conversion

The raw format files of LC-MS data were converted to mzML format using the MSCovert software. The data used to train the end-to-end model were sampled directly from the mzML format without any further processing. This raw data could be used directly as input to the model. In the mzML file, ion intensity and mass-to-charge ratio of each ion point for each time interval were recorded. Ions points were mapped into a 3D space by their RT and m/z. A 2D matrix was sampled from this 3D points array data using a maximally pooled convolution kernel. RT: 0.5 min and m/z: 50 as the sampling starting point and RT: 0.016 min and m/z: 1 as the sampling interval. The sampling ranges of retention time and mass/charge ratio were set. Using the sampling interval as a sliding window, the maximum ion intensity in the interval was sampled to obtain a two-dimensional matrix of 1024 × 1024 ion intensities.

Extraction and annotation of metabolic peaks

We used Compound Discovery v3.1 and TraceFinder v4.0 (Thermo Fisher Scientific) for peak alignment and extraction. These steps resulted in a matrix containing retention time, mass-to-charge ratio and peak area information for each metabolite. To eliminate potential batch-to-batch variation, we used the Ref-M method to correct peak areas. This involved dividing the peak area of each feature in the study sample by the peak area of the reference compound from the same batch, yielding relative abundances. We then used several data annotation tools such as MetID, MetDNA and NetID in MZmine to annotate the metabolite features and combined the annotation results52,58,59,60. These analysis tools include mass spectral information from databases such as the HMDB, MassBank and MassBank of North America (MoNA)41,61. In addition, we performed data annotation using MS2 data based on Compound Discovery v3.1 and finally selected bio-compounds with MS2 matches to certified standards or compounds with inferred annotations based on complete matches in mzCloud (score > 85) or ChemSpider as the precise annotation results62.

Raw data visualisation

The mzML files were read using the Pyteomics package in Python. Records were traversed for all times in the sampling interval63. For each time index data in mzML files, it recorded the preset scan configuration, the scan window, the ion injection time, the intensity array, and the m/z array. The intensity array and m/z array were selected to form an array of data points, and retention time, mass-to-charge ratio, and intensity are the row names. The intensity values were log2 processed. Then, the 3D point cloud data was visualised using the Matplotlib Toolkits package in Python64. The 2D matrixes were obtained by down-sampling the 3D point cloud and pooling the 3D data using median and maximum convolution kernels. Convolution spans were RT: 0.1 min and m/z: 0.001. Heatmaps and contours were plotted using Matplotlib. Retained time-intensity curves were also plotted using Matplotlib with an m/z span of 0.003.

Dataset division and assignment

The dataset of each batch was randomly divided into a training dataset and an independent testing dataset in a ratio of 4:1. The data from the first to the seventh batch contained 90 samples each, including 30 healthy individuals, 30 lung nodules, and 30 lung adenocarcinoma samples. The data for the eighth and ninth batches did not contain healthy samples. The data for the tenth batch only contained nodule samples. To avoid the effect of classification imbalance, we constrained the same sample type and sex ratio in the training and independent testing dataset. Because the samples came from patients of different ages and sexes, the lesion sizes of lung nodules and lung adenocarcinoma patients also varied. In order to avoid these attributes affecting the authenticity of the model, sex, age, and lesion size were also used as constraints for dataset division. The difference in the distribution of sample attributes between the training dataset and the independent testing dataset was verified by an unpaired t-test.

Deep learning model construction in detail

In this step, we aimed to construct a model to predict the class labels for each metabolomic sample. For this, we first set X and Y as the input and label spaces, respectively. A single end-to-end model consisted of three parts, a dimension converter based on pool processing, a feature detector based on the convolutional neural networks, and a multi-layer perceptron (MLP) classifier. The input data directly from the raw data was extremely large and contained a lot of redundant information, so a pooling process was required to reduce the resolution for downstream convolution operations. The input data of the model was reduced by the maximum pooling layer to obtain D(X). Next, enter the feature extractor dominated by convolutional layers to obtain F(D(X)). The convolutional neural network had local displacement invariance and was well adapted to the RT offset problem in metabolomic data. Due to the relatively large size of the data, more than 5 layers of convolutional operations were required to reduce the dimensionality of the data to the extent that the computing power could be loaded. Different architectures were used respectively to compare the performance in the tuning set. The architectures used in different models included VGGNet (VGG16, VGG19), Inception Model (InceptionV3), ResNet (ResNet50), DenseNet (DenseNet121), and EfficientNet (EffcientNetB0-B5)33,65,66,67. In addition, two optimisation models based on Densenet121 were created to simplify the DenseNet network. The direct connection route replaced the last dense layer of Densenet121 with a convolutional layer. The optimisation route replaced the last dense layer of DenseNet with a convolutional layer that retained a one-hop connection. The pre-training parameters in pre-trained models were derived from ImageNet. Each architecture was tested on the TensorFlow + Keras platform and PyTorch platform, respectively. To reduce overfitting, we used only one linear layer for our MLP layer. In the TensorFlow + Keras model, there was a SoftMax activation layer before the output layer. The output of the model was C(F(D(X))).

The positive and negative spectral data used different convolutional layers for feature extraction. Their features were combined before inputting the fully connected layer. Their pre-training parameters were shared. For a model trained on both positive and negative spectral data, a cross entropy loss was used.

Model training

20% of the discovery dataset was divided into tuning sets, which were used to guide model architecture selection, hyperparameter selection, and model optimisation, and the rest 80% was used for model training. Sample category balancing was used as a constraint for dataset segmentation. The model architecture was evaluated by both validation performance and operational performance. We counted the number of model parameters and evaluated the complexity of the model. The average of the 10 running times of the models was used as the runtime. Hyperas was used to preliminarily select the optimal model hyperparameters and initialisation weights68. The optimal initialisation method was he_normal. But we opted for pretraining with the ImageNet dataset due to its comparable performance and faster execution. After reducing the size of the parameter search, we used the grid search method for hyperparameter tuning.

Ensemble strategy

DeepMSProfiler consists of several trained end-to-end sub-models as an ensemble model, where the average of the classification prediction probabilities of the samples from all sub-models was used as the final prediction probability for classification. The ensemble model calculated a score vector of length 3 in each of the three classifications, and the category with the maximum score was selected as the predicted classification result.

Each end-to-end sub-model was trained on the discovery dataset. The architecture of each sub-model is the same, but some hyperparameters are different. Two different learning rates of 1e-3 and 1e-4 were used. The optimiser used is ‘adam’ with parameter settings of beta_1 as 0.9, beta_2 as 0.999, epsilon as 0.001, decay as 0.0, and amsgrad as False. The batch size was set as 8 and the training was run for 200 epochs. A model typically took about 2 h to complete training on a GP100GL (16GB) GPU server. Each sub-model participated fairly in the final prediction result without setting a specific weight. The independent testing dataset was not used in model training and hyperparameter selection.

Machine learning models for comparison

To compare our DeepMSProfiler to other existing tools, we selected several common traditional machine learning methods to build tri-classification models based on the peak area data obtained from the previous steps. These methods included Extreme Gradient Boosting (XGBoost), RF, Adaptive Boosting (Adaboost), SVM, and DNN. The training dataset and independent testing dataset were divided in the same way as the deep learning algorithm, and the numbers of estimators for Adaboost and XGBoost algorithms were the same as those of DeepMSProfiler. XGBoost was implemented by the XGBClassifier function in the xgboost library. Other machine learning methods were implemented using the SciKitLearn library. SVM was adopted using the svm function, and the kernel of SVM is ‘linear’. RF was implemented through the RandomForestClassifier function. Adaboost was adopted through the AdaBoostClassifier function. DNN was implemented using the MLPClassifier function. The optimal hyperparameter was obtained by the grid search method.

Performance metrics

We evaluated the performance of the model on the independent testing dataset. The evaluation metrics included accuracy, precision, sensitivity and F1 score. Micro was chosen as the computational method for the multiclassification model. Confidence intervals were estimated using 1000 bootstrap iterations. During the bootstrapping procedure, our model was estimated by an ensemble strategy combining 20 models trained on the discovery dataset. In addition, we calculated a confusion matrix and an AUC curve to demonstrate the performance of the model in the three classifications of lung adenocarcinoma, benign nodules and healthy individuals. When the sensitivity was 0.7 or 0.9, the specificity was calculated using the sensitivity-specificity curve. The sensitivity-specificity curve was interpolated using the NEAST method.

Visualisation of “black-box”

In the end-to-end neural network prediction, the data flowed in a chain of \({{{\boldsymbol{X}}}}\to D({{{\boldsymbol{X}}}})\to F(D({{{\boldsymbol{X}}}}))\to C(F(D({{{\boldsymbol{X}}}})))\) from the input layer through the hidden layer to the output layer. In the feature extraction layer, which is dominated by convolutional layers, information was passed in the same chain manner. After inputting X, we obtained the corresponding output L in different hidden layers to open the black box process. In order to observe the space of middle features, PCA was used to reduce T dimensionality to principal components. The PCA result was visualised by the Matplotlib package in Python.

To evaluate the correlation of hidden layer output with batch label and type label, respectively, we calculated NMI, ARI, and ASC using the following formulas. L was the layer output and C was the cluster labels used for the cluster evaluation.

$${{{\rm{MI}}}}\left({{{\bf{L}}}}{{{\boldsymbol{,}}}} \, {{{\bf{C}}}}\right)={{\sum}_{i}^{{Ma}{x}_{L}}}{{\sum}_{j}^{{Ma}{x}_{C}}}{P}_{i,j}\frac{\log \left({P}_{i,j}\right)}{{P}_{i}{P}_{j}}$$
(1)
$${NMI}\left({{{\bf{L}}}}{{{\boldsymbol{,}}}}{{{\bf{C}}}}\right)=\frac{2{MI}\left({{{\bf{L}}}}{{{\boldsymbol{,}}}} \, {{{\bf{C}}}}\right)}{H\left({{{\bf{L}}}}\right)+H\left({{{\bf{C}}}}\right)}$$
(2)

In the above equations, the mutual information (MI) computed by the layer outputs L and the label cluster C. \({P}_{i,j}\) represents the joint distribution probability between i and j, and \({P}_{i}\) refers to the distribution probability of i. \({P}_{j}\) refers to the distribution probability of j. \(H\left({{{\boldsymbol{L}}}}\right)\) and \(H\left({{{\bf{C}}}}\right)\) represent the entropy values of L and C, respectively. The clusters of the output layer are clustered by the K-nearest neighbour algorithm.

$${ARI}\left({{{\bf{L}}}}{{{\boldsymbol{,}}}} \, {{{\bf{C}}}}\right)=2\frac{{TP}\times {TN}-{FN}\times {FP}}{({TP}+{FP})({FN}+{TN})+({TP}+{FP})({FP}+{TN})}$$
(3)

In the above equation, TP represents the number of point pairs belonging to the same cluster in both real and experimental cases, and FN represents the number of point pairs belonging to the same cluster in the real case but not in the same cluster in the experimental case. FP represents the number of point pairs not belonging to the same cluster in the real case but in the same cluster in the experimental case, and TN represents the number of point pairs not belonging to the same cluster in both real and experimental cases. The range of ARI is [−1, 1], and the larger the value, the more consistent with the real result, that is, the best effect of clustering.

$${ASC}\left({{{\bf{L}}}}{{{\boldsymbol{,}}}} \, {{{\bf{C}}}}\right)=\frac{{\sum }_{i}^{N}\frac{{b}_{i}-{a}_{i}}{\max \left[{a}_{i},\, {b}_{i}\right]}}{{N}_{C}}$$
(4)

The output layer was first dimensionally reduced by PCA, and the cluster was specified by the real label. In the above equation, \({a}_{i}\) represents the average of the dissimilarity of the vector i to other points in the same cluster, and \({b}_{i}\) represents the minimum of the dissimilarity of the vector i to points in other clusters.

Feature contributions

In previous feature contribution studies, different branches used different methods to compute feature contributions to final classifications. These methods can help to better understand features and their impacts on model predictions. Gradient-based methods, such as GradCAM, calculate the gradients of the last convolutional layer by backpropagation of the category with the highest confidence69. Due to its convenience, this method is widely used in computer vision tasks. But it has a significant problem, that is, the resolution of the feature contribution heatmap is extremely low and cannot reach the requirements for distinguishing most signal peaks. The size of the feature contribution heatmap corresponds to the last convolutional layer of the model. The weight of the feature contribution is the average of the gradients of all features. On the other hand, perturbation-based methods, such as RISE and Local Interpretable Model-Agnostic Explanations, measure the importance of features by obscuring some pixels in raw data37,70. The predictive ability of the black box is then observed to show how much this affects the prediction. Perturbation-based methods can lead to higher resolution and more accurate contribution estimates, but their runtimes are longer. To improve the computing speed in this study, we made some improvements based on RISE, using boost sampling for the mask process.

Using RISE, we can determine the characteristic contributions of RT and m/z for each sample according to its true category. The feature contribution heatmap uses RT as the horizontal axis and m/z as the vertical axis to show the feature contribution of different positions of each sample. The average feature contribution of all samples correctly predicted to be of their true category is taken as the feature contribution of the category. At the same time, by performing peak extraction in the previous steps, we determined the RT value range and the m/z median value for each signal peak. The characteristic contribution associated with the RT and median m/z coordinates is then identified as the distinctive contribution of the signal peak.

Network analysis and pathway enrichment

The extracted metabolic peaks with a contribution score greater than 0.70 to the lung cancer classification were filtered. Mass-to-charge ratio and some substance identification information of these metabolites and their classification contribution scores were used as input data. For some of the metabolic signal peaks, we have accurately identified their molecular formulae and substance names by secondary mass spectrometry as substance identification information. Due to the limitation of existing databases, many unknown signals cannot be identified through secondary mass spectrometry. Therefore, PIUMet was also adopted to search for hidden metabolites and related proteins.

PIUMet built disease-related metabolite-protein networks based on the prize-collecting Steiner Forest algorithm. First, PIUMet integrated iRefIndex (v13), HMDB (v3) and Recon2 databases to obtain the relationship between m/z, metabolites and proteins, and generated an overall metabolite-protein network. The edges were weighted by the confidence level of the correlation reported in these databases. iRefIndex provides details on the physical interactions between proteins, which are detected through different experimental methods. The protein-metabolite relationships in Recon2 are based on their involvement in the same reactions. HMDB includes proteins that are involved in metabolic pathways or enzymatic reactions, as well as metabolites that play a role in protein pathways, based on known biochemical and molecular biological data. The disease-related metabolite peaks obtained by DeepMSProfiler were matched to the metabolite nodes of the overall network by their m/z, and directly to the terminal metabolite nodes of the overall network after annotation. The corresponding feature contributions obtained by DeepMSProfiler served as prizes for these metabolite nodes. The network structure was then optimised using the prize-collecting Steiner Forest algorithm to minimise the total network cost and connect all terminal nodes, thereby removing low-confidence relationships and obtaining disease-related metabolite sub-networks.

Metabolite identification is an important issue in metabolomics research and there are different levels of confidence in identification. Referring to the highest level considered71, we analysed authentic chemical standards and validated 11 of the metabolites discovered by PIUMet with only m/z (Supplementary Table 5). Then, disease-related metabolites and proteins were used to analyse their pathways39. These hidden metabolites and proteins from PIUMet were then processed for KEGG pathway enrichment analysis using MetaboAnalyst (v6.0). We used joint pathway analysis in MetaboAnalyst and chose hypergeometric test for enrichment analysis and degree centrality for topology measure. The integrated version of KEGG pathways (year 2019) was adopted by MetaboAnalyst. Pathways were filtered out using 1e-5 as a p value cut-off72. The corresponding SYMBOL IDs of the proteins were converted to KEGG IDs by the ClusterProfiler package in R73.

Ablation experiment

We searched the PubMed database for a total of 5088 articles using the terms “serum”, “lung cancer” and “metabolism” from 2010 to 2022. By reading the titles and abstracts of them, we excluded publications that used non-serum samples such as urine and tissue for research, as well as publications that used non-mass spectrometry methods such as chromatography, nuclear magnetic resonance, and infrared spectroscopy. We then further screened the selected literature to exclude studies that did not result in the discovery of metabolic biomarkers. Finally, 49 publications were remained and 811 serum metabolic biomarkers for lung cancer were reported. Some of the literature provides information on the retention time and mass-to-charge ratio of biomarkers. However, in other literature, only the name of the identified biomarker is given. Therefore, we searched the molecular weights of these metabolites in the HMDB database based on the literature information to match the corresponding m/z. The use of metabolite molecular weights to match the m/z took full account of the effect of adducts. Based on the number of publications of biomarkers in the literature, we determined the range of retained signals to be the m/z corresponding to biomarkers that exceeded the threshold number of publications. We filtered the signals in the raw data to exclude signals that did not fall into the 3 ppm intervals around these m/z. The filtered raw data were used as input to the model.

Statistical analysis

All statistical analysis calculations were performed using the stat package in Python. The distribution of data was plotted using the Seaborn package in Python. The correlation between patient information and labels was calculated using Pearson’s, Spearman’s and Kendall’s correlation coefficients. Pearson’s correlation coefficient was preferred to find linear correlations. Spearman’s and Kendall’s rank correlation coefficients were used to discover non-linear correlations. P-values below 0.05 were considered significant.

Figure preparation

The main figures in this paper were assembled in Adobe Illustrator. The photo of mass spectrometry instruments was taken from actual objects. The data structure diagrams were obtained by fitting simulated functions based on python. Some cartoon components were drawn through FigDraw (www.figdraw.com) with a license code (TAIRAff614) for free use.

AI-assisted technologies in the writing process

At the end of the preparation of this work, the authors used ChatGPT to proofread the manuscript. After using this tool, the authors reviewed and edited the content as needed and take full responsibility for the content of the publication.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.