Introduction

Metabolomics aims at characterizing metabolic biomarkers by analytically describing complex biological samples1. At present, the metabolomics based on liquid chromatography mass spectrometry (LC/MS) is capable of simultaneously monitoring thousands of metabolites in bio-fluid, cell and tissue, and is widely applied to various aspects of biomedical research. In particular, metabolomics analysis on LC/MS data can aid the choice of therapy2, provide powerful tools for drug discovery by revealing drug mechanism of actions and potential side effects3, and help to identify biomarkers4,5,6 of various diseases such as hepatocellular carcinoma (HCC)7, colorectal cancer8, insulin resistance9, and so on.

Several factors (e.g., unwanted experimental & biological variations and technical errors) may hamper the identification of differential metabolic profiles and effectiveness of metabolomics analysis (e.g., paired or nested studies)10,11,12,13,14. To remove specific types of unwanted variations, the signal drift correction (when quality control samples are available), the batch effect removal (when internal standards or quality control samples are available), and the scaling (not suitable when the self-averaging property does not hold) are adopted13. These commonly used strategies are generally grouped into two categories: (1) method-driven normalization approaches extrapolating external model that is based upon internal standards or quality control samples and (2) data-driven normalization approaches scaling or transforming metabolomics data15,16,17,18,19,20. As reported in Ejigu’s work, the method-driven strategies may not be practical due to several reasons, especially their unsuitability for treating untargeted metabolomics data, while data-driven ones are better choices for untargeted LC/MS based metabolomics data15. The capacities of 11 data-driven normalization methods (“normalization method” in short for the rest of this paper) for processing nuclear magnetic resonance (NMR) based metabolomics data were systematically compared21. Two methods (the Quantile and the Cubic Splines) were identified as the “best” performed normalization methods, while other two methods (the Contrast and the Li-Wong) could “hardly” reduce bias at all and could not improve the comparability between samples21. For gas chromatography mass spectrometry (GC/MS) based metabolomics, a comparative research on the performances of 8 normalization methods discovered two (the Auto Scaling and the Range Scaling) of “overall best performance”12. Similar to NMR and GC/MS, the LC/MS is one of the most popular sources of current metabolomics data, and it is of great importance to analyze the differential influence of those methods on LC/MS based data. Ejigu et al. measured the performance of 6 methods according to their “average metabolite specific coefficient of variation (CV)”15. The CV showed that the Cyclic Loess and the Cubic Splines performed “slightly better” than other methods, but no statistical difference among CVs of those methods was observed15.

For the past decade, no less than 16 methods have been developed for normalizing the LC/MS based metabolomics data13,22,23, some of which (e.g., the VSN24, the Quantile25, the Cyclic Loess26) are directly adopted from those previously used for processing transcriptomics data. Both metabolomics data and transcriptomics data are high-dimensional. However, the dimension of transcriptomics data can reach 10 thousands, while that of metabolomics data is about a few thousands. Moreover, unlike transcriptomics, correlation among metabolites identified from metabolomics data may not indicate a common biological function27. Apart from the above differences, there are significant similarities between two OMICs data: (1) right-skewed distribution23, (2) great data sparsity28, (3) substantial amount of noise29,30 and (4) significantly varied sample sizes31,32. Due to these similarities, it is feasible to apply some of the normalization methods used in transcriptomics data analysis to the metabolomics one.

Those 16 methods specifically normalizing LC/MS based metabolomics data can be classified into two groups21. Methods in group one (including the Contrast Normalization33, the Cubic Splines34, the Cyclic Loess35, the Linear Baseline Scaling25, the MSTUS22, the Non-Linear Baseline Normalization36, the Probabilistic Quotient Normalization37 and the Quantile Normalization25) aim at removing the unwanted sample-to-sample variations, while methods of the second group (including the Auto Scaling38, the Level Scaling12, the Log Transformation39, the Pareto Scaling40, the Power Scaling41, the Range Scaling42, the VSN43,44 and the Vast Scaling45) adjust biases among various metabolites to reduce heteroscedasticity. However, the performance and the sample size dependence of those methods widely adopted in current metabolomics studies (e.g., the Pareto Scaling and the VSN)28,46 have not yet been exhaustively compared in the context of LC/MS metabolomics data analysis.

Moreover, several comprehensive metabolomics pipelines are currently available online, where various normalization algorithms are integrated in as one step in their corresponding analysis chain. These online pipelines include the MetaboAnalyst28, the Metabolomics Workbench47, the MetaDB48, the MetDAT49, the MSPrep50, the Workflow4Metabolomics51 and the XCMS online52. Based on a comprehensive review, the number of normalization algorithms provided by the above pipelines varies significantly from 2 (the Workflow4Metabolomics) to 13 (the MetaboAnalyst). 6 out of those 7 pipelines only provide <50% of those 16 methods analyzed in this study. The MetaboAnalyst is the only pipeline offering 13 methods, but some methods reported as “well-performed” in LC/MS based metabolomics analysis (e.g., the VSN and the PQN)28,37,46 are not provided. The inadequate coverage of these methods may weaken the applicability range of those pipelines. Moreover, the suitability of a normalization method was reported to be greatly dependent on the nature of the analyzed data53, a comparative performance evaluation among methods is therefore essential to the determination of the most appropriate method for professional/inexperienced researchers. However, no comparative evaluation among those normalization methods was conducted in the above pipelines. So far, the Normalyzer53 is the only online tool offering comparative evaluation of 12 different normalization methods treating high-throughput OMICs data53. In particular, this tool accepted a variety of data types including metabolomics, proteomics, DNA microarray and the real-time polymerase chain reaction data53. However, since the Normalyzer was designed to process a wide range of OMICs, it did not cover 8 of those 16 methods specifically for LC/MS based metabolomics studies. Thus, it is in urgent need to construct a publicly available tool for comparatively and comprehensively evaluating the performances of methods used specifically for normalizing LC/MS based metabolomics data.

In this study, a comprehensive comparison on the normalization capacities of 16 methods was conducted. Firstly, the differential metabolic features selected based on each method were validated by a benchmark spike-in dataset and by experimentally validated markers. To further understand the influence of sample size on the method performance, 10 sub-datasets of various sample size were generated to evaluate the variation of normalization performance among 16 methods, and to categorize these methods into 3 groups (superior, good and poor performance group). Finally, a web-based tool used to comprehensively evaluate the performance of all 16 methods was constructed. In sum, this study could serve as valuable guidance to the selection of suitable normalization methods in analyzing the LC/MS based metabolomics data.

Materials and Methods

Benchmark datasets collection and sub-datasets generation

Five criteria were used to select datasets from the MetaboLights (http://www.ebi.ac.uk/metabolights/)32 in this study, which include: (1) data type set as “study”; (2) technology set as “mass spectrometry”; (3) organism set as “homo sapiens”; (4) study validation set as “fully validated”; (5) untargeted LC/MS based metabolomics data with >100 samples selected by manual literature and dataset reviews. Based on the above criteria, 4 benchmark datasets were collected for analysis, which include the positive (ESI+) and negative (ESI−) ionization modes of both MTBLS2854 and MTBLS1755. For MTBLS17, only the dataset of experiment 1 with >100 studied samples was included. For the remaining text of this paper, MTBLS17 was used to stand for the dataset of experiment 1 in Ressom’s work55. Both ESI+ and ESI− of MTBLS28 provided LC/MS based metabolomics profiles of 1,005 samples (469 lung cancer patients and 536 healthy individuals)54, and MTBLS17 ESI+ and ESI− gave profiles of 189 samples (60 HCC patients and 129 people with cirrhosis) and 185 samples (59 HCC patients and 126 people with cirrhosis), respectively55.

To construct training and validation datasets and sub-datasets of various sample size, random sampling and k-means clustering were applied. Taking MTBLS28 ESI+ as an example, 1,005 samples were divided into training dataset (400 lung cancer patients and 500 healthy individuals) and validation dataset (105 samples) by random sampling. Moreover, to generate the sub-datasets from training dataset, the k-means clustering56 was used to sample 10 sub-datasets of various sample size. In particular, the number of lung cancer patients versus that of healthy individuals were 50 vs. 40, 100 vs. 80, 150 vs. 120, 200 vs. 160, 250 vs. 200, 300 vs. 240, 350 vs. 280, 400 vs. 320, 450 vs. 360, and 500 vs. 400 for 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100% of the samples in the training group, respectively.

LC/MS based metabolomics data pre-processing

Biological variance and technical error are two key factors introducing biases to the metabolomics data. Biological variance arises from the spread of metabolic signals detected from various biological samples57, while technical error results from machine drift58. In particular, biological variances (e.g., varying concentration levels of bio-fluid, different cell sizes, varying sample measurements) are commonly encountered in metabolomics data13, while technical errors (e.g., a sudden drop in peak intensities or measurements on different instruments) are the major issues in large-scale metabolomics studies58. Apart from those above methods widely adopted to remove biological variances22, quality-control (QC) samples were used to significantly reduce technical errors58.

Moreover, sparsity is the nature of metabolomics data, which can be represented by a substantial amount of missing values (10~40%), which can affect up to 80% of all metabolic features59. The direct assignment of zero to the missing values could be useful for cluster analysis, but it may lead to poor performance or even malfunction if normalization method is applied50, especially for those methods based on the logarithm (e.g., the Log Transformation)50,53. Several missing value imputation methods are currently available, among which the KNN algorithm60 was reported as the most robust one for analyzing mass spectrometry based metabolomics data60. Therefore, the KNN algorithm was adopted in this work to impute the missing signals of the metabolic features.

In this study, a widely adopted data pre-processing procedure54,60,61 was applied, which included sample filtering, data matrix construction and signal filtering & imputing (Fig. 1). In particular, (1) samples with signal interruption or not detectable internal standard were removed based on Mathé’s work54; (2) peak detection, retention time correction and peak alignment54 were applied to the UHPLC/Q-TOF-MS raw data (in CDF format) using the xcmsSet, the group and the rector functions in the XCMS package62 with both the full width at half-maximum (fwhm) and the retention time window (bw) set as 10; (3) metabolic features detected in <20% of QC samples61 or with large variations54 were removed based on Mathé’s work, and missing signals of the remaining metabolic features were imputed by the KNN algorithm60. The detailed workflow of data pre-processing used in this study was illustrated in Fig. 1.

Figure 1
figure 1

The overall research design and flowchart of this study.

Normalization methods analyzed in this study

16 methods were analyzed in this work, which include the Auto Scaling (unit variance scaling, UV)38, the Contrast Normalization33, the Cubic Splines34, the Cyclic Locally Weighted Regression (Cyclic Loess)35, the Level Scaling12, the Linear Baseline Scaling25, the Log Transformation39, the MS Total Useful Signal (MSTUS)22, the Non-Linear Baseline Normalization (Li-Wong)36, the Pareto Scaling40, the Power Scaling41, the Probabilistic Quotient Normalization (PQN)37, the Quantile Normalization25, the Range Scaling42, the Variance Stabilization Normalization (VSN)43,44 and the Vast Scaling45.

Auto Scaling (unit variance scaling, UV) is one of the simplest methods adjusting metabolic variances21, which scales metabolic signals based on the standard deviation of metabolomics data. This method makes all metabolites of equal importance, but analytical errors may be amplified due to dilution effects21. Auto scaling has been used to improve the diagnosis of bladder cancer using gas sensor arrays63 and to identify urinary nucleoside markers from urogenital cancer patients64.

Contrast Normalization is originated from the integration of MA-plots and logged Bland-Altman plots, which assumes the presence of non-linear biases21. The use of a log function in this method may impede the processing of zeros and negative numbers, which requires the conversion of non-positive numbers to an extremely small value21. The contrast method has been employed to reveal the role of polychlorinated biphenyls in non-alcoholic fatty liver disease by metabolomics analysis65.

Cubic Splines is one of the non-linear baseline methods assuming the existence of non-linear relationships between baseline and individual spectra21. Cubic splines has been adopted to reduce variability in DNA microarray experiments by normalizing all signal channels to a target array34. Moreover, this method has been performed to evaluate differential effects of clinical and biological variables in breast cancer patients66.

Similar to contrast normalization, Cyclic Locally Weighted Regression (Cyclic Loess) comes also from the combination of MA-plot and logged Bland-Altman plot by assuming the existence of non-linear bias21. However, cyclic loess is the most time-consuming one among those studied normalization methods, and the amount of time grows exponentially as the number of sample increases67. This method has been used to discover microRNA candidates regulating human osteosarcoma68.

Level Scaling transforms metabolic signal variation into variation relative to the average metabolic signal by scaling according to the mean signal12. This method is especially suitable for the circumstances when huge relative variations are of great interest (e.g., studying the stress responses, identifying relatively abundant biomarkers)12. Level Scaling has been used to identify urinary nucleoside markers from urogenital cancer patients64.

Linear Baseline Scaling maps each sample spectrum to the baseline based on the assumption of a constant linear relationship21. However, this assumption of a linear correlation among sample spectra may be oversimplified21. This method has been conducted to identify differential metabolomics profiles among the banana’s 5 different senescence stages69. Moreover, linear baseline scaling has been performed to discover the toxicity profiling of capecitabine in patients with inoperable colorectal cancer70.

Log Transformation converts skewed metabolomics data to symmetric via the non-linear transformation, which is usually used to adjust heteroscedasticity and transform metabolites’ relations from multiplication to addition12. In metabolomics, relations among metabolites may not always be additive, this method is thus needed to identify multiplicative relation with linear techniques12. This method has been used to delineate potential role of sarcosine in prostate cancer progression71.

MS Total Useful Signal (MSTUS) utilizes the total signals of metabolites that are shared by all samples by assuming that the number of increased and decreased metabolic signals is relatively equivalent22,72. However, the validity of this hypothesis is questionable since an increase in the concentration of one metabolite may not necessarily be accompanied by a decrease in that of another metabolite72,73. MSTUS has been reported as among the best choices for overcoming sample variability in urinary metabolomics73 and used to identify diagnostic and prognostic markers for lung cancer patients54.

Non-Linear Baseline Normalization (Li-Wong) is one of the normalization methods aiming at removing unwanted sample-to-sample variations21. This method is first used to analyze oligonucleotide arrays based on a multiplicative parametrization36,74, and currently adopted to improve NMR-based metabolomics analysis21. This method has already been successfully integrated into the dChip74.

Different from the auto scaling, Pareto Scaling uses the square root of the standard deviation of the data as scaling factor40. Therefore, comparing to the auto scaling, this method is able to reduce more significantly the weights of large fold changes in metabolite signals, but the dominant weight of extremely large fold changes may still be unchanged21. Pareto scaling has been performed for improving the pattern recognition for targeted75 and untargeted76 metabolomics data.

Power Scaling aims at correcting for the pseudo scaling and the heteroscedasticity12. Different from the log transformation, the method is able to handle and zero values12. Power scaling has been used to study the serum amino acid profiles and their variations in colorectal cancer patients77.

Probabilistic Quotient Normalization (PQN) transforms the metabolomics spectra according to an overall estimation on the most probable dilution37. This algorithm has been reported to be significantly robust and accurate comparing to the integral and the vector length normalizations37. PQN has been used to discover potential diagnostic technique for ovarian and breast cancers from urine metabolites78.

Quantile Normalization aims at achieving the same distribution of metabolic feature intensities across all samples, and the quantile-quantile plot in this method is used to visualize the distribution similarity21. Quantile normalization has been used to probe differential molecular profiling between pancreatic adenocarcinoma and chronic pancreatitis79, and currently adopted to improve NMR-based metabolomics analysis21.

Range Scaling scales the metabolic signals by the variation of biological responses63. A disadvantage of this method lies in a limited number (usually only 2) of values used to describe the variation unlike other scaling methods taking all measurements into account using the standard deviation, which makes this algorithm relatively sensitive to outliers12. Because all variation levels of the metabolites are treated equally by the range scaling, it has been used to fuse mass spectrometry-based metabolomics data42.

Variance Stabilization Normalization (VSN) is one of the non-linear methods aiming at remain variances unchanged across the whole data range21. The method is reported to be a preferred approach for exploratory analysis such as the principal component analysis80. VSN was originally developed for normalizing single and two-channel microarray data81, and currently used to determine metabolic profiles of liver tissue during early cancer development82.

As an extension of the auto scaling, Vast Scaling scales the metabolic signals based on the coefficient of variation12. Vast scaling has been used to identify prognostic factors for breast cancer patients from the magnetic resonance based metabolomics83.

Detailed descriptions on these methods could be found in Supplementary Note S1, and their source codes programed in this study could be found in Supplementary Note S2.

Assessment of the normalization performance by classification algorithm

Firstly, the differential metabolic features were identified by VIP value (>1) of the partial least squares discriminant analysis (PLS-DA)84 in R package ropls85 together with p-value (<0.05) of Student t-test71. All computational assessments were conducted in R (http://www.r-project.org) version 3.2.4 running on 64-bit Mac OS X EI Capitan (v10.11.5) platform. Source codes of related programs designed in this study could be found in Supplementary Note S2.

Secondly, classification algorithm was applied to assess the performance of each normalization method based on the identified differential metabolic features. Several classification algorithms were adopted to evaluate the performance of normalization methods, which include the Support Vector Machine (SVM)21, the k-Nearest Neighbors (k-NN)86, the Gaussian Mixture Model (GMM)87, and so on. As illustrated in Fig. 1, the SVM algorithm in the R package e1071 (http://cran.r-project.org/web/packages/e1071) was selected to assess normalization performance in this study. In the process of training the classification models, 10-fold cross validation was used to optimize parameters, and the validation dataset was then used to assess the classification performance of the selected differential features by the receiver operating characteristic (ROC) plots generated by R package ROCR88. Source codes of the classification algorithm programed in this study could be found in Supplementary Note S2.

Identification of the performance relationship among normalization methods

The hierarchical clustering56,89,90 was adopted to identify the relationship of sample size dependent performance among 16 methods. Firstly, the area under the curve values (AUCs) of a specific method among 10 sub-datasets of various sample size were used to generate a 10 dimensional vector. Secondly, hierarchical clustering was adopted to investigate the relationship among vectors, and therefore among corresponding methods. As an assessment of consistency between different distance metrics, two metrics (the Manhattan and the Euclidean) were applied:

In Eq. (1) and Eq. (2), i refers to each AUC of method a and b. Clustering approach adopted is the Ward’s minimum variance method91, which is used to reduce the total within-cluster variance to the maximum extent. In this work, Ward’s minimum variance module in R package was used92. Source codes of the hierarchical clustering algorithm programed in this study could be found in Supplementary Note S2.

Construction of web-based tool for evaluating performance of 16 normalization method

A web-based tool named as MetaPre for comprehensively evaluating the normalization performance of all 16 methods was constructed and hosted at http://server.idrb.cqu.edu.cn/MetaPre/. MetaPre was developed in R environment, and further extended using HTML, CSS and JavaScript. The R package Shiny (http://shiny.rstudio.com/) was used to construct web application (comprised of a front end and a back end). R package DiffCorr93 and vsn from Bioconductor Project94 were utilized to support background processes. MetaPre server was deployed at Apache HTTP web server v2.2.15 (http://httpd.apache.org).

Results and Discussion

Validation of the differential metabolic features selected based on 16 normalization methods

Supplementary Table S1 showed the number of differential metabolic features identified by PLS-DA based on 16 normalization methods. As demonstrated, the numbers of features selected based on some methods were the same as each other, while the numbers identified by some others varied significantly. SVM classifier based on those features was used in this work, the validity of these features were therefore crucial for assessing performances of 16 methods. In this study, two lines of evidence were provided for this assessment. First, a benchmark spike-in dataset from Franceschi’s work95 was analyzed. As shown in Supplementary Table S2, the performances on identifying spike-in compounds based on 16 methods were equivalent to that of Franceschi’s work, which indirectly reflected the reliability of strategy applied in this study. Secondly, 2 markers (creatine riboside and 561.3432) from positive and other 2 markers (cortisol sulfate and N-acetylneuraminic acid) from negative ionization mode were experimentally validated in Mathé’s work54. Supplementary Table S3 listed the number of experimentally validated markers identified by this work from the same datasets as that in Mathé’s work (MTBLS28 ESI+ and ESI−). For all methods of various sample sizes, the absolute majority (91.6%) identified all experimentally validated markers, which could server as another line of evidence for the validity of metabolic features selected by this study.

Variation of normalization performances among 16 methods based on benchmark datasets

Table 1 demonstrated the prediction accuracy (ACC) of each method trained by 10 sub-datasets based on MTBLS28 (ESI+ and ESI−). For the training set of 900 samples from MTBLS28 ESI+, the ACC values of 11 methods fell in the range from 0.6095 (the Level Scaling) to 0.6952 (the Log Transformation, the Power Scaling and the Range Scaling). The ACC values of 4 methods (the VSN, the PQN, the Cyclic Loess and the Cubic Splines) exceeded 0.7, while that of another method (the Contrast) was only 0.5143. For training set of 900 samples from MTBLS28 ESI−, the ACC values of 14 methods fell in the range from 0.6095 (the Level Scaling) to 0.6857 (the Cyclic Loess and the VSN). The ACC value of only one method (the Quantile) exceeded 0.7, while that of another method (the Contrast) was only 0.3333. Moreover, Supplementary Table S4 showed the ACC values of each method trained by 10 sub-datasets based on MTBLS17 (ESI+ and ESI−). For the training set of 170 samples from MTBLS17, the Contrast method always underperformed comparing to other methods, which was similar to that of MTBLS28. However, the top-ranked normalization methods for each ionization mode of each dataset vary significantly, which is in accordance with Chawade’s conclusion that the effectiveness of a method in normalizing data relied on the nature of the analyzed data53. Thus, this significant variation reminded us that it is essential to take various sample size into account, if one try to compare the performance among normalization methods.

Table 1 Performance evaluation of 16 normalization methods across 10 sub-datasets based on the benchmark data MTBLS28 (ESI+ and ESI−).

The receiver operating characteristic (ROC) curves and the area under the curve values (ACCs) were used to illustrate the performances of 16 methods in Fig. 2 and Supplementary Table S5. Figure 2a–d illustrated ROC curves of MTBLS28 ESI+, MTBLS28 ESI−, MTBLS17 ESI+ and MTBLS17 ESI−, respectively. The training dataset of Fig. 2a and b consisted of 900 samples (400 lung cancer patients and 500 healthy individuals), and that of Fig. 2c and d consisted of 170 samples (50 HCC patients and 120 people with cirrhosis). The grey diagonal represented an invalid model with the corresponding AUC value equaled to 0.5. As illustrated in Fig. 2a–d, the Contrast method showed a poor normalization performance in all 4 datasets, while the VSN and the Log Transformation outperformed consistently. However, performance rank of the remaining methods fluctuated dramatically, which also requested a collective assessment of normalization performance based on various sample size.

Figure 2
figure 2

Normalization performance of 16 methods measured by receiver operating characteristic (ROC) curves based on four benchmark datasets: (a) MTBLS28 ESI+, (b) MTBLS28 ESI−, (c) MTBLS17 ESI+ and (d) MTBLS17 ESI−. The training dataset of (a) and (b) composed of 900 samples (400 lung cancer patients and 500 healthy individuals), and that of (c) and (d) consisted of 170 samples (50 HCC patients and 120 people with cirrhosis). The grey diagonal represented an invalid model with the corresponding area under the curve (AUC) value equaled to 0.5. All lines were generated by the LOESS regression.

Categorization of 16 methods based on their normalization performances

AUCs of a specific method among 10 sub-datasets were calculated to construct a 10 dimensional vector. The resulting 16 vectors were then hierarchically clustered based on two popular distance metrics (the Manhattan in Fig. 3 and the Euclidean in Supplementary Figure S1). Cluster analysis of 16 methods was conducted based on 4 benchmark datasets: (a) MTBLS28 ESI+, (b) MTBLS28 ESI−, (c) MTBLS17 ESI+ and (d) MTBLS17 ESI−. As shown in Fig. 3a–d, 16 methods were divided by the corresponding dendrogram on the left side of each figure into three areas: top, middle and bottom areas colored by green, blue and magenta, respectively. Clearly, 3 methods (the VSN, the Log Transformation and the PQN) were consistently ranked into the top area of all 4 figures, while one method (the Contrast) always stayed in the bottom area. Therefore, 16 normalization methods could be categorized into 3 groups (A, B and C) by comprehensively considering their performances across all 4 benchmark datasets.

Figure 3
figure 3

Cluster analysis of 16 normalization methods according to their AUC values (across 10 various sample sizes) calculated based on four benchmark datasets: (a) MTBLS28 ESI+, (b) MTBLS28 ESI−, (c) MTBLS17 ESI+ and (d) MTBLS17 ESI−. The data were presented in matrix format in which columns represent specific training dataset of various sample size and rows represent each normalization method. Each cell in heat map represents AUC value of a normalization method trained on one specific training sample. The cell of the highest AUC value was set as exact blue with those lower AUC values gradually fading towards red (the lowest AUC value). Hierarchical clustering analyses were conducted using Manhattan metric and Ward’s minimum variance algorithm.

As illustrated by Fig. 4, normalization methods in group A (the VSN, the Log Transformation and the PQN) demonstrated the best performance among all 16 methods, which made group A (G-A) the Superior Performance Group. The VSN and the PQN had been discovered as robust and well-performed methods in metabolomics for various dilutions of biological samples37,96. The Log Transformation was reported to be a powerful tool for making skewed distributions symmetric12, it was therefore a very suitable method for treating metabolomics data (the distribution of which was right-skewed)23. Moreover, some methods (e.g., the VSN) in G-A was also found to be the most capable one in reducing variation between technical replicates in proteomics, and consistently well-performed in identifying differential expression profiles97. The Contrast was the only one method in group C (G-C, the Poor Performance Group), the performance of which was consistently the worst across 10 sub-datasets among all 16 methods. As reported by Kohl et al.21, the Contrast hardly reduced bias at all and could not improve comparability among samples21.

Figure 4
figure 4

Method groups categorized according to the normalization performances across various sample sizes based on four benchmark datasets: (a) MTBLS28 ESI+, (b) MTBLS28 ESI−, (c) MTBLS17 ESI+ and (d) MTBLS17 ESI−. (G–A) superior performance group; (G-B1) good performance group including methods occasionally classified to top green area of Fig. 3; (G-B2) good performance group including methods consistently staying in middle blue area of Fig. 3; (C) poor performance group. All lines were generated by the LOESS regression.

Moreover, the remaining 12 methods in group B (Good Performance Group) could be further divided into G-B1 (including 6 methods occasionally classified to the top area of Fig. 3) and G-B2 (including 6 methods consistently staying in the middle area of Fig. 3). As illustrated in Fig. 4, although slightly underperformed comparing to G-A, methods in G-B1 showed good normalization performances across 10 sub-datasets of various sample size. Furthermore, the majority of the methods in G-B2 followed a similar fluctuation trends across various sample sizes, with the Li-Wong distinguished as an outlier. The Li-Wong performed the worst among other assessed methods in reducing within- and between-group variations96, and could hardly reduce the biases among samples at all21.

Similar to the Manhattan metric (Fig. 3), 16 methods could also be re-categorized with the Euclidean metric. As illustrated in Supplementary Figure S1, the categorization generated based on the Euclidean metric identified 3 groups with exactly the same methods in each group as that of the Manhattan metric, which reflected the independent nature of method categorization on different distance metrics. Moreover, in Supplementary Figure S1d, the Li-Wong was clustered into the bottom area (magenta) together with the Contrast, which again reflected its unsuitability in analyzing LC/MS based metabolomics data21,96.

Online interactive analysis tool for normalizing LC/MS based metabolomics data

With R package Shiny (http://shiny.rstudio.com/), an interactive web tool, named MetaPre, was developed in this study and hosted at http://server.idrb.cqu.edu.cn/MetaPre/. The MetaPre constructed to normalize LC/MS based metabolomics data could be easily accessed by modern web browsers such as Chrome, Foxfire, IE, Safari, and so on. Meanwhile, the local version of MetaPre was freely provide in this study and could also be readily downloaded from https://github.com/libcell/MetaPre in Github. The procedure for using online version of the MetaPre was provided in Fig. 5, which included 4 steps: (1) uploading the dataset; (2) data pre-processing; (3) data normalization; (4) performance evaluation.

Figure 5
figure 5

General operational procedure for using MetaPre.

Uploading the dataset provided the option to upload data with or without QC samples. In large-scale metabolomics study (especially the LC/MS based one), not all samples can be analyzed in the same experimental batch61. To cope with these difficulties, QC samples were frequently applied58,61. In the MetaPre, batch correction based on QC samples was provided, which made this tool one of the few currently available online servers51,98 offering such kind of function.

Data pre-processing offered the function to correct metabolic features and impute missing signals. For data with QC samples, the MetaPre firstly applied within-block signal correction61 to correct metabolic features. Then, multiple popular imputing algorithms were provided to fill missing signals. For data without QC samples, only the process of missing signal imputing was implemented.

Data normalization integrated 16 normalization methods discussed in this study to remove the unwanted biological variations. After selecting any of these methods, the normalized data matrix was displayed on the web page and a corresponding csv file could be downloaded directly. Moreover, two box plots used to visualizing the distributions of data before and after normalization were illustrated on the web page.

Performance evaluation was quantified based on AUC values of the constructed SVM models. Firstly, the differential metabolic features were identified by VIP value (>1) of PLS-DA model. Then, SVM models were constructed based on these identified differential features. After k-folds cross validation, ROC curve together with its AUC value were calculated and displayed on the web page.

MetaPre is valuable online tool to select suitable methods for normalizing LC/MS based metabolomics data, and is a useful complement to the currently available tools in modern metabolomics analysis.

Conclusion

Based on the 4 datasets tested in this work, 16 methods for normalizing LC/MS based metabolomics data were categorized into three groups based on their normalization performances across various sample sizes, which included the superior (3 methods), good (12 methods) and poor (1 method) performance groups. The VSN, the Log Transformation and the PQN were identified as methods of the best normalization performance, while the Contrast consistently underperformed across all sub-datasets of different benchmark data among those 16 methods. Moreover, an interactive web tool comprehensively evaluating the performance of all 16 methods for normalizing LC/MS based metabolomics data was constructed and hosted at http://server.idrb.cqu.edu.cn/MetaPre/. In sum, this study could serve as guidance to the selection of suitable normalization methods in analyzing the LC/MS based metabolomics data.

Additional Information

How to cite this article: Li, B. et al. Performance Evaluation and Online Realization of Data-driven Normalization Methods Used in LC/MS based Untargeted Metabolomics Analysis. Sci. Rep. 6, 38881; doi: 10.1038/srep38881 (2016).

Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.