Kernel weighted least square approach for imputing missing values of metabolomics data

Mass spectrometry is a modern and sophisticated high-throughput analytical technique that enables large-scale metabolomic analyses. It yields a high-dimensional large-scale matrix (samples × metabolites) of quantified data that often contain missing cells in the data matrix as well as outliers that originate for several reasons, including technical and biological sources. Although several missing data imputation techniques are described in the literature, all conventional existing techniques only solve the missing value problems. They do not relieve the problems of outliers. Therefore, outliers in the dataset decrease the accuracy of the imputation. We developed a new kernel weight function-based proposed missing data imputation technique that resolves the problems of missing values and outliers. We evaluated the performance of the proposed method and other conventional and recently developed missing imputation techniques using both artificially generated data and experimentally measured data analysis in both the absence and presence of different rates of outliers. Performances based on both artificial data and real metabolomics data indicate the superiority of our proposed kernel weight-based missing data imputation technique to the existing alternatives. For user convenience, an R package of the proposed kernel weight-based missing value imputation technique was developed, which is available at https://github.com/NishithPaul/tWLSA.

Mass spectrometry is a modern and sophisticated high-throughput analytical technique that enables large-scale metabolomic analyses. It yields a high-dimensional large-scale matrix (samples × metabolites) of quantified data that often contain missing cells in the data matrix as well as outliers that originate for several reasons, including technical and biological sources. Although several missing data imputation techniques are described in the literature, all conventional existing techniques only solve the missing value problems. They do not relieve the problems of outliers. Therefore, outliers in the dataset decrease the accuracy of the imputation. We developed a new kernel weight function-based proposed missing data imputation technique that resolves the problems of missing values and outliers. We evaluated the performance of the proposed method and other conventional and recently developed missing imputation techniques using both artificially generated data and experimentally measured data analysis in both the absence and presence of different rates of outliers. Performances based on both artificial data and real metabolomics data indicate the superiority of our proposed kernel weight-based missing data imputation technique to the existing alternatives. For user convenience, an R package of the proposed kernel weight-based missing value imputation technique was developed, which is available at https:// github. com/ Nishi thPaul/ tWLSA.
Metabolomics datasets produced by mass spectrometry (MS) often contain a wide number of missing cells in the data matrix that can be generated from various sources, including both technological and biological hazards. Generally, there are approximately 10% to 40% missing values in metabolomics datasets [1][2][3] . The reasons include: (i) the metabolite concentration peak is below the analytical method's detectable threshold; (ii) the metabolite concentration peak is not initially present in the chromatogram; (iii) overlapping signal separation; (iv) deconvolution may give false negatives during the separation of overlapping signals, (v) computational and/or measurement error, (vi) the concentration of the metabolite is present in the sample but vanishes during downstream processing, and (vii) the concentration of a particular metabolite is identified in one sample, but does not exist at a significant concentration in another sample 1,3-6 . These missing values can be categorised as (a) missing completely at random (MCAR), (b) missing at random (MAR), and (c) missing not at random (MNAR). If a missing variable is not related to any observed variable or response it is MCAR. If a missing variable is linked with one or more observed variables, but not to the response, it is MAR. The response associated with missing is MNAR. In metabolomics datasets, if the concentration of a metabolite is not seen in one group of samples, but is present in another group of samples, the missing values most likely occur for a biological reason and can be classified as MNAR. However, if the peak of metabolite concentration is smaller than the analytical method's detection threshold, this missing type is a combination of biological and technological issues and can be considered as MNAR. Finally, MCAR is caused by only technological reasons, for example, errors related to peak picking software, in which the peak was evident but not included in the raw data.
The easiest and most straight forward method of dealing with missing values is the filtering method. In this method, variables 7,8 or samples 9,10 are removed. In recent times, this is applicable only when the data matrix includes a greater percentage of missing data. To handle the missing value problem, an alternative approach is the imputation technique. The conventional and widely used missing imputation techniques in different studies and software for imputing missing data are half of the minimum value replacement 2,11 , mean replacement 12 , median replacement 12 , k-nearest neighbour (kNN) 13 16 , zero imputation 17 , multiple imputations with expectation maximisation (EM) algorithm and Monte Carlo Markov chain (MCMC) method 18 , expectation-maximization principal component analysis (EM-PCA) 19 , and random forest (RF) imputation 20 . Recently developed techniques include Gibbs sampler-based left-censored missing value imputation approach (GSimp) 21 , quantile regression imputation of left-censored data (QRILC) 2 , kNN on observations with variable pre-selection ("kNN-obs-sel") 22 , BayesMetab 23 , robust missing imputation using mean absolute error (rmiMAE) 24 , multivariate imputation by chained equations (MICE) 25 , and others. Several missing imputation techniques are described in the literature. However, the selection of the missing imputation technique has a profound impact on univariate and multivariate (unsupervised and supervised) data analyses and interpretation 1,[26][27][28] . Therefore, the appropriate handling of missing data is very important according to the structure or nature of the original data for downstream analysis. The pattern of metabolomics datasets is very complicated because metabolomics datasets contain outliers 29 , non-normality, and inherent correlation structure 30 . However, the missing value imputation techniques, such as mean, kNN, EM-PCA, PPCA, BPCA, and RF, are sensitive to outliers 25 . All the aforementioned techniques can only handle the problem of missing values. They cannot significantly and simultaneously reduce the outlier problem. This is because the conventional imputation algorithms do not directly consider any outlier-robust function or any outlier identification and substitute algorithms. Furthermore, existing outlier resolving techniques do not consider missing value problems. For these reasons, we have developed a novel kernel-weight-based missing imputation (KMI) method that can simultaneously overcome both the missing value imputation problems and outliers. We compared our proposed method with widely used conventional techniques and recently developed techniques.
To evaluate the performance of the proposed weight-based missing imputation method compared to the other existing missing value imputation methods, we took into account nine widely used well-known missing imputation methods: zero imputation, mean imputation, median imputation, half of the minimum value imputation, kNN imputation, BPCA imputation, PPCA imputation, EM-PCA imputation, and RF imputation. We also considered five recently developed missing imputation techniques: GSimp, QRILC, BayesMetab, rmiMAE, and MICE. We measured the performances of the missing imputation methods, including the proposed technique, using both artificial and real data analysis in the absence and presence of different rates of outliers.

Material and methods
In this dissertation, we developed a new missing data imputation method by minimising the two-way kernel weighted square error loss function. To compare the competence of the proposed method, we considered nine widely used traditional missing imputation techniques as described above. Substituting all missing values are by zero is known as zero imputation. In the mean, median, and half of the minimum value imputation, missing data for each metabolite are substituted by the corresponding metabolite average, median, and half of the minimum value, respectively. Missing data substitution using kNN, EM-PCA, and RF are found in the "impute", "missMDA" and "missForest" packages, respectively of the R platform. Moreover, BPCA and PPCA imputation can be done using "pcaMethods" package in Bioconductor. As comparators of our proposed missing imputation method, we also considered five recently developed missing imputation techniques: GSimp, QRILC, BayesMetab, rmiMAE, and MICE. Among the techniques, rmiMAE is a comparatively more robust missing imputation technique which is computed by minimising the two-way mean absolute error loss function, i.e., L1 (Least absolute deviation) loss function like minimizing 1 n n j=1 e ij = 1 n n j=1 |x ij − r i c j | , which is more robust against outliers than L2 (Least square error) loss function like minimizing n j=1 (e ij ) 2 = n j=1 (x ij − r i c j ) 2 . To reduce the influence of outliers in the least square error loss function, here, we used the weighted squared error loss function, where the weight function is w j = exp − 2(mad(x j )) 2 (x ij − median(x j )) 2 . The speciality of the weight function is that the weight will be close to zero if the corresponding observation is apart from its median and if the corresponding observation is the neighbour of the median, the weight will be close to one. A detailed description of the proposed missing value imputation method using a two-way kernel weighted square error loss function is given below.
Missing data imputation using two-way kernel weighted least square error approach (proposed). Let X = x ij be metabolomics data, where i = 1, 2, . . . , p represents the metabolites and j = 1, 2, . . . , n represents the samples. Thus, in the metabolomics data X, different rows indicate different metabolites, and the columns indicate different samples.
Each cell of the metabolomics data could be represented as the product of the metabolite (row) effect and the sample (column) effect. Mathematically, it is written in a bilinear form, where r i and c j represent the i-th row effect (i.e. metabolite effect) and the j-th column effect (i.e. sample effect), respectively. The observed metabolomic data matrix usually contains missing cells and outliers. Thus, both the missing cell and outliers in the data matrix can be estimated by considering the effect of the corresponding row and column. In Eq. (1), r i and c j are both unknown. Therefore, our motive is to determine r i and c j to forecast the ij-th missing cell or outlying cell. To estimate r i and c j , consider model www.nature.com/scientificreports/ where x ij is the yield corresponding to the effect of the ith metabolite (row) and jth sample (column), r i indicates the factors of the ith metabolite, and c j indicates the factors of the j-th sample and ∈ ij indicates the error term. From model (2), we must estimate r i and c j simultaneously. To estimate r i and c j , we developed a weighted least square approach using a kernel weight function w j = exp − 2(mad(x j )) 2 (x ij − median(x j )) 2 and updated r i and c j by an iterative procedure, where mad represents the median absolute deviation. The speciality of the kernel weight function is that it lies between zero and one. The weight will be close to zero if the corresponding observation is apart from its median. If the corresponding observation is the neighbour of the median, the weight will be close to one. In the kernel weight function, is the tuning parameter, where the value of is chosen by k-fold cross-validation. The details of the appropriate selection procedure are given in Supplementary Information 1 ( Supplementary Fig. S1). If the data set is clean (i.e. no outliers), then will be zero. In this condition, all the weights will be 1, that is, the technique will be the classical least-squares approach. The steps for estimating r i and c j are given below: Step 1 To initialise the j-th column (sample) effect (c j ), calculate the j-th column median of X. Column median is computed by excluding the missing values j = 1, 2, . . . , n.
Step 2 Using the weighted least square approach, estimate the i-th row effect (i.e. metabolite effect) r i by minimising n j=1 e ij 2 = n j=1 w ij x ij − r i c j 2 , based on the i-th row of X, by eliminating the missing values, i = 1, 2, . . . , p.
Step 3 Revise the j-th column effect c j , using the weighted least square approach by minimising p i=1 w ij x ij − r i c j 2 , based on the j-th column of X, by eliminating the missing values, j = 1, 2, . . . , n.
Step 4 Repeat Steps 2 and 3 until it satisfies the rule |r new −r old |+|c new −c old | n+p ≤ ε ; here ε is a very small positive number, which depends on the researcher's interest. Here, we choose ε = 0.01.
• Calculate the first remainder matrix (X R1 ) as X R1 = X −X (1) = X −r 1ĉ1 (excluding the missing cells of the data matrix) • Using steps 1-4 on X R1 , compute the second fitted bilinear form as,X R1 =r 2ĉ2 and calculate the second remainder matrix (X R2 ) as, The number of r is selected in such a way that the total row variations of r k=1r kĉk can explain (1 − α)100% variations of X (using the concept of singular value decomposition; the details of the r selection procedure are given in Appendix 1 of the supplementary materials), where α is chosen by the researcher interest. In this case, α = 0.05. Therefore, the approximation of X is: Step 6 Substitute the missing values and the outlying cells of X by the corresponding cells of X (r) that produce the reconstructed full and clean data matrix X . Here, the inter quartile range (IQR) rule 31 was used to detect outliers.
The application procedure of the proposed method in metabolomics data is given below. The metabolomics dataset may contain several groups of samples in their data structure. If a metabolomics dataset contains k groups of samples, then the dataset is split according to the groups as where g 1 is the column number (subjects) of group-1, g 2 is the column number (subjects) of group-2, and so on g 1 + g 2 + · · · + g k = n.
Therefore, we checked whether the metabolomics data matrix X contained multiple groups in the samples. If X contains multiple groups, then partition matrix X as X = ( X 1 X 2 · · · X k ) according to k groups of samples, www.nature.com/scientificreports/ If X contains k groups, then apply Steps 1-6 for each partitioned data matrix and compute X 1 ,X 2 , · · ·X k . Thus, the reconstructed full and clean data matrix X = (X 1X2 · · ·X k ) . Otherwise, apply Steps 1-6 for the data matrix X and compute the reconstructed full and clean data matrix X .
User can install the package in R platform using the following R code Artificially generated metabolomics data. To simulate metabolomics datasets, we used the following additive linear model: where x ijk is the concentration of the i th metabolite, j th group, and k th sample; the average concentration for the i-th metabolite is µ i ; g ij represents the j th group effect of the i th metabolite and the random error term of the ith metabolite, j-th group, and k-th sample is ∈ ijk . To generate the data, we considered, µ i ∼ uniform(5, 10) and ∈ ijk ∼ N(0, 1) . To measure the efficiency of the proposed technique, we created three types of metabolomics datasets: (i) without a class level in the samples, (ii) two class levels (two groups) in the samples, and (iii) three class levels (three groups) in the samples. In the case of two-and three-class level-based datasets, we also generated two types of metabolites: (a) equal concentration (EE) metabolites and (b) differential concentrations (DE) metabolites. DE metabolites were classified into two groups: upregulated and down-regulated metabolites. For up-concentrated metabolites, we used g ij ∼ N(0, 1) the healthy group and g ij ∼ N(2, 1) the disease group. Similarly, for down-regulated metabolites, we used g ij ∼ N(2, 1) the healthy group and g ij ∼ N(0, 1) the disease group. For EE metabolites, g ij ∼ N(0, 1) in both groups. We generated 200 metabolites and 90 samples for each dataset.
In two-and three-class datasets, we considered 80 metabolites as DE and 120 metabolites as EE. We generated 100 datasets for each type of dataset. We also incorporated various rates (5%, 10%, 15%, and 20%) of missing cells in the data matrix. Among the total missing values, 60% MAR and 40% for lower values. To investigate the efficiency of our proposed technique in the presence of outliers, we also included various rates (3%, 5%, 7%, and 10%) of outliers in the artificial datasets. In the i-th metabolite, we provided N(5*μ i , σ i 2 ) as outliers, where μ i and σ i 2 are the mean and variance of the i-th metabolite; these outliers were distributed randomly in the dataset; thus, outliers may occur anywhere in the dataset.

Real metabolomics data.
To measure the performance of our proposed missing imputation method, we first considered two publicly available fully defined real metabolomics data matrices. One is the Human Cachexia dataset 32 , collected from 1H -NMR profiles of urinary metabolites that are available in the R-specmine library. The other is the treated dataset 33 , which is also available in the R-metabolomics library. Since, these two data matrices did not contain any missing values, to investigate the efficiency of the proposed technique compared to the other techniques we randomly incorporated different rates (5%, 10%, 15% and 20%) of missing values and also computed the mean square error (MSE) between the reconstructed data and original data. We also considered two datasets: hepatocellular carcinoma (HCC) with 26.52% missing values/cells 34 and MDA-MB-231 breast cancer dataset with 15.81% missing values 35 to evaluate the performance of the proposed missing value imputation method. The HCC and MDA-MB-231 datasets were also modified by artificially including various rates (3%, 5%, 7%, and 10%) of outliers to investigate the performance of the proposed method. Outliers are distributed randomly and follow N(5*μ i , σ i 2 ), where μ i and σ i 2 are the mean and variance of the i-th metabolite, respectively.
x 1(g 1 +···+g k−1 +1) x 1(g 1 +···+g k−1 +2) · · · x 1(g 1 +···+g k ) x 2(g 1 +···+g k−1 +1) x 2(g 1 +···+g k−1 +2) · · · x 2(g 1 +···+g k ) . . . . . . . . . . . . www.nature.com/scientificreports/ Artificial data analysis results. In simulation studies, we first measured the performance of the proposed missing imputation technique compared to the other ten missing imputation methods (zero, mean, median, half of the minimum value, kNN, BPCA, PPCA, EM-PCA, RF imputations and rmiMAE) using the distance-based measurement. We computed the MSE between the original simulated dataset and the reconstructed missing imputed dataset in both the presence and absence of outliers. We generated three types of simulated metabolomics datasets and 100 datasets for each type and calculated the average MSE from 100 MSEs for each type of dataset for different rates of outliers (0%, 3%, 5%, 7%, and 10%) and different rates (5%, 10%, 15%, and 20%) of missing values. For the datasets with no class level in the samples, the results of the above calculation are shown in Fig. 1. Similarly, for two class levels (two groups) in the sample datasets and three class levels (three groups) in the sample datasets, the results of the aforementioned calculation are given in the Supplementary Information in Fig. S1 and Fig. S2. In the same way, a comparison of the performance of our proposed method with the recently developed techniques (GSimp, QRILC, BayesMetab, rmiMAE, and MICE) using the datasets with no class level in the samples are given in the Supplementary Information in Fig. S3. In all these figures, the proposed missing value imputation technique produced lower average MSEs for various rates (0%, 3%, 5%, 7%, and 10%) of outliers, as well as for various rates (5%, 10%, 15%, and 20%) of missing values. Therefore, our missing imputation method was better than the other existing techniques. Second, we evaluated the performance of our developed KMI method using the misclassification error rate (MER), receiver operating characteristic (ROC) curve, and area under the ROC curve (AUC) through DE metabolite identification for two groups and three groups of datasets. To calculate the performance indices (MER, ROC curve, and AUC values), we identified the DE metabolites from the different reconstructed datasets (missing were www.nature.com/scientificreports/ imputed by different methods) using a t-test for the two class level dataset and analysis of variance (ANOVA) for the multiclass level dataset. Since the DE and EE metabolites were known in the simulated dataset, we computed the MER, ROC curve, and AUC for different missing imputed datasets in both the absence and presence of various rates of outliers. The above calculation procedures are provided in Supplementary Information in Fig. S4. The ROC curve of DE calculation for two-class datasets with 5% missing data and various rates of outliers are depicted in Fig. 2. Similarly, for three classes of simulated datasets, the ROC curve of the DE calculation is also shown in the Supplementary Information in Fig. S5. Similarly, for 10%, 15%, and 20% missing values, the ROC curves are given in the Supplementary Information (Fig. S6-S11). In addition, Table 1 presents the MER and AUC values of the DE calculation for two-class datasets with 5% missing as well as various rates of outliers. Moreover, for the two classes of datasets with 5% missing as well as various rates of outliers, the MER and AUC values of DE identification are also presented in the Supplementary Information in Table S1. Similarly, for   Fig. S5 to S11, and Tables S1 to S7 show that the proposed missing imputation method produced lower average MER and higher average AUC values for different rates (5%, 10%, 15%, and 20%) of missing values and various rates (0%, 3%, 5%, 7%, and 10%) of outliers. Therefore, the proposed KMI technique was better than the existing missing value imputation techniques. Finally, we measured the performance of our proposed KMI technique through sample classification using only DE metabolites. Although taking only the differentially expressed variables may give over-optimistic values for prediction performance, however, to increase accuracy, it is often used as the feature selection approach. To overcome this problem we used the cross-validation approach. The performance measure calculation procedure of different imputation methods based on sample classification (using a SVM classifier) is given in the Supplementary Information in Fig. S12. The ROC curve based on sample classification using a test dataset for two-class simulated datasets with 5% and 10% missing values and various rates (3%, 5%, 7%, and 10%) of outliers are presented in Fig. 3 and Supplementary Fig. S13, respectively. Figure 3 and Fig. S13 show that our proposed KMI technique gave a higher average true positive rate at any point of average false positive rate compared to the other missing imputation methods in the presence of different rates of outliers (3%, 5%, 7%, and 10%). We also computed the average MER and AUC in the appearance of 5% missing data as well as the different percentages of outliers using two-and three class level datasets, which are presented in Tables 2 and 3. Similarly, for 10% missing data, as well as different percentages of outliers using two-and three class level datasets, the average MER and AUC are given in Supplementary Tables S8 and S9. Tables 2 and 3 show that the proposed KMI technique produced lower average MER and higher average AUC values at various rates of missing values and different rates of outliers for two-and three class level simulated metabolomics data. Therefore, in simulation studies, our proposed KMI technique was better than the existing missing value imputation methods.
Real data analysis results. Here, we used four real metabolomics datasets to evaluate the efficiency of our newly developed KMI technique compared to other missing imputation methods for real data analysis. Since the Human Cachexia and treated datasets are fully defined, to explore the performance of our proposed technique we artificially incorporated various percentage of missing values (5%, 10%, 15% and 20%) and reconstructed the data matrix using several missing value imputation methods including the proposed one. We measured the MSE between the original and reconstructed datasets. We also repeated the aforementioned calculation 100 times and  (Fig. 4a) and the treated dataset (Fig. 4b). Therefore, our proposed imputation method displayed comparatively better performance than the other ten conventional missing value imputation methods. Moreover, we conducted a comparative study of the efficiency of our proposed missing imputation technique and five recently developed techniques (GSimp, QRILC, BayesMetab, rmiMAE, and MICE) using  Fig. S14. We also measured the competency of our proposed KMI technique using MER and AUC of sample classification for both the two-class hepatocellular carcinoma dataset and the three-class MDA-MB-231 dataset. To evaluate the performance of all well-known missing value imputation methods in the presence of outliers, we modified both datasets by artificially incorporating different rates of outliers (3%, 5%, 7%, and 10%). The performance measure calculation procedure for different missing imputation techniques is shown in Fig. 5. The calculation of performance measures (MER and AUC) using the HCC dataset and the MDA-MB-231 dataset are shown in Tables 4 and 5, respectively. The data indicated that our proposed KMI technique produced a lower average MER and higher AUC values compared to other missing imputation methods in the appearance of various rates of outliers. Therefore, both simulation studies and real data analysis showed that our proposed missing value imputation method performed better than the existing missing value imputation methods.

Discussion
We examined the performance of each missing imputation technique by optimising the parameter settings using a trial-and-error basis to avoid biased comparisons. For example, in the case of kNN imputation, we chose k, for which the MSE and MER were smaller and the accuracy was maximum. The performance of different missing imputation techniques may depend on the structure and the value/intensity of data. Therefore, we presently generated three types of simulated metabolomics datasets and 100 datasets for each type and calculated the average MSE from 100 MSEs for each type of dataset at different rates of outliers (0%, 3%, 5%, 7%, and 10%) and different rates (5%, 10%, 15%, and 20%) of missing values. MAR may occur at any position in the data matrix, thus, we generated 100 modified real datasets, including different MAR positions of the data matrix, to measure the performance of different missing imputation techniques. To compute the performance of various missing imputation methods through MER and AUC using the classification technique, we divided the dataset into two parts: the test dataset and the training dataset. To reduce the sampling error during the calculation of MER and AUC, we generated 100 training datasets and 100 test datasets for each case and computed the average MER and AUC for measuring the performance of different missing imputation methods. The detailed calculation procedure of different performance measures calculated by different missing imputation methods for the artificial dataset is shown in the Supplementary Information in Fig. S4 and Fig. S12. As well, information for the artificial dataset is presented in Fig. 5. We calculated the execution time (speed of execution) for different methods, including the proposed method, for different numbers of metabolites and samples (Supplementary Information Table S10). The URL of the R package and the user manual of our proposed method are https:// github. com/ Nishi thPaul/ tWLSA.

Conclusion
The Selection of the missing imputation method affects consecutive metabolomics data analysis. Moreover, metabolomics data generated from different platforms often contain missing values and outliers. Thus, in this study, we developed a new outlier-robust kernel-weight-based two-way alternating weighted least square approach for imputing missing values. We also measured the performance of our proposed KMI technique compared to the existing conventional methods (zero, mean, median, half of the minimum value, kNN, BPCA, PPCA, EM-PCA, and RF imputations) and recently developed missing imputation methods (GSimp, QRILC, BayesMetab, rmiMAE, and MICE) through both artificial and real metabolomics data analysis. Based on our computational results, the presently developed missing value imputation method is better than the existing missing value imputation methods in both the absence and presence of outliers. For this reason, our recommendation is to apply our proposed two-way kernel weighted least square-based missing value imputation method instead of existing missing imputation methods to substitute the missing values in metabolomics datasets for consecutive univariate, multivariate, and exploratory metabolomics data analysis.

Data availability
Comparisons were evaluated using R language. To identify the DE metabolites in R we used 't.test' function (for two groups) and 'anova' function (for more than two groups) from "stats" package. "ROCR" and "pROC" packages have been used to draw receiver operating characteristic (ROC) curve as well as to calculate the area under the ROC curve. We also used the support vector machine 'svm' function from "e1071" package to calculate the misclassification error rate (MER) for sample classification. Moreover, The source code and packages of different missing imputation techniques are given below, Proposed: R package and the user manual of our proposed method are available at https:// github. com/ Nishi thPaul/ tWLSA. rmiMAE: The R code for rmiMAE is available at https:// github. com/ Nishi thPaul/ missi ngImp utati on/ blob/ main/ rmiMAE.R. GSimp: The R code for GSimp is available at https:// github. com/ Wande Rum/ GSimp. kNN: Here, kNN imputation technique has been implemented using "impute" package from bioconductor. The reference manual of this package can be found www.nature.com/scientificreports/ at https:// www. bioco nduct or. org/ packa ges/ relea se/ bioc/ manua ls/ impute/ man/ impute. pdf. EM-PCA: We used "missMDA" package in R to impute missing values by EM-PCA method. The reference manual of this package can be found at https:// cran.r-proje ct. org/ web/ packa ges/ missM DA/ missM DA. pdf. RF: "missForest" package has been used to impute the missing values by Random Forest (RF) method. The reference manual of this package can be found at https:// cran.r-proje ct. org/ web/ packa ges/ missF orest/ missF orest. pdf. BPCA and PPCA: BPCA and PPCA imputation techniques have been implemented using "pcaMethods" package from bioconductor. The reference manual is available at https:// www. bioco nduct or. org/ packa ges/ relea se/ bioc/ manua ls/ pcaMe thods/ man/ pcaMe thods. pdf. QRILC: Here QRILC imputation technique has been implemented using "imputeLCMD" package in R. The reference manual can be found at https:// cran.r-proje ct. org/ web/ packa ges/ imput eLCMD/ imput eLCMD. pdf. BayesMetab: R code for BayesMetab method is available at https:// bmcbi oinfo rmati cs. biome dcent ral. com/ artic les/ 10. 1186/ s12859-019-3250-2. MICE: Here, "missCompare" package in R has been used to impute the missing values by MICE method. The reference manual of the package can be found at https:// cran.r-proje ct. org/ web/ packa ges/ missC ompare/ missC ompare. pdf Received: 4 January 2021; Accepted: 13 May 2021