Differential Abundance Analysis with Bayes Shrinkage Estimation of Variance (DASEV) for Zero-Inflated Proteomic and Metabolomic Data

Mass spectrometry (MS) is frequently used for proteomic and metabolomic profiling of biological samples. Data obtained by MS are often zero-inflated. Those zero values are called point mass values (PMVs). Zero values can be further grouped into biological PMVs and technical PMVs. The former type is caused by true absence of a compound and the later type is caused by a technical detection limit. Methods based on a mixture model have been developed to separate the two types of zeros and to perform differential abundance analysis comparing proteomic/metabolomic profiles between different groups of subjects. However, we notice that those methods may give unstable estimate of the model variance, and thus lead to false positive and false negative results when the number of non-zero values is small. In this paper, we propose a new differential abundance analysis method, DASEV, which uses an empirical Bayes shrinkage method to more robustly estimate the variance and enhance the accuracy of differential abundance analysis. Simulation studies and real data analysis show that DASEV substantially improves parameter estimation of the mixture model and outperforms current methods in identifying differentially abundant features.


Calculation of Hyperparameters
We assume σ 2 k follows an inverse-gamma distribution with shape parameter d 0 /2 and scale parameter d 0 s 2 0 /2.
The mean and variance of the inverse-gamma distribution are .
Let m and v be the sample mean and smaple variance, respectively. Based on the method of moments, s 0 and d 0 can be calculated as: Scenario LFC OR TPMV Ratio Results 1 log(2) 1 Random Main text Figure 1 to 4. Supplementary Figures S2, S3 S4, S7, S8, S14, and S15.  Table S1: A list of simulation scenarios. LFC: log fold change between control and case group mean abundance of non-BPMVs. OR: odds ratio between control and case group BPMV proportions. TPMV: TPMV proportions. TPMV proportions were obtained based on randomly selected BPMV proportions, mean abundance, detect limit, and variance. Ratio: ratio of dissonant vs consonant features.

Additional Simulation Results
We performed additional simulation studies to more comprehensively assess the performance of DASEV under various practical situations. A list of the six simulation scenarios considered in this paper is provided in Table S1. Under the rst scenario, other than testing for H M 0 that is presented in the main text, we also performed test for H B 0 . We considered both smaller (n= 10 or 20 per group) and larger (n= 100 or 200 per group) sample sizes, and compared DASEV to two additional parametric methods, accelerated failure time model (AFT) [1] and two part t-test (2T) [2,3]. The AFT method assumes all PMVs are TPMVs and characterizes the data by a log-normal distribution with a detection limit. The 2T method integrates a chi-square test to compare PMV proportion and a t-test to compare log-transformed non-PMVs. DASEV had better performance than AFT and 2T under both larger and smaller sample size situations (Supplementary Figure S8). DASEV held the highest TPR among the very top-ranked features. As the number of top-ranked features increased, 2T obtained similar performance. AFT's performance did not improve much by increasing the sample size. All other methods performed much better with larger sample size. Although AFT held lower observed FDR for sample size 100 and 200 per group, the counts of TPs were much smaller than other methods. DASEV obtained reasonable counts of TPs and observed FDR at all three thresholds.
The second scenario aimed to evaluate DASEV's performance in testing difference in BPMV proportions. We considered hypothesis testing for both H P 0 and H B 0 . DASEV outperformed TLK in testing H P 0 (Supplementary Figure  S5). DASEV was able to identify more truly dierentially abundant features than TLK. The observed FDR from DASEV was smaller than the threshold while the observed FDR from TLK was much larger than the threshold for the sample size of 100. As the sample size increased to 200, the observed FDR for both methods became closer to the thresholds. For testing H B 0 , AFT yielded the highest TPR curve (Supplementary Figure S9). AFT also resulted the best observed FDR and counts of TPs. However, when we examined the estimation of log fold change among dierentially abundant features, the estimation from AFT had a much larger deviation from the true value compared to the other methods (Supplementary Figure S16). In contrast, among features which were not dierentially abundant, all four methods showed similar deviation. Therefore, the dierence in zero proportion between groups were mis-characterized as dierence in the mean of lognormal distribution by AFT since it considered all zero values as left censored from a lognormal distribution. Thus, although AFT appeared to have high performance in identifying dierentially abundant features, the inference from AFT was unreliable under such situations. Compare to AFT, DASEV obtained better estimation on group means. In addition, DASEV performed better than TLK and 2T.
Our third to fth scenarios considered the situation where both BPMV proportions and non-BPMV mean abundance are dierent between groups. In reality, dierentially abundant features can be either dissonant (lower BPMV proportion with lower mean for non-BPMVs) or consonant (higher BPMV proportion with lower mean for non-BPMVs). Therefore, we considered the following three scenarios: equal amount of dissonant and consonant features (the third scenario), more dissonant features (the fourth scenario), and more consonant features (the fth scenario). For each scenario, 10% features were randomly selected as dierentially abundant features. In the third scenario, we set a ratio of 1:1 for dierentially abundant features to have higher versus lower mean abundance in the case group than in the control group. We also set a ratio of 1:1 for those features to have higher versus lower BPMV proportion in case group compared to the control group. In the fourth scenario, we set both ratios to 9:1. In the fth scenario, we set the ratio for dierentially abundant features to have higher versus lower mean abundance in the case group to 9:1, and the ratio for those features to have higher versus lower BPMV proportion in case group to 1:9. As a result, the ratios between dissonant and consonant features were 1:1 , 82:18, and 18:82 for those three scenarios, respectively. Simulation results for testing H B 0 are presented in Supplementary Figure S10 to S12. All methods performed better when there were more consonant features (Supplementary Figures S11 and S12). TPR curves were generally lower if the data contained more dissonant features. The observed FDRs were similar for scenarios with more dissonant or more consonant features. However, counts of TPs were much higher in the later scenario for sample size 100 and 200 per group. DASEV outperformed other methods in most situations (Supplementary Figure S10 to S12). The only exception was that for the fth scenario with small sample size of 10 or 20 per group (Supplementary Figure S12), AFT performed slightly better than DASEV. This again is likely due to mis-characterizing dierence in BPMV proportions as dierence in non-BPMV means, which happened to increase the dierence in non-BPMV means for consonant features. Therefore, for a dataset dominated by consonant features as we showed in the fth scenario, AFT had higher power. However, for dissonant features, the dierence in non-BPMV means caused by mis-characterization of dierence in BPMV proportions were in the opposite direction to the true dierence in non-BPMV means so that those two would cancel out. As shown by the fourth scenario (Supplementary Figure S11), when there were more dissonant features, AFT had the lowest performance compared to other methods. We also explored numbers of consonant and dissonant features identify by each method when equal amount of those features existed (the third scenario). AFT identied more consonant features and the other three methods resulted similar amount of consonant and dissonant features (results not shown).
Our sixth simulation scenario was designed to examine methods' performance while no TPMVs were present. DASEV outperformed all other methods (Supplementary Figure S13). The results show a similar pattern to what we obtained for the rst simulation scenario (Supplementary Figure S8). DASEV had the highest TPR among the very top-ranked features. As the number of top-ranked features increased, 2T obtained similar performance as DASEV. All methods other than AFT performed much better with larger sample size. Although AFT held lower observed FDR for sample size 100 and 200 per group, The counts of TPs were much smaller than other methods. DASEV obtained reasonable number of TPs and observed FDR at all three FDR thresholds.