MAP: model-based analysis of proteomic data to detect proteins with significant abundance changes

Isotope-labeling-based mass spectrometry (MS) is widely used in quantitative proteomic studies. With this technique, the relative abundance of thousands of proteins can be efficiently profiled in parallel, greatly facilitating the detection of proteins differentially expressed across samples. However, this task remains computationally challenging. Here we present a new approach, termed Model-based Analysis of Proteomic data (MAP), for this task. Unlike many existing methods, MAP does not require technical replicates to model technical and systematic errors, and instead utilizes a novel step-by-step regression analysis to directly assess the significance of observed protein abundance changes. We applied MAP to compare the proteomic profiles of undifferentiated and differentiated mouse embryonic stem cells (mESCs), and found it has superior performance compared with existing tools in detecting proteins differentially expressed during mESC differentiation. A web-based application of MAP is provided for online data processing at http://bioinfo.sibs.ac.cn/shaolab/MAP.


Supplementary note 1: A comparison of different strategies of protein intensity normalization
To start a unbiased comparison of two proteomic profiles, the protein intensities in two profiles should first be normalized to make them comparable. In MAP, we proposed a new strategy of protein intensity normalization based on the "trimmed" total intensity of each profile, which was calculated as the sum of protein intensities over proteins that were not identified as outliers in both profiles. In previous studies, usually the total intensity of all detected proteins was used as a reference for global normalization [1,2]. However, it has been noted in RNA-seq data analysis that genes highly expressed in only one experimental condition may cause sampling artifact and thus could introduce obvious bias to the following comparisons [3]. For example, hemoglobin genes are dramatically upregulated during mammalian erythropoiesis and may comprise a considerable portion of the total transcriptome and proteome at late stages [4], which should be taken into account in the data analysis. Thus, in MAP we label the proteins with extremely high intensities as outliers and exclude them from calculating normalization factors. To test this new strategy, we compared the ratios of protein intensities normalized by the total intensity and by the "trimmed" total intensity of each profile. Similar to what was observed with RNA-seq data [3], the distribution of the log2-ratios of protein intensities normalized by the total intensity of each profile obviously deviated from zero, and this bias was largely eliminated by using the "trimmed" total intensity instead ( Supplementary Fig. S1b).
Next, we further incorporated the ribosome profiling data of undifferentiated and differentiated mESCs to evaluate the performance of two protein intensity normalization strategies. Inspired by the analysis presented in a previous study [5], we picked out the genes that only showed weak translation changes during mESC differentiation from ribosome profiling data (genes with |log2-ratio|<0.2). For these genes, the distribution of their log2-ratios of protein intensities normalized by trimmed total intensity did not show significant deviation from 0 in all three runs ( Supplementary   Fig. S3c). However, the distribution of their log2-ratios of protein intensities normalized by total intensity showed a significant deviation from 0 in all three runs ( Supplementary Fig. S3c), which is not consistent with the observation from ribosome profiling data for these genes. Then, we separately applied MAP to the protein intensities normalized by two strategies. Interestingly, the top proteins up-regulated in differentiated mESCs detected by using the protein intensities normalized by trimmed total intensity showed obviously higher consistency scores compared to those detected by total intensity-based normalization ( Supplementary Fig. S3d). On the other hand, the top proteins up-regulated in undifferentiated mESCs based on these two normalization strategies did not show clear difference in their consistency scores. These findings indicate that normalization of protein intensities simply based on the total intensity may introduce an apparent bias, which could clearly reduce the accuracy of the detection of differentially expressed proteins.

Supplementary note 2: Discussion of using the Z-statistic defined in MAP to describe protein expression change
In many previous studies, it has been shown that the contribution of technical and systematical errors to the log2-ratio of protein intensities observed for each protein depends on the intensity level of this protein [6,7]. Thus, the raw protein ratios observed in different mass spectrometry (MS) runs may not be directly comparable, since a considerable difference often can be observed between the intensity levels of each protein detected in different runs ( Supplementary Fig. S1c).
We compared the log2-ratios of protein intensities between the proteomic profiles of undifferentiated and differentiated mESCs generated in different MS runs, and found they showed a low correlation with each other (Supplementary Fig. S2d). On the other hand, the contribution of technical and systematical errors to the Z-statistic defined in MAP can be assumed to follow standard normal distribution. Thus, it should be more comparable across different runs than the raw log-ratios of protein intensities. To support this hypothesis, we also calculated the correlation between the Z-statistics of each protein across different runs, and found they have a much better correlation ( Supplementary Fig. S2e). Thus, the new Z-statistic could be better used to represent protein intensity changes.
Inspired by this finding, we defined the average Z-statistic for each protein to represent its overall intensity change across all comparisons, which is similar to the Z-score defined in Stouffer's Z test [8]. Here we use Lef1 as an example, which has been validated to be differentially expressed at protein level during mESC differentiation by western blot [1]. It can be seen that although Lef1 was only detected to have significant expression change in one MS run (P-value=0.005 without correcting for multiple testing), its average Z-statistic over all three runs is still quite significant (P-value=0.003 after correcting for multiple testing), as it showed a consistent expression change in all three runs (Fig. S4a). We then tried using the average Z-statistics of proteins to identify proteins differentially expressed between undifferentiated and differentiated mESCs, and found the top differentially expressed proteins identified by this method could also achieve a much improved consistency score compared to those selected simply by the best Pvalue of three runs (Fig. S4b).

Supplementary note 3: Discussion of merging the parallel technical replicates generated in each MS run
In this study, proteomic profiling of undifferentiated and differentiated mESCs was performed using conventional 4-channel iTRAQ technique. In each MS run, four proteomic profiles were generated, including two parallel technical replicates from iTRAQ channel 114 and 115 for undifferentiated mESCs as well as two parallel technical replicates for differentiated mESCs from channel 116 and 117. Before carrying out the analysis shown in main text, the iTRAQ intensities from channel 114 and 115 were added to generate the proteomic profile of undifferentiated mESCs, and those from channel 116 and 117 were also combined as the proteomic profile of differentiated mESCs. Then, the two combined proteomic profiles of undifferentiated and differentiated mESCs were used as input of MAP model to detect differentially expressed proteins between them. Meanwhile, we additionally generated two combined proteomic profiles, by averaging the iTRAQ intensities from channel 114 and 116 as well as those from channel 115 and 117, respectively, as two technical replicates used for model fitting using the method proposed in Zhang et al.
To illustrate the necessity of this procedure, we repeated the differential protein expression analysis between undifferentiated and differentiated mESCs using the original four proteomic profiles directly obtained from channel 114-117 in each of three MS runs. In this analysis, the proteomic profile of undifferentiated mESCs from channel 114 was compared with the profile of differentiated mESCs from channel 116 using MAP model, and the proteomic profile from channel 115 was compared with the one from channel 117, respectively. Then, we systematically analyzed the protein expression changes detected from these comparisons. Interestingly, the correlation of the Z-statistics derived from different runs in this analysis (Pearson Correlation Coefficients shown in the white boxes of Supplementary Fig. S2f), was found to be clearly lower than that obtained from the combined proteomic profiles ( Supplementary Fig. S2e). This finding indicates that combining the iTRAQ intensities from parallel technical replicates generated in the same MS run could suppress the impact of instrumental errors, and thus help to improve the reproducibility of downstream analysis.

Supplementary note 4: Comparing the performance of MAP and other existing tools in differential protein expression analysis based on benchmark differentially expressed proteins
Here, we additionally tried defining benchmark differentially expressed proteins (DEPs) between undifferentiated and differentiated mESCs and using them to evaluate the performance of different methods in differential protein expression analysis. To be fair to all three methods being compared, i.e. MAP, the method proposed in Zhang et al. and MaxQuant, we defined the benchmark up/down-regulated proteins during mESC differentiation as those that were identified as significantly up/down-regulated proteins (using P-value<0.1 as cutoff) by all three methods in at least two of the three runs, respectively. In this way, we got 206 up-regulated and 110 downregulated proteins as benchmark DEPs. Then, we checked whether these benchmark DEPs could be efficiently recalled from the comparison of proteomic profiles generated in each single run using the three methods. It can be seen that in all three runs, MAP recalled more benchmark DEPs than the other two methods (Supplementary Fig. S4a-b), indicating it has a better sensitivity in detection of truly differentially expressed proteins.
To fully assess the sensitivity and specificity of three methods, we further examined whether the significance level of each protein's intensity change derived by them from the proteomic data generated in each single run can be used to accurately distinguish the benchmark DEPs from other proteins. It should be note that, the benchmark DEPs were defined in a stringent way. Thus, it's not surprising to find that all three methods achieved a high accuracy in separating benchmark DEPs from the others, with an area under the receiver operating characteristic (ROC) curve (AUC) above 0.9 in all three runs (Fig. S4d). However, we still observed that MAP achieved a higher AUC score than the other two methods in all three runs (Fig. S4d). To test the significance of this difference, we performed 10,000 times of random sampling of non-DEPs and repeated the above analysis. At each time, the benchmark DEPs were matched with 2000 randomly selected other proteins detected in each run, and the AUC score was then recalculated for each of three methods.
Again, MAP was found to have a better AUC score compared to the other two methods in classifying the benchmark DEPs against each of the 10,000 sets of randomly selected other proteins ( Supplementary Fig. S4c). Taken together, these findings strongly support a favorable performance of our MAP model in differential analysis of iTRAQ proteomic data.

Supplementary note 5: Permutation-based analysis for false discovery rate (FDR) estimation
Here we first describe how we estimated FDR for the second-best P-value of each protein, e.g. 2 for protein , obtained from the comparisons of proteomic data of undifferentiated and differentiated mESCs generated in all three runs, which represents the type-I error rate of the differentially expressed proteins defined by using 2 as cutoff. In previous studies, several permutation-based methods have been proposed for FDR estimation in differential analysis of gene expression microarray data [9,10]. The key strategy of these methods is that, when estimating FDR for the differentially expressed genes defined by a specific significance cutoff, it's better to include only the non-differentially expressed genes in building the null distribution of test statistic using random permutations. To provide a fast estimation of FDRs for thousands of topranked second-best P-values, we chose to use a fixed set of non-differentially expressed proteins (non-DEPs) using a specific cutoff, which were defined as proteins with the second-best P-values above 0.01, and then performed 10,000 times of random permutations. At each time of random permutation, the Z-statistics of these non-DEPs detected in each run were randomly permuted, and then the second-best P-value of each non-DEP was re-calculated. Finally, the estimated FDR for 2 was calculated as To test the effectiveness of this approach, we borrowed the second-best P-values ̃2 derived from the comparisons of technical replicates generated in three runs using MAP to estimate an FDR for each 2 , which was calculated as # ( 2 ) = ℎ ̃2 ≤ 2 ℎ 2 ≤ 2 , and used it as a reference to assess the FDR estimated using the permutation-based approach.
We first compared the FDRs estimated from technical replicates with those estimated by random permutations without excluding DEPs, and observed a clear overestimation of FDRs by the permutation-based approach ( Supplementary Fig. S5a). Then, by excluding DEPs from random permutations, we found the overestimation of FDRs was greatly reduced, indicating a favorable performance by this approach (Fig. S4c). As the result, it can be obviously seen that the topranked DEPs defined based on the second-best P-values have a low FDR ( Supplementary Fig.   S5b).
Next, we describe how the FDR for the average Z-statistic of each protein, e.g. Z for protein , was estimated, which represents the type-I error rate of differentially expressed proteins defined by using the absolute value of Z , i.e. |Z |, as cutoff. Similarly, to have a fast estimation of FDRs, in the random permutation analysis we only used a fixed set of non-DEPs, which were defined as proteins with adjusted P-values derived from the average Z-statistics above 0.01 here. At each of 10,000 times of random permutations, the Z-statistics of these non-DEPs detected in each run were randomly permuted, and then the average Z-statistic of each non-DEP was re-calculated.
Finally, the estimated FDR for Z was calculated as Again, to test the effectiveness of this approach, we borrowed the average Z-statistics Z derived from the comparisons of technical replicates to estimate an FDR for each Z , which was calculated as In this way, it can be found that the top-ranked DEPs selected based on average Z-statistics was also associated with a low FDR ( Supplementary Fig. S5c), and the FDRs estimated by random permutations generally were supported by those estimated from technical replicates ( Supplementary Fig. S5d). Additionally, it is worth to mention that the FDR estimated for the top DEPs ranked by second-best P-values agreed well with that estimated for the same number of top-ranked DEPs selected by average Z-statistics ( Supplementary Fig. S5b-c), although they were derived from independent permutation analysis and were calculated using different formula.
Since the vast majority of top-ranked DEPs selected by the second-best P-values were also identified as top DEPs based on average Z-statistics (for example, 393 proteins were found to have second-best P-values<0.01, and 380 of them were also found to have adjusted P-values calculated from average Z-statistics lower than 0.01; see Supplementary Table S1 for details), we think this observation is quite reasonable. It suggests that controlling the FDR level, if it's properly estimated, could be a practical and robust way to determine the cutoff of statistical significance in differential expression analysis, as indicated by quite a number of previous studies [11].

Supplementary note 6: The extended MAP (eMAP) model for simultaneously comparing multiple proteomic profiles
It's often needed to perform differential protein expression analysis across multiple cellular contexts. As a natural extension of our original MAP model for pairwise comparisons, we additionally developed a new statistical framework, termed extended MAP (eMAP) model, for simultaneously comparing multiple proteomic profiles generated in the same MS run and directly detecting proteins/peptides with significant intensity changes across these profiles.
Similar to traditional Analysis of Variance (ANOVA), in eMAP we assume that, for every detected protein i, it has equal abundance/expression level among the samples being compared, which is used as the null hypothesis. Thus, the log2-transformed intensities of this protein in the N input proteomic profiles can be considered as drawn from the same normal distribution ( , 2 ). Here indicates the expected intensity level of this protein and 2 is used to represent the impact of technical and systematic errors (or noise), and is empirically modeled as a function of its intensity level, which is the global variance function that needs to be inferred from input data.
To directly build global variance function from the proteomic profiles being compared, we adopt a similar step-by-step regression procedure as that used in our original MAP model. Here Technically, eMAP first plots the sample variance of each protein's log2-intensities against its mean log2-intensity, which we called VA plot, and this plot is then scanned by a sliding window of size proteins moving from left to right ( = 400 and step size=100 by default). In this way, the whole VA plot is covered by a series of windows, which are used to first derive the local approximations of global variance function. Again, eMAP introduces an assumption that proteins falling in each window have similar intensity levels, and thus the impact of technical and systematic errors to their log2 intensities can be approximately quantified by the same variance parameter σ 2 . To estimate the σ 2 for each window, the observed sample variance of all the proteins in this window are ordered from smallest to largest, and then plotted against the corresponding theoretical quantiles ̂ of Chi-square distribution , and the slope is used to as an estimation of the variance parameter σ 2 for this window. When the sliding window finishes scanning the VA plot, non-linear least-square regression is applied to fit a exponential function between the variance σ 2 estimated for each window and the average of the mean log2-intensities over proteins in this window that were used for linear regression as , which is then used to infer the 2 for every detected protein by 2 = exp( 1 + 2 • ̅ ) + 3 .
Finally, a Chi-square statistic is defined for each protein by using the 2 inferred for it to rescale the observed sample variance of its log2 intensities as , which is expected to follow Chi-square distribution , where Γ( , ) denotes the gamma distribution with shape parameter and inverse scale parameter . Next, a P-value is calculated for the average Chi-square statistic of protein as , which will be further adjusted using the Benjamini-Hochberg approach for multiple testing. It should be of note that, all the P-values define in this section are one-tailed P-values, which represent the likelihood of observing an equal or greater sample variance of log2 protein intensities compared to that expected under the null hypothesis.
To test the validity of eMAP model, we applied it to compared the original four proteomic profiles of undifferentiated and differentiated mESCs generated in each of the three runs. Again, here we use the comparison of four proteomic profiles generated in the first MS run to illustrate the workflow of eMAP. First, we generated a VA plot by plotting the sample variance of each protein's log2-intensities against its mean log2-intensity ( Supplementary Fig. S6a), and then used a sliding window moving from left to right to scan the VA plot. Of note, proteins in the window should be considered as a mixture of differentially and non-differentially expressed proteins. To directly infer the impact of technical and systematic errors from the non-differentially expressed proteins in the mixture, the observed sample variances of all proteins in this window were ordered from smallest to highest and then plotted against the corresponding theoretical quantiles of the Chi-square distribution 3 2 divided by 3. It can be seen that the ordered sample variances located on the left side, which could be assumed to be mainly associated with non-differentially expressed proteins, exhibited a strong linear relationship with the corresponding theoretical quantiles (right panel of Supplementary Fig. S6a). Then, an ordinary least-square linear regression analysis was applied between the smallest W of the sample variances and the corresponding theoretical quantiles (we chose W=30% here, as it's difficult to expect a very large fraction of the detected proteins are non-differentially expressed between any possible pair of four input proteomic profiles), and a liner model with very high coefficient of determination was derived (right panel of Supplementary Fig. S6a). The slope σ 2 of the fitted linear model can be used to as a local approximation of the global variance function for proteins covered by this window.
As the sliding window moving in a stepwise manner from left to right, the whole VA plot was covered by a series of windows. For each window, the same linear regression analysis was repeatedly applied to estimate the σ 2 as a local approximation of the global variance function.
Remarkably, the coefficient of determination R 2 was found to be higher than 0.95 for all linear regressions, strongly supporting the validity of our approach. Next, the σ 2 inferred for each window was plotted against the mean log2-protein intensity averaged over proteins falling in this window, and an exponential function was fitted between the variance and the mean with coefficient of determination R 2 =0.994 ( Supplementary Fig. S6b). This exponential function is right the global variance function we want to infer for the four profiles under comparison. Finally, a Pvalue was calculated for each protein to describe the significance of the variation of its intensities across the four input proteomic profiles ( Supplementary Fig. S6c).
Similar to what was done in MAP, an additional analysis was also carried in eMAP to summarize all the local linear regressions. In this analysis, the ordered sample variances in each window were rescaled by the σ 2 inferred for this window. Then, the mean and standard deviation of the rescaled sample variances across all windows were plotted according to their order against the corresponding theoretical quantiles of Chi-square distribution 3 2 divided by 3 ( Supplementary   Fig. S6d). This analysis could be used to inspect whether the parameter W needs to be further adjusted. Here we found that, after rescaling, the smallest 30% of the sample variances were found to match with the line y=x very well. Meanwhile, the ordered sample variances on the right side obviously deviated from line y=x, which suggests a considerable proportion of them are difficult to be explained by the null hypothesis and may actually be associated with proteins differentially expressed between undifferentiated and differentiated mESCs. Thus, they should not be included in inferring the local and global variance function.
After separately applying eMAP to compare the proteomic profiles of undifferentiated and differentiated mESCs generated in each of three runs, we collected the output from all three runs, and identified proteins with significantly elevated variances of protein intensities using P-values calculated based on the average Chi-square statistic of each protein over three runs after adjusting for multiple testing, which we called hyper-variable proteins (HVPs). In total, we obtained 383 HVPs with adjusted P-values lower than 0.01 (Supplementary Table S2), the vast majority of which were also identified as differentially expressed proteins (DEPs) between undifferentiated and differentiated mESCs by using MAP model ( Supplementary Fig. S6e), indicating a high consistency between these two analyses. Furthermore, by incorporating the ribosomal profiling data of undifferentiated and differentiated mESCs, we found that the RNA transcripts from genes associated with the 383 HVPs showed a significantly higher change of ribosome occupancy during the differentiation of mESCs than the other genes detected in ribosomal profiling, indicating these genes tend to be differentially translated ( Supplementary Fig. S6f). Taken together, these findings indicate that our eMAP model can be used to statistically model the variations of protein intensities across multiple proteomic profiles and reliably identify proteins with significant expression changes among the profiles being compared.  Here the pairwise PCCs may either be calculated between two separate comparisons using proteomic profiles from the same MS run but different channels, or using those from different runs.    (a) False discovery rate (FDR) estimated for the second best P-value of each protein using the permutation-based approach without excluding the differentially expressed proteins were plotted against that estimated from the comparisons of technical replicates for this second best P-value.

Reference
Here FDRs associated with the benchmark DEPs defined previously were explicitly indicated.