Complementary Post Transcriptional Regulatory Information is Detected by PUNCH-P and Ribosome Profiling

Two novel approaches were recently suggested for genome-wide identification of protein aspects synthesized at a given time. Ribo-Seq is based on sequencing all the ribosome protected mRNA fragments in a cell, while PUNCH-P is based on mass-spectrometric analysis of only newly synthesized proteins. Here we describe the first Ribo-Seq/PUNCH-P comparison via the analysis of mammalian cells during the cell-cycle for detecting relevant differentially expressed genes between G1 and M phase. Our analyses suggest that the two approaches significantly overlap with each other. However, we demonstrate that there are biologically meaningful proteins/genes that can be detected to be post-transcriptionally regulated during the mammalian cell cycle only by each of the approaches, or their consolidation. Such gene sets are enriched with proteins known to be related to intra-cellular signalling pathways such as central cell cycle processes, central gene expression regulation processes, processes related to chromosome segregation, DNA damage, and replication, that are post-transcriptionally regulated during the mammalian cell cycle. Moreover, we show that combining the approaches better predicts steady state changes in protein abundance. The results reported here support the conjecture that for gaining a full post-transcriptional regulation picture one should integrate the two approaches.


Supplementary Information Contents
The supplementary is organized as follows and contains the following information: 2 Methods

FACS analysis of the synchronized cells
The figure below includes the cell count and DNA content for the G1 and M conditions, demonstrating that the cell cycle arrest was efficient.

Ribosomal profiling experiment replicates
Two ribosomal profiling experiments, which measured mRNA levels in parallel, were conducted, one with 3 replicates (rep1, rep2, rep3), and one with 1 replicate (rep4). As can be seen in Figure 1 and 2, the Spearman correlation between the replicates is significantly high, and the results of all analyses performed in the paper are robust to utilizing individual replicates .

Determining differentially expressed genes from Ribo-Seq
Differentially expressed genes between M and G1 are calculated according to [2], a method called DESeq based on the negative binomial distribution, with variance and mean linked by local regression. Briefly, Anders et al. [2] devised a statistical test to decide whether, for a given gene, an observed difference in read counts is significant, that is, whether it is greater than what would be expected just due to natural random variation. If reads were independently sampled from a population with given, fixed fractions of genes, the read counts would follow a multinomial distribution, which can be approximated by the Poisson distribution [3,4]. However, it has been noted that the assumption of Poisson distribution is too restrictive [5,6]: it predicts smaller variations than what is seen in the data. Therefore, the resulting statistical test does not control type-I error (the probability of false discoveries). To address this so-called overdispersion problem, it has been proposed to model count data with negative binomial (NB) distributions [7], with parameters uniquely determined by mean μ and variance σ 2 , and this approach is used in the edgeR package for analysis of SAGE and RNA-Seq [6,8]. However, the number of replicates in data sets of interest is often too small to estimate both parameters, mean and variance, reliably for each gene. For edgeR, Robinson and Smyth assumed [9] that mean and variance are related by σ 2 = μ + αμ 2 , with a single proportionality constant α that is the same throughout the experiment and that can be estimated from the data. Hence, only one parameter needs to be estimated for each gene, allowing application to experiments with small numbers of replicates. Anders et al. extend this model by allowing more general, data-driven relationships of variance and mean, provide an effective algorithm for fitting the model to data, and show that it provides better fits. As a result, more balanced selection of differentially expressed genes throughout the dynamic range of the data can be obtained. DESeq has three sets of parameters that need to be estimated from the data: 1. Library size parameters. 2. Gene abundance parameters under each experimental condition.
3. The smooth functions that model the dependence of the raw variance on the expected mean.

Estimating Library Size Factor
The expectation values of all gene counts from a sample are proportional to the sample's library size. The effective library size can be estimated from the count data. Compute the geometric mean of the gene counts across all samples in the experiment as a pseudo-reference sample. Each library size parameter is computed as the median of the ratio of the sample's counts to those of the pseudo-reference sample. The counts can be transformed to a common scale using size factor adjustment.

Estimate the gene abundance
To estimate the gene abundance for each experimental condition you use the average of the counts from the samples transformed to the common scale (Eq. 6 in [2]).

Estimating Negative Binomial Distribution Parameters
In the model, the variances of the counts of a gene are considered as the sum of a shot noise term and a raw variance term. The shot noise term is the mean counts of the gene, while the raw variance can be predicted from the mean, i.e., genes with a similar expression level have similar variance across the replicates (samples of the same biological condition). A smooth function that models the dependence of the raw variance on the mean is obtained by fitting the sample mean and variance within replicates for each gene using local regression function. Sample variances transformed to the common scale are calculated according to Eq. 7 in [2], while the shot noise term is estimated according to Eq. 8 in [2], and the sample variance is calculated by adding the shot noise bias term to the raw variance according to Eq.9 in [2].

Testing for Differential Expression
Having estimated the mean-variance dependence, one can test for differentially expressed genes between the samples. Define, as test statistic, the total counts in each condition, and their overall sum. Parameters of the new negative binomial distributions for the count sums can be calculated according to Eqs. 12-14 in [2], and the numerical calculation of the p-values for the statistical significance of the change between the experimental conditions (differential expression) is detailed in Eq. 11. The p-values are empirically adjusted from the multiple tests for false discovery rate (FDR) with the Benjamini-Hochberg procedure [10].

DAVID analysis
We added the following to the DAVID defaults: Literature: GENERIF_SUMMARY Protein_Interactions: DIP. We define our entire gene set as background and generate a Chart Report, that is an annotationterm-focused view which lists annotation terms and their associated genes under study.
DAVID EASE Score Threshold (Maximum Probability): The threshold of EASE Score, a modified Fisher Exact P-Value, for gene-enrichment analysis. It ranges from 0 to 1. Fisher Exact P-Value = 0 represents perfect enrichment. Usually P-Value is equal or smaller than 0.05 to be considered strongly enriched in the annotation categories.

Modules of differentially post-transcriptionally expressed genes and physical interactions
We performed a clustering analysis (Newman algorithm [30], see methods), on the proteinprotein interactions network using the previously described differentially expressed genes according to Ribo-Seq (RP) and PUNCH-P (PP) respectively.  Integrin-mediated cell adhesion 6.7e-02

The reported results cannot trivially be explained by biological and technical variability within each procedure
To demonstrate that the reported results cannot be explained by technical variability within each procedure, i.e. to show that the improved prediction of PSS when adding RP (and mRNA) to PP (and vice versa) is not due to any randomness that occurs among different technical repeats, but due to additional/"orthogonal" "information" provided by RP, instead of testing the regressors PP, PP+RP (based on the average across the four replicates) (PP+RP+mRNA) depicted in Figure  5, we tested the regressors of all combinations of the four RP replicates RPi, RPi+RPj, RPi+RPj+mRNA, and showed that the correlation with PSS (and improvement in correlation with the addition of variables) is lower. We performed the following analyses: Utilizing the four RP replicates from the two ribosomal profiling experiments, one with 3 replicates (Rep1, Rep2, Rep3), and one with 1 replicate (Rep4), as described above, we performed the regressor analysis illustrated in Figure 5 of the main text, only now replacing the PP and averaged RP (across the 4 replicates) measurements by all replicate pairs (see Supplementary Figure 2), for RP coverage > 0. As can be seen, while the regressors based on both PP and RP achieve a steady increase in the correlations with steady state protein levels (PSS), the regressor based only on RP replicates plateaus. In order to further demonstrate that each of the techniques, RP and PP, uncovers biologically relevant protein-protein interactions that cannot be detected by the other technique, three PPI network colouring schemes were defined, where "black" nodes represent differentially expressed (DE) genes between G1 and M phase of the cell cycle. In the first case, the black nodes were defined as genes that are DE according to RP but not based on PP (RP-PP); in the second case the black nodes were defined as genes that are DE according to PP but not based on RP (PP-RP); in the third case the black nodes were defined as genes that are DE according to both RP and PP; similarly to the previous analysis. We computed the mean distance (md) between all black nodes in each of the aforementioned three cases. Shorter distances between DE PPI nodes means more meaningful biological signals, as if indeed we uncover real regulatory changes in signalling pathways, we expect them to be clustered/close in the PPI network (we expect to see physical interactions between DE genes). The mean distance in the case of RP∩PP (125 genes) was shorter (2.01) than in the case of the RP-PP (999 genes) and the PP-RP (203 genes) groups (2.12 and 2.13, respectively), depicted in Figure 7 of the main text. We re-executed this analysis, only now instead of calculating the RP DE genes according to all four replicates, we calculated them according to all pairs of replicates, resulting in 6 DE groups, and then utilized the 3 nonoverlapping ones instead of the PP and averaged RP, resulting in 3 independent analyses ( groups, which we will name RP5 and RP6 respectively: RP5∩RP6: 2.17 (138 genes), RP5-RP6: 2.42 (900 genes), RP6-RP5: 2.10 (1000 genes). As can be seen, in most of the cases the intersection does not achieve the shortest distance (as in the case of RP vs. PP), supporting the conjecture that the relations reported in main text are not trivially due variation among replicas. To empirically test the significance of the shorter distance achieved by combing PP and RP, as opposed to using only technical replicates of RP, we devised the following empirical p-value: since RP1∩RP2 attained the shortest distance (which is 2.11), we sampled uniformly at random 125 genes from the PPI network 1000 times, and computed the mean distance between them, mdi, the p-value being (# of times (2.11 -mdi) ≤ (2.11 -2.01))/1000, which is indeed < 0.001.
At the next step our objective was to show that both PP and RP can be used for detecting relevant differentially transcriptional and post transcriptional regulated genes, and that each of these methods exclusively detects relevant genes. We performed pathway and biological process enrichment for each of the DE groups, 1. RP∩PP (125 genes). 2. RP-PP (1,090 genes). 3. PP-RP (200 genes). To achieve our objective, we aimed to show that relevant pathways and biological processes are significantly enriched with DE genes in all three cases, see Supplementary Information Table 1 (section 2.2) above ( Figure 6 of the main text includes selected pathways and biological processes (DAVID analysis) which are significantly enriched by the 3 groups of DE genes, here we examine only our pathway enrichment analysis). Using the same DE groups as in the above PPI analysis, we compared the pathway enrichment results of RP∩PP, with that of RP1∩RP2, RP3∩RP4, RP5∩RP6, taking only pathways enriched by at least two of the groups and that passed FDR. The results are summarized in table 3 below, the p-values reported for the RP groups are based on the average, as can clearly be seen, utilizing RP∩PP uncovers more significant and relevant pathways.

RP, PP, and mRNA Pearson correlation with PSS
In order to be comparable to previous studies which tried to estimate how much of the variance of steady state protein levels (PSS) can be explained by mRNA levels, and which performed Pearson correlations [11][12][13] (as opposed to the Spearman correlations performed throughout our study), we calculated the Pearson correlations between steady state protein levels and RP, PP and mRNA levels respectively. In our opinion it is more 'correct' to employ Spearman correlations which unlike Pearson do not assume linearity, as when comparing mRNA levels with ribosomal density and protein levels that means that we assume there is no translation regulation, which is known to be incorrect. Moreover, even if the relationship was linear, since all experimental measurements have a saturation range, that linear relationship would have been distorted.