PECAplus: statistical analysis of time-dependent regulatory changes in dynamic single-omics and dual-omics experiments

Simultaneous dynamic profiling of mRNA and protein expression is increasingly popular, and there is a critical need for algorithms to identify regulatory layers and time dependency of gene expression. A group of scientists from United States and Singapore present PECAplus, a comprehensive set of statistical analysis tools to address this challenge. Protein expression control analysis (PECA) computes the probability scores for change in mRNA and protein-level regulatory parameters at each time point, deconvoluting gene expression regulation in the presence of measurement noise. PECAplus adapted PECA’s mass action model to a variety of proteomic data including pulsed SILAC and generic protein expression data. It also features analysis modules to fit smooth curves on rugged time series observations, and to facilitate time-dependent interpretation of the data for genes and biological functions. They demonstrate the core modules with two time course datasets of mammalian cells responding to unfolded proteins and pathogens.


Significantly up-regulated genes in metabolic processes
Supplementary Figure 1. a. PECA-N produced greater CPS scores than PECA Core for the genes belonging to metabolism and respiratory electron transport at 16 hrs at the RNA level. b. The ratio of the number of significantly up-regulated RNAs between PECA-N and PECA core at the same CPS score thresholds among 1720 genes in the metabolic processes. and PECA-pS in the synthetic LPS response data set. For PECA-pS (horizontal axis), the rate parameters in the adjacent time intervals at 0 and 12 hours were used for comparison with the rates from the ODE-based approach at the respective time points.  Figure 4. GP smoothing is robust to variation of parameters F and within the proposed range. We illustrate the impact of the parameters for the SLC39A14 gene (two biological replicates). The two parameters work in concert to control the overall smoothness of curves. We optimized parameters for small to medium sized time series that are typical for biological experiments. We varied each parameter from small to large with respect to the default value while fixing the other parameter at default value. The covariance weight parameter F was varied from 1 to 4 while fixing = 1. Local variance parameter was varied from 0.5 to 2 while fixing F = 2. Supplementary Figure 5. The GP-based imputation (used in PECAplus) is more robust to leave-one-out and leave-two-out tests than K-nearest neighbor (knn) imputation. In the ER stress data, we removed one or two intensity values from randomly selected time points for each gene individually, and challenged both methods to impute the values. We then computed imputation errors as the difference between imputed values and observed (but erased) values. The panels show the distributions of the imputation errors across all genes at all time points. The distributions are centered more narrowly around 0 (no error) for GP smoothing than for knn imputation.
2 Gaussian Process model for smoothing and imputation

Model description
In this section, we provide detailed description of the Gaussian Process (GP) model [5], which smoothes rugged time series data and imputes missing observations in PECAplus. A time series for a gene, whether it is mRNA or protein, is expressed as a function of time f (h) with the following kernel: For any finite set of time points, we assume that the distribution of the function f between any two time points h i and h j follows Gaussian distribution denotes the Gaussian kernel specified below. Thanks to this flexible definition of the stochastic process, GP is a widely used, highly flexible technique for inference of time series data in many areas of application.

Gaussian kernel and tuning parameters
The Gaussian kernel is defined as The value of σ 2 y does not affect the value off * so we simplify the kernel to F and values are user-defined parameters. These two parameters jointly control the shape of smoothed curve, with subtle difference in their roles: the parameter F controls the amount of correlation between time points (not necessarily observation time points), whereas the parameter controls the variability of values in neighboring time points, i.e. ruggedness of curve in local temporal neighborhoods. For this reason, we call F the absolute covariance weight parameter and the local variance parameter.
If the user sets a small value of F (e.g. 0.1 ∼ 0.5), then the model assumes that the expression values in any two time points are less correlated. This leads the model to think that the variation along the time course is more attributable to random noise than systematic changes, and therefore results in a flat curve. By contrast, a large F (e.g. 2 ∼ 5) will honor correlated changes along time and produce a curve closer to the original observations with reduced amount of smoothing.
Meanwhile, the parameter controls the amount of fluctuations allowed in neighboring time points. If the user sets a small value of (e.g. 0.1 ∼ 0.5), then the curve has to be fit tightly to the observed data, therefore losing flexibility in the curve. This tends to produce more locally rugged patterns (less smoothed). By contrast, if the user sets a large value of (e.g. 2 ∼ 5), then the curve no longer has to tightly fit the observed values and can have smoother shape along the time course.
In the current implementation, we set the default values at (F, ) = (2, 1). Supplementary Figure 4A shows the impact of the covariance weight parameter F on the smoothed curve when the local variance parameter is fixed at 1. As explained above, at a fixed value of , larger values of F render the curve to be more "correlated" in nearby time points (nearby observed values), i.e. closely fitting the observed data points. Supplementary Figure 4B shows that smaller produces less smooth a curve closely fitting the observed data. We have chosen F = (2, 1) based on empirical evidence over a number of data sets we analyzed, and we recommend that these values are not adjusted too much by users who are not familiar with the modeling. However, the python script automatically produces these plots, which helps the user make a more informed decision on optimal parameters for their own data.

Smoothing and missing data imputation
The above model can be used to calculate the expected (de-noised) expression level E(y) at any time point x for both smoothing and imputation. We first use all observed data points We apply the same to each observation time point with non-missing data to perform smoothing on the entire time series.
To evaluate whether the imputation scheme is efficient, we performed a numerical study comparing GP-based imputation with K-nearest neighbor (KNN)-based imputation. To do this, we took the mRNA data from the ER stress study and modified the data as follows. We randomly selected one time point in each gene and erased the observed intensity value, making it a missing data (Supplementary Figure 5A). After erasing one data point per gene, we challenged both methods to impute the values. To evaluate the performance, we compared the difference between imputed values and true underlying values. Supplementary Figure 5A shows that GP-based smoothing is overall superior to the KNN method, giving smaller imputation errors. This performance difference became more pronounced when we erased two data points per gene (Supplementary Figure 5B).

PECA-N: Incorporating biological network information into the PECA model
PECA-N employs the same statistical model as the original PECA Core [6]. In PECA, the prior probability of change point in a rate ratio parameter at time t is the same for every gene, which is estimated from the data across all genes. In PECA-N, we employ the Markov random field (MRF) prior [2,7], where the prior probability of change point in a gene is adjusted by the change point status of other first degree neighbor genes in a user-provided biological network. We used the protein-protein interaction data from the STRING database [4]. To estimate the model parameters, we constructed a Markov chain Monte Carlo (MCMC) sampler that combines standard Metropolis-Hastings updates and dimension switching updates in the form of reversible-jump MCMC [3]. The following section describes the full details on the PECA-N model and Bayesian inference of the model parameters.

Likelihood and prior distributions
First, suppose that the experiment has measurements for I genes across T times points in N biological replicates. Then, the likelihood of the entire PECA Core model is and i, j, and t index gene, replicate, and time, respectively. We specify prior distributions that are the least subjective with wide variance parameters: for fixed C i for all i, where N , U, G denote normal, uniform, and gamma distributions respectively. Here C i is the change point set for gene i, i.e. the collection of time points in which rate ratios in adjacent time periods are different. This set is empty when the rate ratio remains constant over time, and becomes a non-empty set when there is at least one change point. Accordingly, |C i | denotes the number of change points in gene i, which is a random variable on its own and will be inferred during the estimation procedure. We impose the following prior for C i : where ϕ it is the probability that gene i has a change point probability at time t, i.e. the rate ratios κ i,t−1 and κ it are different. The posterior expectation of this parameter becomes the final change point score (CPS) of gene i at time t, i.e. P (κ i,t−1 = κ it |x, y), where x and y denote the two layers of expression data.
The difference between PECA Core and PECA-N is how ϕ it is defined and estimated. In PECA Core, the parameter is estimated solely based on the data for gene i. In PECA-N, by contrast, this parameter is estimated with the help of the network information (or module information) from related genes is incorporated (see below), using the MRF prior.

Definition of module information
There are different types of network information in the literature. Most biological networks are provided as a list of gene pairs that interact with one another. One can convert gene groups (e.g., Gene Ontology [1] or GO terms) into this format by forming cliques for a group of genes, i.e. enumerating all pair-wise interactions in the group. We refer both formats of network data as module information hereafter.

MCMC sampler
When the module information is utilized, PECA-N will run the MCMC sampler twice. The first MCMC sampling will be done following the sampler of PECA Core. On the completion of the first MCMC run, we estimate the MRF prior coefficient by finding a maximizer of with respect to γ t , where In summary, the prior can be written as where the prior for {κ it } is omitted conditional on the fact that the parameters are uniformly distributed on unit interval [0,1], and φ denotes standard normal density. The model parameters are updated in the following order: for all i. This whole cycle is repeated for 1,000 iterations for burn-in period and M = 10, 000 iterations for the main iteration with thinning of 10 samples, in both simulation and data analysis sections that follow. We use hat and tilde symbols to denote current and proposed values respectively.
1. We first start with η ji0 . We run the random walk Metropolis algorithm chain in the logspace by generating the proposal log(η ji0 ) from N (log(η ji0 ), 1) or equivalentlyη ji0 from η ji0 exp (N (0, 1)). Running the chain in the log-space alleviates the need to tune for step sizes for each η ji0 . Since this parameter is involved in the expected values of y ijt at all time points, the likelihood has to be evaluated at all time points for updating each of these parameters.
2. Next, we draw the variance parameter τ 2 i by Gibbs sampling from inverse gamma distribution IG(a τ + N (T + 1)/2, b τ + j,t (y jit − η jit ) 2 /2). i for each protein i. We run the chain in the logit-space, i.e. draw a proposal value logit(κ s i ) from N (logit(κ s i ), 1) and accept or reject afterwards. 4. Finally, we update the change point set C i . There are two different moves: birth of a new change point and removal (death) of an existing change point. Since these two moves are reversible in notation, we just describe the birth move here. Suppose thatκ i covers a time period (h t , h t+m ) that contains at least one observation time(s). Then we propose a birth of a new change point h * ∈ {h t+1 , . . . , h t+m−1 } within the interval (chosen from one of the intermediate time points) and break the current rate parameter into two daughter parameters, namely (κ i ,κ i, +1 ) where it is required to meet with a random perturbation such that with u ∼ Uniform(0, 1). Under this transformation, the Jacobian is ). Hence the Metropolis-Hastings ratio for the birth move just equals the posterior ratio times the Jacobian since the acceptance probability of this proposal is min{1, likelihood ratio × prior ratio × proposal ratio × Jacobian}, where the prior and proposal ratios are the ratios of Uniform distribution over unit intervals. Then the Metropolis-Hastings ratio becomes

PECA-pS: A PECA model when pulse-labelled proteomic data is available
PECA-pS uses pulsed-SILAC data for the proteomic data to estimate synthesis and degradation rates and infer regulatory changes across the time points in synthesis and degradation separately. The model for the synthesis rate parameter takes the amount of mRNA available at the beginning of each time period into estimation, while the model for the degradation rate is formulated as a function of protein abundance at the beginning of each time period and the rate parameter, disregarding the abundance of mRNA.
after proper normalization of the data. Our goal is to infer the protein synthesis rate κ s it and the degradation rate κ d it during the interval (h t , h t+1 ) of length ∆h t = (h t+1 − h t ) for protein i. More importantly, the mean parameters are related between adjacent time points as follows: for t = 0, 1, . . . , T − 1. This mathematical assumption was put in place after we noticed that, without such transformation, the majority of signals were detected mostly on highly abundant genes only. The transformation log(x+1) "de-sensitizes" the change point scores in very highly abundant genes only, especially when the data are very noisy.
To detect the change in these rate parameters, we formulated a change point model similar to PECA Core to describe the probability distribution of κ s i = (κ s i0 , · · · , κ s i,T −1 ) as follows. We first note that the synthesis rate κ s it and the degradation rate κ d it are always positive since they are rate parameters by definition. Second, unlike the PECA Core and PECA-N models, we define the change point set C i for synthesis and degradation separately, since synthesis and degradation rates may not share the same change point in the same gene across time points. For gene i, let C

Likelihood and prior distributions
First, the likelihood of the entire model can be written as The two equations above dictate the monotone decreasing and increasing time series for degradation and synthesis, respectively. We specify prior distributions that are the least subjective with wide variance parameters: for all i, where N , LN , G denote normal, log-normal, and gamma distributions, respectively. We also assume that the change point set C (s) i has the following prior: The priors are set similarly for the change point set for degradation parameter C

MCMC sampler
The model parameters are updated in the following order: for all i. This whole cycle is repeated for 1,000 iterations for burn-in period and M = 10, 000 iterations for the main iteration with thinning of 10 samples, in both simulation and data analysis sections that follow. We use hat and tilde symbols to denote current and proposal values respectively.
1. We first start with η ji0 . We run the random walk Metropolis algorithm chain in the logspace by generating the proposal log(η ji0 ) from N (log(η ji0 ), 1) or equivalentlyη ji0 from η ji0 exp (N (0, 1)). Running the chain in the log-space alleviate the need to tune for step sizes for each η ji0 . Since this parameter is involved in the mean values at all time points, the likelihood has to be evaluated at all time points for updating each of these parameters. Similarly for η (H) ji0 .
2. Next, we draw the variance parameter (τ with a random perturbation such that with u ∼ Uniform(0, 1). Under this transformation, the Jacobian is ). Hence the Metropolis-Hastings ratio for the birth move just equals the posterior ratio times the Jacobian since the acceptance probability of this proposal is min{1, likelihood ratio × prior ratio × proposal ratio × Jacobian}.
Then the Metropolis-Hastings ratio becomes 5 PECA-R: Inferring synthesis and degradation rates from proteomic data not derived from a pulse-labeling experiment PECA-R aims to estimate synthesis and degradation rates separately from proteomic concentration data (along with mRNA). For example, the data might arise from label-free experiments or proteins that were tagged with mass labels, such as TMT. The model expresses the total concentration change as a sum of increase in concentration due to new synthesis and decreased due to degradation. Note that if pulse-labeling, e.g. pulsed-SILAC, data is available, we highly recommend using PECA-pS instead of PECA-R.
In PECA-R, the synthesis and degradation rate parameters are estimated under the following assumptions: • When the total concentration increases, it is due to the increase in the synthesis rate unless the mRNA concentration increased drastically and can explain the protein concentration increase even with an unchanged synthesis rate; • When the total protein concentration decreases, it is due to the increase in the degradation rate unless the mRNA concentration dropped dramatically and can explain the decrease in protein concentration given constant synthesis and degradation rates.
The reason for imposing the aforementioned assumptions on the parameter space is straightforward. In label-free or TMT data, we only observe total protein changes, without separate abundance measurements for newly synthesized and existing proteins. Hence when the protein concentration changes, this model has to make a decision as to whether the synthesis rate and/or the degradation rate changed, considering the changes in mRNA concentration.
Since the total protein concentration changes can be explained by infinitely many combinations of the two rate parameters, the statistical significance score (CPS) is often more diluted in PECA-R than those values from PECA-pS. However, the PECA-pS model is not applicable unless pulse labelled samples are available, and PECA-R is the next best option for non-pulse labelled data within the PECAplus package if the estimation of synthesis and degradation is the ultimate aim of the analysis.
Note that estimating rates of synthesis and degradation (which is done with PECA-pS and PECA-R) is a goal different from simply extracting significant change points (which is done with PECA Core and PECA-N). Estimating rates aims at estimating relative ?speeds? of synthesis and degradation -they are not absolute, i.e. only applicable when comparing across genes, and have to be used with care. In contrast, significance analysis by PECA Core provides significance scores (CPS) and false discovery rates and information on the direction of the regulation (up or down) for each gene and each time point without providing rates. While PECA-R also provides CPS values, they are not as reliable for estimating significance of change as those from PECA Core.
Suppose that we have parallel mRNA and protein expression data X = {x jit }, Y = {y jit } for protein i = 1, . . . , I in replicates j = 1, . . . , N observed over time points (h 0 , . . . , h T ). Time h 0 indicates the time point before the samples are treated or the baseline of subsequent time points. We assume that the protein expression measurements follow log normal distributions y jit ∼ LN ln(η jit ), τ 2 i after proper normalization of the data. Our goal is to infer the protein synthesis rate κ s it and the degradation rate κ d it during the interval (h t , h t+1 ) of length ∆h t = (h t+1 − h t ) for protein i.