Characterization and non-parametric modeling of the developing serum proteome during infancy and early childhood

Children develop rapidly during the first years of life, and understanding the sources and associated levels of variation in the serum proteome is important when using serum proteins as markers for childhood diseases. The aim of this study was to establish a reference model for the evolution of a healthy serum proteome during early childhood. Label-free quantitative proteomics analyses were performed for 103 longitudinal serum samples collected from 15 children at birth and between the ages of 3–36 months. A flexible Gaussian process-based probabilistic modelling framework was developed to evaluate the effects of different variables, including age, living environment and individual variation, on the longitudinal expression profiles of 266 reliably identified and quantified serum proteins. Age was the most dominant factor influencing approximately half of the studied proteins, and the most prominent age-associated changes were observed already during the first year of life. High inter-individual variability was also observed for multiple proteins. These data provide important details on the maturing serum proteome during early life, and evaluate how patterns detected in cord blood are conserved in the first years of life. Additionally, our novel modelling approach provides a statistical framework to detect associations between covariates and non-linear time series data.


Data description
We analyze mass-spectrometry proteomics data for a group of healthy children. The experiment is set up like this. Each child goes to the clinics at 3, 6, 12, 18, 24, 36 months age. Blood samples are taken at each visit. The samples are then analyzed together to provide the serum protein levels at those time points.
For each child, we have the following explanatory variables: age (sample date -birth date), location (Finnish or Estonian), gender, season (sample date -a common date for all child), id (id of the child).
The response variables are serum protein levels measured by mass-spectrometer and quantified by maxQuant software. The protein intensities are in log2 scale. There are 3 technical replicates for each sample. The median of the 3 technical replicates is taken as the final protein intensity. For each protein in a sample, we compute the median using only those technical replicates where the protein has been detected.
Note that many proteins can be measured in a single sample, but we analyze each protein independently. We also remove proteins that are detected in less than 50% of all available samples. In our data, we analyzed 266 proteins in total.

Additive Gaussian process and model selection
We describe our computational methods for a single protein by ignoring the protein index for simplicity. We can represent our data as follows: we have a continuous response variable Y and we have 5 explanatory variables X, in which 2 (age, season) are continuous explanatory variables and 3 (location, gender, id) are binary/categorical explanatory variables. We want to find out which explanatory variables can best explain the response variable. 1 The first question we want to answer is what modeling framework should we use. Commonly used modeling frameworks include linear mixed model and generalized linear mixed model. In practice we find linear mixed models can not fit the data very well, since the data shows nonlinear trends. We decide to use Gaussian Processes (GP) to handle the nonlinearity in the data. The general form is where f (x) can be any smooth non-linear function and is the noise. GP is parameterized by its mean m(x) and covariance matrix k(x, x ). In practice we usually center the data points such that the mean is 0. This practice leads to the specification of a zero-mean GP prior, which has less complex formulations while keeping the same modeling power. Assume f (x) has non-zero mean µ and the noise has zero mean in Equation 1, we can rewrite it as (y − µ) = (f (x) − µ) + , which shows the effects of centering.
The key part of GP modeling are the choices of the covariance functions. We can think the function f (x) is a composed function of some more basic nonlinear functions that only involves one or two explanatory variables. Here "compose" means addition and multiplication, e.g.
where multiplication refers to the interaction of covariates x 4 and x 5 (or their covariance functions) in f 4 (x 4 , x 5 ). Different ways of "composing" leads to different models.
Let us use M to denote a model and θ to denote parameters of the covariance functions of model M (including noise variance). We use cross-validation to compare different models. Specifically, we split the data set into small portions, then iteratively use one portion as test data and the others as training data. In the end, we would like to obtain the predictive density of one portion y i conditioned on the rest y −i , i.e. p(y i |y −i ) = θ p(y i |θ)p(θ|y −i )dθ. Note that the predictive density can be obtained using sampling techniques since p(θ|y −i ) can be sampled and p(y i |θ) can be computed analytically. See [1] and [2] for more details.
We use two types of cross-validation in our time series data: leave-one-out cross-validation (LOOCV) and statified cross-validation (SCV). LOOCV takes out only a single data point (a single time point of an individual) and SCV takes out all data points (with respect to time) of an individual. We compare two models M 1 and M 2 by the following formula: which compares the average prediction accuracy of the two models. If Equation 2 is greater than 0, then M 1 is better than M 2 , otherwise M 2 is better than M 1 . This comparison does not provide a probability saying how much better one model compares to the other. No matter M 1 is better than M 2 at all data points, or M 1 is only marginally better than M 2 at a few data points, we will get the same conclusion, which is not favorable in many cases. We approximate the distribution of y i using Bayesian bootstraping [3], which assumes y i only take values from the observations and have zero probability at all other values. The probabilities of the observation values follow the n-dimensional Dirichlet distribution Dir(1, 1, ..., 1). More specifically, we bootstrap the samples N times. Each time we get the same n observations, with each observation taking weight w bi (b = 1, . . . , N, i = 1, . . . , n) from the Dirichlet distribution. Consequently, the value of Equation 2 can change for each bootstrap sample b. We then summarize the N bootstrap results to obtain the probability that M 1 is better than M 2 where δ{·} is the Heaviside step function, w bi is the bootstrap weight for ith data point in bth bootstrap iteration, and weights w b are sampled from the n-dimentional Dirichlet distribution Dir(1, 1, ..., 1). See [4] for more details.
We call the result of Equation 3 loo factor or scv factor depending on the cross-validation techniques used. The above strategy also works when comparing multiple models. Instead of calculating the heaviside step function in each boostrap iteration, we simply take the model with the highest rank by sorting 1

Practical application
This section provides the technical details about how we perform the model selection for the proteomics data. In our analysis, the protein intensities and continuous covariates (age, season) are all standardized such that the mean is 0 and the standard deviation is 1. This specification allows conveniences to set the parameter priors in GP regression. 4. id (categorical), categorical kernel times constant kernel; magnitude prior the same as for location.
5. season (continuous), periodic kernel gpcf_periodic, period = 12/(standard deviation of season covariate before standardization), length scale prior is the same as for age, magnitude prior is a Student's t-distribution with the location parameter set to 0, the scale parameter to 1, and the degree of freedom set to 4. Note that we impose the period to be exactly 12 months.
We assume a zero mean Gaussian noise and the prior for variance σ 2 is scaled inverse Chi-squared distribution (prior_sinvchi2()) with degree of freedom ν = 1 and scale square (variance) σ 2 = 0. 1 Regarding the interaction terms, all pair-wise interactions of the covariates in a model are included by default. We further set up and eliminate some interaction terms by the following rules: 1. Continuous covariates such as age and season are accompanied with a different length scale prior that allows for more rapid changes. For example, the length scale prior for age in the interaction term is the positive half of the Student's t-distribution (prior_t()) with the location parameter set to 0, the scale parameter set to 1, and the degree of freedom set to 4. This prior allows small length scale.
2. Interactions of continuous covariates are not considered, since they will bring too much flexiblity to the model and make the parameter inference challenging.
3. We do not consider an interaction term involving id and another binary covariate (location or gender) because the product kernel is in essence the same as id kernel, as shown in M 4 , M 5 and M 11 .
4. When considering an interaction term of a continuous covariate and a binary covariate, the constant kernel part of the binary covariate is deleted since the kernel of continuous covariate can already model the magnitude. In other words, the interaction term is a product of a squared exponential kernel and a categorical kernel.
5. When considering an interaction term of two categorical covariates, there are two constant kernels in the interaction term by default, one for each covariate. Only one constant kernel is kept due to redundency.
6. When there are more than 20 parameters in a model, we will first estimate the parameters using MCMC (see Section 3.2) and then delete interaction terms according to their explained variances. Interaction terms that explain less than 0.02 of the total variance will be deleted. If there are still more than 20 parameters, we will delete interaction terms with the least explained variances such that the final model contains 20 parameters. With more than 20 parameters, the fast central composite design (CCD) (see Section 3.2) inference algorithm is unreliable. We did not use MCMC in SCV since it was very time consuming.
We use both LOOCV and SCV to compare the aforementioned 16 models. The results show that LOOCV is good at analyzing effects of shared covariates, while SCV is good at analyzing effects of subgroup specific covariates. For example, when comparing M 4 versus M 1 using LOOCV for a protein with location effects, there is barely any difference between the two models. This is because we can borrow strength from other times points of an individual to estimate the leftout time point in LOOCV setting. If we use SCV, then we will see M 4 is much better than M 1 , since M 4 uses the location mean for prediction whereas M 1 uses the whole sample mean for prediction. On the other hand, when we compare M 2 versus M 1 using SCV for a protein with age effect, the predictions will be centered around the whole sample mean, which brings in large error and can easily mask the slight improvement from age. As a result, there is not too much difference between M 2 and M 1 . If we use LOOCV to compare M 2 and M 1 , the age effect can be better detected since the baseline protein level of an individual can be reliably estimated from other time points, which circumvents the problem in SCV. We conclude that LOOCV is good at analyzing effects of covariates that are shared by all individuals, while SCV is good at analyzing effects of categorical covariates that are subgroup specific.

Inference
For each model, we use 4 independent MCMC (Markov chain Monte Carlo) chains to infer the parameters, each of which is initialized with different parameter values. Slice sampling [6] is used in the actual MCMC sampling. 2500 samples are generated from each MCMC chain, 100 samples are discarded as burn-in samples and the rest is thinned by 6, that is 400 approximately independent samples in each chain. We then concatenate the independent samples of the 4 MCMC chains, that is 1600 samples. After that we use Potential Scale Reduction FactorR [7] to check the convergence of the MCMC chains. If 0.9 ≤R ≤ 1.1 we consider that the MCMC chains have converged. If not, we will repeat the process at most 4 times, in each of which we combine new samples with previous samples.
When performing LOOCV, we need to do the cross-validation n times such that we get predictive density of each data item, i.e. p(y i |y −i ) = θ p(y i |θ)p(θ|y −i )dθ. However, this is usually time consuming and we use importance sampling to approximate the predictive density instead. Note that the leave-one-out posterior p(θ|y −i ) is very close to the full posterior p(θ|y) since there is only one data point difference. Therefore we can use the full posterior p(θ|y) as the proposal distribution in the importance sampling, which takes little time to sample from the leave-one-out posterior p(θ|y −i ). See more details in [1].
SCV is also time consuming if a separate MCMC inference is performed for each fold. To make the inference time affordable, we use central composite design (CCD) [8] to infer the parameters instead of MCMC. CCD assumes a unimode posterior, then it finds and assigns weights to the representative points around the posterior mode, after that it approximates the predictive density using these representative points. CCD is much faster than MCMC and reliably accurate for less than 20 parameters. Therefore we reduce the model size by removing some negligible interaction terms if a model has more than 20 parameters.

Model selection
There are two types of questions we are interested in.
1. Is there any age/seasonal effect in the data? Age/seasonal effect means that the protein intensity changes in a common pattern along time in all individuals. Since age and season are continuous covariates shared by all individuals, we use LOOCV to compare the models. The model space for LOOCV is constrained to {M 1 , M 2 , M 3 , M 6 }. The selected best model will then explain the age/seasonal effect. The following rules are adopted to select relevant models.
(a) The best model is M 2 (bestModelInd=2) and its posterior rank probability P (M 2 |data) is higher than 0.5 (bestModelRank>0.5) and loo factor versus M 1 is larger than 0.95 (loofactor_2vs1>0.95) and MCMC converged. This rule says that the protein in question has significant age effect and no significant seasonal effect. This rule says that the protein in question has significant age effect, but no significant seasonal effect.
There does not seems to be proteins with significant seasonal effect in our result. We think this is probably because of sparse sampling in our data, i.e. 6 time points per individual is not enough to reliably detect the seasonal effect.
2. Is there any location / gender effect in the data? Location/gender effects mean that the individuals belonging to a certain group (according to location or gender) differ systematically from another group. Since gender and location are categorical covariates specific to each subgroup, we use SCV to compare all the models in the model space defined in section 3.1. If the best model (highest posterior rank probability) contains gender or location, then we think there is gender/location effect. To be considered as significant, we require that (1) the posterior rank probability is higher than 0.8 and (2) the scv factors of the best model versus both the age model M 2 and the constant model M 1 to be greater than 0.9. If the posterior rank probability is between 0.5 and 0.8, and the scv factors of the best model versus both the age model M 2 and the constant model M 1 are greater than 0.9, then we think the location / gender effect to be suggestive. Note that the SCV results can be different to that of LOOCV results.