An additive Gaussian process regression model for interpretable non-parametric analysis of longitudinal data

Biomedical research typically involves longitudinal study designs where samples from individuals are measured repeatedly over time and the goal is to identify risk factors (covariates) that are associated with an outcome value. General linear mixed effect models are the standard workhorse for statistical analysis of longitudinal data. However, analysis of longitudinal data can be complicated for reasons such as difficulties in modelling correlated outcome values, functional (time-varying) covariates, nonlinear and non-stationary effects, and model inference. We present LonGP, an additive Gaussian process regression model that is specifically designed for statistical analysis of longitudinal data, which solves these commonly faced challenges. LonGP can model time-varying random effects and non-stationary signals, incorporate multiple kernel learning, and provide interpretable results for the effects of individual covariates and their interactions. We demonstrate LonGP’s performance and accuracy by analysing various simulated and real longitudinal -omics datasets.

B iomedical research often involves longitudinal studies where individuals are followed over a period of time and measurements are repeatedly collected from the subjects of the study. Longitudinal studies are effective in identifying various risk factors that are associated with an outcome, such as disease initiation, disease onset or any disease-associated molecular biomarker. Characterisation of such risk factors is essential in understanding disease pathogenesis, as well as in assessing an individuals' disease risk, patient stratification, treatment choice evaluation, in a future personalised medicine paradigm, and planning disease prevention strategies.
There are several classes of longitudinal study designs, including prospective vs. retrospective studies and observational vs. experimental studies, and each of these can be implemented with a particular application-specific experimental design. Also, as the risk factors (or covariates) can be either static or timevarying, statistical analysis tools need to be versatile enough so that they can be appropriately tailored to every application. Traditionally, analysis of variance (ANOVA), general linear mixed effect models (LME), and generalised estimating equations are widely used in analysing longitudinal data due to their simplicity and interpretability 1 . Although numerous advanced extensions of these statistical techniques have been proposed, longitudinal data analysis is still complicated for several reasons, such as difficulties in choosing covariance structures to model correlated outcomes, handling irregular sampling times and missing values, accounting for time-varying covariates, choosing appropriate nonlinear effects, modelling non-stationary (ns) signals, and accurate model inference.
Modern statistical methods for timeseries and longitudinal data analysis make less assumptions about the underlying data generating mechanisms. These methods use predominantly nonparametric models, such as splines 2 , and more recently latent stochastic processes, such as Gaussian processes (GP) 3,4 . While spline models can implement complex nonlinear functions, they are less efficient in modelling effects of covariate interactions. GP is a principled, probabilistic approach to learn non-parametric models, where nonlinearity is implemented through kernels 5 . A GP modelling framework is adopted in this work due to its flexibility and probabilistic formulation.
GPs have become a popular method for non-parametric modelling, especially for time-series data, and a wide variety of kernel functions have been proposed for different modelling tasks. A GP model can be made additive by defining the kernel function to be a sum of kernels. Similarly, a product of two or more kernels is also a valid kernel 5 . Thus, GPs can be made more interpretable and flexible by decomposing the kernel into a sum of individual and product (interaction) kernels much in the same way, conceptually, as with standard linear models. Here we can view the individual kernels as flexible nonlinear functions, which corresponds to the linear terms in linear regression. Plate 6 was among the first to formulate additive GPs by proposing a sum of univariate and multivariate kernels in an attempt to balance between model complexity and interpretability. Duvenaud et al. 7 considered an additive kernel that includes all input interaction terms and proposes a method for learning point estimates of kernel parameters by maximising the marginal likelihood. More complex kernel functions and structures were considered later 8 . Gilboa et al. 9 proposed Bayesian inference for additive GPs, whereas a hypothesis testing framework for nonlinear effects with GP was later proposed 10 . Bayesian semi-parametric models 4 and additive GP regression together with Bayesian inference methods 11 were proposed in the context of longitudinal study designs. Schulam et al. 12 presented a method that combines linear components, spline components, and GP components to model a data set with a hierarchical structure. Computationally efficient model inference for additive GP models (AGPM) using sparse approximations and variational inference was recently proposed 13 .
We present LonGP, a flexible and interpretable non-parametric modelling framework together with a versatile software implementation that solves commonly faced challenges in longitudinal data analysis. LonGP implements an additive GP regression model, with appropriate product kernels, that is specifically designed for longitudinal biomedical data with complex experimental designs. LonGP inherits the favourable features of GPs and multiple kernel learning. Our method extends previous GP (as well as linear mixed effect) models in several ways. Contrary to previous GP methods, LonGP implements a multi-level model that is conceptually similar to the commonly used linear models, and thus enables modelling individual-specific time-varying random effects, for example. LonGP also models ns signals using ns kernel functions and provides interpretable results for the effects of individual covariates and their interactions. We also develop a fully-Bayesian, predictive inference for LonGP and use that to carry out model selection, i.e. to identify covariates that are associated with a given study outcome value.
We demonstrate LonGP's performance and accuracy by analysing various simulated and real longitudinal -omics data sets, including high-throughput longitudinal proteomics and metagenomics data. We also compare LonGP with LME and GPs with automatic relevance determination (GP-ARD) kernel. LonGP with its full functionality is developed as an open-source software tool, which provides great convenience and flexibility of nonparametric longitudinal data analysis for applied research.

Results
Additive GP. Linear models and their mixed effect variants have become a standard tool for longitudinal data analysis. However, a number of challenges still persist in longitudinal analysis, e.g. when data contains nonlinear and ns effects.
GP are a flexible class of models that have become popular in machine learning and statistics. Realizations from a GP correspond to random functions and, consequently, GPs naturally provide a prior for an unknown regression function that is to be estimated from data. Thus, GPs differ from standard regression models in that they define priors for entire nonlinear functions, instead of their parameters. While nonlinear effects can be incorporated into standard linear models by extending the basis functions e.g. with higher order polynomials, GPs can automatically detect any nonlinear as well as ns effects without the need of explicitly defining basis functions 5 . By definition, the prior probability density of GP function values f( where elements of the N-by-N covariance matrix are defined by the GP kernel function [K X,X (θ)] i,j = k(x i , x j |θ) with parameters θ. Mean in Eq. (1) can in general depend on X but zero mean is often assumed in practice. Covariance (also called the kernel function) of the normal distribution defines the smoothness of the function f, i.e. how fast the regression function can vary. Intuitively speaking, although GP is defined such that any finitedimensional marginal has a Gaussian distribution, GP regression is a non-parametric method in the sense that the regression function f has no explicit parametric form. More formally, GP contains countably infinite many parameters that define the regression function, which are the function values f at all possible inputs x 2 X. For a comprehensive introduction to GPs we refer the reader to the book 5 and the Methods section.
GP models can be made more flexible and interpretable by making them additive, where the kernel (covariance) is a sum of kernels (covariances) and each kernel models the effect of individual covariates or their interactions, i.e. f(x) = f (1) . Intuitively one can think that each GP component f ( j) now implements a nonlinear function that specifies the corresponding effect, and the overall effect of several covariates is then the sum of these nonlinear functions. This is achieved by using specific kernels for different types of covariates, such as squared exponential (se) kernel for continuous covariates, constant (co), binary (bi), and categorical (ca) kernels for discrete covariates, and products of these kernels for interaction terms. Moreover, ns signals can be accounted for by incorporating ns kernels. Figure 1 shows an example where biomarker data y is simulated from an AGPM that depends on continuous covariates age (age) and time from a disease event (diseaseAge) as well as discrete covariates ID (id) and location (loc) as follows: . This example provides an intuitive illustration of nonlinear and ns effects of different kernels mentioned above. For example, continuous covariate age has a nonlinear effect on y and similarly continuous covariate diseaseAge has a nonlinear and ns effect on y, where the largest change in the effect is localized at the time of disease onset. The overall cumulative effect is then defined by the sum of the individual nonlinear effects (Fig. 1, bottom row, second panel from right), and measurements of biomarker y are corrupted by additive noise (Fig. 1, bottom row, right panel). In case a study contains other covariates or interaction terms, the additive GP regression provides a very flexible modelling framework that can be adjusted to a number of different applications.
Longitudinal studies typically involve two interrelated statistical questions: prediction of an outcome and model selection. While standard linear models are commonly constructed using hypothesis testing, here we develop a Bayesian predictive model selection method for the proposed AGPM that combines several state-of-the-art methodologies, including both Markov chain Monte Carlo (MCMC) sampling and approximate inference using central composite design (CCD). Furthermore, our model selection strategy involves assessing the predictive performance using cross-validation (with or without importance sampling), Bayesian bootstrap and a model search strategy for accurate model selection. For details of LonGP's statistical methodology, see Methods section. We tested LonGP on simulated data sets and two real data sets, including a longitudinal metagenomics 14 and a proteomics data sets 15 , which are described below.
Simulated data sets. We first carried out a large simulation study to test and demonstrate LonGP's ability to correctly infer associations between covariates and target variables from longitudinal data. Here, we are primarily interested in answering two questions: is LonGP able to select the correct model as well as the correct covariates that were used to generate the data and can we detect disease-associated signals. We simulated nonlinear and ns -omics data sets from five different generative AGPM: To set up our simulation scenario, we first use P = 40 individuals (which are divided into 20 cases and 20 controls for AGPM4 and AGPM5 due to the presence of sero effect in cases), each with n i = 13 data points ranging from 0 to 36 months with an increment of three months, thus specifying the age covariate. Other covariates are randomly simulated using the following rules. The disease occurrence time is sampled uniformly from 0 to 36 months for each case subject and diseaseAge is computed accordingly. We make the effect of diseaseAge ns by transforming it with the sigmoid function from Eq. (16), such that majority of changes occur in the range of −12 to +12 months. The loc and gender are i.i.d. and sampled from a Bernoulli distribution with p = 0.5 for each individual, where gender and group act as irrelevant covariates. The continuous covariates are subjected to standardisation after being generated, such that the mean of each covariate is 0 and standard deviation is 1. We then sample latent function values and data from all the five models with the kernels described above (for details, see Methods), where the lengthscales for continuous (standardised) covariates are set to 1 for the shared components and 0.8 for the interaction components. Case Control f (4) , loc = 0 f (4) , loc = 1 f (5) , loc = 0 f (5) , loc = 1 10 20 30  0  10  20  30  0  10  20  30 0  10  20  30  0  10  20  30  0  10 20 30 We set the variances of each shared component to 4 and noise to 3, i.e. σ 2 age ¼ σ 2 diseaseAge ¼ σ 2 loc ¼ σ 2 id ¼ 4 and σ 2 ε ¼ 3. With these specifications, we generate 100 data sets for each AGPM. A randomly generated longitudinal data set from AGPM5 is visualised in Fig. 1 (Note, the order of latent functions is changed for better visualisation).
In the inference, all covariates including irrelevant group and gender are used, which means that there are 2 5 = 32 candidate models to choose from. Interaction terms are allowed for all covariates except for diseaseAge. Table 1 shows the distribution of selected models for each generating AGPM, with the numbers in bold font indicating correctly identified models. Table 1 shows that LonGP can achieve between 88% and 98% accuracy in inferring the correct model with these parameter settings. Results in Table 1 also show that it becomes more challenging to identify the correct model as the generating model becomes more complex, which is expected. LonGP can accurately detect the disease related signal as well, since the diseaseAge covariate is included in the final model for 97% of the simulation runs for both AGPM4 and AGPM5 models (see Table 1). Moreover, LonGP is notably specific in detecting the diseaseAge covariate as the percentage of false positives is only 0%, 1% and 0% for AGPM1, AGPM2 and AGPM3, respectively (see Table 1).
To better characterise LonGP's performance in different scenarios, we tested how the amount of additive noise affects the results. We varied the noise variance as σ 2 ε 2 f1; 3; 5; 8g and kept all other settings unchanged, effectively changing the signal to noise ratio or the effect size relative to the noise level. Figure 2a shows that the model selection accuracy increases consistently as the noise variance decreases. We next tested how the number of study subjects (i.e. the sample size P) affects the inference results. We set the number of case-control pairs to {(10, 10), (20,20), (30, 30), (40, 40)} and kept all other settings unchanged. As expected, Fig. 2b shows how LonGP's model selection accuracy increases as the sample size increases. Similarly, LonGP maintains its high sensitivity and specificity in detecting the diseaseAge covariate across the additive noise variances and samples sizes considered here (see Supplementary Tables 1 and 2).
Finally, we also quantified how the sampling interval (i.e. the number of time points per individual) affects the inference results. We varied the sampling intervals as {2, 3, 4, 6} (months) corresponding to n i ∈ {19, 13, 10, 7} time points for each individual and kept all other simulation settings unchanged. Supplementary Table 3 shows that, again, the model selection accuracy changes consistently with the number of measurement time points. Supplementary Table 4 shows that changing the sampling interval has a small but systematic effect on the sensitivity and specificity of detecting the diseaseAge covariate.
To demonstrate LonGP's performance relative to previous methods, we analysed the same simulated data sets using three traditional methods: (a) LME, (b) LME with second-order polynomial terms (LME-P), and (c) GP-ARD. We include GP-ARD in performance comparisons because it is the most commonly used method for assessing relevance of variables in GP regression. The ARD kernel contains an individual lengthscale parameter for each input covariate, and the relevance of each covariate is determined by the estimated length-scale value, large (small) values indicate lower (higher) relevance. In LME and LME-P, the same effects in the generating models are considered as for LonGP. Specifically, individual variations are modelled as random effects and others are modelled as fixed effects. In GP-ARD, only shared effects are considered and interactions are not considered. See Supplementary Method 1 for detailed descriptions.  AGPM1  98  2  0  0  0  0  0  100  AGPM2  0  95  2  1  0  2  1  99  AGPM3  0  0  95  0  0  5  0  100  AGPM4  0  3  0  92  3  2  97  3  AGPM5  0  0  3  8  88  1  97  3 The data is simulated with P = 40 individuals (20 cases and 20 controls), noise variance σ   Figure 3 shows the number of times the correct models were identified and the number of times the diseaseAge term was detected in the final model, for the same experiment settings as in Table 1. LonGP has a notably better accuracy than the traditional methods in selecting the correct model ( Fig. 3a), as well as significantly better sensitivity (AGPM4-5 in Fig. 3b) and specificity (AGPM1-3 in Fig. 3b) in detecting the disease related effect. Full results of LME, LME-P, and GP-ARD over all simulated data sets are provided in Supplementary Note 1 and Supplementary Data 1.
Overall, our results suggest that LonGP can accurately infer the correct model structure and also detect a relatively weak disease related signal with as few as 10 case-control pairs and notable noise variance. Moreover, the model selection accuracy increases as the number of individuals (biological replicates), the number of time points, and signal to noise ratio increases.
Longitudinal metagenomics data set. We used LonGP to analyse a longitudinal metagenomics data set 14 . In this data set, 222 children from Estonia, Finland, and Russia were followed from birth until the age of three years through the collection of longitudinal stool samples, which were subsequently analysed by metagenomic sequencing. The aim of this study was to characterise the developing gut microbiome in infants from countries with different socio-economic status and to determine the key factors affecting the early gut microbiome development. Here, we model the microbial pathway profiles (i.e. total count of metagenomic reads mapping to bacterial genes involved in a pathway) quantifying the functional potential of the metagenomic communities. There are in total N = 785 metagenomic samples. To focus our analysis on pathways with sufficiently strong signal, we include in our analysis pathways that have been detected (i.e. at least one sequence read maps to genes of a pathway) in at least 64% (=500/785) of the samples. Let c ij denote the number of reads mapping to genes in the jth (j = 1, …, 394) pathway in sample i (i = 1, …, 785) and C i is the total number of sequencing reads for sample i. The target variable is defined by log 2 ðc ij =C i Á medianðC 1 ; C 2 ; :::; C N Þ þ 1Þ.
We selected the following 7 covariates for our additive GP regression based on their known interaction with the gut microbiome: age, bfo, caesarean, est, fin, rus, and id. bfo indicates whether an infant was breastfed at the time of sample collection; caesarean indicates if an infant was born by Caesarean section; est, fin, and rus are bi covariates indicating the home country of the study subjects (Estonia, Finland, and Russia, respectively); id denotes the study subjects. We use SE kernel for age and bfo, ca kernel for id, and bi kernel for caesarean, est, fin, and rus. Interactions are allowed for all covariates except for bfo.
We applied LonGP to analyse each microbial pathway as a target variable separately and inferred the covariates for each target variable as described above. The selected models and explained variances of the components for all 394 pathways are available in Supplementary Data 2. A key discovery in Vatanen et al. 14 was that Lipid A biosynthesis pathway was significantly enriched in the gut microbiomes of Finnish and Estonian children compared to Russian children. Our analysis confirmed the linear model based analysis 14 by selecting the following model for Lipid A biosynthesis pathway: y ¼ f ca se ðid ageÞ þ ε, which shows the difference between the Russian and Finnish study groups. Explained variance of bfo was 0.2% and bfo was thus excluded from the final model. Figure 4a shows the normalised Lipid A biosynthesis data together with the additive GP predictions using kernels y ¼ f  14 with an exception that the apparent nonlinearity is captured by the AGPM, but otherwise the new model conveys the same information. Our analysis also identified many novel pathways with differences between Finnish, Estonian, and Russian microbiomes and is reported in Supplementary Data 2.
Longitudinal proteomics data set. We next analysed a longitudinal proteomics data set from a type 1 diabetes (T1D) study 15 . Liu et al. measured the intensities of more than 2000 proteins from plasma samples of 11 children who developed T1D and 10 healthy controls. For each child, nine longitudinal samples were analysed with the last sample for each case collected at the time of T1D diagnosis, resulting in a total of 189 samples. Detection of T1D associated autoantibodies in the blood is currently held as the best early marker that predicts the future development of T1D, and most of the individuals turning positive for multiple T1D autoantibodies will later on develop the clinical disease. Identifying early markers for T1D that would be detected even before the appearance of T1D associated autoantibodies is a grand challenge. It would allow early disease prediction and possibly even intervention.
Liu et al. used a linear mixed model with quadratic terms to detect proteins that behave differently between cases and controls. However, they only regressed on age since they did not take into  15 ) of the cases and therefore could not model changes associated with seroconversion. We use LonGP to re-analyse this longitudinal proteomics data set 15 and try to find additional proteins with differing plasma expression profiles between cases and controls in general, as well as focusing on changes occurring close to seroconversion. Note that the age at which the T1D autoantibodies are detected is different for each individual. For each individual, the GP sero effect is then localized at the individualspecific seroconversion time point, making the sero effect consistent in the sero age coordinate but difficult (or impossible) to detect in the absolute age coordinate. The sero effect aims to detect nonlinear and ns effects that appear at specific times before and after the seroconversion, possibly near the time of the seroconversion. The modelling is done with the following covariates: age, sero (measurement time minus seroconversion time, see Methods), group (case or control), gender, and id. 1538 proteins with less than 50% missing values are kept for further analysis. We follow the same preprocessing steps 15 to get the normalised protein intensities. We use SE kernel for age, input warped ns SE kernel for sero, bi kernel for group as well as for gender, and ca kernel for id. Interactions are allowed for all covariates except for sero. The selected models and explained variances of each component for all 1538 proteins are reported in Supplementary Data 3.
We detected 38 proteins that are associated with the group covariate. In the original analyses by Liu et al. 15 [ Table 1 and  Supplementary Table S3], 18 of these proteins had the same temporal expression trend between the cases and the controls. As an example, we found the levels of Carbohydrate sulfotransferase 3 (UniProt Accession Q7LGC8) to be higher in cases than controls. The selected model for the protein is y ¼ f bi se ðgroup ageÞ against the real protein intensity to better visualise the predicted group difference.
We also detected altogether 47 proteins whose expression levels were changed relative to the time of seroconversion (sero covariate), with 20 of them having the same expression trend between the cases and the controls based on the analyses by Liu et al. 15 ( Table 1 and Supplementary Table S3). For two selected proteins, Prosaposin (Uniprot Accession P07602) and Opioid-binding protein/cell adhesion molecule (Uniprot Accession Q14982), protein expression levels were best explained by the LonGP model y ¼ f  Figure 4c shows the contribution of the sero component together with the real (centred) protein intensities as a function of seroconversion age for protein P07602. The sero component increases and then stabilises at a higher baseline after seroconversion in the cases. This is shown by the lower baseline of cases before seroconversion and higher baseline after seroconversion. Supplementary Fig. 1 shows the predicted mean of each component as well as the cumulative effects for protein P07602. Supplementary Fig. 2 shows a different type of sero effect for protein Q14982 where a temporary increase in protein intensity is observed close to the seroconversion time for many T1D patients, in contrast to the slowly decreasing age trend. Supplementary Fig. 3 shows the predicted individual components and the cumulative effects for protein Q14982.

Discussion
General LME is a simple yet powerful modelling framework that has been widely accepted in biomedical literature. Still, applications of linear models can be challenging, especially when the underlying data generating mechanisms contain unknown nonlinear effects and correlation structures or ns signals.
Here we have described LonGP, a non-parametric additive GP model for longitudinal data analysis, which we demonstrate to solve many of the commonly faced modelling challenges. As LonGP builds on GP regression, it can automatically handle irregular sampling time points and time-varying covariates. Missing values are also easily accounted for via bi mask kernels without any extra effort. More generally, LonGP provides a flexible framework to choose appropriate covariance structures for the correlated outcomes via the GP kernel functions, and the chosen kernels are properly adjusted to the given data by carrying out Bayesian inference for the kernel parameters. GP are known to be capable of approximating any continuous function. Thus, LonGP is applicable to any longitudinal data set. Furthermore, incorporating ns kernels into the kernel mixture easily adapts LonGP for ns signals. This allows us to model longitudinal phenomenon whose statistical properties are not time-shift invariant, which is especially useful for modelling e.g. pathophysiological mechanisms and changes that can have faster dynamics around a disease onset time than changes at other time points. While it is in principle possible to model ns signals with linear models, ns GP regression with Bayesian inference can be conveniently formulated and implemented using ns kernel functions, as we have shown here. Similar to standard GP regression methods, LonGP also provides predictions with quantified uncertainties (Eqs. (19) and (22)). As an example, corresponding to Fig. 4b, c, Supplementary Figs. 7 and 8 show the one standard deviation around the predictive mean. As the effect of individual covariates and their interactions can be quantified from the kernel mixture, LonGP provides an interpretable, nonparametric probabilistic analysis framework.
LonGP is equipped with an advanced Bayesian predictive inference method that utilises several recent, state-of-the-art techniques which make model inference accurate and improves running time especially for larger data sizes and more complex models. Finally, LonGP can be easily tailored for a variety of different longitudinal study designs. For example, multiple disease sub-types can be accounted for by using a ca kernel instead of a bi (case-control) kernel. Similarly, continuous phenotypes can be modelled using continuous kernels. For example, the ability to model the extensive, nonlinear age-associated changes observed in serum protein expression levels during early childhood 16 should improve the detection of disease-associated signals from such data.
LonGP has at least three features which makes it more efficient in our simulated data scenarios than the standard LME model. First, kernels automatically implement arbitrary nonlinear effects, whereas LME model is limited to linear (or second-order polynomial) effects. This is accentuated by having several nonlinear effects for individual covariates or their interactions. Moreover, characterising the posterior of the kernel parameters further improves LonGP's ability to identify nonlinear effects: instead of optimizing the kernel parameters to a given data set we also infer their uncertainty, and thus improve predicting new/unseen data points and inferring the covariate effects at the end. Second, LonGP contains ns effects that can be difficult to model using linear models. Third, LonGP naturally implements individualspecific time-varying random effects, which we consider relevant in modelling real biomedical longitudinal data sets, too.
Compared with traditional linear regression methods, LonGP is also useful in finding relatively weak signals that have an arbitrary shape. The dominant factor for Prosaposin (P07602) expression in the longitudinal proteomics data set 15 is age (explained variance 25%), while the disease related effect sero (explained variance 5.5%) is a minor factor, as shown in Supplementary Fig. 1. Glucose-induced secretion of Prosaposin has been observed from a murine pancreatic beta-cell line 17 . Based on the LonGP analysis of the longitudinal proteomics data 15 , the expression of Prosaposin in plasma decreases with age, but the baseline expression of the protein stabilises at a higher level in the cases after seroconversion. Similar changes were also detected by LonGP for secretogranin-3 (Q8WXD2), a protein with important functions in insulin secretory granules 18 , and protein FAM3C (Q92520) also secreted from a murine beta-cell line in response to glucose 17 . However, no statistically significant differences were detected in the expression values of these proteins between cases and controls in the original analyses 15 . Seroconversion-associated changes in plasma levels of these proteins might reflect changes in the function or status of pancreatic beta-cells already before the onset of T1D. These, as well as other seroconversion-associated proteins revealed by our study provide a list of candidate proteins for further analysis with a more extensive sample size using, for example, targeted proteomics approaches. Similarly, in the longitudinal metagenomics data set 14 , we also observed nonlinear effects for many of the covariates, some of which warrant further experimental studies. Revealing such disease related effects is essential in understanding mechanisms of disease progression and uncovering biomarkers for diagnostic purposes.
Apart from LME, only a few software packages exist for longitudinal data analysis. LonGP is accompanied by a software package that has all the functionality described here. Overall, supported by our results and open-source software implementation, we believe LonGP can be a valuable tool in longitudinal data analysis.

Methods
Notation. We model target variables (gene/protein/bacteria/etc;) one at a time. Let us assume that there are P individuals and there are n i time-series measurements from the ith individual. The total number of data points is thus N ¼ P P i¼1 n i . We denote the target variable by a column vector y = (y 1 , y 2 , ... y N ) T and the covariates by X = (x 1 , x 2 , ..., x N ), where x i = (x i1 , x i2 , ..., x id ) T is a d-dimensional column vector and d is the number of covariates. We denote the domain of the jth variable by X j and the joint domain of all covariates is X ¼ X 1 X 2 ::: X d . In general, we use a bold font letter to denote a vector, an uppercase letter to denote a matrix, and a lowercase letter to denote a scale value.
Gaussian process. GP can be seen as a distribution of nonlinear functions 5    where elements of the N-by-N covariance matrix are defined by the kernel [K X, We use the following hierarchical GP model where π(ϕ) defines a prior for the kernel parameters (including σ 2 ε ), σ 2 ε is the noise variance, and I is the N-by-N identity matrix. For a Gaussian noise model, we can marginalise f analytically 5 Additive GP. To define a flexible and interpretable model, we use the following AGPM with D kernels where each f ðjÞ ðxÞ $ GPð0; k ðjÞ ðx; x′jθ ðjÞ ÞÞ is a separate GP with kernel-specific parameters θ (j) and ε is the additive Gaussian noise. By definition, for any finite collection of inputs X = (x 1 , x 2 , ..., x N ), each GP f (j) (X) follows a multivariate Gaussian distribution. Since a sum of multivariate Gaussian random variables is still Gaussian, the latent function f also follows a multivariate Gaussian distribution. Denote Θ ¼ ðθ ð1Þ ; θ ð2Þ ; :::; θ ðDÞ ; σ 2 ε Þ, then the marginal likelihood for the target variable y is where the latent function f has been marginalised out as in Eq. (6). To simplify notation, we define For the purposes of identifying covariate subsets that are associated with a target variable, we assume that each GP depends only on a small subset of covariates f ðjÞ ðxÞ : X ðjÞ ! X, where X ðjÞ ¼ Q X i ; i 2 I j f1; ; dg and Y is the domain for target variable. I j are indices of the covariates associated with the jth kernel.
Kernel functions for covariates. Longitudinal biomedical studies typically include a variety of continuous, ca, and bi covariates. Typical continuous covariates include age, time from a disease event (sampling time point minus disease event time point), and season (time from beginning of a year). Typical ca or bi covariates include group (case or control), gender, and id (id of an individual). In practice, to set up the AGPM, one needs to choose appropriate kernels for different covariates and their subsets (or interactions). We designed the following kernels to reflect common domain knowledge of longitudinal study designs, which covers most common modelling needs. Note that all kernels, including interactions, are automatically determined given a set of input covariates according to the algorithm in Supplementary Method 2.
Stationary kernels. In LonGP, we use the following specific stationary kernels which only involve one or two covariates. se kernel for continuous covariates where ' se is the length-scale parameter, σ 2 se is the magnitude parameter and θ se ¼ ð' se ; σ 2 se Þ. Length-scale ' se controls the smoothness and magnitude parameter σ 2 se controls the magnitude of the kernel. Periodic kernel for continuous covariates k pe ðx i ; x j jθ pe Þ ¼ σ 2 pe exp À where ' pe is the length-scale parameter, σ 2 pe is the magnitude parameter, γ is the period parameter, and θ pe ¼ ð' pe ; σ 2 pe ; γÞ. Length-scale ' pe controls the smoothness, σ 2 pe controls the magnitude, and γ is the period of the kernel. In our model, γ corresponds to a year. co kernel where θ ¼ ðσ 2 co Þ is the magnitude parameter of the co signal. ca kernel for discrete-valued covariates bi (mask) kernel for bi covariates Product kernel between any two valid kernels, such as k bi (⋅) and k se (⋅) (similarly for any other pair of kernels) where θ ðp′Þ bi and θ ðq′Þ se are kernel parameters for the pth and qth covariates, respectively.
Non-stationary (ns) kernel. It may be realistic to assume that the target variable (e.g. a protein) changes rapidly only near a special event, such as disease initiation or onset. This poses a challenge for GP modelling with se kernel since the kernel is stationary: changes are homogeneous across the whole time window. ns GPs can be implemented by using special ns kernels, such as the neural network kernel, by defining the kernel parameters to depend on input covariates [19][20][21] or via input or output warpings 22 . We propose to use the input warping approach and define a bijective mapping ω:(−∞, + ∞) → (−c, c) for a continuous time/age covariate t as where a, b, and c are predefined parameters: a controls the size of the effective time window, b controls its loc, and c controls the maximum range. The ns kernel is then defined as where θ se are the parameters of the SE kernel. Supplementary Fig. 4 shows an example transformation with a = 0.5, b = 0 and c = 40, where we limit the disease related change to be within one year of the disease event. Effectively, all changes in the transformed space correspond approximately to ±12 month time window in the original space. Supplementary  Fig. 5 shows randomly sampled functions using stationary and ns SE kernels with the same kernel parameters. The ns SE kernel naturally models signals that are spike-like or exhibit a level difference between before and after the disease event, which can be interpreted as a permanent disease effect.
The same parameters as Supplementary Fig. 4 are used for ns kernels in all experiments of the Results section.
Kernel specification in practice. The data sets analysed in this work include 11 covariates and covariate pairs which we model using the following kernels (see next section for prior specifications). age: The shared age effect is modelled with a slowly changing stationary SE kernel. Time from a disease event or diseaseAge: We use the product of the bi kernel and the ns SE kernel (assuming cases are coded as 1 and controls as 0). season: We assume that the target variable exhibits an annual period and is modelled with the periodic kernel. group: We model a baseline difference between the cases and controls, which corresponds to average difference between the two groups, using the product of the bi kernel and the co kernel. gender: We use the same kernel as for group covariate. loc: bi covariate indicating if an individual comes from a certain loc. We use the same kernel as for group covariate. id: We assume baseline differences between different individuals and model that by the product of the categorical kernel and the co kernel. group × age: We assume that the differences between cases and controls varies across age. That difference is modelled by the product of the bi kernel and the stationary SE kernel. gender × age: The same kernel as for group × age is used for this interaction term. It implements a different age trend for males and females. id × age: We assume different individuals exhibit different age trends. This longitudinal random effect is modelled by the product of the ca kernel and the SE kernel. This kernel is especially helpful for modelling individuals with outlying data points. group × gender: This interaction term assumes that male (or female) cases have a baseline difference compared to others. The product of two bi kernels and the co kernel is used. Although discrete covariates are modelled as a product of the co kernel and the bi or ca kernel, the co kernel is not explicitly included in our notation.
In practice, we often observe missing values in the covariates. Missing values can be due to technical problems in measurements or because some covariates may not be applicable for certain samples, e.g. diseaseAge is not applicable to controls since they do not have a disease. In LonGP, we construct a bi flag vector for each covariate. The missing values are flagged as 0 and non-missing values are flagged as 1. Then, we construct a bi kernel for this flag vector and multiply it with any kernel that involves the covariate. Consequently, any kernel involving a missing value is evaluated to 0, which means that their contribution to the target variable is 0. All missing values are handled in this way by default and we do not use any extra notations for it. Interaction terms always refer to product kernels with non-missing values, assuming missing values are already handled.
Prior specifications. Before the actual GP regression, we standardise the target variable and all continuous covariates such that the mean is zero and the standard deviation is one. This helps in defining generally applicable priors for the kernel parameters. After the GP regression, the predictions are transformed back to the original scale. We visualise the results in the original scale after centering the data by subtracting the mean.
We define a prior pðΘÞ ¼ Q D j¼1 pðθ ðjÞ Þ pðσ 2 ε Þ for the kernel parameters as follows. For continuous covariates without interactions, we use the log normal prior (μ = 0 and σ 2 = (log(1) − log(0.1)) 2 /4) for the length-scales (' se and ' pe ) and the square root student-t prior (μ = 0, σ 2 = 1, and ν = 20) for the magnitude parameters (σ 2 se and σ 2 pe ). This length-scale prior penalises small length-scales such that smoothness less than 0.1 has very small probability and the mode is approximately at 0.3. For continuous covariates with interactions, the prior for the magnitude parameters is the same as for without interactions and the half truncated student-t prior (μ = 0, σ 2 = 1, ν = 4) is used for the length-scale, which allows smaller length-scales.
Scaled inverse chi-squared prior (σ 2 = 0.01 and ν = 1) is used for the noise variance parameter σ 2 ε . The period parameter γ of the periodic kernel is predefined by the user. Square root student-t prior (μ = 0, σ 2 = 1, and ν = 4) is used for the magnitude parameter σ 2 co of all co kernels. Supplementary Fig. 6 visualises all the above-described priors with their default hyperparameter values.
Model inference and prediction. Given the AGPM, we are next interested in the posterior inference of the model conditioned on data (y, X). Assume, for now, that for each additive component f (j) the kernel k (j) (⋅), its inputs X ðjÞ and prior are specified. We use two different inference methods, MCMC and a deterministic evaluation of the posterior with the CCD.
For MCMC we use the slice sampler as implemented in the GPStuff package 23,24 to sample the parameter posterior pðΘjy; XÞ / pðyjX; ΘÞpðΘÞ; ð18Þ where the likelihood is defined in Eq. (8). After convergence checking from 4 independent Markov chains (details in Supplementary Method 3), we obtain S posterior samples fΘ s g S s¼1 , where Θ s ¼ ðθ ð1Þ s ; θ ð2Þ s ; :::; θ ðDÞ s ; σ 2 ε;s Þ. We use the posterior samples to approximate the predictive density for test data X Ã ¼ ðx Ã 1 ; x Ã 2 ; :::; x Ã n Þ pðf Ã jy; X; X Ã Þ ¼ R pðf Ã jy; X; X Ã ; ΘÞpðΘjy; XÞdΘ where μ s ¼ K X Ã ;X ðΘ s ÞK y ðΘ s Þ À1 y ð20Þ are the standard GP prediction equations adapted to additive GPs with K X Ã ;X ðΘ s Þ ¼ P D j¼1 K ðjÞ X Ã ;X ðθ ðjÞ s Þ encoding the sum of cross-covariances between the inputs X and test data points X * (K X Ã ;X Ã is defined similarly) and K y (Θ s ) is defined in Eq. (9).
As an alternative approach to slice sampling for higher dimensional models, we also use a deterministic finite sum using the CCD to approximate the predictive densities for GPs 25,26 . CCD assumes a split-Gaussian posterior q(⋅) for (logtransformed) parameters γ = log(Θ) and defines a set of R points fγ r g R r¼1 (fractional factorial design, the mode, and so-called star points along whitened axes) to estimate the predictive density with a finite sum pðf Ã jy; X; X Ã Þ % P R r¼1 pðf Ã jy; X; X Ã ; γ r Þqðγ r ÞΔ r where N(μ r , Σ r ) is computed as in Eqs. (20,21), q(γ r ) is the split-Gaussian posterior, and Δ r are the area weights for the finite sum 26 .
Predictions and visualisations for an individual kernel k (j) (1 ≤ j ≤ D) are obtained by replacing μ s and Σ s in Eqs. (19) and (22)  Model comparison. We have described how to build and infer an AGPM for a given target variable using a set of kernels and a set of covariates for each kernel. A model M can be specified by a 3-tuple ðD; fk ðjÞ g D j¼1 ; fI j g D j¼1 Þ, where D ≥ 1. However, all covariates may not be relevant for the prediction task and often the scientific question is to identify a subset of the covariates that are associated with the target variable. For model selection, we use two cross-validation variants and Bayesian bootstrap as described below.
Leave-one-out cross-validation. We use leave-one-out cross-validation (LOOCV) to compare the models when a continuous covariate such as age, diseaseAge or season is added to a model. In this case, a single time point of an individual is left out as test data and the rest are kept as training data. We use MCMC to infer the parameters of a given model and calculate the following leave-one-out predictive density pðy i jy Ài ; X; MÞ ¼ where we have omitted X and M in the notation for simplicity and Θ s is a MCMC sample from the full posterior p(Θ|y). However, directly applying Eq. (26) usually results in high variance and is not recommended. We use a recently developed Pareto smoothed importance sampling to control the variance by smoothing the importance ratios p(Θ s |y −i )/p(Θ s |y) 27,28 .
The importance sampling phase is fast and it is shown to be accurate 27 . Therefore, we only need to run MCMC inference once for the full training data. Once the leave-one-out predictive probabilities in Eq. (25) are obtained for all the data points, the GP models are compared using Bayesian bootstrap described later this section.
Stratified cross-validation. In stratified cross-validation (SCV), we leave out all time points of an individual as test data and use the rest as training data. SCV is used when a ca/bi covariate, such as group or gender, is added to the model. Let y i denote all measured time points corresponding to an individual i (X i is defined similarly) and y −i = y\y i . Similar to LOOCV, we compute the predictive density of the test data points y i pðy i jy Ài ; X; MÞ ¼ Z pðy i jΘ; X; MÞpðΘjy Ài ; X; MÞdΘ: This can be calculated by setting f * ← y i , X * ← X i , y ← y −i , and X ← X −i in Eq. (22). Since importance sampling does not work well in this case, we apply the CCD inference P times (once for each individual). Also, we use CCD with SCV as it is much faster than MCMC.
Model comparison using Bayesian bootstrap. After obtaining the leave-one-out predictive densities (Eqs. (25) or (27)) for a collection of models, we use Bayesian bootstrap to compare the involved models. Let us start with a simple case where two models M 1 and M 2 are compared. In the LOOCV setting, we compare the models by computing the average difference of their log-predictive densities 1 N X N i¼1 logðpðy i jy Ài ; X; M 1 ÞÞ À logðpðy i jy Ài ; X; M 2 ÞÞ which measures the difference of the average prediction accuracy of the two models. If Eq. (28) is greater than 0, then model M 1 is better than M 2 , otherwise model M 2 is better than M 1 . Comparison in Eq. (28) does not provide a probabilistic quantification of how much better one model is compared to the other. We thus approximate the relative probability of a model being better than another model using Bayesian bootstrap 29 , which assumes y i only takes values from the observations y = (y 1 , y 2 , ... y N ) T and has zero probability at all other values. In Bayesian bootstrap, the probabilities of the observation values follow the N-dimensional Dirichlet distribution Dir(1, 1, ..., 1). More specifically, we bootstrap the samples N B times (b = 1, …, N B ) and each time we get the same N observations y, with each observation taking weight w bi (i = 1, …, N) from the N-dimensional Dirichlet distribution. The N B bootstrap samples are then summarised to obtain the probability of M 1 being better than M 2 1 N B w bi log pðy i jy Ài ; X; M 1 Þ pðy i jy Ài ; X; where δ{⋅} is the Heaviside step function and w bi is the bootstrap weight for the ith data point in the bth bootstrap iteration 27 . We call the result of Eq. (29) LOOCV factor (LOOCVF). The above strategy also works when comparing multiple models. Instead of calculating the heaviside step function in the bth bootstrap iteration, we simply choose the model with the highest rank by sorting the models using where m indices the model. In the end, we count the occurrences N m of each model being the best across all N B bootstrap samples and we compute the posterior probability of model M m as N m /N B , which we term as the posterior rank probability.
For SCV, we replace y i with y i and y −i with y −i in Eqs. (28,29) and follow the same procedure as above to compare the models. Eq. (29) is then termed as the SCV factor (SCVF). In practice, we set the threshold of the LOOCVF to be 0.8 and SCVF to be 0.95, i.e. the LOOCVF (resp. SCVF) of the extended model versus the original model needs to be larger than 0.8 (resp. 0.95) for a continuous covariate (resp. bi covariate) to be added.
Although Eq. (30) can be used to compare any subset of models, complex models will dominate the posterior rank probability when compared together with simpler models. Hence, LonGP only uses it to compare candidate models of similar complexity (see next Section and Supplementary Method 2).
Step-wise additive GP regression algorithm. The space of all models is large and thus an exhaustive search for the best model over the whole model space would be too slow in practice. Two commonly used model (or feature) selection methods include forward and backward search techniques. Starting with the most complex model, as in the backward search approach, is not practical in our case, so we propose to use a greedy forward search approach similar to step-wise linear regression model building. That is, we start from the base model that only includes the id covariate. Then we add continuous covariates to the model sequentially until the model cannot be further improved. During each iteration, we first identify the covariate that improves the model the most (Eq. (30)) and test if the LOOCVF of a new proposed model versus the current model exceeds the threshold of 0.8 (Eq. (29)). While including a continuous covariate, we also include relevant interaction terms (allowed interaction terms defined by the user). After adding continuous covariates, we add discrete (ca or bi) covariates sequentially to the model until it cannot be further improved. As with continuous covariates, during each iteration, we first identify the discrete covariate that improves the model the most and test if the SCVF of a new proposed model versus the current model exceeds the threshold of 0.95. While including a discrete covariate, we also include relevant interaction terms (allowed interactions specified by user). Details of our forward search algorithm are given in Supplementary Method 2 together with a pseudo-algorithm description. We note that although step-wise model selection strategies are commonly used with essentially all modelling frameworks, they have the danger of overfitting a given data. To avoid overfitting, we implement our search algorithm such that an additional component is added to the current model only if the more complex model improves the model fit significantly, as measured by the LOOCVF and SCVF.
Once all the covariates have been added, the kernel parameters of the final model are sampled using MCMC and kernel-specific predictions on the training data X are computed using Eq. (19). Additionally, a user can choose to exclude kernels that have a small effect size as measured by the fraction of total variance explained. We require component specific variances to be at least 1%. The software is implemented using features from the GPStuff package 24 and implementation is discussed in Supplementary Method 4.

Data availability
All data generated or analysed during this study are included in this published article (and its supplementary information files).