Continuous time Gaussian process dynamical models in gene regulatory network inference

One of the focus areas of modern scientific research is to reveal mysteries related to genes and their interactions. The dynamic interactions between genes can be encoded into a gene regulatory network (GRN), which can be used to gain understanding on the genetic mechanisms behind observable phenotypes. GRN inference from time series data has recently been a focus area of systems biology. Due to low sampling frequency of the data, this is a notoriously difficult problem. We tackle the challenge by introducing the so-called continuous-time Gaussian process dynamical model (GPDM), based on Gaussian process framework that has gained popularity in nonlinear regression problems arising in machine learning. The model dynamics are governed by a stochastic differential equation, where the dynamics function is modelled as a Gaussian process. We prove the existence and uniqueness of solutions of the stochastic differential equation. We derive the probability distribution for the Euler discretised trajectories and establish the convergence of the discretisation. We develop a GRN inference method based on the developed framework. The method is based on MCMC sampling of trajectories of the GPDM and estimating the hyperparameters of the covariance function of the Gaussian process. Using benchmark data examples, we show that our method is computationally feasible and superior in dealing with poor time resolution.


Introduction
In 2017, Jeffrey C. Hall, Michael Rosbash, and Michael W. Young were awarded the Nobel prize in physiology or medicine for their discoveries of molecular mechanisms controlling the circadian rhythm of plants.Indeed, one of the focus areas of modern scientific research is to reveal mysteries related to genes, their interactions, and their connection to observable phenotypes.Application areas of analysing interactions of genes are not limited to plants and their circadian clocks.In the field of biomedicine, for example, knowledge of genes and their interactions plays a crucial role in prevention and cure of diseases.
Interactions between genes are typically represented as a gene regulatory network (GRN) whose nodes correspond to different genes, and a directed edge denotes a direct causal effect of some gene on another gene.The usual problem statement is to infer the network topology from given gene expression data.Classically, GRN inference has been based on analysing steady state data corresponding to gene knockout experiments, based on silencing one gene and observing changes in the steady state expressions of other genes.However, carrying out knockout experiments on a high number of genes is costly and technically infeasible.Moreover, for example circadian clocks of plants are oscillating systems, and in practice it can be difficult to determine whether a particular measurement corresponds to a steady state.In contrast, methods that infer GRNs from time series data can infer the networks on a more global level using data from few experiments.Therefore GRN inference from time series data has gained more and more attention recently.We present a method which is mainly designed for inference from time series data, but steady state measurements can be straightforwardly implemented as well.
We model the time series data {y j } m j=0 as samples from a continuous trajectory, y j = x(t j ) + v j where v j represents measurement noise.The continuous trajectory is assumed to satisfy a nonlinear stochastic differential equation (1.1) dx = f (x)dt + dw, where w is some driving process noise.Now x is an R n -valued function, and thus also f is vector-valued, f (x) = .
It is known that the dynamics of one gene can only be influenced by few other genes, that is, a component f j only depends on some components of x.
If function f j depends on x i , in the corresponding GRN there is a link from gene i to gene j.The task in GRN inference is to discover this regulatory interconnection structure between variables.
What is characteristic to the GRN inference problem is that the data tends to be rather poor in terms of temporal resolution and the overall amount of data.Low temporal resolution has a deteriorating effect on network inference -in linear systems (f (x) = Ax in (1.1)) this is illustrated by the fact that matrices A and e A∆T do not share even approximatively the same sparsity pattern when ∆T is big.In addition, many methods use derivatives estimated directly from the time series data using difference approximations or curve fitting techniques.Using different techniques can significantly effect the results, and therefore avoiding such derivative approximations would be desirable.
In this article, we introduce a continuous-time version of the so-called Gaussian process dynamical model (GPDM) [34].Essentially, we model the dynamics function f in (1.1) as a Gaussian process with some covariance function [30].This defines x as a stochastic process.We prove existence and uniqueness of solutions of the corresponding differential equation, and the convergence of the Euler discretisation.We derive a probability distribution for the discretised trajectory, enabling direct MCMC sampling of realisations of this process.This way the derivative approximations are avoided.Network inference is then done by estimating the hyperparameters of the chosen covariance function.The technique of performing variable selection based on estimating variable-specific hyperparameters is known as automatic relevance determination (ARD) [20,23].In our approach, missing measurements as well as non-constant sampling frequency are easy to treat.In the end, we are mainly interested in which of the variable-specific hyperparameters are non-zero.Therefore we introduce indicator variables for these hyperparameters, whose posterior distribution can be considered as a multivariate Bernoulli distribution.Notably, our approach based on MCMC sampling allows to extract more information than just the posterior mean, for example covariances between links.We demonstrate that our approach is superior in dealing with poor time resolution while still remaining computationally feasible.
Linear version of the method is presented in [1], where the MCMC samplers for the trajectory and the network topology are introduced.Other methods that model the dynamics function using a Gaussian process are presented in [3,16,26,27].The main difference with these methods is that they all treat the problem as a nonlinear regression problem with inputoutput pairs [3] y j , y j − y j−1 t j − t j−1 m j=1 or using some other method to approximate the derivatives from the time series data.As mentioned above, we avoid this derivative estimation by fitting continuous-time trajectories to the time series data.
Several GRN inference problems from different types of data have been posed as competitive challenges by the Dialogue for Reverse Engineering Assessments and Methods (DREAM) project.Results and conclusions, as well as some top performers are introduced in [21,14].The inference problems are based mainly on two types of data, namely time series data, and steady state measurements corresponding to gene knockout or knockdown experiments.A review [27] on methods based on time series data concluded that methods based on nonparametric nonlinear differential equations performed best.Such methods include Gaussian process models and random forest models [15].Other types of ODE models tend to make rather restrictive assumptions on the dynamics, such as linear dynamics [17], or a library of nonlinear functions [5,25].Mechanistic models [2,24] try to fit the data using dynamical systems constructed from enzyme kinetics equations.
The rest of the article is organised as follows.In Section 2, we introduce the continuous-time Gaussian process dynamical models, and we prove the existence and uniqueness of solutions, and the convergence of the Euler discretisation.This is applied in Section 3 where we derive the probability distribution for the discretised trajectory.The network inference method is introduced in Section 4 including incorporating several time series and knockout/knockdown experiments.Section 5 is devoted to benchmark data experiments.We apply the method to the DREAM4 In Silico Network Challenge data [21,22,29,31] using either the time series data only, or including also the steady state experiments.In Section 6, we provide a short summary and discuss future prospects.

Continuous-time Gaussian process dynamical models
Discrete-time GPDMs (a.k.a.Gaussian process state space models [9,8]) were originally introduced in [34], whose treatment was based on the GP latent variable models [19].They are an effective tool for analysing time series data that is produced by a dynamical system that is unknown to us, or somehow too complicated to be presented using classical modeling techniques.In the original paper the method was used for human motion tracking from video data.Motion tracking problems remain the primary use of GPDMs [6,10], but other types of applications have emerged as well, such as speech analysis [13], traffic flow prediction [35], and electric load prediction [12].
In this section we study theoretical properties of the continuous time GPDM trajectory defined as the solution x ∈ R n on t ∈ [0, T ] for some fixed T to the stochastic differential equation (2.1) where the initial state x 0 is normally distributed, x 0 ∼ N(m, P 0 ) for some covariance matrix P 0 , u t is a smooth deterministic input function, and w t is an n-dimensional Brownian motion with diagonal covariance matrix Q = diag(q 1 , ..., q n ).Finally, f = [f 1 , ..., f n ] ⊤ where each component f i = f i (u, x, ω) conditioned on a trajectory x is modelled as a Gaussian process.
For simplicity, we assume that each f i is centred (see Remark 2.2) and has a covariance k i depending on u (input variable) and x (state variable).That is, Ef i (u, x, ω) = 0 and Ef i (u, x, ω)f i (v, z, ω) = k i (u, v, x, z).
Remark 2.1.By Mercer's theorem, each covariance k can be represented as Hence a Gaussian f with covariance k can be modelled by where ξ k ∼ N(0, λ 2 k ) are mutually independent.From this it is clear that for given x, f (u, x) is Gaussian, whereas for random x, it is usually not.
Throughout the article we make the following assumption on the covariances k i .Assumption 2.1.For every i = 1, . . ., n, there exists a constant Example 2.1.The GRN inference algorithm developed below is based on the squared exponential covariance functions and estimating the hyperparameters β i,j .The hyperparameters satisfy γ i > 0 and β i,j ≥ 0. If β i,j > 0, it indicates that gene j is a regulator of gene i.The Assumption 2.1 is satisfied with the constant L i = γ i max 1≤j≤n β i,j .
Before stating and proving existence and uniqueness result for (2.1) we need one technical lemma.
Lemma 2.1.Suppose that Assumption 2.1 holds and let x, z ∈ R n be arbitrary.Then for any p ≥ 1 there exists a constant C depending on p and the numbers L 1 , . . .L n such that Proof.Since f is a Gaussian vector, it suffices to prove the claim only for p = 2. Furthermore, by triangle inequality, it suffices to prove that for each component f i we have

Now
E|f i (u t , x, ω)−f i (u t , z, ω)| 2 = k i (u t , u t , x, x)+k i (u t , u t , z, z)−2k(u t , u t , x, z), and Assumption 2.1 implies which concludes the proof.
Corollary 2.1.Suppose that Assumption 2.1 holds and let x, z ∈ R n be random variables.Then for any p ≥ 1 there exists a constant C depending on p and the numbers L 1 , . . .L n such that Proof.The claim follows from Lemma 2.1 together with the fact that f i conditioned on x and z is Gaussian with covariance k i .
The following existence and uniqueness result for the stochastic differential equation (2.1) justifies the use of the continuous-time GPDM model.Theorem 2.1.Suppose that Assumption 2.1 is satisfied.Then (2.1) admits a unique solution x.
Proof.We use Picard iteration and define x 0 t = x 0 , and for j ≥ 1 we set Then Let us define F j t = σ(x j−2 s , x j−1 s , x j s , s ∈ [0, t]).That is, the smallest σalgebra that makes the trajectories of x j−2 s , x j−1 s , x j s up to t measurable.Taking conditional expectation together with Corollary 2.1 then gives We now claim that This follows by induction.For j = 1 we have In particular, this gives converges, and the convergence is uniform.Finally, since f (u s , x, ω) is continuous in x by Gaussianity and Lemma 2.1, we observe that the limit x = lim j→∞ x j satisfies (2.1).
Remark 2.2.We stress that while we assumed the Gaussian process f to be centred for the sake of simplicity, the extension to a non-centred case is rather straightforward.Indeed, if for each component then the existence and uniqueness follows from the above proof by centering f first.We leave the details to the reader.
The following result studies the basic properties of the solution.
Proof.Clearly, each x j in the proof of Theorem 2.1 is continuous.Consequently, the solution x is continuous as a uniform limit of continuous trajectories.The Hölder continuity then follows from (2.1) and the Hölder continuity of the Brownian motion w.Indeed, since f (u s , x, ω) is continuous in x and x is bounded as a continuous function on a bounded interval [0, T ], it follows that f (u s , x s , ω) is also bounded.Finally, the existence of all moments follow from the fact that f (u s , x s , ω) has all the moments finite as well as sup t∈[0,T ] |w t | has all the moments finite.The method's numerical implementation will be based on the Euler discretised equation (2.1).Define therefore a partition Denote the discretised trajectory corresponding to the partition π M by X M τ k , and recall that its dynamics are given by (2.4) and k = 1, ..., M .Later, we will obtain a probability distribution for the discrete trajectory X = [X τ 0 , X τ 1 , ...X τ M ], but first we show the pointwise (in ω) convergence to the continuous solution of (2.1) as the temporal discretisation is refined.
We study the continuous version defined for t

holds and consider arbitrary discretisation such that sup
As in the proof of Theorem 2.1, conditioning again on both continuous and discrete trajectories implies Let now M be large enough such that C|π M | < 1.We get Iterating then gives for some other constant C. Since as M → ∞ and |π M |M is bounded by assumption, it follows that for some unimportant constant C.But now which gives the result.

Probability distribution of the discretised trajectory
Assume that f ∼ f θ where θ belongs to some parameter space.We now derive the probability distribution p for the discrete trajectory X = X M .The discretisation level index M is dropped now, since we only use one discretisation level from now on.It holds that For given f , the trajectory X is a Markov process, and therefore its distribution satisfies Let us introduce notation X := [X τ 1 , . . ., X τ M ] ⊤ and X := [X τ 0 , . . ., X τ M −1 ] ⊤ .Same notation is also used for the different dimensions of the trajectory.
Then it holds that where ∆τ is a diagonal matrix whose element (k, k) is δτ k , and Now p(X|f, θ) in the integral (3.1) depends only on the values of f at points X.By definition of a Gaussian process, the integral can equivalently be computed over a collection of finite-dimensional, normally distributed random variables The integral in (3.1) can be computed analytically (see Appendix B), Applying the Woodbury identity to the exponent gives and the determinant lemma gives (recall Q is a diagonal matrix with q i 's on the diagonal) Finally, the desired probability distribution is Note that above it was implicitly assumed that the covariance K i (X) is positive definite.This assumption is only violated if X τ j = X τ k for some j = k or if the covariance function k i is degenerate.In this case the integral should be computed over a lower-dimensional variable F i , but the end result would not change.
Note also that (3.2) corresponds to the finite dimensional distribution of the continuous Euler scheme (2.5) evaluated at discretisation points.Since (2.5) converges strongly to the solution x of (2.1), the finite dimensional distributions converge as well.This means that (3.2) is a finite dimensional approximation of the distribution of x.
3.1.Discretisation refinement.In this section we give a heuristic discussion on how integral operators are obtained at the limit as the discretisation is refined.
The exponent in the probability distribution satisfies Now if X i would be replaced by samples from a differentiable function, say Z = [z(τ 0 ), z(τ 1 ), . . ., z(τ M )], then it would hold that (cf. the Wiener measure, which is actually obtained if when the discretisation is refined so that sup k δτ k → 0, where F i (x) is the Fredholm integral operator defined by Then, let us take into consideration the determinant |∆τ K i (X)∆τ +∆τ q i |.Obviously, this determinant approaches zero as the discretisation is refined.However, we can write the determinant as the product of the eigenvalues Now the coefficient |∆τ |q M i approaches zero as the discretisation is refined (whereupon also M → ∞).The product term, on the other hand, remains bounded from both below and above.The eigenvalues of K i (X)∆τ tend to the eigenvalues of the Fredholm operator defined above.These eigenvalues have an accumulation point at zero and they converge fast enough so that the product remains bounded.This can be observed from the following lemma.
Proof.Recall that det(A) = eig(A).Therefore it holds that Since ∆τ 1/2 K i (X)∆τ 1/2 is positive (semi)definite, it holds that each eigenvalue is non-negative.In addition, Taking into account the restrictions on the eigenvalues, and the fact that log(1 + ξ) is a concave, increasing function of ξ, it is straightforward to deduce that the right hand side of (3.3) reaches its minimal value when ∆τ 1/2 K i (X)∆τ 1/2 has one eigenvalue T γ min and all other eigenvalues are zero, and its maximal value when all M eigenvalues are T γ max /M .The lower bound now follows directly, and for the upper bound,

Network inference method
Consider then the original problem, that is, estimating the hyperparameters from given time series data.Denote Y = [y 0 , y 1 , ..., y m ] where y j is assumed to be a noisy sample from the continuous trajectory x, that is, y j = x(t j ) + v j , and v j is a Gaussian noise with zero mean and covariance R = diag(r).We intend to draw samples from the parameter posterior distribution using an MCMC scheme.Therefore, we only need the posterior distribution up to constant multiplication.Denoting the hyperparameters collectively by θ, the hyperparameter posterior distribution is Here p(Y |x, θ) is the Gaussian measurement error distribution, p(x|θ) will be approximated by (3.2) for the discretised trajectory X, and p(θ) is a prior for the hyperparameters.This prior consists of independent priors for each parameter.The integration with respect to the trajectory x is done by MCMC sampling.In the network inference algorithm, we consider only the squared exponential covariance function (2.2).The function f i has mean where a i and b i are regarded as positive hyperparameters corresponding to basal transcription (b i ) and mRNA degradation (a i ).
For sampling the hyperparameters β i,j in the squared exponential covariance function (2.2), we introduce an indicator variable as in [1].That is, each hyperparameter is a product β i,j = S i,j H i,j , where S i,j ∈ {0, 1} and H i,j ≥ 0. The state of the sampler consists of the indicator variable S, the hyperparameters (i, j = 1, ..., n) H i,j , γ i , r i , q i , a i , b i and the discrete trajectory X.They are sampled using a Gibbs sampler (or more precisely, Metropolis-Hastings within Gibbs sampler) as described below.
For the Gibbs sampler, notice that p(X|θ) given in (3.2) is readily factorised in form (4.1) This factorisation makes it natural to sample S, H, γ, a = {a 1 , ..., a n }, and b = {b 1 , ..., b n } one dimension at a time.However, each factor P i still depends on the full trajectory X, so the trajectory sampling is done separately.Also, when using the Crank-Nicolson sampling (see Section A.2), the sampling of q is intertwined with the trajectory sampling, so they are sampled together.This two-phase sampling scheme is described in the following algorithm.Here the algorithm is presented in its basic form.Some ways to make the sampling more efficient are presented in Appendix A. We assume that the initial time τ 0 coincides with the time of the first measurement t 0 , so that p(X τ 0 |θ) = N(y 0 , R).In the algorithm, this is included in the data fit term p(Y |x, θ).
Algorithm 4.1.Denote the l th samples by parenthesised superindex, e.g., X (l) is the trajectory of the l th sample.The new candidate samples are denoted by a hat.
Indicator and hyperparameter sampling: For i = 1, ..., n: • Sample the i th row of S by drawing ĵ from uniform distribution over {1, ..., n}.Then • Sample H i = [H i,1 , ..., H i,n ], γ i , a i , and b i using random walk sampling, that is, add small changes to each component, drawn from zero-mean normal distribution.If the candidate sample is negative, take its absolute value.• Accept the new samples with probability where p is the hyperparameter prior, and the factors P i are defined in (4.1).
• Sample R with random walk sampling, with acceptance probability Trajectory sampling: • Sample Xi = X where m b = ⌊M/2⌋.• Sample Q using the random walk sampling.
• Accept X and Q with probability ) , X (l) ) where C j ∈ R (M +1)×1 gives the element from the full trajectory X corresponding to the measurement y j .In the case {t 0 , ..., t m } ⊂ {τ 0 , ..., τ M }, C j is a vector with one at position k satisfying t j = τ k , and zeros elsewhere.
The algorithm of course contains a burn-in period, and additional thinning, that is, not every sample l is collected.As the number of samples increases, it holds that The element (i, j) of this matrix gives the probability that β i,j is not zero.
For q i and r i we use noninformative inverse Gamma priors.For H i,j ≥ 0 we use the Laplace prior p(H i,j ) ∝ exp (−ρ i H i,j ) where ρ i = max j [y j ] i − min j [y j ] i as well as for the mean function parameters, a i ∼ exp(−a i /ρ i ) and b i ∼ exp(−b i /ρ i ).The covariance scaling parameter is given a prior 10ν i where ν i gives an estimate of the average squared derivative, computed from the time series data by For S we use p(S) ∝ η |S| 0 where |S| 0 gives the number of ones in S, and the parameter η > 0 can be set to obtain a desired sparsity level for the solution.This prior means that the existence of a link is independent of other links, and the prior probability for the existence of any given link is η 1+η .
4.1.Incorporating several time series and knockout/knockdown experiments.Several time series experiments can be easily incorporated.
For fixed f , the probability distributions for different time series are independent.In the end, this leads to the same format of the probability distribution (3.2), but the trajectories are concatenated.Then X contains the concatenated trajectories, except for the first point in each separate discretised trajectory, and X contains all trajectories, except for the last points in each trajectory.
In a knockout experiment a particular gene is "de-activated", meaning that its expression is artificially put to zero.The system is then allowed to evolve sufficiently long time for it to attain a steady state.At the steady state z i corresponding to knockout of gene i, it should therefore hold that f j (z i ) = 0 for all j, except j = i.It is not possible to deduce anything of f i (z i ), since the dynamics of the i th gene are artificially tampered with.
A gene knockdown experiment is similar to a gene knockout experiment, but the genes are only repressed instead of made completely inactive.Again, the corresponding measurement is a steady state experiment, and it is taken into account in exactly the same way as a knockout experiment.Also in a knockdown experiment the dynamics of the knocked-down gene are artificially modified, so the knockdown experiment of gene i should not be taken into account in function f i .
When using all of the knockout and knockdown data, we assume that there is one point x ss where f i (x ss ) = 0 for all i.This steady state value is sampled, and its prior is a normal distribution whose mean is the sample mean of all steady state measurements including the actual steady state measurement, knockout measurements, knockdown measurements, and the multifactorial data (in the 10-gene challenge).The covariance of the prior distribution of x ss is the sample covariance of this data, divided by the number of the steady state measurements.This corresponds to the sample covariance of the mean.We assume that at the steady state, it holds that f i (x ss ) = v i,ss where v i,ss ∼ N(0, M i,ss ), and at the knockout and knockdown points f i (x j,ko ) = v i,ko where v i,ko ∼ N(0, M i,ko ).Also the covariances M i,ss and M i,ko are sampled, and they are given noninformative inverse gamma prior distributions.The incorporation of the knockout/knockdown data to p(X|θ) in (3.2) is done by doing the following replacements: x ss , y i,ko/kd ]) where y i,ko/kd denotes the collection of all knockout/knockdown measurements except for the ko/kd of gene i.

Benchmark data example
We benchmark our method using the data from the DREAM4 in silico network challenge.The challenge consists of network inference tasks with network sizes 10 and 100, with five networks in each size.The data consists of five time series for the 10-gene challenge and ten time series for the 100-gene challenge, where different perturbations have been applied on some genes for the first half of the time.The time series illustrate the system's adaptation to the perturbation, and its relaxation when the perturbation is removed.Each time series consists of 21 time points.In addition, steady state values are provided in the dataset as well as gene knockout and knockdown data corresponding to each gene.For the 10-gene challenge, multifactorial data is provided, which corresponds to steady state values under mild perturbations on the basal transcription rate.This corresponds to data collected from different cells, for example.
We have compared our method (GPDM) to the challenge best performers using all available data, and with other methods using only the time series data.As in the DREAM4 challenge, we use the standard classifier scores for the comparison, the area under the receiver operating characteristic curve (AUROC) and the area under the precision-recall curve (AUPR).The 10gene challenge winner, Petri Nets with Fuzzy Logic (PNFL) is introduced in [18].The 100-gene challenge winner is introduced in [28].The method is based only on the knockout data, with some post-processing.A similar scoring method without post-processing, the median-corrected Z-score (MCZ) method [11] achieved the second highest score in the 100-gene challenge.
Methods inferring GRNs from only time series data are reviewed in [27], where the best performer (in terms of average AUPR value) was a method called Causal Structure Identification (CSI) [16,26], which is based on Gaussian process regression as well.We compare our method to the discretetime version of CSI, since its performance was better.In addition, we have included a new method, dynGENIE3 [15] to the comparison.It is designed for inference from time series data.However, as suggested in [11], any method inferring networks from time series data can be combined with a method inferring GRNs from steady state data, such as the MCZ.Unfortunately, the MCZ requires knockouts or knockdowns of all genes, which can hardly be expected in a real experiment.Nevertheless, the combinations dynGENIE3*MCZ and GPDM*MCZ are included in the full data comparison.The scores for the combined methods are the products of the individual scores, favouring links that score high in both methods.It should be noted that our method (as well as the PNFL) can utilise also partial knockout data together with time series data.To demonstrate this, we run the algorithm on the 10-gene data challenge using the time series data, steady state measurement, and knockout data for genes 2,4, and 6 (randomly selected).
The parameter η controlling the level of sparsity of the networks was 0.25 for the 10-gene case and 0.025 for the 100-gene case.In the 10-gene case one Markov chain was generated with a burn-in period of 3000 samples.After the burn-in, every 10th sample was stored until 20000 samples was collected.In the 100-gene case, three chains were generated with burn-in periods of 1500 samples.Every 10th sample was stored until 3000 samples was collected from each of the three chains resulting in 9000 samples in total.5.1.The 10-gene network results.The results of the 10-gene network challenge data are summarised in Table 1.When using all data from the challenge, our method scores a little bit higher (average AUPR) than the DREAM4 10-gene challenge winner PNFL.The average scores are very close Table 1.AUROC/AUPR values for the DREAM4 in silico 10-gene network inference challenge data, using either all data or only time series data.The values for PNFL are taken from [31], for dynGENIE3 (and dynGENIE3*MCZ) from [15, Suppl.information], and for CSI from [27 to each other but in the different networks there are some rather significant differences.Our method reaches a fairly high AUPR in network 2, which seemed to be very difficult for all challenge participants.The best AUPR for network 2 among the challenge participants was 0.660, and the PNFL's 0.547 was the second highest.The poor performance of most methods with network 2 is attributed to low effector gene levels in the wild type measurement [18].On the other hand, our method's performance is less satisfactory with network 3, where the PNFL achieves almost perfect reconstruction.This might be due to a fairly high in-degree (four) of two nodes in the true network.Only three out of eight of these links get higher confidence value than 0.5 in our method.Based on Table 1 and [27, Table 1], network 3 also seems to be the one where the knockout data has the biggest impact.It may be that the PNFL makes better use of this data.Also the GPDM*MCZ combination scores fairly well with network 3, but in network 2 it loses clearly to the GPDM applied to all data directly.The GPDM consistently outperforms other methods by a large margin (with the exception of network 3) in GRN inference from time series data.Unfortunately, the PNFL method's code is not publicly available, and therefore it has not been used in this comparison.
When using only a part of the knockout data (genes 2,4, and 6), the results improved somewhat compared to using the time series data only.As perhaps expected, there seems to be some correlation between the improvement and the number of outgoing links related to the knocked-out genes (2,8,4,5,3 for networks 1 to 5, respectively).5.2.The 100-gene network results.The results of the 100-gene challenge data are summarised in Table 2.When using all data, GPDM scores a bit higher than the winner of the DREAM4 100-gene challenge (in particular in networks 4 and 5), but lower than the combination of dynGENIE3 and MCZ.The highest score is achieved by the combination GPDM*MCZ, which, unlike in the 10-gene case, beats the GPDM on full data.However, it should be noted that both the DREAM4 winner as well as the MCZ are based solely on the knockout and knockdown data, but their implementation requires knockout of every gene, which is hardly realistic in a real Table 2. AUROC/AUPR values for the DREAM4 in silico 100-gene network inference challenge data, using either all data or only time series data.The values for the challenge winner are taken from [31], for dyn-GENIE3 (and dynGENIE3*MCZ) from [15, Suppl.information], and for CSI from [27,  experiment.Nevertheless, it seems that with the 100-gene network, our method cannot always combine different types of data in an optimal way.This may be due to the large number of steady state points where the dynamics function f should vanish.These steady state measurements seem to be rather noisy, and therefore the noise parameter M ko associated to the knockout/knockdown measurements tends to be rather high, thereby reducing the influence of the knockout data.This hypothesis is supported by the fact that the results actually deteriorate slightly when also the knockdown data is included as opposed to using only the knockout data with the time series data.On the other hand, as in the 10-gene case, the GPDM method outperforms its competitors by a clear margin in all five networks when inferring the networks from time series data alone.
Remark 5.1.The AUROC and AUPR values for the method CSI are taken from [27], where the self interactions are included in the computation of these values.The self interactions are given a weight zero, and hence all methods get 10 or 100 "free" true negatives, depending on the network size.This has some increasing effect on the AUROC values for networks of size 10 (they report mean AUROC of 0.55 for random networks as opposed to 0.5).The effect on the 100-gene network results and on all AUPR values is negligible.

Discussion
In this article we introduced a continuous-time version of the Gaussian process dynamical model, and derived theory behind the model.We developed a GRN inference method based on this framework and considered benchmark data examples, where the method was favourably compared to state-of-the-art methods in GRN inference from time series data.There are other methods using Gaussian process regression for GRN inference, but a common aspect about earlier methods is that they are based on estimating derivatives directly from the time series data, and then treating the problem as an input-output regression problem.It seems that our approach based on sampling the continuous-time trajectories is very good in dealing with time series data with low sampling frequency.
Interesting future research topics include applying this new method to solve different biological and biomedical real-data problems.From theoretical perspective, it would be desirable to relax smoothness requirements and to consider process noise with memory and/or dependence on the system's state.
Appendix A. Efficient sampling schemes A.1.Pseudo-input scheme.Gaussian process regression suffers from a very unfavourable scaling of the computational load with respect to the number of data points.This problem is further aggravated by our scheme, where the number of data points used in the GP regression is in fact the number of discretisation points in the continuous time trajectory.However, we can resort to a pseudo-input scheme, where this scaling becomes linear.
In the pseudo-input scheme [33], the underlying Gaussian process f is characterised through so-called pseudo-data P := {(x j , fj )} p j=1 , where fj = f (x j ).The number of pseudo-inputs p is specified by the user, based on the available computing power and the size of the original problem.The pseudo-inputs are not related to the inputs of the actual data, but instead they can be considered as hyperparameters, and they can be estimated by a maximum likelihood approach or they can be sampled as well.Another approach is to use only a subset of the actual input-output data (a so-called active set) in the regression [32].We use the pseudoinput approach of [33], but the main idea then follows [32], that is the value f (x) at a generic point x is approximated by E(f (x)|P ).When the pseudo-outputs fj are integrated out, the approximation leads to replacement of the matrices K i (X) in (3.2) by K i (X) ≈ K i (X, P )K i (P ) −1 K i (X, P ) ⊤ , where K i (X, P ) ∈ R M ×p is a matrix whose element (j, k) is k i (X τ j−1 , xk ).Similarly K i (P ) ∈ R p×p is a matrix whose element (j, k) is k i (x j , xk ).The approximation used in [33] is more accurate, but its computational cost is much higher when it is not used only for regression.
With this approximation, it is possible to use the Woodbury identity and the matrix determinant lemma again to obtain for the exponent in (3.2) ∆τ K i (X, P )K i (P ) −1 K i (X, P ) ⊤ ∆τ + q i ∆τ −1 = (q i ∆τ ) −1 − 1 q i K i (X, P ) q i K i (P ) + K i (X, P ) ⊤ ∆τ K i (X, P ) −1 K i (X, P ) ⊤ .
Here q i ∆τ is a diagonal matrix and the full matrix inverse is computed for a p × p matrix instead of M × M .The downside is that the determinant term becomes ∆τ K i (X, P )K i (P ) −1 K i (X, P ) ⊤ ∆τ + q i ∆τ = |K i (P )| −1 |q i ∆τ | K i (P ) + 1 q i K i (X, P ) ⊤ ∆τ K i (X, P ) where |K i (P )| must be computed separately.Notice, for example, that |K i (P )| tends to zero if two pseudo-inputs tend to each other, so it has an effect of pushing the pseudo-input points apart from each other.In practical implementation, a small increment εI is added to the matrix K i (P ) to ensure numerical stability.This corresponds to assuming that the pseudooutputs fj are corrupted by small noise (with variance εI).We sample the pseudoinputs using random walk sampling, using a uniform prior for the pseudoinputs in the hypercube covering the actual data.
A.2. Crank-Nicolson sampling.In the presented algorithm, the discretised trajectory X is sampled using MCMC.When the discretisation is refined, the acceptation rate tends to decrease when conventional samplers are used.This can be avoided by Crank-Nicolson sampling [4,7], if the target distribution has a density with respect to a Gaussian measure, p(x) = Φ(x)N(x; m, P ).
Crank-Nicolson sampling plays well along with the pseudo-input scheme.The term (q i ∆τ ) −1 in the matrix inverse approximation above, and the term |q i ∆τ | in the determinant correspond exactly to the Wiener measure on the discretised trajectory.Notice that also the data fit term p(Y |x, θ) is Gaussian.In order to get a sampler producing reasonable trajectory candidates, we factorise the Wiener measure W(dx) = m j=1 N x t j − x t j−1 ; 0, Q(t j − t j−1 ) B (t j−1 ,t j ) (dx), where B (t j−1 ,t j ) (dx) is the Brownian bridge measure on interval (t j−1 , t j ), that is fixed to values x t j−1 and x t j at the end points.Finally, the Gaussian measure that is used in the Crank-Nicolson sampler is N(Y |x, θ) m j=1 B (t j−1 ,t j ) (dx), and the factors m j=1 N x t j − x t j−1 ; 0, Q(t j − t j−1 ) are implemented in the acceptance probability.
A.3.Importance sampling.It tends to happen that many links are included in all the collected samples even if a high number of samples is collected.For example, in the DREAM challenge the results were required as an ordered list of links.In order to order the links that have probability 1, it is possible to use importance sampling as a "tiebreaker" every now and then.This means computing probabilities p i,j := p s i,j = 1 {s k,l , (k, l) = (i, j)}, x, θ for each link (i, j), where θ = H, γ, q, a, b .To compute an importance sample, denote by Ŝ0 the topology matrix which is otherwise the same as the current sample S (l) , but s i,j = 0. Similarly Ŝ1 is otherwise the same as