Gene regulatory network inference from sparsely sampled noisy data

The complexity of biological systems is encoded in gene regulatory networks. Unravelling this intricate web is a fundamental step in understanding the mechanisms of life and eventually developing efficient therapies to treat and cure diseases. The major obstacle in inferring gene regulatory networks is the lack of data. While time series data are nowadays widely available, they are typically noisy, with low sampling frequency and overall small number of samples. This paper develops a method called BINGO to specifically deal with these issues. Benchmarked with both real and simulated time-series data covering many different gene regulatory networks, BINGO clearly and consistently outperforms state-of-the-art methods. The novelty of BINGO lies in a nonparametric approach featuring statistical sampling of continuous gene expression profiles. BINGO’s superior performance and ease of use, even by non-specialists, make gene regulatory network inference available to any researcher, helping to decipher the complex mechanisms of life.


Supplementary notes
Supplementary note 1: Solvability of the GPDM stochastic differential equation Consider the stochastic differential equation on t ∈ [0, T ] for some fixed T , where ω is an element of the sample space Ω. The initial state x 0 is assumed to be normally distributed, x 0 ∼ N(m, P 0 ) for some covariance matrix P 0 , and u t is a smooth deterministic input function. The process noise w t is an n-dimensional Brownian motion with diagonal covariance matrix Q = diag(q 1 , ..., q n ). Each component in f = [f 1 , ..., f n ] , f i = f i (u, x, ω) conditioned on a trajectory x is modelled as a Gaussian process. For simplicity, we assume first that each f i is centred (see Remark 2) and has covariance function k i depending on the input u and the state x, that is, Ef i (u, x, ω) = 0 and Ef i (u, x, ω)f i (v, z, ω) = k i (u, v, x, z).
Remark 1. By Mercer's theorem, each covariance k can be represented as with some basis functions {φ k }. The Gaussian f with this covariance k can be modelled by where ξ k ∼ N(0, λ 2 k ) are mutually independent. From this it is clear that for given x, f (u, x) is Gaussian, whereas for random x, it is usually not.
Throughout the article we make the following assumption on the covariances k i . Assumption 1. For every i = 1, . . . , n, there exists a constant L i such that Before stating and proving existence and uniqueness result for (1) we need one technical lemma. Lemma 1. Suppose that Assumption 1 holds and let x, z ∈ R n be arbitrary. Then for any p ≥ 1 there exists a constant C depending on p and the numbers L 1 , . . . L n such that E|f (u t , x, ω) − f (u t , z, ω)| p ≤ C|x − z| p .
Proof. Since f is a Gaussian vector, it suffices to prove the claim only for p = 2. Furthermore, by triangle inequality, it suffices to prove that for each component f i we have x, x)+k i (u t , u t , z, z)−2k i (u t , u t , x, z), and Assumption 1 implies which concludes the proof.
Corollary 1. Suppose that Assumption 1 holds and let x, z ∈ R n be random variables. Then for any p ≥ 1 there exists a constant C depending on p and the numbers L 1 , . . . L n such that Proof. The claim follows from Lemma 1 together with the fact that f i conditioned on x and z is Gaussian with covariance k i .
The following existence and uniqueness result for the stochastic differential equation (1) justifies the use of the continuous-time GPDM.
Theorem 1. Suppose that Assumption 1 is satisfied. Then (1) admits a unique solution x.
Proof. We use Picard iteration and define and for j ≥ 1 we set Then Taking expectation, conditioning, and using Corollary 1 then gives We now claim that This follows by induction. For j = 1 we have which proves the claim for j = 1 as the supremum of Gaussian process f and the supremum of w t have all moments finite. Suppose Then (3) implies In particular, this gives On the other hand, from (2) we get Consequently, taking expectation gives and thus we also have converges uniformly to an integrable random variable. Finally, since f (u s , x, ω) is continuous in x by Gaussianity and Lemma 1, we observe that the limit x = lim j→∞ x j satisfies (1).

Remark 2.
We stress that while we assumed the Gaussian process f to be centred for the sake of simplicity, the extension to a non-centred case is rather straightforward. Indeed, if for each component f i the mean function then the existence and uniqueness follows from the above proof by centering f first. We leave the details to the reader.
The following result studies the basic properties of the solution.
Theorem 2. Suppose that Assumption 1 holds. Then the solution x to (1) is Hölder continuous of any order γ < 1 2 . Furthermore, sup t∈[0,T ] |x t | has all the moments finite.
Proof. Clearly, each x j in the proof of Theorem 1 is continuous. Consequently, the solution x is continuous as a uniform limit of continuous trajectories. The Hölder continuity then follows from (1) and the Hölder continuity of the Brownian motion w. Indeed, since f (u s , x, ω) is continuous in x and x is bounded as a continuous function on a bounded interval [0, T ], it follows that f (u s , x s , ω) is also bounded. Finally, the existence of all moments follow from the fact that f (u s , x s , ω) has all the moments finite as well as sup t∈[0,T ] |w t | has all the moments finite.

Supplementary note 2: Convergence of the Euler discretisation
The second part of the theoretical considerations is concerned with convergence of the Euler discretised trajectory that is defined on a partition where δτ k := τ k − τ k−1 , and k = 1, ..., M . Later, we will obtain a probability distribution for the discrete trajectory X = [X τ 0 , X τ 1 , ..., X τ M ], but first we show the pointwise (in ω) convergence to the continuous solution of (1) as the temporal discretisation is refined. We study the continuous version defined for t ∈ [τ k−1 , τ k ] by Note that X M τ k = X M τ k for all k.
Theorem 3. Suppose that Assumption 1 holds and consider arbitrary dis- Moreover, for any > 0 we have almost surely.
Proof. Let t ∈ [τ k−1 , τ k ] and denote where · p denotes the p-norm. Now As in the proof of Theorem 1, this implies Let now M be large enough such that C|π M | < 1. We get Iterating then gives for some other constantC. Since as M → ∞ and |π M |M is bounded by assumption, it follows that for t ∈ [0, τ 1 ] from which it follows that z 1 ≤ C|π M | proving the first claim. Finally, the second claim is a direct consequence of the Borel-Cantelli lemma.

Supplementary note 3: Probability distribution of the discretised trajectory
For the discrete-time GPDM, the probability distribution of the discretised trajectory is computed in [1, Appendix A], and for the most parts the derivation here is similar. Like in the main text, only one discretisation level is considered hereinafter, and the discretisation index is dropped, that is, X = X M . For notational simplicity, we first assume that the Gaussian process f is centred, that is, m i (u, x) = 0. It holds that For given f , the trajectory X is a Markov process, and therefore its distribution satisfies Let us introduce notation X := [X τ 1 , . . . , X τ M ] and X := [X τ 0 , . . . , X τ M −1 ] . Same notation is also used for the different dimensions of the trajectory.
Then it holds that (6) depends only on the values of f at points X. By definition of a Gaussian process, the integral can equivalently be computed over a collection of finite-dimensional, normally distributed ran- (6) can be computed analytically, Applying the Woodbury identity to the exponent gives and the determinant lemma gives (recall Q is a diagonal matrix with q i 's on Finally, the desired probability distribution is To take into account the nonzero mean Note that above it was implicitly assumed that the covariance K i (X) is positive definite. This assumption is only violated if X τ j = X τ k for some j = k or if the covariance function k i is degenerate. In this case the integral should be computed over a lower-dimensional variable F i , but the end result would not change.
Note also that (7) corresponds to the finite dimensional distribution of the continuous Euler scheme (5) evaluated at discretisation points. Since (5) converges strongly to the solution x of (1), the finite dimensional distributions converge as well. This means that (7) is a finite dimensional approximation of the distribution of x.
In the derivation of the probability distribution p(X|θ), an integral of the exponential function with a quadratic exponent was computed. Here it is shown how such integral is computed analytically.
Consider the integral and A is symmetric and positive definite. Now J can be written as where J min = min x J(x) and x min is the (unique) vector attaining this minimum. Then Finally, the minimum is obtained by straightforward differentiation, and it is In the derivation of p(X|θ) above, this is applied so that

Supplementary note 4: Network inference algorithm
As mentioned in the main text, the covariance function for each component (1) is modelled as a Gaussian process with the squared exponential covariance function where n is the system dimension and m is the dimension of the input u in (1). The method is based on estimating posterior probabilities for β i,j for i = 1, ..., n and j = 1, ..., n + m being nonzero. Indeed, if β i,j = 0, it means that the function f i does not depend on x j , that is, gene j is not a regulator of gene i. In a similar way, the method estimates also the target genes of the external inputs u j for j = 1, ..., m. The hyperparameters β i,j are given a socalled spike-and-slab prior, meaning that there is a positive probability that β i,j = 0. One way to treat such random variables in practice is to represent them as a product β i,j = S i,j H i,j of an indicator variable S i,j ∈ {0, 1} and a magnitude variable H i,j ∈ R + . The indicator variable matrix S then has the interpretation of the adjacency matrix of the underlying gene regulatory network.
The network inference algorithm is based on MCMC sampling of the probability distribution The distribution p(x|θ) is approximated by the discretised p(X|θ) given in (7), and p(θ) consists of hyperparameter priors, specified below. The output of the algorithm is the average of the collected samples of network topologies S. The trajectory X and all other hyperparameters besides S are integrated out by sampling. The probability distribution p(X|θ) is readily factorised in form (8). Each factor P i only depends on the i th component/row of the vectors/matrices S, H, γ, q, a = {a 1 , ..., a n }, and b = {b 1 , ..., b n }, and so it is natural to sample them one dimension at a time. However, each factor P i still depends on the full trajectory X, so the trajectory sampling is done separately. Also, when using the Crank-Nicolson sampling (explained below), the sampling of q is intertwined with the trajectory sampling, so they are sampled together. This two-phase sampling scheme is described in the following algorithm. Here the algorithm is presented in its basic form. Some ways to make the sampling more efficient are discussed below. We assume that the initial time τ 0 coincides with the time of the first measurement t 0 , so that p(X τ 0 |θ) = N(y 0 , R). In the algorithm, this is included in the data fit term p(Y|x, θ).
Algorithm 1. Denote the l th samples by parenthesised superindex, e.g., X (l) is the trajectory of the l th sample. The new candidate samples are denoted by a hat. A parameter summary is in Supplementary table 1. Prior probability distributions used in the method are described below.
Indicator and hyperparameter sampling: For i = 1, ..., n: • Sample the i th row of S by drawingĵ from uniform distribution over {1, ..., n}. ThenŜ • Sample H i = [H i,1 , ..., H i,n ], γ i , a i , and b i using random walk sampling, that is, add small changes to each component, drawn from zeromean normal distribution. If the candidate sample is negative, take its absolute value.
• Accept the new samples with probability where p is composed of the hyperparameter priors, and the factors P i are defined in (8).
• SampleR with random walk sampling, with acceptance probability Trajectory sampling: • SampleQ using the random walk sampling.
• AcceptX andQ with probability where C j ∈ R (M +1)×1 gives the element from the full trajectory X corresponding to the measurement y j . In the case {t 0 , ..., t m } ⊂ {τ 0 , ..., τ M }, C j is a vector with one at position k satisfying t j = τ k , and zeros elsewhere.
The algorithm of course contains a burn-in period, and additional thinning, that is, not every sample l is stored.

Supplementary note 5: Incorporating several time series and knockout/knockdown experiments
Several time series experiments can be easily incorporated. For fixed f , the probability distributions for different time series are independent. In the end, this leads to the same format of the probability distribution (7), but the trajectories are concatenated. Then X contains the concatenated trajectories, except for the first point in each separate discretised trajectory, and X contains all trajectories, except for the last points in each trajectory.
In a knockout experiment a particular gene is "de-activated", meaning that its expression is artificially put to zero. From an experiment where gene i has been knocked out, it is not possible to deduce anything about f i , since the dynamics of the i th gene are artificially tampered with. Therefore these experiments are excluded from the cost functions corresponding to f i .
In a steady state experiment, the system is allowed to evolve a long time without any excitation, so that it finally attains a steady state, where it should hold that f (x ss ) = 0. In the method, some noise is added to steady state measurements, and therefore, at a steady state point x ss , it is assumed that f i (x ss ) = v i,ss , where v i,ss ∼ N(0, M ss ). The incorporation of the steady state data to (7) is done by replacing K i (X), X i − X i , ∆τ , and q i I by

respectively.
A steady state experiment can also be a knockout experiment. At the steady state z i corresponding to knockout of gene i, it should hold that f j (z i ) = 0 for all j, except j = i, since the dynamics of gene i have been artificially tampered with.
A gene knockdown experiment is similar to a gene knockout experiment, but the genes are only repressed instead of completely inactivated, and it is taken into account in exactly the same way as a knockout experiment.
When using all of the knockout and knockdown steady state data, we assume that there is one point x ss where f i (x ss ) = 0 for all i. This steady state value is sampled, and its prior is a normal distribution whose mean is the sample mean of all steady state measurements including the actual steady state measurement, knockout measurements, knockdown measurements, and the multifactorial data (in the DREAM4 10-gene challenge). The covariance of the prior distribution of x ss is the sample covariance of this data, divided by the number of the steady state measurements. This corresponds to the sample covariance of the mean. We assume that at the steady state, it holds that f i (x ss ) = v i,ss where v i,ss ∼ N(0, M i,ss ), and at the knockout and knockdown points f i (x j,ko ) = v i,ko where v i,ko ∼ N(0, M i,ko ). Also the covariances M i,ss and M i,ko are sampled, and they are given noninformative inverse gamma prior distributions. The incorporation of the knockout/knockdown data to p(X|θ) in (7) is done by replacing X i − X i , ∆τ , and q i I by respectively, and K i (X) is replaced by K i ([X, x ss , y i,ko/kd ]) where y i,ko/kd denotes the collection of all knockout/knockdown measurements except for the ko/kd of gene i.

Supplementary note 6: Pseudo-input scheme
Gaussian process regression suffers from a very unfavourable scaling of the computational load with respect to the number of data points. This problem is further aggravated by our scheme, where the number of data points used in the GP regression is in fact the number of discretisation points in the continuous time trajectory. However, we can resort to a pseudo-input scheme, where this scaling becomes linear.
In the pseudo-input scheme [2], the underlying Gaussian process f is characterised through so-called pseudo-data P : . The number of pseudo-inputs p is specified by the user, based on the available computing power and the size of the original problem. The pseudo-inputs are not related to the inputs of the actual data, but instead they can be considered as hyperparameters, and they can be estimated by a maximum likelihood approach or they can be sampled as well. Another approach is to use only a subset of the actual input-output data (a so-called active set) in the regression [3]. We use the pseudoinput approach of [2], but the main idea then follows [3], that is, the value f (x) at a generic point x is approximated by E(f (x)|P ). When the pseudo-outputsf j are integrated out, the approximation leads to replacement of the matrices K i (X) in (7) by The approximation used in [2] is more accurate, but its computational cost is much higher when it is not used only for regression.
With this approximation, it is possible to use the Woodbury identity and the matrix determinant lemma again to obtain for the exponent in (7) Here q i ∆τ is a diagonal matrix and the full matrix inverse is computed for a p × p matrix instead of M × M . The downside is that the determinant term becomes where |K i (P )| must be computed separately. Notice, for example, that |K i (P )| tends to zero if two pseudo-inputs tend to each other, so it has an effect of pushing the pseudo-input points apart from each other. In practical implementation, a small increment εI is added to the matrix K i (P ) to ensure numerical stability. This corresponds to assuming that the pseudo-outputsf j are corrupted by small noise (with variance εI). We sample the pseudoinputs using random walk sampling, using a uniform prior for the pseudoinputs in the hypercube covering the actual data.

Supplementary note 7: Crank-Nicolson sampling
In the presented algorithm, the discretised trajectory X is sampled using MCMC. When the discretisation is refined, the acceptation rate tends to decrease when conventional samplers are used. This can be avoided by Crank-Nicolson sampling [4,5], if the target distribution has a density with respect to a Gaussian measure, p(z) = Φ(z)N(z; m, P).
Crank-Nicolson sampling plays well along with the pseudo-input scheme. The term (q i ∆τ ) −1 in (9), and the term |q i ∆τ | in the determinant (10) correspond exactly to the Wiener measure on the discretised trajectory. Notice that also the data fit term p(Y|x, θ) is Gaussian. In order to get a sampler producing reasonable trajectory candidates, the Wiener measure is factorised into where B (t j−1 ,t j ) (dx) is the Brownian bridge measure on interval (t j−1 , t j ), that is fixed to values x t j−1 and x t j at the end points. Finally, the Gaussian measure that is used in the Crank-Nicolson sampler is and the factors m j=1 N x t j − x t j−1 ; 0, Q(t j − t j−1 ) are implemented in the acceptance probability.

Supplementary note 8: Prior specifications
In the experiments, the time series were scaled so that the difference of the maximal and minimal expression value for each gene was one, so that parameter priors would be consistent across dimensions. The scaling is not completely necessary, since the priors are either scale free, or are scaled accordingly if either the data is scaled or the time axis is scaled. The priors for the parameters are as follows: • Noninformative inverse gamma prior for the process noise covariance q i , measurement noise covariance r i , and the steady state covariance M i,ss • Exponential priors for a i ≥ 0, b i ≥ 0, and β i,j where V (Y i ) is the variation of i th component of the trajectory per time unit (approximated from data), and ran(Y j ) is the range of the j th trajectory: Note that ran(Y j ) = 1 if the time series are scaled as described above.
• Gamma prior (truncated) for γ i where σ(∆Y i ) an estimate of the variance of the derivative of the i th component of the trajectory: • Inverse gamma prior for the knockout measurement covariance where N i,ko is the number of knockout measurements taken into account when inferring links pointing to gene i.
Ideally also M i,ko should have a noninformative prior, but it was observed that this variable had a tendency to become either very small, thereby giving all weight to the knockout data and neglecting the time series data, or very large with the opposite effect. This might be due to some mismatch in the time series data and the knockout data. Nevertheless, using all data simultaneously still seemed to produce best results, but in order to achieve a good balance between both data types, the values for M i,ko have to be forced to a good range using an informative prior like this.
Other details of the numerical examples are presented in Supplementary table 2. The only parameter that the user has to choose is the sparsity parameter η in the topology prior p(S). As noted in the main text, in all experiments this parameter was set to η = 1/n where n is the dimension of the system. This prior roughly corresponds to having one incoming link for each node. This choice was not optimal -it seems that in the DREAM4 dataset the performance was better with bigger η, but with the simulated circadian clock and the IRMA data, the performance was better with smaller η. This might be due to richness of the data. In the DREAM4 data there are several time series for each network, and the majority of the network could be inferred correctly. The circadian clock network, on the other hand, was inferred from one time series, and in this case it is not possible to get the full network at once. In all experiments, the time discretisation for the Euler discretised trajectory was four times finer than the measurement discretisation, that is, three sampling points between two measurements. For the DREAM4, size 100 case the method was parallelised to three processors, each carrying out MCMC sampling independently. The number of pseudoinputs was 50 in all experiments.
Supplementary note 9: Remarks on the benchmark data examples Remark 3. The AUROC and AUPR values for the method CSI for the DREAM4 data are taken from [6], where the self interactions are included in the computation of these values. The self interactions are given a weight zero, and hence all methods get 10 or 100 "free" true negatives, depending on the network size. This has some increasing effect on the AUROC values for networks of size 10 (they report mean AUROC of 0.55 for random networks as opposed to 0.5). The effect on the 100-gene network results and on all AUPR values is negligible.
Remark 4. In the ARNI method, the user has to choose the type and the order of basis functions. In the DREAM 10-gene case, we tried all basis function sets provided in their Matlab implementation with a variety of orders, and the best performing combinations were tried with the 100-gene case. The best performance overall in the different DREAM4 inference tasks was achieved with polynomial basis functions with degree 3. The values reported in all the results in the main text are obtained with these basis functions. In the article [7], a method for basis function selection has been introduced, but it was not implemented.
The ARNI method considers a regression problem with input-output pairs where {y j } is the time series data. We made a small modification to the implementation by replacing the inputs by y j which improved the method's performance.
We could not reproduce exactly the dynGENIE3 results for the DREAM4 in silico network inference challenge data reported in [8]. We obtained similar results, but the scores for the different networks varied from the reported scores. Finally, we decided to include results from our own simulations taking into account the perturbations, whereby the results improved slightly. The inputs were incorporated by including five (or ten in the 100-gene case) additional signals to the time series, of which the j th signal consisted of 10 ones and 11 zeros in the j th experiment, and only zeros in other experiments.
We used the "random forest" option with K = n in the DREAM4 experiment (as in [8]), but in other experiments we used K = √ n which is the default setting in the dynGENIE3 code.
For GRNTE, all time series were used at once as replicates with the DREAM4 data. However, the IRMA data is not evenly sampled, and sampling times differ for different time series. Therefore the method was applied separately on each time series, and an average network was used as the end result.

Remark 5.
In the analysis of all the results, we have ignored self-regulation as was done in the DREAM4 challenge. Therefore, in the IRMA network, the maximum number of possible links is 20, whereas in references [9,10] the self-regulation is somehow taken into account (details are not given), and the number of possible links is 25. Moreover, it seems that in [10] one link (Gal4 → Swi5) has been omitted from the ground truth. With these criteria, using the ground truth network from [11], shown in Supplementary figure 1, the ELM-GRNNminer had 6 true positives (out of eight links in the ground truth network) and 3 false positives (out of 20 − 8 = 12 missing links in the ground truth), and TD-ARACNE had 5 true positives and 2 false positives. These numbers were used to plot the predictions in Figure 2 of the main text.