Learning complex dependency structure of gene regulatory networks from high dimensional micro-array data with Gaussian Bayesian networks

Gene expression datasets consist of thousand of genes with relatively small samplesizes (i.e. are large-$p$-small-$n$). Moreover, dependencies of various orders co-exist in the datasets. In the Undirected probabilistic Graphical Model (UGM) framework the Glasso algorithm has been proposed to deal with high dimensional micro-array datasets forcing sparsity. Also, modifications of the default Glasso algorithm are developed to overcome the problem of complex interaction structure. In this work we advocate the use of a simple score-based Hill Climbing algorithm (HC) that learns Gaussian Bayesian Networks (BNs) leaning on Directed Acyclic Graphs (DAGs). We compare HC with Glasso and its modifications in the UGM framework on their capability to reconstruct GRNs from micro-array data belonging to the Escherichia Coli genome. We benefit from the analytical properties of the Joint Probability Density (JPD) function on which both directed and undirected PGMs build to convert DAGs to UGMs. We conclude that dependencies in complex data are learned best by the HC algorithm, presenting them most accurately and efficiently, simultaneously modelling strong local and weaker but significant global connections coexisting in the gene expression dataset. The HC algorithm adapts intrinsically to the complex dependency structure of the dataset, without forcing a specific structure in advance. On the contrary, Glasso and modifications model unnecessary dependencies at the expense of the probabilistic information in the network and of a structural bias in the JPD function that can only be relieved including many parameters.


Introduction
The reconstruction of Gene Regulatory Networks (GRNs) of gene expression data is an open problem and has attracted great deal of interest for decades. GRNs model interaction structure of genes in a network and are important to understand the biological functions in living organisms as well as the regulation of diseases in them. The increasing availability and improved systematic storage of gene expression data obtained with DNA microarrays [1] revolutionized the incorporation of mathematical and computational models to model GRNs in the past two decades [2,3].
Mathematical models range from more complex models relying on sets of differential equations that directly describe dynamic changes in GRNs [4,5] to simpler models that describe GRNs building on a graph presentation (graphical models or network models). The latter attract attention due to their capacity of visualization of the complex interaction structure of micro-array data in a graph. The most simple and widely used example of a graphical model are pairwise Correlation Networks (CNs) [6], but lately more advanced Probabilistic Network Models (PNMs) that make use of machine learning algorithms to comprehensively model conditional and/or partial dependencies have gained in popularity.
One of firsts applications of PNMs to reconstruct GRNs was made by Friedman et al. [7]. Their work substantially promoted later research in the field. They analyzed the use of Bayesian Networks (BNs)-a subclass of PNMs relying on conditional dependencies modeled in a Directed Acyclic Graph (DAG)-on data of S.cerevisiae cell-cycle measurements. In their work they set the field to model GRNs with two approaches, multinomial BNs and Gausian 1 we leave out the analysis of Correlation Networks (CNs) that rely on the (Σ) presentation generally generated from a thresholded sample correlation matrix, because they cannot properly regulate high dimensional data. We refer the interested reader to Graafland et al. [27] for an extensive comparison study between BNs and CNs. The representations of the Gaussian JPD function in terms of (Σ −1 )) and (β, ν)) are described in Methods section "Probabilistic Gaussian Network Models" and subsections "Probabilistic PN (BN) models".
As illustrated in Figure 1, the algorithms under subject of study, Glasso (and modifications) and Hill climbing, differ in the parameterset (and hence the presentationform) they have as an objective to learn. They do however coincide on their machine learning spirit; both algorithms attempt to find a parameterset that optimizes a score function in which a statistical criterion (penalization) is integrated. The score functions and their statistical criteria are described in more detail in Methods sections "learning PN (BN) structure from data". Learning methods (the steps in the algorithms) are chosen to efficiently execute this optimization task and are described in Methods section "Learning with Hill Climbing and Glasso". In the case of Hill Climbing the search space is restricted to that of DAGs and in the case of Glasso to that of symmetric positive definite precision matrices, which, converted to binary format, are translated into undirected graphs. These constraints procure that the edges in the networks respectively are associated with the regression coefficients β in the (β, ν) parametrization of the JPD function and with the entries of the precision matrix Σ −1 in the Σ −1 parametrization.
Once learned, a PNM is in general not bound to its initial parametrization/presentationform. At some cost, an analytical transformation of the parameterset can transform the presentationform of the PNM. In Figure 1 the purple arrow represents a direct transformation between parametersets from BNs to PNs for which an analytical formula exists. The analytical transformation is described in Methods section "Transformation of probabilistic BN model to probabilistic PN model". The transformation makes the initial BN loosing some information on its independency structure (see how the graphs of BNs and PNs encode independence statements about the JPD function in Methods section "Dependencies in BN and PN structure"), but allows us to compare Hill Climbing and Graphical lasso in the same UNM framework.

Data, reference network and algorithm settings
We use two public available datasets for the Escherichia Coli genome to evaluate our structure learning algorithms with. On the one hand we use the (E.coli) micro-array dataset that is available from the Many Microbe Microarrays database (M3D) [1] from which PNMs are learned with the four algorithms HC, Glasso, HGlasso and SFGlasso. In particular we use the latest version at this moment E coli v4 Build 6. This dataset contains the uniformly normalized expression levels of 4297 genes in E.coli measured under 466 experimental conditions using Affymetrix arrays. Expression levels from the same experimental conditions are averaged and their mean expression provides one of the 466 sample points in the dataset. A reference network, on the other hand, is constructed from a dataset containing evidence about transcriptional regulations; the interactions between Transcription Factors (TFs), that arise from TF encoding genes, and their target genes. For the E.coli genome, a number of studies generated information on transcriptional regulations; The Regulon Data Base (RegulonDB) is the primary database [31] in which this information is gathered. We will use the most complete file in the RegulonDB called network tf gene.txt.
We quire the algorithms to construct networks using only the expression levels in M3D of the 1683 genes from which evidence is reported in the RegulonDB. The known interactions in the RegulonDB will than really act like a reference network that us enables to evaluate the topological accuracy of the learned GRNs from the micro-array data. A similar strategy is applied in [29]. The 1683 genes contain a total of 173 TF encoding genes of which 172 count as the origin of all transcriptional regulation interactions in the reference network. Together they are good for 3381 unique interactions without self-loops. As interactions go from TF encoding genes to target genes the reference RegulonDB network is by nature a directed causal network. We expect the 172 TFs encoding genes in RegulonDB to (implicitly) determine the dependency structure of the M3D dataset.
The number of edges that is needed to construct a PNM from the micro-array dataset plays an important role in the quality and the practical possibilities of a structure learning algorithm. With this in mind we generate networks of different sizes for all algorithms. To vary the amount of edges |E| we vary for respectively Glasso, HGlasso and SFGlasso the initial parameters λ and λ 1 , λ 2 , λ 3 and α (see Methods section "Learning PN structure from data"). With respect to Hill Climbing we obtain networks of different amount of edges by varying the amount of iterations while using the standard BIC score (an action that give similar results as application of the BIC γ score and varying the parameter γ [28].) The directed BNs learned by Hill Climbing will be transformed to UNMs. This process, described in Methods section "Transformation of probabilistic BN model to probabilistic PN model", will cost some efficiency (extra, unnecessary parameters/edges will be added), but allow for comparison with the products of (SF-, H-and) Glasso as the transformed edges will encode the same type of parameters.

True Positives versus Network size
The topological accuracy of a single learned GRN is measured by the amount of True Positive edges (TPs) according to the reference RegulonDB network. We define an undirected edge in the learned UNMs as TP if there exists an associated directed edge in the RegulonDB network. Also, the amount of connected Transcription Factors (TFs) in the estimated network is measured as an indication of the hub structure of the learned networks. Mentioned above, these measures are analyzed in light of the networksize to measure the capability of algorithms to produce accurate, though compact and sparse networks. In Figure 2 results on TPs and TFs with respect to the number of total estimated edges |E| in the network are reported for the four algorithms. The dots indicate the amount of TPs and the dot labels, in the form of numbers, indicate the amount of TFs found in the belonging network.
The UNMs obtained with HC contain the highest amount of TPs and the highest amount of TFs for sparse networks up to 5000 edges. The biggest gain of TPs with respect to the amount of edges |E| occurs in the range from 1000 to 1800 edges. In this range also the amount of recovered TFs is highest (around 20 new TFs per 100 added edges). This simultaneous acceleration in grow is intuitive as all TPs in RegulonDB consist of at least one TF. The classic Glasso recovers less TP edges as HC in sparse networks, the recovering rate is especially lower with respect to HC in the range of 1000 to 1800 edges, and could be related to the fact that in this range Glasso only recovers between 1 and 2 new TFs per 100 edges. In bigger networks that consist of more than 5000 edges the amount of TPs exceeds HC values and we guess that in networks of around 20000 edges (not displayed) the amount of TFs will be almost equal. In the discussion we go into more detail into the strategy of Glasso and HC to discover TPs.
With respect to HGlasso, it depends on the network size and the combination of the values of the parameters λ 1 , λ 2 and λ 3 if HGlasso can outperform Glasso or not. Take for example the combination λ 1 = 0.9, λ 2 = variable and λ 3 = 14 (orange dots). These networks consist of more TFs as classic Glasso networks, but less TPs are found. The impact of the relaxed penalty term in the score function for hubs is thus visible, but not realized through the caption of more TPs by the HGlasso algorithm. The best results of HGlasso, are obtained with λ 1 = variable, λ 2 = 0.95 and λ 3 = 7 (grey dots); medium networks (up from 3000 edges) with these parameters not only consist of more TFs, but also outperform the classic Glasso on TPs. Finally, we see that SFGlasso performs worse as the classic Glasso and HGlasso on both the amount of TPs and TFs. The discovery rate of TFs is slow resulting in little added TPs. It seems that the introduced scale-free degree distribution does not fit well to the dependency structure that is imposed by TFs in the micro-array data.

Log-likelihood versus networksize
The relative probabilistic accuracy of a single learned GRN with respect to another learned GRN is measured by their difference in log-likelihood. The log-likelihood quantifies how well the model explains the data in terms of the likelihood of the available data given the particular probabilistic model (see Methods section "Log-likelihood definition and calculation"). Figure 3 displays the log-likelihood of the networks versus the amount of links in the networks. For easy comparison with Figure 2, we again placed the number of integrated TFs as dotlabels.
The log-likelihood curve of Hill Climbing rapidly improves until 2000 edges and establishes between 2000 and 7000 edges. Note that the same is true for the number of found TFs, growing rapidly until 2000 edges, reaching 153 TFs, and than slowing down until 7000 edges, reaching the total amount of 173 TFs. The true positives curve in Figure 2 also begins to flatten around 2000 edges. Hence, in the case of HC, the validation measures TP and TFs can be directly associated with the amount of information that is extracted from the micro-array data and captured in the network.
In the case of H-,SF-and Glasso the relation between log-likelihood and topological measures is more complex. The Glasso log-likelihood curve remains far apart from the HC curve for small and medium networks. From Figure  2 we learn that Glasso values of topological measures finally approach HC values when including more and more edges. Figure 3 illustrates that this is not true (or in any case not true for regularized networks) for log-likelihood values. In the discussion we go into the relation between log-likelihood and TPs for Glasso and HC.
All HGlasso networks score better on log-likelihood values than the classical Glasso. For HGlasso networks that improve Glasso networks on TPs (e.g. HGlasso networks with λ 2 = 0.95, λ 3 = 10 up from 3000 edges, grey dots) this betterment is even more pronounced. Still, however, HC log-likelihood values are not reached neither at small nor at medium edge size by the adapted HGlasso algorithm. Similar as in the case of TPs, SFGlasso scores worse than Glasso on log-likelihood values for small to medium networksizes. We do observe that the more edges are added to the SFGlasso network, the closer log-likelihood values become to Glasso values.

Illustration of networkstructure
To illustrate the above results on topological and probabilistic accuracy, we zoom into a part of the reference RegulonDB network that contains the two TF encoding genes hyfR and fhlA and we compare with the networks learned from M3D with the four structure learning algorithms. Window a in Figure 4 shows the two TF encoding genes hyfR and fhlA (coloured in red) and their direct neighbours in the RegulonDB network (among the direct neighbours are three other TFs that are also colored in red; crp, fnr and nsrR). The RegulonDB is a directed network in which the meaning of the edges is causal. Symmetric dependencies that could exist between childs of hyfR and/or fhlA due to their common parent are not documented in RegulonDB and hence are not displayed in window a. Windows b-e show the same subset of nodes as window a and form part of networks that are learned from M3D by respectively HC, Glasso, HGlasso and SFGlasso. These networks are all (converted to, in the case of HC,) undirected PNMs (or UNMs) in which edges do not indicate casual influence, but direct dependency (the edges represent non zero entries in the -symmetric!-precision matrix). The exact encoding of dependencies in the networks is described in the Methods section "Dependencies in PN". The grey nodes indicate genes that are integrated in the network and red edges indicate edges that find their direct counterpart in the reference network in window a, i.e. edges that are denominated TPs. The edge-labels display the weight of the edges, the estimated partial variation. In window d blue edges and blue vertex closure indicate edges and vertex that are included in the HGlasso network but not in the Glasso network in window c of similar size, whereas grey edges and black vertex closures indicate edges and vertices that exist in both networks.
The results in the above subsections are illustrated in the subnetworks. In a sparse network, Hill Climbing (window b) discovers both TF encoding genes, hyfR and fhlA and with them two TP edges between respectively hyfR and hyfJ and fhlA and hypE. Glasso, HGlasso and SFGlasso (window c,d and e) discover only one TF, hyfR, and find one TP related to hyfR.
Despite finding the most TPs, Hill climbing also discovers the most unique child-child dependencies between the target genes that are regulated by the transcription factors HyfR and FhlA. Child-child dependencies are not considered as TPs in the RegulonDB network, but do improve quality of UNM networkstructure. For example, the community {hyfA, hyfB . . . hyfJ} and the TP between hyfR and hyfJ can be an indication that the set {hyfA, hyfB . . . hyfJ} \ {hyfJ} is also progeny of HyfR.
Glasso finds less variety of child-child dependencies between nodes that are regulated by HyfR and FhlA. Instead, the sub groups of childs are interconnected with more edges between them. These edges are probably true, but, from an information theoretic perspective, are redundant in the sense that they lower entropy of the community structure in the network (see also [27]). Moreover, the values of the parameters, the partial variation, in Glasso differ in magnitude with the values of the parameters in HC. The low partial variation as estimated by (H)Glasso lower loglikelihood values with respect to Hill Climbing. In the discussion we will come back to the estimation of the magnitude of the parameters and its impact on log-likelihood values.
In window d, HGlasso improves the networkstructure with respect to Glasso with an additional connected gene hyfG while using the same amount of edges. Also, parameter estimation seems more accurate as HGlasso includes higher partial variation values. In general, the inclusion of more significant edges (for some combinations of λ 1 , λ 2 , λ 3 ) and more precise parameter estimation alter the log-likelihood values of HGlasso with respect to Glasso, showed in Figures 2 and 3.
The SFGlasso in window e on the other hand excludes four more genes than the classic Glasso. Not many dependencies are found in this part of the sub network. Thereby, the estimation of parameters is worse in this part of the sub network. Partial variations are of even smaller magnitude with respect to Glasso. These results are in accordance with Figure 2 showing that SFGlasso networks are less complete than (H)Glasso networks and with Figure 3 showing that the least amount of information can be induced from SFGlasso networks.

Discussion
Our results on sparse and medium networks showed high log-likelihood values for HC with respect to the classic Glasso and its modifications. Morever, the results for sparse networks (up to 5000 edges) show that the HC algorithm finds the most true interactions (in the form of TPs). We first discuss the log-likelihood results in light of the topology of the learned networks and in light of the score functions that are integrated in the algorithms.
Later we discuss the results on TPs in light of the same score functions. In the second part of the discussion we place the results of HC in the light of two other partial-variation learning algorithms, SPACE and ESPACE, that are evaluated in [29] on the same E.Coli dataset. In the third part we discuss the results of HC in this study with the results of HC in a recent study on high-dimensional climate data and find some generalities about Gaussian BNs that are learned with HC and applied on high-dimensional real world data [27].
When analyzing log-likelihood values, firstly we should keep in mind that not all interactions that can be inferred from the M3D are equally informative and thus not contribute equally to the log-likelihood. It is intuitive to regard the evidence in the RegulonDB that takes the form of TF-target gene interactions as the backbone structure of the M3D data, whereas local interaction structure between childs of the same TPs is less informative with respect to the complete M3D database. In this light, the HC algorithm produces network topologies that contribute to high loglikelihood values, finding relatively more TPs and much more TFs than Glasso and its modifications in sparse networks and just enough links to determine local interaction structure. Secondly, accurate weight estimation of the incorporated paramaters (or the weight of the edges) alters loglikelhood values. The HC algorithm is able to accurately estimate the weight of parameters. This is fruit of the l 0 penalty used in the score function that only penalizes the amount of parameters and not their weight. The l 1 regularization in Glasso, on the other hand, while penalizing the amount of parameters, also penalizes the weight of parameters. The l 1 penalty thus imposes a structural bias on the selected parameters in the JPF, resulting in inaccurate parameter estimation and substantially lower log-likelihood values. The score functions including l 0 or l 1 penalization are given in detail in Methods section "Learning BN (PN) structure from data".
The answer to the question why HC includes more TPs and TFs than Glasso and modifications has to do with the integrated score functions and with the steps in the learning algorithms. Every iteration HC can choose to incorporate a parameter. This edge is selected in order to maximize the score function that partly exists of the log-likelihood and partly of the l 0 penalization term that penalizes the amount of parameters. On the one hand there are high informative links (TPs) from which there are relative few that alter the log-likelihood substantively. On the other hand, there are local links, from which there are many that are relatively less informative. HC is designed to alter log-likelihood with the least amount of parameters and thus includes first TPs and informative local links (that enlarge communities) and only then fills the network with less informative local links (that fill communities). The score function of Glasso on the other hand exists of the same log-likelihood part and of the l 1 penalization term that penalizes besides the amount also the weight of edges. Local links have a stronger direct correlation and it is for this reason that Glasso includes more of them, whether informative or uninformative, as local links 'remain' under application of the statistical criterion. More informative links as TPs have weaker direct correlation and in sparse networks Glasso shrinks those links to zero.
In the same line the differences in results between Glasso and modifications can be explained. Due to the adapted l 1 regularization, HGlasso performs better on log-likelihood values than the classical Glasso for every combination of initial parameters λ 1 , λ 2 , λ 3 . The new score function always alleviates the penalization on parameters that are related to hubs, resulting in higher log-likelihood values with respect to the classic Glasso. HGlasso, however, still depends heavily on λ 1 (the penalty term for non hubs) to introduce overall sparsity, and thus can not reduce the mean parameter bias sufficiently to reach HC values for small and medium networks. With respect to the measure of true positives, it depends on the combination of tuning parameters and on the networksize if HGlasso performes out Glasso. A successful combination of tuning parameters leads HGlasso finding more TPs. These networks generally contained more TFs than the associated Glasso network. However, the incorporation of more TFs did not always led to more True Positives in the network. We found no standard method to discover combinations of initial parameters with whom Glasso could be outperformed. Finding such a combination is a time consuming process of trial and error.
The algorithm SFGlasso does incorporate a hub structure, but the networks generally contain less TFs than Glasso and the combinations of HGlasso. As judged by the low log-likelihood values, the algorithm thus selects 'wrong' hubs and also places great bias on the estimates of informative links. These results may have to do with the poor adjustment of SFGlasso to high dimensional data. Generally, in scale free networks, the coefficient α in the powerlaw with which the degree distribution decays lies between 2 and 3. However, we had to reduce α to values below 0,5 as otherwise the resulting networks stayed empty (even when using only two iterations). Moreover, in this work we only display results after two iterations, because using more than two iterations the lasso algorithm took extraordinary long time to converge.
The analysis of HGlasso and SFGlasso with respect to Glasso is a strong indication that GRNs are not fully characterized by scale-free degree distributions and/or hub nodes. The structure of a GRN seems to be more complex and probably not completely definable. We expect algorithms that focus only on these specific characteristics 6 while leaving out other characteristics of the real-world data to improve the classic Glasso only up to certain height.
Mentioned before two other algorithms were applied on (a less up to date version of) the micro-array data used in this paper: the algorithms SPACE and ESPACE [29]. The algorithm SPACE uses sparse covariance selection techniques but differs from Glasso and was especially designed for the large-p-small-n data in the framework of GRNs and for powerful identification of hubs [32]. The ESPACE algorithm is an extension of the SPACE method and explicitly accounts for prior information on hub genes, which, in case of E.Coli data, yields knowing in advance that TFs are the highly connected nodes in the true E.Coli GRN. We can loosely compare the HC algorithm with the performance of the algorithms SPACE and ESPACE. In table 6 of their work, Yu et al. [29] report a percentage of 4.35% of TPs in a network of 386 edges for SPACE, whereas ESPACE found 12,89% TPs in a network of 349 edges. In our work HC has a percentage of 4.85% of TPs in a network of 350 edges, outperforming SPACE. We may conclude with some precaution, as datasets are not entirely equal, that HC performs out the exploratory algorithm SPACE, but can not compete with ESPACE that starts with prior information about the true hubs.
Finally, the great power of HC when exploring E.coli data with unknown structure is that one can extract information of the true deterministic underlying network structure from every edge, as was illustrated at hand of the graphs in Figure 4. The fact that no edge is redundant with respect to the dataset enables us to learn from every 'false' positive edge, whereas in the case of Glasso and variants this is not true: not every edge is data-significant neither tells us more about the network structure. The same conclusion was drawn for high dimensional climate data for which we found that HC provided an informative community structure that can be analyzed well with centrality measures (i.e. betweenness centrality, see [27])). The HC approach prioritizes an efficient edge distribution by favouring heterogeneous selection of edges between communities over uniform selection of edges that lie in communities. We saw that this approach explores very well deterministic features in high dimensional real world complex systems like interconnected spatial communities by teleconnections (climate) or regulatory interactions (in GRNs). Complex network centrality measures such as community structure and betweenness centrality therefore have high potential for probabilistic BN networks applied on real-world datasets.

Conclusions
The use of the Hill Climbing algorithm that arises in the context of Gaussian Bayesian networks offers a sound approach for the reconstruction of gene regulatory networks of high-dimensional-low-samplesize micro-array data when no initial information is at hand about the underlying complex dependency structure. The HC algorithm picks only the most significant dependencies from the Gaussian data and in this manner naturally includes the complex dependency structure of the complex GRN that may consist of hubs, a scale-free degree distribution, outliers, a combination of the former or of other real-world characteristics. The algorithm naturally leaves out uninformative dependencies and variables.
The Bayesian network can easily be transformed to an Undirected Gaussian Probabilistic Network Model, paying the price of some loss of information on the independence structure with respect to its initial directed Bayesian Network. If one prefers an undirected PNM over a directed PNM -for easy interpretation due to symmetric links, or for sparse estimation of the inverse covariance matrix -this study shows that the transformation from BN to UNM is worth this loss of information as the UNMs obtained by Hill Climbing still outperform UNMs obtained by state of the art UNM-structure learning algorithms when applied to high-dimensional-low-sample size (micro-array) data that contains an unknown complex dependency structure.
This conclusion is drawn with respect to state-of-the-art structure algorithms that arise in the context of Undirected Gaussian Network Models, the Glasso algorithm and variants of Glasso that are developed to integrate complex dependency structure. These algorithms model unnecessary dependencies at the expense of the probabilistic information in the network and of a structural bias in the probability function that can only be relieved including many parameters. In the case of the E.Coli gene expression data used in this work, unnecesary dependencies also go at the expense of the amount of true positive edges, the last as judged by a reference network compounded of evidence gathered in the RegulonDB.

Probabilistic Gaussian Network Models (PGNMs)
The term refers to the choice of a multivariate Gaussian Joint Probability Density (JPD) function to associate graph edges with model parameters in a given PNM, such that the probabilistic model encodes in the JPD function a large number of random variables that interact in a complex way with each other by a graphical model. The multivariate Gaussian JPD function can take various representations in which dependencies between the variables are described by different types of parameters. The best-known representation of the Gaussian JPD function is in terms of marginal dependencies, i.e., dependencies of the form X i , X j |∅ as present in the covariance matrix Σ. Let X be a N -dimensional multivariate Gaussian variable then its probability density function P(X) is given by: where µ is the N -dimensional mean vector and Σ the N ×N covariance matrix. In the following we describe in some detail two types of PGNMs, in which parameters reflect respectively direct dependencies X i , X j |X\{X i , X j } and general conditional dependencies X i |S with S ⊆ X (direct dependencies are the least restrictive case of conditional dependencies).

Probabilistic PN models
The Gaussian JPD function in equation (1) can be formulated more generally using a set of factors Φ = {φ 1 (S 1 ), . . . , φ k (S k )} that describe dependencies between arbitrary (overlapping) subsets of variables S 1 , . . . , S k which comply with ∪ k S k = X. This representation of the JPD is called the Gibbs function and written as [30] P(X 1 , . . . , withP (X 1 , . . . , The Gibbs distribution where all of the factors are over subsets of single variables or pairs of variables is called a pairwise Markov network. The factors in a pairwise Markov network correspond to direct dependencies, i.e., dependencies of the form X i , X j |X\{X i , X j }. In a Gaussian distribution these dependencies are present in the inverse covariance matrix or precision matrix Σ −1 . The information form of the Gaussian JPD function in terms of the precision matrix Θ = Σ −1 is equivalent to the Gibbs function in equation (2) with factors defined on every variable and every pair of variables, i.e. Φ = {Φ n , Φ e } with φ n i = exp{− 1 2 θ ii X 2 i } and φ e ij = exp{θ ij X i X j }. The corresponding PGNM in which the notion of variables and pairs of variables is extended to the notion of nodes and undirected edges in a graph is called the Probabilistic PN model. The graph of a PN encodes the probability function in equation (4) as follows. Each node corresponds to a variable X i ∈ X, the presence of an edge X i − X j implies the presence of the factor φ e ij in P(X), and direct dependency of X i and X j . Moreover, the absence of an arc between X i and X j in the graph implies the absence of the factor φ e ij in P(X) and, thus, the existence of a set of variables S ⊆ X\{X i , X j } that makes X i and X j conditionally independent in probability [30].
The graph structure of PNs in this work are estimated simultaneously with the values of the parameters in Θ that define this structure. This simultaneous learning process is explained in Methods section "Learning PN structure from data".

Probabilistic BN models
Alternatively, the P(X) in equation (1) can be characterized with conditional dependencies of the form X i |S with S ⊆ X. The representation of the JPD is then a product of Conditional Probability Densities (CPDs): whenever the set of random variables {X i | Π Xi } i∈N is independent [33]. In this representation N is the normal distribution, µ i is the unconditional mean of X i , ν i is the conditional variance of X i given the set Π Xi and β ij is the regression coefficient of X j , when X i is regressed on Π Xi . We call Π Xi the parentset of variable X i . The corresponding PGNM in this case is the Probabilistic BN model. The graph of a BN model is a DAG encoding the corresponding probability distribution as in equation (5). Each node corresponds to a variable X i ∈ X, the presence of an arc X j → X i implies the presence of the factor P i (X i | . . . X j . . . ) in P(X), and thus conditional dependence of X i and X j . Moreover, the absence of an arc between X i and X j in the graph implies the absence of the factors P i (X i | . . . X j . . . ) or P j (X j | . . . X i . . . ) in P(X) and, thus, the existence of a set of variables S ⊆ X\{X i , X j } that makes X i and X j conditionally independent in probability [30,34].
The graph structure of the BN identifies the parentset Π Xi in equation (5). With this structure available, one easily learns the corresponding parameter set (β, ν); in our case parameters β ij and ν i are a maximum likelihood fit of the linear regression of X i on its parentset Π Xi . To estimate parameter values from graph structure we use the appropriate function in the R-package bnlearn [35]. The challenge of learning the graph structure is explained in Methods section "Learning BN structure from data".

Learning PN structure from data
A Precision Network (PN) is learned with the help of a structure learning algorithm that estimates the inverse covariance matrix, i.e. the precision matrix Σ −1 of the underlying Gaussian distribution. Converted into binary format, the estimate Θ of Σ −1 provides the undirected adjacency matrix A of a pairwise Markov Network. From the adjacency matrix A of the graph of a PN the structure of the factor-set Φ = {Φ n , Φ e } of the associated Gaussian JPD function (outlined in Eq. (4)) can be directly read off. In the Methods Section "Probabilistic PN Models" is explained how pairwise Markov networks encode the corresponding Gaussian PNM.
The Graphical lasso (Glasso) can be regarded as the default structure learning algorithm learning PNs for largep-small-n datasets. Glasso is a score-based algorithm based on a convex score. This score is basically made up by the Maximum Likelihood Estimate (MLE) of the precision matrix of a Gaussian probability function to which an l 1 penalty term is added [14]: Here Θ = Σ −1 , S is the sample covariance matrix calculated directly from the data D and λ a scalar, the penalization coefficient. Networks of different sizes can be generated by varying the penalization parameter λ. In this work we generate networks with zero edges to complete networks by varying λ from 1 to 0. A short outline of the steps in the Graphical lasso algorithm is given in Methods section "Learning with Hill Climbing and Glasso". The Glasso function is implemented in the R-package glasso [14].
The Hub Graphical lasso (HGlasso) learns a PN that consist of hub nodes combining a lasso (l 1 ) penalty and a sparse group lasso (l 2 ) penalty [36]. The estimated inverse covariance matrix Θ can be decomposed as where Z is a sparse matrix and V is a matrix that contains hub nodes. The belonging score is with Θ restricted to Θ = V + V T + Z. In this score λ 3 controls the selection of hub nodes, and λ 2 controls the sparsity of each hub node's connections to other nodes. We obtain networks of different sizes by varying λ 1 , λ 2 and λ 3 . The HGlasso function is implemented in the R-package hglasso [36].
The Scale-Free Graphical lasso (SFGlasso) aims to include even more structural information than mere sparsity or hubs [25]. Hubs are expected in this type of network but the focus lies on learning models that poses the so-called "scale-free" property; a property often claimed to appear in real-world networks. This feature is mathematically expressed by a degree distribution p(d) that follows a powerlaw: p(d) ∝ d −α (up from a certain degree d). In the score function of the classic Glasso, the l 1 -edge regularization is replaced with a power law regularization. The objective score function is: with Θ ¬i = {θ ij |j = i}. This score function is not convex, a requirement to use Glasso, however can be proven to be monotone increasing. The score Score(Θ, S, α, β) is sequentially improved by elements of the sequence Θ n that iteratively maximize the following reweighted convex l 1 regularization problems: where . This re-weighting reduces regularization coefficients of nodes with high degree, encouraging the appearance of hubs with high degree.
Following the set up in the experiment section of [25] we take β i = 2α/ i and i equal to θ ii estimated in the last iteration, in this way i is on the same magnitude of Θ n ¬j 1 . Generally, in scale free networks α lies between 2 and 3. However, we had to reduce α to values below 0,5 as otherwise the resulting networks stay empty (even when only using 2 iterations). To optimize equation (10) and find Θ n+1 we iteratively use the glasso function in the R-package glasso with λ = λ(Θ n ), defined above. In this work we display results after 2 iterations. Using 3-5 iterations the lasso algorithm took extraordinary long time to converge.

Learning BN structure from data
The graph of a BN is estimated with the help of a structure learning algorithm that finds the conditional dependencies between the variables and encodes this information in a DAG. Graphical (dis-)connection in the DAG implies conditional (in-)dependence in probability (see Methods section "Dependencies in BN"). From the structure of a BN a factorization of the underlying JPD function P(X) of the multivariate random variable X (as given by Eq. (5)) can be deduced. In the Methods Section "Probabilistic BN Models" is explained how networks can be extended to their corresponding Probabilistic Network Model (PNMs).
In general there are three types of structure learning algorithms: constrained-based, score-based, and hybrid structure learning algorithms-the latter being a combination of the first two algorithms.
Constrained-based algorithms use conditional independence tests of the form Test(X i , X j |S; D) with increasingly large candidate separating sets S Xi,Xj to decide whether two variables X i and X j are conditionally independent. All constraint-based algorithms are based on the work of Pearl on causal graphical models [37] and its first practical implementation was found in the Principal Components algorithm [38]. In contrast, score-based algorithms apply general machine learning optimization techniques to learn the structure of a BN. Each candidate network is assigned a network score reflecting its goodness of fit, which the algorithm then attempts to maximise [39]. In [28] we compared structure learning algorithms belonging to the three different classes on accuracy and speed for highdimensional complex data. We found that score-based algorithms perform best. Algorithms in this class are able to handle high-variable-low-sample size data and find networks of all desired sizes. Constrained-based algorithms can only model complex data up to a certain size and, as a consequence, for climate data they only reveal local network topology. Hybrid algorithms perform better than constrained-based algorithms on complex data, but worse than score-based algorithms.
In this work we use a simple score-based algorithm, the Hill Climbing (HC) algorithm [39], to learn BN structure. The HC algorithm starts with an empty graph and iteratively adds, removes or reverses an edge maximizing the score function. This algorithm is formalized in Methods section "Learning with Hill Climbing and Glasso". HC is implemented in the R-package bnlearn.
We used the Bayesian Information Criteria (BIC) (corresponding to BIC 0 in [28]) score, which is defined as: where G refers to the graph (DAG) for which the BIC score is calculated, P refers to the probability density function that can be deduced from the graph (see Methods Section Probabilistic BN Models.), Π Xi refer to the parents of X i in the graph (i.e. nodes Y with relation Y → X i in the graph) and |Θ Xi | is the amount of parameters of the local density function P(X i | Π Xi ).

Dependencies in BN and PN structure
In the following we describe how a BN (or DAG) and a PN (or Pairwise Markov Network) encode conditional dependencies. New nomenclature is indicated with an asterisk and illustrated in Figure 5a and c.

Dependencies in BN
In a BN two nodes X and Y are conditionally dependent given a set S (denoted by D(X, Y |S)) if and only if they are graphically connected, that is, if and only if there exists a path U * between X and Y satisfying the following two conditions: • Condition (1): for every collider * C (node C such that the part of U that goes over C has the form of a V-structure, i.e., → C ←) on U, either C or a descendant * of C is in S.
• Condition (2): no non-collider on U is in S.
If the above conditions do not hold we call X and Y conditionally independent given the set S (denoted by I(X, Y |S)). Marginal dependency between two nodes can be encoded by any path U with no V-structures. In Figure 5b six conditional (in)dependence statements are highlighted in a simple DAG. In the caption of Figure 5 one of the statements is proved at the hand of conditions (1) and (2).

Dependencies in PN
In a PN two nodes X and Y are conditionally dependent given a set S (denoted by D(X, Y |S)) if and only if there exists a path U * between X and Y satisfying: No node Z ∈ S is on U. If the above condition do not hold we call X and Y conditionally independent given the set S (denoted by I(X, Y |S)). Marginal dependency between two nodes can be encoded by any path U. In Figure 5d four conditional (in)dependence statements are highlighted in a simple pairwise Markov network.

Learning with Hill Climbing and Glasso
Algorithm 1 Hill Climbing [28] Input: a data set D from X, an initial (usually empty) DAG G and a score function Score(G, D) as given in equation (11).
Output: the DAG G max that maximises Score(G, D).

1.
Compute the score of G, S G = Score(G, D), and set S max = S G and G max = G.

Repeat as long as S max increases:
(a) for every (or some; simple hill climbing) possible arc addition, deletion or reversal in G max resulting in a DAG: i. compute the score of the modified DAG G * , S G * = Score(G * , D): At the hand of Algorithm 1 and 2 we outline Hill Climbing and Graphical Lasso. For a more detailed description -and explanation of the equalities in Glasso-we respectively refer the reader to [39] and [14]. The input of both algorithms consists of the dataset D (the sample correlation matrix S in Algorithm 2 is just (1/(n − 1))D D for standardized variables) consisting of n independent samples of the multi Gaussian variable X and a score function Algorithm 2 Graphical Lasso Input: The sample correlation matrix S generated from a data set D from X and the penalization coefficient λ. Output: The estimated precision matrix Θ (in binary format the undirected PN graph) that maximises the Score(Θ, D) as given in equation (7). 1 (c) Fill in the corresponding row and column of W using w 12 = βW 11 . 3. Finally, using the notation of step 2(a) for Θ, for each j, first recover θ 22 from the equation 1/θ 22 = w 22 − w 12 β and then recover θ 12 from θ 12 = −βθ 22 .
to optimize. The output of HC is a DAG, whereas the output of Glasso is the estimated precision matrix Θ, which, in binary format, is the adjacency matrix of the associated undirected graph.
Hill Climbing simply visits all (or some; 'simple' Hill Climbing) neighbouring networks that differ on one edge of the current network (step 1) and moves then to the network with highest score -or directly to the first network found with better score in the case of simple HC (step 2). The algorithm stops when no neighbouring network has higher score than the current network. This could be at a local optimum.
Glasso transforms the initial score function (equation (7)) to a lasso problem and applies a coordinate descent approach to solve the problem: the algorithm fixes all dependencies in the current estimate of the correlation matrix W except those of one variable (coordinate), i.e. except one column and row (step 2a). Then it estimates the dependencies of this variable that best solves the element wise lasso regression problem (step 2b) and fills in the corresponding row and column in the updated correlation matrix W (step 2c). Next, it moves to the next coordinate and solves the same problem, this time with the former solution integrated in the fixed coordinates (integrated in W 11 ). This process (step 2) is repeated until convergence. Finally, in the last cycle, the row θ 12 and diagonal element θ 22 in Θ are recovered from W and β (step 3).

Transformation of probabilistic BN model to probabilistic PN model
Moralization turns the graph of a directed Gaussian Bayesian network into the graph of an undirected Markov network. Moralization yields the introduction of an undirected edge between any two nodes with a common child and subsequently ignorance of edge directions. Thus, each set of parents and childs (X i | Π Xi ) is a fully connected subset in the moral graph. The moral graph M(BN) of a BN is a minimal I-map, however the mapping is not necessarily perfect; not all independencies in the BN are necessarily covered in M(BN).
An undirected PNM can be asociated with the moral graph in more than one way. To asociate the M(BN) with the special case of a probabilistic PN model that encodes the JPD formulated in equation (4), i.e. a pairwise Markov network, the parameterset (β, ν) of the initial BN has to be transformed. The following equality between the precision matrix Θ and the parameters (β, ν) of a Gaussian Bayesian Network holds [40]: The new weights of the edges and parameters of the pairwise Markov Network are the entries of the precision matrix: The entry θ ij is zero if there is no edge i, j in M(BN) (Occasionally, θ ij can take the value of zero as a result of the matrix summation at the right hand side of equation (13)).
In this work we moralize and extract the precision matrix of all BNs that were learned with the Hill Climbing algorithm. In practice we use the R-packages bnlearn for the process of moralization and topological analisis and sparseBNutils [41] for the extraction of the precision matrix.

Log-likelihood definition and calculation
The likelihood of the data D, given a model M is the density of the data under the given model M: P(D | M). For discrete density functions the likelihood of the data equals the probability of the data under the model. The likelihood is almost always simplified by taking the natural logarithm; continuous likelihood values are typically small and differentiation of the likelihood function (with the purpose of a maximum likelihood search) is often hard. Log-likelihood values can be interpreted equally when the expression is used for model comparison and maximum likelihood search as the natural logarithm is a monotonically increasing function. In with P PNM the probability density function as modelled by the corresponding PNM with a Gaussian multivariate probability. In this work we considered two types of PNMs, precision and Bayesian PNMs, deduced from PNs and BNs graphs, respectively. In the case of a PGNM given by a PN we get: Entries in the sum are evaluations of the multivariate normal density function and executed with the R-package mvtnorm [42].
In the case of a PGNM given by a BN, from equation (5), we have where d k Π X i is a subset of D k containing the k-th data realization of the parentset Π Xi of X i . From equation (6) we know that the conditional univariate densities in the sum in equation (16) are univariate normal and we execute them with the basic R-package stats.  Figure 1: Schematic illustration of the presentationforms of Gaussian probability density function P in terms of conditional parameters (β, ν) and precision matrix (Σ −1 ), respectively associated with Bayesian Networks (BNs) and Precision Networks (PNs). Arrows refer to the associated learning algorithms (brown; score-based Hill Climbing algorithm, red; Graphical lasso together with the type of integrated statistical criterion (l 0 or l 1 penalization) and to the analytic transformation of the initially learned parameterset from BNs to PNs; PN(HC) (purple).