A unified data representation theory for network visualization, ordering and coarse-graining

Representation of large data sets became a key question of many scientific disciplines in the last decade. Several approaches for network visualization, data ordering and coarse-graining accomplished this goal. However, there was no underlying theoretical framework linking these problems. Here we show an elegant, information theoretic data representation approach as a unified solution of network visualization, data ordering and coarse-graining. The optimal representation is the hardest to distinguish from the original data matrix, measured by the relative entropy. The representation of network nodes as probability distributions provides an efficient visualization method and, in one dimension, an ordering of network nodes and edges. Coarse-grained representations of the input network enable both efficient data compression and hierarchical visualization to achieve high quality representations of larger data sets. Our unified data representation theory will help the analysis of extensive data sets, by revealing the large-scale structure of complex networks in a comprehensible form.


General network representation theory
We consider the case, in which the input network is given by the symmetric, node-node co-occurrence A (adjacency) matrix having probabilistic entries a ij ≥ 0. If we start with the H edge-node co-occurrence (incidence) matrix instead, capable to describe hypergraphs as well, then A ∼ H T H is simply given by the elements, a ij = 1 h * * k h ki h kj .Here and in the following asterisks indicate indices, for which we summed up.Throughout the paper we use the most general form of the input matrices, without assuming their normalization.Similarly, there is no need to normalize the information theoretic measures over A, such as the S information content or the I mutual information given in the Methods section.The network is represented by a B co-occurrence matrix and a natural way to quantify the quality of the representation is to use the relative entropy (D).The relative entropy measures the extra description length, when B is used to encode the data described by the original matrix, A, Although D(A||B) is not a metric and not symmetric in A and B, it is an appropriate and widely applied measure of statistical remoteness 31 , quantifying the distinguishability of B form A. Thus, the highest quality representation is achieved, when the relative entropy approaches 0, and our general goal is to obtain a B * representation as Since D(A||B) = H(A, B)−S(A), where H(A, B) is the (unnormalized) cross-entropy, we could equivalently minimize the cross-entropy for B. Although the minimization of D(A||B) appears in the minimal discrimination information, also known as the minimum cross-entropy (MinxEnt) approach by Kullback 32 , there the goal is the opposite of ours, namely to find the optimal 'real' distribution, A, while the 'approximate' distribution, B, is kept fixed.In this sense, our optimization is an inverse MinxEnt problem 33 .This kind of optimization appears also as a refinement step to improve the importance sampling in Monte Carlo methods (for highly restricted A-s), under the name of cross-entropy method 34 .In order to avoid confusion and emphasize the differences, we only use the term of relative entropy in the following.
Although D(A||B) ≥ 0 can be arbitrarily large, there is always available a trivial representation by the uncorrelated, product state, B 0 matrix given by the elements b 0 ij = 1 a * * a i * a * j .This way D 0 ≡ D(A||B 0 ) = I ≤ S is the I mutual information, thus the optimized value of D can be always normalized with I, or alternatively as η ≡ D/S ≤ 1.Here, η gives the ratio of the needed extra description length to the optimal description length of the system.The optimization of relative entropy is local in the sense, that the global optimum of a network comprising independent subnetworks is also locally optimal for each subnetwork.
The finiteness of D 0 ensures, that if i and j are connected in the original network (a ij > 0), then they are guaranteed to be connected in a meaningful representation as well, enforcing b ij > 0, since otherwise D would diverge.In the opposite case, when we have a connection in the representation, without a corresponding edge in the original graph (b ij > 0 while a ij = 0), b ij does not appear directly in D, only globally, through the b * * normalization.Nevertheless, the B matrix of the optimal representation (where D is small) is close to A, since the total variation of the normalized distributions is bounded by Thus, in the optimal representation of the network all the connected network elements are connected, while having only a strongly suppressed amount of false positive connections.

Network visualization and data ordering
Since force-directed layout schemes 21 have an energy or quality function, optimized by efficient techniques borrowed from many-body physics 36 and computer science 37 , graph layout could be in principle serve as a quantitative tool.However, these approaches inherently struggle with an information shortage problem, since the edge weights only provide half the needed data to initialize these techniques.For instance, for the initialization of the Fruchterman-Reingold 38 (Kamada-Kawai 39 ) method we need to set both the strength of an attractive force (optimal distance) and a repulsive force (spring constant) between the nodes in order to have a balanced system.Due to the lack of sufficient information, such graph layout techniques become somewhat ill-defined and additional subjective considerations are needed to double the information encoded in the input data, traditionally by a nonlinear transformation of the attractive force parameters onto the parameters of the repulsive force.
While in usual graph layout schemes the graph nodes are represented by points (without spatial extension) in a ddimensional background space -connected by (straight, curved or more elaborated) lines -in our approach the nodes are extended objects, namely probability distributions (ρ) over the background space.The role of edges is played by the overlaps of the node distributions.Importantly, in our representation the shape of nodes encodes just that additional set of information, which has been lost and then arbitrarily re-introduced in the above mentioned network visualization methods.In the following we consider the simple case of Gaussian distributions -having a width of σ, and norm h = ρ, see equation ( 6) of the Methods section -, but we have also tested the non-differentiable case of a homogeneous distribution in a spherical region of radius σ.For a given graphical representation the B co-occurrence matrix is built up from the overlaps of the distributions ρ i and ρ j -analogously to the construction of A from H -as The trivial data representation of B 0 can be obtained by an initialization, where all the nodes are at the same position, with the same distribution function (apart from a varying h i ∝ a i * normalization to ensure the proper statistical weight of the rows).This way, initially D 0 = I is the mutual information of the input matrix, irrespectively from the chosen distribution function.Here we note, that the b ii diagonal self-overlaps cannot change during the repositioning of the distributions in the layout, only due to changing the shape parameters of the distributions, namely the σ i widths or h i normalizations.The numerical optimization can be in general, straightforwardly carried out by a usual simulated annealing scheme starting with an initialization of B 0 .Alternatively, in the differentiable case we can also use a Newton-Raphson iteration as in the Kamada-Kawai method 39 (for details see the Methods section).In terms of the layout, the finiteness of D 0 ensures that the connected nodes overlap in the layout as well, even for distributions having a finite support.Moreover, independent parts of the network (nodes or sets of nodes without connections between them) tend to be apart from each other in the layout.Additionally, if two rows (and columns) of the input matrix are proportional to each other, then it is optimal to represent them with the same distribution function in the layout, as though the two rows were fused together.
When using Gaussian distributions, our visualization method can be conveniently interpreted as a force-directed method.If the normalized overlap, b ij /b * * , is smaller at a given edge than the normalized edge weight, a ij /a * * , then it leads to an attractive force, while the opposite case induces a repulsive force.For Gaussian distributions all the nodes overlap in the representations, leading typically to D > 0 in the optimal representation.However, for distributions with a finite support, such as the above mentioned homogeneous spheres, perfect layouts with D = 0 can be achieved even for sparse graphs.In d = 2 this concept is reminiscent to the celebrated concept of planarity 40 .However, our concept can be applied in any dimensions.Furthermore, any network of I = 0 (e.g. a fully connected graph) is perfectly represented in any dimensions by B 0 , that is by simply putting all the nodes at the same position.Here we note, that the concept of cross-entropy have already appeared in the field of graph drawing, such as in the methods of Refs.41,42.Besides others, the most important difference between these methods and ours is, that in our case the relative entropy (or cross entropy) is calculated over N × N -point distributions for N nodes, while in the related papers 41,42 only 2-point distributions of the form {a ij , 1 − a ij } are considered.
Our method is illustrated in Fig 1 .on the Zachary karate club network 43 , which became a corner stone of graph algorithm testing.It is a weighted social network of friendships between N 0 = 34 members of a karate club at a US university, which fell apart after a while into two communities indicated by different colors in Fig. 1.Our network layout technique works in any dimensions, as illustrated in d = 1, 2 and 3.In each case the communities are clearly recovered and, as expected, the quality of layout becomes better as the dimensionality of the embedding space increases.
Nevertheless, the one dimensional case deserves special attention, since it serves as an ordering of the elements as well (after resolving possible degenerations with small perturbations), as illustrated in Fig 1 .e.Although our network layout works only for symmetric co-occurrence matrices, the ordering can be extended for hypergraphs with asymmetric H matrices as well, since the orderings of the two adjacency matrices HH T and H T H readily yield an ordering for the rows and columns of the matrix, H.
When applying a local scheme (e.g.simulated annealing) for the optimization of the representations, we generally run into computationally hard situations.These correspond to local minima, in which the layout can not be improved by single node updates, since whole parts of the network should be updated (rescaled, rotated or moved over each other), instead.Being a general difficulty in many optimization problems, it was expected to be insurmountable also in our approach.In the following we show, that the relative entropy based coarse-graining scheme -given in the next section -can, in practice, efficiently help us trough these difficulties in polynomial time.

Coarse-graining of networks
Since it is generally expected to be an NP-hard problem to find the optimal simplified, coarse-grained description of a network at a given scale, we have to rely on approximate heuristics having a reasonable run-time.In the following we use a local coarse-graining approach, where in each step a pair of rows (and columns) is replaced by coarsegrained rows, giving the best approximative new network G ′ in terms of D(G||G ′ ), where G is the H or A matrix of the initial network.Although the applied formulas and coarse-graining steps are different, the idea of pairwise, information theoretic coarse-graining appeared recently also in the method of Ref. 28 to detect the presence of mesoscale structures in complex networks.In our method, for G = H the coarse-graining step means, that instead of the original k and l rows, we use two new rows, being proportional to each other, while the h ′ k * = h k * and h ′ l * = h l * probabilities are kept fixed This way the rows involved are replaced by their normalized average.Figuratively, this way 'bonds' are formed between the nodes in each step with a given D(G||G ′ ) 'scale'.As in the case of graph layout, if two rows (or columns) are proportional to each other, they can be fused together, since their coarse-graining leads to D = 0. Consequently, we can alternatively think of the coarse-graining step as a fusion or grouping of the rows involved.For an illustration of the fused data matrices see the lower panels of Fig. 2 a-d.We note, that proportional rows (or columns) can be generally fused together also initially, in the input data, as a prefiltering, before starting to find an optimal representation.Since in a coarse-graining step the matrix is modified, for the remaining steps the D difference should be updated if at least one member of the pair is neighbour of the fused elements.For n elements this takes O(n 3 ) time until the whole network is fused together.When G = A, the coarse-graining step is carried out simultaneously and identically for the rows and columns.The optimal coarse-graining is illustrated in Fig. 1.f for the Zachary karate club network.The heights in the dendrogram indicate the D values of the representations when the fusion step happens.Although it seems to be somewhat tedious in a later step to measure D always from the original input network for a given (k, l) pair, there is a simple rule to calculate this from the actually existing coarse-grained data alone.If D k and D l are the values when the rows (and columns) k and l were formed via fusion (being zero initially), then from the apparent D ′ value -measuring the formation of a bond directly from the coarse-grained rows k and l -we get Since the normalization of the rows and columns is preserved during coarse-graining, finally we arrive at D = I = D 0 .This nicely coincides with the above proposed initialization step of the network layout approach.

Hierarchical layout
Although the introduced coarse-graining scheme may be of significant interest whenever probabilistic matrices appear, here we are primarily interested in its application for network layout, to obtain a hierarchical visualization [44][45][46][47][48][49] .Our bottom-up coarse-graining results can be readily incorporated into the network layout scheme in a top-down way by initially (at high temperature) starting with one node (comprising the whole system), and successively undoing the fusion steps (cutting bonds) until the original system is recovered.Between each such extension step the layout can be optimized as before.
We have found, that this hierarchical layout scheme produces significantly improved layouts compared to a local optimization, such as a simple simulated annealing.By incorporating the coarse-graining in a top-down approach, we first arrange the position of the large-scale parts of the network, and refine the picture in later steps only.The refinement steps happen, when the position and extension of the large-scale parts have already been sufficiently optimized.After such a refinement step, the nodes -moved together so far -are treated separately.At a given scale (having N ≤ N 0 nodes), the D value of the coarse-graining provides a lower bound for the D value of the obtainable layout.Our hierarchical visualization approach is illustrated in Fig. 2. with snapshots of the layout and the coarse-grained representation matrix of the Zachary karate club network 43 at N = 5, 15, 25 and 34.As an illustration on a larger and more challenging network, in Fig. 3. we show the result of the hierarchical visualization on the giant component of the weighted human diseasome network 50 .In this network we have N 0 = 516 nodes, representing diseases, connected by mutually associated genes.The colors indicate the known disease groups.In the numerical optimization for this network we primarily focused on the positioning of the nodes, thus the optimization for the widths and normalizations was only turned on as a fine-tuning after an initial layout was obtained.
Being an efficient approximation scheme to speed up the optimization process, it is pointless to spend huge computational efforts to carry out the coarse-graining steps in a large network precisely.It leads to a faster and reasonable approximation to leave out (some of) the recalculation steps.As a quadratic, O(N2 ), approach we can fuse together all the nodes with their best match, according to the initially obtained D values.This procedure uses O(log N ) global fusion steps.An even faster approach is to leave out the recalculation steps completely.In this simplified scheme, the coarse-graining happens on the minimal spanning tree of the original matrix of pairwise calculated D values.While for smaller networks (N ∼ 103 ) the full approach is feasible, for larger networks we suggest to use the faster, approximative methods.

Discussion
In this paper we have shown, that a straightforward information theoretic idea provides an elegant, unified solution for such long-standing problems, as matrix ordering, network visualization and data coarse-graining.First, we demonstrated that the minimization of relative information yields a novel visualization technique, while representing the A input matrix by the B co-occurrence matrix of extended distributions, embedded in a d-dimensional space.As another application of the same approach we obtained a hierarchical coarse-graining scheme, when the input matrix is represented by its subsequently coarse-grained versions.Since these applications are two sides of the same representation theory, they turned out to be superbly compatible, leading to an even more powerful hierarchical visualization technique.Although we have focused on the visualization in d-dimensional continuous space, the representation theory can be applied more generally, incorporating also the case of discrete embedding spaces.A possible future application is the optimal embedding of a (sub)graph into another graph.Our relative entropy-based visualization with e.g.Gaussians can be naturally interpreted as a force-directed method.Traditional force directed methods induced huge efforts on the computational side to achieve scalable algorithms applicable for the large data sets in real life.Here we naturally can not and do not wish to compete with such advanced techniques, but believe that our approach can be a good starting point for further scalable implementations.As shown, network visualization is already interesting in one dimension yielding an ordering for the elements of the network.Our approach is formulated directly for the adjacency or incidence matrix of weighted networks, incorporating also the cases of bipartite graphs and hypergraphs.Since in this paper our primary intention was merely to demonstrate our theoretical framework, the more involved analysis of further interesting networks will be the subject of forthcoming articles.Due to our network representation theory we hope to get closer to discover knowledge form the huge data matrices in science.

Methods
The (unnormalized) Shannon entropy of A, expressing the amount of information in the system is given by S = − ij a ij ln a ij a * * , while the I mutual information between the rows and columns of A is given by I . The parametrization of the Gaussian distributions used in the layout is the following in d-dimensions Here we note, that in practice, special care may be needed for the a ii diagonal elements of A, describing the probability of the co-occurrence of an element with itself.If the nodes represent individual entities, rather than some properties or groups, then such self co-occurrences are impossible leading to a ii ≡ 0, which should be included in the representation scheme as well, by requiring b ii ≡ 0. While the solution of this case is rather straightforward, for the sake of simplicity we omitted its detailed description here.
For the numerical optimization of the network layout, we have implemented a simple, general purpose simulated annealing scheme, and for Gaussian distributions a much faster Newton-Raphson update, applied also in the Kamada-Kawai method 39 and in Refs.41,42.In practice, we used a separate Newton-Raphson step for the d coordinates and for the σ i widths and h i normalizations.In each step, the node with the largest gradient amplitude (||J||) is updated in the direction and with a parameter step size, obtained by the second derivative matrix, F as −F −1 J. Since F is not always positive definite, special care is needed when the relative entropy increases in such a step.In such a case, a sufficiently small step size is applied in the direction of the gradient vector, instead.This way our technique has the same computational complexity as the widely applied Kamada-Kawai method 39 .

Figures
FIG. 1: Illustration of the power of our unified representation theory on the Zachary karate club network 43 .The optimal layout (η = 2.1%) in terms of d = 2 dimensional Gaussians is shown by a density plot in a and by circles of radiuses σ i in b. c the best layout is obtained in d = 3 (η = 1.7%),where the radiuses of the spheres are chosen to be proportional to σ i .d the original data matrix of the network with an arbitrary ordering.e the d = 1 layout (η = 4.5%) yields an optimal ordering of the original data matrix of the network.f the optimal coarse-gaining of the data matrix yields a tool to zoom out from the network in accordance with the underlying community structure.We stress, that the coarse-graining itself does not yield a unique ordering of the nodes, therefore an arbitrary chosen compatible ordering is shown in this panel. 43.In our hierarchical visualization technique the coarse-graining procedure guides the optimization for the layout in a top-down way.As the N number of nodes increases, the relative entropy of both the coarse-grained description (red, •) and the layout (blue, •) decreases.The panels a-d show snapshots of the optimal layout and the corresponding coarse-grained input matrix at the level of N = 5, 15, 25 and 34 nodes, respectively.For simplicity, here the h i normalization of each distribution is kept fixed to be ∝ a i * during the process, leading finally to η = 4.4%.

FIG. 3 :
FIG. 3: Visualization of the human diseasome.The obtained best layout (η = 3.1%) by our hiearchical visualization technique of the human diseasome is shown by circles of radiuses σ i in a and by a traditional graph in b.The nodes represent diseases, colored according to known disease categories 50 .