Assessing diversity in multiplex networks

Diversity, understood as the variety of different elements or configurations that an extensive system has, is a crucial property that allows maintaining the system’s functionality in a changing environment, where failures, random events or malicious attacks are often unavoidable. Despite the relevance of preserving diversity in the context of ecology, biology, transport, finances, etc., the elements or configurations that more contribute to the diversity are often unknown, and thus, they can not be protected against failures or environmental crises. This is due to the fact that there is no generic framework that allows identifying which elements or configurations have crucial roles in preserving the diversity of the system. Existing methods treat the level of heterogeneity of a system as a measure of its diversity, being unsuitable when systems are composed of a large number of elements with different attributes and types of interactions. Besides, with limited resources, one needs to find the best preservation policy, i.e., one needs to solve an optimization problem. Here we aim to bridge this gap by developing a metric between labeled graphs to compute the diversity of the system, which allows identifying the most relevant components, based on their contribution to a global diversity value. The proposed framework is suitable for large multiplex structures, which are constituted by a set of elements represented as nodes, which have different types of interactions, represented as layers. The proposed method allows us to find, in a genetic network (HIV-1), the elements with the highest diversity values, while in a European airline network, we systematically identify the companies that maximize (and those that less compromise) the variety of options for routes connecting different airports.

NOTE S1. PROOF THAT D IS A METRIC BETWEEN NETWORKS Equation (1) from the main text shows that D(p, q) = 0 if, and only if, both networks possess the same transition matrix and, consequently, the same adjacency matrix. D is a metric because the Jensen-Shannon divergence is square of a metric between probability distributions, then D is a metric between layers, in fact, is a metric between labelled graphs. Figure S1 and Table S1 presents a small example on how the metric works. Both networks are very similar, they have the same number of nodes, and all of them have the same degree. As it can be seen in Table S1, nodes 1 and 4 present dissimilarity zero. Node 1 has the same adjacency matrix in both networks and it is connected to different nodes at distance 2, then, node 1 has the same node distance distribution in both networks. The same is valid for node 4 that it is connected to nodes 1 and 2 in both layers. In this small example it is easy to see that the distance between networks is zero if all nodes share, with their counterparts in the other layers, the same adjacency matrix.  Table S1.  Figure S1.
Here we discuss and compare different existing measures and methods that are used either, to compute dissimilarities between labeled nodes, or heterogeneity in multiplex structure. Table S2 presents, to the best of our knowledge, the most commonly used methods.

Measure Description References
Graph Edit Distance (GED) Counts only the number of uncommon edges between two networks, not considering topological differences between them. [1][2][3] The Quantum Jensen-Shannon divergence (QJSD) It is not proved to be a metric between networks. It is computed through the square root of the Jensen-Shannon divergence between the eigenvalues of the normalized Laplacian Matrix. The main drawbacks of this measure are, the lack of local information and the number of isospectral networks with different topological features. [4][5][6] Node and Layer activity vector The node-activity value is a binary operator returning 1 if the node possesses at least one first neighbor. The layer-activity vector is a vector containing all node activity value of the layer. In order to quantify the relative overlap between two layers at the level of node activity, Hamming distance between the two corresponding layer-activity vectors was proposed in [7]. Since it returns zero if the networks share the same set of active nodes, is a pseudometric between networks. Therefore, pairs of connected networks are indiscernible using this measure. [7,8] Interlayer Mutual Information Computes how correlated the degree distributions of a pair of layers are. The main drawback is the lack of information when networks with the same degree distribution, but different topological structure, are compared. For instance, a pair of networks can possess a high interlayer mutual information value, not possessing common links. [9] Average Edge Overlap Global measure of the multiplex system which computes the expected number of layers on which an edge is present. The main drawback is the lack of information concerning local and global features of the system. [9] To highlight the fact that our measure looks beyond the degree distribution, we compare a Barabàsi-Albert (BA) scale-free network (m=2), and two networks generated by dk model [10], with k = 1 preserving its degree sequence and k = 2.5 preserving the degree sequence, degree correlation, clustering coefficient and clustering spectrum. We compute the node dissimilarity D i corresponding to the node with the highest degree in the BA network, and the layer dissimilarities, as shown in Figure S2. It is possible to see that, although each corresponding node in these networks has the same degree, D recognizes that nodes are connected in a different way, giving different dissimilarity values. Measures based on the node degree, or on node activity, are no able to acknowledge this fact. In a previous work, our group proposed a pseudo-metric between graphs, a measure that is not designed to consider the identity of the nodes and to whom they are connected [11]. Then, this previously proposed measure cannot be applied to structures in which the position of specific nodes and their relationship with all other in the network is relevant. Some examples where labels are relevant are, for example, climate networks where each node is connected to the others depending on the variable considered, or social networks in which the same group of individuals is connected considering different social ties.
To illustrate the limitation of the measure presented in [11]when applied to multiplex networks, we present Figure S3, in which, nodes 1, 2 and 3 are connected in different ways by two links. As the distance proposed in [11] does not consider the identity of the nodes, networks A and B are seen as identical (D = 0). The measure developed in this work considers the identity of the nodes and captures the topological differences between networks A and B (D = 0.3109697).

Layers difference values
Matrix S2 presents the difference values between layers. For the two highest and lowest values of node diversity, nodes 1,11, 58 and 60, we present their difference matrices, and local diversity values.
Difference matrix of node 1, Matrix S3. Its local diversity value is U 1 =0.9844. The complete dataset is available as Supporting Information material in an Excel file at [12]. In the same file, the reader will find the diversity value for each on of the 1114 genes.
tat is an essential regulatory element. It is a HIV trans-activator and plays an important role in regulating the transcription of the viral genome [13][14][15][16][17].
nef and vif are considered belonging to the class of accessory regulatory proteins. nef is involved in multiple functions during the replication cycle of the virus, playing an important role to increase virus infectivity. vif is important for the infectivity of HIV-1 virions depending on the cell type [13][14][15][16][17].
The env and gag genes belongs to the class of viral structural proteins. gag codes for the precursor gag-polyprotein which is processed by viral protease during maturation of the protein matrix and env is responsible for a mechanism that embeds in the viral envelope to enable the virus to attach to and fuse with target cells [13][14][15][16][17].    Pierre Auger Collaboration: the network consists of layers corresponding to different working tasks within the Pierre Auger Collaboration. considering all submissions between 2010 and 2012 and assigned each report to L=16 layers according to its keywords and its content: Neutrinos, Detector, Enhancements, Anisotropy, Pointsource, Mass-composition, Horizontal, Hybrid-reconstruction, Spectrum, Photons, Atmospheric, SD-reconstruction, Hadronic-interactions, Exotics, Magnetic and Astrophysical-scenarios. Readers should refer to [18] for details. The multiplex is weighted (see Table S4). Homo Sapiens -genetic interaction: network concerns homo sapiens genetic interaction. There are 18222 nodes and 7 layers: Direct interaction, Physical association, Suppressive genetic interaction defined by inequality, Association, Colocalization, Additive genetic interaction defined by inequality and Synthetic genetic interaction defined by inequality. See [19,20] for a better description of the data and Table S5 for the results. Hepatitusc multiplex GPI network: is the multiplex genetic and protein interactions network of the Hepatitus C virus. The network contains 105 nodes, and 3 layers: Physical association, Direct interaction and Colocalization.
Readers should refer to [19,20] for a better description of the data and