Evaluating relevance and redundancy to quantify how binary node metadata interplay with the network structure

Networks are real systems modelled through mathematical objects made up of nodes and links arranged into peculiar and deliberate (or partially deliberate) topologies. Studying these real-world topologies allows for several properties of interest to be revealed. In real networks, nodes are also identified by a certain number of non-structural features or metadata. Given the current possibility of collecting massive quantity of such metadata, it becomes crucial to identify automatically which are the most relevant for the observed structure. We propose a new method that, independently from the network size, is able to not only report the relevance of binary node metadata, but also rank them. Such a method can be applied to networks from any domain, and we apply it in two heterogeneous cases: a temporal network of technology transfer and a protein-protein interaction network. Together with the relevance of node metadata, we investigate the redundancy of these metadata displaying by the results on a Redundancy-Relevance diagram, which is able to highlight the differences among vectors of metadata from both a structural and a non-structural point of view. The obtained results provide insights of a practical nature into the importance of the observed node metadata for the actual network structure.


Supplementary Information
Computational complexity of the phase diagram The computation of the phase diagram requires a complete enumeration of all the possible n n 1 combinations of binary metadata over the network nodes. The computational complexity of the phase diagram is bounded by such amount of combinations that can be estimated, in the worst case (i.e. when n 1 = n/2), to be O(2 n ) times the number of metadata.
Using the Stirling's approximation, n! ∼ √ 2πn(n/e) n , we can write: In general, the complexity associated to the computation of the phase diagram is strictly related to the ratio between n and n 1 so it is possible to range from linear, yet trivial, cases when n 1 = 1 or n 1 = n to exponential cases as that considered for computing the computational complexity.
The necessity to compute and store the combinations associated to the phase diagram causes serious memory issues even for small networks. Considering, for instance, a network with n = 50 and n 1 = 25 the computation of all the possible combination of binary vectors, that are about n n 1 ∼ 10 14 , would require n · n n 1 = 6 · 10 15 bits that means ∼ 0.8 petabytes of memory. This is not the case if we consider n 1 = 5 for which there are n n 1 = 2118760 configurations to be examined. In summary, the estimation of a limiting amount of nodes is a complicated task that suffers of case dependency and that can become impossible to solve in an exhaustive manner also in the case of networks with few tens of nodes.

PseudoCode
Algorithm 1 Computation of the significance of node metadata 1: Load a graph G with n nodes and m links 2: Load a set V of vectors of binary node metadata v i 3: for any vector of metadata v i ∈ V do 4: Compute: n 1 , H, D, H min , D min , H max , D max

28:
end if 29: end for 30: Sort r in non decreasing order 2/6

Explanation of the indexes
For the node metadata, we refer to several indexes from those constituting the Global Innovation Index (GII) reports. The GII reports are generally considered a leading reference on innovation and they are co-published by Cornell University, INSEAD and the World Intellectual Property Organization (WIPO). The reports are published annually and available at the web address http://www.globalinnovationindex.org. The indicators that we take into account are: GDP per capita (GDP), Institutions (INST), Human capital and research (HCR), Infrastructure (INFR), Market sophistication (MS), Business sophistication (BS), Knowledge, technology and scientific outputs (KTSO), and Creative outputs (CO).
In particular, for each country, we refer to the GDP per capita in PPP (purchasing power parity) in dollars, as extracted from the World Bank World Development Indicators databases. We also consider the value score for seven different indexes, defined as pillars in the GII reports. Indeed, the GII refers to two sub-indices: the Innovation Input Sub-Index and the Innovation Outputs Sub-Index, each built around these pillars. The Innovation Input Sub-Index has five enabler pillars: Institutions (INST), Human capital and research (HCR), Infrastructure (INFR), Market sophistication (MS), and Business sophistication (BS). Enabler pillars are related to aspects of the environment that are favourable to innovation within an economy. The other two pillars concern the innovation activities within an economy and are related to innovation outputs. They are: Knowledge, technology and scientific outputs (KTSO) and Creative outputs (CO). All the formal descriptions on GII, as well as its constituting indexes, are reported in the official reports available at the web address: http://globalinnovationindex.org. In order to compute the relevance of node metadata we binarize the values of the considered indicators, considering countries over-performing (c i = 1) and under-performing (c i = 0) with respect to the mean of a certain indicator. Such a procedure seems appropriate in the case of the EEN, since the considered indicators display a relatively homogenous distribution across the years, as shown in Figure 1. In general, the binarization of metadata is a procedure that is not appropriate for every distribution of scalar quantities. In the case the distribution of metadata is heterogeneous, e.g. it presents a fat-tail, we suggest to adopt other methods for partitioning the distribution such as the characteristic scores and scale (CSS) method.  cellular organization cellular transport and transport mechanisms A transport and sensing categories "transport facilitation" and "regulation of / interaction with cellular environment" R stress and defense cell rescue, defense and virulence D genome maintenance DNA processing and cell cycle C cellular fate / organization categories "cell fate" and "cellular communication / signal transduction" and "control of cellular organization" U uncharacterized categories "not yet clear-cut" and uncharacterized