Likelihood-based approach to discriminate mixtures of network models that vary in time

Discriminating between competing explanatory models as to which is more likely responsible for the growth of a network is a problem of fundamental importance for network science. The rules governing this growth are attributed to mechanisms such as preferential attachment and triangle closure, with a wealth of explanatory models based on these. These models are deliberately simple, commonly with the network growing according to a constant mechanism for its lifetime, to allow for analytical results. We use a likelihood-based framework on artificial data where the network model changes at a known point in time and demonstrate that we can recover the change point from analysis of the network. We then use real datasets and demonstrate how our framework can show the changing importance of network growth mechanisms over time.


Likelihood Calculation
The main text describes the general form of a model likelihood given observations of an evolving graph G 1 , G 2 , . . . , G t : Here we provide more details for how each term P(∆ i = δ i |G i−1 = g i−1 , M ) is evaluated. In all of the data sets used in this paper, the graphs evolve by adding stars. For example, in the arXiv citation dataset the graph evolves when a new paper arrives and it cites other papers in the dataset. The operation model here is a new node arriving and connecting to m existing nodes. In this case then the m nodes of the star are chosen in turn, all without replacement, to avoid self-loops or multilinks. Now, to calculate the exact probability of this operation we must evaluate the probability of each order occurring. So if we want the likelihood that the chosen set of nodes was {2, 3} then we must consider the likelihood of picking 2 then 3 and also 3 then 2.
Consider the observation in figure 1, where we observe a new node (4) joining the network by connecting to nodes (2) and (3), and suppose we wish to compare which of two object models, M 1 and M 2 , is a better explanation for this observation. As (4) is a new node, the probability we are interested in is the probability of picking nodes (2) and (3) according to models M 1 or M 2 . As there is no order on which of the edges (4, 2) and (4, 3) arrived first, we consider the two different orders. Let us imagine we want compare the BA model with the random model for the pictured addition of one node and two links. In general we have l(M |observation) = P(pick node 2, then 3|M ) + P(pick node 3, then 2|M ).
For the BA model we get the following likelihood and for the random model we get In this case, therefore, the BA model is more likely (the likelihood ratio of BA to random being 5 4 . If a node connects to a small number of others then each ordering can be considered explicitly in this way. However the number of orderings goes up quickly with the number of chosen nodes m (the number of possible orderings of node selections being m!) and hence, for m > 5 we use a sampling procedure to calculate the average likelihood for a sample of possible orderings and multiply this by the number of orderings.

Data requirements for the framework
Our framework requires that the history of the network's growth is known to a sufficient temporal resolution such that the edges arriving in each time increment form a star graph. The simplest scenario is if just a single link arrives at each time increment; this is the case for the Facebook wall posts and StackExchange MathOverflow datasets. In many datasets, including all our artificial datasets, each time increment will comprise link arrivals as a star (an existing or new node connecting to more than one other new or existing nodes simultaneously). This is subject to the ordering problem discussed in the previous section, in principle the calculation can be exact but the combinatorics mean that sampling needs to be used. In some data sets more complex situations arise, for example, two unrelated links arriving simultaneously. In principle these could be handled by considering every possible arrival order as well. However, in the data sets we have used these events are extremely rare. They are less than 0.1% of the data and the ordering makes negligible difference to the likelihood calculated so an arbitrary ordering can be assumed without change to the results. Data sets where a large number of unrelated links arrive at exactly the same instant cannot be analysed with this methodology but this still leave a large and increasing number of datasets amenable to analysis.

Datasets
This paper used four different publicly available network datasets as case studies for fitting (time varying) mixtures of models. To make these compatible with the modelling framework, we cleaned the datasets to remove any duplicate links and any nodes which do not connect to the largest connected component upon joining the network. These cleaned datasets are available within the software repository [1] in the form of a tab separated file, where each line is of the form SOURCE_NODE DEST_NODE TIMESTAMP specifying a string source and destination node for each link and the timestamp (Unix epoch) at which the link was recorded. Table 1 details each dataset showing the original source and number of nodes/edges after cleaning. Also included is a description of what an edge (u, v, t) between nodes u and v created at time t represents for this network. For simplicity we consider these networks as undirected.

Extra dataset results
As well as the real data results on figures 8 and 9 with the experiments described in the main text, we generated networks for the remaining Facebook wall posts and citation network datasets. In the case of the citation network (Figure 2), the 'first order' degree based statistics (top row) from the artificial networks are fairly close to the real data but adding changepoints doesn't seem to add any advantage. The last point is not surprising; there was not much change in the model parameters in the time-varying model (Figure 9 main text). The degree assortativity and clustering coefficient are not realised well by either model. This being said, the model components used for this were deliberately simple to be able to compare the four networks as in Figure 9 main text; if the aim was to find the most realistic model, more complex models such as nonlinear preferential attachment [6] or aging [7] could be considered. For the Facebook wall posts network (Figure 3), the degree-based statistics were well reproduced by both models, with the changepoints not adding much advantage, while the degree assortativity and clustering coefficient were poorly captured by both models. We suggest here that a more realistic model may be achieved by incorporating community structure such as a dynamic variant of the stochastic block model.

FETA code for reproducing the experiment in the manuscript
We have provided a Java-based codebase FETA [1] (Framework for Evolving Topology Analysis) for implementing various aspects of this framework, with tutorials provided. The user can generate a network given an operation and object model, calculate the likelihood of a model given network observations (i.e. a dataset of fitting the requirements in section 2), fit a time-varying mixture model to network observations, and extract a time series of different network measurements from a dataset. An API is also provided for users to write their own object models to test on data.