Abstract
A major line of contemporary research on complex networks is based on the development of statistical models that specify the local motifs associated with macrostructural properties observed in actual networks. This statistical approach becomes increasingly problematic as network size increases. In the context of current research on efficient estimation of models for large network data sets, we propose a fast algorithm for maximum likelihood estimation (MLE) that affords a significant increase in the size of networks amenable to direct empirical analysis. The algorithm we propose in this paper relies on properties of Markov chains at equilibrium, and for this reason it is called equilibrium expectation (EE). We demonstrate the performance of the EE algorithm in the context of exponential random graph models (ERGMs) a family of statistical models commonly used in empirical research based on network data observed at a single period in time. Thus far, the lack of efficient computational strategies has limited the empirical scope of ERGMs to relatively small networks with a few thousand nodes. The approach we propose allows a dramatic increase in the size of networks that may be analyzed using ERGMs. This is illustrated in an analysis of several biological networks and one social network with 104,103 nodes.
Introduction
Developing efficient approaches to data analysis and statistical inference is becoming increasingly important due to the widespread availability of large data sets in many fields of science. This is particularly the case for relational data typically taking the form of square arrays recording the presence of one or more relations among units of interest represented as network nodes^{1,2}. Network representation of data is a fundamental tool for understanding and modeling a wide range of complex systems. Among the available models for relational data, exponential random graph models (ERGMs) are generally considered as “The most promising class of statistical models for expressing structural properties of social networks observed at one moment in time”^{3}. ERGMs have found broad application in the analysis of social networks^{4}, as well as biological networks^{5}.
Formally, ERGMs may be written in the following form:
which expresses the probability of observing a network with a fixed number nodes in a given state x. Here z_{A}(x) are network statistics from the space of states x, which are counts of theoretically meaningful or empirically relevant network subgraphs (often called “configurations” in ERGM terminology). The summation is over all configurations A, θ_{A} denotes the model parameter associated to z_{A}(x), θ is the vector of all these parameters and k(θ) the normalizing constant ensuring that the probability distribution sums to one. The reader may also recognize equation (1) as an exponential family in canonical form (see Ch. 8 of BarndorffNielsen^{6}), a Gibbs distribution, or a Markov random field^{7}.
Different structural features are present in different networks. Yet, empirical network data are also characterized by recurrent structural regularities whose identification is crucial for understanding the behavior of complex network systems. For instance, much recent research has concerned “motifs,” small subgraphs occurring more frequently than might be expected by chance (and hence often similar to ERGMs configurations)^{8,9}. Motifs have been considered as the building blocks of complex networks. To determine if a given motif is overrepresented, the frequency of the motif in an observed network is typically compared to the average frequency in an appropriate random network (null model)^{10,11,12}. ERGMs permit inference on the under or overrepresentation of specific configurations conditional on the presence of other configurations, or structural features. In ERGMs, network statistics may be defined for particular subgraphs that may be of general or contingent interest in the particular network under study like, for example, reciprocity or triadic closure^{3,4,13,14,15}. Network nodes may have attributes^{16}, the expression for z_{A}(x) may be rather complicated, and the number of parameters may be large. Common examples of network statistics adopted in empirical research may be found elsewhere^{3,4,17}.
The model parameters fit the observed network x_{obs} if the following method of moments (MoM) condition^{18} is satisfied for all A:
where \({E}_{\pi (\theta )}({z}_{A}(x))={\sum }_{x}{z}_{A}(x)\pi (x,{\boldsymbol{\theta }})\) denotes the expected network statistics with respect to the probability distribution (1) with parameters θ to be estimated; z_{A}(x_{obs}) denotes network statistics in the observed network. If the estimated θ_{A} is significantly larger than zero, then the corresponding network statistic z_{A} occurs more frequently than might be expected by chance given all the other parameters of the model, and the corresponding configuration is overrepresented. A wellknown property of the exponential family (e.g., Ch. 8 of BarndorffNielsen^{6}) is that E_{π(θ)}(z_{A}(x)) is a monotonically increasing function of θ_{A}. Thus, the estimated θ_{A} measures the corresponding structural features in observed networks.
The problem of parameter estimation (2) coincides, in our context, with the problem of maximum likelihood estimation (MLE). The computational challenge involved in using ERGMs is the intractable normalizing constant k(θ), that makes MLE computable only by Monte Carlo techniques^{14}. Throughout the paper, we will be assuming that a MLE exists and that the model is nondegenerate^{14,19}. It is known that this latter condition does not hold for all ERGM models (e.g., the edgetriangle model and other simple Markov random graph models^{13,19,20} with phase transitions, where nonconvergent estimations are common), but more robust specifications, specifically the “alternating” statistics used in this paper (see the beginning of Section Results for the exact ERGM specifications adopted), have been shown to be much better behaved in terms of stability of statistics across a wide range of parameter values^{3}.
Existing computational methods for MLE/MoM of ERGM parameters via equation (2) do not scale up easily to large data. Even though, to date, computational costs have constrained the scope of MLE, it remains widely adopted in numerous research settings, including the analysis of temporal networks^{21,22}. Pseudolikelihood and quasilikelihood methods have been used when MLE cannot be computed, but it has been shown that these methods do not produce reliable results^{20,23,24}. Estimation via conditional independence sampling methods, particularly snowball sampling, has been recently introduced to alleviate some of these issues of scale^{25,26,27}.
Estimates of the model parameters θ may be achieved using a number of computational approaches such as, for example, Markov chain Monte Carlo maximum likelihood estimation (MCMCMLE)^{23,28,29,30,31}, variants of stochastic approximation for the method of moments^{18,22,32}, and Bayesian estimation^{33,34,35}. Of these different approaches, Bayesian estimation of ERGMs is the most computationally intensive due to the double intractability of the posterior distribution^{34}. In other types of models, however, new methods based on variational Bayesian inference^{36} can be relatively efficient, leveraging the sparse structure of large networks, e.g., graph partitioning (via blockmodeling analysis and/or analysis of modularity)^{37,38}. These methods, however, have yet to be adopted for the estimation of ERGMs. In this paper, we derive an efficient Monte Carlo approach for MLE. MCMCMLE methods are somewhat faster than the MoM because they use importance sampling but are crucially dependent on the choice of the importance distribution. Stochastic approximation for the MoM is often more robust. For the network models we describe here, the existing MCMCMLE and MoM methods are similar in practical terms with respect to the network size that can be estimated. A considerable variety of stochastic approximation methods have been developed for different problems. In particular, efficient stochastic gradient methods^{39} are often adopted for maximum likelihood estimation when many independent observations are available. The MoM^{18} adopts the RobbinsMonro algorithm^{40} and PolyakRuppert averaging^{41} and is often used to solve (2) when the left side of (2) can be computed only by MCMC simulation^{42}. In this paper, we compare the performance of our new approach with that of the MoM^{18}.
MCMC simulation and, in particular, the MetropolisHastings algorithm^{43,44} may be used to generate a network x for fixed values of the parameters θ, which we denote by x(θ). Equation (2) formulates the inverse problem, i.e., to find the value of θ that, given x_{obs}, satisfy (2), which we denote by θ(x_{obs}). Computing θ(x_{obs}) is much more computationally expensive than computing x(θ). The largest networks for which these methods have been applied to find the MLE of ERGM parameters contain at most a few thousand nodes. In contrast to existing approaches for MLE, what we propose does not require the simulation of a large number of Markov chains until convergence. Rather, it relies on properties of the Markov chain at equilibrium. For this reason, we call the proposed approach equilibrium expectation (EE).
The remainder of this paper is organized as follows: in the next section we give a brief description of Markov chain Monte Carlo, and then propose a new method, Equilibrium Expectation, for MLE of ERGM parameters. Then in the Results section, we demonstrate the performance of the EE algorithm on some simulated and empirical networks, of sizes well beyond what is currently possible in practice.
Markov Chain Monte Carlo
If the value of the normalizing constant k(θ) in equation (1) is not known, the values of probability distribution (1) cannot be computed. Markov chain Monte Carlo simulation is typically used to address this issue. MCMC simulation allows approximation of the target probability (1) and computation of expected properties \({E}_{\pi (\theta )}({z}_{A}(x))={\sum }_{A}{z}_{A}(x)\pi (x,{\boldsymbol{\theta }})\) of the model. The MetropolisHastings algorithm uses a Markov process that asymptotically reaches a unique stationary distribution π(θ). Given that the system is in state x the new state x′ is proposed with probability q(x → x′). A Markov process is uniquely defined by its transition probabilities, P(x → x′, θ) = q(x → x′)α(x → x′, θ), i.e., the probability of transitioning from any given state x to any state x′. Detailed balance with respect to a given distribution π(x, θ), which in turn implies that π(x, θ) is a stationary distribution for the Markov chain, is satisfied if the acceptance probability of the new state is given by
Formally, the MetropolisHastings algorithm may be written as Algorithm 1.
If t is larger than the burnin time, t_{B}, of the Markov process, then x_{t} values are drawn from π(x, θ), where t_{B} is the time taken by the Markov chain to forget its initial state and reach the stationary regime. The algorithm generates z_{A}(t) = z_{A}(x_{t}) sequences. From the Markov chain ergodic theorem it follows that, under regularity conditions, if the number of steps T is large, then the expected values of the model statistics E_{π(θ)} (z_{A}(x)) can be estimated by the average of z_{A}(t) along the path of the Markov chain, \(\overline{{z}_{A}(t > {t}_{B})}\)^{44}. The MetropolisHastings algorithm provides a general framework that results in a large number of different MCMC samplers that differ in the proposal q(x → x′). Verification of MCMC convergence is not a simple matter, and different methods have been suggested^{45}. One heuristic rule for evaluating convergence of the Markov process x_{t}, in our ERGM context, is to monitor the z_{A}(x_{t}) sequences.
Consider the expected change E_{π(θ)}(Δz_{A}) of z_{A}, E_{π(θ)}(Δz_{A}) = E_{π(θ)}(z_{A}(x_{t+1}) − z_{A}(x_{t})). If equilibrium stationary distribution is reached, then statistics z_{A}(x_{t}) converge and fluctuate around their expected values E_{π(θ)} (z_{A}), which do not depend on t, and hence
We can write E_{π(θ)} (Δz_{A}) as a function of the transition probabilities. Given x_{t} = x, the expected change along the path of the Markov chain in z_{A}(x_{t+1}) − z_{A}(x_{t}) is obtained as
and the expected value of Δz_{A}(x, θ) with respect to π(x, θ) is
From (4), it follows that after the MCMC burnin, when equilibrium stationary distribution is reached, we have that for all the statistics z_{A}(x),
Estimation strategy: Equilibrium Expectation
In the context of ERGMs, MLE of parameters is obtained from equation (2). Existing computational methods for MLE, such as MCMCMLE, the MoM, or Bayesian estimation, use iterative algorithms that successively modify θ until (2) is satisfied within given criteria for MCMC convergence. To this end, MCMC simulation is performed by drawing a large number of simulated networks for various values of θ, which we denote by x_{S}(θ). The simulated network x_{S} is a network drawn from probability distribution π(x, θ).
Each time a simulated network x_{S}(θ) is drawn, the convergence criterion (7) should be satisfied, but this makes the standard estimation algorithms very computationally expensive. We suggest a much less computationally expensive approach for MLE.
We use equation (7) and rewrite it as:
Equation (8) reads as follows: if network x is drawn from probability distribution π(x, θ), then the expected value of Δz_{A}(x, θ) is zero. Equation (8) is valid only at equilibrium, that is, when π(x, θ) is the limiting equilibrium distribution of the Markov chain that has transition probabilities P(x → x′, θ). Δz_{A}(x, θ) may be computed via Monte Carlo integration^{42,46,47}, as suggested in Section Estimation algorithm. The expectation with respect to π(x, θ) could be computed if a large Monte Carlo sample of networks independent and identically distributed (i.i.d.) from π(x, θ) were available i.e., \({x}_{{S}_{1}},{x}_{{S}_{2}},\cdots ,{x}_{{S}_{n}}\). The existence of this large i.i.d. sample of networks is assumed here for expository purposes in deriving an estimation strategy, and later we will remove this assumption to develop an estimation method which can be applied to observed empirical networks. Making this assumption, the LHS of (8) may be computed by Monte Carlo integration as \({E}_{\pi ({\boldsymbol{\theta }})}({\rm{\Delta }}{z}_{A}(x{\boldsymbol{\theta }}))=\frac{1}{n}{\sum }_{i}{\rm{\Delta }}{z}_{A}({x}_{{S}_{i}},{\boldsymbol{\theta }})\), and (8) may be approximated by
If we have a large sample of networks i.i.d. from π(x, θ), then we can efficiently compute the LHS of (9) and solve system of equations (9) with respect to θ. Thus, we can estimate θ. When MCMC simulation is performed, the number of steps should be larger than the burnin time, which may be large. In contrast, Δz_{A}(x, θ) in (9) may be computed by Monte Carlo integration, there is no burnin, and the number of steps may be small.
Equation (2) may be written as f_{A}(x_{obs}, θ) = 0, where f_{A}(x_{obs}, θ) = E_{π(θ)} (z_{A}(x) − z_{A}(x_{obs})); hence, the true parameter values, \({{\boldsymbol{\theta }}}^{\ast }\), may be estimated from
Typically, to estimate \({{\boldsymbol{\theta }}}^{\ast }\), the estimation of each observed network is performed, and the resulting estimates \({\boldsymbol{\theta }}({x}_{{S}_{i}})\) are averaged over the observations as follows: \(\hat{{\boldsymbol{\theta }}}\)* \(=\frac{1}{n}{\sum }_{i}{\boldsymbol{\theta }}({x}_{{S}_{i}})\). If the network x_{S} is very large, then \({\boldsymbol{\theta }}({x}_{{S}_{i}})\) is the desired estimate of the true \({{\boldsymbol{\theta }}}^{\ast }\). Thus, if the network x_{S} is very large, then the true \({{\boldsymbol{\theta }}}^{\ast }\) may be estimated from f_{A}(x_{obs}, θ) = 0, and the summation in equation (10) may be dropped. In statistics and statistical physics, this property is called the ergodicity of systems: the time averaging is equivalent to the ensemble averaging. Here, time averaging is the average over networks generated by the Markov process, and ensemble averaging is the average over the space of all the system’s states, which grows with the network size. If the system is ergodic, then this property may also be applied to equation (9) i.e., if the network x_{s} is very large, then the summation in equation (9) may be dropped. Thus, the true \({{\boldsymbol{\theta }}}^{\ast }\) value may be found from the following equilibrium expectation condition: for all A,
where Δz_{A}(x_{S}, θ) is given by (5). We can find the value of θ^{EE} that, given x_{S}, satisfies (11); this value is denoted as θ^{EE}(x_{S}). If the network x_{S} is very large, then θ^{EE}(x_{S}) is the desired estimate of \({{\boldsymbol{\theta }}}^{\ast }\). Otherwise, a large sample of networks is required as described above, and the true \({{\boldsymbol{\theta }}}^{\ast }\) value may be found from (9).
The estimation method described above may be theoretically interesting, but is of limited value for the estimation of realworld data, for which we have only a single observed network x_{obs} drawn from some unknown probability distribution, rather than a large i.i.d. Monte Carlo sample of networks drawn from a known probability distribution, as we assumed earlier.
It is not, however, lacking in usefulness when applied to such an observed network. In fact, it is actually Contrastive Divergence^{48} (CD) as applied to ERGM parameter estimation^{31,49,50,51}. The CD algorithm, rather than running an MCMC simulation to convergence, instead starts from the observed data and makes only some number k of MCMC updates (the CD algorithm with k updates is called CDk). CD1 has been shown to be equivalent to maximum pseudolikelihood under certain conditions^{52} and it has also been shown that CDk forms a series of increasingly close approximations to the MLE as k → ∞^{49}.
To see that the estimation strategy we have just described is equivalent (but independently developed) to CD1, it may be useful to consider the implementation described by Krivitsky^{31} of using CD to find the initial estimates to seed MCMCMLE. In this implementation, an ERGM MCMCMLE implementation is simply converted to CDk by modifying the MCMC sampler to start from x_{obs} and reverting the chain to x_{obs} every k steps. And so if k = 1 then the network is not actually modified and the CD1 estimate of θ is just the solution of (10). Details of the CD1 algorithm as applied in the work described here are given in the Supplementary Information.
As previously noted, maximum pseudolikelihood (and hence CD1) does not necessarily produce reliable results for ERGM estimation. However they (and CDk for k > 1) are useful for finding initial θ values to seed methods such as MCMCMLE that are sensitive to initialization conditions^{31,53}. In this paper we indeed use CD1 to find initial parameter estimates.
We will now describe how this strategy can be modified, for more realistic applications, having an observed network drawn from some unknown probability distribution. Let \({{\boldsymbol{\pi }}}^{\ast }\)(x) be the unknown probability distribution from which the realworld data x_{obs} are drawn. π(x, θ) is then the probability distribution corresponding to the model specification we have chosen (in this paper the ERGM model for some specified choices of network statistics). We assume this model to be appropriate for the observed data – we do not discuss model selection here – and, as discussed in the Introduction, we also assume the MLE exists and the model is not degenerate. Formally, we can write that for any θ, \({{\pi }}^{\ast }\)(x) ≠ π(x, θ). Equation (7) can be applied only if the equilibrium stationary distribution of the Markov chain follows a statistical model π(x, θ). If \({{\pi }}^{\ast }\)(x) ≠ π(x, θ), then the conditions of Equation (7) are not satisfied, and it cannot be applied. In other words, if observed networks are not drawn from π(x, θ), then the LHS of equation (7) is not equal to the LHS of equation (9). In this case, we cannot compute the LHS of (7), and hence we cannot find a value of θ such that (7) is satisfied. However, if a solution to (2) exists, then it is possible to find x_{S} drawn from π(x, θ) such that the following condition is satisfied:
For network x_{S}, drawn from probability distribution π(x, θ), (11) may be used as described above. Using (10) and (12), the solution of (2) may be obtained for an observed network drawn from an unknown probability distribution. The network drawn from π(x, θ) may be obtained by an MCMC simulation. In the estimation strategy just described, the parameters θ are iteratively adjusted until (11) is satisfied. To estimate parameters for networks drawn from an unknown distribution \({{\pi }}^{\ast }\)(x), both θ and x are modified: θ is modified in the same way as before, and x is also modified (equilibrated) so that, after some burnin, x will be drawn from the probability distribution π(x, θ).
Indeed, starting from the previously described method to find the solution of (11), we can derive the EE algorithm (described in the following section) for the solution of (11), (12). The EE algorithm for the estimation of an observed network x_{obs} performs MetropolisHastings moves, that is, it actually makes changes to a network for accepted proposals, a step which was not necessary for solving (11). The EE algorithm requires MCMC simulation, but – in contrast to MCMCMLE, the MoM or Bayesian estimation – the EE algorithm does not draw many simulated networks for various θ values and, therefore, is considerably faster.
Estimation algorithm
The EE algorithm is described in Algorithm 2. To provide some intuition, the description is maintained at a rather general level here. A more detailed and formal description of the EE algorithm is included in the Supplementary Information. Let m = 1000 and M indicate the number of steps of the EE algorithm. K_{A} is a positive constant, the values of which could be for example K_{A} ≈ 10^{−4} ⋅ (∂Δz_{A}(x_{obs}, θ)/∂θ_{A})^{−2}; a better choice is suggested in the Supplementary Information. The final parameter estimate from the CD1 algorithm is used as the starting point for the EE algorithm, but the efficiency of the EE algorithm is more sensitive to K_{A} values than to the initial parameter estimates θ_{0}.
Steps 4–5 of the EE algorithm are identical to steps 3–4 of the MetropolisHastings algorithm presented in Section Chain Monte Carlo: the move x → x′ is proposed, and the acceptance probability (3) is calculated. If the move to x′ is accepted, then the change in the sufficient statistics z_{A}(x′) − z_{A}(x) is computed.
Step 13 modifies θ iteratively until (11) is satisfied. This parameter update depends on the property of the exponential family that E_{π(θ)}(z_{A}(x)) is a monotonically increasing function of θ_{A}, which was mentioned in the Introduction. Because of this, for each configuration A we want to increase θ_{A} when dz_{A} is negative and decrease it when dz_{A} is positive, in order to find θ_{A} such that dz_{A} is zero. This is achieved by adding \({K}_{A}\cdot {\rm{sign}}(d{z}_{A})d{z}_{A}^{2}\) to the current θ_{A} value (recall that K_{A} is positive).
Different rootfinding algorithms may be applied to solve the system of equation (11) when Δz_{A}(x, θ) is available. For example, bisection or quasiNewton methods^{46,54} may be applied if Δz_{A}(x, θ) were a continuous functions of θ. Efficient root finding methods use derivative information. The precise derivative information is either not available or very computationally expensive, but it may be shown (see Supplementary Information) that, for any x,
It is trivial to find roots of continuous monotonic functions, and step 13 implements one possibility. A more robust approach to solving Δz_{A}(x, θ) = 0 would be to use stochastic approximation methods, which are typically applied to solve (2).
Since x is not constant, z_{A}(x) is also not constant. Recently, an auxiliary parameter MCMC method was proposed to perform MCMC simulation, constraining the value of one of the network statistics. Only the network statistic z_{L}(x), which is the edge count, was constrained to the observed value, and this was achieved by using a special proposal q(x → x′): the IFD sampler^{55}. The corresponding parameter θ_{L} was adapted^{55} in the same way as done in step 13 of the EE algorithm. In the present paper, all the network statistics z_{A}(x) are constrained and, crucially, without the need to develop a special proposal distribution. Indeed, they may be constrained by exploiting (13), i.e., the monotonic dependence of Δz_{A}(x, θ) and E_{π(θ)}(z_{A}(x)) on θ_{A}, and by properly modifying the θ_{A} values. We need an algorithm that converges to the values of θ and x that satisfy (11) and (12) for all the network statistics z_{A}(x). This is achieved by accumulating the accepted change statistics (line 7). As a result, dz_{A} = dz_{A}(t) + Δz_{A}(x, θ) and dz_{A} = z_{A}(x) − z_{A}(x_{obs}). If equilibrium is reached, then x = x_{S} is drawn from π(x, θ) and hence Δz_{A}(x, θ) = 0 and (11) is satisfied. In addition, if dz_{A} = 0, then z_{A}(x) = z_{A}(x_{obs}) and (12) is satisfied. It is shown in Section Estimation strategy: Equilibrium Expectation that the solution of (11) and (12) gives the solution of (2).
The EE algorithm produces θ_{A}(t) sequences. The number of steps M should be large enough for θ_{A}(t) to converge. After θ_{A}(t) reaches convergence, its values fluctuate around some constant values θ_{A}, which we estimate as \({\theta }_{A}=\overline{{\theta }_{A}(t > {t}_{B})}\). In addition to θ_{A}(t), the sequences dz_{A}(t) = z_{A}(t) − z_{A}(x_{obs}) are generated by the EE algorithm. It is convenient to plot both z_{A} − z_{A}(x_{obs}) and θ_{A}(t) sequences to visually check for convergence of the algorithm. We use the tratio test to check if (12) is satisfied:
where t_{B} is the burnin time, starting from which convergence is observed from the plots of θ_{A}(t) and z_{A}(t) − z_{A}(x_{obs}), and \(\overline{{z}_{A}(t > {t}_{B})}\) denotes the network statistic averaged over t > t_{B}, while SD(⋅) is the standard deviation operator. We propose an appropriate indication of good convergence so that the suggested algorithm gives a converged solution to (2) if, for all A, (i) τ_{A} < 0.1 and (ii) θ_{A} converges.
Note that if K_{A} = 0, then the EE algorithm does not differ from the MetropolisHastings algorithm. MCMC estimation algorithms make use of MCMC simulation; hence, the efficiency of MCMC estimation algorithms depends on the efficiency of MCMC samplers. In turn, the efficiency of the MetropolisHastings algorithm depends on the proposal q(x → x′). The same holds also for the EE algorithm: the algorithm may be used with different proposals q(x → x′), and its efficiency depends on a good choice of the proposal distribution. Proposals for ERGMs are reviewed elsewhere^{55}. Both the updating step of the MoM^{18} and that of the EE algorithm depend on z_{A}(x) − z_{A}(x_{obs}). However, in the case of the MoM, x is the equilibrium network configuration, which can be obtained by, e.g., the MetropolisHastings algorithm if the MCMC simulation time is larger than the burnin time of the Markov process. In the EE algorithm, x is a current nonequilibrium state. The MoM and MCMCMLE^{28} require many converged outputs of the MetropolisHastings algorithm, while the EE algorithm does not need such outputs. Instead, the EE algorithm generates one converged output.
Results
First, we test the EE algorithm by computing the bias and the standard deviation of the estimates that this algorithm produces. Simulated networks of various sizes N (here, N is the number of nodes in a network) are generated by the MetropolisHastings algorithm (as implemented in the PNet^{56} program). The ERGM model is defined by the AS, AT, A2P, ρ and ρ_{B} network statistics. The network statistics we use in this paper were suggested by Snijders et al.^{3,20} and are detailed in the Supplementary Information. The true values of the corresponding model parameters \({\theta }_{A}^{\ast }\) are represented by the horizontal lines in Fig. 1, with the estimated values of the parameters shown as boxplots, for both the EE algorithm and a variant of the stochastic approximation algorithm (MoM)^{18} for MLE, widely used for ERGM estimation. It is clear that both EE and the MoM give accurate estimates of the true \({\theta }_{A}^{\ast }\) values. Both the variance and the bias \({\theta }_{A}{\theta }_{A}^{\ast }\) of the estimates obtained using different methods have similar values. In Fig. 2, we report the computational times needed to obtain the corresponding estimates. As described in the Methods section, steps 4–6 of the EE algorithm are equivalent to steps 3–5 of the MetropolisHastings algorithm, used in the MoM. In our implementation of the MoM and EE algorithms, these steps are carried out in exactly the same way. This allows the evaluation of the efficiency of the MoM and EE algorithms by comparing the computational times for the corresponding estimations. Both the MoM and EE algorithms are used with the Basic sampler^{55}, and estimation is performed on a Cray XC50 machine available at the Swiss National Supercomputing Centre (CSCS). Figure 2 shows that for small networks (N = 1,000), the efficiencies of the EE and MoM algorithms are not very different, while for larger networks (N = 10,000), the EE algorithm is 2 orders of magnitude faster than the MoM. These results suggest that the EE algorithm may be used for the analysis of large datasets.
For a more complete comparison of the MoM and EE, we now compare their respective performance on empirical datasets. Using the MoM, it is possible to estimate simulated networks with 10,000 nodes within several hours, but the estimation of empirical networks takes longer to converge. Despite several attempts to improve the efficiency of the computational methods for MLE, the existing algorithms are still too computationally expensive for empirical networks of this or larger size. Using the more efficient IFD sampler^{55}, we can estimate six biological networks via both EE and the MoM. The estimation results are reported in the Supplementary Information. The results reported in Supplementary Tables S2 and S3 show that the MoM and EE produce equivalent estimates, but EE may be two orders of magnitude faster than the MoM, even for relatively small empirical networks with 1,781 and 5,038 nodes. Although we cannot estimate larger networks with the MoM, the results reported in Fig. 2 suggest that for larger networks, EE may be orders of magnitude faster.
For an illustrative example of a biological network, we study an Arabidopsis thaliana proteinprotein interaction (PPI) network^{57,58}, including some protein attributes on the nodes. In this network, nodes represent proteins, and (undirected) edges represent literaturecurated binary interactions between proteins. Arabidopsis thaliana proteins are annotated with various properties, as described in the Supporting Online Material of Arabidopsis Interactome Mapping Consortium^{58}. We make use of the following protein attributes. We consider one binary “plantspecific” node attribute (genes defined as plantspecific, absent from other eukaryotic lineages^{58}) and define activity ρ and interaction ρ_{B} network statistics on this attribute. We consider 2 categorical attributes “E class” and “kinasephosphorylated”, derived from the Supporting Online Material of Arabidopsis Interactome Mapping Consortium^{58} as described in the Supplementary Information, and define mismatch network statistics on these attributes. The estimated parameters of the resulting model are given in Table 1. The significant positive AT parameter indicates that triangle motifs are significantly overrepresented. The positive plantspecific interaction parameter shows that plantspecific proteins interact preferentially with each other (with the negative activity effect showing that, relative to other proteins, they are less likely to interact at all). The positive mismatch kinasephosphorylated effect shows the propensity of kinases to interact with phosphorylated proteins, as we would expect given the action of kinases to phosphorylate proteins. The positive mismatch E class shows the propensity of proteins to interact if they belong to different E classes^{59}.
We also demonstrate the performance of the algorithm in a case study involving the analysis of a larger empirical social network. More specifically, we study the Livemocha network of friendship relations among members of an online language learning community available from the repository at http://konect.unikoblenz.de ^{60,61}. Founded in 2007, Livemocha was an online community of peertopeer language learning that provided language instruction to approximately 12 million users from more than 190 countries. Livemocha closed in 2016 after being included by the Times magazine in its list of 50 best websites in 2010. The Livemocha dataset that we analyze consists of an undirected network with 104,103 nodes and 2,193,083 ties. We estimate AS, AT and edge parameters using the EE algorithm and IFD sampler. The convergence criteria τ_{A} < 0.1 is satisfied, and the values of estimated parameters θ_{A}(t) are presented in Fig. 3 as a function of the algorithm step t. One can see that θ_{A}(t) converge. The θ_{A}(t = 0) values are obtained using the CD1 algorithm, and the smaller K_{A} values of the EE algorithm are used for t > 2.5 × 10^{6} (the computational details are given in the Supplementary Information). The values of the estimated parameters are significantly different from zero. The θ_{A}(t) values are averaged over the last 5 × 10^{5} steps, and the following estimates of the model parameters are obtained: θ_{L} = −19.321, θ_{AS} = 2.7355, and θ_{AT} = 0.6453. A different convergence test was suggested by Snijders for the MoM algorithm: estimated parameter values may be used to estimate E_{π(θ)}(z_{A}(x)) in (2) via the MetropolisHastings algorithm, and the tstatistics^{18} may be computed. We checked that this convergence test is also satisfied: for all A, the absolute values of the tstatistics are less than 0.3. However, the convergence test, as suggested by Snijders, is computationally expensive. When the EE algorithm is applied, it is more convenient to check the values of τ_{A} and to visualize θ_{A}(t) dependencies.
Discussion
We propose a Monte Carlo Markov chain based approach to the maximum likelihood estimation of parameters of statistical models with intractable normalizing constants belonging to the linear exponential family. The algorithm we propose is similar to the MetropolisHastings algorithm, but allows Monte Carlo simulation to be performed while constraining the values of model statistics so that the equilibrium expectation condition (11) is satisfied. In contrast to the widely adopted MetropolisHastings algorithm, which is guaranteed to converge within the limit of infinite simulation time, the EE algorithm provides no such guarantee. Further work is required to establish the general convergence conditions of the algorithm. This work is currently in progress.
We demonstrate the merits of the algorithm by estimating the parameters of ERGMs on large network data sets. ERGMs are popular statistical models for the analysis of singleobservation network data. ERGMs may be used to determine motif significance, and are widely used to connect motifs to structural features in social and other kinds of networks. We compute the bias of the estimates produced by the EE algorithm and show that its value is close to the bias of a commonly adopted computational method for MLE. However, EE is faster, and the estimation time increases with the number of network nodes as N^{1.5}. These results suggest that the EE algorithm may be relied upon to support statistical inference on very large complex systems with networklike local dependencies. We then study several biological networks and a large social network. We show that the triangle motif is significantly overrepresented in all the networks studied. We show that accurate maximum likelihood estimates of ERGM parameters may be obtained for a large empirical network with 104,103 nodes and 2, 193,083 ties. The smallest network for which we applied the EE algorithm is a network with 418 nodes (see Supplementary Information).
The algorithm is inappropriate for the curved exponential family^{30} when the number of parameters to be estimated differs from the number of statistics. Currently, we are applying the EE algorithm to study directed networks for which different network statistics need to be computed^{3}. Future research directions involve an attempt to couple the EE algorithm with the expectationmaximization algorithm^{62} for incomplete data, and an extension to bipartite networks. In addition, we believe the computational approach we have proposed in this paper may be applicable to a wider range of statistical models for data characterized by complex networklike dependencies.
Data availability
The datasets analyzed in the current study are available in the following repositories: Livemocha http://konect.unikoblenz.de/networks/livemocha, http://socialcomputing.asu.edu/datasets/Livemocha; A. thaliana PPI: http://interactome.dfci.harvard.edu/A_thaliana; Yeast PPI: igraph^{63} Nexus repository (as of March 2017, the igraph Nexus repository is no longer available; we downloaded this data set on 10 November 2016, and it is available upon request); Human PPI: https://icon.colorado.edu/#!/networks, http://interactome.dfci.harvard.edu/H_sapiens; C. elegans PPI: (downloaded from the web address specified in Huang et al.^{64} on 2 March 2017, which is no longer available, data available on request); E. coli regulatory: http://www.statnet.org/index.shtml; Drosophila optic medulla: https://icon.colorado.edu/#!/networks, http://openconnecto.me/graphservices/download/.
Code availability
The EE algorithm is implemented in the Estimnet program, available from the authors upon request and from the webpage http://www.estimnet.org. The algorithm may be accessed from the webpage https://github.com/Byshkin/EquilibriumExpectation.
References
Borgatti, S. P., Mehra, A., Brass, D. J. & Labianca, G. Network analysis in the social sciences. Science 323, 892–895 (2009).
Butts, C. T. Revisiting the foundations of network analysis. Science 325, 414–416 (2009).
Snijders, T. A. B., Pattison, P. E., Robins, G. L. & Handcock, M. S. New specifications for exponential random graph models. Sociol. Methodol. 36, 99–153 (2006).
Lusher, D., Koskinen, J. & Robins, G. Exponential random graph models for social networks: Theory, methods, and applications (Cambridge University Press, 2013).
Saul, Z. M. & Filkov, V. Exploring biological network structure using exponential random graph models. Bioinformatics 23, 2604–2611 (2007).
BarndorffNielsen, O. Information and exponential families in statistical theory (John Wiley & Sons, 2014).
Geman, S. & Geman, D. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. In Readings in Computer Vision, 564–584 (Elsevier, 1987).
Milo, R. et al. Network motifs: simple building blocks of complex networks. Science 298, 824–827 (2002).
ShenOrr, S. S., Milo, R., Mangan, S. & Alon, U. Network motifs in the transcriptional regulation network of Escherichia coli. Nat. Genet. 31, 64–68 (2002).
ArtzyRandrup, Y., Fleishman, S. J., BenTal, N. & Stone, L. Comment on “Network motifs: simple building blocks of complex networks” and “Superfamilies of evolved and designed networks”. Science 305, 1107 (2004).
Ciriello, G. & Guerra, C. A review on models and algorithms for motif discovery in protein–protein interaction networks. Brief. Funct. Genomic. Proteomic. 7, 147–156 (2008).
Kovanen, L., Kaski, K., Kertész, J. & Saramäki, J. Temporal motifs reveal homophily, genderspecific patterns, and group talk in call sequences. Proc. Natl. Acad. Sci. USA 110, 18070–18075 (2013).
Frank, O. & Strauss, D. Markov graphs. J. Am. Stat. Assoc. 81, 832–842 (1986).
Hunter, D. R., Krivitsky, P. N. & Schweinberger, M. Computational statistical methods for social network models. J. Comput. Graph. Stat. 21, 856–882 (2012).
Newman, M. E., Watts, D. J. & Strogatz, S. H. Random graph models of social networks. Proc. Natl. Acad. Sci. USA 99, 2566–2572 (2002).
Newman, M. E. & Clauset, A. Structure and inference in annotated networks. Nat. Commun. 7, 11863 (2016).
Pallotti, F., Lomi, A. & Mascia, D. From network ties to network structures: Exponential random graph models of interorganizational relations. Qual. Quant. 47, 1665–1685 (2013).
Snijders, T. A. B. Markov chain Monte Carlo estimation of exponential random graph models. J. Soc. Struct. 3, 1–40 (2002).
Handcock, M. S. Statistical models for social networks: Inference and degeneracy. In Dynamic Social Network Modeling and Analysis: Workshop Summary and Papers, 229–240 (National Academies Press, 2003).
Robins, G., Snijders, T. A. B., Wang, P., Handcock, M. & Pattison, P. Recent developments in exponential random graph (p*) models for social networks. Soc. Networks 29, 192–215 (2007).
Snijders, T. A. B., Koskinen, J. & Schweinberger, M. Maximum likelihood estimation for social network dynamics. Ann. Appl. Stat. 4, 567–588 (2010).
Snijders, T. A. B. The statistical evaluation of social network dynamics. Sociol. Methodol. 31, 361–395 (2001).
Hummel, R. M., Hunter, D. R. & Handcock, M. S. Improving simulationbased algorithms for fitting ERGMs. J. Comput. Graph. Stat. 21, 920–939 (2012).
van Duijn, M. A., Gile, K. J. & Handcock, M. S. A framework for the comparison of maximum pseudolikelihood and maximum likelihood estimation of exponential family random graph models. Soc. Networks 31, 52–62 (2009).
Pattison, P. E., Robins, G. L., Snijders, T. A. B. & Wang, P. Conditional estimation of exponential random graph models from snowball sampling designs. J. Math. Psychol. 57, 284–296 (2013).
Stivala, A. D., Koskinen, J. H., Rolls, D. A., Wang, P. & Robins, G. L. Snowball sampling for estimating exponential random graph models for large networks. Soc. Networks 47, 167–188 (2016).
Thiemichen, S. & Kauermann, G. Stable exponential random graph models with nonparametric components for large dense networks. Soc. Networks 49, 67–80 (2017).
Geyer, C. J. & Thompson, E. A. Constrained Monte Carlo maximum likelihood for dependent data. J. Roy. Stat. Soc. B Met. 54, 657–699 (1992).
Geyer, C. J. Markov chain Monte Carlo maximum likelihood. In Keramides, E. M. (ed.) Computing Science and Statistics: Proceedings of the 23rd Symposium on the Interface, 156–163 (Interface Foundation of North America, 1991).
Hunter, D. R. & Handcock, M. S. Inference in curved exponential family models for networks. J. Comput. Graph. Stat. 15, 565–583 (2006).
Krivitsky, P. N. Using contrastive divergence to seed Monte Carlo MLE for exponentialfamily random graph models. Comput. Stat. Data Anal. 107, 149–161 (2017).
Okabayashi, S. et al. Long range search for maximum likelihood in exponential families. Electron. J. Stat. 6, 123–147 (2012).
Lehmann, E. L. & Casella, G. Theory of point estimation (Springer Science & Business Media, 2006).
Caimo, A. & Friel, N. Bayesian inference for exponential random graph models. Soc. Networks 33, 41–55 (2011).
Liang, F., Jin, I. H., Song, Q. & Liu, J. S. An adaptive exchange algorithm for sampling from distributions with intractable normalizing constants. J. Am. Stat. Assoc. 111, 377–393 (2016).
Blei, D. M., Kucukelbir, A. & McAuliffe, J. D. Variational inference: A review for statisticians. J. Am. Stat. Assoc. 112, 859–877 (2017).
Zhang, P. & Moore, C. Scalable detection of statistically significant communities and hierarchies, using message passing for modularity. Proc. Natl. Acad. Sci. USA 111, 18144–18149 (2014).
Zhang, P., Krzakala, F., Reichardt, J. & Zdeborová, L. Comparative study for inference of hidden classes in stochastic block models. J. Stat. Mech. Theory Exp. 2012, P12021 (2012).
Roux, N. L., Schmidt, M. & Bach, F. R. A stochastic gradient method with an exponential convergence rate for finite training sets. Adv. Neural Inf. Process. Syst. 25, 2663–2671 (2012).
Robbins, H. & Monro, S. A stochastic approximation method. Ann. Math. Stat. 40, 400–407 (1951).
Polyak, B. T. & Juditsky, A. B. Acceleration of stochastic approximation by averaging. SIAM J. Contr. Optim. 30, 838–855 (1992).
Christian, P. R. & Casella, G. Monte Carlo statistical methods (Springer, 1999).
Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H. & Teller, E. Equation of state calculations by fast computing machines. J. Chem. Phys. 21, 1087–1092 (1953).
Hastings, W. K. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57, 97–109 (1970).
Cowles, M. K. & Carlin, B. P. Markov chain Monte Carlo convergence diagnostics: a comparative review. J. Am. Stat. Assoc. 91, 883–904 (1996).
Miranda, M. J. & Fackler, P. L. Applied computational economics and finance (MIT press, 2004).
Newman, M. & Barkema, G. Monte Carlo Methods in Statistical Physics chapter 1–4 (Oxford University Press: New York, USA, 1999).
Hinton, G. E. Training products of experts by minimizing contrastive divergence. Neural Comput. 14, 1771–1800 (2002).
Asuncion, A., Liu, Q., Ihler, A. & Smyth, P. Learning with blocks: Composite likelihood and contrastive divergence. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, 33–40 (2010).
Hummel, R. M. Improving estimation for exponentialfamily random graph models. Ph.D. thesis, The Pennsylvania State University, https://etda.libraries.psu.edu/catalog/11493 (2010).
Fellows, I. E. Why (and when and how) contrastive divergence works. arXiv preprint arXiv:1405.0602 (2014).
Hyvärinen, A. Consistency of pseudolikelihood estimation of fully visible Boltzmann machines. Neural Comput. 18, 2283–2292 (2006).
CarreiraPerpiñan, M. A. & Hinton, G. E. On contrastive divergence learning. In Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics, 33–40 (2005).
Ortega, J. M. & Rheinboldt, W. C. Iterative solution of nonlinear equations in several variables (SIAM, 1970).
Byshkin, M. et al. Auxiliary parameter MCMC for exponential random graph models. J. Stat. Phys. 165, 740–754 (2016).
Wang, P., Robins, G. & Pattison, P. PNet: program for the estimation and simulation of p* exponential random graph models. Department of Psychology, The University of Melbourne (2009).
Swarbreck, D. et al. The Arabidopsis Information Resource (TAIR): gene structure and function annotation. Nucleic Acids Res. 36, D1009–D1014 (2008).
Arabidopsis Interactome Mapping Consortium. Evidence for network evolution in an Arabidopsis interactome map. Science 333, 601–607 (2011).
Mazzucotelli, E. et al. The E3 ubiquitin ligase gene family in plants: regulation by degradation. Curr. Genom. 7, 509–522 (2006).
Liaw, M.L. Review of livemocha. Lang. Learn. Technol. 15, 36–40, http://www.lltjournal.org/item/2722 (2011).
Zafarani, R. & Liu, H. Social computing data repository at ASU, http://socialcomputing.asu.edu (2009).
Dempster, A. P., Laird, N. M. & Rubin, D. B. Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc. B Met. 1–38 (1977).
Csárdi, G. & Nepusz, T. The igraph software package for complex network research. InterJournal Complex Systems, 1695, http://igraph.org (2006).
Huang, X.T., Zhu, Y., Chan, L. L. H., Zhao, Z. & Yan, H. An integrative C. elegans protein–protein interaction network with reliability assessment based on a probabilistic graphical model. Mol. BioSyst. 12, 85–92 (2016).
Wickham, H. ggplot2: Elegant Graphics for Data Analysis, http://ggplot2.org (SpringerVerlag New York, 2009).
Acknowledgements
The authors are grateful to Dr. Peng Wang for making the source code for the PNet program available. Mark Girolami, Illia Horenko, Ritabrata Dutta, Pavel Krivitsky and Akihiko Nishimura provided helpful discussions on earlier versions of the manuscript. We are grateful to Federica Bianchi for her careful and valuable assistance and advice on earlier drafts. We thank Melbourne Bioinformatics at the University of Melbourne, grant number VR0261, and the Swiss National Supercomputing Centre for high performance computing facilities. M.B. thanks Svitlana Byshkina for her patience. This work was supported by Swiss National Science Foundation grant number 167362, National Research Program 75 (“Big Data”) awarded to A. Lomi, and by Swiss National Science Foundation grant number 105218_163196 (“Statistical Inference on LargeScale Mechanistic Network Models”) awarded to A. Mira.
Author information
Authors and Affiliations
Contributions
M.B., A.S., A.M., G.R. and A.L. conceived the research, designed the research, conducted the analyses and wrote the manuscript. M.B. conceived the original EE algorithm.
Corresponding author
Ethics declarations
Competing Interests
The authors declare no competing interests.
Additional information
Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Byshkin, M., Stivala, A., Mira, A. et al. Fast Maximum Likelihood Estimation via Equilibrium Expectation for Large Network Data. Sci Rep 8, 11509 (2018). https://doi.org/10.1038/s41598018297258
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598018297258
This article is cited by

Maximum entropy networks for large scale social network node analysis
Applied Network Science (2022)

Testing biological network motif significance with exponential random graph models
Applied Network Science (2021)

Tieformation process within the communities of the Japanese production network: application of an exponential random graph model
Applied Network Science (2019)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.