Fast Maximum Likelihood Estimation via Equilibrium Expectation for Large Network Data

A major line of contemporary research on complex networks is based on the development of statistical models that specify the local motifs associated with macro-structural properties observed in actual networks. This statistical approach becomes increasingly problematic as network size increases. In the context of current research on efficient estimation of models for large network data sets, we propose a fast algorithm for maximum likelihood estimation (MLE) that affords a significant increase in the size of networks amenable to direct empirical analysis. The algorithm we propose in this paper relies on properties of Markov chains at equilibrium, and for this reason it is called equilibrium expectation (EE). We demonstrate the performance of the EE algorithm in the context of exponential random graph models (ERGMs) a family of statistical models commonly used in empirical research based on network data observed at a single period in time. Thus far, the lack of efficient computational strategies has limited the empirical scope of ERGMs to relatively small networks with a few thousand nodes. The approach we propose allows a dramatic increase in the size of networks that may be analyzed using ERGMs. This is illustrated in an analysis of several biological networks and one social network with 104,103 nodes.


Fast Maximum Likelihood Estimation via Equilibrium Expectation for Large Network Data
A major line of contemporary research on complex networks is based on the development of statistical models that specify the local motifs associated with macro-structural properties observed in actual networks.This statistical approach becomes increasingly problematic as network size increases.In the context of current research on efficient estimation of models for large network data sets, we propose a fast algorithm for maximum likelihood estimation (MLE) that affords a significant increase in the size of networks amenable to direct empirical analysis.The algorithm we propose in this paper relies on properties of Markov chains at equilibrium, and for this reason it is called equilibrium expectation (EE).We demonstrate the performance of the EE algorithm in the context of exponential random graph models (ERGMs) a family of statistical models commonly used in empirical research based on network data observed at a single period in time.Thus far, the lack of efficient computational strategies has limited the empirical scope of ERGMs to relatively small networks with a few thousand nodes.The approach we propose allows a dramatic increase in the size of networks that may be analyzed using ERGMs.This is illustrated in an analysis of several biological networks and one social network with 104,103 nodes.
Developing efficient approaches to data analysis and statistical inference is becoming increasingly important due to the widespread availability of large data sets in many fields of science.This is particularly the case for relational data typically taking the form of square arrays recording the presence of one or more relations among units of interest represented as network nodes 1,2 .Network representation of data is a fundamental tool for understanding and modeling a wide range of complex systems.Among the available models for relational data, exponential random graph models (ERGMs) are generally considered as "The most promising class of statistical models for expressing structural properties of social networks observed at one moment in time" 3 .ERGMs have found broad application in the analysis of social networks 4 , as well as biological networks 5 .
Formally, ERGMs may be written in the following form: which expresses the probability of observing a network with a fixed number nodes in a given state x.Here z A (x) are network statistics from the space of states x, which are counts of theoretically meaningful or empirically relevant network subgraphs (often called "configurations" in ERGM terminology).The summation is over all configurations A, θ A denotes the model parameter associated to z A (x), θ is the vector of all these parameters and k(θ) the normalizing constant ensuring that the probability distribution sums to one.The reader may also recognize equation (1) as an exponential family in canonical form (see Ch. 8 of Barndorff-Nielsen 6 ), a Gibbs distribution, or a Markov random field 7 .
Different structural features are present in different networks.Yet, empirical network data are also characterized by recurrent structural regularities whose identification is crucial for understanding the behavior of complex network systems.For instance, much recent research has concerned "motifs, " small subgraphs occurring more frequently than might be expected by chance (and hence often similar to ERGMs configurations) 8,9 .Motifs have been considered as the building blocks of complex networks.To determine if a given motif is over-represented, the frequency of the motif in an observed network is typically compared to the average frequency in an appropriate random network (null model) [10][11][12] .ERGMs permit inference on the under or over-representation of specific configurations conditional on the presence of other configurations, or structural features.In ERGMs, network statistics may be defined for particular subgraphs that may be of general or contingent interest in the particular network under study like, for example, reciprocity or triadic closure 3,4,[13][14][15] .Network nodes may have attributes 16 , the expression for z A (x) may be rather complicated, and the number of parameters may be large.Common examples of network statistics adopted in empirical research may be found elsewhere 3,4,17 .
The model parameters fit the observed network x obs if the following method of moments (MoM) condition 18 is satisfied for all A: denotes the expected network statistics with respect to the probability distribution (1) with parameters θ to be estimated; z A (x obs ) denotes network statistics in the observed network.If the estimated θ A is significantly larger than zero, then the corresponding network statistic z A occurs more frequently than might be expected by chance given all the other parameters of the model, and the corresponding configuration is over-represented.A well-known property of the exponential family (e.g., Ch. 8 of Barndorff-Nielsen 6 ) is that E π(θ) (z A (x)) is a monotonically increasing function of θ A .Thus, the estimated θ A measures the corresponding structural features in observed networks.
The problem of parameter estimation (2) coincides, in our context, with the problem of maximum likelihood estimation (MLE).The computational challenge involved in using ERGMs is the intractable normalizing constant k(θ), that makes MLE computable only by Monte Carlo techniques 14 .Throughout the paper, we will be assuming that a MLE exists and that the model is non-degenerate 14,19 .It is known that this latter condition does not hold for all ERGM models (e.g., the edge-triangle model and other simple Markov random graph models 13,19,20 with phase transitions, where non-convergent estimations are common), but more robust specifications, specifically the "alternating" statistics used in this paper (see the beginning of Section Results for the exact ERGM specifications adopted), have been shown to be much better behaved in terms of stability of statistics across a wide range of parameter values 3 .
Existing computational methods for MLE/MoM of ERGM parameters via equation ( 2) do not scale up easily to large data.Even though, to date, computational costs have constrained the scope of MLE, it remains widely adopted in numerous research settings, including the analysis of temporal networks 21,22 .Pseudo-likelihood and quasi-likelihood methods have been used when MLE cannot be computed, but it has been shown that these methods do not produce reliable results 20,23,24 .Estimation via conditional independence sampling methods, particularly snowball sampling, has been recently introduced to alleviate some of these issues of scale [25][26][27] .
Estimates of the model parameters θ may be achieved using a number of computational approaches such as, for example, Markov chain Monte Carlo maximum likelihood estimation (MCMCMLE) 23,[28][29][30][31] , variants of stochastic approximation for the method of moments 18,22,32 , and Bayesian estimation [33][34][35] .Of these different approaches, Bayesian estimation of ERGMs is the most computationally intensive due to the double intractability of the posterior distribution 34 .In other types of models, however, new methods based on variational Bayesian inference 36 can be relatively efficient, leveraging the sparse structure of large networks, e.g., graph partitioning (via blockmodeling analysis and/or analysis of modularity) 37,38 .These methods, however, have yet to be adopted for the estimation of ERGMs.In this paper, we derive an efficient Monte Carlo approach for MLE.MCMCMLE methods are somewhat faster than the MoM because they use importance sampling but are crucially dependent on the choice of the importance distribution.Stochastic approximation for the MoM is often more robust.For the network models we describe here, the existing MCMCMLE and MoM methods are similar in practical terms with respect to the network size that can be estimated.A considerable variety of stochastic approximation methods have been developed for different problems.In particular, efficient stochastic gradient methods 39 are often adopted for maximum likelihood estimation when many independent observations are available.The MoM 18 adopts the Robbins-Monro algorithm 40 and Polyak-Ruppert averaging 41 and is often used to solve (2) when the left side of (2) can be computed only by MCMC simulation 42 .In this paper, we compare the performance of our new approach with that of the MoM 18 .
MCMC simulation and, in particular, the Metropolis-Hastings algorithm 43,44 may be used to generate a network x for fixed values of the parameters θ, which we denote by x(θ).Equation (2) formulates the inverse problem, i.e., to find the value of θ that, given x obs , satisfy (2), which we denote by θ(x obs ).Computing θ(x obs ) is much more computationally expensive than computing x(θ).The largest networks for which these methods have been applied to find the MLE of ERGM parameters contain at most a few thousand nodes.In contrast to existing approaches for MLE, what we propose does not require the simulation of a large number of Markov chains until convergence.Rather, it relies on properties of the Markov chain at equilibrium.For this reason, we call the proposed approach equilibrium expectation (EE).
The remainder of this paper is organized as follows: in the next section we give a brief description of Markov chain Monte Carlo, and then propose a new method, Equilibrium Expectation, for MLE of ERGM parameters.Then in the Results section, we demonstrate the performance of the EE algorithm on some simulated and empirical networks, of sizes well beyond what is currently possible in practice.

Markov Chain Monte Carlo
If the value of the normalizing constant k(θ) in equation ( 1) is not known, the values of probability distribution (1) cannot be computed.Markov chain Monte Carlo simulation is typically used to address this issue.MCMC simulation allows approximation of the target probability (1) and computation of expected properties of the model.The Metropolis-Hastings algorithm uses a Markov process that asymptotically reaches a unique stationary distribution π(θ).Given that the system is in state x the new state x′ is proposed with probability q(x → x′).A Markov process is uniquely defined by its transition probabilities, P(x → x′, θ) = q(x → x′)α(x → x′, θ), i.e., the probability of transitioning from any given state x to any state x′.Detailed balance with respect to a given distribution π(x, θ), which in turn implies that π(x, θ) is a stationary distribution for the Markov chain, is satisfied if the acceptance probability of the new state is given by Formally, the Metropolis-Hastings algorithm may be written as Algorithm 1.
If t is larger than the burn-in time, t B , of the Markov process, then x t values are drawn from π(x, θ), where t B is the time taken by the Markov chain to forget its initial state and reach the stationary regime.The algorithm generates z A (t) = z A (x t ) sequences.From the Markov chain ergodic theorem it follows that, under regularity conditions, if the number of steps T is large, then the expected values of the model statistics E π(θ) (z A (x)) can be estimated by the average of z A (t) along the path of the Markov chain, z t t ( ) A B > 44 .The Metropolis-Hastings algorithm provides a general framework that results in a large number of different MCMC samplers that differ in the proposal q(x → x′).Verification of MCMC convergence is not a simple matter, and different methods have been suggested 45 .One heuristic rule for evaluating convergence of the Markov process x t , in our ERGM context, is to monitor the z A (x t ) sequences.
Consider the expected change If equilibrium stationary distribution is reached, then statistics z A (x t ) converge and fluctuate around their expected values E π(θ) (z A ), which do not depend on t, and hence We can write E π(θ) (Δz A ) as a function of the transition probabilities.Given x t = x, the expected change along the path of the Markov chain in z A (x t+1 ) − z A (x t ) is obtained as and the expected value of From (4), it follows that after the MCMC burn-in, when equilibrium stationary distribution is reached, we have that for all the statistics z A (x),

Estimation strategy: Equilibrium Expectation
In the context of ERGMs, MLE of parameters is obtained from equation (2).Existing computational methods for MLE, such as MCMCMLE, the MoM, or Bayesian estimation, use iterative algorithms that successively modify θ until (2) is satisfied within given criteria for MCMC convergence.To this end, MCMC simulation is performed by drawing a large number of simulated networks for various values of θ, which we denote by x S (θ).The simulated network x S is a network drawn from probability distribution π(x, θ).

Algorithm 1. Metropolis-Hastings algorithm.
Each time a simulated network x S (θ) is drawn, the convergence criterion (7) should be satisfied, but this makes the standard estimation algorithms very computationally expensive.We suggest a much less computationally expensive approach for MLE.
We use equation (7) and rewrite it as: Equation ( 8) reads as follows: if network x is drawn from probability distribution π(x, θ), then the expected value of Δz A (x, θ) is zero.Equation ( 8) is valid only at equilibrium, that is, when π(x, θ) is the limiting equilibrium distribution of the Markov chain that has transition probabilities P(x → x′, θ).Δz A (x, θ) may be computed via Monte Carlo integration 42,46,47 , as suggested in Section Estimation algorithm.The expectation with respect to π(x, θ) could be computed if a large Monte Carlo sample of networks independent and identically distributed (i.i.d.) from π(x, θ) were available i.e.,  x x x , , , . The existence of this large i.i.d.sample of networks is assumed here for expository purposes in deriving an estimation strategy, and later we will remove this assumption to develop an estimation method which can be applied to observed empirical networks.Making this assumption, the LHS of ( 8) may be computed by Monte Carlo integration as ( ) , and ( 8) may be approximated by ( ) If we have a large sample of networks i.i.d.from π(x, θ), then we can efficiently compute the LHS of (9) and solve system of equations ( 9) with respect to θ.Thus, we can estimate θ.When MCMC simulation is performed, the number of steps should be larger than the burn-in time, which may be large.In contrast, Δz A (x, θ) in ( 9) may be computed by Monte Carlo integration, there is no burn-in, and the number of steps may be small.Equation ( 2) may be written as f A (x obs , θ) = 0, where ; hence, the true parameter values, ⁎ θ , may be estimated from Typically, to estimate θ ⁎ , the estimation of each observed network is performed, and the resulting estimates x ( ) are averaged over the observations as follows: θ is the desired estimate of the true ⁎ θ .Thus, if the network x S is very large, then the true ⁎ θ may be estimated from f A (x obs , θ) = 0, and the summation in equation ( 10) may be dropped.In statistics and statistical physics, this property is called the ergodicity of systems: the time averaging is equivalent to the ensemble averaging.Here, time averaging is the average over networks generated by the Markov process, and ensemble averaging is the average over the space of all the system's states, which grows with the network size.If the system is ergodic, then this property may also be applied to equation (9) i.e., if the network x s is very large, then the summation in equation ( 9) may be dropped.Thus, the true ⁎ θ value may be found from the following equilibrium expectation condition: for all A,

A S EE
where Δz A (x S , θ) is given by (5).We can find the value of θ EE that, given x S , satisfies (11); this value is denoted as θ EE (x S ).If the network x S is very large, then θ EE (x S ) is the desired estimate of θ ⁎ .Otherwise, a large sample of networks is required as described above, and the true ⁎ θ value may be found from (9).The estimation method described above may be theoretically interesting, but is of limited value for the estimation of real-world data, for which we have only a single observed network x obs drawn from some unknown probability distribution, rather than a large i.i.d.Monte Carlo sample of networks drawn from a known probability distribution, as we assumed earlier.
It is not, however, lacking in usefulness when applied to such an observed network.In fact, it is actually Contrastive Divergence 48 (CD) as applied to ERGM parameter estimation 31,[49][50][51] .The CD algorithm, rather than running an MCMC simulation to convergence, instead starts from the observed data and makes only some number k of MCMC updates (the CD algorithm with k updates is called CD-k).CD-1 has been shown to be equivalent to maximum pseudo-likelihood under certain conditions 52 and it has also been shown that CD-k forms a series of increasingly close approximations to the MLE as k → ∞ 49 .
To see that the estimation strategy we have just described is equivalent (but independently developed) to CD-1, it may be useful to consider the implementation described by Krivitsky 31 of using CD to find the initial estimates to seed MCMCMLE.In this implementation, an ERGM MCMCMLE implementation is simply converted to CD-k by modifying the MCMC sampler to start from x obs and reverting the chain to x obs every k steps.And so if k = 1 then the network is not actually modified and the CD-1 estimate of θ is just the solution of (10).Details of the CD-1 algorithm as applied in the work described here are given in the Supplementary Information.
As previously noted, maximum pseudo-likelihood (and hence CD-1) does not necessarily produce reliable results for ERGM estimation.However they (and CD-k for k > 1) are useful for finding initial θ values to seed methods such as MCMCMLE that are sensitive to initialization conditions 31,53 .In this paper we indeed use CD-1 to find initial parameter estimates.
We will now describe how this strategy can be modified, for more realistic applications, having an observed network drawn from some unknown probability distribution.Let π ⁎ (x) be the unknown probability distribution from which the real-world data x obs are drawn.π(x, θ) is then the probability distribution corresponding to the model specification we have chosen (in this paper the ERGM model for some specified choices of network statistics).We assume this model to be appropriate for the observed data -we do not discuss model selection here -and, as discussed in the Introduction, we also assume the MLE exists and the model is not degenerate.Formally, we can write that for any θ, ⁎ π (x) ≠ π(x, θ).Equation ( 7) can be applied only if the equilibrium stationary distribution of the Markov chain follows a statistical model π(x, θ).If π ⁎ (x) ≠ π(x, θ), then the conditions of Equation ( 7) are not satisfied, and it cannot be applied.In other words, if observed networks are not drawn from π(x, θ), then the LHS of equation ( 7) is not equal to the LHS of equation (9).In this case, we cannot compute the LHS of ( 7), and hence we cannot find a value of θ such that (7) is satisfied.However, if a solution to (2) exists, then it is possible to find x S drawn from π(x, θ) such that the following condition is satisfied: A S A obs For network x S , drawn from probability distribution π(x, θ), (11) may be used as described above.Using ( 10) and ( 12), the solution of (2) may be obtained for an observed network drawn from an unknown probability distribution.The network drawn from π(x, θ) may be obtained by an MCMC simulation.In the estimation strategy just described, the parameters θ are iteratively adjusted until (11) is satisfied.To estimate parameters for networks drawn from an unknown distribution ⁎ π (x), both θ and x are modified: θ is modified in the same way as before, and x is also modified (equilibrated) so that, after some burn-in, x will be drawn from the probability distribution π(x, θ).
Indeed, starting from the previously described method to find the solution of ( 11), we can derive the EE algorithm (described in the following section) for the solution of ( 11), (12).The EE algorithm for the estimation of an observed network x obs performs Metropolis-Hastings moves, that is, it actually makes changes to a network for accepted proposals, a step which was not necessary for solving (11).The EE algorithm requires MCMC simulation, but -in contrast to MCMCMLE, the MoM or Bayesian estimation -the EE algorithm does not draw many simulated networks for various θ values and, therefore, is considerably faster.

Estimation algorithm
The EE algorithm is described in Algorithm 2. To provide some intuition, the description is maintained at a rather general level here.A more detailed and formal description of the EE algorithm is included in the Supplementary Information.Let m = 1000 and M indicate the number of steps of the EE algorithm.K A is a positive constant, the values of which could be for example K A ≈ 10 −4 ⋅ (∂Δz A (x obs , θ)/∂θ A ) −2 ; a better choice is suggested in the Supplementary Information.The final parameter estimate from the CD-1 algorithm is used as the starting point for the EE algorithm, but the efficiency of the EE algorithm is more sensitive to K A values than to the initial parameter estimates θ 0 .
Steps 4-5 of the EE algorithm are identical to steps 3-4 of the Metropolis-Hastings algorithm presented in Section Chain Monte Carlo: the move x → x′ is proposed, and the acceptance probability (3) is calculated.If the move to x′ is accepted, then the change in the sufficient statistics z A (x′) − z A (x) is computed.
Step 13 modifies θ iteratively until (11) is satisfied.This parameter update depends on the property of the exponential family that E π(θ) (z A (x)) is a monotonically increasing function of θ A , which was mentioned in the Introduction.Because of this, for each configuration A we want to increase θ A when dz A is negative and decrease it when dz A is positive, in order to find θ A such that dz A is zero.This is achieved by adding K d z dz sign( ) Different root-finding algorithms may be applied to solve the system of equation ( 11) when Δz A (x, θ) is available.For example, bisection or quasi-Newton methods 46,54  It is trivial to find roots of continuous monotonic functions, and step 13 implements one possibility.A more robust approach to solving Δz A (x, θ) = 0 would be to use stochastic approximation methods, which are typically applied to solve (2).
Since x is not constant, z A (x) is also not constant.Recently, an auxiliary parameter MCMC method was proposed to perform MCMC simulation, constraining the value of one of the network statistics.Only the network statistic z L (x), which is the edge count, was constrained to the observed value, and this was achieved by using a special proposal q(x → x′): the IFD sampler 55 .The corresponding parameter θ L was adapted 55 in the same way as done in step 13 of the EE algorithm.In the present paper, all the network statistics z A (x) are constrained and, crucially, without the need to develop a special proposal distribution.Indeed, they may be constrained by exploiting (13), i.e., the monotonic dependence of Δz A (x, θ) and E π(θ) (z A (x)) on θ A , and by properly modifying the θ A values.We need an algorithm that converges to the values of θ and x that satisfy (11) and ( 12) for all the network statistics z A (x).This is achieved by accumulating the accepted change statistics (line 7).As a result, dz A = dz A (t) + Δz A (x, θ) and is satisfied.In addition, if dz A = 0, then z A (x) = z A (x obs ) and ( 12) is satisfied.It is shown in Section Estimation strategy: Equilibrium Expectation that the solution of ( 11) and (12) gives the solution of (2).
The EE algorithm produces θ A (t) sequences.The number of steps M should be large enough for θ A (t) to converge.After θ A (t) reaches convergence, its values fluctuate around some constant values θ A , which we estimate as t t ( ) > .In addition to θ A (t), the sequences dz A (t) = z A (t) − z A (x obs ) are generated by the EE algorithm.It is convenient to plot both z A − z A (x obs ) and θ A (t) sequences to visually check for convergence of the algorithm.We use the t-ratio test to check if ( 12) is satisfied: where t B is the burn-in time, starting from which convergence is observed from the plots of θ A (t) and z A (t) − z A (x obs ), and z t t ( ) > denotes the network statistic averaged over t > t B , while SD(⋅) is the standard deviation operator.We propose an appropriate indication of good convergence so that the suggested algorithm gives a converged solution to (2) if, for all A, (i) |τ A | < 0.1 and (ii) θ A converges.
Note that if K A = 0, then the EE algorithm does not differ from the Metropolis-Hastings algorithm.MCMC estimation algorithms make use of MCMC simulation; hence, the efficiency of MCMC estimation algorithms depends on the efficiency of MCMC samplers.In turn, the efficiency of the Metropolis-Hastings algorithm depends on the proposal q(x → x′).The same holds also for the EE algorithm: the algorithm may be used with different proposals q(x → x′), and its efficiency depends on a good choice of the proposal distribution.Proposals for ERGMs are reviewed elsewhere 55 .Both the updating step of the MoM 18 and that of the EE algorithm depend on z A (x) − z A (x obs ).However, in the case of the MoM, x is the equilibrium network configuration, which can be obtained by, e.g., the Metropolis-Hastings algorithm if the MCMC simulation time is larger than the burn-in time of the Markov process.In the EE algorithm, x is a current non-equilibrium state.The MoM and MCMCMLE 28 require many converged outputs of the Metropolis-Hastings algorithm, while the EE algorithm does not need such outputs.Instead, the EE algorithm generates one converged output.

Results
First, we test the EE algorithm by computing the bias and the standard deviation of the estimates that this algorithm produces.Simulated networks of various sizes N (here, N is the number of nodes in a network) are generated by the Metropolis-Hastings algorithm (as implemented in the PNet 56 program).The ERGM model is defined by the AS, AT, A2P, ρ and ρ B network statistics.The network statistics we use in this paper were suggested by Snijders et al. 3,20 and are detailed in the Supplementary Information.The true values of the corresponding model parameters θ A ⁎ are represented by the horizontal lines in Fig. 1, with the estimated values of the parameters shown as boxplots, for both the EE algorithm and a variant of the stochastic approximation algorithm (MoM) 18 for MLE, widely used for ERGM estimation.It is clear that both EE and the MoM give accurate estimates of the true ⁎ θ A values.Both the variance and the bias A A ⁎ θ θ − of the estimates obtained using different methods have similar values.In Fig. 2, we report the computational times needed to obtain the corresponding estimates.As described in the Methods section, steps 4-6 of the EE algorithm are equivalent to steps 3-5 of the Metropolis-Hastings algorithm, used in the MoM.In our implementation of the MoM and EE algorithms, these steps are carried out in exactly the same way.This allows the evaluation of the efficiency of the MoM and EE algorithms by comparing the computational times for the corresponding estimations.Both the MoM and EE algorithms are used with the Basic sampler 55 , and estimation is performed on a Cray XC50 machine available at the Swiss National Supercomputing Centre (CSCS).Figure 2 shows that for small networks (N = 1,000), the efficiencies of the EE and MoM algorithms are not very different, while for larger networks (N = 10,000), the EE algorithm is 2 orders of magnitude faster than the MoM.These results suggest that the EE algorithm may be used for the analysis of large datasets.
For a more complete comparison of the MoM and EE, we now compare their respective performance on empirical datasets.Using the MoM, it is possible to estimate simulated networks with 10,000 nodes within several hours, but the estimation of empirical networks takes longer to converge.Despite several attempts to improve the efficiency of the computational methods for MLE, the existing algorithms are still too computationally expensive for empirical networks of this or larger size.Using the more efficient IFD sampler 55 , we can estimate six biological networks via both EE and the MoM.The estimation results are reported in the Supplementary Information.The results reported in Supplementary Tables S2 and S3 show that the MoM and EE produce equivalent estimates, but EE may be two orders of magnitude faster than the MoM, even for relatively small empirical networks with 1,781 and 5,038 nodes.Although we cannot estimate larger networks with the MoM, the results reported in Fig. 2 suggest that for larger networks, EE may be orders of magnitude faster.For an illustrative example of a biological network, we study an Arabidopsis thaliana protein-protein interaction (PPI) network 57,58 , including some protein attributes on the nodes.In this network, nodes represent proteins, and (undirected) edges represent literature-curated binary interactions between proteins.Arabidopsis thaliana proteins are annotated with various properties, as described in the Supporting Online Material of Arabidopsis Interactome Mapping Consortium 58 .We make use of the following protein attributes.We consider one binary "plant-specific" node attribute (genes defined as plant-specific, absent from other eukaryotic lineages 58 ) and define activity ρ and interaction ρ B network statistics on this attribute.We consider 2 categorical attributes "E class" and "kinase-phosphorylated", derived from the Supporting Online Material of Arabidopsis Interactome Mapping Consortium 58 as described in the Supplementary Information, and define mismatch network statistics on these attributes.The estimated parameters of the resulting model are given in Table 1.The significant positive AT parameter indicates that triangle motifs are significantly over-represented.The positive plant-specific interaction parameter shows that plant-specific proteins interact preferentially with each other (with the negative activity effect showing that, relative to other proteins, they are less likely to interact at all).The positive mismatch kinase-phosphorylated effect shows the propensity of kinases to interact with phosphorylated proteins, as we would expect given the action of kinases to phosphorylate proteins.The positive mismatch E class shows the propensity of proteins to interact if they belong to different E classes 59 .
We also demonstrate the performance of the algorithm in a case study involving the analysis of a larger empirical social network.More specifically, we study the Livemocha network of friendship relations among members of an online language learning community available from the repository at http://konect.uni-koblenz.de 60,61.Founded in 2007, Livemocha was an online community of peer-to-peer language learning that provided language instruction to approximately 12 million users from more than 190 countries.Livemocha closed in 2016 after being included by the Times magazine in its list of 50 best websites in 2010.The Livemocha dataset that we analyze consists of an undirected network with 104,103 nodes and 2,193,083 ties.We estimate AS, AT and edge parameters using the EE algorithm and IFD sampler.The convergence criteria |τ A | < 0.1 is satisfied, and the values of estimated parameters θ A (t) are presented in Fig. 3 as a function of the algorithm step t.One can see that θ A (t) converge.The θ A (t = 0) values are obtained using the CD-1 algorithm, and the smaller K A values of the EE algorithm are used for t > 2.5 × 10 6 (the computational details are given in the Supplementary Information).The values of the estimated parameters are significantly different from zero.The θ A (t) values are averaged over the last 5 × 10 5 steps, and the following estimates of the model parameters are obtained: θ L = −19.321,θ AS = 2.7355, and θ AT = 0.6453.A different convergence test was suggested by Snijders for the MoM algorithm: estimated parameter values may be used to estimate E π(θ) (z A (x)) in ( 2) via the Metropolis-Hastings algorithm, and the t-statistics 18 may be computed.We checked that this convergence test is also satisfied: for all A, the absolute values of the t-statistics are less than 0.3.However, the convergence test, as suggested by Snijders, is computationally expensive.When the EE algorithm is applied, it is more convenient to check the values of τ A and to visualize θ A (t) dependencies.

Discussion
We propose a Monte Carlo Markov chain based approach to the maximum likelihood estimation of parameters of statistical models with intractable normalizing constants belonging to the linear exponential family.The algorithm we propose is similar to the Metropolis-Hastings algorithm, but allows Monte Carlo simulation to be performed while constraining the values of model statistics so that the equilibrium expectation condition (11) is satisfied.In contrast to the widely adopted Metropolis-Hastings algorithm, which is guaranteed to converge within the limit of infinite simulation time, the EE algorithm provides no such guarantee.Further work is required to establish the general convergence conditions of the algorithm.This work is currently in progress.
We demonstrate the merits of the algorithm by estimating the parameters of ERGMs on large network data sets.ERGMs are popular statistical models for the analysis of single-observation network data.ERGMs may be used to determine motif significance, and are widely used to connect motifs to structural features in social and other kinds of networks.We compute the bias of the estimates produced by the EE algorithm and show that its value is close to the bias of a commonly adopted computational method for MLE.However, EE is faster, and the estimation time increases with the number of network nodes as N 1.5 .These results suggest that the EE algorithm may be relied upon to support statistical inference on very large complex systems with network-like local dependencies.We then study several biological networks and a large social network.We show that the triangle motif is significantly over-represented in all the networks studied.We show that accurate maximum likelihood estimates of ERGM parameters may be obtained for a large empirical network with 104,103 nodes and 2, 193,083 ties.The smallest network for which we applied the EE algorithm is a network with 418 nodes (see Supplementary Information).The algorithm is inappropriate for the curved exponential family 30 when the number of parameters to be estimated differs from the number of statistics.Currently, we are applying the EE algorithm to study directed networks for which different network statistics need to be computed 3 .Future research directions involve an attempt to couple the EE algorithm with the expectation-maximization algorithm 62 for incomplete data, and an extension to bipartite networks.In addition, we believe the computational approach we have proposed in this paper may be applicable to a wider range of statistical models for data characterized by complex network-like dependencies.

Figure 1 .Figure 2 .
Figure 1.Estimation of simulated networks by suggested EE algorithm and MoM 18 .Horizontal lines show the true θ A⁎ value and the estimates from each of the two methods for each of the four network sizes are shown as boxplots (generated with ggplot265 ).Each boxplot represents estimates for 120 networks, except that for MoM, when N = 5000 only 118 estimations converged and when N = 10,000 only 49 of the 120 estimations converged.

Figure 3 .
Figure 3. Results of estimation of ERGM parameters for Livemocha networks with 104,103 nodes and 2,193,083 ties using the EE algorithm.The starting point is the result of the CD-1 algorithm.Producing these results took 12 hours on one core of the Intel E5-2650 machines available at https://intranet.ics.usi.ch/HPC.
may be applied if Δz A (x, θ) were a continuous functions of θ.Efficient root finding methods use derivative information.The precise derivative information is either not available or very computationally expensive, but it may be shown (see Supplementary Information) that, for any x, SCientifiC REPORTS | (2018) 8:11509 | DOI:10.1038/s41598-018-29725-8A A

Table 1 .
Parameter estimates with 95% confidence interval (see Supplementary Information) for the Arabidopsis thaliana PPI network, estimated using the EE algorithm with the IFD sampler.Estimation of this 2,160 nodes network took only 3 minutes on the Lenovo NeXtScale x86 system at Melbourne Bioinformatics.