Driven by growing interest across the sciences, a large number of empirical studies have been conducted in recent years of the structure of networks ranging from the Internet and the World Wide Web to biological networks and social networks. The data produced by these experiments are often rich and multimodal, yet at the same time they may contain substantial measurement error1,2,3,4,5,6,7. Accurate analysis and understanding of networked systems requires a way of estimating the true structure of networks from such rich but noisy data8,9,10,11,12,13,14,15. Here we describe a technique that allows us to make optimal estimates of network structure from complex data in arbitrary formats, including cases where there may be measurements of many different types, repeated observations, contradictory observations, annotations or metadata, or missing data. We give example applications to two different social networks, one derived from face-to-face interactions and one from self-reported friendships.
Most empirical studies of networks take a `naive' view of structural data, meaning that one assumes that the data are the network. For instance, in a study of a protein–protein interaction network16,17,18, one might compile a list of known protein interactions and represent them as a network of protein nodes joined by interaction edges. But this network represents the pattern of measured interactions, not the pattern of actual interactions. The two could, and probably do, differ substantially, because of both error in the measurements and missing data5,19. As another example, in studies of friendship networks20,21, one commonly assembles a network simply by asking people who their friends are. The resulting network thus represents who people say they are friends with, not who they are actually friends with. The two can differ if, for instance, participants and experimenters apply different standards for what constitutes a friendship, or if participants fail to report some friendships at all1,2,8,22.
At the same time, many studies return data much richer than just a simple measurement of connections. Protein–protein interaction networks, for example, are commonly assembled from the results of many complementary experiments involving a variety of techniques, further enriched by knowledge of protein function, genetics or other features. Friendship networks can likewise be probed in different ways, using surveys, online data, observations of face-to-face interactions and others, possibly enhanced with metadata on participant location, occupation, age and many other characteristics. Taken together, these many types of data may be able to give a more accurate and nuanced picture of network structure than any single one can alone.
The problem of determining network structure from experimental data, which often goes under the heading of network reconstruction, has been studied particularly in the biological sciences (for instance, in the context of gene regulatory networks, metabolic networks and protein networks5,12,23,24). A range of methods have been developed for use with data from high-throughput laboratory techniques such as microarrays, RNA sequencing and tandem affinity purification19,25,26,27,28,29. The issue of errors and unreliability in network data has also been recognized in the social sciences, where there has been extensive discussion of sources of error in social surveys, its effects on measurements and ways of estimating and minimizing it1,2,6,7,8. There is also domain-specific literature on problems such as predicting missing nodes or edges in networks9,10,30,31,32 and name disambiguation in bibliometrics33,34,35,36, typically making use of assumptions about correlations in network structure. Combinations of these methods can be used to create hybrid algorithms for resampling and Monte Carlo estimation of network structure9,10,11,13,15. There is also a significant volume of work on the related problem of estimating network structure from non-network data (see ref. 37 for a review).
Here we present a general formalism for the optimal inference of network structure from rich but noisy data, and show how it can be applied to a range of data types. Generically, the question we want to answer is this: given the results of a set of measurements performed on a system of interest, what is our best estimate of the structure of the underlying network? The data could take many forms. They could be rich, hierarchical, multilevel and multimodal, but they may also be unreliable and error prone. Some of the data may have no bearing at all on the network structure. Others may be related only obliquely to it. Furthermore, we may not know in advance which data are relevant and which are not, or how accurate any of the measurements are. Remarkably, under these seemingly daunting circumstances, we can nonetheless make progress.
Suppose that we are interested in the structure of a certain n-node network and for the moment let us concentrate on the commonest case of an unweighted undirected network. (We describe some generalizations to weighted and directed data below and in the Supplementary Information.) Let us denote the true structure of the network—which we do not know—by an n × n symmetric adjacency matrix A, having elements A ij = 1 if nodes i and j are connected by an edge and 0 otherwise. This structure, commonly called the ground truth, is the thing we are trying to estimate.
We now make a set of measurements of the system, measurements that can take many forms as discussed above, perhaps including direct measurements of network structure but also potentially including indirect measurements, metadata, or `red herrings' that have nothing to do with the network at all. The network structure and the data are related to one another by a data model, expressed in the form of a probability function P(data|A, θ) that specifies the probability of making the particular set of measurements we did, given the ground-truth network A plus, optionally, some additional model parameters, which we collectively denote by θ. In general, we do not know the form of this probability distribution—in most cases, it will be a complicated function—but the option to include parameters θ allows us to specify a family of functions that encompass a broad spectrum of possibilities. Our goal will be, given such a family, first to determine the values of the parameters, which effectively chooses a particular member of the family and thereby fixes the relationship between the network structure and the data, and then, given those values, to estimate the network structure itself.
then, summing over all possible network structures A, we get P(θ|data) = , which we maximize to find the most probable value of the parameters θ given the observed data, the so-called maximum a posteriori estimate. In fact, for convenience, we maximize not P(θ|data) but its logarithm, whose maximum falls in the same place. Employing the well-known Jensen inequality , we can write
where q(A) is any probability distribution over networks A satisfying . It is trivially the case that exact equality between left- and right-hand sides of equation (2) is achieved when
and hence this choice maximizes the right-hand side with respect to q. A further maximization with respect to θ will then give us the optimal parameter values we seek. To put that another way, a double maximization of the right-hand side of equation (2) with respect to both q and θ will give us our answer for θ. This can be easily carried out by maximizing first with respect to q(A) using equation (3) and then with respect to θ, repeating until the result converges. Differentiating equation (2) while holding q(A) constant, we find the maximum with respect to θ to be the solution of
Our calculation consists of iterating equations (3) and (4) from random initial values to convergence. The final result is a value for the parameters θ, which we can then use to estimate the ground-truth network. In fact, however, it turns out that this last step is unnecessary: the calculations we have already performed give us the ground-truth network structure as a by-product; indeed, they give us the entire posterior probability distribution over structures, since from equation (3) the quantity q(A) = P(A, θ|data)/P(θ|data) = P(A|data, θ). In other words, it is precisely the probability of the network having true structure A given the observed data and the parameters θ.
The method derived here is an example of an expectation-maximization or EM algorithm38. As described, the method is a general one that can be used with many different networks and data models. Let us see how it is applied in practice.
Our first example application is to a social network of US university students. The data come from a ‘reality mining’ study39, which aimed to establish the real-world social network of a set of individuals by measuring their physical proximity over time. The 96 students participating in the study were given mobile phones that used special software to record when they were in proximity with one another. The resulting record of pairwise proximity measurements is both richer and poorer than a direct network measurement, in exactly the manner considered in this paper. It is richer in the sense that interactions between individuals may be measured repeatedly and not just once, but poorer in the sense that proximity is an error-prone indicator of actual interaction—two individuals may find themselves coincidentally in proximity, as they pass on the street say, without being acquainted or having any social interaction.
We take as our data set the measurements made during the reality mining study for eight consecutive Wednesdays in March and April of 2005. (We choose weekly observations to remove weekly periodic effects, and March and April because they fall during the university term.) This gives us eight sets of observations, one for each day, in which an observed edge means that two individuals were in physical proximity at some time during that day.
The data model we adopt for these data is a particularly simple one, in which the edge measurements—the observations of proximity—are assumed to be independent identically distributed random variables, conditioned on the ground truth A ij . That is, the probability of observing an edge between nodes i and j depends only on the matrix element A ij and in the same way for all i,j. This dependence can be parametrized by two quantities: the true-positive rate α, which is the probability of observing an edge where one truly exists, and the false-positive rate β, the probability of observing an edge where none exists. (Note that these are the empirical true- and false-positive rates—the frequency with which the measurements agree or disagree with the ground truth—rather than the true- and false-positive rates for our final inferred networks, which we cannot normally calculate.) In addition, we will assume a uniform prior probability ρ of the existence of an edge in any position, so that our model is parametrized by three parameters α, β and ρ.
If for each node pair i, j, we make N measurements and observe an edge to be present in E ij of them then, as shown in the Methods, our expectation-maximization equations give the following estimates for the three parameters:
(We use symbols with hats to denote estimated values of variables.) The quantity Q ij appearing here is the posterior probability that there is an edge between nodes i and j for these parameter values, which is given by
The full calculation involves iterating equations (5) and (6) until convergence is reached, and the results tell us the estimates of the three parameters α, β and ρ, as well as the entire posterior probability distribution over possible ground-truth networks, which is given by P(A|data, θ) = . The posterior distribution allows us to compute estimates of any other network quantities we might be interested in, such as degrees, correlations or clustering coefficients (see Supplementary Section 5) and can also be used as an input to further calculations (for instance, of community structure14).
Applying equations (5) and (6) to the reality mining data, the algorithm converges rapidly and reliably to parameter estimates , and . The small value of β tells us that there are very few false positives: an edge is observed where none exists less than 1% of the time. On the other hand, even if the false-positive rate is low, the probability of being wrong when one does observe an edge can still be high. This probability, called the false discovery rate, is given by (1 − ρ)β / [ρα + (1 − ρ)β], which has an estimated value of 0.2270 in the present case, meaning that more than one in every five observed edges is in error. Moreover, the relatively small value of α implies that there are also a large number of false negatives: around 58% of pairs of individuals who are, in fact, connected in the underlying network are not observed in proximity on any one day. This is understandable. Most people do not see all of their acquaintances every day.
Figure 1a shows the inferred ground-truth network, with edge thicknesses varying to indicate the probability Q ij of individual edges. In Fig. 1b we show the relationship between the number of observations E ij of a particular edge and the posterior probability Q ij . As the figure shows, an edge observed only zero times or one time implies a low Q ij (less than 0.1), so a single observation is probably a false alarm. However, two or more observations of the same edge result in a much larger Q ij (greater than 0.9), indicating a strong inference that the edge exists in the ground truth. The sharp transition between low and high values of Q ij means that it is possible to infer the presence or absence of edges with good reliability despite the high error rate in the data.
For our second example, we study a more traditional friendship network, taken from the National Longitudinal Study of Adolescent Health (the `Add Health' study)21. This study compiled networks of friendships between students at a number of US high schools by asking participants to name their friends. Again, the data are both richer and poorer than a simple network measurement. They are richer in the sense that we have two measurements of each friendship, from the point of view of each of the two participants, but poorer in the sense that those measurements can (and often do) disagree, indicating that respondents are not reliable in the reports they give or that they are employing different standards for what constitutes a friendship. Following ref. 8, we represent this situation by giving each participant i their own individual true- and false-positive rates α i and β i . Once again, one can derive closed-form expressions for these parameters and for the posterior probabilities Q ij of edges in the ground-truth network (see the Methods). The analysis can be applied to any of the schools in the Add Health study; we use one of the smaller ones as our example, solely because it allows us to make a clear picture of the resulting network.
Again the expectation-maximization algorithm converges quickly and reliably, giving a network-average estimated true-positive rate , false-positive rate and prior edge probability . These values indicate that non-existent friendships are rarely falsely reported as existing (low average β i ), although, once again, arguably the more interesting quantity is the false discovery rate, the probability of a friendship that is reported being false. This probability, which is equal to (1 − ρ)β i / [ρα i + (1 − ρ)β i ], is significantly larger, having a network-average estimated value of 0.3309. In other words, about one in three reported friendships does not really exist. There is also a relatively high rate of failure to report friendships that do exist (many of the α i are significantly less than 1). The latter is perhaps less surprising given the design of the study: students were limited to naming at most ten friends, so those with more than ten would be obliged to omit some.
Figure 1c shows the inferred network of friendships, with edge widths again indicating the probability Q ij that an edge exists, and node sizes now varying to indicate how reliable the nodes are, in terms of the fraction of reported friendships that actually exist (which is equal to one minus the false discovery rate, also called the precision). Reports made by nodes depicted with large diameter are reliable; those made by smaller nodes are not. Armed with these results, one can now calculate a multitude of further quantities, including any function of network structure.
These are just two examples of possible applications. The particular data models applied here are quite flexible and could be applied to other networks, but there are also many other models one could use. Note, for instance, that the two models above both make the assumption that edges are conditionally independent. This works well for these particular examples but it is not a requirement. The methods described can be applied to models with dependent edges too, which might be appropriate, for instance, for data sets derived from longitudinal (time-dependent) network studies. See the Supplementary Information for further discussion and a number of additional examples of possible models.
In the reality mining example, edge observations are assumed to be independent (Bernoulli) random variables, conditioned on the ground truth A ij for the appropriate node pair i, j, with true-positive rate α and false-positive rate β. Suppose that for each node pair i, j, we make N ij measurements and observe an edge to be present in E ij of those measurements. Then, under this independent edge model,
If the prior probability of an edge in any position is ρ, then the prior probability of the entire network is P(A|ρ) = . We also assume that the prior probability distributions on α, β and ρ themselves are all uniform in the interval [0,1]. Combining equations (1) and (7), we then have
Taking the log, substituting into equation (4), and differentiating with respect to α, we find that the maximum a posteriori estimate of the true-positive rate satisfies
Defining the posterior probability of an edge between i and j by Q ij = P(A ij = 1|data,θ) = and rearranging equation (9), we then get
Similarly, differentiating with respect to β and ρ, we arrive at
Note that if we make no measurements for a pair of nodes i, j, so that N ij = E ij = 0 (the case of ‘missing data’), this expression correctly gives Q ij equal to the estimated prior edge probability .
Turning to the Add Health friendship network example, measurements of edges in this data set come from unilateral statements made by participants. Let E ij in this case represent the number of times node i identifies node j as a friend. (Normally this number will be zero or one, but we allow arbitrary values for the sake of generality.) In effect, E ij constitutes a directed network, and self-reported friendship networks are sometimes depicted as being directed. However, we consider the underlying ground-truth network to be undirected. Only our observations of it are directed.
Study participants may vary in the reliability with which they identify their friends. A participant whose identifications agree, generally, with those of their friends, is probably a reliable observer; one whose identifications disagree is probably not. We do not have to impose these assumptions on our calculation, however. They will be automatically reflected in the solution found by the expectation-maximization algorithm.
In our calculations, we employ a data model in which each node i has its own true-positive rate α i and false-positive rate β i . Then the likelihood of a set of observations given a ground-truth network A is
where N ij is the total number of observations of node j made by node i. Note that we explicitly include terms in E ij and E ji separately, since these numbers are distinct. (On the other hand, A ij = A ji since the ground-truth network is assumed undirected. We write A ij and A ji separately in the above expression purely to preserve symmetry.)
Again assuming a prior probability of ρ on each ground-truth edge and uniform priors on the parameters, applying equation (1), and taking logs, we arrive at the log-likelihood:
Applying equation (4), performing the derivatives and rearranging, we then find the following estimates for the parameters:
As before, Q ij is the posterior probability of an edge between i and j, which can be calculated by a method analogous to the one we used for our first model above. Combining equations (1) and (14) and using A ij = A ji , we write
We evaluate this probability at the estimated values of the parameters and the complete posterior distribution over ground-truth networks A is then given by
Note that this expression is explicitly symmetric with respect to the indices i and j, as it should be, since Q ij = Q ji by definition.
This calculation returns not only an estimate of the ground-truth network but also an estimate of the reliability of each of the nodes, parametrized by their true-positive and false-positive rates, which tell us both how often a node truthfully reports an edge that does exist and how often it falsely reports an edge that does not. Note that even in the (common) case where each edge is observed at most once, so that E ij can take only the values zero and one, the parameter estimates and and the posterior probabilities Q ij can take a wide range of values, by contrast with the case of the reality mining network, where there are only as many possible values of Q ij as there are values of E ij (see Fig. 1b). For instance, even if both nodes i and j report the existence of an edge between them (E ij = E ji = 1), if neither node is considered reliable then the algorithm may say that the probability Q ij of the edge actually existing is low. If either of them is considered reliable, on the other hand, then Q ij will be larger. Finally, if one is unreliable and claims an edge, while the other is reliable but does not, then Q ij will be particularly small.
The reality mining data39 are available at http://realitycommons.media.mit.edu/realitymining.html and the high-school friendship data21 are available at http://www.cpc.unc.edu/projects/addhealth/documentation/publicdata.
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The author thanks E. Bruch, G. Cantwell, T. Martin, G. Reinert and M. Riolofor useful comments. This work was funded in part by the US National Science Foundation under grants DMS–1407207 and DMS–1710848. This work uses data from Add Health, a programme project designed by J. R. Udry, P. S. Bearman and K. Mullan Harris, and funded by a grant P01–HD31921 from the Eunice Kennedy Shriver National Institute of Child Health and Human Development, with cooperative funding from 23 other federal agencies and foundations. A special acknowledgment is due to R. R. Rindfuss and B. Entwisle for assistance in the original design. Anyone interested in obtaining data files from Add Health should contact Add Health, Carolina Population Center, 123 W. Franklin Street, Chapel Hill, NC 27516-2524 (email@example.com). No direct support was received from grant P01-HD31921 for this analysis.
Supplementary notes, supplementary figures 1–3