Neighbor-Neighbor Correlations Explain Measurement Bias in Networks

In numerous physical models on networks, dynamics are based on interactions that exclusively involve properties of a node’s nearest neighbors. However, a node’s local view of its neighbors may systematically bias perceptions of network connectivity or the prevalence of certain traits. We investigate the strong friendship paradox, which occurs when the majority of a node’s neighbors have more neighbors than does the node itself. We develop a model to predict the magnitude of the paradox, showing that it is enhanced by negative correlations between degrees of neighboring nodes. We then show that by including neighbor-neighbor correlations, which are degree correlations one step beyond those of neighboring nodes, we accurately predict the impact of the strong friendship paradox in real-world networks. Understanding how the paradox biases local observations can inform better measurements of network structure and our understanding of collective phenomena.

In numerous physical models on networks, dynamics are based on interactions that exclusively involve properties of a node's nearest neighbors. However, a node's local view of its neighbors may systematically bias perceptions of network connectivity or the prevalence of certain traits. We investigate the strong friendship paradox, which occurs when the majority of a node's neighbors have more neighbors than does the node itself. We develop a model to predict the magnitude of the paradox, showing that it is enhanced by negative correlations between degrees of neighboring nodes. We then show that by including neighbor-neighbor correlations, which are degree correlations one step beyond those of neighboring nodes, we accurately predict the impact of the strong friendship paradox in real-world networks. Understanding how the paradox biases local observations can inform better measurements of network structure and our understanding of collective phenomena.
Local interactions among nodes in a complex network can lead to an astounding array of collective phenomena. Examples include viral outbreaks in social networks, cascading failures in the power grid and financial networks, synchronization of coupled oscillators, opinion dynamics and consensus formation in human groups. Researchers have linked the structure of complex networks to the dynamics of collective phenomena unfolding on them: highly connected nodes amplify viral outbreaks 1-3 , while community structure affects the dynamics of synchronization 4 and the spread of social contagions 5 .
A node's own local view of a network, however, may be systematically biased. One source of bias is Feld's friendship paradox: the number of connections, or degree, of a node is smaller than the average of its neighbor's degrees 6 . Recently, more subtle forms of the paradox have been proposed. The strong friendship paradox 7 states that the degree of a node tends to be smaller than the median of its neighbor's degrees. Roughly speaking, this is equivalent to the node having fewer neighbors than do a majority of its neighbors. But unlike the original friendship paradox and some recent generalizations [8][9][10][11] , the strong friendship paradox does not arise as a straightforward result of sampling from skewed distributions 7 . The strong friendship paradox can dramatically distort local measurements in a network, leading to the "majority illusion" 12 in which a globally rare attribute may be overrepresented in the local neighborhoods of many nodes. Physical systems whose dynamics are governed by majority rule-from Ising spin interactions 13 to more complex voting models 14 -may be affected by this paradox.
In this manuscript, we develop a stochastic model to predict the magnitude of the strong friendship paradox. Specifically, we show that (a) increasingly disassortative networks exhibit a larger paradox, and (b) accurately modeling it requires considering degree correlations one step beyond those of neighboring nodes.
Given a network with degree distribution p(k), we define the global probability of the strong friendship paradox as P paradox = ∑ k p(k)f(k), where f(k) is the probability that a randomly chosen node with degree k experiences the paradox. Formally, we define where ′ k i is the degree of the node's ith neighbor. Of course, networks can have structure beyond that given by the degree distribution. The dK-series framework 15 specifies network structure as a series of joint degree distributions of subgraphs of d nodes. Thus, a network's 1K-structure is specified by the degree distribution p(k). The 2K-structure captures degree correlations of 1  nodes in connected pairs. This is specified by the joint degree distribution e(k, k′), the probability that an edge links two nodes with degrees k and k′. It follows that the degree distribution of an edge's endpoint is Similarly, a network's 3K-structure is specified by the joint degree distribution of connected triplets, either wedges or triangles. We find that these higher-order degree correlations can be substantial in real-world networks, possibly reflecting their macroscopic organization into a core-periphery structure, and that accounting for them is necessary for a quantitative understanding of the strong friendship paradox.
The strong friendship paradox depends only on the comparison between the degrees of a node and its neighbors. The probability Q > that a node sees a neighbor with degree larger than its own can be written as: k k k k k k since the neighbor degree distribution of a degree k node is P(k′|k) = e(k, k′)/q(k). This expression uses information about the network's 2K-structure, which is globally measured by the assortativity coefficient 16 where the variance of k is taken with respect to the distribution q(k): In assortative networks (r > 0), nodes preferentially link to other nodes with similar degree, while in disassortative networks (r < 0), they prefer to link to others with dissimilar degree, e.g., high to low degree nodes. Since k is in the numerator of the sum for r but in the denominator of Eq. (1), given the normalization ∑ ′ = ′ e k k ( , ) 1 k k , , we may expect disassorativity to magnify the paradox in networks, and assortativity to suppress it. Previous numerical results for the conventional friendship paradox 10 support this prediction.

Results
The 2K model. Given a randomly chosen node with degree k, define an indicator function x i , i = 1…k, to track the degree of the node's ith neighbor: To a close approximation (and exactly, for odd k), the node is in the paradox regime To understand how network structure affects the strong friendship paradox, we now examine μ x (k), the probability that a neighbor (say the ith one) of a randomly chosen degree-k node has degree greater than k: If we assume that degrees of neighbors are independent and identically distributed random variables, the probability for a degree-k node to observe the strong friendship paradox is then given by the binomial distribution: For large k, f(k) is close to Gaussian. In terms of the normal distribution's cumulative distribution function Φ, To demonstrate how assortativity modifies the strong friendship paradox, we consider a network with e(k, k′) that has a bivariate log-normal distribution, a long-tailed distribution defined on positive domain of k, with equal means m, equal variances s 2 , and correlation coefficient c. This form of the distribution allows for analytical treatment of the problem. Thus, the assortativity can be written as It follows that f(k) decreases with k. As the network becomes more disassortative (c < 0), f(k) undergoes an increasingly sharp transition from 1 to 0 around k = e m ( Fig. 1(a)). Given that most nodes have low degree, this leads to a globally stronger paradox in more disassortative networks ( Fig. 1(b)), consistent with our prediction.
The structure of real-world networks creates conditions for the paradox. Table 1 reports the observed fraction of nodes in these networks who see a majority of their neighbors with a larger degree. This fraction is very large in all networks, ranging from 75% to 90%. Table 1 shows that the observed fractions of nodes experiencing the paradox are close to the global probabilities predicted by the 2K model, when μ x (k) is set to the actual frequency with which a neighbor of a degree-k node has larger degree. However, a breakdown by degree class reveals significant deviations. Figure 2 plots the paradox probability f(k) for a degree-k node (blue dots). We define the degree at which the 2K estimate (Eq. (6)) of paradox probability is 0.5 as the critical degree k c of the network. By construction, k c = Median(q(k)). Nodes with degree k < k c are likely to experience the paradox, while those with k > k c are unlikely to do so. The 2K model (dotted line) overestimates the paradox for low-degree nodes and underestimates it for high-degree nodes. This suggests that the 2K model is insufficient, and we need to take into account structure beyond degree correlations of connected pairs of nodes.
The 3K model. If neighbor degrees are identically distributed but correlated random variables, Eq. (6) must be modified to represent a multivariate rather than a single binomial distribution. To deal with the correlation, we now consider a pair of neighbors, with degree k i and k j , of a single degree-k node, and their indicator functions x i and x j as defined in Eq. (3). The corresponding multivariate normal approximation then gives   Unlike in Eq. (6), where f(k) is completely determined by μ x (k), the 3K model requires the covariance term to be specified. Using values determined empirically from real-world networks as in the 2K model, we obtain very accurate paradox probability estimates (solid line in Fig. 2). These estimates also improve on the global 2K results shown in Table 1 for all cases except Youtube and English words, where the two estimates are nearly identical due to their close agreement for low degree values that represent a large fraction of nodes in the network.
To understand the effect of the covariance term, consider the 3K-distribution ′ ′ t k k k ( , , ) i j , the joint degree distribution of a connected ordered triplet of nodes with degrees ′ ′ k k k ( , , ) i j . Conditioning on the degree k of the focal node gives the joint degree distribution of its two neighbors: The indicator function covariance term in Eq. (10) is is given by Eq. (4). Thus, the covariance takes into account correlations only up to the level of chains ′ ′ k k k ( , , ) i j . Any higher-order correlations beyond 3K, such as those involving connected subgraphs of four nodes, would no longer be consistent with a normal approximation for f(k), since they would involve information beyond the second moment of the indicator function. The remarkable success of the 3K model in Fig. 2 suggests that such higher-order correlations are not needed to explain the paradox, or that they are negligible in real-world networks.
Define the neighbor-neighbor correlation as Note that this correlation, like σ x (k), is based not on the neighbors' degrees but on the indicator function comparing them to the node's degree. Figure 3 shows empirically determined values of ρ x (k) for the real-world networks we studied. Recall that in the 2K model, the probability that a degree-k node has a neighbor with degree greater than k is determined completely by e(k, k′) and is unrelated to the degrees of the other neighbors. One might reasonably expect low-degree nodes to have mostly neighbors of higher degree, high-degree nodes to have mostly neighbors of lower degree, and medium-degree nodes to have a mix of both. Figure 3, however, depicts a different scenario: medium-degree nodes prefer to have neighbors with similar degree to one another-whether those neighbors have higher or lower degree. To see how these correlations may be indicative of the macroscopic organization of a network, we plot the distribution of x , the fraction of higher-degree neighbors, for nodes with k = k c . In the technological networks of Skitter and Google, such medium-degree nodes link more often to high-degree nodes, possibly reflecting a hierarchical network structure with medium-degree at the top level and high-degree nodes at the next level. The remaining networks show a broad distribution of x , consistent with a core-periphery network structure where medium-degree nodes link to higher-degree nodes in the core and to lower-degree nodes in the periphery 17,18 .

Discussion
The connection between local measurement bias and network structure revealed by the strong friendship paradox is crucial for several reasons. It is often impractical to observe large networks in their entirety: instead, researchers estimate network properties by exploring local neighborhoods of select nodes. The paradox, however, may systematically bias local views of networks structure, including sampled degree distribution 19 . The strong friendship paradox also affects measurements of information in networks. Consider a network where nodes have attributes and estimate their prevalence from local observations. When attribute and degree are correlated, the paradox can create an illusion that the attribute is common even when it is globally rare 12 . Finally, quantifying measurement bias may be necessary for predicting the evolution of dynamic processes such as domain formation by majority rule in interacting spin systems 13 , or synchronization of frequencies in complex networks such as electrical power grids 20 . Accounting for neighbor-neighbor correlations could be instrumental to the success of network models for such systems.
In this paper, we have studied strong friendship paradox in networks, a phenomenon that distorts nodes' observations of local network structure. The paradox leads most nodes to observe that a majority of their neighbors have a larger degree than their own. We have developed an analytical model of the strong friendship paradox, enabling highly accurate predictions of its strength in networks. In contrast to Feld's friendship paradox 6 , which exists in any network with variance in the degree distribution, the strong friendship paradox requires information about higher-order network structure. Specifically, negative correlations between degrees of connected nodes-given by network's 2K structure-will magnify the paradox, especially in networks with a skewed degree distribution. The impact of disassortativity, however, is modulated by degree correlations between nodes' neighbors. These correlations-given by network's 3K structure-are necessary to accurately quantify the paradox. The success of the 3K model in explaining the paradox is consistent with the observation 15 that it is sufficient to capture known network properties. In order to mitigate the effects of local measurement bias in networks, it is important to account for the strong friendship paradox and how it is impacted by higher-order network structure.

Methods
Data description. We study six networks from a variety of domains, including social networks (friendship links on LiveJournal blogging site soc-LiveJournal1 21 , community structure on Youtube com-Youtube 21 ) technological networks (Skitter internet graph as-skitter 21 and Google web hyperlink graph web-Google 21 ), scientific citations graph (Arxiv cit-HepPh 21 ), and relationships between English words 22 . Table 2 shows some basic properties of the networks. These networks vary in size from 34.5 K nodes (Arxiv) to almost 4 M nodes (LiveJournal), and assortativity from 0.045 (LiveJournal) to −0.08 (Skitter).

Name
Type Nodes Edges Assortativity  Table 2. List of real world networks and their basic profiles. Note that directed edges, if they exist, are treated as undirected edges.