The paradox of second-order homophily in networks

Homophily—the tendency of nodes to connect to others of the same type—is a central issue in the study of networks. Here we take a local view of homophily, defining notions of first-order homophily of a node (its individual tendency to link to similar others) and second-order homophily of a node (the aggregate first-order homophily of its neighbors). Through this view, we find a surprising result for homophily values that applies with only minimal assumptions on the graph topology. It can be phrased most simply as “in a graph of red and blue nodes, red friends of red nodes are on average more homophilous than red friends of blue nodes”. This gap in averages defies simple intuitive explanations, applies to globally heterophilous and homophilous networks and is reminiscent of but structually distinct from the Friendship Paradox. The existence of this gap suggests intrinsic biases in homophily measurements between groups, and hence is relevant to empirical studies of homophily in networks.

Step-by-step proof of the list version Let's say a red node i has homophily h i and degree d i . This means it has h i d i red friends and There are n total nodes, and k of them are red.
Let's represent the means of the red and blue histograms by counting each red node's homophily as a separate term (some h i 's may be equal but that's okay).
We'd like to prove that µ (R) Let's bring this to the common denominator and remove it because the denominator of each fraction is positive (as at least one homophily value is positive, at least one < 1, all homophily values are between 0 and 1 and all degrees are positive).
We can now find common terms to get Then note that if h i = h j , at least one of the terms for (i, j) and (j, i) is not zero.
Proving that by contradiction: If the above is wrong, with algebra we get: for (2), so h j has to be 0 to satisfy (2). Then h i = h j . Similarly if h j = 1 to satisfy (1) h j can't be 0 for (2) and h i has to be 1 to satisfy (2). Then h i = h j . Contradiction.
We know from homophily diversity in red nodes that there is at least one pair (i, j) such that (If there is no homophily diversity in red nodes, the sum is 0.) Next, we'd like to prove that for all such pairs (i, j), the sum of the two terms for (i, j) and (j, i), T ij + T ji , is positive.
Since terms for h i = h j are 0 and there exist i and j such that h i = h j , the whole gap numerator is positive, so the red gap for the list version with homophily diversity in red nodes is positive. 2 Step-by-step proof of the special case of the singular version We start with the observation that the second-order red homophily of a red node i with homophily h i and degree d i is equal to the first-order homophily of a random red friend of theirs.
Let the nodes be partitioned into a set R of k red nodes and a set B of n − k blue nodes.
For clarity, let a red node's first-order homophily be denoted h i , and let a blue node's first-order homophily be denoted p i . Similarly, let a red node's degree be d i and a blue node's degree be c i . Let N (i) denote the neighbors of a node i, and for two sets of nodes, let E(S, T ) be the set of edges with one end in S and the other end in T . If we first choose a random red node i and then choose a random red friend from among their h i d i red friends, the expected first-order homophily we see, which is equal to mean second-order red homophily of red nodes, is Rewriting to see how each h i is counted with i being a red friend of red nodes: There is a third way to represent this sum, too. Adapting an argument that Kramer et al.
We can apply a similar approach to a random red friend of a random blue node. If we first choose a random blue node i and then choose a random red friend from among their (1 − h i )d i red friends, the expected first-order homophily we see, which is equal to mean second-order red homophily of blue nodes, is Again, we can rewrite this to see how each h i is counted with i being red a friend of blue nodes: For our special case, assume: 1], no restriction on the relationship between d and c, no restriction on k as it relates to n, no restriction on n.
Let's first prove that µ Above, the inequality for reciprocals can be strict because there is homophily diversity in red nodes. Notice that the number of red edges |E(R, R)| is Now, let's see how p relates to λ R . First, we know from the balance equation that, Or, in our case, We'll compute mean p i = p right now and equate it to p at the end. From the above, Here, p = p.
Going back to µ (R,sing) B , which is From what we know about p, We already know, , we need to show, Multiplying by (1 − λ R ) and expanding λ R we get, the last statement is true by (reverse) Chebyshev's sum inequality.

Constructing a two-type network from parameters: double configuration model
To create a network of red and blue nodes with parameters n, k, d, c, homophily samples h 1 , ...h k and p k+1 , ...p n set or calculated as described in the discussion of simulation and gap size prediction below, we use what we call a double configuration model. For each red node, we take their homophily h i , multiply it by d, and round it to get its within-type degree d iw . We subtract it from d to get its outside-type degree d io . We collect d iw 's into a degree sequence and create the red subgraph with this degree sequence using the configuration model algorithm. We perform analogous operations on blue nodes (using p j and c) to calculate c jw 's and c jo 's and to create the blue subgraph.
We then sum d io 's and c jo 's ( i∈R d io , j∈B c jo ), both of which are the number of edges between types, and see if they equal each other, as they must. If they don't, we look at the sum that's larger, get a random node from that type, decrease its outside-type degree by 1, and repeat the comparison of the sums, until they are equal. In the simulated example just below, they were equal right away.
Once the sums are equal, we take each red node i and put d io copies of it into a new set R o . We take each blue node j and put c jo copies of it into a new set B o . We permute both sets, create a ranking for them based on the permutation and create edges between the elements of the two sets if their rankings are equal. In the end, the copies of red node i would get d io pairings to blue nodes, just like node i should. We take the information on which red node copy is connected to which blue node copy in this scenario, and connect their original nodes between the red subgraph and the blue subgraph that we have, getting the final graph.

Simulation
Here, we'd like to show that the result holds in the most symmetrical contexts and in data. We have proven the singular version result for the case with no homophily diversity in blue nodes, so in the simulation below we'll use the following parameters: n = 10000, k = 5000, d = 100 (same for all red nodes); half of the red nodes would have homophily 0.6, the other would have homophily 0.4, and the same would be true for blue nodes. We'll calculate the common degree of blue nodes c from the balance equation (we round c to make it integer). Because of the symmetry of the situation, it would also be equal to 100. Then, we create the network via double configuration model as described above and compute the red and the blue gaps for both the list and the singular versions of second-order homophily. The results are presented in Figure S1. Figure S1: We see the gap in both the list and the singular version. Note that the means and the gaps are the same between versions, but that's not usually the case. The networks for each case were constructed separately.

Full derivation of list version gap size formula for a special case
We start with n nodes, k red nodes, r = k r , sets R and B for red and blue nodes, degrees d ∈ [0, n] for i ∈ R and c ∈ [0, n] for j ∈ B. First-order homophily h i of i ∈ R is distributed around mean λ R with standard deviation σ R , but then limited to [0, 1] (clipping). Similarly, firstorder homophily p j of j ∈ B is distributed around mean λ B with standard deviation σ B , but then limited to [0, 1] (clipping). We observe that the expected number of edges between red and blue nodes is equal to i∈R This means that setting all parameters freely isn't possible. We can solve for c, for example: We can see how c would respond to changes in λ R and σ R given fixed values of everything else in Figure S2. Note that as we increase σ R more clipping occurs and the first-order homophily of red nodes sample resembles a uniform distribution with sample mean = 0.5, regardless of the given λ R .
A red person with homophily h i would be counted h i d times in red nodes' second-order homophily values, and (1 − h i )d times in blue nodes' second-order homophily values. So, the total number of points in the red histogram would be i∈R h i d and in the blue histogram So the mean of the red distribution would be: And the mean of the blue distribution would be: If all h i for i ∈ R are equal to h, the equations become: Meaning there is no gap.
If h i 's are not identical (σ R = 0), it is more difficult to deal with µ Then the numerator of µ R i∈R H 2 i + |R|λ 2 i∈R H 2 i is a chi-square-distributed variable whose mean equals |R|. Then: Then µ (R) This also seems consistent with the data. Now we'd like to prove that µ B . This is clear, The gap would be equal to: In general, because of the denominators and the square, g would be > 0 for 0 < λ R < 1 and σ R > 0 (and also σ R < 0, but that's not important). g would be defined for 0 < λ R < 1.
Note: this might be an abuse of notation because we initially derived this for σ R = 0, but if σ R = 0, g = 0.
6 Examples where the Friendship and the Homophily Paradoxes co-occur and don't Figure S3: Example networks where the Friendship and the Homophily Paradoxes co-occur and don't.