Topological measures for identifying and predicting the spread of complex contagions

The standard measure of distance in social networks – average shortest path length – assumes a model of “simple” contagion, in which people only need exposure to influence from one peer to adopt the contagion. However, many social phenomena are “complex” contagions, for which people need exposure to multiple peers before they adopt. Here, we show that the classical measure of path length fails to define network connectedness and node centrality for complex contagions. Centrality measures and seeding strategies based on the classical definition of path length frequently misidentify the network features that are most effective for spreading complex contagions. To address these issues, we derive measures of complex path length and complex centrality, which significantly improve the capacity to identify the network structures and central individuals best suited for spreading complex contagions. We validate our theory using empirical data on the spread of a microfinance program in 43 rural Indian villages.

the size of the bridge between these two neighborhoods as 0, while a complex contagion can spread between them. Supplementary Figure 1 displays one of these edge cases, where two neighborhoods share no overlapping members, even though a complex contagion can spread between them. Figure 2. Visualizing the measure of bridge width across a variety of neighborhood configurations. These configurations include overlapping and non-overlapping neighborhoods, as well as neighborhoods that are and are not connected by a sufficiently wide bridge. Tj refers to the adoption threshold of target node j. Oij refers to the overlap between the inclusive neighborhoods of node i and j. Rij refers to the size of the reinforcement set connecting the neighborhoods of node i and j. BWij refers to the width of the bridge connecting node i to node j. See the "Methods" section of the main text for the formal definition for each of the terms visualized.

Supplementary
For this reason, we updated the definition of bridge width to account for cases where two neighborhoods are connected, but share no overlapping members, allowing the measure of bridge width to generalize beyond lattices. We develop the following logic for identifying when a sufficient bridge exists that can spread a contagion between two neighborhoods, which share connections but are not necessarily adjacent. Supplementary Figure 2 provides a visual display of how our approach accurately characterizes the width of bridges for both adjacent and non-adjacent but connected neighborhoods. As stated in the main text, for a contagion with an adoption We ascribe every bridge in G a binary value indicating whether the bridge is sufficiently wide to enable diffusion. We indicate this binary value in notation by placing sharp brackets around the term for bridge width: The above definition of bridge width can be readily adapted to heterogeneous distributions of thresholds by requiring that Rij contain only of nodes from Dij that can be activated by N [i].
Specifically, this requires that we keep each node x in Rij only if Oix ≥ Txi.e., if there are enough ties from N[i] to satisfy Tx.

The Add Health Dataset
The Add Health dataset was constructed from an in-school survey, administered to 90,118 students from 84 distinct communities throughout the US in 1994-1995 (2). All network data is publicly available at the following github: https://github.com/drguilbe/complexpaths. 1 The survey was designed to gather data on students' social networks. Each student was given a paper-and-pencil questionnaire and a copy of a roster listing every student in the school and, if the community had two schools, the students were provided with the roster of the "sister" school. Students were asked to "List your closest (male/female) friends. List your best (male/female) friend first, then your next best friend, and so on. (Girls/boys) may include (boys/girls) who are friends and (boy/girl) friends". This dataset was chosen for the purposes of our study because the social networks represent empirically grounded peer networks with significant topological variation. 1 The data and code for replicating this study can be cited as: Guilbeault

Design of the simulated diffusion experiment using the Add Health dataset
For each viable network in the Add Health dataset, we simulated diffusion separately using six seeding strategies: (i) the "complex" seeding strategy, where diffusion was initiated using the seed node identified as having the highest complex centrality; (ii) the "degree" seeding strategy, where diffusion was initiated using the seed node identified as having the highest degree centrality; (iii) the "betweenness" seeding strategy, where diffusion was initiated using the seed node identified as having the highest betweenness centrality; (iv) the "eigenvector" seeding strategy, where diffusion was initiated using the seed node identified as having the highest eigenvector centrality; (v) the "k-core" seeding strategy, where diffusion was initiated using the seed node identified as having the highest coreness; and (vi) the "percolation" seeding strategy, where diffusion was initiated using the seed node identified as having the highest percolation centrality. In all cases, diffusion was initiated with a seeding budget of Ti, where the seed node and Ti -1 of the nodes in its neighborhood were initially activated (if a node's neighborhood contained more than Ti -1, than Ti -1 nodes from this neighborhood were randomly selected as the initial seed set, along with the seed node). In cases where more than one seed node was identified as having the highest centrality according to any of the measures, we randomly selected one of the nodes to be the seed. For each network and each seeding strategy, we ran simulations using thresholds ranging from minimal complexity (Ti = 2) to maximal complexity (Ti = 6 the leaders who seeded the program, and also with respect to the households without leaders that adopted and provided reinforcement for other households to follow suit. To measure the social network structure of each village, Banerjee et al. administered surveys to each household, which identified social relations across twelve dimensions: those who visit the respondent's home, those whose homes the respondent visits, kin in the village, nonrelatives with whom the respondent socializes, those from whom the respondent receives medical advice, those from whom the respondent would borrow money, those to whom the respondent would lend money, those from whom the respondent would borrow material goods (e.g., kerosene and rice), those to whom the respondent would lend material goods, those from whom the respondent gets advice, those to whom the respondent gives advice, and those with whom the respondent goes to pray (at a temple, church, or mosque). Banerjee  Since it is not possible to directly determine the empirical adoption thresholds that characterized each household's willingness to adopt, we calculate a household's expected complex centrality as its average centrality across a range of adoption thresholds. This methodology is displayed in Supplementary Figure 6. We first simulate diffusion from each household while holding the thresholds of all households constant across a range of absolute adoption thresholds, from Ti = 2 to Ti =6. For example, we set the adoption threshold of each household to Ti = 2 and then we simulate diffusion when seeding from each possible household. Similar to our Add Health simulation, we adopt a clustered seeding approach. We take the same approach for each Ti from Ti = 2 to Ti = 6. In each case, when activating a given household as the seed, we set the number of nodes to activate from the seed's neighborhood to Ti -1, identical to our simulated experiments on the Add Health dataset. We then take the average of each household's complex centrality under each value of Ti. As the final step, for each village, we identify the household with the highest centrality according to each extant centrality measuredegree, eigenvector, betweenness, k-core, and percolationin addition to identifying the node with the highest average complex centrality.

Supplementary
To evaluate our predictions, we compare the ability for each centrality measure to identify influential households, where an empirical measure of household influence is determined by measuring the fraction of a household's neighbors who adopted after the seed household adopted (see Supplementary Table 1 and Supplementary Table 2 for full details on our statistical approach).

Statistical analysis
To average and compare the results of seeding strategies in the case of homogeneous thresholds (where diffusion outcomes vary significantly by Ti for all seeding strategies), we first normalize the diffusion outcomes across all possible nodes for each network and each value of Ti. We use min-max normalizations to standardize diffusion outcomes to a scale from 0 to 1, thus enabling us to average diffusion outcomes across networks with different homogeneous Ti distributions, while preserving the capacity to clearly identify which seeding strategy performed the best. Minmax normalization in implemented using the following formula: We then average the normalized diffusion outcomes across networks based on seeding strategies to produce an aggregate representation of which seeding strategy performed best across a full range of threshold conditions. Network conditions that involved heterogeneously distributed thresholds did not require normalization.
For the purposes of visualization in Fig. 3, we use min-max normalization to standardize the number of adopters generated across all seeding strategies, holding the network and threshold constant. This allows us to rank seeding strategies for each network-threshold configuration in terms of their diffusion success, where the seeding strategy with the maximum number of adopters from the set of simulation results across seeding strategies is normalized to 1. Similar to the definition of betweenness centrality (5), we also display the values of complex centrality (on the x-axis) using min-max normalization. We then average the rankings for each seeding strategy across all threshold values for each network, such that each network is associated with an average normalized number of adopters for each seeding strategy, giving 74 datapoints for each seeding strategy (one for each network), and 444 datapoints in total.
To compare the diffusion outcomes of different seeding strategies in our analysis of the Add Health data, we use the nonparametric Wilcoxon Signed Rank Test, which is a paired test that compares the ranks of each seeding strategy, paired at the level of each trial in our simulation experiment (where each trial refers to simulated diffusion with a specific fixed threshold applied to a specific network). That is, when comparing two seeding strategies, this test first determines whether a given seeding strategy gave rise to more adopters than another seeding strategy for each network, under each value of Ti (which is held constant for all nodes in each graph). Across networks, this measure reveals the number of times that one seeding strategy gave rise to more adopters than another strategy (i.e., was ranked higher) across all networks. This test then determines whether the network-level rankings between two conditions are equivalent, or whether they significantly differ. A significant p-value in this case indicates that one seeding strategy gave rise to significantly more adopters on average than another seeding strategy. We use the two-tailed version of the Wilcoxon Signed Rank Test.
In all cases where betweenness centrality (5,6) is calculated in this study, it is calculated based on the standard equation, where the betweenness centrality of a node v is given by the expression: where σ is the total number of shortest paths from node s to node t and σst(v) is the number of those paths that pass through v. As is standard, we normalize betweenness centrality such that g ∈ [0,1] using min-max normalization.
In all cases where eigenvector centrality (6)  In all cases where degree centrality is calculated in this study, it is calculated as per standard methodology, simply as the absolute number of connections held by a particular node (6); or, more formally, for a given node i, degree centrality is calculated as the number of cells in the adjacency matrix A such that ai,j = 1, divided by 2 in the context of unweighted symmetrical ties, to adjust for symmetries in the adjacency matrix.
The measure of clustering coefficient in this work refers to the standard global clustering coefficient (6)  As a robustness test, we compare complex centrality seeding to less popular seeding methods that are nevertheless still based on simple path lengthi.e., closeness centrality and reach centrality. Closeness centrality is defined as the reciprocal sum of the length of the shortest (simple) paths between a given node and all other nodes in a graph. Thus, the more central a node, the closer it is to all other nodes (according to the metric of distance supplied by simple path length). Formally, closeness is defined as (7): where d(y, x) is the distance between vertices x and y. When speaking of closeness centrality, it is often represented in its normalized form which represents the average length of the shortest paths instead of their sum, as given by the previous formula multiplied by N -1, where N is the number of nodes in the graph.
This normalization adjustment allows comparisons between nodes of graphs of different sizes.
Reach centrality is a recent measure of centrality that seeks to capture the proportion of other nodes on a graph that are 'reachable' in a diffusion process from a given node (8) Optimal percolation centrality identifies which nodes are most likely to collapse the largest connected component of a graphdefined in terms of simple pathswhen these nodes are removed (9). In practice, percolation centrality amounts to the product of the reduced degree of a node (k -1) and the total reduced degree of all nodes at the optimal distance d. Optimal results are frequently reached when d is either 3 or 4. Optimal percolation centrality was calculated in this paper using the collective influence (CI) algorithm defined by Morone & the implementation built into the influential package for the statistical programming language R. 2 The coreness of an algorithm is calculated using the k-shell decomposition algorithm (10). A k-core of a graph G is a maximal connected subgraph of G in which all vertices have degree at least k. Equivalently, it is one of the connected components of the subgraph of G formed by repeatedly deleting all vertices of degree less than k. A vertex has coreness c if it belongs to a c-core but not to any (c+1)-core.

Robustness of the measure of locally sufficient bridges
For the sake of analytic clarity, our main text presents our measure of locally sufficient bridges on graphs subjected to simplifying assumptions along three key dimensions: (i) degree uniformity, where every node was given the same number of contacts in their neighborhood, and (ii) threshold type, where every node was assigned a fixed absolute adoption threshold referring to the number of adopters to which one needs to be exposed to adopt. Here we relax these assumptions and show that this measure still provides a highly robust predictor of global cascades.

Robustness of the measure of locally sufficient bridges to scale-free networks with homogeneous absolute and fractional thresholds
Here we validate our measure of locally sufficient bridges on scale-free networks to illustrate robustness to degree heterogeneity. We simulated outcomes on 50 randomly generated scale-free networks, produced using Holme and Kim's (12)

Robustness of correlation for k-regular graphs between average bridge width size and average proportion of adopters
Supplementary Figure 9 shows that the correlation between the average bridge width of a To average the diffusion outcomes on the same graph across different homogeneous threshold conditions, the final number of adopters for each network was standardized using min-max normalization for each threshold condition prior to averaging. This normalization strategy displays the average ranking of each seeding strategy on each network, averaged within each threshold regime. Error bars display 95% confidence intervals. Figure 3 in the main text compares centrality-based seeding strategies while associating each seeding strategy with its average diffusion outcome across a range of homogeneous threshold distributions (Ti = 2, Ti = 3, Ti = 4, Ti = 5, and Ti = 6). Here, we confirm that our results are robust to comparing seeding strategies in a disaggregated fashioni.e., by comparing seeding strategies within each homogeneous threshold regime separately. Supplementary Figure 10 shows that seeding with complex centrality produces the highest expected proportion of adopters compared to all other centrality measures, under each homogeneous threshold regime examined. including an additional predictor variable, βcomplex.cent, corresponding to the complex centrality associated with each focal seed node identified by each seeding strategy, finds that the complex centrality of a focal seed node is strongly and positively correlated with inducing a higher proportion of adopters, while controlling for seeding strategy and adoption threshold, and while clustering standard errors at the network level (p<0.01, βcomplex.cent. = 0.50¸ CI = [0.47,0.53]).

Robustness of complex centrality seeding to network composition and influence model.
Here, we demonstrate the ability for complex centrality to outperform extant measures of centrality in identifying influential nodes across a range of popular influence models. In addition to the complex contagion model, we compare extant seeding strategies in the Independent Cascade (IC) model and the Linear Threshold (LT) model (11). The complex contagion model assumes that all agents require some degree of reinforcement from multiple peers. The IC and LT models, by contrast, provide environments where simple and complex contagion dynamics can coexist: depending on the model's parameters, some agents may require reinforcement from multiple peers to adopt, whereas other agents in the same population may be able to adopt with exposure to only a single peer, exhibiting the logic of simple contagion. Like the complex contagion model, IC and LT start with an initial set of seeds, and diffusion proceeds in discrete time steps. In IC, when node i becomes active in step t, it is given only one chance to activate each inactive neighbor w, where it succeeds with probability θ. If i succeeds, then w will become active in t + 1. After node i attempts to activate w at step t, node i is unable to make any future attempts to activate w. In LT, each node i is assigned a threshold Ti uniformly at random from the interval [0,1]. Each node in LT is influenced by each neighbor j according to a weight bij, where the sum of weights among i's neighbors is less than or equal to 1. Adoption thresholds in LT thus represent the weighted fraction of i's neighbors that must become active to trigger adoption by i. That is, for node i at step t, node i will become active if the summed weight of its active neighbors is greater than or equal to Ti. In all models, diffusion runs until no more activations are possible.
In each influence model, we initiate diffusion from all possible seed nodes, and we use each measure of centrality to identify which of these seed nodes is most successful at triggering diffusion. We compare each centrality across a range of seeding budgets corresponding to the proportion of nodes on a graph that are initially activated as seeds. For each node in a graph, we activate that node and a random subset of its neighbors, where the size of this subset is the size of the seeding budget minus one (for the central node). Given the importance of clustered social influence for complex contagions, we adopt a clustered seeding strategy, such that if the seeding budget exceeds the size of the most central node's neighborhood, we iteratively activate nodes that are directly connected to the neighbors of the most central node until we reach the seeding budget. Finally, we use each centrality measure to identify which seed node among all possible seed nodes is the most influential. We then compare each centrality measure in terms of its ability to successfully identify influential seed nodes in the spread of complex contagions. We examine the robustness of these results to a suite of both theoretical and empirical topologies.
To further evaluate the efficacy of our centrality measure, we compare our measure against a canonical approach in computer science: a greedy algorithm that simulates diffusion from every possible seed separately and then selects the optimally influential set of nodes with the greatest expected diffusion based on their individual performance (11). We focus our analysis on random scale-free networks with tunable clustering. We begin by comparing different seeding Supplementary Figure 11 shows that, in scale-free graphs, selecting seeds with complex centrality leads to strikingly higher levels of diffusion than seeding with standard centrality measures, across a range of threshold conditions, seeding budgets, and influence models. Panel A of Supplementary Figure 11 shows that across seeding budgets from 0.01% to 1%, seeding with complex centrality substantially increases the number of adopters when thresholds are absolute and distributed heterogeneously, as compared to seeding with degree centrality (n = 60,  Figure 11 shows that the greedy algorithm outperforms degree, betweenness, and eigenvector centrality in the IC model (11). However, panel C of Supplementary Figure 11 also shows that in IC, complex centrality outperforms all centralitybased seeding methods and the greedy algorithm. In supplementary analyses, we show that this finding is robust to variation in θ (which specifies the likelihood of successful peer influence in IC). Lastly, panel D of Supplementary Figure 11 shows that in the LT model, complex centrality also significantly outperforms degree, betweenness, and eigenvector centrality seeding methods, equivalent to the optimal greedy algorithm.

Robustness of complex centrality seeding to scale-free networks with varying levels of clustering.
To test the robustness of our results to a wide range of scale-free networks that vary in terms of average clustering coefficient, we generated scale-free networks using Holme and Kim's (10)

Robustness to homogeneous absolute and fractional thresholds in the scale-free graphs
In Supplementary Figure 13, we present our results on scale-free graphs with heterogeneous threshold distributions, which we foreground in the main text because they capture expected heterogeneity in a population. Here in Supplementary Figure 13 we show that our seeding results also hold in graphs with homogeneous distributions of absolute and fractional thresholds in scale-free networks (N = 1000; = 3; m = 4; p = .5). Panel A of Supplementary Figure 13 shows that seeding with complex centrality leads to significantly greater diffusion than extant seeding strategies in scale-free graphs with homogeneous absolute thresholds; and panel B of Supplementary Figure 13 shows that seeding with complex centrality leads to significantly greater diffusion than seeding with extant seeding strategies in scale-free graphs with homogeneous fractional thresholds.
Supplementary Figure 14. Comparing seeding strategies in the Independent Cascade model, while varying θ, the probability that a given peer interaction in the network will enable diffusion. The proportion of adopters in the Independent Cascade model averaged over 30 unique scale-free networks ( = 3, m = 4, p = .5, N = 1000) for seeding strategies based on node centrality (complex, degree, betweenness, and eigenvector) and the greedy algorithm. (A) θ=0.2; (b) θ=0.3. Error bars display 95% confidence intervals.

Robustness to varying the probability of adoption in the independent cascade model
Supplementary Figure 14 shows that complex centrality outperforms all other centrality-based seeding strategies across a range of θ values in the IC model (where θ refers to the probability that any interaction between an adopter and nonadopter in the network will permit diffusion). In the main results, we reported simulations of IC with θ = 0.1, where complex centrality was shown to outperform all other centrality measures and the greedy algorithm. Complex centrality is the most successful in this environment, because when θ is low, reinforcing ties among peers can be essential for enabling a cascade. As θ increases, contagion dynamics become "simpler" in that single tie encounters become more likely to trigger adoption without peer reinforcement. . Error bars display 95% confidence intervals. Thresh., adoption thresholds.

Robustness of complex centrality seeding in comparison to closeness and reach centrality
For succinctness in the main text, we report the advantages of seeding with complex centrality in comparison to the most popular centrality-based seeding strategies based on simple path lengthi.e., degree, betweenness, and eigenvector centrality. Here we show that complex centrality also substantially outperforms established centrality measures that are less frequently used in seeding.
Supplementary Figure 15 shows that, across a range of seeding budgets, complex centrality triggers substantially greater diffusion than closeness centrality (5) and reach centrality (6)  10. Robustness of complex centrality seeding to k-regular graphs.
In Supplementary Figure 8, we present our results on scale-free graphs with nonuniform degree distributions, which are of relevance to seeding in extant empirical social networks that normally have nonuniform degree distributions. Here we show that our seeding results also hold in graphs with a uniform degree distribution (i.e., k-regular graphs), which are regularly employed in structured social contexts (14,15). Supplementary Figure 16 shows that seeding with complex centrality leads to significantly greater diffusion than seeding with extant seeding strategies in kregular graphs of varying levels of randomness in tie distribution. Panel A of Supplementary   Figure 16 shows that across a range of seeding budgets, seeding with complex centrality in kregular graphs with homogeneous absolute thresholds substantially increases the number of adopters, as compared to betweenness centrality (p < .001) and the greedy seeding algorithm

Robustness to conventionally-generated scale-free graphs
Here, we confirm that our theory is consistent with scale-free networks generated by the conventional algorithm from the Barabási-Albert model (13). Supplementary Figure 17 illustrates that nodes with the highest complex centrality consistently led to a significantly higher   Test, Two-Tailed; percolation centrality is normalized using min-max normalization).
Note that nodes with the highest complex centrality do not have higher degree, betweenness, eigenvector, or percolation centrality than nodes with the highest k-coreness. Yet, Figure 3F shows that nodes with the highest complex centrality are nevertheless structurally distinct from nodes with the highest k-coreness. Figure 3F shows that seeds with the highest complex centrality have the lowest k-core centrality (

Robustness to hyperparameter d in seeding with percolation centrality
The optimal percolation centrality methodalso known as the collective influence algorithm (CI)defined by Morone and Makse (2015) is tuned by a hyperparameter, d, which specifies the distance from the focal node (i.e., the number of steps along simple paths) within which alter nodes will be assessed in terms of their reduced degree. Our main results presented in figure 3 and 4 assign d the standard and default value of 3 steps. For thoroughness, we show here that altering d does not lead to any significant improvements in the overall proportion of adopters

Robustness of BSS diffusion model to statistical controls
In this final supplementary section, we illustrate that the results presented in figure 4 are robust to a myriad of statistical tests and socioeconomic control variables. Supplementary Table 2 displays the fit of an OLS model that uses each centrality measure to predict the fraction of each seed household's neighborhood that adopted the BSS program (using 'leader' households only), while controlling for all socioeconomic variables included in Banerjee et al.'s (2013) survey, with additional fixed effects at the village level (3). The intercept identifies the expectation when randomly identifying leader households. The results are robust to varying the seeding strategy used as the referent strategy for the intercept. We see that, even when subject to all of the above controls, only seeding with complex centrality is associated with a significant increase in the fraction of adopters relative to randomly selected leader households (p=0.001, βcomplex = 0.07¸CI=[0.03, 0.11]). This effect still holds when clustering standard errors at the village level (p<0.05, βcomplex = 0.07). Overall, the above model accounts for 76% of the variance in the ability for leader households to trigger adoption of the BSS program among their network neighbors.
Supplementary Table 3. OLS model using each centrality measure to predict the fraction of each seed household's network neighborhood that adopted the BSS program (using all households as possible seeds), while controlling for all socioeconomic variables included in Banerjee et al.'s (2013) survey, with additional fixed effects at the village-level. The results are robust to varying the seeding strategy used as the referent strategy for the intercept.
Supplementary Table 3 replicates the model in Supplementary Table 2, while examining the capacity for each centrality measure to predict the fraction of seed household's neighborhood that adopted the BSS program, when using any potential adopting household as a seed (3). The intercept identifies the expectation when randomly identifying seed households. The results are robust to varying the seeding strategy used as the referent strategy for the intercept. We see that, even when subject to all of the above controls, seeding with complex centrality is associated with a highly significant increase in the fraction of adopters relative to randomly selected households (p<0.01, βcomplex = 0.09¸CI=[0.03,0.14]). This effect still holds when clustering standard errors at the village level (p<0.05, βcomplex = 0.09). No other seeding strategies were identified as inducing a significance increase in the rate of adoption, relative to randomly selected seed households.
Thus, again, we see that complex centrality significantly improves the capacity to identify influential households in the spread of the BSS program, beyond extant and state-of-the-art centrality measures. Overall, the above model accounts for 53% of the variance in the ability for households to trigger adoption of the BSS program among their network neighbors.