Networks portray a multitude of interactions through which people meet, ideas are spread and infectious diseases propagate within a society1,2,3,4,5. Identifying the most efficient ‘spreaders’ in a network is an important step towards optimizing the use of available resources and ensuring the more efficient spread of information. Here we show that, in contrast to common belief, there are plausible circumstances where the best spreaders do not correspond to the most highly connected or the most central people6,7,8,9,10. Instead, we find that the most efficient spreaders are those located within the core of the network as identified by the k-shell decomposition analysis11,12,13, and that when multiple spreaders are considered simultaneously the distance between them becomes the crucial parameter that determines the extent of the spreading. Furthermore, we show that infections persist in the high-k shells of the network in the case where recovered individuals do not develop immunity. Our analysis should provide a route for an optimal design of efficient dissemination strategies.
Spreading is a ubiquitous process, which describes many important activities in society2,3,4,5. The knowledge of the spreading pathways through the network of social interactions is crucial for developing efficient methods to either hinder spreading in the case of diseases, or accelerate spreading in the case of information dissemination. Indeed, people are connected according to the way they interact with one another in society and the large heterogeneity of the resulting network greatly determines the efficiency and speed of spreading. In the case of networks with a broad degree distribution (number of links per node)6, it is believed that the most connected people (hubs) are the key players, being responsible for the largest scale of the spreading process6,7,8. Furthermore, in the context of social network theory, the importance of a node for spreading is often associated with the betweenness centrality, a measure of how many shortest paths cross through this node, which is believed to determine who has more ‘interpersonal influence’ on others9,10.
Here we argue that the topology of the network organization plays an important role such that there are plausible circumstances under which the highly connected nodes or the highest-betweenness nodes have little effect on the range of a given spreading process. For example, if a hub exists at the end of a branch at the periphery of a network, it will have a minimal impact in the spreading process through the core of the network, whereas a less connected person who is strategically placed in the core of the network will have a significant effect that leads to dissemination through a large fraction of the population. To identify the core and the periphery of the network we use the k-shell (also called k-core) decomposition of the network11,12,13,14. Examining this quantity in a number of real networks enables us to identify the best individual spreaders in the network when the spreading originates in a single node. For the case of a spreading process originating in many nodes simultaneously, we show that we can further improve the efficiency by considering spreading origins located at a determined distance from one another.
We study real-world complex networks that represent archetypical examples of social structures. We investigate (1) the friendship network between 3.4 million members of the LiveJournal.com community15, (2) the network of email contacts in the Computer Science Department of University College London (Zhou, S., private communication), (3) the contact network of inpatients (CNI) collected from hospitals in Sweden16 and (4) the network of actors who have costarred in movies labelled by imdb.com as adult17 (see Supplementary Section SI for details).
To study the spreading process we apply the susceptible–infectious–recovered (SIR) and susceptible–infectious–susceptible (SIS) models2,3,18 on the above networks (see Methods). These models have been used to describe disease spreading as well as information and rumour spreading in social processes where an actor constantly needs to be reminded19. We denote the probability that an infectious node will infect a susceptible neighbour as β. In our study we use relatively small values for β, so that the infected percentage of the population remains small. In the case of large β values, where spreading can reach a large fraction of the population, the role of individual nodes is no longer important and spreading would cover almost all the network, independently of where it originated from.
The location of a node is defined using the k-shell decomposition analysis11,12,13. This process assigns an integer index or coreness, kS, to each node, representing its location according to successive layers (k shells) in the network. The kS index is a quite robust measure and the node ranking is not influenced significantly in the case of incomplete information. (For details see Supplementary Fig. S6 in Section SII. Small values of kS define the periphery of the network and the innermost network core corresponds to large kS (see Fig. 1a and Supplementary Section SII.) Figure 1b–d illustrates the fact that the size of the population infected in a spreading process (shown in this example in the CNI network) is not necessarily related to the degree of the node, k, where the spreading started. Spreading may be very different even when it starts from hubs of similar degrees as comparatively shown in Fig. 1b and c. Instead, the location of the spreading origin given by its kS index predicts more accurately the size of the infected population. For instance, Fig. 1b and d show that nodes in the same kS layer produce similar spreading areas even if they have different k (by definition, in a given layer there could be many nodes with k≥kS).
The above example suggests that the position of the node relative to the organization of the network determines its spreading influence more than a local property of a node, such as the degree k. To quantify the influence of a given node i in an SIR spreading process we study the average size of the population Mi infected in an epidemic originating at node i with a given (kS,k). The infected population is averaged over all the origins with the same (kS,k) values: where ϒ(kS,k) is the union of all N(kS,k) nodes with (kS,k) values.
The analysis of M(kS,k) in the studied social networks reveals three general results (see Fig. 2): (1) For a fixed degree, there is a wide spread of M(kS,k) values. In particular, there are many hubs located at the periphery of the network (large k, low kS) that are poor spreaders. (2) For a fixed kS, M(kS,k) is approximately independent of the degree of the nodes. This result is revealed in the vertically layered structure of M(kS,k), suggesting that infected nodes located in the same k shell produce similar epidemic outbreaks M(kS,k) independent of the value of k of the infection origin. (3) The most efficient spreaders are located in the inner core of the network (large kS region), fairly independently of their degree. These results indicate that the k-shell index of a node is a better predictor of spreading influence. When an outbreak starts in the core of the network (large kS) there exist many pathways through which a virus can infect the rest of the network; this result is valid regardless of the node degree. The existence of these pathways implies that, during a typical epidemic outbreak from a random origin, nodes located in high-kS layers are more likely to be infected and they will be infected earlier than other nodes (see Supplementary Section SIII). The neighbourhood of these nodes makes them more efficient in sustaining an infection in the early stages, thus enabling the epidemic to reach a critical mass such that it can fully develop. Similar results on the efficiency of high-kS nodes are obtained from the analysis of M(kS,CB) in Fig. 2, where CB is the betweenness centrality of a node in the network9,10: the value of CB is not a good predictor for spreading efficiency.
To quantify the importance of kS in spreading we calculate the ‘imprecision functions’ εkS(p), εk(p) and εCB(p). These functions estimate for each of the three indicators kS, k and CB how close to the optimal spreading is the average spreading of the pN (0<p<1) chosen origins in each case (see Methods and Supplementary Section SIV). The strategy to predict the spreading efficiency of a node based on kS is consistently more accurate than a method based on k in the studied p range (Fig. 3a). The CB-based strategy gives poor results compared with the other two strategies.
Our finding is not specific to the social networks shown in Fig. 2. In Supplementary Section SV we analyse the spreading efficiency in other networks not social in origin, such as the Internet at the router level20, with similar conclusions. The key insight of our finding is that in the studied networks a large number of hubs are located in the peripheral low-kS layers (Fig. 3b shows the location of the 25 largest hubs in the CNI; see also Supplementary Section SV) and therefore contribute poorly to spreading. The existence of hubs in the periphery is a consequence of the rich topological structure of real networks. In contrast, in a fully random network obtained by randomly rewiring a real network preserving the degree of each node (such a random network corresponds to the configuration model21; see Supplementary Section SVI) all the hubs are placed in the core of the network (see the red scatter plot in Fig. 3c) and they contribute equally well to spreading. In such a randomized structure the same information is contained in the k shell as in the degree classification because there is a one-to-one relation between the two quantities, which is approximately linear, kS∝k (Fig. 3c and Supplementary Fig. S13). Examples of real networks that are similar to a random structure are the network of product space of economic goods22 and the Internet at the AS level (analysed in Supplementary Section SV).
Our study highlights the importance of the relative location of a single spreading origin. Next, we address the question of the extent of an epidemic that starts at multiple origins simultaneously. Figure 3d shows the extent of SIR spreading in the CNI network when the outbreak simultaneously starts from the n nodes with the highest degree k or the highest kS index. Even though the high-kS nodes are the best single spreaders, in the case of multiple spreading the nodes with highest degree are more efficient than those with highest kS. This result is attributed to the overlap of the infected areas of the different spreaders: large-kS nodes tend to be clustered close to one another, whereas hubs can be more spread in the network and, in particular, they need not be connected with one another. Clearly, the step-like features in the plot of highest-kS nodes (red solid curve in Fig. 3d) suggest that the infected percentage remains constant as long as the infected nodes belong in the same k shell. Including just one node from a different k shell results in a significantly increased spreading. This result suggests that a better spreading strategy using n spreaders is to choose either the highest-k or kS nodes with the requirement that no two of the n spreaders are directly linked to each other. This scheme then provides the largest infected area of the network, as shown in Fig. 3d.
Many contagious infections, including most sexually transmitted infections23, do not confer full immunity after infection as assumed in the SIR model, and therefore are suitably described by the SIS epidemic model, where an infectious node returns to the susceptible state with probability λ. In an SIS epidemic the number of infectious nodes eventually reaches a dynamic-equilibrium ‘endemic’ state, where as many infectious individuals become susceptible as susceptible nodes become infectious18. In contrast to SIR, in the initial state of our SIS simulations 20% of the network nodes are already infected. The spreading efficiency of a given node i in SIS spreading is the persistence, ρi(t), defined as the probability that node i is infected at time t (ref. 7). In an endemic SIS state, becomes independent of t (see Supplementary Section SVII). Previous studies have shown that the largest persistence is found in the network hubs, which are re-infected frequently owing to the large number of neighbours7,24,25. However, we find that this result holds only in randomized network structures. In the real network topologies studied here, we find that viruses persist mainly in high-kS layers instead, almost irrespectively of the degree of the nodes in the core.
In the case of random networks, it is found that viruses propagate to the entire network above an epidemic threshold given by β>βcrand≡λ〈k〉/〈k2〉 (refs 24, 26). In real networks, such as the CNI network, the threshold βc is different from βcrand. Furthermore, in real networks, we find that viruses can survive locally even when β<βc, but only within the high-kS layers of the network, whereas virus persistence in peripheral kS layers is negligible (Fig. 4a–c). As the k-shell structure depends on the network assortativity, the lower threshold is in agreement with the observation that high positive assortativity27 may decrease the epidemic threshold.
The importance of high-kS nodes in SIS spreading is confirmed when we analyse the asymptotic probability that nodes of given (kS,k) values will be infected. This probability is quantified by the persistence function as a function of (kS,k) at different β values (Fig. 4a and b). High-kS layers in networks might be closely related to the concept of a core group in sexually transmitted infection research23. The core groups are defined as subgroups in the general population characterized by high partner turnover rate and extensive intergroup interaction23.
Similar to the core group, the dense subnetwork formed by nodes in the innermost k shells helps the virus to consistently survive locally in the inner-core area and infect other nodes adjacent to the area. These k shells preserve the existence of a virus, in contrast to, for example, isolated hubs at the periphery. Note that a virus cannot survive in the degree-preserving randomized version of the CNI network, owing to the absence of high-k shells.
The importance of the inner-core nodes in spreading is not influenced by the infection probability values, β. In both models, SIS and SIR, we find that the persistence ρ or the average infected fraction M, respectively, is systematically larger for nodes in inner k shells compared with nodes in outer k shells, over the entire β range that we studied (Fig. 4c,d). Thus, the k-shell measure is a robust indicator for the spreading efficiency of a node.
Finding the most accurate ranking of individual nodes for spreading in a population can influence the success of dissemination strategies. When spreading starts from a single node the kS value is enough for this ranking, whereas in the case of many simultaneous origins spreading is greatly enhanced when we additionally repel the spreaders with large degree or kS. In the case of infections that do not confer immunity on recovered individuals, the core of the network in the large-kS layers forms a reservoir where infection can survive locally.
The k-shell decomposition.
Nodes are assigned to k shells according to their remaining degree, which is obtained by successive pruning of nodes with degree smaller than the kS value of the current layer. We start by removing all nodes with degree k=1. After removing all the nodes with k=1, some nodes may be left with one link, so we continue pruning the system iteratively until there is no node left with k=1 in the network. The removed nodes, along with the corresponding links, form a k shell with index kS=1. In a similar fashion, we iteratively remove the next k shell, kS=2, and continue removing higher-k shells until all nodes are removed. As a result, each node is associated with one kS index, and the network can be viewed as the union of all k shells. The resulting classification of a node can be very different than when the degree k is used.
The spreading models.
To study the spreading process we apply the SIR and SIS models. In the SIR model, all nodes are initially in the susceptible state (S) except for one node in the infectious state (I). At each time step, the I nodes infect their susceptible neighbours with probability β and then enter the recovered state (R), where they become immunized and cannot be infected again. The SIS model aims to describe spreading processes that do not confer immunity on recovered individuals: infected individuals still infect their neighbours with probability β but they return to the susceptible state with probability λ (here we use λ=0.8) and can be reinfected at subsequent time steps, and they remain infectious with probability 1−λ.
The imprecision function.
The betweenness centrality, CB(i), of a node i is defined as follows: Consider two nodes s and t and the set σst of all possible shortest paths between these two nodes. If the subset of this set that contains the paths that pass through the node i is denoted by σst(i), then the betweenness centrality of this node is given by where the sum runs over all nodes s and t in the network.
The imprecision function ε(p) quantifies the difference between the average spreading between the pN nodes (0<p<1) with highest kS, k or CB and the average spreading of the pN most efficient spreaders (N is the number of nodes in the network). Thus, it tests the merit of using k shell, k and CB to identify the most efficient spreaders. For a given β value and a given fraction of the system p we first identify the set of the Np most efficient spreaders as measured by Mi (we designate this set by ϒeff). Similarly, we identify the Np individuals with the highest k-shell index (ϒkS). We define the imprecision of k-shell identification as εkS(p)≡1−MkS/Meff, where MkS and Meff are the average infected percentages averaged over the ϒkS and ϒeff groups of nodes respectively. εk and εCB are defined similarly to εkS.
We thank NSF-SES, NSF-EF, ONR, DTRA, Epiwork and the Israel Science Foundation for support. F.L. is supported by Riksbankens Jubileumsfond. We thank L. Braunstein, J. Brujić, kc claffy, D. Krioukov and C. Song for discussions and S. Zhou for providing the email dataset. The use of the hospital dataset was approved by the Regional Ethical Review Board in Stockholm (Record 2004=5:8).
About this article
Journal of Systems Science and Complexity (2019)