Letter | Published:

Identification of influential spreaders in complex networks

Nature Physics volume 6, pages 888893 (2010) | Download Citation

Subjects

Abstract

Networks portray a multitude of interactions through which people meet, ideas are spread and infectious diseases propagate within a society1,2,3,4,5. Identifying the most efficient ‘spreaders’ in a network is an important step towards optimizing the use of available resources and ensuring the more efficient spread of information. Here we show that, in contrast to common belief, there are plausible circumstances where the best spreaders do not correspond to the most highly connected or the most central people6,7,8,9,10. Instead, we find that the most efficient spreaders are those located within the core of the network as identified by the k-shell decomposition analysis11,12,13, and that when multiple spreaders are considered simultaneously the distance between them becomes the crucial parameter that determines the extent of the spreading. Furthermore, we show that infections persist in the high-k shells of the network in the case where recovered individuals do not develop immunity. Our analysis should provide a route for an optimal design of efficient dissemination strategies.

Main

Spreading is a ubiquitous process, which describes many important activities in society2,3,4,5. The knowledge of the spreading pathways through the network of social interactions is crucial for developing efficient methods to either hinder spreading in the case of diseases, or accelerate spreading in the case of information dissemination. Indeed, people are connected according to the way they interact with one another in society and the large heterogeneity of the resulting network greatly determines the efficiency and speed of spreading. In the case of networks with a broad degree distribution (number of links per node)6, it is believed that the most connected people (hubs) are the key players, being responsible for the largest scale of the spreading process6,7,8. Furthermore, in the context of social network theory, the importance of a node for spreading is often associated with the betweenness centrality, a measure of how many shortest paths cross through this node, which is believed to determine who has more ‘interpersonal influence’ on others9,10.

Here we argue that the topology of the network organization plays an important role such that there are plausible circumstances under which the highly connected nodes or the highest-betweenness nodes have little effect on the range of a given spreading process. For example, if a hub exists at the end of a branch at the periphery of a network, it will have a minimal impact in the spreading process through the core of the network, whereas a less connected person who is strategically placed in the core of the network will have a significant effect that leads to dissemination through a large fraction of the population. To identify the core and the periphery of the network we use the k-shell (also called k-core) decomposition of the network11,12,13,14. Examining this quantity in a number of real networks enables us to identify the best individual spreaders in the network when the spreading originates in a single node. For the case of a spreading process originating in many nodes simultaneously, we show that we can further improve the efficiency by considering spreading origins located at a determined distance from one another.

We study real-world complex networks that represent archetypical examples of social structures. We investigate (1) the friendship network between 3.4 million members of the LiveJournal.com community15, (2) the network of email contacts in the Computer Science Department of University College London (Zhou, S., private communication), (3) the contact network of inpatients (CNI) collected from hospitals in Sweden16 and (4) the network of actors who have costarred in movies labelled by imdb.com as adult17 (see Supplementary Section SI for details).

To study the spreading process we apply the susceptible–infectious–recovered (SIR) and susceptible–infectious–susceptible (SIS) models2,3,18 on the above networks (see Methods). These models have been used to describe disease spreading as well as information and rumour spreading in social processes where an actor constantly needs to be reminded19. We denote the probability that an infectious node will infect a susceptible neighbour as β. In our study we use relatively small values for β, so that the infected percentage of the population remains small. In the case of large β values, where spreading can reach a large fraction of the population, the role of individual nodes is no longer important and spreading would cover almost all the network, independently of where it originated from.

The location of a node is defined using the k-shell decomposition analysis11,12,13. This process assigns an integer index or coreness, kS, to each node, representing its location according to successive layers (k shells) in the network. The kS index is a quite robust measure and the node ranking is not influenced significantly in the case of incomplete information. (For details see Supplementary Fig. S6 in Section SII. Small values of kS define the periphery of the network and the innermost network core corresponds to large kS (see Fig. 1a and Supplementary Section SII.) Figure 1b–d illustrates the fact that the size of the population infected in a spreading process (shown in this example in the CNI network) is not necessarily related to the degree of the node, k, where the spreading started. Spreading may be very different even when it starts from hubs of similar degrees as comparatively shown in Fig. 1b and c. Instead, the location of the spreading origin given by its kS index predicts more accurately the size of the infected population. For instance, Fig. 1b and d show that nodes in the same kS layer produce similar spreading areas even if they have different k (by definition, in a given layer there could be many nodes with kkS).

Figure 1: When the hubs may not be good spreaders.
Figure 1

a, A schematic representation of a network under the k-shell decomposition. The two nodes of degree k=8 (blue and yellow nodes) in this network are in different locations: one lies at the periphery (kS=1) whereas the other hub is in the innermost core of the network, that is, it has the largest kS (kS=3). bd, The extent of the efficiency of the spreading process cannot be accurately predicted on the basis of a measure of the immediate neighbourhood of the node, such as the degree k. For the contact network of inpatients (CNI), we compare infections originating from single nodes with the same degree k=96 (nodes A and B) or the same index kS=63 (nodes A and C), with infection probability β=0.035. In the corresponding plots, the colours indicate the probability that a node will be infected when spreading starts in the corresponding origin, as long as this probability is higher than 25%. The results are based on 10,000 different realizations for each case. In the first case, where origin A has kS=63, spreading reaches a much wider area more frequently, in contrast to origin B (kS=26), where the infection remains largely localized in the immediate neighbourhood of B. Spreading is very similar between origins A and C, which have the same kS value, although the degree of C is much smaller than A. The importance of the network organization is also highlighted when we randomly rewire the network (preserving the same degree for all nodes). In this case the standard picture is recovered: the extent of spreading coincides and both hubs contribute equally well to spreading (see Supplementary Section SVI).

The above example suggests that the position of the node relative to the organization of the network determines its spreading influence more than a local property of a node, such as the degree k. To quantify the influence of a given node i in an SIR spreading process we study the average size of the population Mi infected in an epidemic originating at node i with a given (kS,k). The infected population is averaged over all the origins with the same (kS,k) values: where ϒ(kS,k) is the union of all N(kS,k) nodes with (kS,k) values.

The analysis of M(kS,k) in the studied social networks reveals three general results (see Fig. 2): (1) For a fixed degree, there is a wide spread of M(kS,k) values. In particular, there are many hubs located at the periphery of the network (large k, low kS) that are poor spreaders. (2) For a fixed kS, M(kS,k) is approximately independent of the degree of the nodes. This result is revealed in the vertically layered structure of M(kS,k), suggesting that infected nodes located in the same k shell produce similar epidemic outbreaks M(kS,k) independent of the value of k of the infection origin. (3) The most efficient spreaders are located in the inner core of the network (large kS region), fairly independently of their degree. These results indicate that the k-shell index of a node is a better predictor of spreading influence. When an outbreak starts in the core of the network (large kS) there exist many pathways through which a virus can infect the rest of the network; this result is valid regardless of the node degree. The existence of these pathways implies that, during a typical epidemic outbreak from a random origin, nodes located in high-kS layers are more likely to be infected and they will be infected earlier than other nodes (see Supplementary Section SIII). The neighbourhood of these nodes makes them more efficient in sustaining an infection in the early stages, thus enabling the epidemic to reach a critical mass such that it can fully develop. Similar results on the efficiency of high-kS nodes are obtained from the analysis of M(kS,CB) in Fig. 2, where CB is the betweenness centrality of a node in the network9,10: the value of CB is not a good predictor for spreading efficiency.

Figure 2: The k-shell index predicts the outcome of spreading more reliably than the degree k or the betweenness centrality CB.
Figure 2

The networks used are (top to bottom) email contacts (β=8%), the CNI network (β=4%), the actor network (β=1%) and the LiveJournal.com friendship network (β=1.5%). a,c,e,g, Average infected size of the population M(kS,k) when spreading originates in nodes with (kS,k). b,d,f,h, The infected size M(kS,CB) when spreading originates in nodes of a given combination of kS and CB. In both cases, spreading is larger for nodes of higher kS, whereas nodes of a given k or CB value can result in either small or large spreading, depending on the value of kS. (There is an exception at large kS and small k of the LiveJournal database, which is due to artificial closed groups of virtual characters that connect with one another for the purpose of online gaming and do not correspond to regular users, as the rest of the database.)

To quantify the importance of kS in spreading we calculate the ‘imprecision functions’ εkS(p), εk(p) and εCB(p). These functions estimate for each of the three indicators kS, k and CB how close to the optimal spreading is the average spreading of the pN (0<p<1) chosen origins in each case (see Methods and Supplementary Section SIV). The strategy to predict the spreading efficiency of a node based on kS is consistently more accurate than a method based on k in the studied p range (Fig. 3a). The CB-based strategy gives poor results compared with the other two strategies.

Figure 3: k-shell structure of the CNI network.
Figure 3

a, The imprecision functions εkS(p), εk(p) and εCB(p), for β=4%. Even though both k-shell and k identification strategies yield comparable results for p=2%, the k-shell strategy is consistently more accurate for 2%<p<10%, with εkS approximately half εk. The CB identification of the most efficient spreaders is the least accurate, with εCB exceeding 40%. b, We visualize the CNI network as a set of concentric circles of nodes representing inpatients, each circle corresponding to a particular k shell. The kS indices of a given layer increase as we move from the periphery to the centre of the network28,29. Node size is proportional to the logarithm of the degree of the node. We highlight the 25 inpatients with the largest degree values. Note that inpatients with high k values are not concentrated at the ‘centre’ of the network but instead are scattered throughout different k shells. We highlight the position of the three nodes, A, B and C, of the origins that were used in the example of Fig. 1. c, Scatter plot of the node degree k as a function of kS for all the nodes in the CNI network (black symbols) and the degree-preserving randomized version of the same network (red symbols). Note that there are many inpatients with large k and low kS values in the original network, whereas in the randomized email network all the hubs are located in the inner core of the network. We also show the positions of the three origins used in Fig. 1. d, When spreading starts from multiple origins, the set of nodes with highest degree (blue continuous line) can spread significantly more than the set of highest-kS nodes (red continuous line), because in the latter case most of these nodes are connected to one another. If we only consider in this set nodes that are not directly linked, then both the sets of highest-k or kS nodes yield a similar result (dashed lines), where spreading is significantly enhanced. Results are shown for β=3% in the CNI.

Our finding is not specific to the social networks shown in Fig. 2. In Supplementary Section SV we analyse the spreading efficiency in other networks not social in origin, such as the Internet at the router level20, with similar conclusions. The key insight of our finding is that in the studied networks a large number of hubs are located in the peripheral low-kS layers (Fig. 3b shows the location of the 25 largest hubs in the CNI; see also Supplementary Section SV) and therefore contribute poorly to spreading. The existence of hubs in the periphery is a consequence of the rich topological structure of real networks. In contrast, in a fully random network obtained by randomly rewiring a real network preserving the degree of each node (such a random network corresponds to the configuration model21; see Supplementary Section SVI) all the hubs are placed in the core of the network (see the red scatter plot in Fig. 3c) and they contribute equally well to spreading. In such a randomized structure the same information is contained in the k shell as in the degree classification because there is a one-to-one relation between the two quantities, which is approximately linear, kSk (Fig. 3c and Supplementary Fig. S13). Examples of real networks that are similar to a random structure are the network of product space of economic goods22 and the Internet at the AS level (analysed in Supplementary Section SV).

Our study highlights the importance of the relative location of a single spreading origin. Next, we address the question of the extent of an epidemic that starts at multiple origins simultaneously. Figure 3d shows the extent of SIR spreading in the CNI network when the outbreak simultaneously starts from the n nodes with the highest degree k or the highest kS index. Even though the high-kS nodes are the best single spreaders, in the case of multiple spreading the nodes with highest degree are more efficient than those with highest kS. This result is attributed to the overlap of the infected areas of the different spreaders: large-kS nodes tend to be clustered close to one another, whereas hubs can be more spread in the network and, in particular, they need not be connected with one another. Clearly, the step-like features in the plot of highest-kS nodes (red solid curve in Fig. 3d) suggest that the infected percentage remains constant as long as the infected nodes belong in the same k shell. Including just one node from a different k shell results in a significantly increased spreading. This result suggests that a better spreading strategy using n spreaders is to choose either the highest-k or kS nodes with the requirement that no two of the n spreaders are directly linked to each other. This scheme then provides the largest infected area of the network, as shown in Fig. 3d.

Many contagious infections, including most sexually transmitted infections23, do not confer full immunity after infection as assumed in the SIR model, and therefore are suitably described by the SIS epidemic model, where an infectious node returns to the susceptible state with probability λ. In an SIS epidemic the number of infectious nodes eventually reaches a dynamic-equilibrium ‘endemic’ state, where as many infectious individuals become susceptible as susceptible nodes become infectious18. In contrast to SIR, in the initial state of our SIS simulations 20% of the network nodes are already infected. The spreading efficiency of a given node i in SIS spreading is the persistence, ρi(t), defined as the probability that node i is infected at time t (ref. 7). In an endemic SIS state, becomes independent of t (see Supplementary Section SVII). Previous studies have shown that the largest persistence is found in the network hubs, which are re-infected frequently owing to the large number of neighbours7,24,25. However, we find that this result holds only in randomized network structures. In the real network topologies studied here, we find that viruses persist mainly in high-kS layers instead, almost irrespectively of the degree of the nodes in the core.

In the case of random networks, it is found that viruses propagate to the entire network above an epidemic threshold given by β>βcrandλk〉/〈k2〉 (refs 24, 26). In real networks, such as the CNI network, the threshold βc is different from βcrand. Furthermore, in real networks, we find that viruses can survive locally even when β<βc, but only within the high-kS layers of the network, whereas virus persistence in peripheral kS layers is negligible (Fig. 4a–c). As the k-shell structure depends on the network assortativity, the lower threshold is in agreement with the observation that high positive assortativity27 may decrease the epidemic threshold.

Figure 4: SIS spreading in the CNI network and β dependence for SIS and SIR.
Figure 4

a,b, Virus persistence ρ(kS,k) as a function of k and kS values of inpatients in the CNI network for β=2% and β=4%, respectively, where 20% of the individuals are initially infected. The infection survives mainly in nodes with large kS values. c, We form four groups of nodes of the CNI network on the basis of their k-shell values. For all values of β, the average virus persistence ρ is consistently higher in the inner k shells. d, Influence of the infection probability β on the spreading efficiency of nodes, grouped according to their k-shell values, for SIR spreading. The solid black line refers to the average infected percentage over all network nodes. Nodes in higher-k shells are consistently the most efficient, independently of the β value.

The importance of high-kS nodes in SIS spreading is confirmed when we analyse the asymptotic probability that nodes of given (kS,k) values will be infected. This probability is quantified by the persistence function as a function of (kS,k) at different β values (Fig. 4a and b). High-kS layers in networks might be closely related to the concept of a core group in sexually transmitted infection research23. The core groups are defined as subgroups in the general population characterized by high partner turnover rate and extensive intergroup interaction23.

Similar to the core group, the dense subnetwork formed by nodes in the innermost k shells helps the virus to consistently survive locally in the inner-core area and infect other nodes adjacent to the area. These k shells preserve the existence of a virus, in contrast to, for example, isolated hubs at the periphery. Note that a virus cannot survive in the degree-preserving randomized version of the CNI network, owing to the absence of high-k shells.

The importance of the inner-core nodes in spreading is not influenced by the infection probability values, β. In both models, SIS and SIR, we find that the persistence ρ or the average infected fraction M, respectively, is systematically larger for nodes in inner k shells compared with nodes in outer k shells, over the entire β range that we studied (Fig. 4c,d). Thus, the k-shell measure is a robust indicator for the spreading efficiency of a node.

Finding the most accurate ranking of individual nodes for spreading in a population can influence the success of dissemination strategies. When spreading starts from a single node the kS value is enough for this ranking, whereas in the case of many simultaneous origins spreading is greatly enhanced when we additionally repel the spreaders with large degree or kS. In the case of infections that do not confer immunity on recovered individuals, the core of the network in the large-kS layers forms a reservoir where infection can survive locally.

Methods

The k-shell decomposition.

Nodes are assigned to k shells according to their remaining degree, which is obtained by successive pruning of nodes with degree smaller than the kS value of the current layer. We start by removing all nodes with degree k=1. After removing all the nodes with k=1, some nodes may be left with one link, so we continue pruning the system iteratively until there is no node left with k=1 in the network. The removed nodes, along with the corresponding links, form a k shell with index kS=1. In a similar fashion, we iteratively remove the next k shell, kS=2, and continue removing higher-k shells until all nodes are removed. As a result, each node is associated with one kS index, and the network can be viewed as the union of all k shells. The resulting classification of a node can be very different than when the degree k is used.

The spreading models.

To study the spreading process we apply the SIR and SIS models. In the SIR model, all nodes are initially in the susceptible state (S) except for one node in the infectious state (I). At each time step, the I nodes infect their susceptible neighbours with probability β and then enter the recovered state (R), where they become immunized and cannot be infected again. The SIS model aims to describe spreading processes that do not confer immunity on recovered individuals: infected individuals still infect their neighbours with probability β but they return to the susceptible state with probability λ (here we use λ=0.8) and can be reinfected at subsequent time steps, and they remain infectious with probability 1−λ.

The imprecision function.

The betweenness centrality, CB(i), of a node i is defined as follows: Consider two nodes s and t and the set σst of all possible shortest paths between these two nodes. If the subset of this set that contains the paths that pass through the node i is denoted by σst(i), then the betweenness centrality of this node is given by where the sum runs over all nodes s and t in the network.

The imprecision function ε(p) quantifies the difference between the average spreading between the pN nodes (0<p<1) with highest kS, k or CB and the average spreading of the pN most efficient spreaders (N is the number of nodes in the network). Thus, it tests the merit of using k shell, k and CB to identify the most efficient spreaders. For a given β value and a given fraction of the system p we first identify the set of the Np most efficient spreaders as measured by Mi (we designate this set by ϒeff). Similarly, we identify the Np individuals with the highest k-shell index (ϒkS). We define the imprecision of k-shell identification as εkS(p)≡1−MkS/Meff, where MkS and Meff are the average infected percentages averaged over the ϒkS and ϒeff groups of nodes respectively. εk and εCB are defined similarly to εkS.

References

  1. 1.

    Caldarelli, G. & Vespignani, A. (eds) Large Scale Structure and Dynamics of Complex Networks (World Scientific, 2007).

  2. 2.

    , & Infectious Diseases of Humans: Dynamics and Control (Oxford Science Publications, 1992).

  3. 3.

    & Mathematical Epidemiology of Infectious Diseases: Model Building, Analysis and Interpretation (Wiley Series in Mathematical & Computational Biology, 2000).

  4. 4.

    & Modeling Infectious Diseases in Humans and Animals (Princeton Univ. Press, 2008).

  5. 5.

    Diffusion of Innovation 4th edn (Free Press, 1995).

  6. 6.

    , & Error and attack tolerance of complex networks. Nature 406, 378–482 (2000).

  7. 7.

    & Epidemic spreading in scale-free networks. Phys. Rev. Lett. 86, 3200–3203 (2001).

  8. 8.

    , , & Breakdown of the Internet under intentional attack. Phys. Rev. Lett. 86, 3682–3685 (2001).

  9. 9.

    Centrality in social networks: Conceptual clarification. Social Networks 1, 215–239 (1979).

  10. 10.

    Theoretical foundations for centrality measures. Am. J. Sociology 96, 1478–1504 (1991).

  11. 11.

    Graph Theory and Combinatorics: Proceedings of the Cambridge Combinatorial Conference in Honor of P. Erdös Vol. 35 (Academic, 1984).

  12. 12.

    Network structure and minimum degree. Social Networks 5, 269–287 (1983).

  13. 13.

    , , , & A model of Internet topology using k-shell decomposition. Proc. Natl Acad. Sci. USA 104, 11150–11154 (2007).

  14. 14.

    & Clustering in complex networks. II. Percolation properties. Phys. Rev. E 74, 056116 (2006).

  15. 15.

    LiveJournal, .

  16. 16.

    , & The contact network of inpatients in a regional healthcare system. A longitudinal case study. Math. Population Studies 14, 269–284 (2007).

  17. 17.

    The Internet Movie Database.

  18. 18.

    The mathematics of infectious diseases. SIAM Rev. 42, 599–653 (2000).

  19. 19.

    , & Statistical Physics of Social Dynamics. Rev. Mod. Phys. 81, 591–646 (2009).

  20. 20.

    & DIMES: Let the internet measure itself. ACM SIGCOMM Comput. Commun. Rev. 35, 71–74 (2005).

  21. 21.

    & A critical point for random graphs with a given degree sequence. Random Struct. Algorithms 6, 161–180 (1995).

  22. 22.

    , , & The product space conditions the development of nations. Science 317, 482–487 (2007).

  23. 23.

    & Gonorrhea Transmission Dynamics and Control (Springer-Verlag, 1984).

  24. 24.

    & Immunization of complex networks. Phys. Rev. E 65, 036104 (2002).

  25. 25.

    & Halting viruses in scale-free networks. Phys. Rev. E 65, 055103 (2002).

  26. 26.

    , , & Resilience of the Internet to random breakdowns. Phys. Rev. Lett. 85, 4626–4630 (2000).

  27. 27.

    Assortative mixing in networks. Phys. Rev. Lett. 89, 208701 (2002).

  28. 28.

    Large Network visualization tool, .

  29. 29.

    , , & Large scale networks fingerprinting and visualization using the k-core decomposition. Adv. Neural Inform. Process. Systems 18, 41–51 (2006).

Download references

Acknowledgements

We thank NSF-SES, NSF-EF, ONR, DTRA, Epiwork and the Israel Science Foundation for support. F.L. is supported by Riksbankens Jubileumsfond. We thank L. Braunstein, J. Brujić, kc claffy, D. Krioukov and C. Song for discussions and S. Zhou for providing the email dataset. The use of the hospital dataset was approved by the Regional Ethical Review Board in Stockholm (Record 2004=5:8).

Author information

Affiliations

  1. Center for Polymer Studies and Physics Department, Boston University, Boston, Massachusetts 02215, USA

    • Maksim Kitsak
    •  & H. Eugene Stanley
  2. Cooperative Association for Internet Data Analysis (CAIDA), University of California-San Diego, La Jolla, California 92093, USA

    • Maksim Kitsak
  3. Levich Institute and Physics Department, City College of New York, New York, New York 10031, USA

    • Lazaros K. Gallos
    •  & Hernán A. Makse
  4. Minerva Center and Department of Physics, Bar-Ilan University, Ramat Gan, Israel

    • Shlomo Havlin
  5. Department of Sociology, Stockholm University, S-10691, Stockholm, Sweden

    • Fredrik Liljeros
  6. Information, Operations and Management Sciences Department, Stern School of Business, New York University, New York, New York 10012, USA

    • Lev Muchnik

Authors

  1. Search for Maksim Kitsak in:

  2. Search for Lazaros K. Gallos in:

  3. Search for Shlomo Havlin in:

  4. Search for Fredrik Liljeros in:

  5. Search for Lev Muchnik in:

  6. Search for H. Eugene Stanley in:

  7. Search for Hernán A. Makse in:

Contributions

All authors contributed equally to the work presented in this paper.

Competing interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to Hernán A. Makse.

Supplementary information

PDF files

  1. 1.

    Supplementary Information

    Supplementary Information

About this article

Publication history

Received

Accepted

Published

DOI

https://doi.org/10.1038/nphys1746

Further reading