Abstract
The popularization of information spreading in online social networks facilitates daily communication among people. Although much work has been done to study the effect of interactions among people on spreading, there is less work that considers the pattern of spreading behaviour when people independently make their decisions. By comparing microblogging, an important medium for information spreading, with the disordered spin glass system, we find that there exist interesting corresponding relationships between them. And the effect of aging can be observed in both systems. Based on the analogy with the Trap Model of spin glasses, we derive a model with a unified power-function form for the growth of independent spreading activities. Our model takes several key factors into consideration, including memory effect, the dynamics of human interest, and the fact that older messages are more difficult to discover. We validate our model by a real-world microblogging data set. Our work indicates that, other than various features, some invariable rules should be considered during spreading prediction. This work also contributes a useful methodology for studying human dynamics.
Similar content being viewed by others
Introduction
Information spreading is of great significance to the communication in people’s daily life, the marketing strategies of corporations, and the concentration of public opinion1. To investigate how individuals respond to received information, most existing research studied the interactions among them. However, to our best knowledge, how individuals independently decide whether or not to spread information has not been well answered yet. In this paper, we study cumulative dynamics of independent spreading behaviour by probing retweeting, one of the most typical information spreading processes in online social networks. Our target problem is to explore if there exists a model with a simple unified form that governs the growth of all independent retweeting activities in our data set.
Retweeting is the behaviour of copying received messages and reposting them on microblogging platforms. The decision of retweeting could be made under the influence of others or, independently. It is intuitive that there exist certain patterns of interactions among people, such as the cascading effect2,3,4, the co-existence of competition and cooperation5. However, most existing studies, whether theoretical6,7,8 or empirical9,10,11, do not discriminate the independence property of spreading activities from others, although it makes up a large portion of all spreading activities. Therefore, rather than focusing on the hierarchical structure, we focus on the retwitters who directly retweet initialized messages from influentials, without any influence from others. The reason we only consider influentials is that there is hardly any information about spreading dynamics if a message is retweeted by only a few retwitters, though we do not make any particular assumptions about influentials in our model derivation. One of the benefits of studying direct retwitters is that, due to the filtering of direct retwitters, networks can be reduced to a much smaller scale. Meanwhile, this filtering process does not reduce the representativeness of retweeting activities, since that a large number of spreading processes originated from influential users end up within their one-degree follower networks11.
Specifically, through empirical data, we trace the reposted messages (retweets) of popular initialized/original messages (tweets) by direct retwitters in Sina Weibo, China’s most popular microblogging service. Direct retwitters can be identified by text parsing utilities. Then, similar to other studies12,13,14, the change of the number of retweets allows us to investigate the dynamics of independent spreading behaviour. We will investigate whether there is a model with a unified form that applies to all responding activities.
Of the existing explanatory models of information spreading, some intend to explain the microscopic interactions among individuals15, some intend to characterize the resulting effect12, 16, and some bridge the gap between microscopic mechanisms and macroscopic phenomena2, 4, 5. We aim to come up with a modeling methodology that is able to derive a mathematical model for phenomenal results based on some intuitive microscopic conjectures. To achieve this goal, we try to model the growth of retweets by the analogy with spin glass models. The motivations of this methodology are two-fold. The first one is the inspiration from the works of Johansen and co-workers14, 17, 18, who reported several experiments on the Internet which could be explained by the models similar to the Trap Model of spin glasses. The second motivation is the realization that aging effect is likely to exist in the growth of retweeting activities, since it is harder to discover older messages in microblogging. To derive our model, we investigate the Ising Model, the Random Energy Model 19 for microscopic mechanisms and the definition of ground state of spin glass systems. Then we investigate the Trap Model for the explanation of the aging effect in spin glasses. We identify several corresponding relationships between the retweeting behaviour of microblogging users after the publication of tweets from influentials, and the relaxation responses of spin glasses after the switch off of magnetic fields at low temperature. Based on some intuitive conjectures, we derive a power-function model to describe the change of cumulative number of retweets over time. These conjectures, such as memory effect, the dynamics of human interest, and the fact that it gets more difficult to discover older messages, are the key elements in our modeling process. And they are demonstrated, to some extent, by empirical experiments or relevant references. Then we fit our data by the derived model. The results show that although the content of messages and the influence of original publishers are diverse, our model fits well to most of our empirical data. We also show the predictability of our model.
Our work contributes a useful methodology, the analogy with physical systems, for studying human dynamics. The discovered rule, that applies to the growth of different retweeting activities with a unified form, reveals the nature of complexity in retweeting activities. We hope that our work will shed some light on the study of human dynamics20,21,22,23,24. Our work also indicates that, other than various features adopted in well-tuned machine learning models, some invariable rules, such as the power-law growth of independent retweeting activities, the memory effect in human behaviour, should be taken into consideration during the prediction of information spreading.
Results
In this section, we introduce the whole modeling process. The section is outlined as follows: At first, to obtain a set of independent retwitters, data preparing and preprocessing are carried out. Then we derive the model for the growth of retweets by the analogy with relevant spin glass models. Next, we fit empirical data with the derived model. Then, we carry out several experiments to validate our model by real data from Sina Weibo. At last, we show the predictability of the derived model.
Data Preparation and Preprocessing
For this study, we obtained two data sets, a message set and a link set. The message set allows us to trace every retweet and its corresponding retwitter of an original tweet. The link set describes the following relationships among users. Thus, we are allowed to locate every single retweeting activity on real-world network structures.
A tweet could be spread to distant users from its original publisher. Since that all the followers of current retwitters could receive the message and expose it to their own descendants, the entire network structure, on which the original tweet spreads, might be too large to study. Goel et al.11 found that a vast majority of spreading cascades terminate within one degree of an initial seed. We further defined Direct Follower Networks (DFNs) in our former paper25 to describe the spreading activities happened within the first-layer of followers of an influential user. Figure 1 shows a schematic diagram of a DFN and two overlapped DFNs from real data. The formal definition of DFNs is as follows,
Definition 1 A Direct Follower Network associated with an influential user u, is a directed graph with a node set \( {\mathcal F} \) and a link set \( {\mathcal L} \). The set \( {\mathcal F} \) consists of all the followers of u with distance 1. The set \( {\mathcal L} \) consists of all the directed links among the nodes in \( {\mathcal F} \).
On this simplified network structure, we trace the retweets of 3,506 original tweets initialized by the top 10 most active (in the sense of the amount of initialized tweets) influentials in our data set. Then we calculate the ratios of the retweeting activities on DFNs to the total amounts of retweets. The results show that the ratios exceed 80% for nearly 90% of original tweets, which means the overwhelming majority of retweeting activities happen on DFNs. In addition, as shown in Fig. 2, the hourly intensity of retweeting activities can be roughly classified into single-peak and multi-peak patterns. Multi-peak patterns involve more complicated factors than single-peak patterns. These multi-peak patterns account for less than 5% of all samples in our data set. Therefore, they are removed in the following analyses.
After the simplification of network structure, we discuss the independence among retwitters. A retwitter could directly retweet a message after receiving the message exactly once from the author, and he also could receive the message multiple times from other retwitters and then retweet it. The retwitters in the former case, who are reckoned as direct retwitters, are considered as independent retwitters with each other because their decisions of retweeting are not affected by other retwitters. Figure 3 demonstrates direct and indirect retwitters. The so-called triad closure makes multiple exposures happen. The effect of multiple exposures on information diffusion in social networks is non-trivial. It is of great importance for researchers from a broad range of fields26, 27. In our data set, the ratios of the numbers of direct retwitters to the numbers of retwitters in DFNs exceed 70% for more than 90% of all tweet samples, which means the overwhelming majority of retwitters in DFNs are direct retwitters.
We identify these direct retwitters by text parsing utilities such as regular expressions. Specifically, among all the retwitters, every direct retwitter lies on the structure of DFNs, since other retwitters can only see the original message through some mediators. As shown in Fig. 4, we identify direct retwitters by parsing the retweet tag in the text of retweets, and locating the positions of retwitters to the author. Admittedly, it is possible that a user could bypass his mediators and “directly” reposts a message, even though he saw the message from mediators. We ignore this situation since it is a much less convenient way for retweeting in microblogging. After the filtering and parsing processes mentioned above, we obtain our data set for modelling independent retweeting behaviour.
Model Derivation
We first introduce the motivations of modeling retweeting behaviour by the analogy with spin glass models. Then we describe the connections between them in detail. At last, we give the mathematical details of our model.
The motivations that we try to model the growth of retweets by the analogy with spin glass models are two-fold. The first one is the inspiration from the works of Johansen and the co-workers14, 17, 18, who reported several experiments that characterize the dynamical responses of Internet users to some bursty events. One of their experiments is to probe the downloading activities of a scientific paper after the publication of a related online interview. The relaxation dynamics to the bursty events is explained as a barrier crossing in the Trap Model28 of spin glasses. The second motivation is the realization that aging effect is likely to exist in the growth of retweeting activities. The spin glass model we make analogies to are aimed to give reasonable interpretations for aging, which represents the slow magnetic relaxation dynamics existed at all time scale in spin glass materials. A typical feature of aging is that, the longer the waiting time, the slower the relaxation. In our Weibo systems, messages are ordered chronologically. The older a message is, the harder it is to find for retweeting. Hence, the growth of retweets shows a similar behaviour as the aging effect of spin glass.
Here we give a brief description of the spin glass models related to our work. A spin glass is a disordered system. Several intermetallic alloys and insulating compounds are made as spin glass experimental samples. These materials show some common behaviours, such as aging. The Ising Model is used to describe a set of interacting (p-)spins on lattice19. The minima model assumes that the spins can only take on the values 1 or −1. A configuration of a system, which assumes to be comprised of N spins, is an assignment of certain values to each spin. The system consists of \({2}^{N}\) configurations. Based on the Ising Model, the Random Energy Model (REM)19 assumes that, (1) the energy levels E i corresponding to each configuration are independent and identically distributed normal random variables; (2) the probability that a system is in a certain configuration is proportional to \(\exp (-{E}_{i}/T)\), where T represents thermodynamic temperature. The solution to the model indicates that an exponentially large number of configurations are distributed on a concentrated ground state. The probability that a system is on other states is almost zero.
Based on the REM, the Trap Model 28 characterizes the effect of aging in the magnetic relaxation processes of spin glasses. The model assumes that a metallic spin-glass instance can be decomposed into a lot independent spin glass subsystems. The energy state of a subsystem is characterized by its own configurations. And the energy landscape of the whole system is rough and hierarchical. The reference energy f 0, which is specified by the ground state revealed by REM, forms a “plane” on the landscape. There are many local minima, which trap the subsystems into metastable configurations, on that plane. The depth of these traps is exponentially distributed with lower probability for deeper traps. And the deeper the trap is, the longer a subsystem stays in it. A transition between two configurations that are on the plane is almost instantaneous since that there is no energy difference between the configurations. Hence, most of the time, subsystems are trapped in these pits with relatively lower energy levels. A subsystem needs extra energy to escape a trap, and transits to another energy level. However, due to the constraint of structure, a transition of a subsystem between configurations occurs only between adjacent traps. It is unlikely that a subsystem can randomly transit to any trap on the plane. The ensemble of the transitions of all subsystems leads to the magnetic relaxation of the whole system. The model also introduces an exponential decaying factor for the convergence of integral.
A Weibo system, which is characterized by a Direct Follower Network of an author, consists of many independent retwitters. Each follower corresponds to a subsystem in spin glasses. The ground energy level of a retwitter corresponds to the least active state of the retwitter. People carry out different tasks in their daily lives. Once a person is dealing with a task, he is “trapped” by it. Since he is now more active compared to the ground energy state, the energy landscape formed by these tasks is filled with humps rather than pits. The difficulty and complexity to accomplish these tasks can be different. More difficult tasks will cost people longer time to deal with. And the property associated with the difficulty of tasks are assumed to be exponentially distributed. And this distribution in our Weibo systems will be demonstrated later. Now an energy landscape similar to that of spin glasses has been established. Analogous to spin glass relaxation, the ensemble of the transitions of retwitters among the humps created by original tweets, leads to the growth of the number of retweets.
We introduce two important facts in retweeting behaviour. The first is the memory effect, which characterizes the correlations among a series of tasks. Intuitively, the effect implies that people will deal with tasks in a logical order, and will not randomly transit between tasks. There are a lot of evidence that human activities show memory effect, i.e., they are not Markovian29. Considering the REM, adjacent system configurations on a grid are only different in the values of a few particles. A system can only transit to its currently adjacent states. And it limits large variations in energy levels. The second conjecture characterizes the bursty nature of the dynamics of human interest20. One of the basic properties is that the longer the time a person is doing one thing, the more likely he will lose his interest and move onto another task21.
The main corresponding relationships between spin glasses and Weibo systems are summarized in Table 1.
The main conjectures in spin glass systems also correspond to the conjectures in Weibo systems. They are, the exponentially distributed landscape, memory effect and the dynamics of human interest. These conjectures are demonstrated, to some extent, by either empirical experiments or relevant references.
Next, we give the mathematical details of model derivation. The configurations in REM correspond to different states when an individual is dealing with different tasks. Intuitively, more difficult tasks will cost people longer time to deal with. For instance, in Weibo systems, we will spend different lengths of time for messages with different contents. Let f denote the property associated with a task that will affect the length of time when people deal with it. The property could be associated with the complexity of mathematical problems, or the attractiveness of tweets. The distribution of f is assumed to be exponential, which is
where f 0 is the reference state in which people stay in their normal status of activeness. f 0 corresponds to the ground state derived in the REM. The difference with the Trap Model is that f 0 is our lowest energy state, and f is always equal or greater than f. The exponential form of the distribution will be demonstrated later by our Weibo data set.
The landscape is filled with humps, which means that people will transfer from ground state to more active states when he deals with some tasks. The expectation of f is
Parameter T corresponds to the activeness of a user. The more active he is, the more probable he will be to deal with complex tasks. Parameter x corresponds to how simple, on average, users’ tasks are. The simpler the tasks, the lower the humps on the landscape.
According to the Arrhenius Law, the time for dealing with a task with property f is,
where τ 0 is the minimum time for users to deal with a task. This response time is a very small quantity compared to the time span under our consideration.
According to Equation (1) and (3), we obtain the distribution of τ as,
The memory effect indicates that, when people are dealing with a sequence of tasks, there must be a reasonable order29. It limits large variations of f during state transitions. In order to involve this effect, we introduce \(r(u)=r(\tau /{t}_{w})\) to denote the likelihood a user deals with a task with processing time τ. \({t}_{w}\) is a constant, and it will be eliminated in our following analysis. The likelihood function should satisfy the properties of \(dr/du\mathrm{ < 0}\) and \({\mathrm{lim}}_{\tau \to \infty }r(\tau /{t}_{w}\mathrm{)=0}\). At any specific time point, let \({P}_{h}(\tau ,{t}_{w})\) denote the hitting probability that a user is dealing with a task with processing time τ. We have,
Hence, by re-organizing constant parameters, we have
where A is the normalization constant.
As mentioned by the study of human interest dynamics21, the longer the time a person keeps doing something, the more probable he is to change his current interest. Hence, the longer the time τ a user is trapped into his former business, the more probable he is to see a certain message at present. Let \({p}_{m}\) denote the possibility that users transfer from doing other business to message browsing in microblogging platforms. To involve the above effect, the transition probability from the task with processing time τ at some point to message browsing in Weibo will be modified by multiplying the factor \({p}_{m}\exp (-t/\tau )\). Note that this factor and the memory effect factor r(u) do not affect the final form of the power-law relaxation equation, as long as they satisfy required properties28.
Let N denote the final number of retweets of an original message. At time point t after the publication of the message, let n(t) denotes the cumulative number of retweets at t. For those who have not retweeted the message yet, they will try \(\langle \mathrm{1/}\tau \rangle dt\) times with transition probability \({p}_{m}\) and hitting probability \({P}_{h}(\tau )\). We assume that once a user decides to retweet a message, his action is instantaneous. \(\langle \mathrm{1/}\tau \rangle \) denotes the average hopping frequency when users transfer from doing other business to tweet browsing. We calculate \(\langle \mathrm{1/}\tau \rangle \) for all terms with τ. Let \(r(\tau /{t}_{w})={t}_{w}/\tau \), we have,
Hence,
The general solution to the above differential equation is
where \({C}_{0}\) is a constant. Its value depends on the satisfaction of the boundary condition. We will discuss the values of parameters later. Equation (10) shows the cumulative dynamics of retweeting. It has the form of \(c-a{t}^{-b}\).
Model Fitting
We fit the power-function model \(n(t)=c-a{t}^{-b}\) with our real-world tweet spreading data set.
We choose the top 5 most active influentials as the root users of their corresponding Direct Follower Networks. They are also the authors of the initialized tweets, which are the seeds of the chains of retweets that we trace. The identities of the influentials are diverse, including the official account of a news agency, two popular accounts that publish jokes and witticisms, a famous writer and a popular actress. There are altogether 2623 seed tweets initialized by these influentials.
We fit the 2623 chains of samples with the same model but different parameters (See Methods). The results show that for 2087 samples, the parameters b are in (0, 1), which account for 80% of all the samples. We choose normalized Root Mean Square Error (RMSE) as our performance measure. Some fitting examples are illustrated in Fig. 5. The figures show that those fitting results with errors lower than around 0.06 demonstrate the effectiveness of our model. The results with high errors are rare, and the reason of high errors is probably that there are several peaks of retweeting intensity during the whole process of spreading. We examine the samples with relatively large fitting errors, and discover that around 70% of the original tweets corresponding to these samples are published between 10 pm and 6am the next day. It is mainly the circadian rhythm30 that results in the spontaneous emergence of multi-peak retweeting patterns among direct retwitters. The distribution of all fitting errors is illustrated in Fig. 6(a). It is shown that the errors are less than 0.05 for around 85.9% of retweeting curves. The empirical data are well fitted by our model.
To further demonstrate the effectiveness of our power-function model, we fit the empirical data with two alternative forms, an exponential function (11) and a logarithmic function (12).
The fitting method is identical to that for power-function fitting. Results are illustrated in Fig. 6(b). We find that the samples whose fitting errors are less that 0.05 account for 1.68% and 13.6% of total samples for exponential model and logarithmic model respectively. While this number is 85.9% for power-function model. Hence, we conclude that the derived power-function model outperforms both exponential and logarithmic models.
In order to show the distribution of the waiting time of retweeting τ, we plot the CCDF (Complementary Cumulative Distribution Function) of τ in Fig. 7. Within the time span under our consideration, which is 72 hours (4320 minutes), the tail of the distribution of τ shows approximately a straight line (the red line). However, at a relatively larger time span, the distribution indicates an exponential tail of the form \(f(x)\sim \exp (-\theta x\mathrm{),\ }\theta \ge 0\), we adopt the maximum distance between the CCDF of the empirical distribution and the CCDF of the estimated exponential distribution as our criterion, which is similar to the Kolmogorov–Smirnov statistic. Then we estimate the exponent θ, and find the optimal value, which is \(2.73\times {10}^{-5}\), approximately. The red line in Fig. 7(b) shows the fitting result. The criterion is approximately 0.01782, which is not large compared to 1.
Within a small time span, in which most spreading processes saturate, the distribution of τ is approximately power-law. This result is consistent with Equation (4).
The above results validate the power-function model of cumulative dynamics of independent spreading activities. The model implies a clear trend of saturation, and an aging effect, which means if the longer the time after the last retweeting activity happens, the longer the time it is to wait for the next retweeting. Model comparison demonstrates that the decay of the intensity of spreading is neither as fast as exponential, nor as slow as logarithmic.
Model Validation
We conduct several empirical experiments to validate some key facts in our model, including
1.the relationship between parameter a and c.
2.the exponentially distributed landscape.
3.the connection between b and temperature.
These experiments are based on those samples with parameter b in (0, 1), which account for 80% of all samples.
To determine the value of the constant C 0 in model (10), we need to investigate the satisfaction of the boundary condition \(n(t\mathrm{=1)=0}\). If we precisely choose the minimal response time of retweeting as our base unit of time, it is possible to adjust the cumulative curves to meet the desired boundary condition. In this case, the boundary condition will hold and a equals c. However, there exist some technical limitations when collecting our data. And for the convenience of analysis, we choose one minute as our time unit. We calculate the ratio of the number of retweets within one minute. They are less than 5% for 96% of all samples. Hence, there are some small deviations in calculating the boundary condition. We draw a scatterplot of a and c in Fig. 8(a). It seems that a = c. However, a closer examination reveals that there are more points below the diagonal.
In addition, the bias in estimating the waiting time τ may affect the boundary condition. If we adopt the power function model of τ described by Equation (4), we can see from Fig. 7(a) that this power-law model is likely to overestimate the actual processing time near zero, which means that the cumulative number of retweets predicted by the model will grow slower than the actual number. A slower growth of the model indicates a relatively larger a in the form \(n(t)=c-a{t}^{-b}\). And the curve generated by our model tends to shift to the right side of the actual curve. Due to this shifting, the model may produce negative values near zero. Hence, it is very possible that a > c when t = 1. And it is consistence with the results in Fig. 8(a). The effect of the above facts on the satisfaction of the boundary condition may be inevitable, but it is small. The above analysis demonstrates the consistency between the derived model and our empirical data.
The energy landscape of disordered systems, social systems and biological systems are well-studied31,32,33. The landscape of very low energies in REM is exponentially distributed. In our Weibo systems, the ground-state corresponds to users’ normal status in their daily activities. Once a user begins to browse and read tweets, he will be “trapped” into some humps created by the tweets he browsed. Most of the tweets will take up only a small amount of his time, but there are some tweets that will attract his attention. Hence, the landscape in Weibo systems is created by the tweets. Furthermore, the time one spends on each tweet is determined by the height of the hump. The popularity or attractiveness of a tweet is well-defined by how many times it has been retweeted during a relatively long period. Figure 8(b) shows the CCDF of the numbers of retweets of the original tweets in our data set. The figure indicates that the CCDF is approximately a straight line with a log-transformed y-axis. This results provide evidence that the energy landscape of Weibo systems is likely to be exponentially distributed.
Rate parameter b is crucial since it determines the overall shape of the growth of retweeting activities. The spin glass parameter x in Equation (1) in Reference 28, to which b corresponds, is temperature dependent. Since temperature represents the average kinetic energy of microscopic motions of particles, the physical model indicates that there is a connection between b and the average activeness of retwitters. The actual meaning that b reveals is related to the so-called social temperature concept, which is defined as the probability of one’s acceptance of others’ opinion34, 35.
Here, we introduce an indicator to represent the average activeness of retwitters. The number of a user’s followees Ω determines the amount of messages the user receives, and the number of messages ω the user retweets during a certain period determines to what extent the user approves others’ opinion. These two quantities are direct indicators of users’ activeness. Then we define the ratio \({r}_{act}=\omega /{\rm{\Omega }}\) as an indicator of retweeting activeness for each retwitter. If temperature rises, retwitters will be more active and the system will relax faster. For each retweeting sample, we calculate \({r}_{act}\) for every its retwitter. The distribution of \({r}_{act}\) for one sample is not fat-tailed. We then fit \({r}_{act}\) with exponential distribution and record the exponent λ for each sample. As shown in Fig. 9, as b increases, the distributions of λ in each group decrease. Since \(\mathrm{1/}\lambda \) reflects the average activeness among the retwitters of a sample, the results demonstrate that as “social temperature” increases, the level of activeness among users increases too. This is consistent with our former analysis. The above empirical findings demonstrate the real significance that parameter b points to.
Predictability
In order to demonstrate the function of our derived model, we carry out several prediction experiments, including two based on historical data and one based on microscopic data.
At first, we try to predict the saturated number of retweets N by using historical data. We fit the curve of cumulative number of retweets by our power-function model at the 8th, 12th and 16th hour respectively after the publication of the original tweet for all our samples. Then we obtain 3 groups of parameters a, b and c. We use these 9 parameters as features to train a support vector regression model for the prediction of the number of retweets at the 72nd hour. We use a ten-fold cross validation to evaluate the accuracy of the prediction. The distribution of the absolute percentage errors between predicted number of retweets and the actual number is shown in Fig. 10(a). The red line indicates the median of the error, which is 24.7%. And the errors are lower than 40% for more than 70% samples.
In view of the difficulty of accurate prediction of the number of retweets for every single original message, the performance of the above prediction seems moderate.
The first experiment implies that an accurate qualitative classification model could be more practical. In order to examine the predictability of historical data, rather than the effect of network structure and message contents, we only select the samples that contain no multimedia contents and are published by a same author. The samples with the largest 200 b are labeled as positive, and the samples with the smallest 200 b are labeled as negative. We use the parameters fitted at the 8th, 12th and 16th hour after publication as features to train a support vector machine for classification. We adopt the ten-fold cross validation to evaluate the performance. The average accuracy, precision and recall are 85.25%, 90.60% and 79.50%, respectively. The results show that, by fitting historical data to our power-function model, we can obtain a high-performance classifier for predicting the magnitude of parameter b.
Next, we show the predictability by using microscopic data. In order to demonstrate the relationship between microscopic features and macroscopic measurements, by using a machine learning technique, we perform an experiment to predict the relaxation exponent b. The external magnetic field, which affects the relaxation of a spin glass system, corresponds to the “field” applied by the author and the original tweet in a Weibo system. We retrieve several microscopic features from authors’ profiles and the content of messages, and perform a support vector regression for the prediction of b. These features are detailed in Table 2. The features are extracted only from the profiles of the authors and the content of the original messages, since we believe that it is only the charm of the influentials and the topics in their messages that define the external magnetic field and then attract people to retweet.
We perform a ten-fold cross-validation to test the performance of our prediction. The absolute errors are shown in Fig. 10(b). Since \(b\in \mathrm{(0,1)}\), the mean value of the absolute error, which is 0.064, is not large. This experiment shows that the differences in b are partly due to the effect of the differences in the identities of authors and also the content of messages. The gap between microscopic mechanisms and macroscopic phenomena can be bridged in this way.
The above experiments demonstrate, to some extent, the predictive capability of our model.
An Application
Here we demonstrate a possible application of our model, which is the examination of whether or not a spreading process is approaching to saturation.
If the parameter b in the model of \(n(t)=c-a{t}^{-b}\) is positive, then n → c as \(t\to \infty \), which means the spreading process is approaching to saturation. If a spreading process approaches to saturation, the model parameter b estimated by the saturated curve will be greater than 0. However, at the early stage of a spreading, n grows in a divergent way. The model parameters a and b will be both negative. Hence, if we estimate the model parameters with truncated data according to the time order, the value of parameter b will monotonically increase from negative to positive.
By the estimation of the value of model parameter b, we are allowed to examine if a spreading process reaches saturation or not. We conduct a simple experiment to explore the application potential of our model. We truncate our data by the 4th, 6th, …, 82th hour and fit our power-function model for parameters a, b and c, respectively. We record the change of the values of b at different hours. Then we record the time points t l , t m and t h at which the cumulative number of retweets reaches 65%, 80% and 95% of the saturated number of retweets. According to the change of b, we record the values \({b}_{l}\), b m and b h that correspond to time points t l , t m and t h . We calculate all \({b}_{l}\), b m and \({b}_{h}\) for each tweet spreading sample. The distributions of \({b}_{l}\), b m and b h for the 5 influentials are illustrated in Fig. 11.
The results show that for all 5 influentials, the peak values of the distributions of b increase. And when the number of retweets reaches 95% of final saturated number, all peak values of the distributions of b exceed zero. We could utilize this feature to roughly identify whether or not a retweeting process approaches to saturation at present. Specifically, we could monitor and record the growth of cumulative number of retweets dynamically. Then we fit the recorded data with our power-function model at a certain time point. We learn and choose a threshold larger than zero based on historical experiences. When the value of parameter b in the fitted model exceeds the threshold, we could make our decision that the spreading process reaches saturation approximately. Accurate identification of the saturation of spreading will be beneficial to the scheduling of intervention, such as influence maximization and advertising promotion using microblogging services.
Discussion
Both our model and data show clear saturation patterns of independent retweeting behaviour. The reasons may be a mixture of the fading of interest and the limit of the size of personal devices. Microblogging is a platform for high-speed information exchange. People will quickly lose interest on one topic and shift to another. In addition, because of the size limit of screens on personal devices and the speed of newly generated content, people will miss large amounts of information easily and will be unlikely to take efforts for digging old messages. The combined effect of these factors results in the saturation of spreading.
A possible explanation to the exponentially distributed landscape could be given based on the Boltzmann distribution, which characterizes the distribution of particles on different energy states of isolated systems that are in thermal equilibrium. Similar to thermodynamic systems, retwitters in our systems are reasonably assumed to be separable. And their interest orientations may correspond to particles’ degeneracy states. The energy of a retwitter may represent his activeness in the sense of retweeting. The states of retwitters are only associated to their energy property. Retwitters could be distributed over different energy states. According to the principle of equal a priori probabilities, the distribution with the largest number of micro-states occurs with the highest probability. This Most Probable Distribution leads to the exponential form of the Boltzmann Distribution.
Among our fitting results, we find that most retweeting samples with relatively large fitting errors show multi-peak patterns in their intensity curves of retweeting. The main reason is the circadian rhythm in people’s daily life. When an influential publishes a tweet in daytime, most active retwitters have enough time to response to the message. While if the message is initialized around people’s sleeping period, a portion of the retwitters are inactive at that moment. It looks like retweeting activities are paused at night and restored in the morning. This leads to multi-peak patterns and relatively large fitting errors. We plan to take into account this effect in our future work.
Parameter b changes if we choose different lengths of time intervals for fitting. The reason is that as the original message is getting older, due to the limit of screen sizes, the number of users who will see the message is getting smaller. Retweeting happens when two conditions hold: first, users must have the willingness to retweet a certain message; second, users must have the chances to see the message. The number of users who are willing to retweet a certain message may be relatively stable. However, the number of users who will see the message decreases after its publication, unless other influentials retweet it during the whole spreading process. Users will pay more effort for searching if a message is getting older.
The significance of our work is two-fold. First, this work demonstrates that applying physical theories to research fields outside physics, such as social sciences and economics24, 36,37,38, is of great significance. It is true that data fitting methods are widely used to summarize the relationships among variables and to infer values. However, when different scenarios are considered, we need to repeat fitting procedures in a possible high-dimensional parameter space. In addition, the actual meanings of the fitted parameters could be ambiguous. The analogy with physical systems helps us derive a model that explains the formation of the data with microscopic mechanisms and meaningful parameters. Once the changes of parameters are measured, the model can be applied in other scenarios. Second, our work indicates that some invariable rules should be taken into consideration during the prediction of information spreading. Traditionally, various features retrieved from users, message text and network structures are adopted in well-tuned models for prediction. However, we find that our derived model with a unified form governs the independent retweeting behaviours with different authors and contents in our data set. And some features, such as memory effect, seem to be common to different retweeting activities. Hence, some common rules, which are ignored before, may be useful for prediction.
Materials and Methods
Data Description
We obtain two data sets from Sina Weibo for our study. The message data set consists of more than 69 million tweets/retweets from Sina Weibo. There are 6 properties associated with each tweet, (1) message text, (2) original publisher’s ID and nickname, (3) timestamp, (4) number of retweets, (5) number of comments, (6) a flag to indicate that the tweet is either originally published or retweeted. The link data set consists of 3.7 billion directed links among 80 million users.
Model Fitting Method
For each of the 2623 seeds, all of its retweets are collected. We use regular expressions to parse the retweet tag in the text of retweets. Then a chain of retwitters of the root author could be retrieved chronologically for each branch of retweets of the initialized tweet. In this way, we could locate the retwitters who directly retweet the root user’s tweets, rather than indirectly retweet it from other retwitters.
For each initialized tweet, we plot the curve of its cumulative number of retweets with respect to time. We intend to fit these curves with our power function model. We choose Root Mean Square Error (RMSE) as our objective error function to measure the goodness that real-world data are observed to fit our model. In addition, in order to make the fitting errors of different samples comparable, we divide RMSEs by the maximum/saturated numbers of retweets of the corresponding initialized tweets. This division has no effect on parameter optimization, and it provides a proper way of evaluating the goodness of fitting. Equation (13) defines our objective function.
where M t is the cumulative number of retweets in empirical data at time t, \({\hat{M}}_{t}\) is the model output at time t. The time granularity is set to 1 minute, and the maximum of t is set to 72 hours since most spreading processes will saturate in DFNs at the end of the third day after the initialization of the seed tweets. Then N is 4,320. And M N is the saturated number of retweets, i.e., the maximum retweeting number.
We assume that the objective function to be minimized is \(F(a,b,c)\) with respect to parameters \(a\), \(b\) and \(c\). The first order total derivative of \(F\) with respect to \(b\) is
It is difficult to calculate the first order and second order total derivatives of the error function with respect to parameter b, since b occurs as the exponent of variable t. However, if we set \(\partial F/\partial a\mathrm{=0}\) and \(\partial F/\partial c\mathrm{=0}\), \(dF/db\) equals to \(\partial F/\partial b\). Parameters a and \(c\) are represented by b. Hence, in order to avoid the expensiveness, we adopt a Quasi-Newton method to find the optimal parameter set. The convexity of F with respect to b is difficult to prove. However, the searching algorithm yields good results.
References
Guille, A., Hacid, H., Favre, C. & Zighed, D. A. Information diffusion in online social networks: A survey. ACM SIGMOD Record 42, 17–28 (2013).
Iribarren, J. L. & Moro, E. Branching dynamics of viral information spreading. Physical Review E 84, 046116 (2011).
Wang, D. et al. A genealogy of information spreading on microblogs: A galton-watson-based explicative model. In Proceedings of the 32nd IEEE International Conference on Computer Communications (INFOCOM), 2391–2399 (IEEE, 2013).
Kawamoto, T. & Hatano, N. Viral spreading of daily information in online social networks. Physica A: Statistical Mechanics and its Applications 406, 34–41 (2014).
Jiang, C., Chen, Y. & Liu, K. R. Evolutionary dynamics of information diffusion over social networks. IEEE Transactions on Signal Processing 62, 4573–4586 (2014).
Pastor-Satorras, R. & Vespignani, A. Epidemic spreading in scale-free networks. Physical Review Letters 86, 3200 (2001).
Liu, Z. & Hu, B. Epidemic spreading in community networks. Europhysics Letters (EPL) 72, 315 (2005).
Lü, L., Chen, D.-B. & Zhou, T. The small world yields the most effective information spreading. New Journal of Physics 13, 123005 (2011).
Newman, M. E., Forrest, S. & Balthrop, J. Email networks and the spread of computer viruses. Physical Review E 66, 035101 (2002).
Lerman, K. & Ghosh, R. Information contagion: An empirical study of the spread of news on digg and twitter social networks. In Proceedings of the 4th International AAAI Conference on Weblogs and Social Media (ICWSM), 90–97 (AAAI, 2010).
Goel, S.,Watts, D. J. & Goldstein, D. G.The structure of online diffusion networks. In Proceedings of the 13th ACM Conference on Electronic Commerce, 623–638 (ACM, 2012).
Yang, J. & Leskovec, J. Patterns of temporal variation in online media. In Proceedings of the 4th ACM International Conference on Web Search and Data Mining (WSDM), 177–186 (ACM, 2011).
Crane, R. & Sornette, D. Robust dynamic classes revealed by measuring the response function of a social system. Proceedings of the National Academy of Sciences 105, 15649–15653 (2008).
Johansen, A. & Sornette, D. Download relaxation dynamics on the www following newspaper publication of url. Physica A: Statistical Mechanics and its Applications 276, 338–345 (2000).
Goldenberg, J., Libai, B. & Muller, E. Talk of the network: A complex systems look at the underlying process of word-of-mouth. Marketing Letters 12, 211–223 (2001).
Matsubara, Y., Sakurai, Y., Prakash, B. A., Li, L. & Faloutsos, C. Rise and fall patterns of information diffusion: model and implications. In Proceedings of the 18th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), 6–14 (ACM, 2012).
Johansen, A. Response time of internauts. Physica A: Statistical Mechanics and its Applications 296, 539–546 (2001).
Johansen, A. Probing human response times. Physica A: Statistical Mechanics and its Applications 338, 286–291 (2004).
Derrida, B. Random-energy model: An exactly solvable model of disordered systems. Physical Review B 24, 2613 (1981).
Barabasi, A.-L. The origin of bursts and heavy tails in human dynamics. Nature 435, 207–211 (2005).
Han, X.-P., Zhou, T. & Wang, B.-H. Modeling human dynamics with adaptive interest. New Journal of Physics 10, 073010 (2008).
Xie, J., Zhang, C. & Wu, M. Modeling microblogging communication based on human dynamics. In Proceedings of the 8th IEEE International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), 2290–2294 (IEEE, 2011).
Iribarren, J. L. & Moro, E. Impact of human activity patterns on the dynamics of information diffusion. Physical Review Letters 103, 038702 (2009).
Jia, T., Wang, D. & Szymanski, B. K. Quantifying patterns of research-interest evolution. Nature Human Behaviour 1, 0078 (2017).
Zhou, C., Zhao, Q. & Lu, W. Impact of repeated exposures on information spreading in social networks. PLoS One 10, e0140556 (2015).
Centola, D. The spread of behavior in an online social network experiment. Science 329, 1194–1197 (2010).
Bao, P., Shen, H.-W., Chen, W. & Cheng, X.-Q. Cumulative effect in information diffusion: empirical study on a microblogging network. PLoS One 8, e76027 (2013).
Bouchaud, J.-P. Weak ergodicity breaking and aging in disordered systems. Journal de Physique I 2, 1705–1713 (1992).
Chierichetti, F., Kumar, R., Raghavan, P. & Sarlos, T. Are web users really markovian? In Proceedings of the 21st ACM International Conference on World Wide Web (WWW), 609–618 (ACM, 2012).
Jo, H.-H., Karsai, M., Kertész, J. & Kaski, K. Circadian pattern and burstiness in mobile phone communication. New Journal of Physics 14, 013055 (2012).
Seyed-Allaei, H., Seyed-Allaei, H. & Ejtehadi, M. R. Energy-landscape networks of spin glasses. Physical Review E 77, 031105 (2008).
Bi, D., Lopez, J. H., Schwarz, J. & Manning, M. L. Energy barriers and cell migration in densely packed tissues. Soft Matter 10, 1885–1890 (2014).
Samarakoon, A. et al. Aging, memory, and nonhierarchical energy landscape of spin jam. Proceedings of the National Academy of Sciences 201608057 (2016).
He, M., Li, B. & Luo, L. Sznajd model with "social temperature" and defender on small-world networks. International Journal of Modern Physics C 15, 997–1003 (2004).
Krause, S. M. & Bornholdt, S. Opinion formation model for markets with a social temperature and fear. Physical Review E 86, 056106 (2012).
Helbing, D., Farkas, I. & Vicsek, T. Simulating dynamical features of escape panic. Nature 407, 487–490 (2000).
Yakovenko, V. M. & Rosser, J. B. Jr Colloquium: Statistical mechanics of money, wealth, and income. Reviews of Modern Physics 81, 1703 (2009).
Castellano, C. & Fortunato, S. & Loreto, V. Statistical physics of social dynamics. Reviews of Modern Physics 81, 591 (2009).
Acknowledgements
This work was supported by the National Key Research and Development Project of China (No. 2017YFC09019010 entitled New Generation Intelligent Building Platform Techniques, and 2016YFB0901901), the National Natural Science Foundation of China (No. 61425027), the 111 International Collaboration Program of China under Grant B06002, and Innovation Joint Research Center for Cyber-Physical-Society System.
Author information
Authors and Affiliations
Contributions
C.Z. conceived the study, prepared all the figures and wrote the manuscript. C.Z. and W.L. processed the data and carried out the primary analysis. C.Z., Q.Z. and W.L. analyzed the results and reviewed the manuscript.
Corresponding author
Ethics declarations
Competing Interests
The authors declare that they have no competing interests.
Additional information
Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Zhou, C., Zhao, Q. & Lu, W. Cumulative Dynamics of Independent Information Spreading Behaviour: A Physical Perspective. Sci Rep 7, 5530 (2017). https://doi.org/10.1038/s41598-017-05899-5
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-017-05899-5
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.