Entropy Measures of Human Communication Dynamics

Human communication is commonly represented as a temporal social network, and evaluated in terms of its uniqueness. We propose a set of new entropy-based measures for human communication dynamics represented within the temporal social network as event sequences. Using real world datasets and random interaction series of different types we find that real human contact events always significantly differ from random ones. This human distinctiveness increases over time and by means of the proposed entropy measures, we can observe sociological processes that take place within dynamic communities.


Introduction
Despite living in social communities and witnessing people communicate, at the first glance, we may not recognize clear patterns or trends of dynamic changes in communication -the general impression my be that people interact almost randomly.Even though many studies [1][2][3][4] show that human interactions are not random, still some vital questions need to be addressed: how specific and how stable over time they are.Additionally, communication traces are the main source for interactions represented by social networks 5 , hence, the questions about communication dynamics simultaneously address the problem of stability of temporal networks.
Despite the fact that temporal social networks have been studied for several years, there is no fixed and commonly agreed set of measures quantifying their dynamics.It is partially caused by the fact that there are many representations of temporal networks, such as event sequences, interval graphs, time windows, etc.It is hard to develop a comprehensive measure that will cover all the models.Therefore, we may expect that the development of dynamic measures will proceed differently than in the case of static networks.
One of the most important concepts introduced in temporal setting is the time-respecting path, i.e. the path connecting nodes v i and v j in such a way that all intermediate nodes are visited in non-decreasing time order 6 .Starting with that metric, it was possible to define a number of natural subsequent measures, such as temporal connectedness 5 between nodes representing the reachability from the source node to destination node in a given time, temporal diameter as a largest temporal distance between any two nodes or characteristic temporal path length that defines the temporal distance over all pairs of nodes 7 .Another important aspect of time-varying networks is the interevent time distribution 8 that defines the frequency of events; it can be used to verify how bursty is the behavior in a given network.To quantify differences in burstiness, the expected number of short-time interactions is used to characterize the early-time dynamics of a temporal network 9 .Lastly, a number of centrality measures were adapted or developed from scratch to describe the position of the node in the network, in particular: temporal betweenness 10 , temporal closeness 11 , and temporal degree 10 .
Entropy-based measures, in turn, were utilized by Takaguchi et al. 12 to evaluate the predictability of the partner sequence for individuals.In 2013, Kun Zhao et al. 13 proposed entropy-based measure to quantify how many typical configurations of social interactions can be expected at any given time, taking into account the history of the network dynamic processes.
We use the entropy to capture human communications dynamics -event sequences (ES) depicting human interactions, which are also one of the basic lossless representations of temporal network 14 .In general, an event sequence is a time ordered list of interactions between pairs of individuals/agents within a given social group.
Three main approaches to compute entropy for temporal networks represented as an event sequence are proposed: 1) the first order entropy, based on the probability of a node to appear as a speaker, or in other words, an initiator of event, 2) the second order entropy, based on probability of the event occurrence, that is probability of interaction between unique pair of nodes, 3) the third order entropy denoting probability of succession appearance, i.e. probability of unique pair of events.Each type of entropy captures different aspect of dynamics and have potential to be useful for different applications.For each new entropy measure, its maximum value can be estimated for a given number of nodes.This value is used for normalization and definition of relative entropy measures that allow us to compare entropies for different datasets.
This paper is organized as follow.In the first section, we present results of our experiments followed by main findings and conclusions.The second section broadly discusses meaning of findings and provides some insight for further work.The last section contains the detailed description of our experiments: experimental setup, datasets used and definitions of all entropies.

Results
We compute entropy values for 4 different dataset with data of real human interactions: (1) face-to-face meetings at HyperText conference, (2) text messages exchanged between students for 6 semesters (NetSense), (3) email communications in the manufacturing company, and (4) face-to-face interactions between patients and hospital staff members.We compute time-line of entropy by taking a window from the beginning of network existence to point in time that we want to know the entropy value.In other words, we compute entropy cumulatively for on-line stream of interaction data.To provide the baseline for real event sequences, we generate 100 artificial event sequences for each dataset with the same numbers of nodes, events and timestamps by randomly reselecting pairs of nodes involved in each event.In the static networks, such procedure would be called rewiring.The average value of entropy for random event sequences is computed and compared against the values for the real network using Z-score -the distance measure that, in general, shows the number of standard deviations by which the value of entropy for real sequence is above the mean value of random streams.The negative values of Z-score mean that entropies for real data are smaller than random ones and greater the difference is more negative Z-scores are.The general concept of experiments is presented in Fig. 1 Figure 1.General schema of experiments.K=100 was used.From original (real) event sequence, event timestamps are extracted as a base for random sequence generator.Entropy value is computed for real event sequence and artificial sequences.We compare results for real data with summarized results for artificial data using Z-score.
The first observation made about the nature of entropy is that the maximum value of entropy is non-decreasing over time since it directly depends on the non-decreasing number of distinct nodes in the event sequence.By normalizing entropy with its maximum, we obtain the relative entropies within the range [0,1].Our experimental results show that entropy of random networks tend to reach the maximum value faster for first-order entropies and slower for higher-order ones.In Fig. 2A, we can observe that entropy for random sequences have the shape suggesting that they converge to some maximum value, i.e. 1 in case of the normalized entropy.In Fig. 2B we can observe similar tendency for non-normalized entropy.However, the relative entropy values for the real network seem to stabilize earlier around the smaller value.We can clearly observe such case for first-order entropy as well as converging shape for higher-order ones.The similar observations were made for all other examined datasets.We split each dataset into reasonable parts selected empirically for more convenient analysis.Most clear observations were noted for non-normalized second-order entropy, see Fig. 3, even though the same phenomena can be seen for all datasets and all entropies.The main finding that can be derived from our results is that entropy decreases over time except for some rare cases, which are explained later on.The results for face-to-face contacts on the first two days of the conference, see Fig. 3A, are similar in terms of their dynamics, however, the last day is significantly different.It means that participants know each other much better on the last day and they interact much more consciously, i.e. with the smaller number of peers.A similar effect is observable for university students, see Fig. 3B.The entropy decreases with each consecutive year of study and it is the lowest for the last, sixth semester.Further, the results of manufacturing company emails communication shows that for consecutive months value of entropy decreases with the exception of June 2010, see Fig. 3C.We suppose that this month breaks from the pattern because of holidays -it may be the month when majority of employees go on vacation, what significantly changes dynamics of communication.Similarly, for face-to-face contacts among hospital stuff and patients, Fig. 3D, we can note that entropy decreases in consecutive days except on the 6th of December.This day is usually celebrated as Saint Nicolas Day, which makes people significantly change their common pattern of communication.
We also measure distance of real sequence entropy from random sequence entropy using Z-score distance measure.The results confirm that there is a clear difference between reality and randomness.A sample plot of Z-score is presented in Fig. 2C.For more results see Supplementary Fig. 6 We can observe that Z-score decreases over time or in other words the difference between reality (smaller and stable over time entropy) and randomness (greater entropy and still growing in time) becomes more and more clear over time.
To show the difference between datasets, we compare entropy values, i.e. their normalized versions to exclude network size effect (different number of individuals), separately for the first-and second-order entropies, see Fig. 4. The greatest first-order value and lowest deviation is observed in the manufacturing company.It means that almost every employee needs to show up every working day in the company and interacts with the same frequency and stability of contacts with most of the other workers (the greatest second-order entropy).This suggests that communication in the company is decentralized and rather 'flat'.Patients in the hospital appear and disappear (low first-order entropy) but if they are present, they interact more randomly than students, who communicate much more within their encapsulated social/learning groups (low second-order entropy).Randomness of interactions between hospital staff members and patients as well as conference participants is comparable (second-order entropy) -they do not know each other so much, even though the first-order values suggest that there is less rotation among conference attendees appearance (first-order) than in hospital.The diversity of contacts (high standard deviation of second-order entropy) in hospital is the greatest, it means that depending on time, the social groups are more or less integrated, e.g.interactions among staff members and between patients are different.Interactions among students and employees are most stable (low standard deviation).Based on these observations, we conclude that different approaches to entropy computation (entropy order) can measure different aspects of communication dynamics.

Discussion
The results of our experiments provide some interesting insights about human communication dynamics.Firstly, we can confirm the general intuition that people do not communicate randomly.This obvious fact now finds quantitative confirmation also in the temporal network context.
The second important observation is that entropy decreases over time, i.e. for consecutive periods.Referring to the examined dataset, we can explained it with a human tendency to narrow their circle of friends with whom they usually communicate.In other words, while people are getting to know each other, they discover their preferences for interlocutors to talk to.It is opposite to the case of the early stage of groups formation, when people communicate more or less randomly.It is clearly, see in Fig. 3B, for the NetSense dataset which contains text communication of freshman students who start their studies at a new university.Similarly, we can observe decreasing entropy in other datasets independently of trend of random sequence entropy.
Another observation is that the distance from entropy of the real sequence to entropy of the random sequence, in general, increases over time, see Fig. 2C with the sample of Z-score distance values -similar trends arise for all other datasets.A group of people unfamiliar to each other engages in nearly random interactions which increasingly become non-random as familiarity of people in the group increases with time.
We recognized some potential of entropy-based measures in solving problems like detection of social communities from dynamic data about human activities.Our hypothesis is that entropy is able to distinguish different groups in the event sequence since the groups may have different dynamic profile of interactions (different entropy levels), e.g.within hospital staff members and separately among patients.
It should be noted that we considered events in the sequence to be directed interactions in our experiments.However, in some applications it may more be meaningful to treat events as undirected contacts.

Methods
In this section, we present in details all methods, measures and datasets we used in the experiments.

Temporal network representation
All experiments are performed on event sequences (ES) 15 , which are lossless representations of temporal social network and the most popular form of traces about human communication 14 .Since it is the most atomic representation, it fits to the real processes better than aggregated approaches like an aggregated weighted network 16 or a time-window graph 17 .
An event sequence (ES) is a time ordered list of events and each event ev i jk captures a single time-stamped interaction between two individuals in the observed system, i.e. ev i jk is a triple ev i jk = {s i , r j ,t k }, where s i is the sender/initiator and r jthe receiver of interaction at time t k .We also assume that the event can happen only between two different individuals (nodes): We also want to define e i j as an edge between two nodes, that is e i j = (s i , r j ).It exists if there is any event from s i to r j at any time.Note that edges are directed: (s i , r j ) = (r j , s i ), i.e. e i j = e ji .The set of all edges derived from a given event sequence ES is denoted asE.Let us define V as a set of all distinct individuals (nodes) participating in all considered events, i.e.V = {s, r : (s, r) ∈ E ∨ (r, s) ∈ E, s = r}.N denotes the size of set V : N = |V |.For further consideration let us define the space of possible edges Ω(E), i.e. the set of all possible pairs {(s, r) : s, r ∈ V, s = r}.Hence, |Ω(E)| = N(N − 1).Some measures in the experiments are computed for the aggregated network, which is a static generalization of the event sequence ES that is simply a directed graph G defined by a tuple: G = (V, E).

Entropy-based measures for temporal network
In this section, we would like to propose a holistic approach -new measures for temporal networks designed especially to quantify temporal networks properties in terms of inner dynamic processes.The proposed measures are the main novelty of this work, even though they implement entropy -the concept well known in physics and information theory.Entropy is a probabilistic description of general systems properties capturing its randomness level.In particular, based on the event sequence (ES) as the representation of temporal network, we propose various entropy measures.
In general, we utilize entropy S known in information theory as information entropy or Shannon entropy, which is defined as follows: where p(i) is occurrence probability of state or object i, and O is the set of all possible states/objects 18 .For each real event sequence, we generate corresponding random event sequences to provide a baseline for our experiments.
The new event sequences were generated preserving timestamps and set of nodes from the real event sequence.Hence, the acquired event sequences are the same in size and have the same set of nodes but different distribution and order of events.We generated an event sequence with following algorithm: 1. Take the real event sequence ES and extract distinct nodes from event's senders and receivers -create set of nodes V .
2. Take the next event from the real event sequence, starting from the first one and keep its timestamp t k .
3. Randomly select the sender s i ∈ V (according to selected distribution).
4. Randomly select the receiver r j ∈ V (according to selected distribution).
5. If the sender and receiver are the same, repeat step 4.

6.
Create event ev i jk = (s i , r j ,t k ).
7. If it is the last event in the real sequence ES -stop, otherwise go to step 2.
We tested the following random selections: with uniform, normal, and exponential distribution.The results of the experiments show that the differences between distributions in terms of entropy are not significant, hence, we have used only the uniform distribution for random generation.
For each real event sequence, we generated 100 random event sequences.

Evaluation
We used Z-score measure to evaluate distance between entropy value of the real network and its random analogues, see Fig. 2C.The Z-score value is defined as follows: where S is the observation from the real data and µ, σ are mean and standard deviation of random variable, respectively.In our case, observation S is the value of appropriate entropy (S 1 , S 2 , S 3 ) for the real event sequence.Randomly generated 100 event sequences, in turn, are aggregated with mean µ and standard deviation σ of their entropy values.12/13 Figure 7. NetSense data: the number N of participating nodes -interacting students -which non-decrease over time due to incrementing set of events.This is the direct reason for the non-decreasing maximum value of entropy used for normalization. 13/13

Figure 2 .
Figure 2. The NetSense dataset, the 1st semester.A) Values of normalized entropies.Solid lines refer to the original event sequence and dashed ones present the average value for the baseline -random sequences.B) Values of non-normalized entropies.C) Z-score for non-normalized second-order entropy with the computed trend and marked standard deviation (gray area).

Figure 3 . 9 / 13 Figure 4 . 13 Figure 5 . 13 Figure 6 .
Figure 3.Value of non-normalized second-order entropy for all examined datasets.Solid lines refer to the real event sequences; upper dashed lines -to average values of random sequences.Each dataset is divided into parts for more convenient analysis.Parts were selected empirically.Different level of entropies for random sequences (especially for NetSense and hospital) comes from either smaller or greater number of interacting nodes in a given period.A) In consecutive days of conference entropy of communication decreases which is especially clear for last day of conference.B) Students tend to be more selective in their communication in later semesters than at the beginning of studies.C) Manufacture company employees communicate with similar dynamic over time but decreasing tendency of the entropy can be still observed with the exception of one month probably related to holiday period.D) Hospital staff and patients contacts shows decreasing entropy over consecutive days with the exception of 6th of December, usually celebrated as Saint Nicolas Day, which may influence contacts dynamic.

Table 1 .
Datasets in numbers