Reconstructing signed relations from interaction data

Positive and negative relations play an essential role in human behavior and shape the communities we live in. Despite their importance, data about signed relations is rare and commonly gathered through surveys. Interaction data is more abundant, for instance, in the form of proximity or communication data. So far, though, it could not be utilized to detect signed relations. In this paper, we show how the underlying signed relations can be extracted with such data. Employing a statistical network approach, we construct networks of signed relations in five communities. We then show that these relations correspond to the ones reported by the individuals themselves. Additionally, using inferred relations, we study the homophily of individuals with respect to gender, religious beliefs, and financial backgrounds. Finally, we study group cohesion in the analyzed communities by evaluating triad statistics in the reconstructed signed network.


Introduction
Social interactions and signed relations are distinct yet related facets of human behavior.Social interactions are short-lived contacts during which individuals exercise directed or reciprocal in uence over one another [30].Individuals can interact via di erent means, and their interactions may repeatedly occur over time.Signed relations, such as friendship and enmity, are interpersonal relations characterized by a sign (positive or negative) re ecting how one person feels or thinks about another.Signed relations are long-lived and change less frequently as more e ort is required to form or change them.
While social interactions and signed relations are di erent, they are coupled to each other-relations acting as drivers for interactions.A positive relation commonly induces more interactions, while a negative one hinders them [14].Moreover, humans perceive surrounding patterns of positive and negative relations [8], to which they adapt [13].Over time, such adaptations can lead to interactions appearing mostly within cohesive groups, potentially leading to echo-chambers .Negative links may be formed across opposing groups, pushing communities towards segregation and, eventually, to polarization [10,31].
To understand such phenomena quantitatively, we require data on the positive and negative relations, which is rare.Interaction data is the more abundant alternative.However, they do not directly inform us about the relations among individuals.This leads to the problem of inferring meaningful information only from interaction data.Usually, this problem is addressed by taking the network perspective, where nodes represent individuals and edges their interactions [11,26,23,6,5].Network ltering [27] and backboning methods [33] can extract relevant connections from observed noisy interactions and nd successful applications in biology [37,21] and Signed Relations from Interaction Data 2/14 economics [9].Alternative methods use thresholding rules [39] or take a topic modeling perspective [35].All these methods, though, can at most be applied to the study of unsigned relations.For the recovery of signed relations, we require novel approaches.
We introduce a statistical network method to infer weighted signed relations from a collection of unsigned, repeated interactions.We will refer to it as the Φ-method.It relies on the main assumption that a statistical over-representation of interactions signals a positive relation and an under-representation signals a negative relation [22].This assumption is motivated by the longstanding theoretical argument that individuals with positive relations are more likely to interact [28,14] and its empirical evidence across di erent communities [15,25,36].Moreover, the idea that negative relation induces fewer interactions is supported by the arguments that people avoid individuals who are considered a source of discomfort rather than pleasure [12,16].
To demonstrate our Φ-method, we utilize four classical interaction datasets of social communities.These are a karate club in a university [40] (KC), a windsurfer community [8] (WS), a high school in France [20] (HS) and participants in the Nethealth project [19] (NH).These social communities are chosen because they, in addition to interactions, contain information about social relations that can be used to validate our method.
With our method, we reconstruct the underlying relational networks of the four communities.The inferred signed relations allow us to study pairs and triads of individuals in a new light.We illustrate the strength of having access to the complete relational structure of communities, which we represent using a weighted signed network.To this end, we investigate the pairwise homophily, relational triads, and cohesiveness of groups in the communities.Note that we refer to social communities (KC, WS, HS, NH) rather than to those detected by community-detection algorithms.Inference of signed networks.To infer the weighted signed networks Si for the four communities KC, HS, WS and NH (extended details provided in Methods), we rst construct an interaction network Gi.An edge ev→w in Gi is created every time an interaction between individuals v and w is observed in the respective dataset.Furthermore, each dataset contains a small set of reported relations obtained by directly surveying a subset of the individuals.Such reported relations are either binary (i.e., positive or not positive) or continuous (i.e., how strong they are).

Results
In Fig. 1, we visualize the interaction network GHS only for HS, which records interactions between students in a French high-school divided into 9 classes.From GHS we infer the weighted signed network SHS.In SHS, we observe clusters of positive relations with weak negative ties between the clusters.This pattern matches the class separation within the high-school.If we compare SHS to the declared friendships provided in the survey (Fig. 1 (right)), we see that most declared friendships are within classes and only few across classes.
To obtain the weight and the sign sv→w of the links in SHS, we use the Φ-method.For each pair (v, w) of individuals, the weight of the relation sv→w is obtained as a linear combination of the probability that two individuals are interacting more than expected with the probability of interacting less than expected (see Methods for details).The coe cients of this linear combination are estimated based on the few reported relations in the community.Once determined, this allows us to infer both positive and negative relations between all individuals, going beyond previous approaches [34].
Accurate prediction of reported relations.Using the Φ-method, we accurately predict the reported relations between individuals.To evaluate this accuracy, we perform both an in-sample and an out-of-sample prediction task where the dependent variable is the reported relation and the predictor the value of sv→w.We detail the results of the prediction tasks in Table 1.For HS, NH, and KC, the reported signed relations are categorical (individuals being friends or not, or individuals feeling a strong, weak or no relation at all).Hence, we evaluate Si by means of standard classi cation methods and list the resulting sensitivity, speci city, and balanced accuracy (see Methods).All these scores are remarkably high and above 80%, which holds for both the in-sample and the out-of-sample predictions.For WS, the reported signed relations are continuous.Thus, we model them with a linear regression.We evaluate the goodness of t using the R 2 and the root-mean-squarederror.These continuous relations are harder to model, as they were obtained through a convoluted interview process.Our goodness of t su ers from this with an R 2 just above 0.3.
We nd that the Φ-method is robust in handling unseen data.For the HS and NH dataset, we preserve a very similar accuracy between the in-sample and the out-of-sample prediction, the same holds for the di erence in R 2 in the WS dataset.The most considerable accuracy loss occurs in the case of the small KC dataset where the speci c train-test split has a signi cant impact.In the supplementary material, we further show that the Φ-method outperforms other approaches for predicting relations based on thresholding rules or network modularity.
Homophily.Homophily is the phenomenon of similar individuals being more likely to form positive relations.In the inferred signed networks SHS and SNH, we nd strong gender homophily, i.e., the speci c case in which similarity is de ned by gender.To test the presence of this phenomenon, we compare two probabilities (in percentage): i) the probability that individuals with a positive relation also have the same gender and ii) the probability that randomly sampled pairs of individuals have the same gender.These are shown in Fig. 2 in the i) outer and ii) inner circles.We only have data about genders in the NS and HS datasets, so we restrict the analysis to these two datasets.We nd that the probability that individuals with a positive relation also are of the same gender is larger compared to the reference probability of randomly sampled pairs being of the same gender (Fig. 2).Precisely, compared to the reference case, it is approximately 20% and 30% more likely that individuals  1: Quality of the model for in-sample and out-of-sample predictions.We report the sensitivity, speci city, and balanced accuracy for the binary HS, NH, and KC.For the continuous relations in WS, we report the R 2 and the root-meansquared-error (RMSE).Overall, the model quality is good for the binary relations and worse for the continuous ones.The model is robust as the out-of-sample prediction only loses little compared to the in-sample prediction.
with a positive relation have the same gender in the HS and NH dataset, respectively.By performing a binomial test, we verify that these results are statistically signi cant (see Methods for details).
Apart from gender, we nd that religion and parental income homophily are of lesser importance to university students.This is shown in Fig. 2, by comparing 64.8 vs 49.0 for gender to 60.7 vs 55.5 for religion and 51.5 vs 45.9 for parental income.Only for this dataset do we have such additional information.The probability that friends have similar religious beliefs or parental income is slightly larger than in the reference case, but nevertheless signi cant.Beyond dyadic properties.Thanks to our analysis, we have attributed a signed relation to each pair of individuals.The datasets contain additional information about the belonging of these individuals to di erent groups (e.g.classes, memberships).By looking at triads composed of three individuals, we can now characterize these groups.Considering only the sign of relations, four types of triads Tτ can appear: (+ + +) (T1), (+ + −) (T2), (+ − −) (T3), (− − −) (T4).For each triad t = (v, w, z) of a given type Tτ , we assign a weight ωt by multiplying the weighted signs sv→w, sw→z, and sz→v [32].We de ne group cohesion by means of triads T1 with three positive relations (+ + +).Group con ict on the other hand, is de ned by those triads T2 that have one negative link (+ + −).

Gender (HS)
Through the weights of the triads, we can quantify the importance of each type of triads for groups (see Methods for details).We can distinguish formal groups (e.g.classes) from informal groups, for example the two groups in KC centered around the leaders JA and HI.Analyzing the networks of signed relations SHS, SKC and SW S , we nd that cohesion strongly outweighs con ict only in HS, which contains formal groups.Di erently, informal groups emerging in WS and KC show weaker cohesion and a higher presence of con ict.Speci cally, Table 2 shows, that (+ + +) (T1) triads have high importance within the groups of HS (0.98 and 0.96).In the informal groups of WS and KC, their importance decreases up to 0.45.Moreover, in the JA group of KC, con ict has as much importance than cohesion.Across all analyzed communities, the importance of relational triads with many negative relations, (+ − −) (T3) and (− − −) (T4), is marginal.
Our analysis of KC further highlights leaders' in uence on group formation.While, at the time of the data collection, KC consisted of a single community, it eventually split into two groups centered around two leaders, JA and HI [40].Analyzing these two groups separately, we nd that the triads involving their leaders are strongly cohesive: (+++) (T1) triads involving HI and JA have an importance of 0.72 and 0.59, respectively (see Table 2 for details).However, when considering triads not involving the leaders, we only nd cohesion in HI's group (0.63).JA's group instead is dominated by con ict (0.54).Hence, we have revealed that the presence of the in uential leader is the major characteristic de ning the group.2: (Top) Importance of triad types (+ + +) and (+ + −) for di erent communities.Each community features groups and the importance of the triads is calculated within these groups.In all groups but the one of John A. (JA) in KC, the importance of cohesion outweighs con ict.(Bottom) Left are triads in KC involving the leaders of the groups (squared node), right triads not involving the leaders.Mr. Hi's group is always characterized by cohesion, while John A. 's shows mostly con ict when he is not present.

Discussion
Our work contributes to the study of human relations by unlocking data sources previously not usable for such investigations.To infer signed relations between individuals, we have employed data about face-to-face contacts (HS), SMS and phone calls (NH), proximity (WS) and co-attendance (KC).Traditionally, weighted signed relations are obtained with surveys, an expensive and hardly scalable approach.Instead, interaction data is abundantly available.Despite the di erent types of data, we have shown that our methodology is well suited to extract signed relations.Therefore, social scientists, behavioral researchers, and psychologists can now use interaction data in new ways.
Our central assumption is that positive relations imply more and negative relations fewer interactions.This way of linking interactions to relations is a long-standing assumption in social science [14], which has been widely tested for positive relations [15,25,36].In the case of negative relations, instead, it has rarely been explored, mainly due to a lack of data.The Φ-method lls this gap.
Our broader perspective allows quantifying social phenomena such as homophily, cohesion, and con ict within groups.For instance, we have con rmed that gender homophily is essential in establishing positive relations, such as friendship.Additionally, we have found that leaders can strongly in uence the cohesion of a group.This result can be related to the theories of social status and structural balance, according to which individuals adapt their behavior in response to their surroundings [38,29,13,2].
Finally, the ability to infer signed relations from interaction data enables to study how relations evolve over time.Social theories about structural balance, status, or social impact postulate di erent mechanisms for relational changes.We can now test these mechanisms by leveraging the ne-grained temporal resolution of interaction data.This opportunity paves the way for future research to explore the evolution of signed relations and their e ect on communities with an unprecedented resolution.

Data
We require data about social communities containing both interactions and declared relations, gathered through surveys.While such data is, in general, scarcely available, we leverage four datasets ful lling our requirements.They vary in size, number and type of interactions, and form of surveyed relations.We summarize this information in Table 3.
The data ranges from small communities of under 50 individuals to larger ones encompassing hundreds of people.In these datasets, an interaction ev→w indicates proximity between, colocation, or communication events through phone calls, SMS, and WhatsApp between two individuals v and w.In the two datasets HS and NH, interactions were collected automatedly.Thus, they feature the most interactions: up to roughly 2 • 10 6 for NH.In the other two datasets, instead, interactions were recorded manually by researchers.The surveyed relations rvw either indicate a quasi-continuous closeness, belonging to one of four factions, or a binary friendship, i.e., people being friends or not.

Windsurfer (WS).
The study of the windsurfer community took place in California in the fall of 1986, with the authors being long-time members of this community [8].The windsurfers were naturally dividing themselves into two groups, newcomers and older members, but there was no display of intergroup con ict.They were observed over 31 days, each day for two 30 min intervals.The interactions can loosely be de ned as proximity events, people sitting together for lunch, or social exchanges.Looking at the interaction network (Fig. 3a) makes it clear that most interactions took place within the two informal groups.All community members were interviewed shortly after the conclusion of the observation period.They were asked to perform a sorting task to identify how close they were to each other.This closeness is rescaled to a number in (0, 1) and represents the relations in this dataset.Even though the authors describe a dataset of 54 surfers, only data about 43 of them was released.
Zachary's Karate Club (KC).This dataset contains interactions between 34 members of a university karate club over three years.The recorded interactions occurred not during the karate lesson but in di erent contexts.Like the windsurfer community, the karate club had two factions that "were never organisationally crystallized" and "[...] not named" [40].However, the factions had two leaders the club president (John.A.) and the karate instructor (Mr. Hi).These factions arose due to a dispute between the leaders over an increase in the costs of lessons.At a certain point, the club split into two clubs, one led by John. A. and the other by Mr. Hi.The club members mainly chose the leader they wanted to join according to the factions they were in before the split [40].The interaction network (Fig. 3b) makes these factions visible before the split, while inter-faction contacts are still present.Before the split, club members were asked which faction they saw themselves in and whether that sentiment was strong or weak.These declarations form the relations in our analysis.The data also contains information about each member's nal group after the split.
French Highschool (HS).As a third community, we consider a high school in France.[20] have recorded face-to-face interactions between students from four programs and organized them into nine classes.This was done using RFID trackers, which only trigger when individuals are close and facing each other.The interactions are recorded while being at school over ve days.Interactions are mainly concentrated within classes, which becomes apparent when considering the network visualization (Fig. 3c).Nevertheless, students interacted with alters from other classes, possibly during breaks.On top of the interactions, information was collected about positive social relations, i.e., friendship.Unfortunately, no information about negative relations was collected.
Nethealth Project (NH).Lastly, we studied the Nethealth Project, a long-lasting (2015-2019) study conducted by the Center for Network Science and Data at the University of Notre Dame [19].It investigates the social networks and health of initially around 700 undergraduate students, comprising pair-wise interaction data as well as responses to surveys administered in 8 waves over the study period.Interactions were recorded through communication events in the form of in-and out-going calls and messages from the participants' phones.We construct the interaction network (Fig. 3d) only including people who have at some point participated in the study and have given their consent to the use of their data.The sheer size of the interaction network does not allow us to extract much information from its visualization.However, we see that the degrees of the nodes vary greatly, between 0 at least and 89950 at most.The data contains surveyed friendships, which constitute the relations we use in our work.As there were multiple 'waves' of surveys, in our analysis, we focus on one wave, namely the second one.This wave contains the most individuals, as subsequently there were some drop-outs.We then only consider interactions happening between the rst and second surveys.Our results remain stable over the other waves.Signed Relations from Interaction Data 8/14 (a) (b) q q q q q q q q q q q q q q q q q Mr. Hi John A.

Inferring signed relations
The Φ-method.The Φ-method relies on the central assumption that over/under-representations of interactions signal positive/negative relations, a longstanding hypothesis in social sciences [14].To quantify these over-and under-representations, we compare the observed interaction counts between individuals to a network null model, the hypergeometric ensemble of random graphs (HypE) [4].By employing a network null model, we de ne an expectation for the number of interactions between individuals.This expectation should account for all factors that bias the observed number of interactions beyond the e ect of signed relations [5].In this work, we speci cally account for the heterogeneity in the activities of the di erent individuals.That means, we account for the fact that a very active individual is more likely to interact with others irrespectively of whether they share a positive or negative relation.Similarly to a standard con guration model [7], HypE allows explicit modeling of such heterogenous activities and enables the estimation of network-and dyadic-sampling probabilities through closed form expressions [4].It does so by modeling the network generation as a sampling process without replacement from a carefully designed urn.The urn is lled with a given number of balls, each representing a possible directed edge between two nodes v and w.An edge ev→w from v to w is considered to be in this set of possible edges if the nodes have non-zero inand out-degrees k out v and k in w , respectively.To account for the di erent levels of activity of di erent individuals, Signed Relations from Interaction Data 9/14 we specify the maximum number Ξvw of possible edges between each pair of individuals to be proportional to the the activity-i.e., degree-of each individual in the network.To do so, we de ne a matrix Ξ Ξ Ξ, whose entries Ξvw are given by k out v k in w .It directly follows that vw Ξvw = m 2 is the total number of possible edges, and thus the number of balls in the urn.A network realization X X X with m edges is given by sampling m balls from this urn without replacement.This sampling procedure is akin to hypergeometric sampling, and the probability of nding the observed network con guration A A A is given by: Equation ( 1) de nes HypE, the network ensemble that we use to estimate the pair-wise over-and underrepresentation of interactions.This ensemble has the bene ts of incorporating interdependencies between pairs of individuals, preserving individuals' activity and attractiveness, and being analytically tractable.For more details, we refer to [4].While in this work, we focus only on incorporating the activity of individuals into our null model, it is in principle possible to extend the null model to account for more complex factors, e.g., block or sub-group structures [3].However, these extensions are beyond the scope of this article.From Eq. ( 1), we extract the two marginal probabilities P (Xvw < Avw) and P (Xvw > Avw), where Avw is the observed number of interaction between v and w and Xvw is an hypergeometric random variable: Intuitively, when the rst probability is high, it is unlikely to nd as many interactions as we observed, indicating an over-representation [5,17] and, therefore, a positive relation.The same reasoning holds for the second probability, indicating a negative relation.Extending the approach of [22], we construct the signed relations by taking the di erence of these probabilities, weighted according to some constants in what we call the Φ-method As shown in the following, we can learn the community-dependent constants a and b when we have access to data about the relations between a small number of individuals in the community.When this data is not available, we assume a symmetric in uence of over-and under-representation, i.e. a = −b = 1.
Constructing the signed networks: training on data.Whenever we have access to data about interactions and relations between some individuals, we can train the Φ-method to nd optimal parameters â and b to infer signed relations.By extrapolating the learned parameters to all pairs in the community, we compute Eq. ( 4) and construct full signed networks from only a few reported relations.We employ simple machine learning techniques to estimate the parameters in Eq. (4).Our aim is to classify the reported relation rvw based on the value of φvw(a, b): rvw ∼ φvw(a, b) + c .
(5) Signed Relations from Interaction Data 10/14 Whenever we have binary relations, e.g., rvw ∈ {Friend, Not Friend}, we perform the classi cation in Eq. ( 5) by means of a logistic regressions.In the case of continuous relations, e.g., rvw refers to some 'closeness' ∈ (0, 1), we use linear regressions.If multiple categories are possible, e.g., rvw ∈ {Friend, Positive Attitude, Neutral, Negative Attitude, Enemy}, multinomial or cumulative link methods [1] are employed, depending on whether the categories are ordered or not.The classi cation just described gives us estimates â and b for the parameters in Eq. ( 4), obtained for the subset of individuals for which reported relations rvw exist.With these, we can extrapolate our ndings to the whole community, generating the signed network S, whose links sv→w = φvw(â, b).In Table 4, we report the coe cients estimated for all datasets.These coe cients are community-dependent.However, a is always positive, and b is always negative.This nding demonstrates that having a high over-representation in interactions increases the probability of having a surveyed friendship.Similarly, having a high under-representation decreases this probability.Additionally, the only dataset with a large negative b is KC.This community is also the only one in which a known con ict arose.For the other communities, b tends to be small in absolute value, giving weakly negative relations.
The coe cient c in Eq. ( 5), provides a baseline from which the value of φvw(a, b) can be related to the reported relations.Thus, we do not employ such value in constructing the signed network S.  4: Estimated coe cients â and b for over-and under-representation for the four datasets studied.| b| is always smaller than |â| for all datasets, indicating the presence of weak negative links.Only for KC we have a large negative coe cient.This is expected as it is the only community in which a known con ict emerged.
Comparing Φ to other methods.In the following, we show that the Φ method outperforms two other methods used to infer relations.The rst one is a threshold method MT .The user de nes a threshold on the interactions over which individuals are assumed to be friends.Similarly, they are assumed to be enemies below this threshold.We assume one threshold for all pairs in the community and this threshold can be learned from the known relations.Speci cally, we use as a predictor the interaction counts Avw in the regression methods: This method disregards any heterogeneities in the individuals, their di erent levels of activity in the community, or their popularity.We can partly alleviate this by factoring in the degrees of the individuals when de ning their relations.By quantifying the expected number of interactions between two individuals based on their degrees, we reach a formulation akin to the one used in the well-known network modularity [24,18].We call this model the modularity method MM .Formally, it can be written as follows (for directed networks): In the undirected case, total degrees are substituted k out v = kv and k in w = kw and the right-hand side is divided by two.While the modularity method now partly accounts for heterogeneities, it disregards that the The sum runs over all triads t is the set Tτ .The subscript vw ∈ t signi es that the link between v and w is in the triad t.Note that we use the absolute value of the Φ-measure.Thus, we consider the weight of the relation when evaluating the importance of a given triad.This way, triads containing mainly weak links will contribute less to the importance.
To obtain a number comparable across communities, we normalize the importance of each triad type over the total importance of all triad types.
where N = n (+++) + n (++−) + n (+−−) + n (−−−) .Such a normalization gives us the relative importance, which is the number we report for the di erent datasets in Table 2 in the main text.

Figure 1 :
Figure 1: (left) Interaction network GHS from the HS dataset.Nodes represent individuals and edges recorded interactions between them.Multiple interactions are shown by parallel edges.(center) Inferred signed network SHS shown only for a subset individuals.Positive relations are represented by blue edges (darker colour refers to larger weight).(right) Network of declared friendship relations among individuals.We report a summary of the evaluation in a confusion matrix.

Figure 2 :
Figure 2: (Left) Gender homophily in HS and NH.(Right) Religion and income homophily in NH.The outer ring shows the probability (in percentage) that individuals with a positive relation also have the same gender, relgion or parental income.The inner circle refers to the random sampling.While all three types of homophily are present, gender homophily is the strongest.

Figure 3 :
Figure 3: Interaction networks visualized for (a) WS, (b) the KC, (c) HS and (d) NH.Link weights in the gures are proportional to interaction counts.

Table 3 :
Summary of the main features of the data.