The language of opinion change on social media under the lens of communicative action

Which messages are more effective at inducing a change of opinion in the listener? We approach this question within the frame of Habermas’ theory of communicative action, which posits that the illocutionary intent of the message (its pragmatic meaning) is the key. Thanks to recent advances in natural language processing, we are able to operationalize this theory by extracting the latent social dimensions of a message, namely archetypes of social intent of language, that come from social exchange theory. We identify key ingredients to opinion change by looking at more than 46k posts and more than 3.5M comments on Reddit’s r/ChangeMyView, a debate forum where people try to change each other’s opinion and explicitly mark opinion-changing comments with a special flag called delta. Comments that express no intent are about 77% less likely to change the mind of the recipient, compared to comments that convey at least one social dimension. Among the various social dimensions, the ones that are most likely to produce an opinion change are knowledge, similarity, and trust, which resonates with Habermas’ theory of communicative action. We also find other new important dimensions, such as appeals to power or empathetic expressions of support. Finally, in line with theories of constructive conflict, yet contrary to the popular characterization of conflict as the bane of modern social media, our findings show that voicing conflict in the context of a structured public debate can promote integration, especially when it is used to counter another conflictive stance. By leveraging recent advances in natural language processing, our work provides an empirical framework for Habermas’ theory, finds concrete examples of its effects in the wild, and suggests its possible extension with a more faceted understanding of intent interpreted as social dimensions of language.


Sociopolitical topic classification
In order to extract our data set of posts with a sociopolitical topic from r/ChangeMyView, we build a supervised classifier. The development and training of such a classifier is described in Section 5.1 in Materials & Methods. Here, we report further information on the quality of this classifier.
First, we build a test set with the same procedure we used for the training data of the classifier; that is, we aggregate posts from different subreddits after we categorize each subreddit as sociopolitical or not (see Section 5.1 for details). On this test set, we obtain an average F1 score of 89.5%. However, since the activity on Reddit varies significantly in the nine years we consider, we further investigate whether the performance of this classifier changes over time. Table SI1 reports the F1 score as measured on the posts from each year. We find that the quality in classification presents very limited variance over different years (i.e., within ±2.5 p.p. from the global F1). Table SI2 reports additional metrics for this classifier.
Second, we build a validation set by selecting a random sample of r/ChangeMyView posts that we categorize as sociopolitical or not, according to the definition given by Moy and Gastil [16] (see Research Design section, page 2). Table SI3 reports a random excerpt of this validation data set, which provides concrete examples of the "sociopolitical" category. On this validation set, the classifier obtains an F1-score of 82% when considering the posts where text is present in the body of the post. Table SI2 also reports additional metrics on this data set.

Data
We apply the classifier in order to find all r/ChangeMyView posts with sociopolitical topic. Here, we further characterize the r/ChangeMyView data set gathered this way. Figure SI1a shows the number of posts per month over the time span of nine years considered in our analysis. Unlike the number of posts, which fluctuates over time, the fraction of posts that are sociopolitical is rather stable, suggesting that the discussion is not dominated by any event in particular. Figure SI1b reports the distribution of the number of posts per author in this set.
Finally, in our analysis we focus on the distinction between comments that received a ∆ from the original author-which indicates the author admits to have changed their view after reading such comment-and those comments that did not receive one. Of the 3 690 687 comments, 38 165 were awarded a ∆. As an additional control group, we find 504 550 comments that did receive an answer from the original poster, and therefore received their attention, but did not obtain a ∆: this is further evidence that the original author did not consider such comments view-changing. We report the distribution of the number of posts answered by the original author with a given number of comments in each of these two categories (with and without ∆) per post in Figure SI1c (posts that have comments both with ∆ and without count towards both distributions). Figure SI2 shows the probability distribution of comment length across dimensions. The typical length of messages may vary considerably across dimensions; for example, comments conveying status tend to be much shorter that knowledge-exchange comments. Table SI1. Classification performance of our sociopolitical classifier on the test set of Reddit posts over the years. In italic, years for which there are no posts in r/ChangeMyView.    Figure SI3 shows the fraction of comments with dimension d among all the comments with a given range of length. We consider five length classes, corresponding to the quintiles of the length distribution. The probability of finding a dimension in a comment increases linearly with the comment length. Figure SI4 shows the cross-correlations between the dimension scores, for all the dimension pairs (plus sentiment scores and comment length, measured in number of words). On the left, we report correlations computed on the original values s d (m). On the right, we report correlations computed on the weight-discounted values d(m). The weight-discounting reduces the cross-correlations considerably. Table SI4 shows the results of a logistic regression to predict whether a comment got a ∆ by using (i) all the dimensions including fun, and (ii) with sentiment scores in addition to all the dimensions. The social dimension of Fun is not significant, we therefore remove it from the other regression models. Table SI5 repeats the same analysis from Table 2 but including information about the length of the message. We do so by including as an independent variable the quantity Z(log l), where l is the length of the message and Z is a Z-score standardization. We apply a logarithmic scaling since the length of a message is broadly distributed. Note that this regression model is spurious, since there is an interdependence between the length of a message and the social dimensions conveyed by it. Namely, it usually takes more words to express some social dimensions rather than others. For example, to express knowledge, one typically requires to use a relatively large number of words to articulate an argument, whereas status can be conveyed effectively with just a few words of admiration.

Opinion change
Nonetheless, these results show that many of our findings are robust: in particular, all of the dimensions are highly significant (as in Table 2). Figure SI5 shows the odds ratios calculated considering only comments that got a reply from the OP. Figure SI5a shows the the odds ratios of a social dimension being conveyed by comments with ∆ compared to comments with no ∆. Figure SI5b shows the the odds ratios of a social dimension being conveyed by posts for which a∆ was awarded, compared to posts whose authors did not give any ∆. Figure Figure SI5c shows a matrix of interaction between the intent of the poster and that of the commenter. The results are qualitatively very similar to those obtained when considering all the comments, including those that got no reply from the OP.   Tables SI6, SI7, and SI8, report results of experiments we conducted as robustness checks. Table SI6 reports the results of the regression model fit on a balanced data set obtained by undersampling negative examples. The results are very similar to the those reported in Table 2 in the main manuscript. The predictive power is slightly improved, simply because we remove the added difficulty of class imbalance and additional noise provided by the negative examples. Table SI7 reports the results of a regression model fit on a randomized dataset that we obtained by randomly shuffling the social dimensions associated to each example, so that the association between opinion change and social dimensions is disrupted. In this experiment, no social dimension appears as significant, as one would expect.
Last, Table SI8 reports the results of the regression that we obtain when considering a more conservative threshold of the sociopolitical classifier (i.e., moving the threshold from 0.50 to 0.75). Such a change does not affect any of the significance levels that we record in the main regression model-most coefficients do not vary by more than 0.01 compared to those presented in Table 2 in the main manuscript. Also the variation in the Pseudo-R 2 is minimal. We can therefore conclude that, even if the accuracy of our sociopolitical classifier is not perfect, our results seem robust to minor classification errors.
Homophily and opinion change. Figure SI6a shows the likelihood that different dimensions are present in comments that are exchanged by two users with different profiles. Figure SI6a (left) shows the odds ratios of a dimension being present in a comment when the commenter and the author of the commented post participated in some political subreddits, but do not have one in common. Figure SI6a (right) shows the odds rations of a dimension being present in a comment when the commenter and the poster belong to different ideologies (left-wing vs. right-wing). Comments flowing between users with different profiles are more likely to convey power and conflict and less likely to contain support or status. Also, knowledge-rich discussion happens between people who have different stances. Figure SI6b shows the result of a logistic regression model that includes as independent variables the social dimensions of the comment, the political side of the commenter, and the political side of the poster. The dependent variable is, as in Table 2, whether the comment is marked with a ∆ or not (that is, if it is labelled as opinion-changing). We aggregate the resulting coefficients in order to obtain odds ratios for all of the nine possible configurations of these variables. Then, we normalize such odds ratios with respect to the interaction between two individual with no detected political side. From these results, we can observe: 1. heterophily: individuals are more likely to change their opinion when confronted by an individual of the opposite political side; 2. asymmetry between left-wing and right-wing individuals: a left-wing individual is more likely to receive a ∆ from a right-wing one than vice versa.

5/9
Table SI4. Odds ratios obtained by a logistic regression that considers all social dimensions (including the non-significant fun), with and without traditional sentiment scores. We indicate with asterisks the statistically significant correlations (with one, two, or three asterisks corresponding to p < 0.05, p < 0.01 and p < 0.001 respectively). P-values are corrected according to the Benjamini-Hochberg procedure, to reduce the chance of spurious correlation emerging because of the high number of factors we consider.
1.111*** Table SI5. Odds ratios obtained by logistic regression when considering also the length of the comment. Each column corresponds to a model with a specific set of variables. A description of each variable is given in Table 3. We indicate with asterisks the statistically significant correlations (with one, two, or three asterisks corresponding to p < 0.05, p < 0.01 and p < 0.001 respectively). P-values are corrected according to the Benjamini-Hochberg procedure, to reduce the chance of spurious correlation emerging because of the high number of factors we consider.   Figure SI5. Odd ratios of containing a dimension in opinion-changing messages versus the others in the case of (a) comments and (b) posts. Here, we considered only comments with an answer from the original poster. On the right (c), we report only the statistically significant odds ratios (p < 0.01) for interactions between dimensions in comments and posts. Results are qualitatively similar to Figure 1. (b) Odds ratio of presence of a ∆ in comments between people with a given alignment combination, as estimated by a logistic regression model including these factors and the social dimensions. We normalize the odds ratio with respect to the interaction between two individuals with no detected political side. Figure SI6. Analysis of political alignment, in relation to (a) social dimensions and (b) opinion change. 9/9