A computational reward learning account of social media engagement

Social media has become a modern arena for human life, with billions of daily users worldwide. The intense popularity of social media is often attributed to a psychological need for social rewards (likes), portraying the online world as a Skinner Box for the modern human. Yet despite such portrayals, empirical evidence for social media engagement as reward-based behavior remains scant. Here, we apply a computational approach to directly test whether reward learning mechanisms contribute to social media behavior. We analyze over one million posts from over 4000 individuals on multiple social media platforms, using computational models based on reinforcement learning theory. Our results consistently show that human behavior on social media conforms qualitatively and quantitatively to the principles of reward learning. Specifically, social media users spaced their posts to maximize the average rate of accrued social rewards, in a manner subject to both the effort cost of posting and the opportunity cost of inaction. Results further reveal meaningful individual difference profiles in social reward learning on social media. Finally, an online experiment (n = 176), mimicking key aspects of social media, verifies that social rewards causally influence behavior as posited by our computational account. Together, these findings support a reward learning account of social media engagement and offer new insights into this emergent mode of modern human behavior.


Supplementary Methods
individual-level averages (Ind.). The latter was generated by computing the mean (M) and the median (Md) for each individual, and then summarized with mean-of-means and median-ofmedians of the individual level averages.

Additional information about Study 1
Inclusion in the original Instagram dataset (collected by 1 ) was based on participation in at least one of Instagram's weekly photography contests. Contest participation was denoted by the addition of a hashtag with prefix "#whp-" to an Instagram post. All media uploaded by a random selection of 2,100 users with at least one "#whp-" hashtag (including media that were not tagged with #whp-hashtags) were gathered and their information retrieved and stored.
Study 1 was based on a subset (with at least 10 posts, n = 2,039) of these users.
To quantify contest participation in our dataset, we compared the number of posts with a "#whp-" hashtag to the total number of posts, and found that "#whp-" hashtags comprised 2.3% of the total number of posts. The number of "#whp-" hashtags per user ranged from 1 to 275, with a median of 19 (comprising 0.0005% to 71% of posts). We show below that the number of contest participations did not predict social media behavior ("No evidence for association between Instagram photo contest participation and social media behavior").

Additional information about Study 2
For Study 2, we obtained public data from three topic focused social media forums (Men's fashion: styleforum.net, Women's fashion: forum.purseblog.com, Gardening: garden.org), where users could provide likes to each other as feedback for posts. These forums are organized in "threads" (which users can start) focused on a specific topic or question. Because our focus was how likes affected posting behavior, we focused for simplicity on threads with a high proportion of image posts rather than textual exchange (where many factors other than likes are likely to affect behavior). For this purpose, we selected three high profile threads (with In all three datasets, we removed all posts that did not include user-generated images (in Supplementary Note 5 below, we confirm that results are qualitatively identical when including text-based posts) or that quoted other posts, and ordered the posts in sequential order for each individual user. To adjust for potential differences between threads in average posting latency, the statistical analyses included either fixed (Men's and Women's fashion forum) or random (Gardening forum) effects for thread. For simplicity and consistency with Study 1, the model-based analysis did not distinguish between threads. The datasets were anonymized (i.e., no information about the post content or username was retained), and only contained the time stamps and likes associated with each post.

Computational modeling
Simulation of key empirical regularities in operant conditioning research. The R L model is based on the normative theoretical framework developed by Niv and colleagues 2 to explain free operant behavior in animals. Because their theory was focused on optimal equilibrium (rather than learning) behavior, designing the R L model to fit the dynamics of social media data required multiple adaptations (e.g., novel updating equations and gradient computations, simplified parametrization, different policy definition). Therefore, we verified that the R L model can accurately reproduce the classic, qualitative behavioral patterns of animals trained in Skinner boxes. This is important for establishing the theoretical validity of the R L model as a tool for identifying reward learning on social media.
We conducted three sets of simulations, each aimed at reproducing a standard empirical regularity from the operant conditioning literature. Each simulation was ended after the model elicited 200,000 responses, and repeated five times (corresponding to five "artificial rats"). As typical for operant conditioning research 3 , we analyze well-learned, "steady state" behavior (the last 100,000 responses). Importantly, the R L model was not altered in any way relative to our analysis of social media data.
First, we used our R L model to simulate behavior in classic variable interval (where the first response after a pre-specified, random time interval is rewarded) and variable ratio (where each response has a pre-specified probability to be rewarded) schedules of reinforcement 3 .
These reinforcement schedules are the cornerstones of free operant behavior. The typical pattern of results is higher responses rates in both (i) interval schedules with shorter, relative to longer, interval durations, and (ii) ratio schedules with lower, relative to higher, ratio requirements. The R L model reproduces both patterns of results (Supplementary Figure 1A-B).
Furthermore, as expected by theory 2 , the model's estimate of the average reward rate, R , was consistently positively related to the response rate (i.e., negatively related to the average response latency). Second, we verified that the R L model reproduces the key difference between ratio and interval schedules: the response rate on ratio schedules is higher than on interval schedules with matched (yoked) reward rate 3 . The theoretical explanation for this result is that shorter inter-response-intervals (i.e., τPost) increases the probability of reward in ratio, but not interval, schedules. We first simulated responding on variable ratio schedules, and then used the exact durations between reward deliveries to set the interval durations for the yoked simulation. Again, the R L model accurately reproduced this pattern (Supplementary Figure   1C). Finally, we simulated responding on Differential-Reinforcement-of-Low-rates (DRL) schedules 4 . In DRL schedules, the animal has to wait a fixed minimum duration (given by the schedule) since its last response to receive reward. Any premature response resets the schedule, meaning that the animal needs to be able to estimate the interval elapsed since its last response, and time its next response accordingly (as well as suppress the natural tendency to start responding before the usual time of reward). We find that our R L model learns to time the schedule duration (Supplementary Figure 1D). Similar to data from DRL experiments, we find that the standard deviation of the response latencies (i.e., τPost) is positively related to the length of the target interval (c.f., scalar property of timining 5 ). Together, these simulations demonstrate that the R L model accurately reproduces key result patterns from the operant conditioning literature, and supports the theoretical validity of the R L model as an account of reward learning on social media. Figure 1. The R L model reproduces key empirical regularities in operant conditioning research. (A) The simulated response rate (i.e., 1/mean(Response Latency)) is higher on schedules with lower ratio requirements. The ratio requirement refers to the mean number of responses required for receiving reward (each response had P = 1/ratio requirement to be rewarded). Each point is the mean of N = 5 independent simulations, error bars are 1 SE. (B) The simulated response rate (i.e., 1/mean(Response Latency)) is higher on schedules with shorter average interval durations (each individual interval was drawn from an exponential distribution with the mean corresponding to the interval duration). The interval is set after each incurred reward. Each point is the average of N = 5 independent simulations (C) The response rate (i.e., 1/mean(Response Latency)) is higher on random ratio schedules than interval schedules with matched (yoked) reward rate. Each data point is N = 1 simulation run.

Supplementary
(D) The model learns the waiting time (DRL duration) required for receiving reward. The mean Response Latency was consistently above the DRL duration. Each point is the average of N = 5 independent simulations. The model parameters were set to α = 0.001, P = 1, C = 0.01 for all simulations.
Model estimation. Computational model estimation was conducted on the individual level using maximum likelihood techniques, with an exponential likelihood function. To avoid local minima in parameter fitting, optimization was initiated with at minimum np*10 randomly selected start values, where np is the number of free parameters. The Akaike Information Criterion (AIC), which penalizes model complexity, was used for model comparison. The relative model fit was assessed with individual-level AIC weights (AICW), which can be interpreted as the probability that a given model, in the candidate set, best accounts for the data 6 . Bayesian model comparison was conducted with the VBA package 7 , using the AIC as approximation to model evidence.
All variables where initialized at 0, except R . Within a dataset, the initial value of R was set to the median number of likes received for the first post, divided by the median latency between the first and the second post (i.e., = mdn(R t=0 )/mdn(τPost t=1 ) (note that we removed the first post for each user from the analysis of the empirical data, because τPost is undefined for the first post). Although this formulation uses data from individual i, which is also used for model estimation, the contribution from any given individual is negligible. The initial value of R was not crucial for the model fit, as preliminary analyses showed that setting it to 0 or the median number of likes at t=1 produced similar results (although model fit is better if R is initialized to a positive value), while estimating the initial value as a free parameter did not reliably improve model fit. were available) and found that the number of likes followed a heavily skewed and long-tailed distribution. In other words, most likes were provided in close temporal proximity to the post.
Testing this quantitatively, we found that the number of likes provided within the first hour of a post was strongly predictive of the total number of likes the post would receive (Spearman's rho = .88, p < .00001), indicating that our analytical simplification did not reduce realism. initial value i, and changed across simulation time points with slope s. Both i and s were estimated from the empirical data, for respective data set, using mixed-models with Poisson link function. In these simulations, the expected value of R (the social reward) was independent of τPost. In other words, the reward for any given post was independent of the model policy.
In an alternative set of simulations, we explored the consequence of this assumption by making R dependent on τPost. In this set up, R was maximized for a specific value M (e.g., 1 day, which varied randomly across simulations) of τPost, and decreased exponentially around M as a function of the absolute difference between M and τPost. We found that the R L model can, with sufficient time, adjust its policy to approximate M, and that the expected difference between high and low R is similar to that of the main simulation (results available from corresponding author, c.f., Figure 1E). This shows that the predictions of the R L model are not dependent on assumptions of R.

Supplementary Figure 4. Additional example individuals in Study 1 (Instagram).
To supplement Figure 2C

Experimental manipulation of social reward rates
Methods. We invited participants (with minimum 95% approval rate) on Amazon Mechanical Turk to take part in a study on "humor on social media". 179 participants completed the study, and were payed 3$ in compensation. The study was approved by the ethical review board of the University of Amsterdam, The Netherlands. All participants provided informed consent.
Participants were instructed that they would take part in a study of humor on social media, in which they would interact with 19 other online participants ("users"). To resemble the typical structure of social media, the experiment involved a (simulated) "feed", where the participant could observe images shared by other users, and "post" (share with the other users) their own images (Supplementary Figure 5). More specifically, the participant could "post" a type of humorous image, known as a "meme" (see Supplementary Figure Figure 5A). The purpose of this was to create a sense of self-expression, while also preventing participants from generating unethical or nonsensical content. Next, the participant was asked (but not required) to provide three "informative and descriptive" nouns or adjectives (termed "tags" to correspond to social media terminology) to the image they selected (e.g., "funny"). The purpose of the tags was to associate an effort cost with posting, as typical on real social media platforms. Next, the low/high order counterbalanced with random assignment to order condition) in order to test the influence of social reward rate on posting response latencies.
After the experiment, the participants were asked to report how many followers they had on Instagram, Twitter, and Facebook, and how many likes they received on average for a post on the social media platform that they typically used. In addition, we administered the nine-item Social Media Disorder Scale 8 . To post a meme, participants pressed the spacebar and then selected one from a random selection of 6 meme images. Participants were then asked to provide up to 3 "tags" (short keywords) to describe the selected meme. This was included to provide an analogue of the effort cost of posting on social media. Next, participants received feedback (likes, represented as filled hearts) on the posted meme. The average number of likes (4.5 vs 14.5 per post) was manipulated within participant (in counterbalanced order across participants) to test the influence of social rewards on posting response latencies. The number of likes received for the preceding post was displayed in the simulated "feed." Numbers ("#1") are placeholders for meme images due to copyright reasons.

Statistical analysis and data exclusions.
We quantified response latencies with the interval from the first opportunity to post (indicated to the participant by the visual prompt "Press Spacebar to post a Meme") until the participant pressed the spacebar (see Supplementary   Figure 4). To analyze the response latencies, we used multilevel GLMMs specifying a gamma distribution with a log link function (using the glmmTMB package 9 ), as often recommended for response time data 10 .
In the main analysis, data were excluded from three participants who spontaneously reported that they did not believe the likes were generated by real participants.

Supplementary Note 2
We report in the main text that Granger causality analyses of the empirical data showed that accrued likes "Granger cause" τPost. The lag number, which is the key analysis parameter specified for Granger causality analysis, was optimized based on analyses of simulated data.
Specifically, we simulated generative models where the ground truth of causality (R L model) and of no causality (No Learning model, and No Learning with Drift [NLD] model), respectively, was known, and applied Granger causality analysis to the simulated data in both cases.
These models represent different possible hypotheses about the mechanisms that generated the observed data. In the R L model, τPost reflects a model policy (mean of an exponential distribution) that is dynamically updated to maximize accrued rewards (see "Methods" in the main text). In the No Learning model, τPost reflects a fixed response "policy" or tendency (mean of an exponential distribution) that is unrelated to, and thereby unaffected by, received reward. The No Learning model is used for model comparison in the main text. In the NLD model, the response tendency (mean on an exponential distribution) drifted (following a Gaussian random walk, with mean 0, and standard deviation σ. The stochastic nature of this formulation prevents accurate model estimation).
Before applying it to empirical data, we tuned the lag-number in the Granger causality analysis to correctly identify Granger causality in simulated data generated by the R L model, and to reject Granger causality in data generated from the two models without learning. To this end, we simulated each of the models 1000 times (to approximate the size of the empirical datasets) with random parameter values from the unit range multiple times, and applied Granger causality analysis methods for panel data 11

Supplementary Note 4
In the basic R L -model, we assumed that the utility of likes followed an identity function (i.e.,  We found no evidence for diminishing marginal utility in Study 2, as the basic R Lmodel fit best in all three datasets (see Supplementary Table 2). This is likely due to the, on average, much lower number of likes per post in Study 2 than Study 1 (see Supplementary   Table 1).

Supplementary Note 5
In our main analysis of Study 2, we only included posts with user-generated images, in order

Supplementary Note 6
To assess the specificity of the R L model, we conducted additional model comparisons that varied key features of the R L model. Finally, we compared the R L model to a model inspired by foraging theory.
Effect of time dependent terms. The R L model explicitly incorporates time-dependent effort and opportunity cost terms (see "Description of model" in main text) that scale with τPost.
To ascertain that these terms, which were based on established theory for free-operant tasks 2 , contributed to the explanatory power of the R L model, we compared it to two alternative models that did not include τPost -dependent terms. Both alternative models utilized the same policy gradient and likelihood function as the original model, but differed in how prediction errors and R were computed. Here, we present model RL2: In model RL3, the effort cost parameter C was removed, which simplifies equation Supplementary Equation 4 to: We compared the R L model to RL2 and RL3 using AICW. The R L model provided the best explanation of the data in all four datasets (all exceedance probabilities = 1, see Supplementary is an exponentially decreasing function of τPost. This formulation is based on established RL theory for free operant tasks 2 . However, it is possible that the effort cost on social media takes different forms. We evaluated two alternative effort cost formulations: (i) exponentially increasing with time, and (ii) fixed. We did not find any evidence that either a fixed effort cost, or a cost that increased with post latency improved model fit (Supplementary Table 4).

Effect of instrumental policy.
A key component of the R L model is that it allows instrumental learning of the response policy (eq. 2-4, main text). In other words, the model can learn that slower/faster post latencies result in more reward, reflecting the hypothesis that social media users strategically adjust their rate of engagement to maximize social rewards. Alternatively, one can envision a "Pavlovian" policy, where the responses are faster following positive prediction errors ("approach") and slower following negative prediction errors ("avoidance").
We implemented this "Pavlovian Policy" model by changing the policy update equation (equation 4, main text) to: * √ In other words, the policy is directly updated with the (cube root of) the value prediction error.
We take the cube root to reduce the influence of especially large prediction errors on the policy, which preliminary analysis showed was detrimental for model fit. Model comparison showed that the R L model explained the data best in all four datasets (t-tests against equal weights, largest p-value = .009, see Supplementary Comparison with a model based on foraging theory. Foraging theory provides a general framework for how organisms should maximize reward in decision making situations that extend over time 12 . Although foraging theory is more concerned with deriving optimal decision-making rules than with the precise mechanisms that produce behavior, some studies have successfully compared computational models based on foraging theory and RL 13 . Following this approach, we developed a stylized model inspired by the principles of foraging theory, in order to assess the specificity of the R L model as an explanation of reward maximization on social media. A core principle of foraging theory is that organisms should maximize their net rate of intake by foraging in a given patch until the current reward rate falls below the average in the environment (marginal value theorem 14 ). Because the social media environment involves many unobservables that would be required for a direct application of foraging theory and the marginal value theorem (e.g., travel time, different distinct patches, extended foraging bout 15 ), our F-model is by necessity relatively abstract. We first describe the model, and then outline the relationship to foraging theory.
The core component of the F-model is the decision to forage (i.e., post) when the expected reward meets or exceeds a threshold T (i.e., E(R) t ≥ T). τPost follows an exponential distribution, given by: where T is a free parameter (0 ≤ T ≤ ∞) that determines the threshold. E(R) t is a linear function of the time since the last post (tLast) and the average reward rate, weighted by a free parameter P (0 ≤ P ≤ ∞). Practically, we solve for t by numerically searching for the root (i.e., 0) of the function 0. Intuitively, E(R) t increases linearly with the time since the last post, with a slope given by the average reward rate: The average by unit time reward rate was calculated as a recency-weighted mean, with updating parameter α (0 ≤ α ≤ 1) similar to the R L model: The F-model is built on the assumption that the expected value of foraging (i.e., posting) goes to 0 directly after a post, and increases with time since the last post. In other words, likes are assumed to have a refractory time, and this refractory time is dependent on the average reward rate. The assumption that foraging reduces available reward is standard in foraging theory 16 . The F-model predicts, as the R L model, that posting should be more frequent when the reward rate is high, as this maximizes the per unit time incurred reward.
The threshold parameter T can be interpreted in two ways that follow from foraging theory. First, it can be seen as an estimate of the overall average reward in the environment (which we cannot directly observe). Under this interpretation, the F-model decision rule (Supplementary Equation 9) is equivalent to the marginal value theorem: the forager should leave the "patch" (a given social media platform, e.g., Instagram) if the expected value of foraging is lower than the average environmental reward, or conversely, forage in the patch if the expected value is higher than the average environmental value. A second interpretation is based on foraging theory for "sit and wait predators" that forage in one patch (rather than select between patches), and whose pray disperse after a foraging attempt (e.g., a school of small fish) [17][18][19] . For such predators, the optimal response time is equal to the refractory, or return, time of the prey [17][18][19] . Under this interpretation, T reflects the foragers estimated return time, or the time at which available reward (i.e., likes) returns to baseline.
We estimated the F-model, and found that while it provided a better explanation of the data than No Learning, the R L model had a better fit in all four datasets (see Supplementary

Supplementary Note 7
As in Study 1, the effect of R on τPost was larger for individuals for whom the RL model

Supplementary Note 8
To verify the robustness of the statistical results presented in the main text, we conducted additional analyses. The analyses of R reported in the main text are based on dichotomization of the rank transformed and scaled (for each individual) R variable. Here, we in addition report analyses with R as a continuous term (either rank-transformed and standardized, or only standardized) (Supplementary Table 7). We also report the same set of results (including the analyses of the discrete High vs Low R we report in the main text) using an alternative regression modeling approach based on cluster-robust standard errors (Supplementary Table   7). This methodology is more conservative than mixed-models and makes fewer assumptions.
It is therefore popular in econometrics and adjacent fields 20 . We found that the different model formulations consistently showed that a higher estimated R was predictive of shorter response latencies.

Supplementary Note 9
Previous research has shown that social comparison plays an important role in determining how many likes are required for a social media post to be experienced as successful 21 , and that receiving fewer likes than close others generates negative affect 22 . Therefore, we asked whether Models that also included downward social comparison (advantageous inequality or pride/gloating 25 ) provided an inferior account of the data-a pattern that further adheres to known dynamics of social comparison 20 (see Supplementary Table 8). Together, these exploratory results suggest that social comparison may contribute to reward learning dynamics on social media.
Alternative social comparison models. The R L model implements upwards social comparison (or disadvantageous inequality). We in addition tested two social comparison models that also included downward social comparison (advantageous inequality).
The second alternative social comparison model (ASC2) simplified ASC1, by allowing both forms of social comparison to be determined by one free parameter ( ): * We compared the R L model to ASC1-ASC2 using AICW. The R L model provided the best explanation of the data in all three datasets of Study 2 (see Supplementary Table 8).

Supplementary Note 10
Robustness analysis. Multiple quantitative criteria indicated that four clusters provided the best k-means cluster solution for the whole dataset (see main text). We assessed the robustness of this conclusion in two complimentary ways.
First, we randomly split the dataset into two equally sized partitions, and assessed the optimal number of clusters in each partition. We repeated this process ten times, and found that four clusters provided the best cluster solution in each of the 20 randomly determined partitions, which indicates that the four cluster solution reported in the main text was not determined by outliers or the exact sample composition.
Second, we randomly shuffled the rows of the dataset, independently for each column (i.e., for the 3 model parameters). This efficiently removed the pair-wise correlation between the model parameters (from |r| = .15-.23 to .01), and should therefore eliminate the four "computational phenotypes" we identified (as these reflect different parameter value profiles).
Indeed, we found that the shuffled dataset was best fit with a 3 cluster solution, where one cluster comprised 78% of all individuals (in contrast, the largest cluster in the original data only comprised 41% of the sample). This analysis indicates that the four cluster solution reported in the main text was not a structural necessity (as the shuffled data then should have the same structure and the same cluster solution), but unique to the four computational phenotype profiles. Together, these two analyses demonstrate both the stability and the specificity of the four social reward learning phenotypes. Participant-specific reward condition. Due to the random reward distribution and the relatively few responses per participant, the actual difference between high and low reward could differ markedly between individuals, which might lead to imprecise estimates. To account for this possibility, we repeated the analysis with by-participant median likes per reward condition as predictor. We found that using this semi-continuous predictor gave somewhat more precise results (β = -0.062, SE = 0.023, z = -2.72, p = 0.007) than the categorical reward condition predictor used in the main analyses, which corroborates the conclusion that changes in the social reward rate drives changes in response latency. In the main text, we report the direct effect of High vs Low R on response latencies (i.e., without the categorical reward condition predictor) for participants with at least 5 responses. By adding both the model-derived and the experiment-based predictor to the model, we next evaluated the shared explanatory value. We found that both the model-derived and experiment-based estimates are reduced in magnitude (0.284 to 0.274, and 0.109 to 0.076, respectively), and that only the model-derived regressor remains conventionally significant (model-derived: z = 6.0, p < .0001. Experimental: z = 1.69, p = .09). This indicates that the regressors, as expected, partially explained the same variance, but that the individual specific model-derived regressor better predicts response latencies. Finally, we repeated the analysis for participants with at least 10 responses (as in our analysis of Study 1-2), and find results to be comparable (n = 97, β = 0.31, SE = 0.052, z = 6.18, p < .0001).

Supplementary Note 12
Because the Instagram dataset used in Study 1 was based on participation in a photography contest on Instagram (see "Additional information about Study 1" in the Supplementary Methods for details), we tested whether the number of such contest participations, as indexed by the number of posts with "#whp-" hashtags, was associated with social media behavior (as measured by our R L model). We predicted (log) "#whp-" tag number from the estimated parameters of the R L model, together with the total number of posts (which naturally is the most important predictor) using linear regression. We found no evidence that the number of "#whp-" tags was associated with R L parameters (neither using the basic R L model or the R L model augmented with a non-linear utility function): the lowest p-value for any estimated parameter was ~0.25. Together, these results indicate that contest participation did not have any clear influence on posting behavior, and thus our results are likely to generalize beyond this context (as further suggested by Study 2 results).