Beliefs about Others’ Abilities Alter Learning from Observation

Learning what is dangerous by observing others can be safer and more efficient than individual learning. The efficiency of observational learning depends on how observational information is used, something we propose depends on our beliefs’ about others. Here, we investigated how described and actual abilities of another individual (a demonstrator) influenced performance and psychophysiology during learning of an observational avoidance task. Participants were divided into two groups. In each group there were two demonstrators who were described as either high (Described-High group) or low (Described-Low group) in their ability to learn the task. In both groups, one demonstrator had a high ability (Actual-High) and the other had a low ability (Actual-Low) to learn. Participants performed worse in the Described-Low compared to the Described-High group. Pupil dilation, and behavioral data in combination with reinforcement learning modeling, suggested that the described ability influenced performance by affecting the level of attention towards the observational information. Skin conductance responses and pupil dilation provided us with a separate measure of learning in addition to choice behavior.


Stimuli
Before each phase, simple monochrome figures/avatars were used to index whose turn it was to make a choice, the participant's or one of the two demonstrators. The participants initially selected which avatar out of 9 possible that would represent them during the task, and the demonstrators' avatars were randomly selected from those remaining. The experiment was shown on a 24 ̋ screen with resolution 1600x1200 pixels. Participants watched the screen from a distance of approximately 60 cm.
The pictures used as choice stimuli consisted of 40 randomly generated geometric figures of equal luminance. For each participant, 22 pictures were randomly drawn and randomized into 11 pairs in which one stimulus was randomly selected to be the optimal while the other was the suboptimal. Of these pairs, one was used in the training block and the rest in the remaining 10 blocks.
To minimize alterations in pupil dilation due to changes in the presentation of visual stimuli on the screen during the experiment, auditory stimuli was used to signal when it was time for the demonstrator or participant to make a choice, and to indicate that the demonstrator received a shock. These auditory stimuli are referred to as the "go-sound" and "shock-sound". The "go-sound" consisted of a 1 second long D6 sinus tone while the "shock-sound" was a 232 ms long non-aversive beep oscillating around C6. The sounds were played to the participants using headphones.

Demonstrators' behavior
Demonstrators' actual ability was manipulated by allowing a simple RL-algorithm to control the behavior of the Actual-High demonstrators while the Actual-Low demonstrators always choose randomly. The RL algorithm controlling the behavior of the Actual-High demonstrators was based on the Q-learning algorithm (1) were choices were assigned to Q-values representing the expected outcome of making the choice. At the start of each block Q-values were set to 0. Choices were made based on the softmax activation function which calculates the probability of making either choice, here denoted A and B: where the inverse temperature was set to 0.4. The outcome of a choice was used to calculate a prediction error, ℎ , the difference between the expected and predicted outcome (the outcome of a shock was set to -10, and the outcome of a omitted shock was set to 10). The prediction error was used to update the Q-values : where the learning rate was set to 0.3. See SI fig 1 for a visualization of the demonstrator's performance across the experiment.

How the value of observational information depends on demonstrator behavior
To investigate how the ability of the demonstrator might affect the value of observational information, we simulated participants undergoing the experiment and learning through observation of a demonstrator that either learned well or made random choices. The simulated participants learned the task by pairing the observed choices with observed outcomes in addition to making choices themselves and learning from them. Simulating ten thousand blocks showed a small advantage, of learning from a demonstrator with low ability (increased probability of making the optimal choice approximately 0.029), see SI fig 2.
SI Fig. 1. Demonstrators' performance over trials across the experiment. Actual-High demonstrators learned quickly while Actual-Low demonstrators' performed around chance throughout.
SI Fig. 2. Mean performance per trial of ten thousand simulated blocks of observational learning from either a demonstrator with either a high or low ability to learn combined with individual learning.
Results show a small advantage of learning from a demonstrator with actual low ability.

Instructions
Participants were instructed that they were to observe the behavior of previous participants in a similar experiment. They were further told that the behavior of these previous participants had been observed and evaluated in order to divide them into two groups of equal sizes based on their level of ability, either high (Described-High) or low (Described-Low). The information regarding how the evaluation had been done was very vague, and it was not stated that the evaluation necessarily relied on any objective measure of performance (e.g., number of shocks received).
Participants were instructed that the probabilities of receiving a shock based on choice were the same for both them and the previous participant although making the same choice as the demonstrator in a trial were the demonstrator did not receive a shock not necessarily resulted in avoidance of shock for themselves.

Questionnaires
Following completion of the experiment, participants were asked to fill out a set of questionnaires to asset personality traits that could possibly influence the efficiency of observational learning. Anxiety was measured both as state and trait using the State-Trait Anxiety Inventory (STAI). Empathic concern, fantasy and perspective taking were measured using the Interpersonal Reactivity Index (IRI).
Social conformity was measured using a subset of the questions in the Swedish universities Scales of Personality (SSP).

Eye-tracking
Blinks were identified as instances when the diagonal pupil width was 0 for 50-400 ms and pupil widths (diagonal and horizontal) during blinks ±60 ms were linearly interpolated. Pupil data was filtered using a 10 Hz low pass filter. Pupil diameter was calculated as the average pupil width based on the diagonal and horizontal measures. Pupil dilation responses were calculated as the relative change in diameter compared to the mean pupil size during the 0.5 s presentation of the fixation cross following the stimuli presentation but before the "go-phase".

Skin Conductance -preparation and scoring
Skin conductance data was analyzed using the AcqKnowledge software (BIOPAC Systems). Raw data was filtered using a 1 Hz low pass filter and a 0.01 Hz high pass filter. Responses were scored as the first peak-to-peak response starting within a time-window of 0.5-4.5 s following the onset of each event of interest (observed choice, observed/heard outcome). Responses were normalized within subject.

Behavioral analyses
Performance was analyzed using a Logistic Generalized Mixed Model (LGMM) which modeled the effect of described ability and actual ability as well as trial (centered) on choice, categorized according to optimality (optimal/suboptimal). We used by-participant random intercepts and random slopes of actual ability within participant. The reported main-and interaction effects were evaluated with "Type II" analysis of deviance tests based on the Wald statistic.
The effect of the described ability on the absolute deviation between estimated and actually delivered number of shocks during a block was analyzed using a Linear Mixed Model (LMM) with by-participant random intercept. We further analyzed the effect of the absolute deviation per block on mean performance per block using an LMM with by-participant random intercept and random slopes of the absolute deviation within subject. The reported main-and interaction effects were evaluated with "Type II" analysis of deviance tests based on the Wald statistic.

Pupil dilation
We analyzed the pupil dilation responses using growth curve analysis, GCA (2). For the proactive pupil dilation responses we analyzed orthogonal linear and quadratic changes in pupil dilation over time (first-and second-order polynomials of time). For the reactive responses, we also included cubic changes in pupil dilation over time (third-order polynomial) to our analysis. Participant and participant-by-condition random effects were used on all time terms. For the proactive responses we analyzed fixed effects of described ability and actual ability on all time terms while for the reactive responses we analyzed fixed effects of the observational prediction errors on all time terms. The effect of the observational choice prediction error, see [8], was analyzed for all trials and the effect of the observational prediction error, see [4], was analyzed for trials where the demonstrator received a shock (since no-shock trials did not include any time-locked event signaling the omission of shock).

Gaze patterns
We analyzed gaze direction during stimuli presentation, for both the observational stage and the individual stage, using two areas of interest (AOIs) defined as Optimal/Suboptimal stimuli. The AOIs consisted of two areas (193 pixels x 192 pixels) centered round the two stimuli. Gaze direction was analyzed using logistic GCA with the AOI Optimal as the target. We included orthogonal linear, quadratic and cubic changes over time (first, second and third order polynomials). We first analyzed the fixed effects of the time terms only to investigate the probability of directing gaze towards the optimal choice over time. Secondly, to investigate if this probability would be affected by a trial-bytrial measure of the certainty that the optimal choice was the optimal, we analyzed the fixed effects of a certainty parameter (derived from our winning RL model) on all time terms for the separate stages (observational/individual). The certainty parameter was defined as the difference between the expected values of the optimal and the suboptimal choices, = − . Thus, expecting a high value from the optimal choice and a low value from the suboptimal choice would lead to high certainty (see SI Model description for additional information regarding the expected values). All models used by-participant random intercepts and random slopes of all main effects within participant (a full random structure was not possible in the models which included certainty as a predictor due to convergence problems).

Skin conductance responses
We analyzed the effect of the observational choice prediction error, see equation [8], and the observational prediction error, see equation [4], on skin conductance responses, SCRs, using LMMs.
The models included fixed effects of the normalized observational prediction error of interest and of trial number. The models included by-participant random intercepts and random slopes for all variables within participant, i.e. a full random structure (3).

Model description
We formulated several reinforcement learning models that differed in their use of observational learning parameters. Here we describe the general structure of the models. All models are based on All models included the individual learning rate . as a free parameter. Based on this setup we constructed a set of models which varied in how the observational learning rate . , imitation parameter and inverse temperature parameter were defined:  . was either i) fixed to 0, ii) set to . iii) one free parameter or iv) two free parameters − , − to allow for differences in learning rates depending on actual ability of the demonstrator  was either i) fixed to 0, ii) one free parameter or iii) two free parameters , to allow for differences in copying depending on actual ability of the demonstrator  was either i) one free parameter or ii) two free parameters , to allow differences in choice selection depending on the actual ability of the demonstrator.
This allowed us to test 4 x 3 x 2 = 24 different models, see SI Table 1.

Model fitting & model comparison
The free parameters for each model were individually fitted for each participant over all trials using by minimizing the negative log-likelihood, −ln(L), of each model. This was done in R (4) using the mle2 function from the bbmle package employing the optim optimization function and the BFGS optimization method. Each set of parameters was fitted 10 times with randomized initial parameters after which the best fitted parameters were chosen, in order to avoid local minima. The learning rate parameters and imitation rate parameters were constrained within the interval [0,1] while the inverse temperature parameters were constrained within the interval [0,5].
Model comparisons were carried using the AIC weights, wAIC i , for each model and participant (5). AIC weights are calculated from the AIC values that measure the goodness of fit of a model while also taking into account its complexity: where k is the number of fitted parameters and −ln(L) is the negative log-likelihood.
where M denotes the number of models compared. AIC weights provide a measure of the weight of evidence for each model in a given set of candidate models. We compared models by looking at the mean wAIC i , mean rank order and number of wins per model across participants and separated by groups (Described-High/Described-Low), see SI table 1.

Behavioral results
The significant interaction between described ability and actual ability was followed by pairwise comparisons adjusted for multiple comparisons using Bonferroni corrections. This revealed a significant difference between the Described-High/Actual-Low and Described-Low/Actual-Low conditions only (β=0.71, SE=0.26 , Z=2.71,p=0.041).  Table. 3. Parameter estimates of the proactive pupil dilation response in the 1 s time-window preceding the presentation of the outcome of the demonstrator's choice. Described-High is coded as 0, Described-Low is coded as 1. Actual-High is coded as 0, Actual-Low is coded as 1.

Reactive Pupil dilation responses
The reactive pupil dilation responses when participants observed the demonstrator's choices were greater when the model derived observational choice prediction error was large, i.e. more surprising, se SI Fig. 3 Table. 5. Parameter estimates of the reactive pupil dilation response in the 2 s time-window following the presentation of the demonstrator's choice.

Gaze patterns
Previous studies on gaze direction during viewing of emotional pictures have shown that humans initially direct visual attention towards both pleasant and unpleasant/threatening (6), as compared to neutral (7), stimuli. Studies on gaze direction during decision making has shown that gaze direction predicts the preferred choice (8), and more so the closer in time to the decision (9). To the best of our knowledge, no one has studied gaze direction during own or a demonstrator's choice decision in an avoidance task. Analyses of gaze patterns during stimuli presentations using logistic GCA with for details), which improved our models significantly (p<0.001, during both the demonstrator and individual phase). As predicted, participants who were more certain that the optimal compared to the suboptimal choice would lead to avoidance of shock were more probable to direct their gaze towards the optimal choice over time compared to those that were less certain, see SI Fig. 4B, further supporting the hypotheses that gaze is directed towards the (believed) optimal choice rather than the dangerous/aversive during both observed and own choices (8).
SI Fig.4. A. The probability of participants directing gaze towards the optimal compared to the suboptimal choice following the onset of stimuli presentation. B. The probability of directing gaze towards the optimal choice was modulated by the RL model-derived certainty (difference in expected value between optimal and suboptimal choice) at each trial. Participants directed their gaze more towards the optimal choice when they were more certain that that choice was better than the other. For the trials within the highest quantile of certainty, the probability of directing gaze toward the optimal choice exceeded 0.6. For the trials within the lowest quantile of certainty, the probability of directing gaze toward the optimal choice remained around chance. Grey ribbons show actual data within one standard deviation from the mean while the black lines represent the model derived probabilities. The dotted line indicates the chance probability, 0.5.

Skin conductance responses
To investigate psychophysiological measures of observational learning and further validate our RL model we analyzed the effect of the observational prediction errors on skin conductance responses, SCRs, following observation of choice and upon hearing that the demonstrator received a shock. SCR is commonly used as a measure of autonomic sympathetic arousal (10) shown to covary with pupil dilation responses (11) and is used as an index of learning in classical conditioning (12) and decision making tasks (13). We expect higher SCRs following more surprising events (i.e. larger absolute prediction errors) since these should be more arousing. As predicted we saw a significant effect of the observational Choice PE on SCRs following the presentation of the demonstrators choices (

Parameter analyses
We used the winning model (above, SI 4.2.) to investigate patterns of parameter distribution. To do this we first used cluster analysis using a Gaussian finite mixture model fitted by an estimationmaximization (EM) algorithm (mclust package) to detect three clusters of parameter combinations: i) low values of both α and β, ii) medium to high values of α and medium to low values of β, iii) medium to high values of α and high values of β (see SI Fig 5). Fitted parameters from participants in the Described-High group were categorized as belonging mainly to cluster i and ii (DH; i:10, ii:9, iii:2) while fitted parameters from participants in the Described-Low group were categorized as belonging mainly to cluster ii (DL; i:5, ii:16, iii:1). Although close to trending this difference between groups was not significant (p=0.16). However, if we categorized participants using a median split of mean overall performance there was a significant difference between high performing (mean performance > 0.8)