Multiple associative structures created by reinforcement and incidental statistical learning mechanisms

Learning the structure of the world can be driven by reinforcement but also occurs incidentally through experience. Reinforcement learning theory has provided insight into how prediction errors drive updates in beliefs but less attention has been paid to the knowledge resulting from such learning. Here we contrast associative structures formed through reinforcement and experience of task statistics. BOLD neuroimaging in human volunteers demonstrates rigid representations of rewarded sequences in temporal pole and posterior orbito-frontal cortex, which are constructed backwards from reward. By contrast, medial prefrontal cortex and a hippocampal-amygdala border region carry reward-related knowledge but also flexible statistical knowledge of the currently relevant task model. Intriguingly, ventral striatum encodes prediction error responses but not the full RL- or statistically derived task knowledge. In summary, representations of task knowledge are derived via multiple learning processes operating at different time scales that are associated with partially overlapping and partially specialized anatomical regions.


Supplementary Note 4: Consolidation of sequence knowledge between session 1 and 2 of the pre-1
scan learning task 2 We tested whether participants consolidated their sequence knowledge between the first and the 3 second session of the pre-scan learning task, prior to scanning. If participants were able to anticipate 4 the next sequence element better during the second learning session, they should be faster to respond 5 to the first movement that initiates the path from A to B, B to C, or C to D. We therefore repeated the 6 original ANOVA with an additional factor of day (2x2x3 repeated measures ANOVA with factor day 7 (day1/day2) sequence (RewSeq/ConSeq) and transition (AB,BC,CD)). In addition to the effects of 8 transition, sequence and their interaction already reported in the main text, this analysis revealed a 9 main effect of day (F(1,25)=55.44, p<0.001), with overall faster RTs on day 2; there was also an 10 interaction between day and sequence type (F(1,25)=11.15, p=0.003; Suppl Fig 1f) which was due to 11 a larger RT improvement on day 2 for RewSeq compared to ConSeq elements. Although the triple 12 interaction of day x sequence x transition reached significance in a conventional ANOVA 13 (F(2,50)=8.304, p<0.001), this interaction was not included in the winning model obtained from a 14 Bayesian repeated-measures ANOVA (Model with Day + Sequence + Transition + Day*Sequence + 15 Transition*Sequence P(M|Data) = 0.587, BFm=25.59, BF10=1). Overall, this suggests that both explicit 16 knowledge of the RewSeq as well as implicit knowledge of the unrewarded ConSeq was consolidated 17 from the first to the second session of the pre-scan learning task, with boosted consolidation due to 18 reward. 19

Supplementary Note 5: Post-scan memory of stimuli reveals differences between RewSeq and 20
ConSeq specific to the recognition of correct stimuli 21 of fMRI testing (i.e., after both scans had been completed). One stimulus was probed at a time and 23 the others greyed out, and it was shown in its correct location and color (recognition of correct 24 stimuli), or in a wrong location or wrong color (error detection of wrong stimuli). Participants' 25 percentage correct scores for judging these stimuli as 'correct' versus 'wrong' was entered into a 2x2 26 repeated-measures ANOVA with factors sequence (RewSeq/ConSeq) and type of recall 27 (recognition/error detection). This analysis showed effects of sequence (F(1,25)=30.20, p<0.001), with 28 better memory for RewSeq over ConSeq, and an interaction of sequence x type of recall 29 (F(1,25)=20.16, p<0.001)), showing differences in memory between RewSeq and ConSeq were more 30 pronounced when stimuli were shown in their correct position/color (Suppl Fig 4e). While type of 31 recall only showed a trend (F(1,25)=3.49, p=0.074), the winning model in a Bayesian repeated 32 measures ANOVA was the full model, including both main effects and interaction (P(M|data)=0.979, BFm=185.80), suggesting that reward boosted memory formation and that an error could be detected 1 more easily than a correct stimulus recalled. 2

Supplementary Note 6: rr-cc is capturing cross-stimulus suppression effects 3
The (rr-cc) contrast was designed to capture associations between rewarded sequence elements via 4 cross-stimulus suppression. However, it may also have be driven by a main difference in BOLD activity 5 between rewarded and control sequence elements. To confirm our interpretation as cross-stimulus 6 suppression effects, we first examined the impact of the temporal delay between successive sequence 7 elements. Neural adaptation effects are expected to scale with the temporal delay between stimuli, 8 with stronger suppression and thus a smaller BOLD signal for stimuli presented closer in time. We 9 therefore modelled the temporal delay between occurrences of rewarded sequence elements and the 10 temporal delay between occurrences of control sequence elements parametrically. We extracted 11 parameter estimates from regions of interest (ROIs) defined as spheres around peak coordinates in 12 the above contrast (see Methods and Supplementary Figure 2c). Even though this is a very demanding 13 test that is rarely performed in repetition suppression experiments we found that BOLD activity for 14 the second of two successive elements of RewSeq (rr) but not ConSeq (cc) indeed scaled with the 15 temporal distance between the stimuli in two of our four ROIs: temporal pole and amyg/hippo 16 (temporal distance rr: p(mPFC)=0.83; p(pOFC)=0.79; p(tempPole)=0.049, t(25)=1.72; 17 p(amyg/hippo)=0.008, t(25)=2.54; temporal distance cc: all p>0.1; one-sample t-tests; Figs 2c,d and 18 Supplementary Figure 2a). This bolstered our interpretation of a shared neural representation of 19 rewarded sequence elements in these regions. In a second test, we confirmed that none of our ROIs 20 were simply showing a main effect difference between rewarded and control elements. We examined 21 rewarded and control elements when they were preceded by an element not part of their own 22 sequence (xr or xc). None of our ROIs showed a significant difference for the contrast xr-xc (one-23 sample t-tests: all p>0.1; Supplementary Figure 2b). Thus, our ROI-defining contrast captured 24 relationships between pairs of stimuli as probed by cross-stimulus suppression, rather than 25 differences in the overall BOLD main activation to rewarded and control stimuli. 26

Supplementary Note 7: Sequence fusion effects are not driven by correctly ordered (forward) pairs 27
Our ROI-defining sequence fusion contrast (Fig 2) examined repeated presentations of stimuli from 28 the rewarded sequence (rr) with repeated presentations of any two control stimuli (cc). The rationale 29 was that if stimuli from the rewarded sequence have more overlapping neural representations, this 30 should lead to more cross-stimulus suppression, compared to control sequence stimulus repetitions 31 (contrast: rr-cc). To ensure that the observed effects were not driven by only the correctly ordered 32 forwards pairs, we repeated the analysis but split rr and cc pairs into repetitions that were forwards-1 directed (ForwPair: AB, xBC, xxCD) and other within-sequence pairs that were not forwards-directed 2 in the correct order but still transitions within the same sequence (OtherWithinPair: BA, CB, DC, AD, 3 DA, AC, CA, BD, DB). Suppl Fig 2e shows the effects separately for these two sub-groups of trials. Note 4 that forwards pairs exclude occurrences of ABC and ABCD, i.e., those where full third-or fourth-order 5 sequence relationships were fulfilled. Suppression for rewarded over control pairs was present for 6 both ForwPairs and OtherWithinPairs. If anything, suppression was stronger for the OtherWithinPairs. 7 While testing each effect individually (rr-cc for just ForwPairs or rr-cc for just OtherWithinPairs) is not 8 orthogonal to our ROI selection, the interaction between the effect of rr-cc in ForwPairs compared to 9 OtherWithinPairs is orthogonal to ROI selection. This is true in general because the mean of two 10 effects is not correlated with, and thus independent of the difference between the same two effects. 11 The direct test of the interaction in a 2x2 repeated-measures ANOVA with factors sequence (rr/cc) 12 and pair type (ForwPair/OtherWithinPair) showed no interaction between pair type and sequence in 13 amyg/hippo and pOFC (both F(1,25)<0.5, p>0.5), a trend-wise interaction in temporal pole 14 (F(1,25)=3.218, p=0.085) and a significant interaction in mPFC (F(1,25)=4.345, p=0.047). In both of 15 these latter cases, the interaction or trend-wise interaction arose because there was less, rather than 16 more, suppression for forward pairs, and the winning model in a Bayesian rs-measures ANOVA 17 contained only the factor sequence (tempPole: BFm=7.323, P(M|data) = 0.647; mPFC: BFm=6.99, 18 P(M|data) = 0.636). Thus, in all cases, the rr-cc effect was not driven by only the subset of correctly 19 ordered sequence pairs. 20 We note that this also implied that sequence fusion effects were not relying on transitions 21 truly experienced during the pre-scan learning task, when only ForwPairs but no OtherWithinPairs 22 were experienced. This points towards a more abstract representation of which elements belong or 23 do not belong to the rewarded sequence. 24 Supplementary Note 8: Correct sequence order encoding depends on third-and fourth-order 25 structure 26 We probed whether increases in BOLD activation in temporal pole and pOFC when transitioning 27 through the correctly ordered rewarded sequence (i.e. A, AB, ABC, ABCD) were dependent on the 28 third and fourth-order contingencies. Alternatively, this signal could be present even when only the 29 pair structure (AB, BC, CD) is fulfilled. For this analysis, we focused on ABC versus xBC (a pair BC not 30 preceded by A, bold indicates time-locking), and ABCD versus xxCD because A and AB by definition do 31 not rely on higher-order chains. We extracted parameter estimates from a GLM that was almost 32 identical to GLM1, except that instead of regressors (1) rr and (3) corrOrderRewSeq, we modelled one regressor with all occurrences of xBC and xxCD, and one regressor with all occurrences of ABC/ABCD, 1 plus other rr pairs of no interest here in their own separate regressor. We ran a 2x2 repeated-2 measures ANOVA on the resulting parameter estimates with factors Area (pOFC/tempPole) x 3 HigherOrder (fulfilled yes/no, i.e. ABC/ABCD vs xBC/xxCD) and found a significant effect of 4 HigherOrder (F(1,25)=5.712; p=0.025) but no effects of Area or interaction between Area x 5 HigherOrder. Consistently, the winning model in a Bayesian rs-ANOVA was a model with HigherOrder 6 factor but no other factors (BFm=4.936, P(M|data) = 0.552 ; Fig 3c). 7 Supplementary Note 9: Statistical learning of spatial distance and transition frequency is not driven 8 by rewarded elements 9 To confirm that our measures of statistically acquired knowledge were indeed reflecting knowledge 10 of relationships between all twelve stimuli, we repeated the analysis with a second GLM that split 11 stimuli into those belonging to the rewarded and those belonging to the control sequence (RewSeq 12 and ConSeq, respectively). The original GLM had one joint onset regressor for all twelve stimuli and 13 parametric regressors across these twelve stimuli. In the control GLM, the onsets of rewarded and 14 control stimuli were instead modelled separately and separately associated with parametric 15 regressors for spatial distance and transition frequency. We note that this analysis would be expected 16 to be slightly less powerful, as parametric regressors rely on 4 out of 12 stimuli and thus a third of the 17 data in each case. Nevertheless, six out of eight effects reached significance (all p<0.05; Suppl Fig 4d) 18 and the remaining two pointed in the same direction as in the initial analysis: for mPFC, transition 19 frequency did not reach significance when fitted on RewSeq trials alone (p=0.18) and for amyg/hippo, 20 transition frequency did not reach significance for ConSeq trials alone (p=0.15). Taken together, this 21 is strong evidence that our measures of statistical learning were not driven by rewarded stimuli alone.

Supplementary Note 10: Transition frequency, not probability, is encoded in mPFC and amyg/hippo 23
Statistical associations ('task model') could be represented in terms of the conditional 24 probabilities of transitions between stimuli given the initial stimulus or as pure state transition 25 frequencies. The two types of representations have many similarities but the latter is arguably a more 26 global, flexible, and abstract representation of the task space that is less dependent on the precise 27 nature of the experiences during the time when it was acquired; the representation of transition from 28 one stimulus to another is not normalized by the number of occasions the first stimulus has been 29 experienced during learning. Intriguingly, activity in both mPFC and amyg/hippo was explained by a 30 representation of state-transition frequencies (see above) but not by the conditional probability of a t(25)=3.12), and a trend-wise difference in amyg/hippo (p=0.0575, t(25) =1.99; Fig 4b). Importantly, 1 the evidence for a transition frequency representation remained significant when conditional 2 transition probability was simultaneously included in the model. Note that the effects of conditional 3 transition probability and pure transition frequency are dissociable from the expected stimulus 4 frequency per se, which was only identified in visual areas (Suppl Fig 4b). 5

Supplementary Note 11: Dissociating tempPole-pOFC and hippo/amyg-mPFC networks 6
In the main text, we showed that BOLD responses in temporal pole and pOFC reflected the correctly 7 ordered rewarded sequence, while mPFC and amyg/hippo carried knowledge of statistical 8 relationships between all stimuli. To formally assess whether these two networks indeed carry 9 different information, we ran additional analyses relating to the contrasts that differentiated between 10 these areas: correct order and spatial/statistical transition. The first 2 x 2 ANOVA focused on the 11 parameter estimates extracted from the correct order contrast in all four ROIs (Fig 3b). It included the 12 factors Network (tempPole & pOFC versus amyg/hippo and mPFC) and Node (anterior/posterior, 13 where pOFC and mPFC are the anterior nodes of the two networks both found within frontal cortex 14 and tempPole and amgy/hippo are posterior nodes in the temporal lobe, respectively). There was a 15 significant effect of Network (2 x 2 repeated measures ANOVA: F(1,25)=9.893, p=0.004) but no effect 16 of Node or Network x Node (both p>0.2). Moreover, consistent with this result, a Bayesian repeated-17 measures ANOVA showed that the winning model indicated that only the factor Network significantly 18 explained variation in activity between the regions (P(M|data)=0.65, BFm=7.34). 19 The second 2 x 2 x 2 ANOVA comprised the same conditions and the additional factor Contrast 20 (spatial distance or transition frequency). This ANOVA focused on BOLD responses related to statistical 21 knowledge (Fig 4b). Again, we found a significant effect of Network (F(1,25)=30.82, p<0.001) and no 22 other significant main effects or interactions, only a trend for a Contrast x Network interaction (2 x 2 23 x 2 repeated-measures ANOVA: F(1,25)=3.472, p=0.074). The conclusions drawn from this analysis 24 were bolstered by a Bayesian repeated-measures ANOVA that revealed three similarly good models 25 of the data, all of which included a factor of Network (Model 1 had a main effect of Network only: 26 P(M|data)=0.195, BFm=4.37; Model 2 included two main effects of Contrast and Network: 27 P(M|data)=0.256, BFm=6.19; Model 3 contained three effects Contrast + Network + Network x 28 Contrast: P(M|data)=0.251, BFm=6.02). In summary, in all cases, there was evidence for a difference 29 between the activity patterns in the pOFC-tempPole and mPFC-amyg/hippo networks. 30 The identification of different response patterns in the two networks is consistent with a large 31 body of work suggesting that there are major anatomical differences between pOFC and tempPole on 32 the one hand and hippo/amyg and mPFC on the other hand. There are strong monosynaptic connections within but not across these two networks 5-7 . For example, temporal pole and pOFC 1 clusters, despite being in different lobes, are connected via the uncinate fascicle 8,9 while the 2 hippocampus and mPFC are interconnected via the fornix 8,10 . These network connections are not just 3 clear in tracer data but can also be appreciated using human resting-state data; there is strong within-4 network activity coupling (between pOFC and tempPole and between hippo/amyg and mPFC) but 5 weaker across-network coupling (Supplementary Figure 5; source: Human Connectome Project (HCP) 6 Data). Altogether, this provides robust evidence that the patterns of BOLD activation in the pOFC-7 temporal pole and amyg/hippo-mPFC were dissociable, with pOFC and temporal pole reflecting 8 knowledge of the correctly ordered rewarded sequence and amyg/hippo and mPFC showing 9 signatures of statistical learning. As previously noted, both types of learning, statistical and reward-10 learning, mediated aspects of stimulus-stimulus learning in our task. The dissociations between 11 regions are less to do with the type of association but the mechanism of learning by which the 12 association was derived. 13  shown in Fig 2, right is added for completeness; * indicates p<0.05 in one-sample t-test. b, The main 1 difference in BOLD response to rewarded and control elements was not significant in any of our ROIs 2 (xr-xc), suggesting that the ROI-defining contrast (rr-cc) indeed captured cross-stimulus repetition 3 suppression effects, and thus probed relationships within the rewarded and control sequences, rather 4 than main activation differences. Main activation differences are tested here based on trials where a 5 rewarded stimulus was preceded by any non-rewarded stimulus (xr; and the same for the control 6 sequence: xc). This test was independent of the ROI-defining contrast. c, ROI spheres centered on the 7 peak locations of the activations shown in Whole-brain results for the encoding of correct sequence order. a, RewSeq elements that follow the 4 correct order are contrasted with RewSeq elements that came in any other order and the same 5 comparison for the ConSeq is subtracted. This highlights strong bilateral activation in pOFC and 6 unilateral activation in temporal pole (all cluster-corrected). b, RewSeq elements that follow the 7 correct order are contrasted with ConSeq elements that follow the correct order directly. This contrast 8 is shown at a lower threshold (z>2.3) but reveals a similar pattern of results. The hippocampus showed effects of transition frequency (a) and spatial distance (b); contrasts are the 5 same as in Supplementary Figure 4a but with views showing the hippocampus (both z>3.1/p<0.001 6 cluster-corrected). c, All effects in the hippocampus are illustrated for a spherical ROI centered on a 7 coordinate taken from Garvert et al., eLife, 2017. This shows BOLD signatures related to statistical 8 learning but no knowledge of the rewarded over and above the control sequence; error bars denote 9 SE; * indicates p<0.05 in one-sample t-test. 10