University of Birmingham The relationship between reinforcement and explicit control during visuomotor adaptation

The motor system’s ability to adapt to environmental changes is essential for maintaining accurate movements. Such adaptation recruits several distinct systems: cerebellar sensory-prediction error learning, success-based reinforcement, and explicit control. Although much work has focused on the relationship between cerebellar learning and explicit control, there is little research regarding how reinforcement and explicit control interact. To address this, participants first learnt a 20° visuomotor displacement. After reaching asymptotic performance, binary, hit-or-miss feedback (BF) was introduced either with or without visual feedback, the latter promoting reinforcement. Subsequently, retention was assessed using no-feedback trials, with half of the participants in each group being instructed to stop aiming off target. Although BF led to an increase in retention of the visuomotor displacement, instructing participants to stop re-aiming nullified this effect, suggesting explicit control is critical to BF-based reinforcement. In a second experiment, we prevented the expression or development of explicit control during BF performance, by either constraining participants to a short preparation time (expression) or by introducing the displacement gradually (development). Both manipulations strongly impaired BF performance, suggesting reinforcement requires both recruitment and expression of an explicit component. These results emphasise the pivotal role explicit control plays in reinforcement-based motor learning.

The motor system's ability to adapt to environmental changes is essential for maintaining accurate movements. Such adaptation recruits several distinct systems: cerebellar sensory-prediction error learning, success-based reinforcement, and explicit control. Although much work has focused on the relationship between cerebellar learning and explicit control, there is little research regarding how reinforcement and explicit control interact. To address this, participants first learnt a 20° visuomotor displacement. After reaching asymptotic performance, binary, hit-or-miss feedback (BF) was introduced either with or without visual feedback, the latter promoting reinforcement. Subsequently, retention was assessed using no-feedback trials, with half of the participants in each group being instructed to stop aiming off target. Although BF led to an increase in retention of the visuomotor displacement, instructing participants to stop re-aiming nullified this effect, suggesting explicit control is critical to BF-based reinforcement. In a second experiment, we prevented the expression or development of explicit control during BF performance, by either constraining participants to a short preparation time (expression) or by introducing the displacement gradually (development). Both manipulations strongly impaired BF performance, suggesting reinforcement requires both recruitment and expression of an explicit component. These results emphasise the pivotal role explicit control plays in reinforcementbased motor learning.
In a constantly changing environment, our ability to adjust motor commands in response to novel perturbations is a critical feature for maintaining accurate performance 1 . These adaptive processes have often been studied in the laboratory through the introduction of a visual displacement during reaching movements 2 . The observed visuomotor adaptation, characterized by a reduction in performance errors, was believed to be primarily driven by a cerebellar-dependent process that gradually reduces the mismatch between the predicted and actual sensory outcome (sensory prediction error) of the reaching movement 1,3,4 . Cerebellar adaptation is a stereotypical, slow and implicit process and therefore does not require the individual to be aware of the perturbation to take place 5,6 . However, a single-process framework cannot account for the great variety of results observed during visuomotor adaptation tasks 7 . Specifically, it has recently been shown that several other non-cerebellar learning mechanisms also play a pivotal role in shaping behaviour during adaptation paradigms such as explicit control 8,9 and reward-based reinforcement [10][11][12][13][14][15] .
Explicit control usually consists of employing simple heuristics such as aiming off target in the direction opposite to a visual displacement, to quickly and accurately account for it 5 . However, this requires explicit knowledge of the perturbation, which in turn usually requires experiencing large and unexpected errors 8,[16][17][18] . Explicit control contrasts with cerebellar adaptation in that it is idiosyncratic 9 , volitional, and can lead to fast adaptation rates 19 . Importantly, in this work, we consider explicit control as the contribution to performance that can be suppressed (or expressed) by participants upon request 20 , as opposed to the additional requirement of being able to verbalise a strategy. Critically, cerebellar adaptation takes place regardless of the presence or absence of any explicit process, even at the cost of accurate performance 5 .
More recently, another putative mechanism contributing to motor adaptation has been proposed, through which the memory of actions that led to successful outcomes (hitting the target) is strengthened, and therefore more likely to be re-expressed 14,21 . Such reinforcement is considered to be an implicit process, but distinct from cerebellar adaptation in that it is not driven by sensory prediction error but task success or failure 10,11 . To examine this phenomenon, several studies employed a binary, hit-or-miss feedback (BF), paradigm which promotes reinforcement over cerebellar processes 11,12,22 . For example, in one study, participants receiving only binary feedback following successful adaptation expressed stronger retention than participants who had received a combination of visual and binary feedback 12 . The authors argued this could be due to greater involvement of reinforcement-based process that is less susceptible to forgetting 12 .
With the multiple processes framework of motor adaptation, the question of interaction between the distinct systems becomes central to understanding the problem as a whole, and it remains an under-investigated question for reward-based reinforcement. In decision-making literature, it has long been suggested that two distinct "model-based" and "model-free" systems interact 23,24 and even require communication to be optimal 25,26 . Interestingly, model-based processes share many characteristics with explicit control during motor adaptation, in that they are both more explicit, rely on an internal model of the world (explicit control 27,28 ; model-based decision-making 29 ), and are closely related to working memory capacity (explicit control 30,31 ; model-based decision-making 32,33 ) and pre-frontal cortex processes (explicit control 30 ; model-based decision-making 26,34 ). On the other hand, the concept of reinforcement in motor adaptation comes directly from the model-free systems described in decision-making literature 28 , and is often labelled as such. It is considered more implicit, relies on immediate action-reward contingencies and is thought to recruit the basal ganglia in both cases (visuomotor adaptation 22 ; decision-making 23 ). Despite these interesting similarities, unlike model-based and model-free decision-making, the relationship between explicit control and reinforcement during visuomotor adaptation paradigms is currently unknown. Evidence of this relationship exists from a recent study which showed participants needed to experience a large reaching error in order to express a reinforcement-based memory 18 . In addition, there is a wealth of evidence which shows explicit control also requires experiencing large errors 16,17,27 . Thus, it is possible that the formation of a reinforcement-based memory requires, or at least benefits, from some form of explicit control 35 .
To address this possibility, we first examined the contribution of explicit control to the reinforcement-based improvements in retention following binary feedback 12,22 . Secondly, we used a forced reaction time (forced RT) paradigm 36 to investigate the importance of being able to express explicit control when encountering binary (reinforcement-based) feedback.

Experiment 1: Explicit control occurs during reinforcement-based retention.
We first sought to investigate the role of explicit control in the retention of a reinforced visual displacement memory. In experiment 1, participants made fast 'shooting' movements towards a single target (Fig. 1a). After a baseline block involving veridical vision (60 trials) and an adaptation block (75 trials) where a 20° counter-clockwise (CCW) visuomotor displacement was learnt with online visual feedback (VF), participants experienced the same displacement for 2 blocks (asymptote blocks; 100 trials each) with either only binary feedback (BF group, Fig. 1b, top) to promote reinforcement, or BF and VF together (VF group, Fig. 1b, bottom). Following this, retention was assessed through 2 no-feedback blocks (100 trials each), during which both BF and VF were removed. Before these no-feedback blocks, half of the participants were told to "carry on" as they were ("Maintain" group) and the remaining ones were informed of the nature of the perturbation, and to stop re-aiming off target to account for it ("Remove" group). Thus, there were four groups: BF-Maintain, BF-Remove, VF-Maintain and VF-Remove (N = 20 for each group).
Group performance is shown in Fig. 2a. All groups showed similar baseline performance ( Fig. 2b; H(3) = 4.59 p = 0.20; see Methods for detailed information on statistical analysis), and had fully adapted to the visuomotor displacement prior to the asymptote/reinforcement blocks (average reach angle in the last 20 trials of adaptation, Fig. 2c; H(3) = 2.56 p = 0.46). Interestingly, at the start of the first asymptote block, participants in both BF groups showed a dip in performance, effectively drifting back toward baseline before adjusting back and returning to plateau performance. This "dip effect" was completely absent in the VF groups, and has previously been observed independently of our study when switching to BF after a displacement is abruptly introduced 12 . Therefore, success rate was compared independently across groups in the first 30 trials (Fig. 2d) and the remaining 170 trials (Fig. 2e) of the asymptote block. Both BF groups exhibited lower success rates than the VF groups in the early asymptote phase (H(3) = 46.79, p < 0.001, Tukey's test p < 0.001 for BF-Maintain vs VF-Maintain and vs VF-Remove, and for BF-Remove vs VF-Maintain and vs VF-Remove). This was also seen in the late asymptote phase (H(3) = 31.29, p < 0.001, Tukey's test p < 0.001 for BF-Maintain vs VF-Maintain and vs VF-Remove, and for BF-Remove vs VF-Maintain and vs VF-Remove), although performance greatly improved for both BF groups compared to the early phase (Z = 3.692 and Z = −3.81 for BF-Remove and BF-Maintain, respectively, p < 0.001 for both). Of note, both BF groups express a slight decrease in reach angle at the beginning of the second asymptote block, but removing this second dip does not qualitatively alter the result (H(3) = 27.46, p < 0.001, Tukey's test p < 0.001 for BF-Maintain vs VF-Maintain and vs VF-Remove, and p < 0.01 for BF-Remove vs VF-Maintain and vs VF-Remove). Finally, no across-group difference in RTs or movement duration was found during the asymptote blocks (Supplementary fig. S1a,b).
Participants then performed a series of 2 no-feedback blocks. Similar to Shmuelof et al. 12 , we assessed retention by looking at the last 20 trials of the second block. However, our results are fundamentally the same irrespective of the trials used to represent retention. Overall, the BF-Maintain group showed greater retention relative to all other groups, largely maintaining the reach angle values achieved during the asymptote phase, whereas there was no difference between the other groups ( Fig. 2f; H(3) = 27.66, p < 0.001, Tukey's test p = 0.001 for BF-Remove vs BF-Maintain and p < 0.001 for BF-Maintain vs both VF groups; p = 0.6 for BF-Remove vs VF-Remove; p = 1 for BF-Remove vs VF-Maintain; p = 0.68 for VF-Maintain vs VF-Remove). We therefore replicated previous work which showed that BF led to enhanced retention of a visual displacement when compared to VF 12 . However, this effect of BF was abolished by asking participants to remove any re-aiming strategy they had developed (BF-remove). This suggests the increase in retention following BF was mainly a consequence of the greater development and expression of explicit control. Experiment 2: Re-aiming is necessary for maintaining performance under binary feedback. If the conclusion from our first experiment is correct, then successful asymptote performance under BF only should be dependent on the ability to develop and express explicit control. Therefore, in experiment 2 we restricted participant's capacity to recruit an explicit component by using a forced RT adaptation paradigm [36][37][38] (Fig. 1c, see methods for details). Specifically, two groups adapted to a 20° CCW visuomotor displacement by performing reaching movements to 4 targets (Fig. 1d), with the amount of available preparation time (i.e. time between target appearance and movement onset) being restricted. A first group was allowed to express slow RTs (SRT; RT constraints were 930 to 1100 ms after target onset; N = 10), while the second group was only allowed very fast RTs (FRT; 130 to 300 ms; N = 10; Fig. 1c and Supplementary fig. S2a). The latter condition has been shown to prevent time-demanding explicit processes such as mental rotations necessary to express re-aiming in reaching tasks 36,38,39 . Critically, this paradigm prevented expression of re-aiming, but may not prevent development of an explicit component, at least reliably. Therefore, to ensure any between-group difference was task-dependent and not related to inter-individual differences in awareness or understanding of the task, we explained in detail the nature of the perturbation and the optimal policy to counter it. In addition, a third condition was designed in which participants were kept unaware of the visual displacement by introducing the perturbation gradually 16,18 (N = 10; Fig. 1d, bottom), and were not informed of any optimal policy to employ. Participants in this group were given no RT constraint whatsoever. Finally, it should be mentioned that a large portion of participants in the Gradual group reported noticing a slight perturbation by the end of the adaptation block when informally asked after the experiment. However, they underestimated its amplitude significantly at best, reporting effects of the order of 5°. Nevertheless, for the sake of simplicity we will qualify this group as "unaware", although we acknowledge they reported very partial, reduced awareness of the perturbation. Feedback-instruction task perturbation and feedback schedule for the BF groups (top) and VF groups (bottom). The white and grey areas represent blocks where VF was available or not available, respectively, as indicated with a crossed or noncrossed eye. Blocks in which hits (with 5° tolerance on each side of the target) were followed by a pleasant sound are indicated with a small speaker symbol. The y-axis represents the value of the discrepancy between hand movement and task feedback. The double dashed vertical lines represents the time point at which "Maintain" or "Remove" instructions were given. The number of trials and names for each block are indicated at the bottom of each schedule. (c) Experiment 2: forced RT. Schedule of tone playback and target appearance before each trial during the forced RT task (SRT and FRT conditions). The green area represents the allowed movement initiation timeframe, and the red dots represent target onset times for each condition. The grey areas represent the tones. (d) Forced RT task perturbation and feedback schedule for the SRT and FRT groups (top) and for the Gradual group (bottom). Grey areas represent blocks without VF. The green tick and red cross represent binary feedback cues for a hit (5° tolerance on each side of the target) and miss, respectively. The white and grey areas represent blocks in which VF was available or not available, respectively, as indicated with a crossed or noncrossed eye, and the y-axis represents the value of the discrepancy between hand movement and task feedback. Overall group performance is displayed in Fig. 3a. During baseline, average reach direction was similar for all groups ( Fig. 3b; H(2) = 0.45, p = 0.79). To examine whether the FRT and SRT groups displayed different rates of learning during adaptation, we applied an exponential model to each participant's adaptation data. Note, this was not done for the gradual group whose adaptation rate was restricted by the incremental visuomotor displacement. Surprisingly, we found no significant difference between the FRT and SRT group's learning rates (U = 74; p = 0.34; Supplementary fig. S2b). Indeed, one would expect the SRT group to express faster learning since they can express strategies to account for the perturbation 19,36,38,40 . This is most likely a consequence of the small size of the perturbation encountered (i.e. 20°), which leaves less margin for strategic re-aiming 20,40,41 . At the end of the adaptation block, all groups adapted successfully, with no significant difference in reaching direction ( Fig. 3c; H(2) = 2.34, p = 0.31). However, despite the lack of statistical significance, the mean reach direction for the FRT group was slightly under 15° (mean: 14.87°), which represents the limit of the reward region in the subsequent block. We discuss the implications of this later.
Participants then experienced an asymptote block with BF, similar to the first experiment, with the exception that hit-miss feedback was provided with a green tick and a red cross onscreen, because audio BF would potentially temporally align with movement initiation cues and confuse participants. Several other studies have already employed visual BF successfully 11,22,42 . During asymptotic performance, where participants were restricted to binary feedback, the SRT group showed a striking ability to maintain performance within the rewarded region whereas the two other groups clearly could not ( Fig. 3d; H(2) = 17.5, p < 0.001, Bonferroni-corrected (see Methods), Tukey's test p < 0.001 vs FRT and p = 0.001 vs Gradual). Next we compared success rates across groups for early BF trials (Fig. 3e) and the remainder of BF trials (Fig. 3f) independently. Early success rates were significantly lower for the Gradual group compared to the SRT (H(2) = 9.2, p = 0.02, Bonferroni-corrected, Tukey's test p = 0.011), and a similar but non-significant trend was observed between the FRT and SRT groups (Tukey's test p = 0.059). The absence of a significant difference in early success rate between the FRT and SRT groups cannot be explained by average reach angles, as the FRT group actually express a larger decrease in reach angle during that timeframe compared to the Gradual group (Fig. 3a). Rather, the greater variability in reach angle within individuals in the FRT as opposed to the Gradual group is likely to cause this result (average individual variance; FRT: 47.5; Gradual: 18.9). However, success rate during the remaining trials reached significance for both the FRT and Gradual groups compared to the SRT group (H(2) = 16.67, p < 0.001, Bonferroni-corrected, Tukey's test p < 0.001 for both FRT and Gradual). Surprisingly, no dip in performance was observed for the SRT group in the early phase of the BF blocks, suggesting that informing participants of the perturbation and how to overcome it at the beginning of the experiment is sufficient to prevent this drop in reach angle.
Next, to ensure the low end adaptation reach angles expressed by the FRT group did not explain the low success rates, we removed every participant who expressed less than 15° reach angle at the end of the adaptation from each group (e.g. 43 ). Henceforth, we refer to those participants as non-adapters, as opposed to adapters. This procedure resulted in 1, 5 and 2 participants being removed in the SRT, FRT and Gradual groups, respectively. Performance for the adapters was fundamentally the same as the original groups (Fig. 4a), except for end adaptation reach angles, which were now all above 15° (Fig. 4b; SRT 17.0 ± 1.2; FRT 16.9 ± 1.2; Gradual 16.7 ± 1.4). Specifically, the SRT-adapter group still showed a clear ability to remain in the rewarded region during binary feedback performance (asymptotic blocks), whereas the other two adapter groups could not ( Fig. 4c; H(2) = 14.0, p = 0.002, Bonferroni-corrected, Tukey's test p = 0.028 vs FRT-adapter and p = 0.001 vs Gradual-adapter). Because the full groups (i.e. non-Adapters included) did not express a drop in success rate during early asymptote trials, we compared Adapters' success rates during asymptote as a whole, rather than splitting them between early and late performance. The SRT-adapter group still displayed greater success than the Gradual-adapter group ( Fig. 4d; H(2) = 13.74, p = 0.002, Bonferroni-corrected, Tukey's test p < 0.001). However, the difference between the SRT-adapter and the FRT-adapter group was now non-significant (Tukey's test p = 0.12). Despite this, the reach angle differences clearly show that successful binary performance remained strongly affected by one's capacity to develop and express explicit control even for the successful adapters, as shown by the Gradual-adapter and FRT-adapter groups, respectively (Fig. 4a).
Finally, since trials were reinitialised if participants failed to initiate reaching movements within the allowed timeframe, we compared the average occurrence of these failed trials between the FRT and SRT groups (Supplementary fig. S2c) to ensure any between-group difference cannot be explained by this. Both groups expressed similar amounts of failed attempts per trial (U = 100, p = 0.73). In addition, movement times were significantly faster across all blocks for the FRT group compared to the SRT group (Supplementary fig. S2d; H(2) = 11.78, p = 0.005, Tukey's test p = 0.002), although they remained strictly under 400 ms for all groups as in the first experiment (Fig. 1c). This difference is to be expected due to the tendency to express faster velocities in movements with rapid initiation 44 . RTs expressed by the Gradual group were between the SRT and FRT constraints (Supplementary fig. S2a; Gradual group RT range 385 to 1610 ms).
Overall these findings demonstrate that preventing explicit control by restricting its expression or making participants unaware of the nature of the task results in the partial incapacity of participants to perform successfully during binary feedback performance. It should be noted, however, that performance did not reduce back to baseline entirely, as participants in both the FRT and Gradual groups were still able to express intermediate reach angle values in the order of 10 to 15°.

Discussion
Previous work has led to the idea that BF induces the recruitment of a model-free reinforcement system that strengthens and consolidates the acquired memory of a visuomotor displacement 10,12,22 . Here, we investigated the role of explicit control in the context of BF, and our results suggest that it may have a more central role in explaining some BF-induced behaviours than previously expected. In the first experiment, the increased retention observed in the BF-Maintain group was suppressed if participants were told to "stop aiming off target" (BF-Remove group). In the second experiment, preventing expression of explicit control by using a secondary task or preventing its development with a gradual introduction of the perturbation resulted in participants being unable to maintain accurate performance during BF blocks. This suggests an explicit component is necessary for performing a BF reaching task, at least within the present study's experimental design.
The initial performance drop observed at the introduction of BF for both BF groups suggests that participants cannot immediately account for a visuomotor displacement they have already successfully adapted to 12 . A possible explanation is that the cerebellar memory is not available anymore, most likely because removing VF results in a context change, which is known to prevent retrieval and expression of an otherwise available memory [45][46][47] . Considering this, the restoration of performance observed after this dip could not be explained by recollection of the cerebellar memory, suggesting another mechanism took place. Two possible candidates to explain this drift back are model-free reinforcement [10][11][12]22 and explicit processes 7,8,41 .
Reinforcement learning is usually considered to operate through experiencing success 10,11,48 . It is thus difficult to argue for a reinforcement-based reversion to good performance during BF because participants in the trough of the dip did not experience a large amount of success (Supplementary fig. S3), if any. Furthermore, participants experienced little "plateau" performance during the previous block, making formation of a model-free reinforcement memory unlikely, because it is considered a rather slow learning process as opposed to model-based reinforcement 10,49 ; though the adaptation block remains longer compared to Shmuelof and colleagues 12 . On the other hand, both BF groups experienced a large amount of unexpected errors during this drop, which may promote a more explicit approach [16][17][18]27,35 . In line with this, the SRT group in the forced RT task, which had been informed of the displacement and of the right policy to counter it, did not express such a dip when starting the BF block.
The forced RT task addresses this question more directly, and shows that impeding explicit control with a secondary task 36,38 prevents participants from restoring performance over BF blocks, confirming our interpretation. Interestingly, both the FRT and Gradual groups did not show a return to baseline during asymptote. Likely, the FRT group was aware of the optimal policy, and could partially express it, leading to these intermediate reach angles. In line with this, previous work on forced RT paradigms shows that adapting the constraints based on each individual's baseline proficiency at this task more efficiently prevents explicit control 38 . Furthermore, even in the presence of BF, the Gradual group showed a striking inability to find the optimal policy, suggesting the lack of structural understanding of the task strongly impeded their exploration 35,48 . This overall incapacity of the Gradual group to express an efficient explorative approach is consistent with previous findings showing that rewarding success alone, without providing any explanation of the task structure, is not sufficient to make participants reliably learn an optimal policy 48,50 .
Previous studies employing the forced RT paradigm have shown it usually leads to slower learning rates during adaptation because participants can less easily employ explicit control from the beginning 19,36,38 . In contrast, no such difference in learning rate was observed in our forced RT groups. This is possibly due to the difference in size of the perturbation between our study (20°) compared to others 36,38 (30°), making the explicit contribution potentially smaller during the adaptation phase 7 .
Our findings qualitatively replicate results from a previous study employing a similar design 12 . However, it should be noted that our paradigm differs in several ways. First, retention was assessed using feedback removal rather than visual error clamps, although there is evidence that both methods lead to quantitatively similar results 51 . Second, our displacement was only 20° of amplitude and no additional displacement was introduced after the asymptote blocks. There is now a growing wealth of evidence that the cerebellum cannot account for more than 15 to 20° displacements 20,38,52 , with the remaining discrepancy usually being accounted for through explicit re-aiming 41 . Therefore, the absence of a second, larger displacement, if anything, should only result in a less explicit performance. Nevertheless, instructing participants to remove any explicit re-aiming policy (Remove groups) resulted in a near-complete nullification of the binary feedback effect, suggesting it is mainly underlain by a simple re-aiming process. However, the Maintain instruction alone was not sufficient to produce this high SCientifiC REPORTS | (2018) 8:9121 | DOI:10.1038/s41598-018-27378-1 retention profile, as the VF-Maintain group did not express it. We believe this can be explained in two ways. First, experiencing no feedback may result in a stronger context change for the VF groups compared to the BF groups, because the latter experienced the absence of VF during the asymptote blocks beforehand. Thus, this should lead to a stronger drop in reaching angle at the beginning of the no feedback trials for the VF groups, as observed here. Alternatively, the VF-Maintain group experienced 200 more trials with visual feedback at asymptote. Consequently, it is very likely that the cerebellar memory at the beginning of the no-feedback blocks was stronger 11 , and the explicit contribution was less for this group compared to the BF-Maintain group 7,19,41,53 . This would therefore result in the slow drop in reach angle observed during early no-feedback trials due to gradual decay of the cerebellar memory 45,51,54 . Critically, both possibilities are not incompatible, and may well occur together.
A notable feature of retention performance is that both BF-and VF-Remove groups show a residual bias of around 5° in their reach angle in the direction opposite to the displacement. Participants in the Remove conditions were not aware of this upon asking them after the experiment. This has been reliably observed in studies using no-feedback blocks to assess retention 21,55 (but see 51 ). Possible explanations include use-dependent plasticity-induced bias 56,57 , perceptual bias 58 or an implicit model-free reinforcement-based memory, although this study cannot provide any account toward one or the other. Note however that although the BF-Remove group expressed slightly more bias than its VF counterpart, this clearly did not reach statistical significance, meaning this cannot be explained by feedback type alone. Regardless, the implicit and lasting nature of this phenomenon makes it a promising focus for future research with clinical applications 13,15 .
Overall, our findings point towards a central role of explicit control during BF-induced behaviours in this study. In line with this, 14/54 participants had to be removed from the BF groups in the feedback-instruction task (experiment 1) because of poor performance in the asymptote blocks (see methods), suggesting that structural learning was required to perform accurately 35,48,50 . Though this is a significant proportion of participants, it should be noted that other studies using BF-based reaching also found a similar percentage of "learners" and "non-learners" 42,43 . Although not expected in our study, this seemingly consistent outcome across a variety of BF experimental designs raises questions regarding either the reliability of this learning mechanism across individuals or the tasks used to examine it. The possibility that this dichotomy between participants is due to structural learning is in line with the dip observed in the BF groups and the absence of dip in the (i.e. informed) SRT group. If correct, then predictors of structural learning capacity should also predict an individual's ability to learn a visuomotor displacement under BF, a hypothesis that will be tested in future studies. Finally, our view is that implicit, model-free reinforcement takes a great amount of time and practice to form 49,59 , and usually arises from initially model-based performance in behavioural literature 23,60 , as illustrated by popular reinforcement models (e.g. DYNA 61,62 ). Two interesting possibilities are that 200 trials of BF alone are not sufficient to result in a strong, habit-like enhancement of retention 60 , or that such behavioural consolidation must take place through sleep 60,63 . Future work is required to address these hypotheses.
In conclusion, this study provides further insight into the use of reinforcement during motor learning, and suggests that successful reinforcement is tightly coupled to the development and expression of explicit control. We suggest that explicit control bears many similarities with model-based reinforcement, thus creating important questions regarding the link between model-based and model-free reinforcement systems during motor learning. At the very least, future studies investigating reinforcement during visuomotor adaptation should proceed with care in order to map which behaviour is the consequence of implicitly reinforced memories or explicit control.

Methods
Participants. 80 participants (20 males) aged 18-37 (M = 20.9 years) and 30 participants (11 males) aged 18-34 (M = 22.1 years) were recruited for experiment one and two, respectively, and pseudo-randomly assigned to a group after providing written informed consent. All participants were enrolled at the University of Birmingham. They were remunerated either with course credits or money (£7.5/hour). They were free of psychological, cognitive, motor or auditory impairment and were right-handed. The study was approved by and done in accordance with the local research ethics committee of the University of Birmingham.
General procedure. Participants were seated before a horizontal mirror reflecting a screen above (refresh rate 60 Hz) that displayed the workspace and their hand position (Fig. 1a), represented by a green cursor (diameter 0.3 cm). Hand position was tracked by a sensor taped on the right hand index of each participant and connected to a Polhemus 3SPACE Fastrak tracking device (Colchester, Vermont U.S.A.; sampling rate 120 Hz). Programs were run under MatLab (The Mathworks, Natwick, MA), with Psychophysics Toolbox 3 64 . Participants performed the reaching task on a flat surface under the mirror, with the reflection of the screen matching the surface plane. All movements were hidden from the participant's sight. When each trial started, participants entered a white starting box (1 cm width) on the centre of the workspace with the cursor, which triggered target appearance. Targets (diameter 0.5 cm) were 8 cm away from the starting position. Henceforth, the target position directly in front of the participant will be defined as the 0° position and other target positions will be expressed with this reference. Participants were instructed to perform a fast "swiping" movement through the target. Once they reached 8 cm away from the starting box, the cursor disappeared and a yellow dot (diameter 0.3 cm) indicated their end position. When returning to the starting box, a white circle displaying their radial distance appeared to guide them back.
Task design. Experiment 1: Feedback-instruction. For each trial, participants reached to a target located 45° counter-clock wise (CCW). Participants first performed a baseline block (60 trials) with veridical cursor feedback, followed by a 75 trials adaptation block in which a 20° CCW displacement was applied (Fig. 1b). In the following 2 blocks (100 trials each), participants either experienced the same perturbation with only BF, or with BF and VF. BF consisted of a pleasant sound selected based on each participant's preference from a series of 26 sounds before the task, unbeknownst of the final purpose. When participants' cursor reached less than 5° away from the centre of the target, the sound was played, indicating a hit; otherwise no sound was played, indicating a miss. For the BF group, no cursor feedback was provided, except for one "refresher" trial every 10 trials where VF was present. Participants in the VF group could see the cursor position during the outbound reach of the trial, along with the BF. Finally, participants went through 2 no-feedback blocks (100 trials each) with BF and VF completely removed. Before those blocks, participants were either told to "carry on" ("Maintain" group) or informed of the nature of the perturbation, and asked to stop using any explicit approach to account for it ("Remove" group). Therefore, we had four groups in a 2 × 2 factorial design (BF versus VF and Maintain versus Remove). Finally, if a trial's reaching movement duration was greater than 400 ms or less than 100 ms long, the starting box turned red or green, respectively, to ensure participants performed ballistic movements, and didn't make anticipatory movements. Participants who expressed a success rate inferior to 40% during asymptote blocks were excluded (BF-Remove N = 6; BF-Maintain N = 8). Although this exclusion rate was high, it was crucial to exclude participants who were unable to maintain asymptote performance in order to reliably measure retention. Experiment 2: Forced RT. In this experiment, participants were forced to perform the same reaching task at slow (SRT) or fast reaction times (FRT), the latter condition preventing explicit re-aiming by enforcing movement initiation before any mental rotation can be applied to the motor command 36,39 . A third group (Gradual) also performed the task with no RT constraints.
In the SRT/FRT groups, for each trial, entering the starting box with the cursor triggered a series of five 100 ms long pure tones (1 kHz) every 500 ms (Fig. 1c). Before the fifth tone, a target appeared at one of four possible locations equally dispatched across a span of 360° (0-90-180-270°). Participants were instructed to initiate their movement exactly on the fifth tone (Fig. 1c). Targets appeared 1000 ms (SRT) or 200 ms (FRT) before the beginning of the fifth tone. Movement initiations shorter than 130 ms are likely anticipatory movements 37 , and explicit control starts to be difficult to express under 300 ms 36,38 . Therefore, in both conditions, movements were successful if participants exited the starting box between 70 ms before the start of the fifth tone and the end of the fifth tone, that is, from 130 ms to 300 ms after target appearance in the FRT condition. If movements were initiated too early or too late, a message "too fast" or "too slow" was displayed and the cursor did not appear upon exiting the starting box. The trial was then reinitialised and a new target selected. Finally, if participants repeatedly missed movement initiation, making trial duration over 25 seconds, RT constraints were removed, to allow trial completion before cerebellar memory time-dependent decay 51,54,65 . Participants in the SRT and FRT groups were informed of the displacement and of the optimal policy to counter it, to ensure that any effect was related to expression, rather than development of explicit control. They were also instructed to attempt using the optimal policy as much as possible when sensible, but not at the expense of the secondary RT task, so as to preserve the pace of the experiment and prevent time-dependent memory decay.
To attain proficiency in the RT task, SRT and FRT participants performed a training block (pseudo-random order of VF and BF trials) of at least 96 trials, or until they could initiate movements on the fifth tone reliably (at the first attempt) at least for 75% of the previous 8 trials. All participants achieved this in 96 to 157 trials. Once this was achieved, participants first performed a 40 trials baseline (Fig. 1d), followed by introduction of a 20° CCW displacement for 260 trials. Participants then underwent a 200-trials asymptote block with only BF (1 "refresher" trial every 10 trials). The BF consisted of a green tick or a red cross if participants hit or missed the target, respectively. Visual (instead of audio) BF was used to avoid BF sounds from lining up with the tones, which could potentially confuse participants. The Gradual group underwent the same schedule, except that no tone or RT constraint were used, and the perturbation was introduced gradually from the 41 st to the 240 th trial (increment of 0.4°/trial) occurring independently for each target. This ensured participants experienced as few large errors as possible to prevent awareness of the perturbation and therefore explicit control. After the experiment, participants in the Gradual group were informed of the displacement, and subsequently asked if they noticed it. If they answered positively, they were asked to estimate the size of the displacement. Data analysis. All data and analysis code is available on our open science framework page (osf.io/hrgzq).
All analyses were performed in MatLab. We used Lilliefors test to assess whether data were parametric, and we compared groups using Kruskal-Wallis or Wilcoxon signed-rank tests when appropriate, as most data were non-parametric. Post-hoc tests were done using Tukey's procedure. As we analysed the data from experiment two twice ( Fig. 3 and 4), success rates and reach angles during asymptote were Bonferroni-corrected with corrected pvalues (multiplied by 2).
Learning rates were obtained by fitting an exponential function to adaptation block reach angle curves with a non-linear least-square method and maximum 1000 iterations (average R 2 = 0.86 ± 0.14 for feedback-instruction task and R 2 = 0.58 ± 0.26 for forced-RT task): where y is the hand direction for trial x, a is a scaling factor, b is the starting value and β is the learning rate. Reach angles were defined as angular error to target of the real hand position at the end of a movement. Trials were considered outliers and removed if movement duration was over 400 ms or less than 100 ms, end point reach angle was over 40° off target, and for the SRT and FRT groups in the forced-RT task, if failed initiation attempts continued for more than 25 sec. In total, outliers accounted for 3755 trials (8%) in the feedback-instruction task and 1013 trials (6%) in the forced-RT task. Even though 4 targets were used during the forced-RT task, trials were reset and a new random target was selected when participants failed to initiate movements on the 5 th tone. Therefore, all possible target positions would not be represented for each epoch, and epochs were consequently not used.