Reduced model-based decision-making in gambling disorder

Compulsive behaviors (e.g., addiction) can be viewed as an aberrant decision process where inflexible reactions automatically evoked by stimuli (habit) take control over decision making to the detriment of a more flexible (goal-oriented) behavioral learning system. These behaviors are thought to arise from learning algorithms known as “model-based” and “model-free” reinforcement learning. Gambling disorder, a form of addiction without the confound of neurotoxic effects of drugs, showed impaired goal-directed control but the way in which problem gamblers (PG) orchestrate model-based and model-free strategies has not been evaluated. Forty-nine PG and 33 healthy participants (CP) completed a two-step sequential choice task for which model-based and model-free learning have distinct and identifiable trial-by-trial learning signatures. The influence of common psychopathological comorbidities on those two forms of learning were investigated. PG showed impaired model-based learning, particularly after unrewarded outcomes. In addition, PG exhibited faster reaction times than CP following unrewarded decisions. Troubled mood, higher impulsivity (i.e., positive and negative urgency) and current and chronic stress reported via questionnaires did not account for those results. These findings demonstrate specific reinforcement learning and decision-making deficits in behavioral addiction that advances our understanding and may be important dimensions for designing effective interventions.

Modern theories of addictive behaviors are built on basic neural and cognitive decision mechanisms, and posit an imbalance between past-oriented habits (e.g., drinking alcohol automatically in a given context) and present and future-oriented goals (e.g., limiting alcohol use), thus resulting in a lack of consideration for the consequences of the actions [1][2][3] . Deficits in goal-directed learning and control (e.g., prepotent response inhibition, set-shifting) have been observed across a range of disorders characterized by compulsivity such as addiction [4][5][6] and obsessive-compulsive disorder 7,8 . The case of gambling disorder (GD) is of particular interest. Recently reclassified alongside substance use disorder 9 , mainly because those syndromes share clinical (e.g., craving, escalation in use) and neurobiological (e.g., abnormal fronto-striatal network) characteristics 10,11 , GD offers the opportunity to understand addiction without potentially confounding neurotoxicity associated with acute or chronic use of psychoactive substance 12 .
For effort and energy saving 13,14 , adaptive choice behavior relies on optimal orchestration between two forms of instrumental decision systems: the goal-directed system learns about the contingency between actions and outcomes and ensures that behavior is appropriate given our motivational state and/or desire for these outcomes, while the 'habitual' system enables actions that has been trained or 'stamped in' to the extent that these actions become stimulus-rather than goal-driven 15 . The way in which those systems interact in healthy and psychopathological conditions have received considerable attention in recent years [16][17][18][19] .
Whether compulsive behaviors are automatically driven by contextual elements without outcome expectations (i.e., habit) 20,21 or if they remain mainly goal-oriented [22][23][24] or both 25 is still debated 26 . While animal model studies

Results
Sample characteristics. Our sample consisted of 82 participants, 33 CP and 49 PG. Our final sample consisted of 78 participants: 45 PG and 33 CP. Table 1 depicts the demographic and clinical variables of PG and CP as well as between-groups comparisons.
Analyses of choice behavior. The regression analysis (see Table 2 and Fig. 1) confirmed the basic signatures of MF and MB strategies, expressed as significant effects of both previous outcome (MF learning; β (SE) = 0.55 (0.06), p < 0.001) and the interaction between previous outcome and transition type (MB learning; β (SE) = 0.32 (0.06), p < 0.001). Moreover, the regression revealed that PG and CP did not appear to differ in their MF choice contributions, as evidenced by the absence of a significant two-way group and previous outcome interaction (p = 0.67). Critically, we observed a significant three-way interaction between group (PG versus CP), previous outcome, and previous transition type (β (SE) = −0.12 (0.06), p < 0.05), indicating an attenuated MB learning signature in PG. As the MB strategy is the optimal reward-harvesting strategy in this task, we found that the proportion of rewarded trials differed significantly between the two groups, whereby the CP group was rewarded significantly more often (mean reward rate: 57.31%) than the PG group (mean reward rate: 54.83%) (F(1,78) = 7.23, p < 0.01, ƞ 2 p = 0.09). Secondly, logistic regressions which separately examined previously rewarded and unrewarded trials (see Table 3) revealed that in both cases, the entire population expressed a basic MB effect (expressed as a main significant effect of transition; rewarded trials (β (SE) = −0.12 (0.05), p < 0.05); unrewarded trials (β (SE) = 0.51 (0.1), p < 0.001)). More importantly, this MB estimate was significantly lowered in PG only after a negative outcome, as shown by a significant negative group * previous transition interaction (β (SE) = 0.16 (0.05), p < 0.01) after a negative outcome but not after a positive outcome (β (SE) = −0.09 (0.1), p = 0.36).

Response time (Rt) analyses.
In the mixed ANOVA comparing the second step's response time according to the transition between both groups (see Fig. 2 Based on past finding that MB control is associated with slower reaction times than MF 46 and because we found that PG had MB deficit after an unrewarded trial, we also examined if the previous losses resulted in faster next trial RTs in PG compared to CP. We used a second mixed ANOVA to analyze the effect of the previous trial's outcome on the first-choice response time in both groups. A significant main effect of the outcome was found  www.nature.com/scientificreports www.nature.com/scientificreports/ clinical analyses. To evaluate the impact of the clinical variables for which there was a difference between PG and CP (i.e., positive and negative urgency, depression, anxiety trait, chronic stress and psychiatric comorbidities) on learning strategies, we ran several logistic regressions with the probability of stay in the previous first step choice as dependent variable and type of outcomes and transition as well as the score at the target clinical questionnaire as independent variables. No significant interaction between any of the clinical variables and either reward type or transition x reward type was found (p > 0.05).

Discussion
The present study aimed at contributing to the understanding of impaired reinforced learning mechanism in behavioral addiction. Based on analysis of choices and reaction times, we found that PG rely less on MB RL prediction while making decision on a two-step task, especially after an unrewarded trial. This finding shed light on potentially important mechanisms involved in inflexible behaviors found in individuals with GD, which are now considered in detail.
Attenuated MB learning signature based on choices was found in PG, with less consideration for transition types, thus leading to fewer rewards. This finding echoes the main idea that impaired MB RL strategy is strongly associated with a symptom dimension comprising compulsive behavior 19 . Further, our results dovetail well with previous studies employing different choice paradigms (e.g., the Fabulous Fruit Task, a reinforcer devaluation test) that found that individuals with drug addiction rely too much on habits instead of goal-directed choices 29 .
In support to the idea of impaired MB control in the clinical sample, we found that PG showed less slowing after rare transitions than CP, which likely reflects reduced MB control 48,49 . Interestingly, the reduction in MB   www.nature.com/scientificreports www.nature.com/scientificreports/ control in PG was particularly important in choices that followed a negative outcome, compared to positive ones. Thus, whereas a negative outcome, in CP signaled the need of additional cognitive control adjustment (MB control) to further avoid these negative outcomes, PG patients failed to recruit these additional control mechanisms. This could occur for a number of possible reasons.
First, the novel finding we provided is that PG is more impulsive than their controls after a non-rewarded trial, as evidenced by faster decisions (expressed as first-stage choice RTs). This phenomenon is in line with previous work reporting that losses (or non-rewarded actions) affect choice by favoring impulsive actions in healthy participants on gambling tasks 50,51 . Our study suggests that impulsive decisions enhances reliance on habits at the detriment of model-based control, possibly due the lack of inhibition of the habit system in the context of frustration. Second, PG could be less sensitive to extinction, a phenomenon characterizing habit formation that can be due to reduced loss aversion 15 , hypersensitivity to rewards, incorrect identification of statistically unlikely sequence of wins as a separate situation from more-commonly experienced losses 52,53 . In line with observed deficits of  www.nature.com/scientificreports www.nature.com/scientificreports/ extinction learning in PG, recent studies suggest that GD could arise from an inflexible association between an action and its reward, even if its outcome is devaluated [52][53][54] .
Finally, although the illusion of control and uncontrolled cue-dependent relapse are common psychological explanations for behaviours observed in gambling addiction, the nature of the choice paradigm here yields that data too limited to address these possible explanations. Indeed, we failed to find a higher probability in PG than their controls to repeat the previous first step choice after an unrewarded trial, independently of the transition type. Together, those findings support a specific MB deficit in the context of reward expectancies violation, a phenomenon putatively associated with a hyperdopaminergic state 41 that interferes with inhibition of basal ganglia for which D2 receptors are critical 55,56 . Clearly, additional work is necessary to draw more robust conclusions on neurocognitive determinants of post unrewarded actions that the present work merely suggested. In addition, we found no association between any clinical variables discriminating groups (chronic stress, state and trait anxiety, depression, negative and positive urgency) and the MB signature. This finding indicates that co-occurrence between PG and other psychopathological conditions is not the main reason why PG have goal-directed deficits.
Our findings hold some useful clinical implications. Interestingly, modest clinical outcome (e.g., low remission rate) in the treatment of gambling disorder 57 could be due to the lack of consideration for the contribution of rudimentary stimulus-response associations to the addictive behavior, in favor of the idea that addiction mainly results from reinforced goal-directed actions (see the self-medication hypothesis) 24 . Because MB RL and cognitive control both involve overcoming habitual, stimulus-driven actions 58 , interventions aimed to improve executive functioning may positively impact on MB contribution. Specifically, electric stimulation (i.e., TDCS) of the dorsolateral prefrontal cortex has been shown to impact a variety of deliberative functions including risk-taking 59 , working memory 60 and classification learning 61 . Stimulation on the left ventrolateral prefrontal cortex was shown to improve MB control and weight in the decisional balance 62 , but see for negative results 63 . Following this recent effort, further research is needed to test the influence of neurocognitive interventions on MB/MF RL in gambling disorder. In the same way, future studies may examine the usefulness of pharmacological intervention (e.g., amisulpride) blocking D2/D3 receptors to augment the relative contribution of MB learning strategy after a negative outcome. This should be done with careful considerations for other cognitive functions involved in dopamine modulation such as risk taking 64 and incentive value 65 .
It is worth noting the potential limitations of this study. First, it is possible that the PG group's behavior is in part attributable to inaccurate expectancies about future events (e.g., the gambling fallacy or hot hand fallacy) 66 . Put differently, inappropriate internal model of the environment's transition structure could have been responsible for lack of consideration for transitions' rarity, potentially contributing to both the RT and choice effects. False beliefs about probabilities (e.g., consecutive losses necessary lead to a larger monetary gain or several wins in a row increase the probability of winning later) might lead to suboptimal, yet goal-directed, strategies and, without fully probing participants' beliefs that takes place during the realization of the task 26,67 , this explanation could not be entirely dismissed. It is therefore possible that decisions considered as habit-like actually result from goal-directed strategies. However, we failed to observe a "hot hand" effect (i.e., the expectation to win after a win) that would have caused faster choice RTs after rewarded trials in PG, in comparison to non-gamblers. Besides, gambling fallacy is more likely after longer runs of losses or wins 68 .
Another potential limitation is that the two-step task does not incentivize participants to use MB control, but instead decouples winnings from the subjects' choice strategy so as to avoid these variables potentially confounding one another. Interestingly, a recent study reported that MB control can be reliably improved with the provision of larger incentives (e.g., higher stakes) in individuals with several psychiatric conditions 69 . The observed boosting model-based control with larger incentive has been thought to result from on a cost-benefit analysis, that is, higher potential payoffs justify the more effortful decision-making processes (i.e., more model-based control) 58,69,70 . It is worth testing whether the PG deficit in MB RL can be ameliorated in this manner, since a higher sensation seeking trait was both a prominent feature in this population 71 and a factor associated with greater boosts in MB control in non-clinical participants (83). However, it should be noted that we offered to participants 30 euros plus 10 euros depending on their net performance, which can be considered as very incentive compared to other similar studies.
Finally, the influence of impaired MB learning in the pathogenesis of gambling addiction remains largely unknown. Unlike drug-taking behaviors that may cause profound disruption in learning systems 11 , gambling behaviors offer room to study addiction without the confounding effects of neurotoxicity associated with acute and chronic use of chemical substances 10 . Clearly, in the absence of longitudinal research design, this question cannot be firmly decided. However, a recent preclinical study suggested that individual differences in model-free learning prior to drug use predicted methamphetamine self-administration 72 .
To summarize, we found deficits in learning and decision making in problem gamblers. It is characterized by a reduced MB action control after a negative outcome. This knowledge has highlighted the importance of decision deficits not directly attributable to the neurotoxic effects of chronic drug use.

participants.
Forty-nine individuals with gambling disorder, named problem gamblers (PG), who took part in games involving little skill (i.e., slot machines, video poker, dice and pull tabs), and 33 controls (CP) matched for age and educational level were recruited. All participants were recruited through advertisement and gave written informed consent to be part of the experiment. The experiment was approved by the C.H.U. Brugmann Ethics Committee (n° OM 026) and was performed according to the Declaration of Helsinki.
All participants underwent a semi-structured interview 73 . All PG met the DSM-V criteria 47 for gambling disorder (range: 3-9) and had a minimum of 8 on the Canadian Problem Gambling Index (CPGI) 74 (range: 8-27). All PG were active gamblers, and none followed a therapy or treatment. Healthy control subjects had a score of 0 (2019) 9:19625 | https://doi.org/10.1038/s41598-019-56161-z www.nature.com/scientificreports www.nature.com/scientificreports/ on the CPGI. The exclusion criteria for all participants were the presence of psychotic or neurologic syndromes, antecedents of substance addiction and recent utilization of psychopharmacological substances susceptible to alter cognitive functioning.
The participants' remuneration was set on 30€ and they were told that they could win up to 10€ more depending on their net performance in the two-step decision task (RL task).
Questionnaires, experimental tasks and procedure. At the end of the experiment, each participant performed the operation span (OSPAN) task 75 and filled out clinical questionnaires to estimate substance use, psychological problems and symptoms of psychopathology, current negative emotions, anxiety, depression, stress, impulsivity, craving for gambling. Alcohol use was estimated by the Alcohol Use Disorders Identification Test 76,77 and nicotine dependence severity by the Fagerström Test for Nicotine Dependence 78 . The psychopathological symptoms were investigated using the total score of the Symptom Checklist-90-Revised (SCL-90-R) 79 . Negative emotions, as well as depression and anxiety, were evaluated by the negative scale of the Positive and Negative Affect Schedule 80 , the short version of the Beck Depression Inventory (BDI) 81 and the State-Trait Anxiety Inventory (STAI-YA and STAI-YB) 82 , respectively. To measure chronic and current stress, the Social Readjustment Scale (SRRS) 83 and visual analogue scales (range: 0-10) were administered. Several facets of impulsivity (i.e., negative urgency, positive urgency, lack of premeditation, lack of perseverance and sensation seeking) were evaluated with the short version of the UPPS Impulsive Behavior Scale 84 .
The entirety of the experimental procedure lasted between 1h30 and 2 h and took place individually with two experimented and well-trained neuropsychologists in a quiet room. Upon their arrival, the participants signed an informed consent and filled out a questionnaire about gambling behaviors (CPGI). Prior the RL task, two visual analogic scales (VAS) (i.e., 'how much do you want to gamble right now?' and 'how much do you feel stressed right now?') were administered. Right after the task, a second series of VAS were given, followed by the remaining clinical questionnaires. two-step decision-making task. Participants performed 200 trials of two-step decision-making task 43 .
This task was divided into two stages (see Fig. 4A). At the beginning of the first step, two fractal images were presented on a black screen, between which the participant had to choose. Each first-stage image led commonly (70%) to one of the two second-stages states and rarely to the other (30%). During the second stage, two images were presented on a green or a blue screen (representing the second-stage 'state'), between which the participant had to choose. Each image led probabilistically to a reward or not, presented with a visual feedback representing a 10c coins or a 0 during the 1-second feedback interval. In order to assure continual learning and exploration during the task, each second-step image's probability to reward money slowly varied during the task according to Gaussian random walks (SD = 0.025). They had 3 second to perform each choice and the inter-stage and inter-trial intervals both lasted 1 second.
Prior to the task, participants were given extensive instructions about the task's structure 19 . They were instructed that the first choice would preferentially lead towards a blue or a green screen, each one associated with different probabilities of winning, and that their choice at the second screen would depend on their choice on the first screen. It was stressed that transition probabilities between the first and the second stage would be constant while the probabilities of winning at the second stage could vary over time. Participants then completed a tutorial and had to provide correct responses to a quiz including three questions about the task's structure 19 . In case of incorrect response to any of them, the explanation phase took place again. They sat in front of a laptop with an AZERTY keyboard. The letter 'E' was assigned to the left image and the letter 'I' to the right image. www.nature.com/scientificreports www.nature.com/scientificreports/ Several measures were considered: The outcome of each second-stage choice (reward or not), transition type (common or rare), the response times to rewarded or unsuccessful trials on frequent or rare transitions, and the probabilities of making two consecutive identical first-stage choices according to the type of transition and reward (termed p(stay)). A pure model-free strategy predicts purely reinforcement-guided choices: a repetition of the previous trial's first-stage choice only when it was previously rewarded, and a shift occurring after a previous trial being not rewarded. A pure MB strategy takes the task structure and transition type into consideration and predicts a repetition of the previous trial's first step only if it was rewarded and following a common transition or if it was not rewarded after a rare transition (see Fig. 4D).
Data analyses. All analyses were performed using IBM SPSS Statistics v25 and RStudio Version 1.1.456.
To ensure that participants' data reflected a sufficient level of engagement to the task, in the same way as a previous study 85 , those who repeated previously rewarded second-step responses at a rate less than 50%, those who did not answer before the deadline more than 20 times, and those who did not try every image in each stage were removed from the data analyses. This resulted in the removement of 4 subjects. Groups were compared on each clinical variable (e.g., depression, anxiety, impulsivity, stress) by using t-tests or non-parametric tests, where appropriate.
A mixed logistic regression was carried out to analyze the influence of group (PG, CP), of previous transition type (common, rare) and of previous outcome (reward, no reward) on the probability to maintain a previous trial first step choice (stay, switch). As MB and MF learning predicting distinct patterns of first-stage repetitions to the previous trial's events (reward and transition type), this analysis allowed for a quantitative evaluation of their contribution to the trial-by-trial learning. A pure MF strategy rends the first stage choice only impacted by the previous trial's outcome, independently of the previous trial's transition type, thus predicting only a main effect of the outcome. On the other hand, a pure MB strategy predicts an interaction between the outcome and the transition type 85 . Secondly, in order to test our hypothesis that PG had a more pronounced MB impairment after unrewarded trials, we performed two more logistic regressions, separately examining trials following a reward and trials following the absence of reward.
To assess further decision strategies based on reaction times, a mixed ANOVA with the current trial's transition type (rare, common) as within-factor in PG and CP as between-factor was performed on the second stage response time. Indeed, the difference between second-stage RTs after common versus rare transitions reflects the level of involvement of MB control 48,49 .
To examine the influence of clinical status other than gambling disorder on decisional strategy on the two-step task, each clinical variable that discriminate the two groups was added separately to the mixed logistic regression.

Data availability
All data will be made available on the following lab website: http://psymed.ulb.be/.