Behavioral flexibility is a cognitive function which refers to the ability to dynamically adjust behavior to a changing environment1. A lack of behavioral flexibility results in rigid behaviors, echoing pathological conditions observed in OCD patients who are resistant to change. Thus, behavioral flexibility impairments have been proposed has one of the causes of compulsive behaviors2. However, there is no consensus whether such a deficit exists in OCD as inconsistencies were found between studies using reversal learning3,4, task switching5,6, or intra/extra-dimensional set shifting7,8 paradigms. Beyond methodological considerations such as small sample sizes9,10, the clinical heterogeneity of OCD patients may have contributed to these discrepant results11,12,13,14,15. In contrast, dysfunctional activations of OCD patients’ prefrontal regions, in particular the orbitofrontal cortex (OFC), during performance of flexibility tasks has been more consistently reported16,17. Similar neurobiological observations have been recently made in the Sapap3 knock-out mutant mice (Sapap3 KO), the current predominant genetic model of compulsive-like behavior18. These genetically engineered mice lack the SAP90/PSD95-associated protein 3, a postsynaptic scaffolding protein mainly expressed in the striatum19. This mutation results in the expression of excessive grooming behaviors, which can be defined as compulsive-like given the associated neurophysiological impairments of the prefronto-striatal circuits18,20, including the OFC21, and the reduction of grooming behavior after chronic administration of fluoxetine, a first-line treatment for OCD patients18. Regarding behavioral flexibility, two recent studies22,23 have challenged this model in a spatial reversal learning task and found that Sapap3 KO mice had impaired performances compared to controls after reversal event, although the type of deficit differed with increased perseveration found in one study23 but not the other22. Interestingly, both studies found that these deficits were not correlated with the severity of compulsive-like grooming, suggesting that compulsivity and flexibility dimensions may be distinctly affected in this model. Moreover, one of the studies identified a lack of flexibility only in a subgroup of Sapap3 KO mice23. This result highlights inherent model heterogeneity, which echoes the clinical heterogeneity observed in OCD patients.

To improve the comparison of human and animal results, similar experimental procedures across species are indispensable in order to ensure comparable task parameters and psychometric properties24,25, and, hence, comparable results26. In case of reversal learning tasks, it has been demonstrated that the neurobiological processes may differ according to sensory modalities, impeding translational value in approaches using different stimulus modality across species27. Hence, proper data transposition from animal models to humans and vice versa in tasks assessing behavioral flexibility is hindered28,29 by the fact that flexibility assessment in animal models mostly rely on spatial discrimination tasks22,23,30,31,32,33 while in humans visual discrimination tasks are most commonly applied.

Therefore, in order to study the involvement of behavioral flexibility in compulsive behaviors in both OCD patients and the Sapap3 KO mice, we have developed an innovative, high throughput behavioral setup for mice that allows us to reliably test individual subjects in a non-spatial visual reversal learning task through multiple reversal blocks, as it is commonly performed in human studies. We furthermore ensured the correct interpretation of our results by recruiting large samples of well characterized and selected subjects in both species, thereby enabling us to investigate intra-group variability in our analyses.


Compulsiveness is unrelated to behavioral flexibility as assessed by the reversal learning paradigm

In both species, we applied a similarly-designed reversal learning task to assess their behavioral flexibility (Fig. 1 and Supplementary Discussion for the underlying rationale). We observed that the performance profile after a reversal event was similar between compulsive subjects and their controls in the two species (BFInclusion < 1 for group factor, Fig. 2a and Supplementary Tables S1 and S2). The number of trials needed to reach reversal criterion (Fig. 2b, top) did not differ between compulsive and control groups, neither for human subjects (BF10 = 0.64, d = 0.27 [0.04 0.77]), nor mice (BF10 = 0.7, d = 0.33 [−0.1 0.92]). Similarly, no significant group differences were found in the number of reversal errors (Fig. 2b, bottom), neither in humans (BF10 = 1.5, d = 0.35 [0.07 0.88]), nor mice (BF10 = 0.44, d = 0.25 [−0.2 0.83]).

Fig. 1: Experimental design and apparatus of the reversal learning task.
figure 1

a Illustration of the behavioral apparatus. Up to 8 operant conditioning chambers run in parallel with mice living and working with minimal human intervention. Each operant chambers was equipped with capacitive touch-sensitive screens (left), pellet and water dispenser (right), and LED lights (top). b Example of one WT mouse performance (smoothed over a 40-trial sliding window) across five reversals. c Design of the human (left) and the mouse (right) versions of the reversal learning task. On each trial, the subject had to make a choice between two different stimuli displayed on the screens. Depending on their choice, positive (for correct response) or negative (for incorrect response) feedback was provided. When the subject had learned the correct association, the reward contingencies were reversed without notice (see “Methods” section for details).

Fig. 2: Compulsivity is not related to behavioral flexibility.
figure 2

a Changes in correct response probability around a reversal. Top: for humans with 10 trials around the reversal. Red line: OCD patients. Green line: healthy subjects. Bottom: for mice with 100 trials around the reversal. Red line: Sapap3 KO mice. Green line: WT mice. The data were smoothed using Savitzky–Golay algorithm for both species. b No difference was found between groups for both humans (left, n = 40 per group) and mice (right, n = 26 per group), neither when considering the number of trials needed to reach the reversal criterion (top) or the number of reversal errors (bottom). Triangle: group mean. Dot: individual mean. Ø: BF10 < 1. c Top: In OCD patients, the disease severity assessed by the Y-BOCS does not predict the number of trials needed to reach the reversal criterion. Dark line: linear fit. Gray area: confidence interval. Bottom: In Sapap3 KO mice, compulsive grooming severity assessed by the number of grooming bouts initiated (over a 10-min period) does not predict the number of trials needed to reach the reversal criterion. Dark line: linear fit. Gray area: confidence interval.

The comparison of other behavioral parameters, such as spontaneous strategy changes (SSC) probability and SSC errors (see definition in “Methods” section), support this lack of difference between groups for both species (Table 1).

Table 1 Humans and mice behavioral parameters in the reversal learning task.

In OCD patients, correlation analysis showed that disease severity and task performance were not related (Fig. 2c, top and Table 2). Likewise, we found no correlation in mice between grooming level and the main behavioral parameters (Fig. 2c, bottom and Table 2).

Table 2 Correlation scores between clinical and task parameters.

Distinct subgroups of OCD patients and Sapap3 KO mice exhibit a behavioral flexibility deficit

In OCD patients (n = 40), depression/anxiety levels and antidepressants did not influence task performance (Table 2 and see Supplementary Notes for more results relative to medication). When we assessed the effect of symptom subtypes (such as “checking”, “washing”, “hoarding”, etc…) on task performance, only severity of the “checking” subtype was positively correlated to an increased number of trials needed to reach the reversal criterion (Fig. 3a and Table 2). We thus conducted another analysis by separating a subgroup of twenty-one OCD “checkers” with predominantly checking symptoms from the other OCD patients. The three resulting distinct groups (n = 21 OCD “checkers”, n = 19 OCD “non-checkers” and n = 40 healthy subjects) were not different in terms of demographic characteristics (Table 3). In terms of clinical characteristics, the group of OCD “checkers” showed a higher rate of comorbid anxiety disorder compared to the two other groups (Table 3).

Fig. 3: Only a subgroup of OCD patients and Sapap3 KO mice needed a higher number of trials to reach the reversal criterion.
figure 3

a Only the severity of the checking symptoms (measured by the OCI-R checking subscore) predicts the number of trials needed to reach the reversal criterion. The higher the checking symptoms severity, the higher the number of trials. n = 40 OCD patients. Dark line: linear fit. Gray area: confidence interval. b 21 OCD patients with predominant checking symptoms (“checkers” subgroup, red) were segregated from the others (19 “non-checkers” and 40 healthy controls subgroups, pink and green, respectively). Only OCD “checkers” patients were impaired in terms of number of trials needed to reach the reversal criterion. ncOCD: “non-checkers”. cOCD: “checkers”. Triangle: group mean. Dot: individual mean. c A two-step cluster analysis using four behavioral parameters (number of trials to reversal, reversal errors, SSC probability and SSC errors) found two distinct clusters within the Sapap3 KO mice and was confirmed by a stepwise discriminant analysis. The intersection point of the lines indicates the group’s centroid. d Only the “impaired” KO mice (n = 12, red) needed more trials to reach the reversal criterion compared to the other “unimpaired” KO (n = 14, pink) and WT mice (n = 26, green). uKO: “unimpaired” KO mice. iKO: “impaired” KO mice. Triangle: group mean. Dot: individual mean. Ø: BF10 < 1. *BF10 < 3. *BF10 ≥ 3. **BF10 ≥ 10. ***BF10 ≥ 30. ****BF10 ≥ 100.

Table 3 Demographic and clinical characteristics of the samples.

As for OCD patients, we attempted to determine a similar heterogeneity in the Sapap3 KO mice (n = 26). Indeed, some innovative studies have found evidence for inter-individual variability of cognitive traits in animal models34,35, even in Sapap3 KO mice23, pointing out the necessity to consider this variability in animal studies. Thus, we performed a two-step cluster analysis which identified two clusters within the Sapap3 KO mice (silhouette measure = 0.6, BIC2 clusters = 92.82; compared to BIC1 cluster = 96.13 and BIC3 clusters = 109.66). The resulting ∆BIC of −3.31 indicated a positive evidence in favor of this clustering. The same procedure was applied to WT controls (n = 26) without the detection of separate clusters. Among the different variables used to perform the analysis favouring the two-cluster solution, the SSC probability was identified as the most important variable, followed by the reversal error proportion, the number of trials needed to reach reversal criterion and the SSC perseverative errors (importance values = 1, 0.79, 0.42, and 0.19, respectively). Three distinct groups resulted from this clustering procedures: the WT mice (n = 26); the “unimpaired” Sapap3 KO mice (n = 14), defined as overlapping with WT; and the “impaired” Sapap3 KO mice (n = 12), defined as distant from the WT (Fig. 3c). We confirmed our clustering analysis by performing a stepwise discriminant analysis: overall, 71.2% of mice were correctly labeled with 83.3% of the “impaired” KO mice correctly classified (16.7% were classified as WT) and 50% for the “unimpaired” KO mice (50% classified as WT). These results were consistent with those of the two-step cluster analysis: the “unimpaired” Sapap3 KO mice cluster was closer to the WT mice cluster (Fig. 3c) with a moderate agreement between the two analyses (κ = 0.53, 95% CI: [0.32 0.73], p < .0005). The two Sapap3 KO subgroups, which resulted from the two-step cluster analysis and which were confirmed via a stepwise discriminant analysis, were similar in terms of weight and grooming level (Table 3), showed comparable locomotor activity and task engagement (Supplementary Fig. S1), and had no identified genealogical difference (Supplementary Fig. S2). Noteworthy, we conducted the same clustering procedure on the human data with comparable results than the ones observed in mice. Two clusters were found in OCD patients with an impaired subgroup of 7 patients, 6 of them being checkers; and only one cluster for healthy controls (see Supplementary Notes for details). These results validated the relevance of using the clinical dimension of “checking” symptoms as a subgroup splitting factor.

To assess behavioral flexibility in these different subgroups of both species, we systematically compared their performance in terms of the number of trials to reach reversal criterion. We detected a difference in the number of trials to reach criterion between OCD “checkers”, OCD “non-checkers” and healthy controls (BF10 = 3.98, η2 = 0.11, Fig. 3b, Table 1). Indeed, a post-hoc analysis revealed that OCD “checkers” needed more trials than both OCD “non-checkers” (BF+0 = 4.66, d = 0.74 [0.08 1.25]) and healthy controls (BF+0 = 9.32, d = 0.67 [0.14 1.17]). We detected no difference between the latter two groups (BF10 = 0.28, d = 0.01 [−0.5 0.5]). Comorbid anxiety disorder or gender effect were absent in humans (Supplementary Tables S3 and S4). Similarly, we found a difference in the number of trials to reach reversal criterion between “impaired” Sapap3 KO, “unimpaired” Sapap3 KO mice and WT controls (BF10 > 100, η2 = 0.35, Fig. 3d). Indeed, in an according post-hoc analysis, “impaired” KO mice needed more trials than WTs (BF10 > 100, d = 1.37 [0.54 2.17]), and we detected no difference between “unimpaired” KO mice and WTs (BF10 = 0.43, d = 0.29 [−0.35 0.82]).

Reversal learning deficit is explained by higher response lability rather than perseveration

Considering that only checking symptoms were associated with a reversal learning impairment in OCD patients, we investigated which behavioral trait could explain this deficit. We first observed that the number of reversal errors did not correlate with severity of checking symptoms, suggesting that OCD “checkers” did not express greater perseverative behaviors than healthy subjects (Table 2). This was confirmed by an absence of a group effect on the number of reversal errors when comparing the OCD “checkers”, OCD “non-checkers” and healthy controls subgroups (BF10 = 0.72, η2 = 0.06, Fig. 4a, left). In mice, a group effect relative to the proportion of reversal errors (BF10 > 100, η2 = 0.34, Fig. 4a right) was shown but we found that “impaired” Sapap3 KO mice performed fewer reversal errors than WTs (BF10 = 53.84, d = 1.51 [0.35 1.93], Fig. 4a, right). These results suggested that behavioral flexibility deficits observed in subgroups of compulsive subjects were not explained by greater perseveration. On the contrary, we observed a positive correlation between the severity of “checking” symptoms and the probability of spontaneous strategy change (SSC); i.e., changing its response despite a positive feedback (Table 2). These results suggested that OCD “checkers” had a high response lability as identified through an elevated SSC probability. The subgroup analysis in both species supported this result, with differences of SSC probability observed between the three subgroups either in humans (BF10 = 1.19, η2 = 0.08, Fig. 4b left) or mice (BF10 > 100, η2 = 0.41, Fig. 4b right). OCD “checkers” had greater SSC probability than both healthy controls (BF+0 = 3.51, d = 0.53 [0.08 1.02]) and OCD “non-checkers” (BF+0 = 2.22, d = 0.59 [0.06 1.1]). The healthy controls and OCD “non-checker” groups did not differ from each other (BF10 = 0.29, d = 0.09 [−0.41 0.57]) and no comorbid anxiety disorder or gender effect was found (Supplementary Tables S5 and S6). The same results were obtained in mice with “impaired” KO mice showing greater SSC probability than WTs (BF10 = 24.64, d = 1.22 [0.33 1.82]), and “unimpaired” KO mice showing less SSC probability than WTs (BF10 = 5.24, d = 0.91 [0.12 1.45]). In the same line, WTs perseverated more after SSC, with more SSC errors than “impaired” KO mice (BF10 > 100, d = 1.61 [0.62 2.24]) (Table 1). Importantly, these difference in response lability (expressed here as an SSC probability) were observed only in a reversal context and not during the acquisition phase, both in OCD “checkers” and “impaired” KO mice (see Supplementary Notes).

Fig. 4: An excessive response lability underlies the reversal learning deficit.
figure 4

a Deficits of behavioral flexibility found in both OCD “checkers” patients (n = 21) and “impaired” KO mice (n = 12) subgroups were not explained by a greater perseveration (in term of reversal errors). b Instead, both OCD “checkers” and “impaired” KO mice showed an increased SSC probability compared to other groups (n = 19 “non-checkers”, 40 healthy controls, 14 “unimpaired” KO and 26 WT mice), suggesting a higher response lability. ncOCD: “non-checkers”. cOCD: “checkers”. uKO: “unimpaired” KO mice. iKO: “impaired” KO mice. Triangle: group mean. Dot: individual mean. Ø: BF10 < 1. *BF10 < 3. *BF10 ≥ 3. **BF10 ≥ 10. ***BF10 ≥ 30. ****BF10 ≥ 100.


This cross-species study assessed the role of behavioral flexibility in compulsive behaviors through a reversal learning task conducted in both humans and mice. We showed that in both species compulsive subjects do not form a homogeneous group. Taken as a whole, neither the human nor rodent compulsive groups showed differences in task performance compared to their controls. Thus, the severity of compulsive behavior per se was not a predictor of performance in our reversal learning task. In contrast, when heterogeneity within groups was taken into account, we identified in both species a subgroup with strong behavioral flexibility deficit in our task. Importantly, this deficit was independent of compulsive behavior severity but rather linked to checking symptoms in patients. In addition, we found in both species that, contrary to what we would expect, the deficit of behavioral flexibility observed in some subgroups was not underpinned by excessive perseverative behavior after reversal but rather by greater response lability. Taken together, our results from a cross-species perspective do not support a link between compulsion and behavioral flexibility. Instead, they suggest that another dimension, excessive response lability, found in subgroups of compulsive subjects has an effect on behavioral flexibility.

Our study also emphasizes the importance of considering clinical subtypes within OCD patients, as encouraged by other recent studies36,37. The fact that we found deficit only in OCD checking patients is in line with a recent meta-analysis demonstrating that the neuropsychological profile of checking patients is more disrupted than in other OCD patients, with major impairment in planning/problem solving, response inhibition and set-shifting (and therefore in executive functioning overall)38. Similarly, the identification of only a subgroup of Sapap3 KO mice impaired in our reversal learning task echoes what was reported in a recent study by Manning and colleagues23 which also found that only a subgroup of these mice was impaired in a spatial reversal learning task although no difference in grooming level was highlighted.

The excessive response lability observed in some subgroups for both species, particularly in humans, echoes the results of computational studies performed by Kanen and colleagues39 that reported reduced stimulus stickiness in OCD patients; and the results found by Hauser and colleagues40 that showed a lower win-stay probability in OCD patients compared to control subjects with a lower perseveration parameter in their model. However, unlike our study, the different OCD subtypes were not taken into account and therefore the influence of the “checking” dimension in their result cannot be excluded. This excessive response lability could also be seen as a specific form of perseveration, the subject having difficulties in suppressing the previous association long after the reversal. However, OCD patients have also decision making41 and information sampling42 impairments specific to situations of uncertainty. In that respect, this increased response lability could be induced by an increased level of uncertainty provoked by the reversal event. This assumption makes sense when considering the isolated subgroup of “checker” patients displaying a higher degree of uncertainty. In mice, we cannot conclude that the isolated subgroup of Sapap3 KO mice is analogous to the “checking” subtype of OCD. However, considering that it is not uncommon for a patient to present a hybrid compulsive symptomatology mixing both compulsive checking and washing43, one can imagine that the impaired subgroup of mice also presents a mix of compulsive grooming and checking behaviors. Obviously, in mice these checking behaviors are not directly observable, but it would be interesting to test if uncertainty monitoring and checking behaviors are also affected in Sapap3 KO mice. Indeed, one could expect that abnormal increase of uncertainty after reversal would provoke the excessive lability of their behavior (e.g., with mice over-checking if the previously rewarded stimulus is still valid). Another dimension which could affect both compulsivity and flexibility is the overexpression of habitual behaviors. Some recent studies have favored this idea in OCD patients44 and Sapap3 KO mice45 but others dampened46 or rejected this hypothesis47. Thus, more evidence will be needed to fully understand the implication of habits in compulsive behaviors and flexibility. Finally, it cannot be ruled out that the deficit found is linked to an impairment that is not task-specific, such as an attentional impairment, whether primary38 or secondary to obsessive activity in patients (particularly for checking patients, as the reversal increases uncertainty and thus possibly the obsessive doubt)48. It would thus be important in the future to simultaneously assess attention, like other cognitive dimensions, to better characterize the underlying mechanisms of this observed behavioral inflexibility.

A potential limitation of our study that can be pointed out is the difference in terms of medication between the two species. Indeed, while our Sapap3 KO mice were free of any pharmacological treatment that could alter their performance, OCD patients were largely on serotoninergic treatment. It is thus logical to think that the results obtained may reflect the effect of the treatment and not their disorder. However, we were able to show the absence of influence of serotoninergic medication on performance in our task with no difference between medicated and unmedicated patients. Further, it has been shown that chronic administration of a selective serotonin reuptake inhibitor reduces perseveration and promotes a win-stay strategy in this type of task49, which is contrary to our results. We can therefore be confident of the cross-species validity of these findings.

Our cross-species results favor the heterogeneity of cognitive deficits observed in compulsive disorders and stress the importance of also considering this heterogeneity in animal models. Indeed, even if inbred mouse lines share identical genetic background, this is not necessarily stable over time and may result in the emergence of new phenotypic traits due to a genetic drift. However, it has been shown that the C57BL/6J strain is one of the strains least susceptible to this effect50. Furthermore, we could not identify any genealogical specificity for the impaired KO mice subgroup. Another hypothesis for the heterogeneity we observed in our animal model could be of epigenetic and/or environmental origin. It has been shown for example that phenotypic variability can emerge from variations in epigenetic regulation35,51,52,53. Examples include studies on genetically homogeneous WT C57BL/6J mice showing inter-individual variability in the expression of flexible behavior underpinned by variability in serotonin levels within the OFC54. As we could not identify any subgroup in our WT mice based on the task performance, such inter-individual variability could only be a risk factor whose sole interaction with the Sapap3 KO mutation leads to an impairment.

The use of Bayesian statistics is another strength of our study which allowed us to formally support the absence of prespecified differences, notably in terms of perseveration but also to quantify the weight of evidence in favor or against those between-group differences. This methodological strength also points out a limitation to the interpretation of our results, especially those obtained with humans. Indeed, the differences highlighted between the checking patients and the healthy subjects are all supported by a Bayes Factor lower than 10 reflecting a low to substantial evidence but far from being decisive with a moderate effect size. This speaks toward the need to replicate these results on a larger sample than the one included in this study.

In conclusion, we found that compulsive behavior is not necessarily associated with a deficit in behavioral flexibility. In contrast, this study proposes that a behavioral flexibility deficit, only observed in a subset of compulsive subjects, may result from excessive response lability rather than perseveration, both in humans and mice.


Participants and animal subjects

OCD patients were recruited through an online advertisement posted on a patient association’s website (AFTOC) and among a cohort of severe patients followed in the psychiatric department of Albert Chenevier Hospital. Healthy comparison subjects were recruited through an online advertisement posted on an information web site dedicated to cognitive research (RISC). Diagnoses and co-morbidity were established by an experienced clinician with the French version of the Mini International Neuropsychiatric Interview (MINI v555). Exclusion criteria were defined as follows: actual major depressive episode, bipolar disorder, acute or chronic psychosis, substance abuse or dependency including alcohol, epilepsy, cerebral injury, or other neurological problems. To assess severity and clinical subtypes of obsessive-compulsive (OC) symptoms, the Yale-Brown Obsessive-Compulsive Scale (YBOCS56) was administered only for OCD patients, and the Obsessive Compulsive Inventory Revised (OCI-R57) was used to measure all participants’ OC characteristics. Forty patients were included in this study. They were diagnosed with OCD according to the DSM-V criteria and had a score greater than or equal to 16 on the YBOCS. Among those, fifteen patients displayed contamination/washing symptoms, thirteen aggressive/checking symptoms, eight predominant aggressive/checking symptoms associated with contamination/washing symptoms and four had predominantly obsessive thoughts, mainly religious/mental rituals. The mean age at onset of OCD symptoms was 15.23 (±5.881) years old and the mean illness duration was 24.92 (±13.951) years. Depression, trait/state anxiety and impulsivity were assessed, respectively, with the short version of the Beck Depression Inventory (BDI58), the Spielberger’s State-Trait Anxiety Inventory (STAI59) and the Barratt impulsivity scale (BIS-1060) in their French version. Among the patients taking part in this study, twenty-eight were free of any psychiatric comorbidity, eleven had a comorbid anxiety disorder (essentially general and social anxiety disorder) and one had an eating disorder. Considering psychotropic medication, twenty-eight patients took an antidepressant drug alone or combined with antipsychotics or mood stabilizer and the remaining patients were medication-free. The current pharmacological treatment was converted to dose-equivalent fluoxetine for each patient61. Forty healthy control subjects, free of any current psychiatric or neurological disorder and subsequent medications, were matched individually according to age, sex, handedness, school education, as well as for IQ (estimated by the French National Adult Reading Test, fNART62). The protocol for human participants was approved by the Medical Ethical Review Committee of the Pitié-Salpêtrière Hospital (ID RCB n° 2012-A01460-43). All the participants gave their informed consent prior to the beginning of the study.

Fifty-two C57BL/6J male mice (26 Sapap3-null (KO) and 26 age matched wildtype (WT) littermates), 6-7 months old, were used. The mice were born, weaned (at post-natal day 21) and raised in the animal facility of the Brain and Spine Institute. Genotypes were determined by PCR of mouse tail DNA, using primer F1 (ATTGGTAGGCAATACCAACAGG) and R1 (GCAAAGGCTCTTCATATTGTTGG) for the wildtype Sapap3 allele (147 base pairs), and F1 and R2 (CTTTGTGGTTCTAAGTACTGTGG; in neo cassette) for the mutant allele (222 base pairs). Before performing the task, they were living in group of 3-5 in ventilated cages with ad-libitum access to water and food, under a temperature of 20–22 °C and 50–60% humidity, and were maintained under a 12 h light/dark cycle (lights on from 8 a.m. to 8 p.m.). Mice started the task when they were at least 6 months old to maximize the chance to observe the grooming phenotype without severe skin lesions18. During the behavioral assessment, which approximately lasted a month, they were single-housed in the experimental cages with ad-libitum access to water under a 12 h light/dark cycle. The first 24-h in the experimental cage consisted in the habituation period. During this period, they had ad-libitum access to food and were video-taped from the top in order to quantify the self-grooming behavior (see “Grooming quantification” section). After this initial 24 h period, mice no longer had access to ad libitum food but a tablet was delivered each time they answered correctly during the behavioral task (see “Behavioral task” section). Their weight was monitored every day by the experimenter (less than 3 min handling), and they were supplemented with tablets if their weight went under 80% of their initial mass. During the entire protocol, mice were under continuous remote video-monitoring. Each animal experiment was approved by the Ethics committee Darwin/N°05 (Ministère de l’Enseignement Supérieur et de la Recherche, France) and conducted in agreement with institutional guidelines, in compliance with national and European laws and policies (Project n° 00659.01).

The detailed characteristics of the samples for both species are summarized in Table 3.

Grooming quantification

The recorded videos were manually analyzed using Kinovea v0.8.15. The self-grooming measures (number of grooming bouts and proportion of time spent grooming) were extracted from a 10-min activity period (a sufficient duration to highlight differences in the self-grooming behavior63) which started from 8 p.m. when mice become more active. When a mouse was not active at 8 p.m., the time window was moved forward until the mouse waked up and left the nest. When a mouse was already engaged in a grooming behavior at 8 p.m., the time window was moved forward to start after of the ongoing grooming sequence. If a mouse was still engaged in a grooming bout at the end of the time window, this one was moved forward in order to include only complete grooming bouts. Self-grooming was defined as one or more of the elements of the syntactic grooming chain in a flexible, non-chained order: elliptical strokes, small strokes, bilateral strokes, flank licks, and tail and genital licks63,64,65. Consistent with previous studies, we counted grooming bouts independently when they were separated by more than 2 s63,66. In addition, we considered two grooming bouts as independent when qualitatively different behavior interrupted the grooming sequence (i.e., jumping, locomotion, rearing)63.

Mouse automatized experimental chamber

It has been shown that daily manipulation of animals in experimental procedures can increase stress and negatively impact behavioral results, including the assessment of behavioral flexibility67. To avoid this bias, especially in Sapap3 KO mice, which express an anxious phenotype, we designed and used in our study an automated experimental chamber (Fig. 1a) where mice were exposed to the task 24 h a day. This behavioral apparatus consisted of a modified ENV-007CTX experimental chamber from Med Associates (Vermont, USA) with interior dimensions of 30.5 × 24.1 × 29.2 cm. The grid floor of the chamber was covered with a stainless-steel tray to receive bedding. On the left wall, two 2.8” TFT capacitive touchscreens (#2090, Adafruit, New York, USA) were placed symmetrically above the bedding tray. Each touchscreen was controlled by an Arduino (Leonardo model, Adafruit) interfaced with the I/O module (DIG-716B, Med Associates). On the right wall, a pellet dispenser placed in the center (ENV-203-20, Med Associates) was delivering 20 mg precision tablets (49.6% sucrose, 5TUL, Test Diet, Missouri, USA) into a pellet receptacle (ENV-303WX, Med Associates) equipped with an infrared head entry detector (ENV-303HDW, Med Associates). A water bottle (ENV-350RMX, Med Associates) was placed next to the pellet receptacle. On the ceiling, a micro camera (700TVL Super HAD CCD II with a 2.8 mm lens, Sony, Japan) associated with a red LED for night vision (5 mm, 55 cd) was fixed on the center and an aversive light (6 W LED spot) was vertically located above the pellet receptacle and tilted toward the touchscreens. The apparatus was controlled by Med-PC IV software (Med Associates) running on a desktop computer (under the Windows 7 OS) equipped with the DIG-700P2-R2 PCI interface card (Med Associates). The mouse lived in the experimental chamber for several weeks where water was provided ad libitum, and each trial could be self-initiated to get food. Therefore, these in-house experimental chambers allow a more naturalistic assessment of behavioral performance of the mice by respecting their nychthemeral rhythm and by avoiding prolonged and repetitive handling (less than 3 min per day for weighing only) and methodological constraints that could also affect animals behavior (such as food deprivation before the experiment)68,69,70. Moreover, our device allowed, as in humans, to repeat the measurement of interest through the collection of thousands of trials per mouse (Fig. 1b; see Supplementary Methods for details). To control for any environmental influence, the mice underwent the task in pairs (a KO mouse and its WT littermate). Prior to the beginning of the task itself, the mice were acclimatized to the experimental chamber for 24 h with a pellet (20 mg precision tablets containing 49.6% sucrose, 5TUL, Test Diet) delivered each time they poked their nose into the food receptacle. After this acclimatization phase, the master program automatically triggered two successive pre-training instrumental phases (see Supplementary Methods for details) during which the animal learned to activate the screens and respond to visual stimuli. Finally, when the animal reached pre-defined criteria of completion of these pre-training phases, the reversal learning task could proceed (Supplementary Fig. S3).

Reversal learning paradigm

The human version of the reversal learning task (Fig. 1c, left) was administered in a computerized version adapted from Valerius et al.71 and coded in MatLab R2013b (MathWorks) using the Psychophysics Toolbox v3 ( The subjects sat in front of a 17” TFT monitor and a regular keypad. Two different abstract symbols from the Agathodaimon alphabet (white font on a black background) were randomly displayed on the left and right sides of the screen. Subjects had to choose one of these symbols by using either a left (“Q”) or a right (“M”) button-press according to the screen location of the stimulus. The subjects had the instruction to respond as fast as possible and the symbols remained on screen until their response. 750 ms after their response, a feedback (green or red smiley face displayed for 500 ms) indicated whether their answer was correct or not followed by the win or loss of one point, respectively. Inter-trials intervals were randomly sorted between 750 ms and 1250 ms. After 6 to 15 (randomized) consecutive correct responses, a reversal occurred and subjects had to adapt their response by selecting as correct the formerly wrong symbol. As classically done in human behavioral studies72 to make reversal events less obvious, probabilistic errors were interspersed so that there was a 20% chance of receiving misleading feedback. In order to avoid successive probabilistic errors, we set a maximum of 3 possible continuous probabilistic errors and 3 possible probabilistic errors within a 10-trial sliding window. Moreover, probabilistic error never occurred on a reversal event and the 3 following trials. There were 3 breaks (every 6 reversal blocks) of up to 5 min and the pairs of symbols was changed after each break. The task ended after the completion of 20 reversals. All participants were first familiarized to the task with a few trials practice run (the training ended when the reversal criterion was reached).

The mouse version of the task (Fig. 1c right) was coded in MEDState Notation (MSN) language and was the same as the human version, with the feedback being deterministic as the main difference (i.e., there were no probabilistic error). As in the human task, the rewarded side was contingent to the stimuli presented on the screens to exclude any simple spatially-guided or even cue-guided strategy. When the mouse launched a trial by nose poking in the pellet receptacle, two distinct visual patterns of vertical and horizontal bars equally luminescent (Fig. 1c) were presented in white font on a black background on the left and right touchscreens with pseudo-randomized locations (within a 10-trial sliding window, the same pattern could not appear more than 3 times consecutively at the same location and more than a total of 7 times). The equiluminescent stimuli pair was chosen among those recommended by Horner et al.73, i.e., grid and lines. Once a trial was initiated, the mouse had 60 s to respond before the screens turned off. If the mouse made a correct response, the two screens blinked and the mouse had 15 s to nose poke in the food receptacle for a reward. Otherwise, the aversive light was turned on for 5 s and the stimuli location remained the same for the subsequent trials until a correct response was made (corrective trials to avoid response lateralization). After completing a trial, the mouse could not launch another one within the next 5 s. When the mouse reached a criterion of 80% correct responses over the last 40 trials, a reversal occurred and the mouse had to choose the formerly wrong stimulus as the new correct response. The task ended after completion of 5 reversals, representing around 2000 trials performed over approximately 3 weeks (Fig. 1b and Supplementary Fig. S4). The first rewarding stimulus was counterbalanced between pairs.

For both species, the main behavioral measures to assess the subjects’ performance were: the number of trials needed to reach the reversal criterion, and the reversal errors defined as the number of perseverative errors following a reversal event. For mice, this last measure was estimated as a proportion of errors in the so-called perseverative phase74,75, defined as the block of consecutive trials following a reversal event until the performance rate reach 40%. These blocks were determined according to a change point analysis adapted from Gallistel et al.76 (see “Change point analysis” section). Other parameters of interest, after the perseverative phase, were the probability of a spontaneous strategy change (SSC), i.e., switching to the unrewarded stimulus despite previous positive feedback; and the SSC errors, i.e., the number of consecutive errors after a spontaneous strategy change.

Change point analysis

To identify the perseverative phase in the mouse version of the task, a change point analysis was performed on the cumulative record of correct responses for each animal and for each reversal block76. This analysis allowed the detection of change points which marked significant variations in the slope of the cumulative record, a useful metric to identify changes in performance. We coded a recursive algorithm based on MatLab functions provided by Gallistel and colleagues76 to search for putative change points in individual cumulative records of performance (i.e., the trial deviating maximally from a straight line drawn between the start of the record and the assessed point). A χ2 test was used to determine whether the frequencies of correct responses before and after the putative change point significantly differed. A change point was retained if its logit value (log of the odds against the null hypothesis that there is no change) reached/exceeded the one defined by the user. We ran the algorithm on each reversal block for each animal starting with the highest (and very conservative) logit value of 6 and, as suggested by Rountree–Harrison et al.77, counting down by increment of 0.1 until we could detect a change point or the lowest acceptable logit value of 1.3 was reached (no change point detected in this case). Therefore, a change point was marking a statistically significant distinction between the post reversal perseverative phase (performance level below 40%), and the learning phase (performance level exceeding 40%)74,75.

Statistics and reproducibility

Bayesian statistics were used to overcome the multiple shortcomings of null hypothesis significance testing78. This approach allows us to assess not only the strength of the evidence against the null hypothesis but also the one in favor of it by computing the Bayes Factor79 (BF), a ratio that contrasts, given the data, the likelihood of the alternative hypothesis (H1) with the likelihood of the null hypothesis (H0), hence the subscript BF10. BF values have a natural and straightforward interpretation as indicative of “substantial” (3 < BF10 < 10), “strong” (10 < BF10 < 30), “very strong” (30 < BF10 < 100) and “decisive” (BF10 > 100) evidence in favor of H1 (and conversely for H0 for BF10 values below 1/3, 1/10, 1/30, and 1/100, respectively). All the analyses were performed in JASP v0.9.280. For group comparisons, two-tailed (or one-tailed when justified, indicated by BF±0) Jeffreys–Zellner–Siow (JZS) paired t-tests81 were carried out to analyze differences in continuous variables with an uninformed Cauchy prior (µ = 0, σ = 1/√2), and two-tailed Gunel–Dickey (GD) contingency tables tests82 for independent multinomial sampling were carried out to analyze differences in categorical variables with a default prior concentration of 1. For multiple-group comparisons, we used JZS ANOVA83 with an uninformed multivariate Cauchy prior (µ = 0, σ = 1/2) followed by post-hoc JZS t-tests and GD contingency tables tests for joint multinomial sampling. For all post-hoc analyses, we adjusted for multiplicity according to the Westfall approach84, with the prior probability that the null hypothesis holds across all comparisons fixed to 0.585. For the comparison of performance around reversal, a two-way mixed JZS ANOVA83 including the within-subject factor “trial” (10 trials around reversal for humans and 100 for mice) and the between-subject factor “group” (OCD/KO vs. healthy/WT) was performed with an uninformed multivariate Cauchy prior (µ = 0, σ = 1/2). When appropriate, effect sizes were reported, i.e., Cohen’s d with its 95% Credible Interval (95CI) for JZS t-tests and the η2 for JZS ANOVA.

To assess the inter-dependence between compulsive behavior severity and behavioral flexibility performance, we carried out correlation analyses between the severity score of the disorder and the behavioral parameters extracted from our task, as performed in previous studies in both species33,71. In humans, the different OCD clinical subtypes, measured by the dedicated OCI-R scale subscores, can have a distinct impact on behavioral flexibility15. Thus, we analyzed separately the relationship between the different OCD subtypes and behavioral flexibility. Likewise, we additionally explored the influence of depressive and anxious symptoms, as well as antidepressant dose on task parameters. These analyses relied on two-tailed JZS Pearson correlation tests86 with an uninformed stretched β prior (width 1). The correlation coefficient r is reported along with its 95CI. The categorical variables were expressed as percentages and the continuous variables were expressed as means ± standard deviation (all the behavioral parameters were averaged out over the 5 reversals). The acquisition phase, defined as the trials needed to reach the reversal criterion for the first time, is thought to measure a baseline capacity for learning the associations, and not a reversal learning deficit87. It was therefore analyzed separately. All values were rounded to two decimal places.

Cluster analysis

To search for a potential subgroup of WT or KO mice which might behave differently in our task, we performed a two-step cluster analysis88 using the four behavioral parameters extracted from our task (number of trials to reversal, reversal errors, SSC probability and SSC errors). This algorithm has the advantage of relying on the Bayes Information Criterion (BIC) to automatically determine the number of clusters; avoiding any subjective and biased selection89. The two-step cluster analysis procedure automatically selects the number of clusters minimizing the Bayes Information Criterion (BIC). The difference of the BIC values (noted ∆BIC) between the best solution and the alternative models is indicative of the strength of the evidence (0<|∆BIC|<2, weak evidence; 2<|∆BIC|<6, positive evidence; 6<|∆BIC|<10, strong evidence; and |∆BIC|>10, very strong evidence)89. Two-step clustering also offers an overall goodness-of-fit measure called the silhouette measure which assess the quality of cluster separation. A silhouette measure of less than 0.20 indicates a poor solution quality, a measure between 0.20 and 0.50 a fair solution, whereas values of more than 0.50 indicate a good solution. Furthermore, the procedure indicates the relative importance of each variable in the determination of a specific cluster, with a value ranging from 0 (least important) to 1 (most important). Our clustering used log-likelihood as a measure of distance. It was followed by a stepwise discriminant analysis as a confirmatory procedure which included all mice (KO and WT). The analysis used the Wilks’ lambda for variable selection and prior probabilities computed from group sizes along with the within-groups covariance matrix for classification. The Cohen’s κ90 was run to quantify the agreement between these two procedures (κ < 0.20 corresponding to a poor agreement; 0.21–0.40, a fair one; 0.41–0.60, a moderate one; 0.61–0.80, a good one; 0.81–1.00, a very good one)91. These analyses were performed in SPSS v25 (IBM) (See Supplementary Methods for more details).

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.