Simulated visual hallucinations in virtual reality enhance cognitive flexibility

Historically, psychedelic drugs are known to modulate cognitive flexibility, a central aspect of cognition permitting adaptation to changing environmental demands. Despite proof suggesting phenomenological similarities between artificially-induced and actual psychedelic altered perception, experimental evidence is still lacking about whether the former is also able to modulate cognitive flexibility. To address this, we measure participants’ cognitive flexibility through behavioral tasks after the exposure to virtual reality panoramic videos and their hallucinatory-like counterparts generated by the DeepDream algorithm. Results show that the estimated semantic network has a flexible structure when preceded by altered videos. Crucially, following the simulated psychedelic exposure, individuals also show an attenuated contribution of the automatic process and chaotic dynamics underlying the decision process. This suggests that simulated altered perceptual phenomenology enhances cognitive flexibility, presumably due to a reorganization in the cognitive dynamics that facilitates the exploration of uncommon decision strategies and inhibits automated choices.


Results
Data were collected from 52 individuals in an equipped VR lab. Participants were exposed to the series of original videos (OR condition, Fig. 1a) followed by DeepDream videos (DD condition, Fig. 1a) in VR. The order of conditions was counterbalanced across participants. Immediately after the videos' presentation in VR in each condition, volunteers performed two behavioral tasks and a questionnaire on a computer screen, always in the same order (Fig. 1b). The first task was the AUT 39,49 (Fig. 1c), in which participants were asked to list as many unusual uses as they could think of to four cue words (e.g., "Newspaper"). The second task was a mouse-tracking version of the Stroop task 40,41 (Fig. 1d), requiring participants to click on the button which represented the color of the target word while ignoring its meaning. After the tasks, the ASC 42 questionnaire was administered to measure specific dimensions of the participants' subjective experience. The ASC was used to probe the effectiveness of the DD videos to simulate visual hallucinations, following Suzuki et al. 36 . We performed a two-tailed paired permutation t-test (α = 0.05, 10,000 iterations) and Cohen's d measure of effect size to compare responses to the ASC items following the two conditions. Results showed a significant increase of the ratings in DD compared to OR (Fig. 1e, see also Table 1  Semantic network structure. The AUT responses were pre-processed and resulted in 567 unique responses in total across the sample. The McNemar's chi-squared test and the Phi measure of effect size (φ) were used to examine whether there was a difference in the proportion of unique responses between the condition. In the OR condition, participants generated 339 of the total unique responses (196 of which were not given in DD), and in the DD condition, participants generated 371 of these responses (228 of which were not given in OR). The proportion of the number of unique responses in DD (65.5%) with respect to OR (59.8%), was not significantly different ( χ 2 (1) = 2.266, p = 0.132, φ = 0.06). After this step, we constructed a group-level semantic network for each condition using the preprocessed AUT unique responses. In order to construct the semantic network, we first selected only the unique AUT responses that matched between conditions and were collected from at least two participants, which resulted in 63 responses (nodes). Then, we computed the cosine similarity between the binary vectors associated with each selected unique response, representing which participant made that response, in a pairwise fashion. This resulted in an undirected weighted semantic network for each condition, with the unique response as nodes and the cosine similarity as links. After filtering 50 the semantic networks, their organization was analyzed using the following network measures, commonly examined in semantic network research: Clustering Coefficient (CC) 51 , Average Shortest Path Length (ASPL), modularity index (Q) 52 , and Small-worldness measure (S) 53 . Results from the comparison of the full network revealed qualitative (Fig. 2a) and quantitative (Fig. 2b,c) differences between the OR and DD networks' structures. The OR network appeared to be more spread out than the DD network (Fig. 2a). Conversely, the DD network showed a reduced distance between nodes, as reflected in the lower ASPL (Fig. 2b). The semantic network of the DD condition showed lower structural (ASPL = 3.388, Q = 0.607) and higher flexible (S = 4.678) values compared to the network of the OR condition (ASPL = 4.052, Q = 0.614, S = 4.565). The clustering coefficient showed a small difference, with a  www.nature.com/scientificreports/ higher value in the OR network (CC = 0.6906) than the DD network (CC = 0.6905). We statistically examined the validity of our findings by applying two complementary approaches, the leave-one-node-out (LONO) and the leave-one-subject-out (LOSO). A two-tailed paired-samples permutation t-test (α = 0.05, 10,000 iterations) were computed on each measure for comparing the conditions in both the LONO and LOSO procedures. Results showed that DD had a significantly smaller distance between nodes, clustering, and higher small-worldness compared to OR, confirming the analyses on the full networks, with effect sizes ranging from moderate to very large (Fig. 2c). The modularity did not reach a statistically significant difference between conditions. For statistics see also Table 1.
Network percolation. We also assessed the robustness of the semantic networks using a network percolation analysis approach 4 to probe the resiliency of the network under targeted attacks. Here, the weighted networks previously constructed from the AUT data were used. In the percolation analysis, networks are "attacked" by removing links with weight strength below an increasing threshold, called the percolation step. In each percolation step, we measure the size of the Largest Connected Component (LCCS), which is the largest cluster of nodes connected only to each other. Once the percolation process reached its end, we computed the percolation integral ( φ ), which is the area under the curve representing the LCCS across the percolation steps. We applied this analysis on the full networks of both conditions, finding that the percolation integral of DD ( φ = 22.15) was larger with respect to OR ( φ = 17.15), meaning that the OR network broke apart faster compared to the DD network, as illustrated in Fig. 3a. In Fig. 3c, we illustrated how the networks appeared throughout the percolation process and can be appreciated the difference in the LCCS between the conditions at different percolation steps, denoting the DD network's robustness compared to the OR network. To determine the statistical significance of our findings, we implemented three approaches: LOSO, LONO, and the link shuffling analysis (LS).  www.nature.com/scientificreports/ For the LS analysis, we randomly exchanged pairs of links (~ 1700) in the network and computed φ , repeating this procedure for 500 iterations. A two-tailed paired-samples permutation t-test (α = 0.05, 10,000 iterations) was computed on percolation integral for comparing the conditions in the LONO, LOSO, and LS procedures. Overall, these analyses revealed similar results (Fig. 3b), namely that the average percolation integral of DD was significantly larger than OR (all p < 0.001) and very large effect size ( d LONO = 7.01, d LOSO = 1.89, d LS = 5.08; for statistics see also Table 2).
Drift diffusion conflict modelling. Stroop data were pre-processed by removing outliers both in terms of reaction times (RT) and accuracies. After this step, we compared accuracy and RT data across Stroop conditions using a paired-samples two-tail permutation t-test (α = 0.05, 10,000 iterations, Fig. 4a). We did not find any significant difference between OR and DD in the accuracy of congruent ( p = 0.787, d = 0.10) and incongruent trials ( p = 0.098, d = 0.20), neither in the RT of congruent ( p = 0.749, d = 0.03) and incongruent trials ( p = 0.642, d = 0.04). Then, we fitted the DCM to the RT and accuracy data for each participant and condition separately. We compared 4 parameters from the model fitting ( Fig. 4b-d), namely the amplitude of the automatic process ( α ), the decay of the automatic process ( τ ), the drift of the controlled process ( δ ) and the decision boundary ( β ). Statistical significance was assessed with two-tailed paired-samples permutation t-test (α = 0.05, 10,000 iterations). We found that α was significantly reduced in DD compared to OR ( p = 0.012, d = 0.51), indicating that the contribution of the automatic process to the participants' decision-making was reduced in DD. We did not observe any difference in decay of the automatic process ( p = 0.986, d = 0.01), drift of the controlled process ( p = 0.282, d = 0.16) and the decision boundary ( p = 0.507, d = 0.11) for OR with respect to DD.

Mouse trajectory analysis.
We analyzed the Stroop data also in terms of mouse trajectories. We computed the area under the curve (AUC) between the trajectories and optimal path, the permutation entropy (PE), and the number of deviations (D) along the x and y coordinates and the Euclidean distance (ED) and the velocity (V). Statistical significance was assessed with a two-tailed paired-samples permutation t-test (α = 0.05, 10,000 iterations). We found that trajectories were closer to the optimal path in the congruent trials with respect to the incongruent ones in both OR and DD conditions (Fig. 5a). The trajectories AUC comparison was not significant in both congruent ( p = 0.555, d = 0.04) and incongruent ( p = 0.839, d = 0.01) trials (Fig. 5b). Moreover, we found that PE was significantly higher in DD compared to OR on the incongruent trials along both the x ( p = 0.045, d . Surprisingly, we observed that PE, ED, and D were generally higher in congruent trials with respect to incongruent trials in both conditions. Furthermore, we used Gaussian Mixture Models (GMM) to estimate macro-states' trajectories in order to better characterize the decision process of participants. Model selection evidenced that a GMM with 4 clusters was the best fitting model. We decided to label these clusters as Initiation, Prediction, Evaluation, and Termination states. As shown in Fig. 5c, Initiation and Termination states pertain to the starting and ending phase of the trajectories, respectively. The Prediction state was subsequent to the Initiation and was considered as a moment in which participants made their first guess about the correct outcome of the trial. After this, the Evaluation state is a phase in which participants could in principle change their first prediction and be attracted more towards other targets. We computed the transition matrices among these states between conditions and split between congruent and incongruent trials (Fig. 5d). Broadly, as expected, we found it was more probable to remain in a certain state with respect to switch since these states can be conceived as mostly sequential states of the decision process. Finally, we computed the Dwell Time (DT), defined as the average lifetime of a state (Fig. 5e). We found a significant difference in the termination state between OR and DD in congruent (higher OR, p = 0.019, d = 0.22) but not in incongruent trials ( p = 0.187, d =

Discussion
Despite scientific studies demonstrating a beneficial effect of psychedelics on CF 20-26 , the effect of artificially induced altered perception on CF has remained largely unexplored. In the present study, we applied the DeepDream algorithm to a series of panoramic videos of natural scenes with the intent to simulate visual hallucinations 36 . Then, we tested whether DeepDream enhanced CF as compared to regular perceptual phenomenology. To achieve this goal, we used a within-subject design in which participants were exposed to DD and OR (control) video sessions in VR. After the video presentation, CF was assessed by both the AUT 39,49 and a mouse-tracking version of the Stroop task 40,41 , while participants' phenomenological experience was measured through a brief version of the ASC questionnaire 42 . ASC revealed a significant increase in DD over OR of perceptual ('patterns' , 'space') and imaginative dimensions ('imagery' , 'strange' , 'vivid' , 'muddle') as well as the overall intensity and mystical quality ('spirit') of the experience. Our results are largely in line with the findings reported by Suzuki et al. using the Hallucination Machine, as well as studies reporting alteration in participants' subjective experience after pharmacological administration of psilocybin 36 . Importantly, our findings corroborate the effectiveness of the DeepDream stimulation procedure on the modulation of distinct aspects of altered states of consciousness, especially visual hallucinations, avoiding the extensive systemic effects produced by pharmacological interventions. Next, we applied a network science methodology in order to construct semantic networks from the AUT responses. Within this framework, we estimated and compared the properties of the semantic network of each condition 15 and performed a percolation analysis on these networks 4 . Here, we assumed that a more smallworlded network should facilitate semantic search processes by connecting weakly related concepts, hence enhancing CF. We found that DeepDream exposure significantly influenced the network organization, leading to a reduction of the shortest path between nodes and an increase S, with respect to the OR condition. Consistent with our findings, a higher value of small-worldness in the semantic has been previously associated with high creative   15,47,54 . Yet, in our results we found significantly higher connectivity in the OR over DD, suggesting that the reduced ASPL, in DD over OR, was the driving effect of these results, augmenting the chances of reaching a wider number of semantic connections 55,56 . We interpret these findings as showing that the DD network had a more efficient and flexible structure as compared to the OR network, indicating a higher level of CF in the participants elicited by DD 14 . A similar effect was also found by 21,33,34 suggesting that LSD and related psychedelics increase the spread of semantic activation. Critically, the percolation analysis yielded consistent results, indicating that the conceptual network of the DD condition was significantly more robust to the percolation process, as exhibited by a higher percolation integral (i.e., DD network breaks apart slower than OR). The semantic network outcomes are further consistent with current theories of semantic memory described as a dynamic system, able to change its organization in the short-term period 21,[57][58][59] . Therefore, our findings may also provide support for process-based change in the conceptual networks. Furthermore, Stroop data were firstly analyzed in terms of accuracy and RT by means of an integrated computational framework, namely the drift diffusion model 43,44 . We implemented a specific version of this model, the DCM, tailored for conflict tasks 45 . We found that, despite the drift parameter δ of the controlled process was similar between condition, DD had a significantly lower amplitude of the automatic process compared to OR, suggesting that automatic processes contributed less to the overall decision process in DD. Indeed, the inhibition of automatic responses is an important characteristic of CF 12,13,48 , therefore we interpreted these results as a confirmation of our previous network findings but from a different perspective, i.e. inhibitory control. Contrary to our expectations, we did not observe an increment in the efficiency of the performance due to DD. This might be due to the fact that DD modulated not only CF, but also other aspects of cognition that might have interfered with the performance on the Stroop task.
These preliminary interpretations were also corroborated by the analyses performed on the mouse trajectories, evidencing general differences in the strategies adopted by participants to reach the target between conditions. Firstly, we found that the trajectories in the incongruent trials were characterized by a higher level of tortuosity (higher permutation entropy and deviations) in DD compared to OR. Also, GMM clustering analysis revealed a higher tendency to stay in the early stages of the decision process in DD with respect to OR in congruent trials (higher Dwell Time of DD in the Prediction state and OR in the Termination state). These findings clearly show how DD affected pervasively participants' cognitive abilities by perturbing their lower-level perceptual expectations that ultimately affected higher-level cognitive processes.
Crucially, these results seem to corroborate at the behavioral level the findings from Greco et al. 37 , in which they observed a general increased entropic brain dynamics due to DeepDream exposure. Therefore, it seems reasonable to interpret these mouse trajectories findings in light of the entropic brain hypothesis 32,35 , according to which, during psychedelic experiences, the brain is supposed to operate at a criticality state, where it has access to a larger repertoire of physical states and therefore has a more chaotic dynamics. Here, we speculate that the tortuosity of the trajectories alongside the increased latency in resolving the uncertainty of the choice at the behavioral level might be explained by the higher entropic dynamics at the neural level. As stated above, these explanations could also be in favor of the observation that DD did not improve participants' performance, since it is more difficult to reach a higher level of efficacy with such a chaotic regime in the brain dynamics.
Overall, our findings can also be interpreted in line with recent evidence suggesting that diversifying experience, loosely defined as highly unusual and unexpected events, can lead to an enhancement of CF 16,60 . Our contribution to this line of research was to provide a quantitative and parametric approach to experience diversification by employing the DeepDream stimulation procedure.
Some limitations to our study exist. First, we did not find significant differences in traditional measures of performance in both AUT and Stroop tasks, namely fluency (i.e. number of generated ideas), accuracy, RT and AUC. This could be due to either methodological issues (e.g., exposure time to stimuli) or the mere fact that DeepDream affects only the aspects of the participants' performance we found to be significant. Therefore, future studies should address this point by diversifying the methodological choices or focusing on analyses that shed light on these traditional measures. An interesting future direction to understand the effect of immersion in DeepDream distortions could be the extension to other CF and executive function measures. Moreover, although we chose to adopt a state of the art analytical methods, this choice comes with the necessary downside that there are only a few studies that validate these methods, since they are so recent. We encourage future studies to compare our findings with both traditional and other advanced methods in order to generalize the results. Second, although the AUT is a valuable task to study creativity, it is usually administered as an open-ended task, whereas we required participants to only generate single-word responses. This requirement potentially constrains participants' responses in this task. Thus, future research is needed to replicate and extend our findings in more standard tasks used to assess semantic memory networks, such as free associations and semantic fluency tasks 56 . Third, we analyzed the semantic networks at the group level, aggregating across individuals and thus ignoring unique individual differences. Follow-up research should replicate our findings using the DeepDream manipulation, by estimating individual-based semantic networks 61 . Lastly, future studies are encouraged to augment the stimulus set by exploring different parameter settings of DeepDream, for instance by generating different videos that match, from low-level to high-level features, their original counterparts; this would clarify how and to what extent the low-level features modulate CF.
In conclusion, our findings provide evidence that simulated altered perceptual phenomenology enhances CF, presumably due to a reorganization in the cognitive dynamics that facilitates the exploration of uncommon decision strategies and inhibits the prevalence of automatic choices. We also showed how the use of recent deep learning models could furnish cognitive science research with a new tool to investigate low-and high-level cognition. Our study illustrates the strength of applying DeepDream-induced altered in studying cognitive processes, as well as further investigations on similar techniques for fostering cognitive flexibility.

Methods
Participants. Fifty-two students from the University of Trento participated in the study. An additional 4 volunteers were excluded from analysis because they did not complete the task. Participants were aged between 19 and 39 years (32 female, M = 23.25 years, SD = 4.32 years) and were native Italian speakers. None of the participants reported having problems with their sight (normal or corrected-to-normal vision). They had no history of neurological disorders and were not taking any neurological medications. We determined the sample size based on comparatively similar experimental psychology studies in which they use "over 50 participants for a simple comparison of two within-participants conditions with 80% power" 62 . Prior to the experiment, all participants provided written informed consent and received €10 or course credits as compensation for their time. All methods were approved by the University of Trento, Human Research Ethics Committee (Protocol 2018-023). The whole procedure was realized in accordance with the Helsinki Declaration.
Procedure. Participants were welcomed in a dedicated VR lab, gave consent, and completed demographic information. The experiment consisted of two conditions (OR and DD), in which participants were exposed to a series of panoramic videos in VR, with a total duration of ~ 45 min. The order of conditions was counterbalanced across participants. After the head-mounted display (HMD, Oculus Rift) was fitted, participants-comfortably sitting in a chair-started the experiment by being exposed to either the DD or the OR video series. They were encouraged to freely explore the virtual environment by moving their head. The OR video series consisted of 6 panoramic high-definition naturalistic video clips (2732 × 1366 resolution, 20 fps) with a duration of 50 s (Fig. 1a), presented one after the other with no delay in between with a total duration of 5 min. All the OR videos represented naturalistic scenes, such as beaches or cascades, and there was a blind spot of approximately 33-degrees located at the bottom of the sphere due to the field of view of the camera. The DD video series was a modified version of the OR videos using the DeepDream 38 . Immediately after the exposure to videos, participants performed the AUT, Stroop tasks, and the ASC questionnaire (Fig. 1b). Those tasks were administered always in the same order via a computer screen, implemented in OpenSesame 63 .

DeepDream stimuli.
DeepDream is a computational procedure that alters images relying on a pre-trained deep convolutional neural network (CNN), a process also referred to as "algorithmic pareidolia" 37 . The algorithm starts by passing an input image I with width ( w ) and height ( h ) through the CNN up to a selected layer A l . The objective function L is to maximize the A l activation. In order to achieve its goal, instead of optimizing the parameters of the network as in the classic approach, it alters the input image by adding the partial derivatives (gradients) of L computed with respect to the input image. This optimization algorithm is called gradient ascent because it leads to the maximization of L . Since DeepDream was conceived for static images, we followed Suzuki et al. 36 for adapting the algorithm to videos using optical flow to stabilize the optimization process and reduce the variability of the generated frames. In this study, we selected a relative higher layer (inception_4d/pool) of the GoogleNet CNN 38,64 , and setting all the hyperparameters similarly to the Hallucination Machine 36 (octaves = 3, octave scale = 1.8, iterations = 16, jitter = 32, zoom = 1, step size = 1.5, flow threshold = 6, blending ratio for optical flow = 0.9, blending ratio for background = 0.1).

The alternative uses task (AUT).
The flexibility of thought was firstly measured by the AUT 39,49 , a widely used divergent thinking task commonly employed for investigating CF 16 . In the AUT, participants were asked to list as many original uses as possible in response to four verbal prompts ("brick", "newspaper", "pencil", "shoe"). They had 2 min to respond to each verbal prompt in a text box, where the responses were constrained to a single or a compound word (e.g. "wrapping paper", Fig. 1c). Participants were instructed to generate single, or compound word responses, instead of the standard open-ended version of the AUT. This allows us to control for the open-ended nature of the task and leads to more consistent and standardized responses by participants. Each condition had two verbal prompts, counterbalanced across participants. As our interest is to study the effect of the DeepDream manipulation on CF, we collapse across AUT items and examine the aggregated action-related space of participants' responses, and how our manipulation affects this conceptual space.

Stroop task.
A mouse-tracking version of the Stroop task 41 was used to assess CF and the ability to inhibit cognitive interference 65 . The stimuli consisted of the words "red", "green", "yellow" and "blue" (in Italian) presented in four response boxes in the upper half of the screen (Fig. 1d). Every trial started after the participant clicked on a start button in the lower center of the computer screen, making appear right on top of the start button the target word stimulus after 200 ms. Participants were instructed to click on one of the four response boxes corresponding to the ink color of the target stimulus. In 50% of the trials, the color of the displayed word matched its meaning (congruent trials), whereas, in the remaining 50% of the trials, the color and meaning were different (incongruent trials). We also added the constrain that red and green written target words were always printed in either red or green, and blue and yellow always in either blue or yellow. In other words, the mapping between meaning and color print was not fully balanced across all target words, but only between the red-green and blue-yellow pairs. This allowed a fair comparison of the mouse trajectories since the response boxes were placed symmetrically according to this constrain. The total number of trials was 64, congruent and incongruent trials were presented in a randomized order. Mouse trajectories, as well as RT and accuracy, were recorded via the Mousetrap 66 plugin in Opensesame with a sample rate of 100 Hz (OS: Windows 10, default mouse settings).

Alternate state of consciousness questionnaire (ASC).
A short version of the ASC questionnaire 42 was administered to control the subjective effects of the videos on certain dimensions of the subjective experi- www.nature.com/scientificreports/ ence. Indeed, we measured only dimensions that were already found significantly similar to both the Hallucination Machine and the actual psychedelic experience 36 (Fig. 1e). In the ASC, participants were instructed to rate, on a continuous linear scale (0 = "no more than usual", 1 = "yes, much more than usual"), their experience with the video series as compared to normal waking consciousness.
Semantic network analysis. Traditionally, using the AUT, most of the research has employed scoring methods of CF based on human judgment 49 measuring the flexible extent of a participant's performances.
Although this approach has been seen to offer some degree of utility, it still poses some concerns related to the complexities of subjective judgment, raters' experience, and labor costs. Thus, we opted to quantify CF from the AUT data using two complementary network science approach 4,15 . Following a recently developed 67 yet extensively applied framework 14 , we modeled the AUT responses as a network in which the nodes represent possible actions (i.e. unique responses) generated by participants in the sample to all of the different AUT cue words, and edges represent relations between two of them. This association indicates the participants' tendency to generate a word "b" given a word "a" is formed, allowing us to investigate the overall group differences in the network depending on how frequently responses co-occurred across participants 67 .
The raw responses to the AUT, constrained to a single or a compound word, were preprocessed by excluding idiosyncratic answers and non-words and controlling for other possible confounds (i.e., spell-checked, converted plural words into singular). We used McNemar's chi-squared test to analyze whether there was a difference in proportion between the total number of unique responses and the number of unique responses given by the participants in each condition. In order to construct the semantic network, we structured the data into a binary N × M matrix for each condition, in which each column M represents the unique response given by all the participants to the AUT verbal prompts, and each row N represents a single participant. On each cell, responses were encoded as 1 when the participant N provided the response M and 0 otherwise. From the binary matrices, we selected only the unique responses (i.e., the columns) generated by at least two participants on each condition and we matched these unique responses between conditions. This allowed us to control for possible confounders, such as the different nodes or edges between conditions 68 . Thus, we constructed a word-similarity matrix by computing the cosine similarity between all the pairs of unique responses for each condition. The resulted matrix is an adjacency matrix of a weighted, fully connected network, having unique word responses as columns and rows and cells as the weight of the link between all the pairs of words. For the sake of retaining the most relevant information in the networks, we removed spurious associations (i.e. weak similarity) by filtering the adjacency matrices with the Triangulated Maximally Filtered Graph (TMFG) method 50 .
The multifaceted aspects of the structure of the conceptual networks were quantified using the following topological quantifiers, after binarizing the networks: CC, ASPL 51 , Q 52 and S 53 . The CC is an index of how close nodes in a network tend to cluster together and it should be interpreted as a measure of connectivity. Thus, a higher CC implies better local organization and shows stronger connectivity within the network. The ASPL indicates the average number of steps along the shortest paths for all possible pairs of network nodes. A lower value of ASPL might improve the chances of reaching faster relatively remote nodes. The Q assesses how a network is broken down into subnetworks, by quantifying the ways in which a network is divided into sub-networks, while the S can be considered as an index of network flexibility. Indeed, high local connectivity (higher CC) and short global distances between nodes (lower ASPL) define a small-world network, which can be quantified as the ratio between CC and ASPL 51 .
Statistical analysis was conducted by applying two complementary approaches, the LONO and LOSO procedures. In the LONO procedure, we iteratively computed the before mentioned network measures on the partial networks resulting from the exclusion of one node at each iteration, for every node. In the LOSO procedure, in each iteration, we excluded one participant and repeat the pipeline for building the semantic networks and computing the network measures, for every participant. We used a two-tailed paired-samples permutation t-test (α = 0.05, 10,000 iterations) to investigate the statistical differences between conditions for each measure in both LONO and LOSO. We also used Cohen's d as a measure of effect size. These analyses were conducted in R using the NetworkToolbox and SemNet packages 69 . Inferential statistics and data visualization was implemented in Python using the NetworkX library 70 .
Network percolation analysis estimates the robustness of complex networks under targeted attacks 71 . In this study, we implemented percolation analysis 4 using the weighted TMFG-filtered networks previously constructed from the AUT data. In the percolation analysis, networks are "attacked" by removing links with weight strength below an increasing threshold, called the percolation step 4 . The initial threshold was the smallest weight in the network and the lowest difference between the sorted weights was used to determine the threshold resolution. In each percolation step, we measure the LCCS, which is the size of the largest connected component, defined as the largest cluster of nodes connected only to each other. The percolation process was terminated when the number of nodes in the largest connected component was less than 3.
Once the percolation process reached its end, we computed φ (percolation integral), which is the area under the curve representing the LCCS across the percolation steps. It is also formally defined as the sum of all LCCS weighted by their weight threshold value 4 . This measure allowed us to estimate how fast the network breaks apart, a measure of its robustness and structure flexibility. To determine the statistical significance of the percolation analysis results, the LONO, LOSO, and LS were applied as complementary approaches. Similar to the procedures described in the semantic network analysis section, when computing the LONO and LOSO procedures, we iteratively excluded one node or participant, ran the percolation analysis on the resulted networks, and computed φ , for each node or participant depending on the LONO or LOSO methods employed, respectively. Furthermore, we performed LS analysis in order to control for the possibility that differences between networks may stem from the differences in the link weights. Here, for each network, we randomly selected two pairs of nodes and www.nature.com/scientificreports/ exchanged them links network. To ensure that the majority of the links are exchanged, this process is repeated 10 times for every link in the networks (1750 shuffles for the OR network, 1730 shuffles for the DD network). This procedure was repeated with 500 iterations, computing φ on the link-shuffled network at each iteration. We then conducted a paired two-tailed permutation t-test (α = 0.05, 10,000 iterations) between the φ of the DD and OR conditions for each procedure.
Drift diffusion conflict modelling. Stroop data were preprocessed by removing outliers both in terms of RT and accuracies. We excluded one participant from the analysis because of the extremely poor performance at one condition (accuracy = 5%), resulting in a sample of 51 participants for subsequent analyses. RT outliers were removed (8.76%) whenever they exceeded ±3 MAD (181 ms) with respect to the median (774 ms) across all trials and participants. Accuracy and RT data across both conditions and congruent and incongruent trials were compared using a paired-samples two-tail permutation t-test (α = 0.05, 10,000 iterations). Then, we fitted the DCM to the RT and accuracy data for each participant and condition separately. The DCM is a computational model suitable for conflict tasks such as the Stroop 45 , modeling the decision-making of participants under the framework of drift-diffusion models 43 . In a drift-diffusion model, the decision process is usually modeled as a Wiener stochastic process X t as follows: where µ(t) is the drift of the diffusion process, t is the difference between two time points and Z(t) is a random variable that follows a Gaussian distribution N (0, σ ) with 0 mean and σ standard deviation. In this framework, a decision is made whenever the diffusion process reaches an upper or lower bound β . In the DCM, the decision process is modeled as a superimposition of a controlled ( C t ) and an automatic ( A t ) process, as follows: The average time-course of A t is assumed to follow a rescaled Gamma density function with shape parameter θ > 1 and scale parameter τ , representing the time-course of the expected mean of A t . Its amplitude, which is the maximum value, is referred to as α , which is positive in congruent trials and negative in incongruent trials. The time-dependent drift µ a of the automatic process is equal to the first derivative of the expected mean of A t with respect to time t , which is: For a more detailed description of the model, see Ulrich et al. 45 . The model was fitted to the RT and accuracy data of each participant and separately for the two conditions, coding the upper bound as the correct response and the lower as the incorrect response. The loss function was the root mean square error (RMSE) between the cumulative distribution function (CDF) of the RT and the conditional accuracy function (CAF), which is the proportion of correct responses to targets for different percentiles of the RT distribution, of the simulated data from the DCM against the CDF and CAF of single-subject data. The loss function was minimized using the Nelder-Mead optimizer, with 200 max iterations 72 . We estimated 7 parameters: the amplitude of the automatic process ( α ), the time-to-peak of the automatic process ( τ ), the drift of the controlled process ( δ ), the decision boundary ( β ), the mean and the standard deviation of the non-decisional component and the shape parameter of the starting point. The starting point was kept fixed and the drift rate was constant across trials. We also fixed the standard deviation of the diffusion process to 4 as well as the shape parameter of the automatic process to 2. To explore plausible starting points for the optimization process, the DCM was fitted to each participant's data using 5000 parameter sets that were randomly generated from a uniform distribution with 5000 trials simulated (see Table 2 in Supplementary Materials for maximum and minimum values). We then took the 15 best parameter sets resulting from this initial search (lowest RMSE) and reran the DCM with 10,000 trials 3 times, to avoid local minima 73 . After the process was completed, we took the single best fitting parameter set for each participant and condition. We only analyzed 4 estimated parameters ( α , τ , δ , β ) since they were the ones with a cognitive interpretation that could fit the aims of this study. Finally, we used a two-tailed paired-samples permutation t-test (α = 0.05, 10,000 iterations) to investigate the statistical differences between conditions for each selected parameter. Cohen's d was used as a measure of effect size. These analyses were conducted in R using the DCMfun library 74 ; inferential statistics and data visualization was implemented in Python.
Stroop mouse trajectory analysis. Stroop data were also analyzed in terms of mouse trajectories. Trajectories were firstly selected based on the exclusion criteria from RT and accuracy data and including only the correct trials. We extracted the actual motion from the raw trajectories by selecting only the data points in which the cursor actually moved along both the x and y coordinates. Then, we time normalized the trajectories using linear interpolation, making all trajectories with the same number of data points (n = 101). Spatial alignment was performed in order to have the same initial and ending point according to the following equation: (1) X t+�t = X t + µ(t) · �t + Z(t) · σ · √ �t, www.nature.com/scientificreports/ where c is either the x or y coordinate time series, c t0 and c T are the first and last time point and c start and c end are the desired initial and ending points. After these preprocessing steps, we computed the AUC between the preprocessed trajectories and the optimal path going from the start button and the target box. We also computed the PE 75 separately for the x and y coordinates time series, with an embedding dimension of 5 and a time delay of 1. Similar to PE, we computed D as the number of changes in the direction of the cursor along with the x and y coordinates. ED was computed as the Euclidean distance between consecutive pairs of coordinates along the whole trajectory, while V represents the ED divided by the time difference. Moreover, we applied a GMM 76 , an unsupervised machine learning algorithm that finds clusters in the data, to estimate macro-states in the mouse trajectories using the scikit-learn library 77 . We ran a series of GMMs in order to select the best number of clusters, using the Expectation-Maximization (EM) algorithm with 1000 maximum iteration and a tolerance criterion of 0.001. The number of clusters varied from 2 to 10 since each trajectory had 101 time points and we wanted to theoretically observe all the transitions between states 78 . Model selection was achieved using the Akaike Information Criterion (AIC) with the constrain that all the clusters had to be present in each participant's trajectories. The winning GMM was the one with 4 clusters (see Supplementary Materials, Fig. 1). Then, we computed the transition matrices among conditions by computing the transition probabilities between each pair of states. Finally, we computed the DT as the average lifetime of each state in a trajectory. Statistical significance was assessed with a two-tailed paired-samples permutation t-test (α = 0.05, 10,000 iterations) and Cohen's d was used as a measure of effect size.

Data availability
The datasets generated and analysed during the current study are not publicly available since participants did not provide explicit written consent regarding the sharing of their data on public repositories, but are available from the corresponding author on reasonable request. The aggregated data displayed in the tables and figures are provided at the following link: https:// osf. io/ 7d42a/.