A generalized reinforcement learning based deep neural network agent model for diverse cognitive constructs

Human cognition is characterized by a wide range of capabilities including goal-oriented selective attention, distractor suppression, decision making, response inhibition, and working memory. Much research has focused on studying these individual components of cognition in isolation, whereas in several translational applications for cognitive impairment, multiple cognitive functions are altered in a given individual. Hence it is important to study multiple cognitive abilities in the same subject or, in computational terms, model them using a single model. To this end, we propose a unified, reinforcement learning-based agent model comprising of systems for representation, memory, value computation and exploration. We successfully modeled the aforementioned cognitive tasks and show how individual performance can be mapped to model meta-parameters. This model has the potential to serve as a proxy for cognitively impaired conditions, and can be used as a clinical testbench on which therapeutic interventions can be simulated first before delivering to human subjects.


S1. The Experimental Tasks
We followed the BrainE experimental paradigm results for our simulation study and modeled various tasks used in that framework 1 . Three tasks that were modeled are: Go Green (to assess selective attention and response inhibition), Middle Fish (to assess distractor processing), and Lost Star (to assess working memory) ( Figure S 1). Cognitive task details are elaborated on in the sections below.

Experimental Paradigm and Tasks
We modeled three BrainE platform tasks. The Go Green task was used to study both selective attention and response inhibition. The Middle Fish task was used to study processing, and the Lost Star was used to assess working memory.

Go Green -Selective Attention
Selective attention is tested when the majority of the stimuli are distractors and very few are targets. In this case, one needs to focus on relevant target stimuli and respond as rapidly as possible. For evaluating selective attention, the Go Green (Selective attention) task was modeled with the test set consisting of 33% green-colored rocket images and 67% other colored rocket images. Five different colored rocket images were trained with appropriate reward processing based on action selection. In this task, when the image of a green-colored rocket is displayed on the screen, the action 'GO' is selected, whereas for all other colored rocket images, the action 'NO GO' is selected. During each test trial, a cue is presented for an interval of 500 ms (milliseconds), followed by the image of the colored rockets for 100 ms. There is a response window of 1 sec and a feedback window of 200 ms. There is an inter-trial interval (ITI) of 500 ms between each trial and the next.

Go Green -Response Inhibition
Response inhibition is tested when the majority of the stimuli are targets and very few are distractors. In this case, one needs to respond to the stimuli most of the time and focus on the distractors and inhibit their response. For evaluating response inhibition, the Go Green task was modeled with a test set consisting of 67% green-colored rocket images and 33% rocket images of other colors. Five different colored rocket images were trained with appropriate reward processing based on action selection. In this task, when the image of a green-colored rocket is displayed on the screen, the action 'GO' is selected, whereas for all other colored rocket images, the action 'NO GO' is selected. During each test trial, at first, a cue is presented for an interval of 500 ms, followed by the image of the colored rockets for 100 ms. There is a response window of 1 sec and a feedback window of 200 ms. Between each trial and the next, there is an ITI of 500ms.

Middle Fish -Distractor Processing
For evaluating distractor processing, the Middle Fish task was used. In this task, four different images of fish were used to train the model with appropriate rewards. The image with leftfacing (right-facing) fish in the center was rewarded with 1 when the action 'LEFT' ('RIGHT') was selected. Each image consisted of 11 fish, one in the Middle and ten others surrounding the middle one. The decision is made based on the middle fish's orientation (left/right), irrespective of the orientation of the surrounding fish, which act as distractors. During each test trial, a cue is presented for 500 ms at the start, followed by the array of fish for 100 msec . The   images are categorized as all left, middle left, all right, and middle right. All left, and all right   images have congruent distractors, while middle left and middle right images have incongruent   distractors. To win the reward, for all left and middle left fish images, the action 'LEFT' should be taken, and for all right and middle right, the action 'RIGHT' should be taken. Post-stimulus presentation, a response window of 1 sec, a feedback window of 200 ms, and an ITI of 500 ms are programmed between successive trials.

Lost Star -Working Memory
For evaluating working memory, the Lost Star task was used. In this task, a perceptual image needs to be stored in memory for a brief period, and on presentation of the probe image, action needs to be taken based on whether the location of the star in the probe image matches that of one of the stars in the perceptual image. One of the 'YES' or 'NO' actions is selected based on the match. A cue is presented for 500 ms at the start of the test, followed by the perceptual image consisting of four stars at random positions on the screen. This test image is presented for one second, followed by a waiting period of three seconds. Then a probe image consisting of a single star is presented for one second. After this, an action must be selected during the response window of 1 sec, followed by a feedback window of 200 ms. Between successive trials, there is an ITI of 500 ms. Eight levels of this task were carried out 1 depending on the number of stars in the test image. Out of these, in our computational study, we have considered only one level (level 0.5four stars in the perceptual image) as there was no comprehensive analysis carried out with respect to variation in levels.
We also consider a few additional tasks from a couple of different works of literature. They are mentioned below.

N-Back -Working Memory
N-back is a continuous working memory task where the subject has to respond with a "yes" when the target stimuli are encountered and respond with a "No" when non-target stimuli are encountered. A representation of the N-back task used in our model is shown in Figure S  Target stimuli represent that stimulus, which has also been presented N-timesteps back 2 . The correct response is rewarded. During each test trial, at first, a cue is presented for an interval of 500 ms, followed by the image of the stimuli for 100 ms. There is a response window of 1 sec and a feedback window of 200 ms. Between each trial and the next, there is an ITI of 500ms.

Figure S 2.
Overview of the A) N-Back tasks 2 and the B) 2x5 tasks 4 used to evaluate working memory and sequence processing properties.

Sequential 2x5 task
2x5 is a sequential movement task that requires learning and retention of sequences 4 . The input contains a series of five 4x4 grids with two cells highlighted. After presenting each 4x4 grid, the subject has to find the underlying desired two sequential button presses. Each 4x4 input grid is termed a 'set,' and the sequence of five sets is called a 'hyper set.' When any incorrect response is given by the subject, the process is terminated without proceeding to the next set. We also implemented the tasks (T-maze and Grid world) with state transitions in a discrete action space. For both T-maze and Grid world, the agent starts from a particular start location and traverses through the intermediate states until they reach a goal state as shown in Figure S 3. In the case of a T-maze task, there is comparatively complex decision-making involved near the terminal states, as a wrong choice of action will take the agent further away from the target. In the grid world task, we have introduced the walls in between, and the agent has to avoid bumping into the wall. In both tasks, the reward is given only when the agent reaches the terminal state.

S2. The Architecture
A detailed overview of the model architecture is given in the primary manuscript. As already discussed in the main manuscript, the model has five distinctly identifiable components.

S2.1 The Representational System (RS)
The input image is presented in the input layer of the RS module ( Figure S 4). Convolutional operation is performed on the input image using 16 feature maps, each using 3x3 kernels. The convolutional layers are followed by max-pooling layers that reduce the feature maps by 2x2.
After four such stages of convolution and max-pooling, the resultant outputs were mapped to a fully connected layer of size 64x1 which constitutes the output of the RS encoder module. At the decoder ends, the 1x64 feature output was expanded, followed by deconvolutional layers and pooling layers at the end of which the original image was reconstructed back.
Once the image is perfectly reconstructed, the feature vector output from the fully connected layer of the encoder part of RS is provided as the input to the memory system (MS).

Figure S 4.
Representational System (RS), with an input and output layer, four convolutional layers and pooling layers and a fully connected layer at the encoder side. The decoder part consists of a fully connected layers, four pooling and deconvolutional layers. When the input image is satisfactorily reconstructed at the output layer, the encoder output taken from the fully connected layer is passed as input to the Memory System (MS).

S2.1.1 Flip Flop
The neurons of the memory system (MS) is modeled using J-K Flipflops. There are two   Table of the JK Flipflop If J=K=0, the previous output will be retained, If J=K=1 then the output will be toggled, If (J,K) = (0,1) then the output will be 0, If (J,K) =(1,0) the output will be 1.

S3. Backward Propagation
The weights between the various modules are updated using the backpropagation algorithm. The learning mechanism in this model is broadly divided into three parts.  , is the input from the representation system (RS) block; MS1/MS2 are the sub-blocks of Memory System (MS)block which forwards the data based on the Modulatory signal from VC.

MS1→AS weight update (Q-learning)
The weight update between the MS1 and the AS blocks is governed by Q-learning. The Q table is updated based on the expected reward and the actual reward. The expected reward is high (1) when for a set of given states, the desired action is selected. For example, in case of Go-Green task the Go action is desired when a green rocket is presented, and No-Go action is desired when other coloured rockets are presented. Action Selection block AS is modeled as a race model. Based on the inputs and the D1-AS weights an action is selected out of the possible two. Depending on the action the actual reward will be either 1 or 0. Based on the Expected and the actual rewards the weights are further updated using back propagation using Q-Learning.
The temporal difference error is used to update the weights between the neurons of the D1 block and the AS block, as shown in Equations (7 to 9). is calculated using the Equation 7.
Here r(t) is the reward obtained for selecting the particular action at that time instant, ( , ) is the Q value obtained for a particular state action pair and +1 ( , ) is the future rewards. A discount factor for the future reward given by is applied to the value that maximizes all possible future actions. The gradient in weight between D1 and AS is found by multiplying this temporal difference error with the output of the D1 block and applying a learning rate as shown in Equation 8.

MS1 -block → VC weight update:
Cortico-striatal weights are updated using ∆ 1→ , which is the gradient of the error, , with respect to the weights between D1 and M blocks. The Error, , is defined as the difference between the reward obtained for the particular action selected and the Value function as mentioned in Equation. 11. The computation of value function is already shown in Equation.
11 of main manuscript, where the output of the D1 block combines over the connections with respective weights given by 1→ in order to obtain the Value function (VF).

RS → MS1/MS2 blocks weight update
The MS1/MS2 sub-blocks that comprise the MS are modeled using the Flip Flop neurons which facilitates the delay and memory properties.

S3.1 Performance Assessment
The model performance is assessed in terms of the metrics of accuracy, reaction time, speed, consistency, and efficiency, as defined below.  The below section shows the effect of parameter tuning on performance.

S3.2 Selection of Learning Parameters and Model parameter tuning
In our current study, we studied the performance efficiency of our model with respect to the variations in the lateral connectivity strength and the threshold parameter. The lateral connectivity strength in the explorer module and the threshold for the race model were the primary contributing factors involved in learning. The explorer model is configured using a variant of Vanderpol oscillators known as the Lienard system of oscillators, which will induce randomness into the circuit based on the input and the lateral and interconnectivity weights.
Two oscillatory subsystems-one excitatory and the other inhibitory, are connected back-toback. Hence, we studied the efficiency majorly concerning these two parameters. Here we did not consider the interconnectivity weights to reduce complexity because it has a comparatively lesser influence. The threshold parameter is set at the action selection block, configured using rate-coded neurons. The action selection block is configured as a race model, and whichever action neurons cross the threshold is selected as the winner. Apart from this, when we expand our model to introduce disease conditions, we will be using the delta parameter, which is analogous to the dopamine component.

S4.1 Training Phase
During the learning progresses, the value functions keep increasing and approaches the value of 1, which is the maximum value that can be attained. Figure

S4.1.1 N-Back and 2x5 Task Results
The model results of the N-back are comparable and relatable to the performance results 2 , while the 2x5 task shows the evaluation of a different type of task using the same modeling paradigm, whose results are comparable to the experimental results 3,4 . The performance of the N-Back task were already discussed in the main manuscript. Figure S 9 shows the performance of the 2x5 task.
The representation of the N-Back task simulated is shown in Figure S 2 2x5 is a sequential movement task that requires learning and retention of sequences 3,4 . The input contains a series of five 4x4 grids with two cells highlighted. After presenting each 4x4 grid, the subject has to find the underlying desired two sequential button presses. Each 4x4 input grid is termed a 'set,' and the sequence of five sets is called a 'hyper set.' When any incorrect response is given by the subject, the process is terminated without proceeding to the next set. Figure S 2   We also implemented the tasks (T-maze and Grid world) with state transitions in a discrete action space. For both T-maze and Grid world, the agent starts from a particular start location and traverses through the intermediate states until they reach a goal state. In the case of a Tmaze task, comparatively complex decision-making is involved near the terminal states, as a wrong choice of action will take the agent further away from the target. In grid world task we have introduced the walls in between, and the agent has to avoid bumping into the wall. In both the tasks the reward is given only when the agent reaches the terminal state.

S4.2 Markov Decision Model for Experiments
The Markov decision model for the experiments is given below.  Table and are as shown in the Table 1 below.  The working memory task involves composite state space based on the initial stimulus and the subsequent matching stimulus (Figure S 11). However, here too, we do not have any state transitions depending on the action taken.

S4.3 Stability and Convergence of Q Networks
In our current study, the three tasks were modeled as per the experimental paradigm considered.
The tasks Go-Green, Middle-Fish and Lost-Star where involved small discrete state space and there were no action based state transitions. For these tasks we did not encounter any convergence or stability issues. However, with more complex networks involving large state space and action dependent state transitions, we do observe convergence and stability issues.
We have considered two tasks, T-maze and Grid-world, that involved state transitions. To achieve faster convergence, we maintained a separate Q-Target network, which was updated periodically after every few epochs. For Q-networks involving large state space and continuous state spaces the stability issues arises where the Q values never converges. In that case the typical workarounds adopted are maintaining a target Q-network. The target Q-network lags behind the main Q-network and the parameters of the target Q-network is not trained, however are periodically updated from the main-Q network.

S4.4 Model parameters and Average Performance
We selected the model parameters t = 0.4 = 0.05, as the RMSE values were found to be lowest with these parameters when the model performance was compared with that of the average performance of the experimental results. The RMSE values for the four tasks with model parameters = 0.4 = 0.05, are as given in the Table 2 below.

S4.3 Model parameter Estimation
We modeled the mapping between the experimental parameters (speed, consistency) and the model meta-parameters ( , ), tuning using a simple multilayer perceptron model The predicted and desired values of both the threshold and epsilon were closely matched.

Figure S 14.
Multi-layered perceptron model to predict the meta parameters with one input layer, two hidden layers and one output layer. Speed and consistency are given as inputs to the model and Epsilon and threshold are predicted. H1,1 to H1,32 represents the neurons in first hidden layer and H2,1 to H2,32 represents the neurons in the second hidden layer.

S4.4 Correlation Between Tasks
The empirical performance results of the cognitive tasks indicate strong correlations between selective attention, response inhibition, and distractor processing cognitive functions. This is partly by design, as all of these tasks were set up as speeded stimulus-response tasks. The working memory task showed a very weak correlation with the performance of other tasks.
This could very well relate to the differences between the goals of the working memory task, to identify the probe as to whether it matches with a test template but not a particularly speeded response to the probe. Working memory processing also differs from the other core functions of selective attention, response inhibition, and distractor processing in terms of distributed recruitment of brain activation 5-7 .

S5. RL Models of higher order learnings in humans
Multi-arm Bandits is one of the RL tasks that can be used to evaluate higher order learning in humans 8 . This involves problem solving and decision making. The task requires selecting the optimum action with a trade-off between taking an action based on our existing knowledge of the environment and exploring in order to update our knowledge base and thereby increasing the chances of a reward in the future. The task is related to variety of cognitive processes including memory, learning, attention and executive functions and are linked to various disorders such as depression, addiction and other neurological disorders. Similarly Iowa Gambling task is another RL task that is used to access the impulsive decision making [9][10][11] .
These models can throw light on the underlying neural mechanisms responsible for decision making behaviour of people with diverse cognitive profile. However, these models are just simplified representations of human cognition, and may not be able to capture all aspects of real-world decision-making.
Various studies have shown that the n-back tasks performance is closely related with intelligence, attention, and working memory 2,6,12-14 . The n-back tasks are be used to test various hypothesis related to working memory. When it comes to higher order cognitive characteristics such as attention and inhibitory control Go-No Go tasks are quite useful. The task provides an evaluation about how accurately and quickly the response to the target is and the suppression of response to non-target stimulus is. The T-maze task is another RL task which involves decision making where the agent has traverse multiple states and has to choose one of the actions left or right at the junction and the task requires ability to store and retrieve information and take actions based on the rules. In the grid world task, the agent has to navigate towards the goal avoiding the obstacles and is a much more complex task compared to the-maze and includes the aspects of attention, working memory, goal oriented behaviour and navigation The task set we have modeled replicating the BrainE experimental tasks covers the cognitive aspects such as the Attention, Distractor processing, Inhibitory control, Working memory, processing speed and executive functions.

S6. Fragmented and Integrative approaches
Much of the research on cognition has followed a fragmented approach, where individual cognitive processes are studied in isolation. There is a long history of this kind of research in psychology and neuroscience, with different fields of study focusing on specific aspects of cognition such as attention, memory, perception, emotions, language, and reasoning [15][16][17][18][19] .
However, by focusing on the individual cognitive functions, it is possible to miss interrelated features, complex brain functions, and their diversity. It will also be challenging to interpret and integrate findings from multiple studies. Different characteristics of cognitive disorders and the commonalities among these various characteristics of these disorders have been summarized 20 .
Our attempt is to integrate multiple aspects of cognition into a unified framework. We elaborate that many simple cognitive paradigms are mostly studied in silos experimentally [21][22][23][24][25] and computationally [26][27][28][29][30][31][32] . A recent study by our group tested the efficiency of a battery of simple experimental cognitive tasks (on attention, distractibility, working memory, and emotion processing) and showed it is complex enough to robustly explain human behavior across age and mental health spectrum 1, 33 . We note that one aim of our modeling effort is to simulate behavior across age and mental health spectrum, and we start with modeling the tasks described in the BrainE battery. To summarize, the fragmented approach results in potential limitations such as lack of integration, simplification of complexity, overgeneralization, narrow focus, and difficulty in interpretation.

S7. Performance Metrics for cognitive assessment
Our modeling study has simulated the results of the experimental data obtained from 1 , which focuses on the human ability to focus on relevant stimuli of immediate perceptual goals, suppress irrelevant information, and store and retrieve relevant information. In general, the performance of these cognitive tasks is accessed by the ability to achieve the goal with less reaction time with minimal errors. Several studies have used this metric to analyze cognitive performance 2 . The reaction time variation and accuracy variation are reliably estimated by introducing a composite metric as discussed in previous works 15,34 . We have included 'efficiency' as a performance metric in this line. The consistency indicates the reliability of the performance over different repetitions of the experiment. It is measured by variance in speed, which is inversely correlated with reaction time. We will update this paragraph in the supplementary section.