Self reward design with fine-grained interpretability

The black-box nature of deep neural networks (DNN) has brought to attention the issues of transparency and fairness. Deep Reinforcement Learning (Deep RL or DRL), which uses DNN to learn its policy, value functions etc, is thus also subject to similar concerns. This paper proposes a way to circumvent the issues through the bottom-up design of neural networks with detailed interpretability, where each neuron or layer has its own meaning and utility that corresponds to humanly understandable concept. The framework introduced in this paper is called the Self Reward Design (SRD), inspired by the Inverse Reward Design, and this interpretable design can (1) solve the problem by pure design (although imperfectly) and (2) be optimized like a standard DNN. With deliberate human designs, we show that some RL problems such as lavaland and MuJoCo can be solved using a model constructed with standard NN components with few parameters. Furthermore, with our fish sale auction example, we demonstrate how SRD is used to address situations that will not make sense if black-box models are used, where humanly-understandable semantic-based decision is required.


Introduction
Reinforcement Learning (RL) and Deep Neural Network (DNN) have recently been integrated into what is known as the Deep Reinforcement Learning (DRL) to solve various problems with remarkable performance.DRL greatly improves the state-of-the-art of control and, in the words of Sutton and Barto 1 , learning from interaction.Among the well-known successes are (1) the Deep Q-Network 2 which enabled machine to play Atari Games with incredible performance, and (2) AlphaGo that is capable of playing notoriously complex game of Go 3 at and beyond pro human level.Although DNN has proven to possess great potentials, it is a black-box that is difficult to interpret.To address this difficulty, various works have emerged, thus we have a host of different approaches to eXplainable Artificial Intelligence (XAI); see surveys [4][5][6] .They have shed some lights into the inner working of a DNN, but there may still be large gaps to fill.Note that there is no guarantee that interpretability is even attainable, especially when context-dependent interpretability can be subjective.
In this paper, we propose the Self Reward Design (SRD), a non-traditional RL solution that combines highly interpretable human-centric design and the power of DNN.Our robot (or agent, interchangeably used) rewards itself through purposeful design of DNN architecture, enabling it to partially solve the problem without training.While the initial hand-designed solution might be sub-optimal, the use of trainable DNN modules allows it to be optimized.We show that deep-learning style training might improve the performance of an SRD model or alter the system's dynamic in general.
This paper is arranged as the following.We start with clarifications.Then we briefly go through related works that inspire this paper.Then our interpretable design and SRD optimization are demonstrated with a 1D toy example, RobotFish.We then extend our application to the Fish Sale Auction scenario which we go through more extensively in the main text.It is then followed by brief descriptions of how SRD can be used in 2D robot in the lavaland and MuJoCo simulation (their details in the appendix).We largely focus on the interpretability of design and SRD training, although we also include concepts like unknown avoidance, imagination and continuous control.All codes are available in https://github.com/ericotjo001/srdwhere the link to our full data can be found.

This Paper Focuses Heavily on Interpretable Human Design
What exactly is this paper about?This paper comprises demonstrations of how some reinforcement learning (RL) problems can be solved in an interpretable manner through self-reward mechanism.Our fish sale auction example also demonstrates an RL-like system that requires decision-making in RL style, but requires high human interpretability and is thus not fully compatible with standard RL optimization.Readers will be introduced to the design of different components tailored to different parts of the problems in a humanly understandable way.
Important note: significance and caveat.The paper has been reorganized to direct readers' focus to interpretable design since reviewers tend to focus on standard RL practice instead of focusing on our main proposal, which is the interpretable design.This paper demonstrates the integration of a system that heavily uses human design augmented by NN.Through this paper, we hope to encourage deep learning practitioners to develop transparent, highly interpretable NN-based reinforcement learning solutions in contrast to standard DRL models with large black-box components.Our designs can be presented in a way that are meaningful down to the granular level.What we do not claim: we do NOT claim to achieve any extraordinary performance, although our systems are capable of solving the given problems.
But what is interpretability?While there may be many ways to talk about interpretability, interpretability in the context of this paper is fine-grained, i.e. we go all the way down to directly manipulating weights and biases of DNN modules.DNN modules are usually optimized using gradient descent from random initialization, thus the resulting weights are hard to interpret.In our SRD model, the meaning and purpose of each neural network component can be explicitly stated with respect to the environmental and model settings.
How do we compare our interpretability with existing explainable deep RL methods?Since we directly manipulate the weights and biases, our interpretability is at a very low level of abstraction, unlike post-hoc analysis e.g.saliency 7 or semantically meaningful high level specification such as reward decomposition 8 .In other words, we aim to be the most transparent and interpretable system, allowing users to understand the model all the way down to its most basic unit.Unfortunately, this means numerical comparison of interpretability does not quite make sense.
Baseline.We believe comparing performance with other RL methods is difficult since powerful DRL is likely to solve some problems very well w.r.t some measure of accuracy.Furthermore, not only are they often black-boxes that do not work in humanly-comprehensible way, sometimes their reproducibility is not very straightforward 9 .Most importantly, in the context of this paper, focusing on performance distracts readers from our focus on interpretability.If possible, we want to compare the level of interpretability.However, quantitative comparison is tricky, and we are not aware of any meaningful way to quantify interpretability that can compare with our proposed fine-grained model.So, what baseline should be used?The short answer is: there is no baseline to measure our type of interpretability.As previously mentioned, this is because our interpretability is fine-grained as we directly manipulate weights and biases.In this sense, our proposed methods are already the most interpretable system, since each basic unit has a specific, humanly understandable meaning.Furthermore, the set up of our auction experiments is not compatible with the concept of reward maximization used in standard RL, rendering comparison of "quality" less viable.In any case, our 2D robot lavaland example achieves a high performance of approximately 90% accuracy, given 10% randomness to allow for exploration, which we believe is reasonable.
Related Works: From RL to Deep RL to SRD RL with imperfect human design.RL system can be set up by human manually specifying the rewards.Unfortunately, human design can easily be imperfect since the designers might not necessarily grasp the full extent of complex problem.For RL agents, designers' manual specification of rewards are fallible, subject to errors and problems such as negative side effect of a misspecified reward 10 and reward hacking 11 .Dylan's inverse reward design (IRD) paper 12 addresses this problem directly: it allows a model to learn beyond what imperfect designers specify -also see the appendix regarding Reward Design Problem (RDP).In this paper, the initialization of our models is generally also imperfect although we use SRD to meaningfully optimize the parameters.Another example of our solution to the imperfect designer problem is through unknown avoidance, particularly w unknown in lavaland problem.
From RL to Deep RL to interpretable DRL.Not only is human design fallible, a good design may be difficult to create especially for complex problems.In the introduction, we mention that DRL solves this by combining RL and the power of DNN.However, DRL is a black-box that is difficult to understand.Thus the study of explainable DRL emerges; RL papers that address explainability/interptretability problems have been compiled in some survey papers 13,14 .Saliency, a common XAI method, has been applied to visualize deep RL's mechanism 7 .Relational deep RL uses a relational module that not only improves the agent's performance on StarCraft II and Box-World, but also provides visualization on the attention heads useful for interpretability 15 .Other methods to improve interpretability include reward decomposition, in which each part of the decomposable reward is semantically meaningful 8 ; do refer to the survey papers for several other ingenious designs and investigations into the interpretability of deep RL.
In particular, explainable DRL models with manually specified humanly understandable tasks have emerge.Programmatically Interpretable RL 16 (PIRL) is designed to find programmatic policies for semantically meaningful tasks, such as car acceleration or steering.Symbolic techniques can further be used for the verification of humanly interpretable logical statements.Automated search is performed with the help of an oracle i.e. the interpretable model "imitates" the oracle, eventually resulting in an interpretable model with comparable performance.Their interpretable policies consist of clearly delineated logical statements with some linear combinations of operators and terms.The components compounded through automatic searches might yield convoluted policies, possibly resulting in some loss of interpretability.By contrast, our model achieves interpretability by the strengths of activation of semantically meaningful neurons which should maintain their semantic assuming no extreme modification is performed.
Multi-task RL 17 uses hierarchical policy architecture so that its agent is able to select a sequence of humanly interpretable tasks, such as "get x" or "stack x", each with its own policy π k .Each sub-policy can be separately trained on a sub-scenario and later integrated into the hierarchy.Each sub-policy itself might loss some interpretability if the sub-problem is difficult enough to require a deep learning.By contrast, each of our neurons will maintain its meaning regardless of the complexity, although our neural network could become too complex as well for difficult problems.
From interpreable DRL to SRD.Our model is called self reward design because our robot computes its own reward, similar to DRL computation of Q-values.However, human design is necessary to put constraints on how self-rewarding is performed so that interpretability is maintained.In our SRD framework, human designer has the responsibility of understanding the problems, dividing the problems into smaller chunks and then finding the relevant modules to plug into the design in a fully interpretable way (see our first example in which we use convolution layer to create the food location detector).We intend to take interpretable DRL to the extreme by advocating the use of very fine-grained, semantically meaningful components.
Other relevant concepts include for example self-supervised Learning.DRL like Value Prediction Network (VPN 18 ) is self-supervised.Exploration-based RL algorithm is applied i.e. the model gathers data real-time for training and optimization on the go.Our model is similar in this aspect.Unlike VPN, however, our design avoids all the abstraction of DNN i.e. ours is interpretable.Our SRD training is also self-supervised in the sense that we do not require datasets with ground-truth labels.Instead, we induce semantic bias via interpretable components to achieve the correct solutions.The components possess trainable components, just like DRL, and we demonstrate that our models are thus also able to achieve high performance.Our pipeline includes several rollouts of possible future trajectories similar to existing RL papers that use imagination components, with differences as the following.Compared to uncertainty-driven optimization 19 towards a target value ŷi = r i + γQ i+1 (heavily abbreviated), SRD (1) is similar because agent updates on every imaginary sample available, but (2) has different, context-dependent loss computation.

Novelty and Contributions
We introduce a hybrid solution to reinforcement learning problem: SRD is a deliberately interpretable arrangement of trainable NN components.Each component leverages the tunability of parameters that has helped DNN achieve remarkable performance.The major novelty of our framework is its full interpretability through component-wise arrangement of our models.Our aim is to encourage the development of RL system with both high interpretability and optimizability via the use of tunable parameters of NN components.
General framework with interpretable components for specialized contexts.In general, SRD uses a neural network for the robot/agent to choose its actions, plus the pre-frontal cortex as the mechanism for self reward; this is somewhat comparable to the actor-critic setup.There is no further strict cookie-cutter requirement for SRD since fine-grained interpretability (as we defined previously) may need different components.However, generally, the components only differ in terms of arrangement i.e. existing, well-known components such as convolution layers are arranged in meaningful ways.In this paper, we present the following interpretable components: 1. stimulus-specific neuron.With a combination of manually selected weights for convolutional kernel or fully connected layers plus the activation functions (e.g.selective or threshold), each neuron is designed to respond to very specific situation, thus greatly improving interpretability e.g.Food Location Detector in robot fish, eq. 1. Named neurons such as f h, f t in robot fish example and PG, SZ in fish sale auction example ensure that the role and meaning of these neurons are fully understandable.
2. The self-reward design (SRD).In a mammalian brain, prefrontal cortex (PFC) manages internal decision 20 .Our models are equipped with interpretable PFC modules designed to provide interpretability e.g.we explicitly see why robot fish decides to move rather than eat.We demonstrate how different loss functions can be tailored to the different scenarios.
3. Other components for such as ABA and DeconvSeq for lavaland problem can be found in the appendix.
So, what's the benefit of our model?Interpretability and efficiency.We build our interpretable designs based on existing DNN modules i.e. we leverage the tunable parameters that make Deep RL works.The aim is to achieve both interpretability and good performance without arbitrary specification of reward (like pre-DRL models).Furthermore, targeted design is efficient.Our SRD models only use modules from standard DNN modules such as convolution (conv), deconvolution (deconv) and fully-connected (FC) layers with few trainable parameters (e.g.only 180 parameters in Robot2NN).With proper choice of initial parameters, we can skip the long, arduous training and optimization processes that are usually required by DRL models to learn unseen concepts.We trade off the time spent on training algorithm with the time spent on human design, thus addressing what is known as the sample inefficiency 21 (the need for large dataset hence long training time) in a human-centric way.
Figure 1.Robot Fish setting.The fish is half full/hungry, F = 0.5.Left: no food.Middle: food on env 3 , thus "food there" neuron lights up.Right: food on env 1 , thus "food here" neuron lights up.

Robot fish: 1D toy example
Problem setting.To broadly illustrate the idea, we start with a one-dimensional model Fish1D with Fish Neural Network (FishNN) deliberately designed to survive the simple environment.Robot Fish1D has energy which is represented by a neuron labelled F. Energy diminishes over time.If the energy reaches 0, the fish dies.The environment is env = [e 1 , e 2 , e 3 ] where e i = 0.5 indicates there is a food at position i and no food if e i = 0.The fish is always located in the first block of env, fig.1(A).In this problem, 'food here' scenario is env = [0.5, 0, 0] which means the food is near the fish.Similarly, 'food there' scenario is env = [0, 0.5, 0] or env = [0, 0, 0.5], which means the food is somewhere ahead and visible.'No food' scenario is env = [0, 0, 0].
Fish1D's Actions.(1) 'eat': recover energy F when there is food in its current position.( 2) 'move': movement to the right.In our implementation, 'move' causes env to be rolled left.If we treat the environment as an infinite roll tape and env as fish vision's on the 3 immediately visible blocks, then the food is available every 5 block.
How to design an interpretable component of neural network?First, we want the fish to be able to distinguish 3 scenarios previously defined: food here, food there and no food.Suppose we want a neuron that strongly activates when there is food nearby (name it food here neuron, fh), another neuron that strongly activates when there is a food nearby (name it food there neuron, ft), and we want to represent the no-food scenario as 'neither fh and ft respond'.How do we design a layer with two neurons with the above properties?We use 1D convolution layer and selective activation function as σ sa (x) = ε/(||x|| 2 + ε), as the following.
The fh and ft neurons.Define the activation of fh neuron as a f h = σ sa [conv f h (env)] where conv f h , a Conv1D with weight array w f h = [1, 0, 0] and bias b f h = 0.5.When there is food near the fish, we get y f h = conv f h (env) = [1, 0, 0] * [0.5, 0, 0] − 0.5 = 0 where * denotes the convolution operator, so a f h = σ sa (y f h ) = 1.This is a strong activation of neuron, because, by design, the maximum value of selective activation function is 1.We are not done yet.Similar to a f h above, define a f t .The important task is to make sure that when 'there is food there but NOT HERE', a f t activates strongly but a f h does not.They are w f t = [0, 1, 1], b f t = −0.5.Together, they form the first layer called the Food Location Detector (FLD).Generally, we have used Interpretable FishNN.To construct the neural network responsible for the fish's actions (eat or move), we need one last step: connecting the neurons plus fish's internal state (energy) (altogether [a f h , a f t , F]) to the action output vector [eat, move] ≡ [e, m] through FC layer, as shown in fig. 2 blue dotted box.The FC weights are chosen meaningfully e.g.'eat when hungry and there is food' and to avoid scenarios like 'eat when there is no food'.This is interpretable through manual weight and bias setting.

Is FishNN's decision correct?
The prefrontal cortex (PFC) decides whether FishNN's decision is correct or not.PFC is seen in fig. 2 green dotted box.The name 'PFC' is only borrowed from the neuroscience to reflect our idea that this part of FishNN is associated with internal goals and decisions, similar to real brain 20 .How do we construct PFC? First, define threshold activation as τ(x) = Tanh(LeakyReLU(x)).Then PFC is constructed deliberately in the same way FishNN is constructed in an interpretable way as the following.

Table 1. Weights and biases of Robot Fish's convolution layer
Output Conv weights bias First, aggregate states from FishNN into a vector v 0 = [a f h , a f t , F, e, m].This will be the input to PFC.Then, v 0 is processed using conv PFC followed by softmax and threshold activation.The choice of weights and biases can be seen in table 1.With this design, we achieve meaningful activations v 1 = [e 1 , m 1 , ex] as before.For example, e 1 is activated when "there is food and the fish eats it when it is hungry", i.e. f h = 1, F < 1 and e is activated relative to m 1 .The output of PFC is binary vector [True, False] = [T, F] obtained from passing v 1 through a FC layer FC PFC .In this implementation, we have designed the model such that the activation of any v 1 neuron is considered a True response; otherwise it is considered false.This is how PFC judges whether FishNN's action decision is correct.
Self reward optimization.As seen above, fish robot has FishNN that decides on an action to take and the PFC that determines whether the action is correct.Is this system already optimal?Yes, if we are only concerned with the fish's survival, since the fish will not die from hunger.However, it is not optimal with respect to average energy.We optimize the system through standard DNN backpropagation with the following self reward loss where CEL is the Cross Entropy Loss and z = Σ mem i=1 [T, F] i is the accumulated decision over mem = 8 iterations to consider past actions (pytorch notation is used).The ground-truth used in the loss is argmax(z), computed by the fish itself: hence, self-reward design.
Results.Human design ensures survival i.e. problem is solved correctly.Initially, fish robot decides to move rather than eat food when F ≈ 0.5, but after SRD training, it will prefer to eat whenever food is available, as shown in fig. 3. New equilibrium is attained: it does not affect the robot's survivability, but fish robot will now survive with higher average energy.

Fish Sale Auction
Here, we introduce a more complex scenario that justifies the use of SRD: the fish sale auction.In this scenario, multiple agents compete with each other in their bid to purchase a limited number of fish.A central server manages the dynamic bid by iteratively collecting agents' decision to purchase, to hold (and demand lower price) or to quit the auction.Generally, if the demand is high, the price is automatically marked up and vice versa until all items are sold out, every agent has successfully 5/20 made a purchase or decided to quit the auction, or the termination condition is reached.In addition, the agents are allowed to fine-tune themselves during part of the bidding process, for example to bias themselves towards purchase when the price is low or when the demand is high.
Interpretability requirement.Here is probably the most important part.The participating agents are required to submit automated models that make the decision to purchase a fish or not based on its price and other parameters (e.g. its length and weight).The automated models have to be interpretable, ideally to prevent agents from submitting malicious models that sabotage the system e.g. by artificially lowering the demand so that the price goes down.Interpretability is required because we want a human agent to act as a mediator, inspecting all the models submitted by the participants, rejecting potentially harmful models from entering the auction.
Deep RL models and other black box models are not desirable in our auction model since they are less amenable to meaningful evaluation.As we already know, deep RL models have become so powerful they might be fine-tuned to exploit the dynamic of a system, and this will be difficult to detect due to their black-box nature.Furthermore, unlike Mujoco and Atari games, there is no reward to maximize in this scenario, especially because the dynamic depends on many other agents' decision.In other words, standard deep RL training may not be compatible with this auction model since there is no true "correct" answer or maximum reward to speak of.Our SRD framework, on the other hand, advocates the design of meaningful self-reward; in our previous 1D robot fish example, PFC is designed to be the interpretable cognitive faculty for the agent to decide the correctness of its own action.
Remark.This scenario is completely arbitrary and is designed for concept illustration.The item being auctioned can be anything else, and, more importantly, the scenario does not necessarily have to be an auction.It could be a resource allocation problem in which each agent attempts to justify their need for more or less resources, or it could be an advertising problem in which each agent competes with each other to win greater exposure or air time.

The Server and Fish Sale Negotiator
Now we describe the auction system and a specific implementation.First, we assume that all agents follow the SRD framework and no malicious models are present, i.e. any undesirable models have been rejected after some screening process.In practice, screening is a process whereby human inspector(s) is/are tasked to read the agents' model specifications and manually admit only semantically sensible models.This inspection task is not modeled here; instead, a dummy screener is used to initiated interpretable SRD models by default -we will simulate a system with failed screening process later for comparison.The Fish Sale Negotiator.(C) Optim (noOptim) denotes fish sale auction proceeding in which (no) SRD optimization is performed.Each dot in the background corresponds to an actual purchase price at a given item supply, whose value in the plot is slightly perturbed to show multiple purchases at similar prices.Top: purchase price vs item supply.With SRD optimization, the inverse trend (red) is more pronounced i.e. when there are much fewer fish available relative to the no. of patrons, the fish tends to be sold at a higher price.Bottom: purchase rate vs item supply, where purchase rate is the fraction of available fish sold.(D) Same as (C) but half the participants submit malicious models to the auction.

6/20
The server's state diagram is shown in fig.4(A).Data initialization (not shown in the diagram) is as the following.The main item on sale, the fish, is encoded as a vector (p, l, w, g, st 1 , st 2 , st 3 , f ) ∈ X ⊆ R 8 where p = 5 denotes the price, l, w, g are continuous variables, respectively the length, weight, and gill colour normalized to 1 (higher values better), st i discrete variables corresponding to some unspecified sub-types that we simply encode as either −0.5 or 0.5 (technically binary variables) and f the fraction of participants who voted to purchase in the previous iteration (assumed to be 0.5 at the start).The base parameter is encoded as (5, 1, 1, 1, 0.5, 0.5, 0.5, 0.5), and for each run, 16 variations are generated as the perturbed version of this vector (data augmentation).
The server then collects each agent's decision to purchase, hold or quit (corresponding to Ev in the diagram or _pa-trons_evaluate function in the code) and redirects the process to one of the three following branches.If the demand at the specific price and parameters is high, i.e.N buy > N (k) a where N (k) a denotes the remaining number of fish available at the k-th iteration, then the states are updated, including price increment.If the demand is low, then purchase transaction is made for those who voted to buy, availability reduced a − N buy , price lowered, and states are correspondingly updated.If the number of purchase is equal to the number of available fish, the purchase is made and the process is terminated.The above constitutes one iteration in the bidding process and this process is repeated up to a maximum of 64 times.In our experiments we observe that the process terminates before this limit is reached, often with a few fish unsold.
The Fish Sale Negotiator (FSN) is shown in fig.4(B) and this is the fully interpretable SRD model that makes the purchase decision.Each agent will initialize one SRD model with fully interpretable weights, i.e. each agent adjusts their model's parameters based on their willingness to purchase.In our implementation, we add random perturbation to the parameters to simulate the variation of participants.The full details are available in the github, but here we describe how this particular model achieves fine-grained interpretability.FSN is a neural network with (1) the external sensor (ES) module that takes in the fish price and parameters as the input and compute x es and (2) a fully-connected (FC) layer that computes the decision y.
External sensor (ES) module is a 1D convolution layer followed by threshold or selective activations.This module takes the input x ∈ X and outputs x es ∈ R 4 .The weights w ES and biases b ES of convolution layer are chosen meaningfully as the following (note that its shape is (4, 1, 8) based on pytorch notation).For example, w ES [0, :, 8] corresponds to the PG neuron, which is a neuron that lights up more strongly when the price is higher than baseline and, to a lesser extent, when the continuous variables length, weight and gill colour are lower than 1.Thus, w ES [0, :, 8] = (1/b, −2δ , −2δ , −2δ , 0, 0, 0, 0) + δ where b = 5 is the baseline purchase price and δ = 0.001 is a relatively small number.The bias corresponding to PG is set to 0.
Implicit augmentation is performed by cloning D ic copies of the input x ∈ X and stacking them to (p , l , ... f ) ∈ R 8×D ic where p = (p, p 1 , ..., p D ic ) (likewise the other variables) and D ic = 5 is called the implicit contrastive dimension (the number is arbitrarily chosen in this experiment).This is performed as a form of data augmentation in similar spirit to contrastive learning [22][23][24] , in which the cloned copies are randomly perturbed.The ES module will have dilation = D ic so that it outputs D ic instances of each of the aforementioned neurons that later can be averaged.The benefit of implicit augmentation is clearer in practical application.Suppose the server resides in a remote location and each agent sends and receives its model repeatedly during fine-tuning.If data augmentation is performed in the server and sent to the agent, the network traffic load will be greater.Instead, with implicit augmentation, the process is more efficient since there will be less data transferred back and forth.
Fully-connected layer, FC, is designed similarly.Taking x es at the input, this layer computes the activation of neurons B, L, Q, respectively buy, hold (and lower price) and quit.The decision is made greedily using torch.argmax.We can see that the layer is semantically clear: neuron PG contributes negatively to buy decision, since buying at higher price is less desirable.Neuron SZ contributes positively to neuron B since longer or heavier fish is more desirable; others can be explained similarly.In our experiments, the bias values of FC layer are initiated with some uniform random perturbation to simulate a variation in the strength of intent to purchase the fish.
Prefrontal Cortex (PFC) and SRD optimization.PFC performs the self-reward mechanism, just like 1D robot fish PFC.The neurons are semantically meaningful in the same fashion.PGL neuron strongly activates the true T neuron when PG and L are strongly activated; this means that deciding to hold (L activated) when the price is high (PG activated) is considered a correct decision by this FSN model.BC neuron corresponds to buying at low price, a desirable decision (hence this activates T as well) while FQ neuron corresponds to "false quitting", i.e. the decision to quit the auction when SZ, LSR and ST are activated (i.e. when the fish parameters are desirable) is considered a wrong decision, hence activating F neuron.With this, SRD optimization can be performed by minimizing loss like equation 2. In this particular experiment, optimization setting is randomized from one agent to another and hyperparameters are arbitrarily chosen.The no. of epochs are chosen uniformly between 0 to 2 (note that this means about one third of the participants do not wish to perform SRD optimization), batch size is randomly chosen between 4 to 15 and learning rate on a standard Stochastic Gradient Descent is initiated uniform randomly lr ∈ [10 −7 , 10 −4 ].Also, SRD optimization is allowed only during the first 4 iterations (chosen arbitrarily).

Auction results
Each auction in this paper admits n = 64 participants given an item supply or rarity r.In our code, r is defined such that the no. of available fish on auction corresponds to r × n.Each such auction is repeated 10 times and each trial independent of any other.The results of each trial is then collected, and the mean values of purchase price and purchase rate are shown in fig.4(C).In this specific implementation, the auction that allows SRD optimization shows a more pronounced price vs supply curve, as shown in fig.4(C) top, red curve.We can see that lower supply results in higher purchase price on average.Fig. 4(C) bottom shows that the fraction of successful buys varies in a nearly linear fashion as a function of supply.
We should remember that the well-behaved trends that we just observed are obtained from sensible automated models, in the sense that they make decisions with humanly understandable reasoning such as "buy when the price is low".This is possible thanks to the full interpretability afforded by our SRD framework.For comparison, we are interested in the case where interpretability is lacking and malicious actors sneak into auction.Thus we repeat the experiments, except half the agents always make hold decisions, intending to sabotage the system by forcefully lowering the price.The results are shown in fig.4(D): the graph of purchase price (top) is greatly compromised which translates into a loss for the auction host.The graph of purchase rate (bottom) appears to be slightly irregular at low item supply.However, the purchase rate plateaus off at higher supply (malicious agents refuse to make any purchase) i.e. the auction fails to sell as many fish as it should have, resulting in a loss.By admitting only sensible interpretable models, the auction host can avoid such losses.

More scenarios
Generally, SRD framework encourages developers to solve control problems with neural networks in the most transparent manner.Different problems may require different solutions and there are possibly infinitely many solutions.This paper is intended to be a non-exhaustive demonstration of how common components of black-box neural network can be repurposed into a system that is not black-box.Here, we will briefly describe two more scenarios and leave the details in the appendix: (1) 2D robot lavaland and (2) the Multi-Joint dynamics with Contact (MuJoCo) simulator.

2D robot lavaland
Like Dylan's IRD paper, we use lavaland as a test bed.An agent/robot (marked with blue x) traverses the lavaland by moving from one tile to the next towards the objective, which is marked as a yellow tile.With components such as ABA and selective activation function (see appendix), semantically meaningful neurons are activated as we have done in our previous two examples.In this scenario, each neuron will respond to a specific tile in the map.More specifically, brown tiles (dirt patches) are considered easy to traverse while green tiles (grassy patches) are considered harder to traverse.The robot is designed to prefer the easier path, thus each neuron responds more favourably towards brown tiles as shown by red patches of v 1 in fig. 5.The problem is considered solved if the target is reached within 36 steps Furthermore, we demonstrate how unknown avoidance can be achieved.There will be red tiles (lava) that are not only dangerous, but also have never been seen by the robot during the training process (thus unknown).The agent is trained only on green and brown tiles, but when it encounters the unknown red tiles, our interpretable design ensures that the agent avoids the tile, thus unknown avoidance.This is useful when we need the agent to "err on the safer side" basis.Once human designer understands more about the unknown, using SRD design principle, a new model can be created by taking into account this unknown.The full experiments and results are available in the appendix.

MuJoCo with SRD
MuJoCo 25 is a well-known open source physics engine for accurate simulation.It has been widely used to demonstrate the ability of RL models in solving control problems.Here, we briefly describe the use of SRD framework to control the motion of a half-cheetah.All technical details are available in the appendix and our github.More importantly, in the appendix, we will describe the design process step by step from the start until we arrive at the full design presented here.
To solve this problem in its simplest setting, an RL model is trained with the goal of making the agent (half-cheetah) learn how to run forward.Multiple degrees of freedom and coordinates of the agent's body parts are available as input and feedback in this control scenario, although only a subset of all possible states will enable the agent to perform the task correctly.More specifically, at each time step, the agent controls its actuators (6 joints of the half cheetah) based on its current pose (we use x and z position coordinates relative to the agent's torso), resulting in a small change in the agent's own pose.Over an interval of time, accumulated changes result in the overall motion of the agent.The objective is for the agent to move its body parts in a With SRD, deliberate design of a neural network as shown in fig.6(A) enables the cheetah to start running forward without training (or any optimization of reward), just like our previous examples.Self reward mechanism is also similar: PFC part of the neural network will decide the correctness of the chosen action, and optimization can be performed by minimizing cross-entropy loss as well.Snapshots of the half cheetah are shown in fig.6(B).The plots of mean x displacement from the original position over time are shown in fig.6(C) (top: without optimization, bottom with SRD optimization).The average is computed over different trials initialized with small noises and the shaded region indicates the variance from the mean values.We test different backswings, i.e. different magnitudes with which the rear thighs are swung.Some backswings result in slightly faster motion.More importantly, the structure of the neural network yield stable motion over the span of backswing magnitudes that we tested.Furthermore, we also tested an additional feature: half cheetah is designed to respond to the instruction to stop moving (which we denote with inhibitor = 2).The result is shown in fig.6(D), in which the instruction to stop moving is given at regular intervals.Finally, the effect of SRD optimization is not immediately clear.At lower backswing magnitudes, the optimization might have yielded a more stable configuration, which is more clearly visible in fig.6(D) top (compared to bottom).

Limitation, Future Directions and Conclusion
Generalizability and scalability.An important limitation to this design is the possible difficulty in creating specific design for very complex problems, for example, computer vision problems with high-dimensional state space.Problems with multitude of unknown variables might be difficult to factor into the system, or, if they are factored in, the designers' imperfect understanding of the variables may create a poor model.Future works can be aimed at tackling these problems.Regardless, we should mention that, ideally, a more complex problem can be solved if a definite list of tasks can be enumerated, the robot's state can be mapped to a task and no state requires contradictory actions (or perhaps stochasticity can be introduced).Following the step-by-step approach of our SRD framework, each task can thus be solved with a finite addition of neurons or layers.Further research is necessary to understand how noisy states can be properly mapped to an action.
Also, an existing technical limitation includes the lack of APIs that perform the exact operations we may need to parallelize the imaginative portions of SRD optimization (see lavaland example in the appendix).We have explored implicit contrastive learning for data augmentation but more research is needed to understand the effect of different techniques.Also, a general formula seems to be preferred in the RL field, and this is not currently available in SRD.For now, a standard SRD framework consists of (1) a NN that decides the agent's actions based on its current state and (2) a PFC that facilitates the process of self-reward optimization with true or false output.Further research is necessary.
Controlling Weights and Biases.A future study on regularizing weights may be interesting.While we have shown that our design provides a good consistency between the interpretable weights before and after training, it is not surprising that the combination of small differences in weights can yield different final weights that still perform well.So far, it is not clear how different parameters affect the performance of a model, or if there are any ways to regularize the training of weight patterns towards something more interpretable and consistent.
To summarize, we have demonstrated interpretable neural network design with fine-grained interpretability.Human experts purposefully design models that solve specific problems based on their knowledge of the problems.SRD is also introduced as a way to improve performance on top of the designers' imperfection.With manual adjustment of weights and biases, designers are compelled to assign meaningful labels or names to specific parts, giving the system a great readability.It is also efficient since very few weights are needed compared to traditional DNN.
Imagination components.Unlike an existing rollout 28 , each of our SRD rollout consists of a series of asymmetric binary choices a 1 , a 2 chosen from {a ∈ A } so that v(a 1 ) ≥ v(a 2 ), where A is the set of actions and v is any generic function that gives each action a local value.The values will be aggregated into a self-reward.Dreamer 29 solves RL problem using only latent imagination where many models (such as reward, action and value models) are specified as probability distributions.By contrast, SRD creates no specific sub-model.All values are just NN activations, and they will be aggregated into impromptu, just-in-time scores, based on which the plans can be greedily chosen.

2D Robot in Lavaland
In the main text, we briefly discussed 2D robot in the lavaland.We will elaborate further here, starting with the interpretable components used in 2D robot's design: 1.The ABA: approximate binary array.With selective activation and tile-specific values (colour), we create strong neuron activations that specifically correspond to tile colours.Their visual maps correspond directly to the relevant signal, preserving the ease of readability.
2. The DeconvSeq.The series of convolutional kernels are intended to provide targeted response in conjunction with ABA e.g. in fig.5, [v 1 ] target gives strong signal centred around the target.This is done by manually setting the center value of the weights to be higher than the rest (see fig. 7).The main selling point is their tunability.The weights are trainable: while each module has been given a specific purpose, e.g.detect target, it is still tunable.We empirically show that the main purpose the kernels' weights are preserved (i.e.center value still highest) after optimization.The interpretable design of Robot2NN is shown in the main text fig. 5. Tile-based modules in the receptors are designed to respond to different types of stimuli (grass, ground, lava etc); as our previous examples, weights and biases are manually selected.Deconvolutional layers are used in the robot's PFC to give the tiles some scores for the robot to decide its subsequent action (red better, blue worse).More precisely, a stack of deconvolutional layers DS n t (defined later) will be used to create a favourability gradient.Robot then chooses an action that generally moves it from blue to red regions.Before we proceed, we clarify some of our notational uses.ABA: approximate binary array, an array whose entries are expected to be ≈ 0 or 1. DS n t , or DeconvSeq, is the sequence of n deconvolutional layers for a tile type t.Deconvolutional layer, or deconv, is a regular DNN module.Each deconv is followed by Tanh activation for normalization and non-linearity.Normalization (to magnitude 1) ensures that DeconvSeq compares action choices in relative terms.Tile t denotes the name of a tile, e.g.grass, but it also denotes its [0, 1] normalized RGB value, e.g. for grass, t = [0, 128, 0]/255 or an array of tile values (they should be obvious from the context).τ recog = 10 −4 is the recognition threshold.Designer needs to specify P = {p t : t = target, grass, ...}.For now P is the set of untrainable parameters, each p t roughly acting as the factor for scaling the true reward.They will affect the robot's final preferences for or against different tiles although tunable parameters will accentuate or attenuate them accordingly.They induce the biases that designers input into the model in a simple, interpretable way.We also define the unknown avoidance parameter u a ≥ 0.
Interpretable tile-based modules are designed to explicitly map robot's response to each specific tile type.Get_ABA() function computes the ABA for each tile type: w t = σ sa • µ[(x attn − t) 2 ] where µ[.] is the mean across RGB channel.This is reminiscent of eq.1; the difference is, each neuron responds to a tile type at each x attn coordinates.Neuron activations are computed as w η where η =target, grass or dirt shown in fig.5, e.g.strong activation for grass detection occurs at [w grass ] 14 .
Unknown avoidance.Like Dylan's IRD, we have a reliable mechanism for unknown avoidance, the Boolean array w unknown = [(1 − Σ t w t ) > τ recog ] to be treated as floating point numbers.From the formula, it can be seen that w unknown aggregates the negation of known activations.The unknown in our case is the lava tile that the imperfect human designer 'forgets' to account for.
Robot2NN weights and preserved interpretability.This model is very efficient because it consists of only 180 trainable parameters, as shown fully in fig.7(B,C).As expected of relatively simple problems, there is no need for millions of parameters required to achieve high accuracy.High performance 90% accuracy is attained, given 10% randomness is allowed.The weights of target deconv appear to have been trained towards higher positive values (redder).The center value remains the most prominent for all, thus preserving our interpretability.Looking into individual variations, fig.7(B) shows the weights from Robot2NN model of project A expt 1 while fig.7(C) from project Compare A expt 1.The difference in grass weights are apparent.
We have seen from fig. 7 that the trained models still have the general interpretable shape we initiated it with.While there is no theoretical proof, there may be an intuitive reason.Due to our interpretable design, the model starts off with a reasonable ability to solve the problem.This probably means weights and biases already reside in a high dimensional parameter space somewhere around one of the local minima.This local minimum is special in the sense that it is more interpretable i.e. weights have recognizable shapes as we have initiated.As a result, a short training leads it nearer to that local minimum, hence the overall shape of the model remains similar to the initial shape and interpretable.
(2) Why no conclusion should be drawn from this observation?It seems that project A with smaller p target results in relatively less preference for the grass tiles, which in turn leads to negative values (blue) for deconv for grass tiles.By comparison, project Compare A seems to have no negative values for grass tiles deconv.Unfortunately, the weights are shown only for demonstration; there is no definite conclusion that can be drawn.This is because other experiments similar to project A also can result in all positive deconv weights with different patterns.They still yield high accuracy, thus possible variations within even this small set of parameters can still produce similar performance.Other results are shown in the appendix.Lava A project does not yield particularly distinct patterns.We see that even failure modes can yield weights profile that look similar to non-failure modes.Further investigations may be necessary.

MuJoCo with SRD
We briefly described the application of SRD framework on half cheetah simulation on MuJoCo in the main text.Here, we will go through step by step process to arrive at the HalfCheetahSRD design (executed using --mode srd-model-design  Stage 0: devtests.To be able to design SRD properly, it is important to understand some fine details of the models and the platform used to simulate the model.We perform an initial testing to observe the agents' poses visually.Run the python command with python mujoco_entry.py--mode devtests --testtype vary_control_strength --model half-cheetah).Here, we arbitrarily choose a specific set of setup that we keep consistent throughout the experiment.For example, the framerate is set to 15 Hz (which does affect the time-step size) and interval = 25 (a setting for SRD model's momentum update).They are arbitrary; for our current experiments, all we need is for the agent to be able to stabilize and perform the running motion properly, noting that different settings might yield different results.
At this point, we use some trial and errors to find a way to control actuators that will successfully make the half-cheetah run without using any neural network yet.Indeed, we found that the following works: (1) apply actuator with strengths [0, 0, 0, −2s, 0, 0] for 25 time steps, (2) followed by [−s, 0, 0, 0, 0, 0] for the next 25 time steps (3) repeat step (1) and (2) cyclically.The rationale is simple: step (1) is used to swing the front thigh forward and step (2) the back thigh.With this simple test movements, the agent is able to run forward.We will refer to this as the basis of the neural network we use in SRD design.
Stage 1.In this stage, our goal is only to observe the position coordinates of half cheetah.Different strengths of actuators are applied to the half cheetah and then it is allowed to stabilize.A short video will be available for readers to verify half cheetah's pose visually.The final poses are shown in fig.13.To proceed with the model design, execute the python command to run stage 1: python mujoco_entry.py--mode srd-model-design --model half-cheetah --stage 1.We will then save these positions in init.params, which we will use later.
In fig.14 in the official MuJoCo documentation).For example, 0 corresponds to its torso, 1 to back thigh etc.The x and z position coordinates of these body parts relative to the torso's coordinates will be used as the input to the neural network that controls our SRD half-cheetah model, as in fig.6(A).By observing the figure and the aforementioned short video, we verify that the agent is indeed initiated from a short distance above the ground, after which it will fall to the ground and stabilize on its front and rear legs.Stage 2. In this stage, we consider how to convert the input (x and z coordinates) to a set of meaningful neuron activations.We start by considering neurons that respond to the stable standing pose from the previous stage.To achieve this, we use StablePoseNeuron, a custom pytorch module with parameter p and a forward propagation method that takes in input x and outputs σ sa ((x − p) 2 ).In fig.6(A), xS and zS are both StablePoseNeuron, while xS inv = 1 − xS and zS inv = 1 − zS.When the agent stays in a stable pose, xS, zS will activate strongly while their corresponding inverses xS inv , zS inv are not activated, and vice versa.This is shown in fig.14(B).In the scenario, the cheetah drops from a short height above the ground and stays in the equilibrium position until time step 250.From this point onward, the actuator of the agent's front thigh is activated, causing a forward swing of the front limb.This motion leads the agent away from its initial stable pose, and, as expected, the xS, zS neurons' signals drop off (and their inverses activate strongly).The goal of our stage 2 design has been achieved.Note: the command is the same as before, but with argument --stage 2. Stage 3. In this stage, we connect the stable pose neurons and their inverses to the actuators.At this point, we have a neural network structure similar to the blue dotted box in fig.6(A), except without the BS neuron.Recall in stage 0 we swing back and front thigh alternately.Our aim here is to approximately replicate the set up, and then upgrade it with a neural network.Thus HalfCheetahSRD is born, in which the parameters of other actuators can be optimized in the SRD way as we have done before.More specifically, a fully connected layer connects the stable pose neurons to the actuators bt, bs, b f , f t, f s, f f as shown in fig.6(A).The simulation is then run similar to the previous stage.We implement the momentum function that ensures that the actuators apply their forces for 25 time steps before the next set of actuator's values are computed from the new, updated pose.
The results are shown in fig.15(A).In essence, the plot of x coordinates shows that the agent moves forward successfully.The z coordinates drop from a height, as expected, and then oscillates regularly, indicating that the agent's body parts move in a regular cycle at a given level above the ground i.e. the cheetah does not trip or fly away etc.
Note: the command is the same as before, but with argument --stage 3. Stage 4. Stage 4 is similar to 3, except with the introduction of BS neuron with the modification of propagate_first_layer function.This neuron creates an additional variable used to vary the movement pose of the agent.More specifically, it allows the agent to vary how much its hind thigh swings throughout the motion.One such result with backswing = 5 is shown in fig.15(B).
The main experiment.The SRD is not yet complete without the PFC for SRD optimization.In this experiment, we keep the PFC simple, as shown in the green dotted box of fig.6(A).As before, we want PFC to decide the correctness of the agent's action.The iN neuron responds to the inhibitor, where the inhibitor takes the value of either 0 or 2. When inhibitor = 0,

Figure 3 .
Figure 3. (A) plot of energy (F, blue) and food availability (red) for untrained model.SRD-trained model at its early stage looks almost identical.(B) same as (A) but after 12000 SRD training iterations.(C) Average energy of SRD-trained fish model.

Figure 4 .
Figure 4. (A) Fish sale server.Ev: evaluation process.U: the states update process.P: updating records of successful purchases.(B)The Fish Sale Negotiator.(C) Optim (noOptim) denotes fish sale auction proceeding in which (no) SRD optimization is performed.Each dot in the background corresponds to an actual purchase price at a given item supply, whose value in the plot is slightly perturbed to show multiple purchases at similar prices.Top: purchase price vs item supply.With SRD optimization, the inverse trend (red) is more pronounced i.e. when there are much fewer fish available relative to the no. of patrons, the fish tends to be sold at a higher price.Bottom: purchase rate vs item supply, where purchase rate is the fraction of available fish sold.(D) Same as (C) but half the participants submit malicious models to the auction.

Figure 5 .
Figure 5. Robot2NN schematic.Receptors help split incoming signals for further processing.x attn (the whole visible map) and w sel f (the agent's position) are the direct input to the model.In v 1 , v Σ , red values are positive (desirable), white zero and blue negative (not desirable).PFC (green dotted box) contains trainable series of parameters to make and adjust decisions.[w grass ] 14 , [w grass ] 23 are examples of strong activations that are interpretable through tile-based module.

Figure 6 .
Figure 6.(A) The HalfCheetahSRD, the neural network model used to move MuJoCo half-cheetah.(B) Half cheetah starts at position x = −2.Red/orange curved arrows denote the swings of front/rear thighs.(C) mean x position over time for different magnitudes of backswing.(D) Same as (C), but with inhibitor set to 2, allowing the agent to response the instruction to stop moving.

Figure 7 .
Figure 7. (A) Initial parameters; all deconvs in all DeconvSeq are initialized to the same 3x3 weights with center max value of 1 and 0.1 elsewhere.(B) Once trained, variations of weights are observed for Project A expt 1 (all 180 parameters are shown).(C) Similar to (B) but for project Compare A.

Figure 11 .
Figure 11.Weights for failure modes (A) Project Compare A expt 2 (B) Project Compare A expt 4. (A) still shows standard-looking weights.

Figure 13 .
Figure 13.Half cheetah's poses after the application of varying strengths of actuators (front and back thighs).
(A), numbers 0 to 6 denote the 7 parts of half cheetah as defined in the xml format (known as the MJCF model 18/20

Figure 14 .
Figure 14.(A) x and z position coordinates of the agent, half cheetah, in stage 1.Numbers 0 to 6 denote the 7 parts of half cheetah (B) The strength of activation of stable pose neurons xS, zS over time, and likewise of their inverses.