The neural coding framework for learning generative models

Neural generative models can be used to learn complex probability distributions from data, to sample from them, and to produce probability density estimates. We propose a computational framework for developing neural generative models inspired by the theory of predictive processing in the brain. According to predictive processing theory, the neurons in the brain form a hierarchy in which neurons in one level form expectations about sensory inputs from another level. These neurons update their local models based on differences between their expectations and the observed signals. In a similar way, artificial neurons in our generative models predict what neighboring neurons will do, and adjust their parameters based on how well the predictions matched reality. In this work, we show that the neural generative models learned within our framework perform well in practice across several benchmark datasets and metrics and either remain competitive with or significantly outperform other generative models with similar functionality (such as the variational auto-encoder).


NGC Model Name
Error Type Prec Wghts Lat Wghts α m Value Uses ∂ϕ GNCN-t1/Rao Type  While this study explores four variant GNCN models that can be derived from the general NGC computational framework proposed in this study, there are many other possible architectures that can handle other styles of problems. This flexibility is owed to the fact that our NGC framework supports asymmetry with respect to the structure of the forward generative pathway and that of the error transmission pathway.
Notably, if an NGC model utilizes a non-hierarchical structure in its generative/prediction neural structure (α m = 1), as in the case of one of this paper's main models, it receives one final suffix "-PDH" (for partially decomposeable hierarchy). Note that in the case of the GNCN-t2-LΣ-PHD, we, in this paper, for further convenience, abbreviate it to GNCN-PDH. Alternatively, if an NGC model contains a nonhierarchical error structure (given that is forward and error transmission pathways are not strictly required to be symmetric), it would receive the final suffix "-PDEH" (for partially decomposeable error hierarchy). Note that certain special cases of our framework have been further provided special name designations to highlight their source studies, e.g., GNCN-t1 is also referred to as GNCN-t1/Rao [40] and GNCN-t1-Σ is also referred to as GNCN-t1-Σ/Friston [13]. Figure 1 depicts five other alternative neural circuit structures that would tackle, respectively, 1) clamped, multiple input to generated/predicted outputs (as in the case of tasks such as direct classification), 2) multi-modal generative modeling (for example, crafting a circuit that jointly learns to synthesize an image and discrete one-hot encoding of a word/character at time step t within a sequence), 3) a generative model where upper layers receive error messages from layers other than its immediately connected one (a GNCN-PDEH, where PDEH means partially decomposable error hierarchy), i.e., layer ℓ = 2 receives error messages from layer ℓ = 1 and ℓ = 0, 4) a label-aware generative model that forms a partially decomposable hierarchy in its forward generative structure (GNCN-PDH driven by labels as input), and 5) one type of temporal/recurrent NGC, where the predictive outputs of each state region are further conditioned on their previous values, i.e.,z ℓ t would be function of the state z ℓ+1 andz ℓ t−1 (as indicated by the curled dash-dotted arrows), since the input takes on temporal/ordered form (as in frames of a video), i.e., x t where t marks a step in time.
If more complex transformations are needed to map a state layer's activity z ℓ to a set of nonlinear prediction valuesz ℓ , for example using multiple self-attention heads as in the case of transformer networks [8], one could opt to leverage back-propagation to locally compute the weight adjustments to the attention head synapses utilizing NGC's online objective is total discrepancy, which is an approximate form of (variational) free energy [35].
Finally, in Table 1, we present a small nearest neighbor analysis of the samples generated by two NGC models -GNCN-t1-Σ (GNCN-t1-Σ/Friston) and GNCN-t2-LΣ -and backprop-based autoencoder models -a regularized autoencoder (RAE) and an adversarial autoencoder (GAN-AE). Specifically, we forced each model, after training on each database, to synthesize a pool of data samples equal in size to the original dataset, e.g., for MNIST, each model created 58, 000 samples (since 2000 were used for validation). Then, we randomly sampled one original data point from each class of each database and performed a nearest neighbor search (using the Euclidean distance function) to select out of the synthesized patterns, for each model, the single fantasy with the lowest distance measurement, i.e., the single top nearest neighbor. We present the visual results of this simple analysis in Table 1  As mentioned in the introduction, some of the more notable criticisms of backprop include: 1. Synapses that make up the forward information pathway need to directly be used in reverse to communicate teaching signals (the weight transport problem), 2. Neurons need to be able to communicate their own activation function's first derivative, 3. Neurons must wait for the neurons ahead of them to percolate their error signals way back before adjusting their own synapses (the update-locking problem), 4. There is a distinct form of information propagation through a long, global error feedback pathway that only affect weights but does not (at least directly) affect the network's internal representations, 5. The error signals have a one-to-one correspondence with neurons.
The properties above are inherent to backprop and do not conform to known biological feedback mechanisms underlying neural communication in the brain [7]. The brain, in contrast, is heavily recurrently connected [2,9], allowing for complementary pathways to form that would allow for percolation of error/mismatch information [28,44]. Biological neurons communicate binary spike signals, making it unlikely that they also sport specialized circuitry to communicate the derivative of a loss function with respect to their activities [18] (though some recent studies have suggested that real neurons might communicate with rate codes [27]). Furthermore, it is more commonly accepted that neurons in the brain learn "locally" [16,11], modulated globally by signals provided through neuromodulators such as dopamine, i.e., they operate with only immediately available information (such as their own activity and that of nearby neurons that they are connected to), making it unlikely that a global feedback pathway drives synaptic adjustments.
In addition, several of the problems above result in practical problems -the weight transport problem has been shown to create memory access pattern issues in hardware implementations [6] and the global feedback pathway itself is a key source behind the well-known exploding and vanishing gradient problems [38] in deep ANNs, yielding unstable/ineffective learning unless specific heuristics are employed.
In recent research, developing learning procedures that enable backprop-level learning while embodying elements of actual neuronal function has seen increasing interest in the machine learning community. However, while insights provided by each development have proven valuable, increasing the evidence that shows how a backprop-free form of adaptation can be consistent with some aspects real networks of neurons, many of these ideas only address one or a few of the issues described earlier. Random feedback alignment algorithms [3,25] address the weight transport problem, and to varying degrees, the update locking problem [32,30,12], but are fundamentally emulating backprop's differentiable global feedback pathway to create teaching signals (and pay a reduction in generalization ability the farther away they deviate from backprop [4,12]). In addition, experimentally, the success of these approaches depends on how well the feedback weights are chosen a prior (instead of learning them). Other procedures, like local representation alignment [36] and target propagation [20,24], which also resolve the weight transport problem and eshew the need for differentiable activations, fail to address the update locking problem, since the various incarnations of these require a full forward pass to initiate inference. Procedures have been proposed to address the update locking problem, such as the method of synthetic gradients [22], but still require backprop to compute local gradients. It remains to be seen how these algorithms could be adapted to learn generative models. Other systems, such as those related to contrastive Hebbian learning (CHL) [31,34,43], are much more biologically-plausible but often require symmetry between the forward and backward synaptic pathways, i.e., failing to address weight transport. More importantly, CHL requires long settling phases in order to compute activities and teaching signals, resulting in long computational simulation times. However, while some of these algorithms have shown some success in ANN training [24,25,36], they focus on classification, which is purely supervised and arguably a simpler problem than generative modeling.
Boltzmann machines [1], which are generalizations of Hopfield networks [26,21] to incorporate latent variables, are a type of generative network that also contains lateral connections between neurons much as the models in our NGC framework do. While training for the original model was slow, a simplification was later made to omit lateral synapses, yielding a bipartite graphical model referred to as a harmonium, trained by contrastive divergence [17], a local contrastive Hebbian learning rule. However, while powerful, the harmonium could only synthesize reasonable-looking samples with many iterations of block Gibbs sampling and the training algorithm suffered from mixing problems (leading low sample diversity among other issues) [10]. Another kind of generative model [19] can be trained with the wake-sleep algorithm (or the up-down algorithm in the case of deep belief networks [19]), where an inference (upward) network and a generative (downward) network are jointly trained to invert each other. Unfortunately, wake-sleep suffers from instability, struggling to produce good samples of data most of the time, due to the difficulty both networks have in inverting each other due to layer-wise distributional shift. Motivated by the deficiencies in models learned by contrastive divergence/wake-sleep, algorithms have been created for auto-encoder-based models [5] but most efforts today rely on backprop.

Supplementary Note 3: On Evaluating Mode Capture
Given that a common problem in training generative models is mode collapse, we measure some distributional properties of several of our baseline models and NGC models. Note that it is difficult to directly and automatically determine the labels of the samples produced by the unsupervised models investigated in this paper (aside from manual qualitative inspection) and it is an open research area/problem to develop better measurements for evaluating mode-capturing ability of generative models in general. Among the myrid of current experimental approaches proposed for measuring the degree of modal capture, we opted to implement and measure the number of statistically different bins (NBD) from [41], which has been argued to be a potentially useful metric for evaluating the degree of mode collapse that a given generative model might have experienced (values closer to 0 mean that the model has likely captured most of the modes of the data's underlying distribution). To complement this metric, we also measure the Kullback-Leibler divergence (KL-D) between a pool of samples generated by our model (equal to the size of the original dataset) and the original dataset samples -we estimate the empirical mean and covariance matrices of each and calculate the closed-form multivariate Gaussian KL-D, to be specific.
To further utilize the labels, we also present results using another approach proposed in [42]. Specifically, we train a well-regularized multilayer perceptron (MLP) classifier (as mentioned above) on the full original dataset and use it to automatically annotate the samples produced by a given generative model. Once the samples have been annotated, we compute the frequencies of each label class (and furthermore, normalize these values to lie in the range [0, 1]) and plot these class probabilities in Figure 2 (we also plot the ground truth distribution for reference). Also note that, since even a powerful MLP classifier such as the one we trained incurs error (it naturally cannot reach 100% test error), the frequency measurements should be taken with annotator error in mind.
Based on the results presented in Figure 2 and Table 2, we see that while it does not appear that the predictive processing NGC models suffer from any severe form of mode collapse, they do not appear to capture the frequency distribution of the classes as well as the GAN-AE. With respect to the G-KL and NBD metrics, it does appear that the GNCN-PDH and GNCN-t1-Σ do well (often performing among the top scores across all datasets), although the GAN-AE and GVAE perform the best currently with respect to these mode measurements. Future work will entail uncovering the potential reason why the NGC models do not, at least upon first examination, match the class frequency of the data as well as the autoencoders.

Supplementary Note 4: On Model Complexity of Neural Generative Coding
On Setting Parameter Complexity Model complexity was selected based on consideration of the hardware resources available and preliminary experimentation with the validation set of each dataset, i.e., each benchmark had a validation subset that was randomly sampled without replacement (per class) from the training set, as described in the paper. Backprop-based variational autoencoders are typically designed with low-dimensional latent spaces in mind so we investigated different latent code sizes in the range of [5,30] and found that 20 worked best.Then, we constrained the GNCNs that had lateral synapses to only have 20 neural columns (since each column roughly functions as a latent variable) in their topmost layers and explored values for the lower levels in the range of [100, 400] and found that 360 was a good choice (GNCN-t1 [40] and GNCN-t1-Σ [13] were set to use 360 neurons in the top layer). For the autoencoder models, the hidden layer sizes were set to be equal and a coarse grid search was conducted through the range between 100 and the maximum number possible for equal-sized layers such that the model could only have a maximum number of parameters equal to the total of the NGC models (meaning they could have fewer synapses if that yielded better performance on the validation set), to ensure fair comparison.
The optimal dimensionality of z ℓ was also tuned through preliminary experimentation using held-out validation data. We chose the linear rectifier activation function for the NGC models because we desired strictly positive activity values (which work well with the formulation of lateral inhibition we present in this work). For the GNCN-t1-Σ [13], we found that the linear rectifier worked best while the hyperbolic tangent worked best for GNCN-t1 [40] (for both of these models, the coefficient weighting the kurtotic prior was tuned on validation data).

On Model Run-time Complexity
Note that an NGC model, require multiple steps of processing to obtain their latent activities for a given input. Naturally, per sample, this means that all of the predictive processing models we have explored would be slower than the feedforward autoencoder models (which conduct inference with a single feedforward pass). In short, while an autoencoder (with L + 1 layers) would roughly only require 2 * L matrix multiplications (the most expensive operation in the neural systems we investigate), any NGC predictive processing model would require at least 2 * L * T multiplications. Note that, as observed in our data efficiency plots in the main paper, we note that this more expensive per-sample cost is desirably offset by convergence with fewer samples in comparison to backprop models. Furthermore, specialized hardware could take advantage of the NGC's inherent parallelism to speed up the process.
One key to reducing the inherent cost of an NGC model's iterative processing include designing alternative state update equations (whereas the ones explored in this paper embody a form of Euler integration, one could design higher-order integration steps, such as those based on the midpoint method or Runga-Kutta). Another solution could be to craft an amortized inference process, where another neural model (much akin to the encoder in a variational autoencoder) learns to infer the value of the state variables at the end of an expensive iterative inference process so as to ultimately reduce the amount of processing steps per sample required. We leave investigation of these remedies for future work.

Supplementary Note 5: On the Omission of Activation Derivatives
Removing the activation derivative as we did in our GNCN-t2-LΣ and GNCN-PDH models could be argued to lead to possible value fluctuations that lead to unstable dynamics. Nonetheless, experimentally, we did not observe any strong weight fluctuations in our simulations and we believe that such fluctuations (much like the growing weight value problem inherent to many Hebbian update rules) are unlikely given that our model weight (columns) are constrained to have unit norms. Furthermore, the step size β is usually kept within the range of [0.02, 0.1] and the leak variable −γz ℓ helps to smooth out the values and prevent the occurrence of large values/magnitudes (serving as sort of an L2 penalty over the latent state activities).
In [37,36], it was argued that so long as the activation function was monotonically increasing (with a similar condition imposed for stochastic activation functions), then the learning process would be stable and the benefit that the point-wise derivative offered would be absorbed by the error synaptic weights introduced to carry error signals. However, note that an approximation for the derivative in the form of a prefactor (derivative) term could be designed/introduced to further safeguard against potential fluctuations (and this will be the subject of future work).

Supplementary Note 6: Generating the Lateral Competition Matrices
In the main paper, we introduced a lateral competition matrix V ℓ that directly affects the latent state z ℓ . It is created to contain the self-excitation weights and lateral inhibition weights by using the following matrix equation: where I is the identity matrix and the masking matrix M ℓ ∈ {0, 1} J ℓ ×J ℓ is set by the experimenter (placing ones in the slots where it is desired for neuron pairs to laterally inhibit one another). In this study, we set α e = 0.13 (the self-excitation strength) and α h = 0.125 (the lateral inhibition strength). Our mask matrix M ℓ , which emphasized a type of group or neural-column form of competition, was generated by the following process: in each matrix S k insert ones at all combinations of coordinates c = {1, · · · , k, · · · , K} (column index) and r = {1 + K * (k − 1), · · · , k + K * (k − 1), · · · , K + K * (k − 1)} (row index) 3. concatenate the J ℓ /K matrices along the horizontal axis, i.e., M ℓ =< S 1 , S 2 , · · · , S C >.
Note that for our proposed integration of the latetal synapses in the state neuron layers, we start out with a probabilistic model and then modify it by introducing sparsity driven by lateral synaptic weights (a neuroscience-inspired idea), which directly modify the values of the state neurons (serving as sort of filter). While we cannot justify them in the probabilistic model, our experiments in the Results section of the main paper show that they improve over those that do not employ them, such as GNCN-t1-Σ [13] and GNCN-t1 [40] (which both impose a Laplace distribution over their state neurons to encourage sparsity). Part of our future work will be to derive a probabilistic interpretation of our particular extensions to the NGC model.
With respect to backprop-based neural systems, one could also introduce stateful neurons with similar connectivity to our lateral and precision synapses above by introducing recurrence (as is done in recurrent neural networks). However, to update the weight parameters, one would have to resort to backprop through time (BPTT) and unroll the model over T steps in time. This would require creating a very deep computational graph and storing the activities and gradients at each time step before a final update to each synaptic weight matrix could be calculated, creating a very expensive memory footprint. An NGC model, in contrast, does not require unrolling and the large memory footprint associated with BPTT-trained recurrent networks.

Supplementary Note 7: Autoencoder Baseline Model Descriptions
To make learning the decoder function (NN) described in the main paper tractable, it is common practice in the deep learning literature to introduce a supporting function known as the encoder [23]. The encoder (NN e ), parameterized by a feedforward network, takes in the input stimulus x and maps it to z or to a distribution over z. Depending on the choice of encoder, one can recover one of the four main baselines we experimented with in this paper.
For all backprop-based baseline models in this paper, the decoder of each was regularized with an additional L2 penalty. Specifically, this meant that their data log likelihood objectives always took the form: ψ reg = ψ + Ω(Θ NN ), where Θ N N = {W L , · · · , W ℓ , · · · , W 1 } contains all of the weight matrix parameters of the decoder NN. Ω NN is the regularization function applied to the decoder, i.e., Ω NN = −λ W ℓ ∈Θ ||W ℓ || 2 2 where ||W ℓ || 2 denotes computing the Frobenius norm of W ℓ . During training/optimization with gradient ascent, we do not constrain the column norms of any of the weight matrices for any of the baseline models (as we do for the GNCNs) as we found that doing so worsened their generalization ability.
Furthermore, the number of total layers in the decoder for any model was set to be four -one output and one input layer with two hidden layers in between. The encoder was constrained to be the same -one input and one output layer with layers in between (in the case of the GVAE, GVAE-CV, and GAN-AE, the encoder's output is technically split into two blocks, as described later). The sizes of the hidden layers were set such that the total number of learnable model weights were approximately equal across all baselines and GNCNs (maximum was 1, 400, 000 synapses), which means that all models were forced to have the same parameter complexity to avoid any unfair advantages that might come from over-parameterization.
Regularized Auto-encoder (RAE): The encoder NN e is designed to be a feedforward network of L layers of neurons. Each layer is a nonlinear transformation of the one before it, whereẑ ℓ = ϕ ℓ (E ℓ ·ẑ ℓ−1 ). Like in the decoder, ϕ ℓ is an activation function and E ℓ is a set of tunable weights. In this paper, we chose ϕ ℓ to be the linear rectifier, i.e., ϕ ℓ (v) = max(0, v). The bottom layer activation was chosen to be the logistic link, i.e., ϕ 0 (z) = 1/(1 + exp(−z).
Note that in the RAE, the input to the decoder is now z =ẑ L , i.e., the noise sample vector is set equal to top-most layer of neural activities of the encoder. The data log likelihood that the RAE optimizes is: where updates to each weight matrix E ℓ (of the encoder) and W ℓ (of the decoder) are updated by computing the relevant gradients ∂ψ ∂E and ∂ψ ∂W , respectively. The weight gradients are then used to update model parameters via gradient ascent.
Gaussian Variational Auto-encoder (GVAE): Instead of using an encoder to only produce a single value for z, we could instead modify this network to produce the parameters of a distribution over z instead. If we assume that this distribution is a multivariate Gaussian with a mean µ z and a diagonal covariance σ 2 z = Σ z ⊙ I, we can then modify the RAE's encoder function to instead be: (µ z , σ 2 z ) = NN e (x). Specifically, the top-most layer of NN e (x) is actually split into two separate output layers as follows: µ z = E L µ · z L−1 and σ 2 z = exp(E L σ · z L−1 ) (this is also known as the variational autoencoder, or VAE [23]). E L µ is the tunable weight matrix for the mean and E L σ is the tunable weight matrix for the variance. The data log likelihood for the GVAE is as follows: where q(z|x) = N (µ z , σ 2 z ) (the Gaussian parameters produced by the encoder NN e (x)) and p(z) = N (µ p , σ 2 p ) where µ p = 0 and σ 2 p = 1 (an assumed unit Gaussian prior over z). The second term in the above objective is the Kullback-Leibler divergence D KL between the distribution defined by the encoder and the assumed prior distribution. This term is meant to encourage the output distribution of the encoder to match a chosen prior distribution, acting as a powerful probabilistic regularizer over the model's latent space. Note that this divergence term serves as a top-down pressure on the top-most layer of the encoder while the gradients that flow from the encoder (via the chain rule of calculus) act as a sort of bottom-up pressure. Note that since N N e (x) is a distribution, the input to the decoder is, unlike the RAE, a sample of the encoder-controlled Gaussian, i.e., z = µ z + σ 2 z ⊙ ϵ where ϵ ∼ N (0, 1). Gradients of the likelihood in Equation 2 are then taken with respect to all of the encoder and decoder parameters, including the new mean and variance encoder weights E L µ and E L σ , which are subsequently updated using gradient ascent. All the other activation functions of the GVAE are set to be the linear rectifier, except for the output function ϕ 0 of the decoder, which, like the RAE, is set to be the logistic sigmoid.

Constant-Variance Gaussian Variational Auto-encoder (GVAE-CV):
This model [14] is identical to the GVAE except that the variance parameters σ 2 z of the encoder are omitted and a fixed (non-learnable) value is chosen instead for the variance (meaning that the diagonal covariance is collapsed further to a single scalar). The exact value for this variance meta-parameter, for each benchmark, was chosen from the range [0.025, 1.0] by tuning performance to a held-out set of image samples.
Generative Adversarial Network Autoencoder (GAN-AE): This model, also referred to as an adversarial autoencoder [29], largely adheres to the architecture of the GVAE except that the second term, i.e., the Kullback-Leibler divergence term, in the data log likelihood is replaced with the adversarial objective normally used to train implicit density estimators like the generative adversarial network (GAN) [15]. As a result, we integrate a third feedforward network, i.e., p r = NN d (z), into the generative model (this module is also referred to as the discriminator). The discriminator is tasked with distinguishing whether an input vector comes from the desired prior distribution p(z) (set to be a unit Gaussian as in the GVAE) or comes from the encoder network NN e (z) distribution. This task is posed as a binary classification problem, where a sample from the encoder z f ∼ N (µ z , σ 2 z ) is assigned the label of c = 0 (fake sample) and a sample drawn from the prior z r ∼ N (0, 1) is assigned a label of c = 1 (real sample). These fake and real samples are fed through the discriminator which returns a scalar value for each, representing the probability p r = p(c = 1|z). This leads to the modified data log likelihood objective below: However, to update the weights of the GAN-AE, we do not compute partial derivatives of Equation 3 directly. Instead, following in line with the typical multi-step optimization of [15,29], upon presentation of a sample or mini-batch of samples, we compute gradients with respect to NN e (x), NN(z), and NN d (z) separately. Specifically, if we group all of the encoder weights under Θ N Ne , all of the decoder weights under Θ N N , and all of the discriminator weights Θ N N d , then the gradients of the objective are computed in three separate but successive steps shown below: where the above gradient calculations are each followed by a separate gradient ascent update to their relevant target parameters, i.e., ∆ auto is used to update Θ N Ne ∪ Θ N N , ∆ gen is used to update Θ N Ne , and ∆ disc is used to update Θ N Ne ∪ Θ N N d .
The number of synaptic weights associated with the discriminator were included in the model's total parameter count and had two hidden layers of linear rectifier units. Again, like the GVAE, the hidden layer functions ϕ ℓ of the encoder and decoder were chosen to be the linear rectifier and the output ϕ 0 of the decoder was set to be the logistic sigmoid.

Supplementary Note 8: On the Feature Analysis of Neural Generative Coding
The analysis we conducted (in Table 3 of the main paper) on the GNCN-t2-LΣ's intermediate representations involved, using the MNIST dataset, examining the generative synaptic weights for each layer of a trained GNCN-t2-LΣ model. Specifically, when we viewed the weight vectors that conveyed predictions of state neurons in layer 1 to layer 0 (the output), we found that the features resembled rough strokes and digit components (of different orientations/translations). When we viewed the weight vectors relating neurons in layer 2 to layer 1 and layer 3 to layer 2 , we found that they rather resembled neural selection "blueprints" or maps that seem to be used to select or trigger lower-level state neurons.
In Figure 6 of the main paper, we illustrate how these higher-level maps seem to interact with the low-level stroke visual features/dictionary. Based on our simple analysis, it appears that the GNCN-t2-LΣ learns a type of multi-level command structure, where neurons in one level learn to turn off and turn on neurons in the neighboring level below them, further scaling those they choose to activate by an intensity coefficient. When we reach layer 1, the state neurons chosen from the command structure of layers 2 and 3, as well as their final produced intensity coefficients (ranging from [0, ∞) due to the fact that any layer of the GNCN-t2-LΣ in this paper uses the linear rectifier activation function) resemble and work to produce a composition of low-level features that ultimately produce a complete object or digit. This means that the GNCN-t2-LΣ (as well as the other predictive processing models like those of [13], i.e., the GNCN-t1-Σ, and [40], i.e., the GNCN-t1, since they process sensory inputs in the same way as the GNCN-t2-LΣ) learns to compose and produce a weighted summation of low-level features akin to the results of sparse coding [33] driven by a complex, higher-level neural latent structure. In the "Output" column of Table 3 of the main paper, we empirically confirm this by summing up the top most highly-activated state neurons in layer N 1 (multiplying each by its activation coefficient) -this simple super-position visually yields digits quite similar to the original one presented to the GNCN-t2-LΣ model.
While this simple feature analysis is promising, we remark that future work should involve developing methodology to map an NGC model's layerwise activities to those of actual brain activity (using fMRI data) or to a biological model such as HMAX, using methodology similar to that proposed in [39].