Main

As they develop, brain networks learn to achieve objectives, from simple functions such as autonomic regulation, to higher-order processes such as solving problems. Many stereotypical features of networks are downstream consequences of resolving challenges and trade-offs they face, across their lifetime1,2 and evolution3,4,5. One example is the optimization of functionality within resource constraints; all brain networks must overcome metabolic costs to grow and sustain the network in physical space, while simultaneously optimizing that network for information processing. This trade-off shapes all brains within and across species, meaning it could be why many brains converge on similar organizational solutions4. As such, the most basic features of both brain organization and network function—such as its sparse and small-world structure, functional modularity, and characteristic neuronal tuning curves—might arise because of this basic optimization problem.

Our understanding of how the brain’s structure and function interact largely comes from observing differences in brain structure, such as across individuals6 or following brain injury7, and then systematically linking these differences to brain function or behavioural outcomes. But how do these relationships between structure, function and behaviour emerge in the first place? To address this question, we need to be able to manipulate experimentally how neural networks form, as they learn to achieve behavioural objectives, to establish the causality of these relationships. Computational models allow us to do this8. They have shown that network modularity can arise through the spatial cost of growing a network9, how orthogonal population dynamics can arise purely through optimizing task performance10 and how predictive coding can arise through limiting a brain’s energy usage11. But we have yet to incorporate both the brain’s anatomy and the brain’s function into a single coherent model, allowing a network to dynamically trade-off its different structural, functional and behavioural objectives in real time.

To achieve this, we introduce spatially embedded recurrent neural networks (seRNNs). An seRNN is optimized to solve a task, making decisions to achieve functional goals. However, as it learns to achieve these goals and to optimize its behavioural performance, its constituent neurons face the kind of resource constraints experienced within biological networks. Neurons must balance their finite resources to grow or prune connections, while the cost of a connection is proportional to its length in three-dimensional (3D) Euclidean space12,13,14,15,16. At the same time, the network attempts to optimize its intra-network communication to allow for efficient propagation of signals17,18,19,20,21. By allowing seRNNs to dynamically manage both their structural and functional objectives simultaneously, while they learn to behave, multiple simple and complex hallmarks of biological brains naturally emerge.

Results

Spatially embedded recurrent neural networks

Our first goal was to create a supervised optimization process that subjects recurrent neural networks (RNNs; ‘RNN modelling’ in Methods) to the constraints of biophysical space while they are optimized for task performance. An established way of influencing a network’s weight matrix while it is optimized for task performance is regularization (Fig. 1a). In regularization, instead of merely optimizing a network’s weights to maximize task performance, one adds an additional regularization term to the optimizer to minimize the strength of a network’s weights. This is related to regularized regression, such as L1 (LASSO) regression, where the sum of the absolute beta weights is minimized to improve a model’s out-of-sample prediction performance. We use the same idea to spatially embed an RNN. We start with fully connected RNNs and while they are trained to maximize task performance, we nudge them to minimize weights that are long in 3D space. To achieve this, we assign every unit in the RNN’s recurrent layer a location in 3D space (Fig. 1b) and regularize a weight more strongly if it belongs to two units that are far apart in Euclidean space. In this pruning process, we also want the network to optimize within-network communication, meaning a weight should be more readily pruned if it does not contribute strongly to the propagation of signals within the network. A standard measure of signal propagation in a (binary) network is communicability, reflecting the shortest routes between all pairs of nodes22 (Fig. 1c; see details in ‘Communicability’ in Supplementary Information). When adapted for a weighted network (weighted communicability19), the communicability value of a network is low when there are strong global core connections supporting short paths across the network while avoiding redundant peripheral connections to achieve sparsity (Fig. 1d). In Supplementary Information (‘Minimizing redundant connectivity by minimizing weighted communicability’), we provide information on how weighted communicability differentially optimizes peripheral and core connection strengths. By combining the spatial distance and weighted communicability terms in an RNN’s regularization while it learns to solve a task, we arrive at seRNNs (Fig. 1e). We provide a detailed walkthrough of the regularization function in ‘seRNN regularization function’ in Methods. While learning to solve a task, seRNNs are nudged to prefer short core weights over long peripheral weights.

Fig. 1: Task structure and seRNNs.
figure 1

a, We use regularization to influence network structure during training to promote smaller network weights and hence a sparser connectome. b, Through regularization, we embed RNNs in Euclidean space by assigning units a location on an even 5 × 5 × 4 grid. We show a schematic of a six-node network in its space. c, We similarly embed RNNs in a topological space, guiding the pruning process towards efficient intra-network communication operationalized by a weighted communicability measure (see main text). The weighted communicability term is shown for the same network. d, When these constraints are placed within a joint regularization term, networks are incentivized to strengthen short connections, which are core to the networks topological structure, and weaken long connections, which are peripheral. Networks are generally incentivized to weaken connections while optimizing task performance. e, In the main study, we trained 1,000 L1-regularized RNNs as a baseline. L1 networks optimize task performance while minimizing the strength of their absolute weights (W). The network receives task inputs from an eight-unit-wide fully connected feed-forward layer and represents its choice as one of four choice units in the output layer. We compare these with 1,000 seRNNs, which include both Euclidean and topological constraints in their regularization term, by multiplying the weight matrix (W) by its Euclidean distance (D) and weighted communicability (C). Elements of the resulting matrix are summed, forming the structural loss. We minimize the sum of the task loss and the structural loss. To the right, we show the evolution of W, D and C matrices over training. f, Networks solve a one-step inference task starting with a period of twenty steps where the goal is presented in one of four locations on a grid: top/bottom, left/right (depicted in light blue). Subsequently, there is a ten-step delay where the goal location must be memorized. Then two choice options are provided for twenty steps. Using prior goal information, agents must choose the option closer to the goal. In this example, given left and right options, the correct decision is to select right.

To understand how this spatial embedding impacts a network’s structure and function, we set up 2,000 RNNs. Half of the networks were seRNNs trained with the new optimization process described above. The other half were regular RNNs regularized with a standard L1 regularizer minimizing the sum of the absolute weights, to arrive at a population of baseline networks that match seRNNs in overall connectivity strength. In both cases, the regularizer was applied to the hidden recurrent layer of the network and the regularization strength was systematically varied within each subgroup of networks to cover a wide spectrum of regularization strength that is matched across subgroups (Fig. 1e and ‘Regularization strength set-up and network selection’ in Methods). All networks had 100 units in their hidden layer and were trained for 10 epochs. All networks started strongly connected and learned through pruning weights in accordance with their regularization. We trained networks on a one-choice inference task that required networks to develop two fundamental cognitive functions of recurrent networks: remembering task information (‘goal’) and integrating it with new incoming information (‘choices’) (Fig. 1f and ‘Task paradigm’ in Methods).

When training the networks, we found that both types of network manage to learn the task with high accuracy (Fig. 2a). Focusing on networks that successfully solve the task (>90% task accuracy; n = 390 for seRNNs, n = 479 for L1s; see ‘Regularization strength set-up and network selection’ in Methods for discussion of network numbers), we first validated that our optimization process is working. By using L1 networks as a baseline, we observed that both groups decrease in their average connectivity strength (Fig. 2b) but that only seRNNs did so by pruning long-distance connections (Fig. 2c). This is commonly found in empirical brain networks across species and scales23. In addition, we validate that seRNNs successfully focus their pruning process on weights that are less important for the network’s communicative structure, as represented by lower weighted communicability (Fig. 2d). Figure 2e shows an example visualization of one seRNN.

Fig. 2: Validating the training of seRNNs.
figure 2

a, The validation accuracy of all converging neural networks is shown across L1 RNNs (n = 479, blue, for all plots) and seRNNs (n = 390, pink, for all plots), showing that equivalent performance is achieved on the one-step inference task. For all plots, error bars correspond to two standard errors. b, At the same time, both groups of networks show a general trend of weakening the weights in their recurrent layer, showing that the overall regularization is working in both groups of networks. c, As a result of their unique regularization function, seRNNs have a negative correlation between weight and Euclidean distance over the course of epochs/training, but in L1 networks there is no relationship between weights and Euclidean distance. d, The regularization function of seRNNs also successfully influences the topology of networks to prefer topologically central weights over topologically peripheral weights, as shown by lower weighted communicability values. e, Left: an example of a representative seRNN network in the 3D space in which it was trained. The size of the nodes reflects their node strength. This network was taken from epoch 9 at a regularization of 0.08 and is the network used for visualizations for the rest of this paper. Middle: we show the negative relationship between the connection weights of seRNN versus the Euclidean distances of the connections. Pearson’s correlation coefficient is provided, with the corresponding P value (P = 7.03 × 10−7). No adjustments were required for multiple comparisons. Right: we show the weight matrix of this seRNN, showing how weights are patterned throughout the network.

Having shown that the new regularization function in seRNNs has the expected effects on the weight matrix of networks, we next tested which features result from the spatial embedding. Specifically, we tested whether seRNNs show features commonly observed in primate cerebral cortices, including structural motifs such as modularity24,25,26 and small-worldness27,28, before testing for functional clustering of units in space27,28. We then go beyond structural and functional organization and test whether spatial embedding forces networks to implement an energy-efficient mixed-selective code29,30. In short, we wanted to test whether established organization properties of complex brain networks arise when we impose local biophysical constraints.

Modular small-world networks emerge from constraints

We first investigated two key topological characteristics that are commonly found in empirical brain networks across spatial scales and proposed to facilitate brain function: modularity24,25,26 and small-worldness27,28. Modularity denotes dense intrinsic connectivity within a module but sparse weak extrinsic connections between modules and small-worldness indicates a short average path length between all node pairs, with high local clustering.

Computing modularity Q statistics and small-worldness (‘Topological analysis’ in Methods) shows that seRNNs consistently show both increased modularity (Fig. 3a) and small-worldness (Fig. 3b) relative to L1 networks over the course of training. Differences are smaller initially, but later in training, the effect size for differences in modularity are large (at epoch 9, modularity P = 2.24 × 10−82, Cohen’s d = 1.07; Fig. 3a, right) and for small-worldness moderate to large (P = 2.82 × 10−19, Cohen’s d = 0.59; Fig. 3b, right). seRNNs achieve modularity Q statistics within ranges commonly found in empirical human cortical networks31. Both L1 and seRNNs achieve the technical definition of small-worldness of >1 (ref. 32), but seRNNs show a higher value more consistent with empirical networks33. ‘Replication across architectures’ in Supplementary Information shows how the subparts of the regularization interact with the task optimization to shape these structural effects. It is important to note that within the population of seRNNs, we find varying degrees of modularity and small-worldness (Fig. 3a, right, and Fig. 3b, right). We will return to this variability in a later section.

Fig. 3: seRNNs show a brain-like structural topology.
figure 3

a, Left: a schematic illustration of the concept of modularity in networks. While both L1 (n = 479) and seRNN (n = 390) networks show increasing modularity over epochs/training, there is a consistently greater modularity in seRNNs compared with L1 networks. Error bars correspond to two standard errors. Right: we show very large (Cohen’s d = 1.07) statistical differences in modularity distributions for functioning (validation accuracy ≥90%) epoch 9 networks in L1 and seRNN networks. A two-sample t-test was taken to provide the P value. No adjustments were required for multiple comparisons. b, Left: a schematic illustration of the concept of small-worldness in networks. While both L1 (n = 479) and seRNN (n = 390) networks show a similar trajectory shape of small-worldness over epochs/training, there is a consistently greater small-worldness in seRNNs compared with L1 networks. Error bars correspond to two standard errors. Right: we show moderate-to-large (Cohen’s d = 0.59) statistical differences in small-worldness distributions for functioning epoch 9 networks in L1 and seRNN networks. A two-sample t-test was taken to provide the P value. No adjustments were required for multiple comparisons. c, For a range of generative network models (‘Generative network modelling’ in Methods), we present the model fit of the top performing simulations fit to seRNNs (n = 390). Note that the lower the model fit, the better the performance, as the model fit function is a measure of dissimilarity between the RNN and the generative simulation. The results show that homophily models achieve the best model fits. These findings are congruent with published data from adolescent whole-brain diffusion-MRI structural connectomes35 (middle right) and high-density functional neuronal networks at single-cell resolution15 (right). The boxplots present the minimum value (bottom), maximum value (top), median value (centre) and the interquartile range (bounded 25th and 75th percentile). A one-way ANOVA was taken to provide the first P value (P = 1.04 × 10−91), followed by a Tukey’s test for pairwise comparisons in which homophily models had a pairwise P value <10−3 for all comparisons.

To further validate the structural likeness of seRNNs to empirical neural connectivity, we used generative network models9,34,35,36. These models elucidate which topological wiring rules can accurately approximate observed neural graphs. Corroborating empirical macro- and microscopic data15,35, we find that homophily wiring rules—where neurons preferentially form connections to other neurons that are self-similar in their connectivity profiles—perform best in approximating the topology of seRNNs relative to all other wiring rules (Fig. 3c and additional detail in ‘Generative network modelling of RNNs’ in Supplementary Information).

Functionally related units spatially organize in seRNNs

So far, we have explored how imposing biophysical constraints within seRNNs produces structures that mimic observed networks. However, this ignores the functional roles of neurons or their patterning within the network. We next examined this by exploring the configuration of functionally related neurons in 3D space (Fig. 4a). In brain networks, neurons sharing a tuning profile to a stimulus tend to spatially group37,38. This can be observed on fine-grained cortical surfaces with preferences for stimuli features39 (Fig. 4b) and in whole-brain functional connectivity forming modular network patterns40 (Fig. 4c). In addition, high-resolution recordings in rodents show how the brain keeps many codes localized but also distributes some across the network41. To test whether seRNNs recapitulate functional co-localization, we decoded how much variance of unit activity can be explained by the goal location or choice options, over the course of each trial (‘Decoding’ in Methods). In Fig. 4d, we show a visualization in a representative network and unit-specific preferences over the course of a single trial.

Fig. 4: Functional clustering and distribution of coding in space.
figure 4

a, An example of a representative seRNN network. The colour of the nodes relates to the decoding preference of that neuron, where a preference for goal information is represented by green and choices information by brown. b, The spatial clustering of neuronal ensembles that are preferentially tuned for orientation versus colour in human prefrontal cortex. The Dorsal-Ventral (D-V) and Anterior-Posterior (A-P) axes are shown39. c, The macroscopic spatial organization of functional networks40. d, We show decoding of units for goal (green) versus choice (brown) information at different points in the trial, within the representative seRNN network. e, A schematic illustration of the spatial permutation test for determining whether the neurons are functionally clustered (top left) or distributed (top right) in space. For this permutation test, we compute the summed Euclidean distance between units with an observed preference for goal or choice information, respectively, weighted by the magnitude of their preference (termed cluster ∑ weighted Euclidean). This gives a statistic, for every network, corresponding to the weighted distance between units (that is, goal or choice units) in space. To determine whether this statistic was equivalent to chance, for each statistic we computed a null distribution of expected distances between goal and choice units, respectively, under the assumption that they are randomly located in space. This was calculated by taking 1,000 random samples of the same size as the number of empirical neurons with a preference for goal or choice information. The Pperm relates to where the statistic sits within this null distribution, where each network gets a Pperm for goal and choice information. The skew of the Pperm towards zero shows that the code of networks is more clustered than the null distribution whereas a skew towards one highlights a more distributed code. The Pperm values across RNNs are given for goal information (middle) and choice information (bottom) for seRNNs (pink) and L1 networks (blue). Goal information is shown to be clustered, as given by the positively skewed Pperm distributions. In contrast, choice information is shown to be distributed. No adjustments were required for multiple comparisons. Panel b reproduced with permission from ref. 39, under a Creative Commons licence CC BY 4.0. Panel c adapted with permission from ref. 40, Elsevier.

By taking the relative preference for goal versus choice for each unit, we tested whether the relative sensitivity to stimuli was concentrated in parts of the network. We used a spatial permutation test (‘Spatial permutation test’ in Methods) to test whether the Euclidean distance between highly ‘goal’ or ‘choice’ selective neurons was significantly less or more than would be expected by chance. A small Pperm value highlights that functionally similar neurons tend to be significantly clustered in space whereas a large Pperm corresponds to functionally similar neurons being distributed in space (Fig. 4e, top).

We tested for functional co-localization across three time windows of the trial (the total duration of a trial was 50 steps; Fig. 1e): (1) early stage (goal presented, steps 15–20); (2) middle stage (choice options presented, steps 30–35) and (3) late stage (decision point, steps 45–50). At the early stage, when only goal information is presented, neurons code for only the goal information (widespread dark green nodes in Fig. 4d, left). In seRNNs, there is a slight positive skew in Pperm values, suggesting clustering of highly goal-coding neurons (Fig. 4e, middle left). Subsequently, in the middle stage, when choice options are first shown, goal information clusters within a concentrated area of space, leaving the choice information distributed (seen by clustering of green nodes and distribution of brown nodes in Fig. 4d, middle). This is highlighted by a large positive skew in Pperm values for the goal in seRNN networks (Fig. 4e, middle top) and correspondingly the opposite for choice information (Fig. 4e, middle bottom). In the late stage, the clustering of goal information in space dissipates such that by the time a decision must be made, the goal information has now spread out more but still retains some clustering (Fig. 4e, middle right). The choice code remains distributed (Fig. 4e, bottom right). This suggests that seRNNs use their highly modular structure to keep a connected core with goal information, which needs to be retained across the trial. It uses spatially proximal units to form this core. The presented choices information is then represented by units outside this core and dynamically integrates with the information in the core as the decision point approaches. These findings are unique to seRNNs, as L1 Pperm values remain uniform, indicative of no functional organization. The control analysis in Supplementary Fig. 12 shows these findings hold true when variables are treated independently instead of relatively.

Mixed selectivity and energy-efficient coding

So far, we have shown that adding spatial constraints to a network gives rise to patterns of network connectivity that are highly reminiscent of observed biological networks. Nodes functionally co-localize and the spatial embedding causes differences in how they code task-relevant information. This selectivity profile has been widely studied. Studies show that neurons in task-positive brain regions tend to show a mixed selectivity profile, meaning that neurons do not only code for a single task variable but instead a mixture of them30,42,43,44. A mixed-selective code is assumed to allow networks to solve complex tasks by increasing the decodability of information from the network’s neurons29,45. There are many ways to quantify selectivity profiles46. One simple method is to calculate the correlation of explained variances of task variables across the population of neurons. These are expected to be uncorrelated, implying a neutrally mixed code where a neuron’s coding preference for one variable does not predict its code for another variable. In single-unit recordings, correlations can be close to zero or sometimes slightly positive47.

We looked at the correlation of selectivities of trained networks (epoch 9) for the goal and choices variables. At the time in the trial when networks make a choice, the median correlation is r = −0.057 for seRNN but r = −0.303 for L1, showing that L1 networks produce an anticorrelated code while seRNNs have a more mixed-selective code (Fig. 5a). It is possible that this effect is driven by the differential connectome structure of the two groups of networks. While a modular and separated network would not automatically mix codes across variables evenly, we find a well-mixed code in seRNNs. The additional highly communicative connections between modules of the small-worldness characteristic might help seRNNs to organize units in space while retaining a mixed code across the population. ‘Mixed selectivity’ in Supplementary Information shows how networks specifically show a mixed-selective code at the time when the decision is made. Like our structural results, we saw that there is variation across the population of networks (Fig. 5a), where some networks fall neatly on r = 0 and others might show correlated codes. The following section provides an analysis of this variance.

Fig. 5: Mixed selectivity and energetic efficiency.
figure 5

a, A histogram of correlations of selectivities at the decision point (correlation between explained variance for goal and explained variance for choices) shows how the distribution of seRNNs is more centred around the expected value r = 0 than the L1 networks. Coloured lines mark the median of the distribution. The expected value corresponds to a fully mixed-selective code. b, seRNN (n = 390) networks spend less energy on unit activations than L1 (n = 479) networks, which are matched for mean weight strength in the recurrent layer. Error bars correspond to two standard errors.

The choice of a neuronal code in populations of neurons is strongly linked to the question of energy demand. As the firing of action potentials uses a substantial amount of energy48, a population of neurons should choose a code with a good trade-off of metabolic cost and information capacity29. To test our networks’ energy consumption, we calculated the mean activation of each unit in a network’s recurrent layer (epoch 9) during the period of information integration (after onset of choices). Then we tested for the difference between seRNNs and L1 networks, controlling for the effect of the average weight strength in the recurrent layer (Fig. 5b). Across most weight strengths, seRNNs showed significantly lower unit activations compared with L1 networks (P < 0.001, t(86,497) = 21.4, 95% confidence interval = [−0.271, −0.226]). Sustaining a mixed-selective code at the time of choice might help downstream integration units to decode information more easily, with fewer unit activations needed to communicate the correct choice. This effect disappears for networks with higher average weights, with weak regularization and hence weaker spatial embedding.

Constraints cause linked brain-like structure and function

So far, we have seen that seRNNs show a collection of features that are commonly observed in brains but have not previously been related. The caveat not addressed so far is that for any feature we observed in seRNNs, we also see strong variation across the population of networks (for example, Fig. 3b for modularity or Fig. 5a for mixed selectivity). This opens the possibility that these features do not arise in parallel in seRNNs but instead each feature could emerge in its unique subgroup of networks. This would be unlike biological brains, which exist in a critical sweet-spot area49 where all the features described in this paper are observed. In this section, we tested whether all seRNN features co-appear in a similar subset of trained networks, defined by a unique combination of training parameters.

To study the co-occurrence of brain features in seRNNs, we looked at the distribution of feature magnitude across the space of training parameters (regularization strength, number of training epochs passed). Figure 6a shows matrix plots for accuracy (left), total sum of weights (middle left), modularity (middle right) and small-worldness (right) across the entire spectrum of training epochs (x axis) and regularization strengths (y axis). As before, there is variation in the magnitude of features across the population of networks, but now we also see that this variation is structured. Brain-like topology emerges in a sweet-spot of low to medium strength regularization and during the later training epochs (pink box). The schematic in Fig. 6b highlights this space of sparse, highly accurate, modular small-world networks with an example network showing all properties (Fig. 6b, middle right). Above this space (that is, networks with less regularization, highlighted in orange) networks can solve the task and show small-worldness, but remain very dense and lack the modular organization found in empirical brain networks. Below this space (that is, networks with more regularization, highlighted in light blue) networks show extreme sparsity and modularity, but fail to functionally converge on the task and they lose their small-world topology.

Fig. 6: The seRNN parameter space converges on brain-like topology and function.
figure 6

a, The white borders within the regularization-training parameter space delineate the conditions where seRNNs achieve robust accuracy (left), sparse connectivity (middle left), modular networks (middle right) and small-worldness (right). The pink box shows where all these findings can be found simultaneously. The colour of the matrix corresponds to the relative magnitude of the statistic compared with the maximum. b, This is further highlighted by a schematic representation, which shows the space of possible seRNNs. The pink box shows the overlap of all findings, where accurate, sparse, modular, small-world networks are generated, which we term as being at the optimal trade-off. Networks 1, 2 and 3 each represent example networks across the space. The nodes of the representative graph reflect the node’s strength, defined as the total sum of the node’s in- and out-connection weights. c, In this pink window, networks are sparse (top), prefer short connections (middle top), have a correlations of variable selectivities centring around zero, consistent with mixed selectivity (middle bottom) and have equivalent explained variance for both the goal and the choice (bottom).

Next, we wanted to look at the same ‘sweet spot’ in terms of the network’s functional properties. As the decoding required us to focus this analysis on networks with high task performance (‘Decoding’ in Methods), we use networks with an accuracy >90% at epoch 9. Figure 6c shows the functional results across regularization strengths, highlighting the sweet spot of regularization from Fig. 6a with the pink box. In the first two plots from the top, we show two structural metrics (sparsity and short connection preference). We observed the same distribution when looking at the homophily generative wiring rule (Supplementary Fig. 11b). Looking at mixed selectivity (Fig. 6c, third from top), our analyses revealed that networks show a mixed-selective code at the decision point in the sweet-spot window identified before. Units here show a balanced code with information for both goal and choices (Fig. 6c, bottom), whereas very dense or sparse networks show a preference for either goal or choices information. As such, the density and related modular small-world structure influences the time horizon of information flowing through the network. Dense networks show greater focus on past information, which resonates with how functional networks reconfigure to support memory50. Supplementary Fig. 14 shows a correlation matrix showing pairwise relationships between features studied here.

Our findings show that there is a critical parameter window in which both structural and functional brain features jointly emerge in seRNNs. Brains are often said to live in a unique but critical niche where all characteristics needed to support their function can exist in parallel51. seRNNs show the same preference for a critical parameter window but also give us the ability to study networks on their way to converging on brain-like characteristics in this critical window.

Discussion

Functioning brains have key organizational features endowing them with computational capacities to perform a broad range of cognitive operations efficiently and flexibly. These include sparse connectivity with a modular small-world structure25,27,52, generatable via homophilic wiring rules34,35,36, with spatially configured functional units that implement a mixed-selective code30,45 and minimize energy expenditure29,48. We argue that these complex hallmarks can be, at least in part, attributed to three forces impacting virtually any brain network: optimization of functional performance in a (task) environment, metabolic/structural costs of the network and signal communication within the network. In this work we have shown that seRNNs allow us to manipulate these optimization goals, demonstrating that seemingly unrelated neuroscientific findings can emerge in unison and appear to have a strong co-dependence. We believe that these findings also have an impact on how we think about the interlinked structural and functional optimization processes in the brain under economic constrains (‘Network economics in structural and functional neuroscience models’ in Supplementary Discussion). Our model provides an important tool to continue the work on jointly studying structure and function in neuroscience models53,54,55,56,57. In addition, our results are relevant for developments on the intersection of neuroscience and artificial intelligence (NeuroAI58) (‘Implications of seRNN findings on artificial intelligence’) in Supplementary Discussion.

There are many areas that we wish to improve on with future research. Principally, our models did not include a substantial amount of biological detail that, while inevitably critical for neuronal functioning, does not speak to the observations we aimed to recapitulate in the present study. Implementing such details including molecular mechanisms guiding circuit development59 or heterogeneous spiking of neurons60 will probably provide insights into the trade-offs specific to biological brains. The addition of such details will help us expand the applicability of our models to explore the effect of developmental time courses61,62, functional brain specialization63 and how network variability may underpin individual differences64. Beyond these biological details, it will be important to see how different functional goals would have differential effects on structural optimization processes. The simple working memory task used here provides a first realistic cognitive challenge, but it will be interesting to consider seRNNs in continuous choice multi-task environments. Finally, it is unknown to what extent the inclusion of biophysical constraints has on the randomness of network structure, although we speculate it would generate less-random network structures, compared with regular task-optimized networks.

The development of seRNNs allowed us to observe the impact of optimizing task control, structural cost and network communication in a model system that can dynamically trade off its structural and functional objectives. This suggests that providing artificial neural networks with a topophysical structure65,66 can enhance our ability to directly link computational models of neural structure and function. We believe that the modelling approach shown to work in seRNNs will speed up innovations in neuroscience by allowing us to systematically study the relationships between features that all have been individually discussed to be of high importance to the brain.

Methods

seRNN regularization function

In a canonical supervised RNN, all the network’s trainable parameters are optimized to minimize the difference between the predicted value and correct value. To achieve this, we define a task loss function (L), which defines the prediction error to be minimized to optimize task performance. To produce a network that generalizes well to unseen data, we can add a regularization term. Regularization incentivizes networks to converge on sparse solutions and is commonly applied to neural networks in general67 and neuroscientific network models68,69. For a regularized network, the loss function becomes a combination of both the task loss and the regularization loss. One example of a commonly applied regularization is the L1 regularization, which is also used in LASSO regression70 and incentivizes the network to maximize task performance while concurrently minimizing the sum of all absolute weights in the neural network. If we want to regularize the recurrent weight matrix (W) with the dimensions m × m, where m is number of units in the recurrent layer, the loss function would be:

$$L=L_{\mathrm{Task}}+{\mathrm{L1}}=L_{\mathrm{Task}}+\gamma \left|\left|W\,\right|\right|$$
(1)
$$\left|\left|W\,\right|\right|=\mathop{\sum }\limits_{i=1}^{m}\mathop{\sum }\limits_{j=1}^{m}{{\rm{|}}w}_{i,\,j}{\rm{|}}$$
(2)

An RNN with this loss function would learn to solve the task with a sparse weight matrix (𝑤𝑖,𝑗), where γ would determine the extent to which the network is forced to converge on a sparse solution. This parameter is called the regularization strength.

Unlike regular RNNs, real brain networks are embedded in a physical space12,13,14. To simulate the pressures caused by existing in a biophysical space, we manipulated the regularization term. We hypothesized that by incorporating constraints that appear common to any biological neural system, we could test whether these local constraints are sufficient to drive a network architecture that more closely resembles observed brain networks. Specifically, we included spatial constraints in two forms—Euclidean and network communication—that we argue are integral to any realistic neural network. To implement this, we first embed units within a 3D space, such that each unit has a corresponding x, y and z coordinate. Using these coordinates, we can generate a Euclidean distance matrix that describes the physical distance between each pair of nodes (Fig. 1b). This allows to minimize weights multiplied by their Euclidean distance (di,j), thereby incentivizing the network to minimize (costly) long-distance connections. The element-wise matrix multiplication is denoted with the Hadamard product \(\odot\). Adding this to our optimization term gives us:

$$L={L}_{{{\mathrm{Task}}}}+\gamma \left|\left|W\odot D\right|\right|$$
(3)
$$\left|\left|W\odot D\right|\right|=\mathop{\sum }\limits_{i=1}^{m}\mathop{\sum }\limits_{j=1}^{m}\left|{w}_{i,\,j}\right|\times \left|{d}_{i,\,j}\right|$$
(4)

The above formalization provides a spatial context for RNN training. In a next step, we want to follow the same approach to incentivize networks to preferably prune weights that are not strongly contributing to the within-network communication structure. We can impose this influence of communication via a weighted communicability term19,22, which computes the extent to which, under a particular network topology, any two nodes are likely to communicate both directly and indirectly over time (Fig. 1c). Now taking this topological communication into account, we get the following loss function:

$$L={L}_{{{\mathrm{Task}}}}+\gamma \left|\left|W\odot D\odot C\right|\right|$$
(5)
$$C={{\mathrm{e}}}^{{S}^{-\frac{1}{2}}\left|W\,\right|{S}^{-\frac{1}{2}}}$$
(6)

Supplementary Figs. 15 provide a walkthrough explanation of how this term works and expand on the logic of how constraining the network’s topology can serve as a prior for intra-network communication in sparse networks. Supplementary Fig. 6 specifically highlights the role that communicability has within the network optimization process. Note that in equation (6), S is a diagonal matrix with the degree of unit i (degi) on the diagonal (that is, the node strength), which simply acts as a normalization term preventing any one single edge having undue influence19. This is explained in Supplementary Figs. 4 and 5.

Importantly, as all terms (W, D, C) are element-wise multiplied within the regularization term, they are all minimized as part of the training process. Note, it is possible, in principle, to parameterize each part of the above equation to vary the extent to which each term influences network outcomes. However, in this work, we focus on establishing the role of all in tandem. Future work could look to establish models with greater parameterization to establish optimal relative magnitudes.

Task paradigm

The task that networks are presented with is a one-choice inference task requiring networks to remember and integrate information (Fig. 1f). On an abstract level, networks needed to first store a stimulus, integrate it with a second stimulus and make a predefined correct choice. More specifically, networks first observe stimulus A for 20 time steps, followed by a delay for 10 time steps, followed by stimulus B for 20 steps. Agents must then make one choice. This set-up can be interpreted as a one-step navigation task, where agents are presented with the goal location (stimulus A) followed by possible choice directions (stimulus B). The choice to be made is the one moving closer to the goal. Extended Data Table 1 outlines all possible trials and defines whether the given trial is included in the regular version of the task used in the main text.

All stimuli are one-hot encoded with a vector of eight binary digits. The first four define the goal locations and only one of the four digits would be set to one during the goal presentation. The second four binary digits each stand in for one allowed choice direction and two choice directions would be set to one during the choice options presentation. Gaussian noise with a standard deviation of 0.05 is added to all inputs.

This task design is a simplified version of a multi-step maze navigation task we have recorded in macaques. A harder version of the task with an extended set of trials is equivalent to the first choice monkeys face in their version of the task. We use the full set of trials for a control calculation in Supplementary Fig. 8. After this first choice, the monkeys then continue the task with a further step to reach the goal and collect the reward. As the goal of this study was to establish the emerging features of seRNNs, here we focus just on the first choice and leave questions relating to the multi-step task to future investigations.

RNN modelling

All recurrent neural networks in this project have 100 units in the hidden layer and are defined by the same basic set of equations:

$${h}_{t}={{\mathrm{ReLU}}}\left({W}_{x}{x}_{t}+{W}_{h}{h}_{t-1}+{b}_{h}\right)$$
(7)
$${\widehat{y_{t}}}={\rm{\sigma }}\left({W}_{y}{h}_{t}+{b}_{y}\right)$$
(8)

Here xt is the input vector at time t (1 × 8), Wx is the input layer weight matrix (8 × 100) (Xavier initialization), ht−1 is the activation of hidden layer at time t − 1 (1 × 100) (zeros initialization), Wh is the hidden layer weight matrix (100 × 100) (orthogonal initialization), bh is the bias of hidden layer (1 × 100) (zeros initialization), ht is the activation of hidden layer at time t (1 × 100) (zeros initialization), Wy is the output layer weight matrix (100 × 8) (Xavier initialization), by is the bias of network output (1 × 8) (zeros initialization), σ is the softmax activation function and \(\widehat{{y}_{t}}\) is the network output/prediction.

Networks differ in terms of which regularization was applied to its hidden layer and with which regularization strength. Networks are optimized to minimize a cross entropy loss on task performance combined with the regularization penalty using the Adam optimizer (hyperparameters: learning rate 0.001, beta_1 0.9, beta_2 0.999, epsilon 1 × 10−7) for 10 epochs. Note that the network’s choice is only read out once, at the very end of the trial. Each epoch consists of 5,120 problems, batched in blocks of 128 problems.

Regularization strength set-up and network selection

The most critical parameter choice in our analyses is the regularization strength. As shown across analyses (for example, Fig. 6), the strength of the regularization has a major influence on all metrics analysed here. While the L1 regularization and the purely Euclidean regularization could be matched by average strength of regularization of the hidden layer, the communicability term of seRNNs makes this challenging due to it being dependent on the current state of the hidden layer and hence changing throughout training. To match the spectrum of regularization strengths in L1 and seRNNs, we used a functional approach. As performance in the task starts to break down as networks become too sparse to effectively remember past stimuli, we matched regularization strength using task performance before looking at any of the other structural or functional metrics. Specifically, we set the regularization spectrum on a linear scale and chose the boundary values so that task performance started to deteriorate half-way through the set of networks (so around the 500th network for the sets of 1,000 networks).

To make both groups comparable, we focus our analyses on networks that achieve >90% task accuracy. For the L1 networks, these were 47.9% of all trained networks and for seRNN networks 39%. Note that this difference in percentages is not meaningful per se and could be eliminated by matching the regularization spectra of both groups more closely. As we focus our analyses on highly functional networks with high task accuracy, matching the regularization spectra of both groups would have not influenced the results. The code repository has an overview file with regularization strengths chosen for different network types. We hope that future implementations of the seRNNs can provide a method for more precise numerical matching between regularization strengths.

Topological analysis

Graph theory network statistics were calculated using the Brain Connectivity Toolbox71, and the mathematical formalisms are provided. All network statistics were calculated on the hidden RNN weight matrix and all edges were enforced to be the absolute value of the element. When the measure in question was binary (for example, small-worldness) a proportional threshold was applied, taking the top 10% of these absolute connections.

Modularity

The modularity statistic, Q, quantifies the extent to which the network can be subdivided into clearly delineated groups:

$$Q=\frac{1}{l}\sum _{i,\,j\in N}\left({a}_{i,\,j}-\frac{{k}_{i}{k}_{j}}{l}\right){\delta }_{{m}_{i}{m}_{j}},$$
(9)

where 𝑙 is number of connections, 𝑁 is the total number of nodes, 𝑎𝑖𝑗 is the connection status between nodes 𝑖 and 𝑗 (𝑎𝑖,𝑗 =1 when 𝑖 and 𝑗 are connected) and 𝑎𝑖,𝑗 = 0 otherwise, where 𝑘𝑖 and 𝑘j are the total number of connections (degrees) of nodes 𝑖 and 𝑗. mi is the module containing node i, and \({\delta }_{{m}_{i}{m}_{j}}=1\) if mi = mj, and 0 otherwise. In this work, we tested the modularity using the default resolution parameter of 1.

Small-worldness

Small-worldness refers to a graph property where most nodes are not neighbours of one another, but the neighbours of nodes are likely to be neighbours of each other. This means that most nodes can be reached from every other node in a small number of steps. It is given by:

$$\sigma =\frac{c/{c}_{{{\mathrm{rand}}}}}{l/{l}_{{{\mathrm{rand}}}}},$$
(10)

where c and crand are the clustering coefficients, and l and lrand are the characteristic path lengths of the respective tested network and a random network with the same size and density of the empirical network. Networks are generally considered as small-world networks at σ > 1. In our work, we computed the random network as the mean statistic across a distribution of n = 1,000 random networks. The characteristic path length is given by:

$${L}_{i}=\frac{1}{n}\sum _{i\in N}\frac{{\sum }_{j\in N,\,j\ne i}{d}_{i,\,j}}{n-1}$$
(11)

Generative network modelling

We use a technique called generative network modelling to investigate whether the connectome of networks can be recreated by unsupervised wiring rules. The idea is to start from an empty network and probabilistically add connections-based simple wiring equations. The wiring equations are based on the topological structure of the existing network. We follow the approach outlined in refs. 15,35. We provide an overview of this approach in ‘Generative network modelling of RNNs’ in Supplementary Information.

Decoding

To analyse the internal function of our trained recurrent neural networks, we record the hidden state activity of every unit while the network solves a set of 640 trials. Each trial is constituted of 50 steps (as shown in Fig. 1e). For decoding, the activity is averaged in step windows of 5, so that there is a total of 10 time windows. In animal electrophysiology, researchers often look at the explained variance per task variable per unit. To allow for comparison of our networks with findings in the literature, we wanted to extract the same metric. Given the nature of our task, the variables used to predict unit activity (goal, choice options, correct choice) are highly correlated, so that the standard decoding with analysis of variance (ANOVA) would give biased results. Instead, we used a decoding algorithm based on L1 regression, as follows.

  1. (1)

    Apply cross-validated L1 regression with k-fold cross validation (5 folds) to set alpha term with best cross-validation performance.

  2. (2)

    Split the dataset via repeated k-fold (3 folds, 2 repeats).

  3. (3)

    On each (train, test) dataset:

  1. (a)

    Train L1 regression with the pre-set alpha term.

  2. (b)

    Calculate explained variance in test dataset including all predictor variables.

  3. (c)

    Iteratively set all values of a given set of predictors (for example, all goal predictors) to 0 and recalculate the explained variance and calculate the drop of explained variance per predictor group.

  4. (d)

    Take mean of drop of explained variance for each group across splits of dataset.

This algorithm results in every unit in every network being assigned an explained variance number for every task variable. Note that the decoding cannot reliably work in networks that make too many errors, so that we functionally analyse only networks with a task performance of 90% or above.

Spatial permutation test

To examine the spatial clustering of decoded task information of neuronal ensembles within the RNNs, we constructed a simple spatial permutation test as follows.

  1. (1)

    Considering a single RNN hidden layer at a particular task time window (note, explained variances change over the course of the task), for each unit, compute the relative preference for goal versus choice explained variance for each unit. This is calculated as the goal explained variance minus the choice explained variance.

  2. (2)

    Between all n ‘goal’ units (that is, positive difference from step 1), compute the Euclidean distance weighted by the decoding for goal information. This, therefore, captures the spatial proximity between goal units weighted by the magnitude of their ‘goal’ information. Average this matrix to compute a summary statistic. This is the observed statistic.

  3. (3)

    Then repeat this procedure for 1,000 times, but for a random set of n units taken from the 3D grid space. These 1,000 summary statistics constitutes the null distribution.

  4. (4)

    Compute a permuted P value (Pperm), which is simply the location in which the observed statistic (step 2) sits within the null distribution (step 3) normalized to the range [0 1]. This value subsequently corresponds to how clustered or distributed the observed goal decoding information is clustered in space relative to random chance. A small Pperm means that information is clustered more than chance and vice versa.

  5. (5)

    Do steps 1–4, but between all ‘choices’ units (that is, negative difference from step 1).

  6. (6)

    Redo steps 1–5 for all desired time windows that have been decoded. In the current work, we calculated Pperm values for time window 3, time window 6 and time window 9 to reflect different aspects of the task over the sequence of the task.

The above steps were done for all functional RNNs (>90% accuracy) for L1 and seRNNs. We presented distributions of these Pperm values for goals and choices to highlight how goal and choices information is clustered, distributed or random at key points in the sequence of the task. To ensure that we did not bias our findings, we further computed a slight variation of the above statistical test, which allows us to assess the clustering of coding information independently (that is, without computing relative goal versus choice coding, as in step 1 above). As cluster size was now not determined by the direction of coding (as it was previously), we instead used the 50 units with the highest variance-explained values for a given variable. This was selected because this approximately mirrors the cluster sizes achieved in the primary functional clustering analysis. Mirroring the permutation testing approach, we calculated Pperm by ranking the mean Euclidean distance between these units (top 50% coding neurons) in a null distribution of Euclidean distance between 1,000 permuted samples of 50 units. This was done for goal and choice options (to assess replication). This test is advantageous in that it allows for testing variables independently, but disadvantageous in that it does not directly incorporate the coding magnitude into the test statistics. These findings are given in Supplementary Fig. 12.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.