## Abstract

Diversity conveys advantages in nature, yet homogeneous neurons typically comprise the layers of artificial neural networks. Here we construct neural networks from neurons that learn their own activation functions, quickly diversify, and subsequently outperform their homogeneous counterparts on image classification and nonlinear regression tasks. Sub-networks instantiate the neurons, which meta-learn especially efficient sets of nonlinear responses. Examples include conventional neural networks classifying digits and forecasting a van der Pol oscillator and physics-informed Hamiltonian neural networks learning Hénon–Heiles stellar orbits and the swing of a video recorded pendulum clock. Such *learned diversity* provides examples of dynamical systems selecting diversity over uniformity and elucidates the role of diversity in natural and artificial systems.

## Introduction

Diversity is a hallmark of many complex systems in physics^{1, 2} and in *physics beyond physics*^{3}, including microscopic cell populations^{4}, marine and terrestrial ecosystems^{5, 6}, financial markets^{7}, and social networks^{8,9,10}. In particular, mammalian brains contain billions of neurons with diverse cell types whose complex dynamical patterns are believed responsible for the rich range of cognition, affect, and behavior^{11,12,13,14}. But despite the widespread appreciation of diversity in neuroscience, researchers have just begun to explore the role of diversity and adaptability in artificial neural networks^{15,16,17}.

Inspired by nature, artificial neural networks are nonlinear systems that can be trained to learn, classify, and predict. Conventional artificial neural networks contain identical neurons in each network layer, even if the neurons vary from layer to layer. But uniform neuronal activation functions can reduce expressiveness and adaptability, limiting the neural network’s capacity to capture the rich diversity of computation and interaction observed in nature. Diversifying the activation functions can overcome such limitations, enabling the networks to be more expressive and better represent the complexity of natural systems. In this article, we propose a novel way to diversify a neural network by learning the neuron types *within* each layer. We flexibly realize the different neurons using sub-networks, or networks-within-the-network, which we train along with the overarching network. This *meta-learning*^{18} generates potent neuron activation function sets, suggestive of orthogonal spanning functions, that increase the expressiveness and accuracy of the network.

After discussing related work and our motivation, we describe how meta-learning diverse activation functions can generate better neural networks, as measured by difficult classification and nonlinear regression tasks. We show that learned diversity can enhance conventional neural networks as well as physics-informed neural networks, so the latter are doubly enhanced. To provide further insight into the advantages of diverse neuronal activations, we employ neuron participation ratios as a metric to elucidate the superior potential of these layers compared to their homogeneous counterparts. Additionally, we study the geometric nature of optimizing minima by examining the spectra of their Hessian matrices, shedding light on the underlying loss landscape of diversified neural networks. Finally, by examining the interplay between stochastic processes and diversified neural networks, we gain valuable insights about how the synergy between the inherent randomness of the optimization procedure and learned diversity results in more generalizable models. We end by discussing future work and the potential for *learned diversity* to enhance artificial neural networks, deep learning, and our appreciation of diversity itself.

## Related work

Researchers have recently begun to relax the rigid rules that have guided the development and use of artificial neural networks. Manessi and Rozza^{19} investigate learning combinations of known neuronal activation functions, and Agostinelli et al.^{20} learn piecewise linear activation functions for each neuron. Apicella et al.^{21} survey trainable activation functions. Lau and Lim^{22} review adaptive activation function in deep neural networks. Jagtap, Kawaguchi, and Karniadakis^{23} and Haoxiang and Smys^{24} include scalable hyper-parameters in their activation functions to improve their networks, while Qian et al.^{25} linearly, nonlinearly, and hierarchically combine basic activation functions to optimize performance.

More radically, Gjorgjieva, Drion, and Marder^{13} investigate the computational implications of biophysical diversity and multiple timescales in neurons and synapses for circuit performance. Doty et al.^{15} show that *hand-crafted* heterogeneous cell types can improve the performance of deep neural networks. Xie, Liang, and Song^{26} demonstrate that diversity in synaptic weights lead to better generalization in neural networks. Mariet and Sra^{27} sample a diverse subset of neurons and merge them with the remaining ones via a re-weighting procedure. Siouda et al.^{28} use genetic algorithms to optimize the number, forms, and types of hidden neurons. Hospedales et al.^{18} survey the current meta-learning landscape. Lin, Chen, and Yan^{29} suggest nesting neural networks inside neural networks.

Decisively, Beniaguev, Segev, and London^{30} write, “We call for the replacement of the deep network technology to make it closer to how the brain works by replacing each simple unit in the deep network today with a unit that represents a neuron, which is already – on its own – deep”, which is what we achieve here with our neuronal sub-networks that meta-learn sets of diverse activation functions that can outperform the corresponding homogeneous neural networks.

## Motivation

Inspired by natural brains, feed-forward neural networks are nested nonlinear functions of linear combinations of activities

where the activation \(\sigma\) is typically a saturating or rectifying function, and training strengthens or weakens the weights and biases \({\textbf{W}}\) and *b* to minimize an objective function, often called a “cost” or “loss” (from financial optimization).

Motivated by the well-studied mammalian visual cortex, varying neuronal activation functions by layer is common. However, within each layer, the activations are typically identical, as in Fig. 1 (left). Neural networks are universal function approximators^{31, 32} and are often used to model hypersurfaces, either for classification or nonlinear regression. Varying the activations within a layer, as in Fig. 1 (middle), should therefore increase the expressiveness of the network by providing diverse spanning basis functions. Furthermore, replacing the activations by sub-networks, as in Fig. 1 (right), and training them for optimal results should increase the expressiveness even further. The training of the activation sub-networks can be on a different schedule than the training of the network, and the activations so obtained can be extracted from the sub-networks as interpolated functions and efficiently reused in other networks addressing different problems.

## Algorithm

To create a learned diversity neural network (LDNN), incorporate sub-networks initialized to simple activations (like identity, ramp, or sigmoid functions). Train the network with many input-output pairs. Quantify the difference between the actual and expected outputs with a loss function \({\mathscr {L}}\). In an inner loop, compute the gradient of the loss function with respect to the network’s weights and biases, and lower the loss by shifting its weights and biases down this gradient. In an outer loop, compute the gradient of the loss function with respect to the *sub-networks’* weights and biases^{33}, and further lower the loss by shifting the sub-networks’ weights and biases down *this* gradient, thereby evolving new activations. Repeat to minimize loss.

In the inner loop, the randomly shuffled inputs are the stochastic driver that buffets the network weights and biases \(\theta\) as they adjust to lower the loss. In the outer loop, the activation sub-network weights and biases \(\theta _s\) open extra dimensions or degrees of freedom to further lower the loss. Figure 2 provides an overview, and Algorithm 1 provides details.

## Results

### MNIST-1D

Here we implement^{34} learned diversity neural networks with one hidden layer of 100 neurons and a cross-entropy loss function to classify the MNIST-1D data set, a minimalist variation of the classic Modified National Institute of Standard and Technology digits^{35, 36}. Each neuron type in the hidden layer is further instantiated by a feed-forward neural network of 50 hidden units evolved from a base sinusoid. We obtain similar results for different numbers of layers, different number of neurons per layer, and different base functions.

Figure 3 summarizes meta-learning the activation functions of neurons in the hidden layer subject to the constraint of having two functions distributed equally among the neuronal population. Figure 3 (left) shows the construction of typical one-dimensional digits. Figure 3 (center) show the evolution of the two activation functions, with time encoded as rainbow colors from violet to red. Figure 3 (right) shows box plots demonstrating validation accuracy for 50 fully connected neural networks composed of entirely \(N_1\) type neurons (yellow), entirely \(N_2\) type neurons (orange), and mixed type with \(N_1\) and \(N_2\) distributed equally among hidden layer (red). With the same training, the mixed network outperforms either pure network on average. These results are robust with respect to network size, as summarized by Fig. 4.

### van der Pol

We obtain similar results for other tasks, such as nonlinear regression of the van der Pol oscillator^{37}, which includes a linear restoring force and a nonlinear viscosity modeled by the differential equation

where the overdots indicate time derivatives. The van der Pol oscillator can model vacuum tubes and heartbeats and was generalized by FitzHugh^{38} and Nagumo^{39} to model spiky neurons. For viscosity parameter \(\mu =2.7\), we trained neural networks to forecast the phase space orbit of the oscillator, as summarized by Fig. 5. On average the learned diversity neural network outperforms either of its pure components as well as a homogeneous network of neurons with sinusoidal activations.

### Hénon–Heiles

The paradigmatic Hénon–Heiles Hamiltonian^{40}

can model a star moving in a galaxy of other stars according to the Hamiltonian flow

where \(q = \{x,y\}\) and \(p = \{p_x, p_y\}\). Bounded motion is possible in a triangular region of position space. As orbital energy increases, circular symmetry degenerates to triangular symmetry, and integrable motion complexifies to chaotic motion.

Consequently, for this example, we meta-learn activation functions for both a conventional and a Hamiltonian neural network^{41,41,42,43,44,46}. Unlike conventional neural networks, which learn dynamical systems by intaking position and velocity and outputting their derivatives, a Hamiltonian neural network learns a dynamical system by intaking position and momentum and outputting a single energy-like variable, which it differentiates according to Hamilton’s recipe. Rather than learning the derivatives, it learns the Hamiltonian function, which is the *generator* of derivatives. This more powerful and efficient strategy is an excellent example of physics-informed machine learning.

More specifically, during training a conventional neural network (NN) maps positions and velocities \(\{q_t, \dot{q}_t\}\) to approximations of their time derivatives, and adjusts its internal parameters to minimize the mean-square-error or loss

The trained network can extrapolate a given initial condition via the Euler update \(\{q, \dot{q}\} \leftarrow \{q, \dot{q}\} + \{\dot{q}, \ddot{q}\} dt\). By contrast, during training a Hamiltonian neural network (HNN) maps position and momenta \(\{q_t, p_t\}\) to the scalar Hamiltonian function *H*, uses reverse-mode automatic differentiation to find the Hamiltonian’s gradients, uses the gradients to approximate the position and momentum change rates, and adjusts its internal parameters to minimize the loss

and enforce Hamilton’s motion equations. The trained network can extrapolate a given initial condition via the Euler update \(\{q, p\} \leftarrow \{q, p\} + \{\dot{q}, \dot{p}\} dt\).

As summarized by Fig. 6, the mix of 2 neuron types outperforms any single neuron type on average for both conventional and Hamiltonian neural networks, but the Hamiltonian neural network is much better, and its mixed version is doubly enhanced. (Spread in Hamiltonian validation losses is much smaller than the spread in the conventional validation losses, possibly because enforcing symplectic structure on the loss manifold for the Hamiltonian neural network is a regularization that facilitates more consistent optimization, while the unbounded loss of the conventional neural network suffers greater variance due to the wide range of stable and chaotic trajectories.)

### Pendulum clock from video

As a final real-world example, we video recorded a wall-hanging pendulum clock, tracked the ends of its compound pendulum, and extracted its angles and angular velocities at equally spaced times^{46}. Engineered to be nearly Hamiltonian, the pendulum’s Graham escapement periodically interrupts the fall of its weight as gravity compensates dissipation. We trained Hamiltonian neural networks to forecast its phase space orbit, as summarized by Fig. 7. Once again, meta-learning proves advantageous.

## Analysis

To understand how mixed activation functions outperform homogeneous neuronal populations, we estimate the change in the dimensionality of the network activations. Start by constructing a neuronal activity data matrix **X** with *N* rows corresponding to *N* neurons in the hidden layer and *M* columns representing inputs. Each matrix element \(\textbf{X}_{ij}\) represents the activity of the \(i^{th}\) neuron at the \(j^{th}\) input. Center the activity so \(\langle \textbf{X}\rangle = 0\). Construct the neural co-variance matrix \(\textbf{C} = M^{-1}{} \textbf{XX}^{T}\), which indicates how pairs of neurons vary with respect to each other, and compute the participation ratio

where \(\lambda _n\) are the co-variance matrix eigenvalues. If all the variance is in one dimension, say \(\lambda _n = \delta _{n1}\), then \({\mathscr {R}} = 1\); if the variance is evenly distributed across all dimensions, so \(\lambda _n = \lambda _1\), then \({\mathscr {R}} = N\). Typically, \(1< {\mathscr {R}} < N\), and \({\mathscr {R}}\) corresponds to the number of dimensions needed to explain most of the variance^{47}. The normalized participation ratio \(r = {\mathscr {R}} / N\).

Figure 8 plots the joint probability densities \(\rho (A,r)\) for multiple realizations of the Fig. 3 MNIST-1D learned diversity neural network and homogeneous competitors. The mix of two neurons types has the best mean accuracy *A* and normalized participation ratio *r*, suggesting that more of its neurons are participating when the mix achieves the best MNIST-1D classification. In contrast, homogeneous networks of neurons with popular activation functions have lower accuracy and participation ratios reflecting their poorer effectiveness.

To understand the impact of learned diversity on the geometric nature of loss-function minima, we compute the spectrum of the Hessian matrix \({\textbf{H}}=\nabla ^{2}{\mathscr {L}}\), which captures the curvature of the loss function. Since \({\textbf{H}}\) is a symmetric matrix, all its eigenvalues are real. A purely convex loss function would have a positive semi-definite Hessian everywhere. However, in practice, the loss function is almost always non-convex (with multiple local minima) due to the presence of hidden neuron permutation symmetries^{48}. Therefore, understanding how diversity helps training find deeper minima is crucial.

Previous work suggests that flatter minima generalizes better to the unseen data^{49, 50}. For the Fig. 3 neural network meta-learning two neuronal activation functions, we find that once training has converged, the resulting minima from diverse neurons is flatter than from homogeneous ones, as measured by both the trace \({{\,\textrm{Tr}\,}}{\textbf{H}}\) of the Hessian and the fraction *f* of its eigenvalues near zero: \({{\,\textrm{Tr}\,}}{\textbf{H}}_1> {{\,\textrm{Tr}\,}}{\textbf{H}}_2 > {{\,\textrm{Tr}\,}}{\textbf{H}}_{12}\) and \(f_1< f_2 < f_{12}\). If steep minima are harder for gradient descent to locate, then the flatter minima engineered and discovered by learned diversity neural networks imply enhanced optimization.

Stochastic processes can provide additional insights. Optimizing a neural network by randomly shuffling training data is like a noisy descent to a minimum in a potential landscape, as in Fig. 9. The landscape is the network’s cost or loss as a function of its weights and biases, and its shape depends on the neuron activation functions. The effective dynamics is that of an overdamped particle buffeted by noise sliding on a complicated potential with many local minima. The Langevin equation

with noise intensity \({\textbf{D}} = (\eta / B) {\mathscr {L}}(\theta ) {\textbf{H}}(\theta ^*)\) describes the evolution of the weights and biases \(\theta =\{W_{ij},b_i\}\) in a valley with local minimum \(\theta ^{*}\), where \(\eta\) is the learning rate and *B* is the training batch size^{51,51,52,54}. The drift term with *dt* includes minus the gradient of the loss function \({\mathscr {L}}\), and the Brownian motion noise term with \(d{\mathscr {W}}_t\) includes the learning rate \(\eta\). The noise aligns with the Hessian near a minimum, and the Eq. 8 Hessian dependence ensures that stochastic gradient descent escapes multiple sharp minima via directions corresponding to large Hessian eigenvalues and eventually converges to a flatter minimum.

## Conclusions

Biomimetic engineering or biomimicry is design inspired by nature. Just as monoculture crops can be fragile, while diverse crops can be robust^{55}, heterogeneous neural networks can outperform homogeneous ones. Here, we highlight advantages of varying activation functions *within* each layer and learning the best variation by replacing activations by sub-networks.

Conceptually, learned diversity neural networks discover novel *sets* of activation functions, when most artificial neural networks use just one of a small number of conventional activations per layer. Practically, mixes of learned activations can outperform traditional activations – where even a \(1\%\) improvement can be significant – and the learned activations can be efficiently reused in diverse neural networks. Additionally, learned diversity can even improve already enhanced physics-informed neural networks like Hamiltonian neural networks^{43, 56}. Future work includes optimizing learned diversity by adjusting hyperparameters, applying learned diversity to a wider range of regression and classification problems, testing the diverse neural networks for robustness^{57}, investigating clustering of learned activations, and applying learned diversity to different neural network architectures, such as recurrent neural networks and reservoir computers^{58,58,60}.

Learned diversity offers neural networks sets of tailored basis functions, which enhance their expressiveness and adaptability and facilitates efficient function approximation. *When given the ability to learn their neuronal activation functions, neural networks discover heterogeneous arrangements of nonlinear neuronal activations that can outperform their homogeneous counterparts with the same training.* Our work provides specific examples of dynamical systems that spontaneously select diversity over uniformity, and thereby furthers our understanding of diversity and its role in strengthening natural and artificial systems.

## Methods

We implement our neural networks in the Python programming language using the PyTorch open source machine learning library. We also implement them in the Python library JAX^{61} using the JAX library Equinox^{62}. The code for the analysis and the network implementation can be found at our GitHub repository^{34}.

Number of training pairs is of order \(10^4\), and number of training epochs is of order 10. Due to computational constraints, the number of inner iterations is much smaller than the number of outer iterations. Indeed, the learner-meta-learner structure of the meta-learning algorithm incurs significant computational costs with a time complexity of \(O(N_O N_I |X|)\). Current implementation of the algorithm is constrained by the number of inner loops within the outer loops since the inner loop is held in memory for the outer loop computation (such as the Algorithm 1 gradients \(\nabla _{\theta _{s}} {\mathscr {L}}_{t}\)) and optimization. In fact, this is one of the fundamental challenges of gradient-based meta-learning algorithms that currently limits the horizon of meta-optimization^{63}. However, the inefficiency of the algorithm plausibly results from activation meta-learning being under-explored and ripe for improvement.

PyHessian Library is used to compute hessian based statistics without the cost of generating the full hessian matrix. The trace of the hessian matrix is computed using Hutchinson’s method exploiting the symmetric nature of the matrix^{64}. The Empirical Spectral Density (ESD) of hessian eigenvalues is computed through Stochastic Lanczos Quadrature (SLQ)^{65} within several successive approximation schemes. Details can be found in Yao et al.^{66}. At an implementation level, a classifier or forecaster using the learned activation(s) is trained in Pytorch and the model is saved. Using this saved model and test data, PyHessian can use PyTorch’s backward graph to compute the gradients needed to build the hessian trace and ESD.

The activation function is captured after meta-learning as the output of the learned activation networks on the interval \([-10,10]\) with 100 linearly spaced points. This output is then linearly interpolated between points and used as the activation function for the classifer at validation. Quadratic or cubic splines or symbolic regression can also be used. We need high order (\(>10\)) polynomials to fit the activation curves accurately so, while possible, we do not recommend polynomials as a reliable way to capture the features of the learned activation functions.

## Code availability

Our code is available at https://github.com/nonlinearartificialintelligencelab/diversityNN.

## References

Anderson, P. W. More is different.

*Science***177**, 393–396 (1972).Bak, P., Tang, C. & Wiesenfeld, K. Self-organized criticality: An explanation of the \(1/f\) noise.

*Phys. Rev. Lett.***59**, 381–384 (1987).Holovatch, Y., Kenna, R. & Thurner, S. Complex systems: Physics beyond physics.

*Eur. J. Phys.***38**, 023002 (2017).Wichterle, H., Gifford, D. & Mazzoni, E. Mapping neuronal diversity one cell at a time.

*Science***341**, 726–727 (2013).Tilman, D., Lehman, C. L. & Thomson, K. T. Plant diversity and ecosystem productivity: Theoretical considerations.

*Proc. Natl. Acad. Sci.***94**, 1857–1861 (1997).Choudhary, A.

*et al.*Weak-winner phase synchronization: A curious case of weak interactions.*Phys. Rev. Res.***3**, 023144 (2021).May, R., Levin, S. & Sugihara, G. Ecology for bankers.

*Nature***451**, 893 (2008).Page, S. E.

*Diversity and Complexity*, vol. 2 (Princeton University Press, 2010).May, R. M.

*Stability and complexity in model ecosystems*(Princeton University Press, 2019).Sinha, S. & Sinha, S. Evidence of universality for the May–Wigner stability theorem for random networks with local dynamics.

*Phys. Rev. E***71**, 020902(R) (2005).Marcus, G., Marblestone, A. & Dean, T. The atoms of neural computation.

*Science***346**, 551–552 (2014).Thivierge, J.-P. Neural diversity creates a rich repertoire of brain activity.

*Commun. Integr. Biol.***1**, 188–189 (2008).Gjorgjieva, J., Drion, G. & Marder, E. Computational implications of biophysical diversity and multiple timescales in neurons and synapses for circuit performance.

*Curr. Opin. Neurobiol.***37**, 44–52 (2016).Tripathy, S. J., Padmanabhan, K., Gerkin, R. C. & Urban, N. N. Intermediate intrinsic diversity enhances neural population coding.

*Proc. Natl. Acad. Sci.***110**, 8248–8253 (2013).Doty, B., Mihalas, S., Arkhipov, A. & Piet, A. Heterogeneous ‘cell types’ can improve performance of deep neural networks. bioRxiv. https://doi.org/10.1101/2021.06.21.449346 (2021).

Perez-Nieves, N., Leung, V. C. H., Dragotti, P. L. & Goodman, D. F. M. Neural heterogeneity promotes robust learning.

*Nat. Commun.***12**, 5791 (2021).Han, C.-D., Glaz, B., Haile, M. & Lai, Y.-C. Adaptable hamiltonian neural networks.

*Phys. Rev. Research***3**, 023156 (2021).Hospedales, T., Antoniou, A., Micaelli, P. & Storkey, A. Meta-learning in neural networks: A survey. arXiv:2004.05439 (2020).

Manessi, F. & Rozza, A. Learning combinations of activation functions. In

*2018 24th International Conference on Pattern Recognition (ICPR)*, 61–66 (IEEE, 2018).Agostinelli, F., Hoffman, M., Sadowski, P. & Baldi, P. Learning activation functions to improve deep neural networks. arXiv:1412.6830 (2014).

Apicella, A., Donnarumma, F., Isgrò, F. & Prevete, R. A survey on modern trainable activation functions.

*Neural Netw.***138**, 14–32 (2020).Lau, M. M. & Hann Lim, K.

*Review of Adaptive Activation Function in Deep Neural Network, 2018 IEEE-EMBS Conference on Biomedical Engineering and Sciences (IECBES)*. 686–690 (Sarawak, Malaysia, 2018). https://doi.org/10.1109/IECBES.2018.8626714.Jagtap, A. D., Kawaguchi, K. & Karniadakis, G. E. Adaptive activation functions accelerate convergence in deep and physics-informed neural networks.

*J. Comput. Phys.***404**, 109136. https://doi.org/10.1016/j.jcp.2019.109136 (2020).Haoxiang, D. W. & Smys, D. S. Overview of configuring adaptive activation functions for deep neural networks—A comparative study.

*J. Ubiq. Comput. Commun. Technol.***3**(1), 10–22. https://doi.org/10.36548/jucct.2021.1.002 (2021).Qian, S., Liu, H., Liu, C., Wu, S. & Wong, H. S. Adaptive activation functions in convolutional neural networks.

*Neurocomputing***272**, 204–212. https://doi.org/10.1016/j.neucom.2017.06.070 (2018).Xie, B., Liang, Y. & Song, L. Diversity leads to generalization in neural networks. arXiv:1611.03131

**1611**(2016).Mariet, Z. & Sra, S. Diversity networks: Neural network compression using determinantal point processes. arXiv:1511.05077 (2015).

Siouda, R., Nemissi, M. & Seridi, H. Diverse activation functions based-hybrid RBF-ELM neural network for medical classification.

*Evolutionary Intelligence*(2022).Lin, M., Chen, Q. & Yan, S. Network in network. arXiv:1312.4400 (2014).

Beniaguev, D., Segev, I. & London, M. Single cortical neurons as deep artificial neural networks.

*Neuron***109**, 2727–2739 (2021).Cybenko, G. Approximation by superpositions of a sigmoidal function.

*Math. Control Signals Syst. (MCSS)***2**, 303–314 (1989).Hornik, K. Approximation capabilities of multilayer feedforward networks.

*Neural Netw.***4**, 251–257 (1991).Maclaurin, D., Duvenaud, D. & Adams, R. P. Gradient-based hyperparameter optimization through reversible learning. arXiv:1502.03492 (2015).

Our code is available at https://github.com/nonlinearartificialintelligencelab/diversityNN

Deng, L. The MNIST database of handwritten digit images for machine learning research.

*IEEE Signal Process. Mag.***29**, 141–142 (2012).Greydanus, S. Scaling down deep learning. arXiv:1511.05077 (2020).

van der Pol Jun. D.Sc, B. Lxxxviii. on “relaxation-oscillations”.

*London Edinb. Dublin Philos. Magaz. J. Sci.***2**, 978–992 (1926).Fitzhugh, R. Impulses and physiological states in theoretical models of nerve membrane.

*Biophys. J .***1**, 445–466 (1961).Nagumo, J., Arimoto, S. & Yoshizawa, S. An active pulse transmission line simulating nerve axon.

*Proc. IRE***50**, 2061–2070 (1962).Hénon, M. & Heiles, C. The applicability of the third integral of motion: Some numerical experiments.

*Astron. J.***69**, 73. https://doi.org/10.1086/109234 (1964).Greydanus, S., Dzamba, M. & Yosinski, J. Hamiltonian neural networks. arXiv:1906.01563 (2019).

Toth, P.

*et al.*Hamiltonian generative networks. arXiv:1909.13789 (2019).Choudhary, A.

*et al.*Physics-enhanced neural networks learn order and chaos.*Phys. Rev. E***101**, 062207 (2020).Miller, S. T., Lindner, J. F., Choudhary, A., Sinha, S. & Ditto, W. L. Mastering high-dimensional dynamics with Hamiltonian neural networks.

*Chaos, Solitons Fract. X***5**, 100046 (2020).Miller, S. T., Lindner, J. F., Choudhary, A., Sinha, S. & Ditto, W. L. Negotiating the separatrix with machine learning.

*Nonlinear Theory Appl IEICE***12**, 134–142. https://doi.org/10.1587/nolta.12.134 (2021).Choudhary, A.

*et al.*Forecasting Hamiltonian dynamics without canonical coordinates.*Nonlinear Dyn.***103**, 1553–1562 (2021).Gao, P.

*et al.*A theory of multineuronal dimensionality, dynamics and measurement. bioRxiv. https://doi.org/10.1101/214262 (2017).Simsek, B.

*et al.*Geometry of the loss landscape in overparameterized neural networks: Symmetries and invariances. In Meila, M. & Zhang, T. (eds.)*Proceedings of the 38th International Conference on Machine Learning, ICML*, vol. 139, 9722–9732 (2021).Ghorbani, B., Krishnan, S. & Xiao, Y. An investigation into neural net optimization via hessian eigenvalue density. arXiv:1901.10159 (2019).

Sankar, A. R., Khasbage, Y., Vigneswaran, R. & Balasubramanian, V. N. A deeper look at the hessian eigenspectrum of deep neural networks and its applications to regularization. arXiv:2012.03801 (2020).

Mori, T., Ziyin, L., Liu, K. & Ueda, M. Logarithmic landscape and power-law escape rate of SGD. arXiv:2105.09557 (2021).

Mandt, S., Hoffman, M. D. & Blei, D. M. Stochastic gradient descent as approximate Bayesian inference.

*J. Mach. Learn. Res.***18**, 1–35 (2017).Sirignano, J. & Spiliopoulos, K. Stochastic gradient descent in continuous time: A central limit theorem.

*Stoch. Syst.***10**, 124–151 (2020).Chaudhari, P.

*et al.*Entropy-SGD: Biasing gradient descent into wide valleys.*J. Stat. Mech: Theory Exp.***2019**, 124018 (2019).Wetzel, W. C., Kharouba, H. M., Robinson, M., Holyoak, M. & Karban, R. Variability in plant nutrients reduces insect herbivore performance.

*Nature***539**, 425–427 (2016).Wu, T. & Tegmark, M. Toward an artificial intelligence physicist for unsupervised learning.

*Phys. Rev. E***100**, 033311 (2019).Cheney, N., Schrimpf, M. & Kreiman, G. On the robustness of convolutional neural networks to internal architecture and weight perturbations. arXiv preprint arXiv:1703.08245 (2017).

Pathak, J., Hunt, B., Girvan, M., Lu, Z. & Ott, E. Model-free prediction of large spatiotemporally chaotic systems from data: A reservoir computing approach.

*Phys. Rev. Lett.***120**, 024102 (2018).Rafayelyan, M., Dong, J., Tan, Y., Krzakala, F. & Gigan, S. Large-scale optical reservoir computing for spatiotemporal chaotic systems prediction.

*Phys. Rev. X***10**, 041037 (2020).Govia, L., Ribeill, G., Rowlands, G., Krovi, H. & Ohki, T. Quantum reservoir computing with a single nonlinear oscillator.

*Phys. Rev. Res.***3**, 013077 (2021).Bradbury, J.

*et al.*JAX: composable transformations of Python+NumPy programs (2018).Kidger, P. & Garcia, C. Equinox: neural networks in JAX via callable PyTrees and filtered transformations.

*Differentiable Programming workshop at Neural Information Processing Systems 2021*(2021).Wu, Y., Ren, M., Liao, R. & Grosse, R. Understanding short-horizon bias in stochastic meta-optimization. arXiv:1803.02021 (2018).

Avron, H. Randomized algorithms for estimating the trace of an implicit symmetric positive semi-definite matrix.

*J. ACM***58**, 8:1-8:34 (2011).Ubaru, S., Chen, J. & Saad, Y. Fast estimation of \(\text{ tr }\,(f({A}))\) via stochastic Lanczos quadrature.

*SIAM J. Matrix Anal. Appl.***38**, 1075–1099 (2017).Yao, Z., Gholami, A., Keutzer, K. & Mahoney, M. PyHessian: Neural networks through the lens of the hessian. arXiv:1912.07145 (2020).

## Acknowledgements

This research was supported by O.N.R. Grant N00014-16-1-3066 and a gift from United Therapeutics. S.S. acknowledges support from the J.C. Bose National Fellowship (Grant No. JBR/2020/000004). W.L.D. thanks Kathleen Russell for the conceptualization of the original idea along with many subsequent discussions.

## Author information

### Authors and Affiliations

### Contributions

A.C. designed and implemented our meta-learning code. A.R. trained and analyzed our neural networks and created our GitHub repository. J.F.L. lead the writing and finalized the figures. S.S. elucidated the diversity mechanism. W.L.D. motivated and guided the research. All authors contributed to the final manuscript.

### Corresponding authors

## Ethics declarations

### Competing Interests

The authors declare no competing interests.

## Additional information

### Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Rights and permissions

**Open Access** This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

## About this article

### Cite this article

Choudhary, A., Radhakrishnan, A., Lindner, J.F. *et al.* Neuronal diversity can improve machine learning for physics and beyond.
*Sci Rep* **13**, 13962 (2023). https://doi.org/10.1038/s41598-023-40766-6

Received:

Accepted:

Published:

DOI: https://doi.org/10.1038/s41598-023-40766-6

## Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.