Neuronal diversity can improve machine learning for physics and beyond

Choudhary, Anshul; Radhakrishnan, Anil; Lindner, John F.; Sinha, Sudeshna; Ditto, William L.

doi:10.1038/s41598-023-40766-6

Download PDF

Article
Open access
Published: 26 August 2023

Neuronal diversity can improve machine learning for physics and beyond

Anshul Choudhary^1,2,
Anil Radhakrishnan¹,
John F. Lindner^1,3,
Sudeshna Sinha ORCID: orcid.org/0000-0002-1364-5276⁴ &
…
William L. Ditto¹

Scientific Reports volume 13, Article number: 13962 (2023) Cite this article

3803 Accesses
114 Altmetric
Metrics details

Subjects

Abstract

Diversity conveys advantages in nature, yet homogeneous neurons typically comprise the layers of artificial neural networks. Here we construct neural networks from neurons that learn their own activation functions, quickly diversify, and subsequently outperform their homogeneous counterparts on image classification and nonlinear regression tasks. Sub-networks instantiate the neurons, which meta-learn especially efficient sets of nonlinear responses. Examples include conventional neural networks classifying digits and forecasting a van der Pol oscillator and physics-informed Hamiltonian neural networks learning Hénon–Heiles stellar orbits and the swing of a video recorded pendulum clock. Such learned diversity provides examples of dynamical systems selecting diversity over uniformity and elucidates the role of diversity in natural and artificial systems.

Parsimonious neural networks learn interpretable physical laws

Article Open access 17 June 2021

Neural heterogeneity promotes robust learning

Article Open access 04 October 2021

Efficient neural codes naturally emerge through gradient descent learning

Article Open access 29 December 2022

Introduction

Diversity is a hallmark of many complex systems in physics^{1, 2} and in physics beyond physics³, including microscopic cell populations⁴, marine and terrestrial ecosystems^{5, 6}, financial markets⁷, and social networks^8,9,10. In particular, mammalian brains contain billions of neurons with diverse cell types whose complex dynamical patterns are believed responsible for the rich range of cognition, affect, and behavior^11,12,13,14. But despite the widespread appreciation of diversity in neuroscience, researchers have just begun to explore the role of diversity and adaptability in artificial neural networks^15,16,17.

Inspired by nature, artificial neural networks are nonlinear systems that can be trained to learn, classify, and predict. Conventional artificial neural networks contain identical neurons in each network layer, even if the neurons vary from layer to layer. But uniform neuronal activation functions can reduce expressiveness and adaptability, limiting the neural network’s capacity to capture the rich diversity of computation and interaction observed in nature. Diversifying the activation functions can overcome such limitations, enabling the networks to be more expressive and better represent the complexity of natural systems. In this article, we propose a novel way to diversify a neural network by learning the neuron types within each layer. We flexibly realize the different neurons using sub-networks, or networks-within-the-network, which we train along with the overarching network. This meta-learning¹⁸ generates potent neuron activation function sets, suggestive of orthogonal spanning functions, that increase the expressiveness and accuracy of the network.

After discussing related work and our motivation, we describe how meta-learning diverse activation functions can generate better neural networks, as measured by difficult classification and nonlinear regression tasks. We show that learned diversity can enhance conventional neural networks as well as physics-informed neural networks, so the latter are doubly enhanced. To provide further insight into the advantages of diverse neuronal activations, we employ neuron participation ratios as a metric to elucidate the superior potential of these layers compared to their homogeneous counterparts. Additionally, we study the geometric nature of optimizing minima by examining the spectra of their Hessian matrices, shedding light on the underlying loss landscape of diversified neural networks. Finally, by examining the interplay between stochastic processes and diversified neural networks, we gain valuable insights about how the synergy between the inherent randomness of the optimization procedure and learned diversity results in more generalizable models. We end by discussing future work and the potential for learned diversity to enhance artificial neural networks, deep learning, and our appreciation of diversity itself.

Related work

Researchers have recently begun to relax the rigid rules that have guided the development and use of artificial neural networks. Manessi and Rozza¹⁹ investigate learning combinations of known neuronal activation functions, and Agostinelli et al.²⁰ learn piecewise linear activation functions for each neuron. Apicella et al.²¹ survey trainable activation functions. Lau and Lim²² review adaptive activation function in deep neural networks. Jagtap, Kawaguchi, and Karniadakis²³ and Haoxiang and Smys²⁴ include scalable hyper-parameters in their activation functions to improve their networks, while Qian et al.²⁵ linearly, nonlinearly, and hierarchically combine basic activation functions to optimize performance.

More radically, Gjorgjieva, Drion, and Marder¹³ investigate the computational implications of biophysical diversity and multiple timescales in neurons and synapses for circuit performance. Doty et al.¹⁵ show that hand-crafted heterogeneous cell types can improve the performance of deep neural networks. Xie, Liang, and Song²⁶ demonstrate that diversity in synaptic weights lead to better generalization in neural networks. Mariet and Sra²⁷ sample a diverse subset of neurons and merge them with the remaining ones via a re-weighting procedure. Siouda et al.²⁸ use genetic algorithms to optimize the number, forms, and types of hidden neurons. Hospedales et al.¹⁸ survey the current meta-learning landscape. Lin, Chen, and Yan²⁹ suggest nesting neural networks inside neural networks.

Decisively, Beniaguev, Segev, and London³⁰ write, “We call for the replacement of the deep network technology to make it closer to how the brain works by replacing each simple unit in the deep network today with a unit that represents a neuron, which is already – on its own – deep”, which is what we achieve here with our neuronal sub-networks that meta-learn sets of diverse activation functions that can outperform the corresponding homogeneous neural networks.

Motivation

Inspired by natural brains, feed-forward neural networks are nested nonlinear functions of linear combinations of activities

$$\begin{aligned} a^\prime {\mathop {=}\limits ^{\text {vec}}} \sigma (\textbf{W}a+b), \end{aligned}$$

(1)

where the activation $\sigma$ is typically a saturating or rectifying function, and training strengthens or weakens the weights and biases ${\textbf{W}}$ and b to minimize an objective function, often called a “cost” or “loss” (from financial optimization).

Motivated by the well-studied mammalian visual cortex, varying neuronal activation functions by layer is common. However, within each layer, the activations are typically identical, as in Fig. 1 (left). Neural networks are universal function approximators^{31, 32} and are often used to model hypersurfaces, either for classification or nonlinear regression. Varying the activations within a layer, as in Fig. 1 (middle), should therefore increase the expressiveness of the network by providing diverse spanning basis functions. Furthermore, replacing the activations by sub-networks, as in Fig. 1 (right), and training them for optimal results should increase the expressiveness even further. The training of the activation sub-networks can be on a different schedule than the training of the network, and the activations so obtained can be extracted from the sub-networks as interpolated functions and efficiently reused in other networks addressing different problems.

Algorithm

To create a learned diversity neural network (LDNN), incorporate sub-networks initialized to simple activations (like identity, ramp, or sigmoid functions). Train the network with many input-output pairs. Quantify the difference between the actual and expected outputs with a loss function ${\mathscr {L}}$. In an inner loop, compute the gradient of the loss function with respect to the network’s weights and biases, and lower the loss by shifting its weights and biases down this gradient. In an outer loop, compute the gradient of the loss function with respect to the sub-networks’ weights and biases³³, and further lower the loss by shifting the sub-networks’ weights and biases down this gradient, thereby evolving new activations. Repeat to minimize loss.

In the inner loop, the randomly shuffled inputs are the stochastic driver that buffets the network weights and biases $\theta$ as they adjust to lower the loss. In the outer loop, the activation sub-network weights and biases $\theta _s$ open extra dimensions or degrees of freedom to further lower the loss. Figure 2 provides an overview, and Algorithm 1 provides details.

Results

MNIST-1D

Here we implement³⁴ learned diversity neural networks with one hidden layer of 100 neurons and a cross-entropy loss function to classify the MNIST-1D data set, a minimalist variation of the classic Modified National Institute of Standard and Technology digits^{35, 36}. Each neuron type in the hidden layer is further instantiated by a feed-forward neural network of 50 hidden units evolved from a base sinusoid. We obtain similar results for different numbers of layers, different number of neurons per layer, and different base functions.

Figure 3 summarizes meta-learning the activation functions of neurons in the hidden layer subject to the constraint of having two functions distributed equally among the neuronal population. Figure 3 (left) shows the construction of typical one-dimensional digits. Figure 3 (center) show the evolution of the two activation functions, with time encoded as rainbow colors from violet to red. Figure 3 (right) shows box plots demonstrating validation accuracy for 50 fully connected neural networks composed of entirely $N_1$ type neurons (yellow), entirely $N_2$ type neurons (orange), and mixed type with $N_1$ and $N_2$ distributed equally among hidden layer (red). With the same training, the mixed network outperforms either pure network on average. These results are robust with respect to network size, as summarized by Fig. 4.

van der Pol

We obtain similar results for other tasks, such as nonlinear regression of the van der Pol oscillator³⁷, which includes a linear restoring force and a nonlinear viscosity modeled by the differential equation

$$\begin{aligned} \ddot{x} - \mu (1 - x^2)\dot{x} + x = 0, \end{aligned}$$

(2)

where the overdots indicate time derivatives. The van der Pol oscillator can model vacuum tubes and heartbeats and was generalized by FitzHugh³⁸ and Nagumo³⁹ to model spiky neurons. For viscosity parameter $\mu =2.7$, we trained neural networks to forecast the phase space orbit of the oscillator, as summarized by Fig. 5. On average the learned diversity neural network outperforms either of its pure components as well as a homogeneous network of neurons with sinusoidal activations.

Hénon–Heiles

The paradigmatic Hénon–Heiles Hamiltonian⁴⁰

$$\begin{aligned} H=\frac{1}{2}\left( p_{x}^{2}+p_{y}^{2}\right) + \frac{1}{2}\left( x^{2}+y^{2}\right) + \left( x^2 y-\frac{1}{3}y^3\right) \end{aligned}$$

(3)

can model a star moving in a galaxy of other stars according to the Hamiltonian flow

$$\begin{aligned} \left\{ \dot{q}, \dot{p} \right\} = \left\{ +\frac{\partial H}{\partial {p}}, -\frac{\partial H}{\partial {q}} \right\} , \end{aligned}$$

(4)

where $q = \{x,y\}$ and $p = \{p_x, p_y\}$. Bounded motion is possible in a triangular region of position space. As orbital energy increases, circular symmetry degenerates to triangular symmetry, and integrable motion complexifies to chaotic motion.

Consequently, for this example, we meta-learn activation functions for both a conventional and a Hamiltonian neural network^{41,41,42,43,44,46}. Unlike conventional neural networks, which learn dynamical systems by intaking position and velocity and outputting their derivatives, a Hamiltonian neural network learns a dynamical system by intaking position and momentum and outputting a single energy-like variable, which it differentiates according to Hamilton’s recipe. Rather than learning the derivatives, it learns the Hamiltonian function, which is the generator of derivatives. This more powerful and efficient strategy is an excellent example of physics-informed machine learning.

More specifically, during training a conventional neural network (NN) maps positions and velocities $\{q_t, \dot{q}_t\}$ to approximations of their time derivatives, and adjusts its internal parameters to minimize the mean-square-error or loss

$$\begin{aligned} {\mathscr {L}}_{\text {NN}} = \bigg \langle ({\dot{q}}_t-{\dot{q}})^2 + (\ddot{q}_t-\ddot{q})^2 \bigg \rangle _t. \end{aligned}$$

(5)

The trained network can extrapolate a given initial condition via the Euler update $\{q, \dot{q}\} \leftarrow \{q, \dot{q}\} + \{\dot{q}, \ddot{q}\} dt$. By contrast, during training a Hamiltonian neural network (HNN) maps position and momenta $\{q_t, p_t\}$ to the scalar Hamiltonian function H, uses reverse-mode automatic differentiation to find the Hamiltonian’s gradients, uses the gradients to approximate the position and momentum change rates, and adjusts its internal parameters to minimize the loss

$$\begin{aligned} {\mathscr {L}}_{\text {HNN}} = \left\langle \left( \dot{q}_t - \frac{\partial H}{\partial p} \right) ^2 + \left( \dot{p}_t + \frac{\partial H}{\partial q} \right) ^2 \right\rangle _t \end{aligned}$$

(6)

and enforce Hamilton’s motion equations. The trained network can extrapolate a given initial condition via the Euler update $\{q, p\} \leftarrow \{q, p\} + \{\dot{q}, \dot{p}\} dt$.

As summarized by Fig. 6, the mix of 2 neuron types outperforms any single neuron type on average for both conventional and Hamiltonian neural networks, but the Hamiltonian neural network is much better, and its mixed version is doubly enhanced. (Spread in Hamiltonian validation losses is much smaller than the spread in the conventional validation losses, possibly because enforcing symplectic structure on the loss manifold for the Hamiltonian neural network is a regularization that facilitates more consistent optimization, while the unbounded loss of the conventional neural network suffers greater variance due to the wide range of stable and chaotic trajectories.)

Pendulum clock from video

As a final real-world example, we video recorded a wall-hanging pendulum clock, tracked the ends of its compound pendulum, and extracted its angles and angular velocities at equally spaced times⁴⁶. Engineered to be nearly Hamiltonian, the pendulum’s Graham escapement periodically interrupts the fall of its weight as gravity compensates dissipation. We trained Hamiltonian neural networks to forecast its phase space orbit, as summarized by Fig. 7. Once again, meta-learning proves advantageous.

Analysis

To understand how mixed activation functions outperform homogeneous neuronal populations, we estimate the change in the dimensionality of the network activations. Start by constructing a neuronal activity data matrix X with N rows corresponding to N neurons in the hidden layer and M columns representing inputs. Each matrix element $\textbf{X}_{ij}$ represents the activity of the $i^{th}$ neuron at the $j^{th}$ input. Center the activity so $\langle \textbf{X}\rangle = 0$. Construct the neural co-variance matrix $\textbf{C} = M^{-1}{} \textbf{XX}^{T}$, which indicates how pairs of neurons vary with respect to each other, and compute the participation ratio

$$\begin{aligned} {\mathscr {R}} = \frac{({\text {tr}}\textbf{C})^2}{{\text {tr}}{} \textbf{C}^2} = \frac{\left( \sum _{n=1}^N\lambda _n \right) ^2}{\sum _{n=1}^N \lambda _n^2}, \end{aligned}$$

(7)

where $\lambda _n$ are the co-variance matrix eigenvalues. If all the variance is in one dimension, say $\lambda _n = \delta _{n1}$, then ${\mathscr {R}} = 1$; if the variance is evenly distributed across all dimensions, so $\lambda _n = \lambda _1$, then ${\mathscr {R}} = N$. Typically, $1< {\mathscr {R}} < N$, and ${\mathscr {R}}$ corresponds to the number of dimensions needed to explain most of the variance⁴⁷. The normalized participation ratio $r = {\mathscr {R}} / N$.

Figure 8 plots the joint probability densities $\rho (A,r)$ for multiple realizations of the Fig. 3 MNIST-1D learned diversity neural network and homogeneous competitors. The mix of two neurons types has the best mean accuracy A and normalized participation ratio r, suggesting that more of its neurons are participating when the mix achieves the best MNIST-1D classification. In contrast, homogeneous networks of neurons with popular activation functions have lower accuracy and participation ratios reflecting their poorer effectiveness.

To understand the impact of learned diversity on the geometric nature of loss-function minima, we compute the spectrum of the Hessian matrix ${\textbf{H}}=\nabla ^{2}{\mathscr {L}}$, which captures the curvature of the loss function. Since ${\textbf{H}}$ is a symmetric matrix, all its eigenvalues are real. A purely convex loss function would have a positive semi-definite Hessian everywhere. However, in practice, the loss function is almost always non-convex (with multiple local minima) due to the presence of hidden neuron permutation symmetries⁴⁸. Therefore, understanding how diversity helps training find deeper minima is crucial.

Previous work suggests that flatter minima generalizes better to the unseen data^{49, 50}. For the Fig. 3 neural network meta-learning two neuronal activation functions, we find that once training has converged, the resulting minima from diverse neurons is flatter than from homogeneous ones, as measured by both the trace ${{\,\textrm{Tr}\,}}{\textbf{H}}$ of the Hessian and the fraction f of its eigenvalues near zero: ${{\,\textrm{Tr}\,}}{\textbf{H}}_1> {{\,\textrm{Tr}\,}}{\textbf{H}}_2 > {{\,\textrm{Tr}\,}}{\textbf{H}}_{12}$ and $f_1< f_2 < f_{12}$. If steep minima are harder for gradient descent to locate, then the flatter minima engineered and discovered by learned diversity neural networks imply enhanced optimization.

Stochastic processes can provide additional insights. Optimizing a neural network by randomly shuffling training data is like a noisy descent to a minimum in a potential landscape, as in Fig. 9. The landscape is the network’s cost or loss as a function of its weights and biases, and its shape depends on the neuron activation functions. The effective dynamics is that of an overdamped particle buffeted by noise sliding on a complicated potential with many local minima. The Langevin equation

$$\begin{aligned} d\theta _{t} = - \nabla {\mathscr {L}}(\theta _{t})\, dt + \sqrt{2{\textbf{D}}} \cdot d{\mathscr {W}}_{t} \end{aligned}$$

(8)

with noise intensity ${\textbf{D}} = (\eta / B) {\mathscr {L}}(\theta ) {\textbf{H}}(\theta ^*)$ describes the evolution of the weights and biases $\theta =\{W_{ij},b_i\}$ in a valley with local minimum $\theta ^{*}$, where $\eta$ is the learning rate and B is the training batch size^51,51,52,54. The drift term with dt includes minus the gradient of the loss function ${\mathscr {L}}$, and the Brownian motion noise term with $d{\mathscr {W}}_t$ includes the learning rate $\eta$. The noise aligns with the Hessian near a minimum, and the Eq. 8 Hessian dependence ensures that stochastic gradient descent escapes multiple sharp minima via directions corresponding to large Hessian eigenvalues and eventually converges to a flatter minimum.

Conclusions

Biomimetic engineering or biomimicry is design inspired by nature. Just as monoculture crops can be fragile, while diverse crops can be robust⁵⁵, heterogeneous neural networks can outperform homogeneous ones. Here, we highlight advantages of varying activation functions within each layer and learning the best variation by replacing activations by sub-networks.

Conceptually, learned diversity neural networks discover novel sets of activation functions, when most artificial neural networks use just one of a small number of conventional activations per layer. Practically, mixes of learned activations can outperform traditional activations – where even a $1\%$ improvement can be significant – and the learned activations can be efficiently reused in diverse neural networks. Additionally, learned diversity can even improve already enhanced physics-informed neural networks like Hamiltonian neural networks^{43, 56}. Future work includes optimizing learned diversity by adjusting hyperparameters, applying learned diversity to a wider range of regression and classification problems, testing the diverse neural networks for robustness⁵⁷, investigating clustering of learned activations, and applying learned diversity to different neural network architectures, such as recurrent neural networks and reservoir computers^58,58,60.

Learned diversity offers neural networks sets of tailored basis functions, which enhance their expressiveness and adaptability and facilitates efficient function approximation. When given the ability to learn their neuronal activation functions, neural networks discover heterogeneous arrangements of nonlinear neuronal activations that can outperform their homogeneous counterparts with the same training. Our work provides specific examples of dynamical systems that spontaneously select diversity over uniformity, and thereby furthers our understanding of diversity and its role in strengthening natural and artificial systems.

Methods

We implement our neural networks in the Python programming language using the PyTorch open source machine learning library. We also implement them in the Python library JAX⁶¹ using the JAX library Equinox⁶². The code for the analysis and the network implementation can be found at our GitHub repository³⁴.

Number of training pairs is of order $10^4$, and number of training epochs is of order 10. Due to computational constraints, the number of inner iterations is much smaller than the number of outer iterations. Indeed, the learner-meta-learner structure of the meta-learning algorithm incurs significant computational costs with a time complexity of $O(N_O N_I |X|)$. Current implementation of the algorithm is constrained by the number of inner loops within the outer loops since the inner loop is held in memory for the outer loop computation (such as the Algorithm 1 gradients $\nabla _{\theta _{s}} {\mathscr {L}}_{t}$) and optimization. In fact, this is one of the fundamental challenges of gradient-based meta-learning algorithms that currently limits the horizon of meta-optimization⁶³. However, the inefficiency of the algorithm plausibly results from activation meta-learning being under-explored and ripe for improvement.

PyHessian Library is used to compute hessian based statistics without the cost of generating the full hessian matrix. The trace of the hessian matrix is computed using Hutchinson’s method exploiting the symmetric nature of the matrix⁶⁴. The Empirical Spectral Density (ESD) of hessian eigenvalues is computed through Stochastic Lanczos Quadrature (SLQ)⁶⁵ within several successive approximation schemes. Details can be found in Yao et al.⁶⁶. At an implementation level, a classifier or forecaster using the learned activation(s) is trained in Pytorch and the model is saved. Using this saved model and test data, PyHessian can use PyTorch’s backward graph to compute the gradients needed to build the hessian trace and ESD.

The activation function is captured after meta-learning as the output of the learned activation networks on the interval $[-10,10]$ with 100 linearly spaced points. This output is then linearly interpolated between points and used as the activation function for the classifer at validation. Quadratic or cubic splines or symbolic regression can also be used. We need high order ($>10$) polynomials to fit the activation curves accurately so, while possible, we do not recommend polynomials as a reliable way to capture the features of the learned activation functions.

Code availability

Our code is available at https://github.com/nonlinearartificialintelligencelab/diversityNN.

References

Anderson, P. W. More is different. Science 177, 393–396 (1972).
Article ADS CAS PubMed Google Scholar
Bak, P., Tang, C. & Wiesenfeld, K. Self-organized criticality: An explanation of the $1/f$ noise. Phys. Rev. Lett. 59, 381–384 (1987).
Article ADS CAS PubMed Google Scholar
Holovatch, Y., Kenna, R. & Thurner, S. Complex systems: Physics beyond physics. Eur. J. Phys. 38, 023002 (2017).
Article Google Scholar
Wichterle, H., Gifford, D. & Mazzoni, E. Mapping neuronal diversity one cell at a time. Science 341, 726–727 (2013).
Article ADS CAS PubMed Google Scholar
Tilman, D., Lehman, C. L. & Thomson, K. T. Plant diversity and ecosystem productivity: Theoretical considerations. Proc. Natl. Acad. Sci. 94, 1857–1861 (1997).
Article ADS CAS PubMed PubMed Central Google Scholar
Choudhary, A. et al. Weak-winner phase synchronization: A curious case of weak interactions. Phys. Rev. Res. 3, 023144 (2021).
Article CAS Google Scholar
May, R., Levin, S. & Sugihara, G. Ecology for bankers. Nature 451, 893 (2008).
Article ADS CAS PubMed Google Scholar
Page, S. E. Diversity and Complexity, vol. 2 (Princeton University Press, 2010).
May, R. M. Stability and complexity in model ecosystems (Princeton University Press, 2019).
Sinha, S. & Sinha, S. Evidence of universality for the May–Wigner stability theorem for random networks with local dynamics. Phys. Rev. E 71, 020902(R) (2005).
Article ADS Google Scholar
Marcus, G., Marblestone, A. & Dean, T. The atoms of neural computation. Science 346, 551–552 (2014).
Article ADS CAS PubMed Google Scholar
Thivierge, J.-P. Neural diversity creates a rich repertoire of brain activity. Commun. Integr. Biol. 1, 188–189 (2008).
Article PubMed PubMed Central Google Scholar
Gjorgjieva, J., Drion, G. & Marder, E. Computational implications of biophysical diversity and multiple timescales in neurons and synapses for circuit performance. Curr. Opin. Neurobiol. 37, 44–52 (2016).
Article CAS PubMed PubMed Central Google Scholar
Tripathy, S. J., Padmanabhan, K., Gerkin, R. C. & Urban, N. N. Intermediate intrinsic diversity enhances neural population coding. Proc. Natl. Acad. Sci. 110, 8248–8253 (2013).
Article ADS CAS PubMed PubMed Central Google Scholar
Doty, B., Mihalas, S., Arkhipov, A. & Piet, A. Heterogeneous ‘cell types’ can improve performance of deep neural networks. bioRxiv. https://doi.org/10.1101/2021.06.21.449346 (2021).
Perez-Nieves, N., Leung, V. C. H., Dragotti, P. L. & Goodman, D. F. M. Neural heterogeneity promotes robust learning. Nat. Commun. 12, 5791 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Han, C.-D., Glaz, B., Haile, M. & Lai, Y.-C. Adaptable hamiltonian neural networks. Phys. Rev. Research 3, 023156 (2021).
Article ADS CAS Google Scholar
Hospedales, T., Antoniou, A., Micaelli, P. & Storkey, A. Meta-learning in neural networks: A survey. arXiv:2004.05439 (2020).
Manessi, F. & Rozza, A. Learning combinations of activation functions. In 2018 24th International Conference on Pattern Recognition (ICPR), 61–66 (IEEE, 2018).
Agostinelli, F., Hoffman, M., Sadowski, P. & Baldi, P. Learning activation functions to improve deep neural networks. arXiv:1412.6830 (2014).
Apicella, A., Donnarumma, F., Isgrò, F. & Prevete, R. A survey on modern trainable activation functions. Neural Netw. 138, 14–32 (2020).
Article MATH Google Scholar
Lau, M. M. & Hann Lim, K.Review of Adaptive Activation Function in Deep Neural Network, 2018 IEEE-EMBS Conference on Biomedical Engineering and Sciences (IECBES). 686–690 (Sarawak, Malaysia, 2018). https://doi.org/10.1109/IECBES.2018.8626714.
Jagtap, A. D., Kawaguchi, K. & Karniadakis, G. E. Adaptive activation functions accelerate convergence in deep and physics-informed neural networks. J. Comput. Phys. 404, 109136. https://doi.org/10.1016/j.jcp.2019.109136 (2020).
Article MathSciNet MATH Google Scholar
Haoxiang, D. W. & Smys, D. S. Overview of configuring adaptive activation functions for deep neural networks—A comparative study. J. Ubiq. Comput. Commun. Technol. 3(1), 10–22. https://doi.org/10.36548/jucct.2021.1.002 (2021).
Article Google Scholar
Qian, S., Liu, H., Liu, C., Wu, S. & Wong, H. S. Adaptive activation functions in convolutional neural networks. Neurocomputing 272, 204–212. https://doi.org/10.1016/j.neucom.2017.06.070 (2018).
Article Google Scholar
Xie, B., Liang, Y. & Song, L. Diversity leads to generalization in neural networks. arXiv:1611.031311611 (2016).
Mariet, Z. & Sra, S. Diversity networks: Neural network compression using determinantal point processes. arXiv:1511.05077 (2015).
Siouda, R., Nemissi, M. & Seridi, H. Diverse activation functions based-hybrid RBF-ELM neural network for medical classification. Evolutionary Intelligence (2022).
Lin, M., Chen, Q. & Yan, S. Network in network. arXiv:1312.4400 (2014).
Beniaguev, D., Segev, I. & London, M. Single cortical neurons as deep artificial neural networks. Neuron 109, 2727–2739 (2021).
Article CAS PubMed Google Scholar
Cybenko, G. Approximation by superpositions of a sigmoidal function. Math. Control Signals Syst. (MCSS) 2, 303–314 (1989).
Article MathSciNet MATH Google Scholar
Hornik, K. Approximation capabilities of multilayer feedforward networks. Neural Netw. 4, 251–257 (1991).
Article Google Scholar
Maclaurin, D., Duvenaud, D. & Adams, R. P. Gradient-based hyperparameter optimization through reversible learning. arXiv:1502.03492 (2015).
Our code is available at https://github.com/nonlinearartificialintelligencelab/diversityNN
Deng, L. The MNIST database of handwritten digit images for machine learning research. IEEE Signal Process. Mag. 29, 141–142 (2012).
Article ADS Google Scholar
Greydanus, S. Scaling down deep learning. arXiv:1511.05077 (2020).
van der Pol Jun. D.Sc, B. Lxxxviii. on “relaxation-oscillations”. London Edinb. Dublin Philos. Magaz. J. Sci. 2, 978–992 (1926).
Fitzhugh, R. Impulses and physiological states in theoretical models of nerve membrane. Biophys. J . 1, 445–466 (1961).
Article CAS PubMed PubMed Central Google Scholar
Nagumo, J., Arimoto, S. & Yoshizawa, S. An active pulse transmission line simulating nerve axon. Proc. IRE 50, 2061–2070 (1962).
Article Google Scholar
Hénon, M. & Heiles, C. The applicability of the third integral of motion: Some numerical experiments. Astron. J. 69, 73. https://doi.org/10.1086/109234 (1964).
Article ADS MathSciNet Google Scholar
Greydanus, S., Dzamba, M. & Yosinski, J. Hamiltonian neural networks. arXiv:1906.01563 (2019).
Toth, P. et al. Hamiltonian generative networks. arXiv:1909.13789 (2019).
Choudhary, A. et al. Physics-enhanced neural networks learn order and chaos. Phys. Rev. E 101, 062207 (2020).
Article ADS CAS PubMed Google Scholar
Miller, S. T., Lindner, J. F., Choudhary, A., Sinha, S. & Ditto, W. L. Mastering high-dimensional dynamics with Hamiltonian neural networks. Chaos, Solitons Fract. X 5, 100046 (2020).
Article Google Scholar
Miller, S. T., Lindner, J. F., Choudhary, A., Sinha, S. & Ditto, W. L. Negotiating the separatrix with machine learning. Nonlinear Theory Appl IEICE 12, 134–142. https://doi.org/10.1587/nolta.12.134 (2021).
Article ADS Google Scholar
Choudhary, A. et al. Forecasting Hamiltonian dynamics without canonical coordinates. Nonlinear Dyn. 103, 1553–1562 (2021).
Article MATH Google Scholar
Gao, P. et al. A theory of multineuronal dimensionality, dynamics and measurement. bioRxiv. https://doi.org/10.1101/214262 (2017).
Simsek, B. et al. Geometry of the loss landscape in overparameterized neural networks: Symmetries and invariances. In Meila, M. & Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning, ICML, vol. 139, 9722–9732 (2021).
Ghorbani, B., Krishnan, S. & Xiao, Y. An investigation into neural net optimization via hessian eigenvalue density. arXiv:1901.10159 (2019).
Sankar, A. R., Khasbage, Y., Vigneswaran, R. & Balasubramanian, V. N. A deeper look at the hessian eigenspectrum of deep neural networks and its applications to regularization. arXiv:2012.03801 (2020).
Mori, T., Ziyin, L., Liu, K. & Ueda, M. Logarithmic landscape and power-law escape rate of SGD. arXiv:2105.09557 (2021).
Mandt, S., Hoffman, M. D. & Blei, D. M. Stochastic gradient descent as approximate Bayesian inference. J. Mach. Learn. Res. 18, 1–35 (2017).
MathSciNet MATH Google Scholar
Sirignano, J. & Spiliopoulos, K. Stochastic gradient descent in continuous time: A central limit theorem. Stoch. Syst. 10, 124–151 (2020).
Article MathSciNet MATH Google Scholar
Chaudhari, P. et al. Entropy-SGD: Biasing gradient descent into wide valleys. J. Stat. Mech: Theory Exp. 2019, 124018 (2019).
Article MathSciNet MATH Google Scholar
Wetzel, W. C., Kharouba, H. M., Robinson, M., Holyoak, M. & Karban, R. Variability in plant nutrients reduces insect herbivore performance. Nature 539, 425–427 (2016).
Article ADS CAS PubMed Google Scholar
Wu, T. & Tegmark, M. Toward an artificial intelligence physicist for unsupervised learning. Phys. Rev. E 100, 033311 (2019).
Article ADS CAS PubMed Google Scholar
Cheney, N., Schrimpf, M. & Kreiman, G. On the robustness of convolutional neural networks to internal architecture and weight perturbations. arXiv preprint arXiv:1703.08245 (2017).
Pathak, J., Hunt, B., Girvan, M., Lu, Z. & Ott, E. Model-free prediction of large spatiotemporally chaotic systems from data: A reservoir computing approach. Phys. Rev. Lett. 120, 024102 (2018).
Article ADS CAS PubMed Google Scholar
Rafayelyan, M., Dong, J., Tan, Y., Krzakala, F. & Gigan, S. Large-scale optical reservoir computing for spatiotemporal chaotic systems prediction. Phys. Rev. X 10, 041037 (2020).
CAS Google Scholar
Govia, L., Ribeill, G., Rowlands, G., Krovi, H. & Ohki, T. Quantum reservoir computing with a single nonlinear oscillator. Phys. Rev. Res. 3, 013077 (2021).
Article CAS Google Scholar
Bradbury, J. et al. JAX: composable transformations of Python+NumPy programs (2018).
Kidger, P. & Garcia, C. Equinox: neural networks in JAX via callable PyTrees and filtered transformations. Differentiable Programming workshop at Neural Information Processing Systems 2021 (2021).
Wu, Y., Ren, M., Liao, R. & Grosse, R. Understanding short-horizon bias in stochastic meta-optimization. arXiv:1803.02021 (2018).
Avron, H. Randomized algorithms for estimating the trace of an implicit symmetric positive semi-definite matrix. J. ACM 58, 8:1-8:34 (2011).
Article MathSciNet MATH Google Scholar
Ubaru, S., Chen, J. & Saad, Y. Fast estimation of $\text{ tr }\,(f({A}))$ via stochastic Lanczos quadrature. SIAM J. Matrix Anal. Appl. 38, 1075–1099 (2017).
Article MathSciNet MATH Google Scholar
Yao, Z., Gholami, A., Keutzer, K. & Mahoney, M. PyHessian: Neural networks through the lens of the hessian. arXiv:1912.07145 (2020).

Download references

Acknowledgements

This research was supported by O.N.R. Grant N00014-16-1-3066 and a gift from United Therapeutics. S.S. acknowledges support from the J.C. Bose National Fellowship (Grant No. JBR/2020/000004). W.L.D. thanks Kathleen Russell for the conceptualization of the original idea along with many subsequent discussions.

Author information

Authors and Affiliations

Nonlinear Artificial Intelligence Laboratory, Physics Department, North Carolina State University, Raleigh, NC, 27607, USA
Anshul Choudhary, Anil Radhakrishnan, John F. Lindner & William L. Ditto
The Jackson Laboratory for Genomic Medicine, Farmington, CT, 06032, USA
Anshul Choudhary
Physics Department, The College of Wooster, Wooster, OH, 44691, USA
John F. Lindner
Indian Institute of Science Education and Research Mohali, Knowledge City, SAS Nagar, Sector 81, Manauli, Punjab, 140 306, India
Sudeshna Sinha

Authors

Anshul Choudhary
View author publications
You can also search for this author in PubMed Google Scholar
Anil Radhakrishnan
View author publications
You can also search for this author in PubMed Google Scholar
John F. Lindner
View author publications
You can also search for this author in PubMed Google Scholar
Sudeshna Sinha
View author publications
You can also search for this author in PubMed Google Scholar
William L. Ditto
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

A.C. designed and implemented our meta-learning code. A.R. trained and analyzed our neural networks and created our GitHub repository. J.F.L. lead the writing and finalized the figures. S.S. elucidated the diversity mechanism. W.L.D. motivated and guided the research. All authors contributed to the final manuscript.

Corresponding authors

Correspondence to Anshul Choudhary, John F. Lindner or William L. Ditto.

Ethics declarations

Competing Interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Choudhary, A., Radhakrishnan, A., Lindner, J.F. et al. Neuronal diversity can improve machine learning for physics and beyond. Sci Rep 13, 13962 (2023). https://doi.org/10.1038/s41598-023-40766-6

Download citation

Received: 12 April 2023
Accepted: 16 August 2023
Published: 26 August 2023
DOI: https://doi.org/10.1038/s41598-023-40766-6

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.