Neuronal diversity can improve machine learning for physics and beyond

Diversity conveys advantages in nature, yet homogeneous neurons typically comprise the layers of artificial neural networks. Here we construct neural networks from neurons that learn their own activation functions, quickly diversify, and subsequently outperform their homogeneous counterparts on image classification and nonlinear regression tasks. Sub-networks instantiate the neurons, which meta-learn especially efficient sets of nonlinear responses. Examples include conventional neural networks classifying digits and forecasting a van der Pol oscillator and physics-informed Hamiltonian neural networks learning Hénon–Heiles stellar orbits and the swing of a video recorded pendulum clock. Such learned diversity provides examples of dynamical systems selecting diversity over uniformity and elucidates the role of diversity in natural and artificial systems.


Introduction
Diversity is a hallmark of many complex systems in physics 1,2 and in physics beyond physics 3 , including microscopic cell populations 4 , marine and terrestrial ecosystems 5,6 , financial markets 7 , and social networks [8][9][10] .In particular, mammalian brains contain billions of neurons with diverse cell types whose complex dynamical patterns are believed responsible for the rich range of cognition, affect, and behavior [11][12][13][14] .But despite the widespread appreciation of diversity in neuroscience, researchers have just begun to explore the role of diversity and adaptability in artificial neural networks [15][16][17] .
Inspired by nature, artificial neural networks are nonlinear systems that can be trained to learn, classify, and predict.Conventional artificial neural networks contain identical neurons in each network layer, even if the neurons vary from layer to layer.But uniform neuronal activation functions can reduce expressiveness and adaptability, limiting the neural network's capacity to capture the rich diversity of computation and interaction observed in nature.Diversifying the activation functions can overcome such limitations, enabling the networks to be more expressive and better represent the complexity of natural systems.In this article, we propose a novel way to diversify a neural network by learning the neuron types within each layer.We flexibly realize the different neurons using sub-networks, or networks-within-the-network, which we train along with the overarching network.This meta-learning 18 generates potent neuron activation function sets, suggestive of orthogonal spanning functions, that increase the expressiveness and accuracy of the network.
After discussing related work and our motivation, we describe how meta-learning diverse activation functions can generate better neural networks, as measured by difficult classification and nonlinear regression tasks.We show that learned diversity can enhance conventional neural networks as well as physics-informed neural networks, so the latter are doubly enhanced.To provide further insight into the advantages of diverse neuronal activations, we employ neuron participation ratios as a metric to elucidate the superior potential of these layers compared to their homogeneous counterparts.Additionally, we study the geometric nature of optimizing minima by examining the spectra of their Hessian matrices, shedding light on the underlying loss landscape of diversified neural networks.Finally, by examining the interplay between stochastic processes and diversified neural networks, we gain valuable insights about how the synergy between the inherent randomness of the optimization procedure and learned diversity results in more generalizable models.We end by discussing future work and the potential for learned diversity to enhance artificial neural networks, deep learning, and our appreciation of diversity itself.

Related Work
Researchers have recently begun to relax the rigid rules that have guided the development and use of artificial neural networks.Manessi and Rozza 19 investigate learning combinations of known neuronal activation functions, and Agostinelli et al. 20 learn piecewise linear activation functions for each neuron.Apicella et al. 21survey trainable activation functions.Lau and Lim 22 review adaptive activation function in deep neural networks.Jagtap, Kawaguchi, and Karniadakis 23 and Haoxiang and Smys 24 include scalable hyper-parameters in their activation functions to improve their networks, while Qian et al. 25 linearly, nonlinearly, and hierarchically combine basic activation functions to optimize performance.
More radically, Gjorgjieva, Drion, and Marder 13 investigate the computational implications of biophysical diversity and multiple timescales in neurons and synapses for circuit performance.Doty et al. 15 show that hand-crafted heterogeneous cell types can improve the performance of deep neural networks.Xie, Liang, and Song 26 demonstrate that diversity in synaptic weights lead to better generalization in neural networks.Mariet and Sra 27 sample a diverse subset of neurons and merge them with the remaining ones via a re-weighting procedure.Siouda et al. 28 use genetic algorithms to optimize the number, forms, and types of hidden neurons.Hospedales et al. 18 survey the current meta-learning landscape.Lin, Chen, and Yan 29 suggest nesting neural networks inside neural networks.
Decisively, Beniaguev, Segev, and London 30 write, "We call for the replacement of the deep network technology to make it closer to how the brain works by replacing each simple unit in the deep network today with a unit that represents a neuron, which is already -on its own -deep", which is what we achieve here with our neuronal sub-networks that meta-learn sets of diverse activation functions that can outperform the corresponding homogeneous neural networks.

Motivation
Inspired by natural brains, feed-forward neural networks are nested nonlinear functions of linear combinations of activities where the activation σ is typically a saturating or rectifying function, and training strengthens or weakens the weights and biases W and b to minimize an objective function, often called a "cost" or "loss" (from financial optimization).Motivated by the well-studied mammalian visual cortex, varying neuronal activation functions by layer is common.However, within each layer, the activations are typically identical, as in Fig. 1 (left).Neural networks are universal function approximators 31,32 and are often used to model hypersurfaces, either for classification or nonlinear regression.Varying the activations within a layer, as in Fig. 1 (middle), should therefore increase the expressiveness of the network by providing diverse spanning basis functions.Furthermore, replacing the activations by sub-networks, as in Fig. 1 (right), and training them for optimal results should increase the expressiveness even further.The training of the activation sub-networks can be on a different schedule than the training of the network, and the activations so obtained can be extracted from the sub-networks as interpolated functions and efficiently reused in other networks addressing different problems.

Algorithm
To create a learned diversity neural network (LDNN), incorporate sub-networks initialized to simple activations (like identity, ramp, or sigmoid functions).Train the network with many input-output pairs.Quantify the difference between the actual and expected outputs with a loss function L .In an inner loop, compute the gradient of the loss function with respect to the network's weights and biases, and lower the loss by shifting its weights and biases down this gradient.In an outer loop, compute the gradient of the loss function with respect to the sub-networks' weights and biases 33 , and further lower the loss by shifting the sub-networks' weights and biases down this gradient, thereby evolving new activations.Repeat to minimize loss.
In the inner loop, the randomly shuffled inputs are the stochastic driver that buffets the network weights and biases θ as they adjust to lower the loss.In the outer loop, the activation sub-network weights and biases θ s open extra dimensions or degrees of freedom to further lower the loss.Figure 2 provides an overview, and Algorithm 1 provides details.

MNIST-1D
Here we implement 34 learned diversity neural networks with one hidden layer of 100 neurons and a cross-entropy loss function to classify the MNIST-1D data set, a minimalist variation of the classic Modified National Institute of Standard and Technology digits 35,36 .Each neuron type in the hidden layer is further instantiated by a feed-forward neural network of 50 hidden units evolved from a base sinusoid.We obtain similar results for different numbers of layers, different number of neurons per layer, and different base functions.Learning rate is optimized to avoid over-fitting but is the same for all network sizes.Activation functions evolved from zero (the null function) with similar results evolved from sine.Mixed networks of 2 neuron types outperform pure networks on average for all sizes and outperform both single learned activation and traditional activations.
Figure 3 summarizes meta-learning the activation functions of neurons in the hidden layer subject to the constraint of having two functions distributed equally among the neuronal population.Figure 3 (left) shows the construction of typical one-dimensional digits.Figure 3 (center) show the evolution of the two activation functions, with time encoded as rainbow colors from violet to red. Figure 3 (right) shows box plots demonstrating validation accuracy for 50 fully connected neural networks composed of entirely N 1 type neurons (yellow), entirely N 2 type neurons (orange), and mixed type with N 1 and N 2 distributed equally among hidden layer (red).With the same training, the mixed network outperforms either pure network on average.These results are robust with respect to network size, as summarized by Fig. 4.

van der Pol
We obtain similar results for other tasks, such as nonlinear regression of the van der Pol oscillator 37 , which includes a linear restoring force and a nonlinear viscosity modeled by the differential equation where the overdots indicate time derivatives.The van der Pol oscillator can model vacuum tubes and heartbeats and was generalized by FitzHugh 38 and Nagumo 39 to model spiky neurons.For viscosity parameter µ = 2.7, we trained neural networks to forecast the phase space orbit of the oscillator, as summarized by Fig. 5. On average the learned diversity neural network outperforms either of its pure components as well as a homogeneous network of neurons with sinusoidal activations.

Hénon-Heiles
The paradigmatic Hénon-Heiles Hamiltonian 40 can model a star moving in a galaxy of other stars according to the Hamiltonian flow where q = {x, y} and p = {p x , p y }.Bounded motion is possible in a triangular region of position space.As orbital energy increases, circular symmetry degenerates to triangular symmetry, and integrable motion complexifies to chaotic motion.Consequently, for this example, we meta-learn activation functions for both a conventional and a Hamiltonian neural network [41][42][43][44][45][46] .Unlike conventional neural networks, which learn dynamical systems by intaking position and velocity and outputting their derivatives, a Hamiltonian neural network learns a dynamical system by intaking position and momentum and outputting a single energy-like variable, which it differentiates according to Hamilton's recipe.Rather than learning the derivatives, it learns the Hamiltonian function, which is the generator of derivatives.This more powerful and efficient strategy is an excellent example of physics-informed machine learning.
More specifically, during training a conventional neural network (NN) maps positions and velocities {q t , qt } to approximations of their time derivatives, and adjusts its internal parameters to minimize the mean-square-error or loss ( The trained network can extrapolate a given initial condition via the Euler update {q, q} ← {q, q} + { q, q}dt.By contrast, during training a Hamiltonian neural network (HNN) maps position and momenta {q t , p t } to the scalar Hamiltonian function H, uses reverse-mode automatic differentiation to find the Hamiltonian's gradients, uses the gradients to approximate the position and momentum change rates, and adjusts its internal parameters to minimize the loss and enforce Hamilton's motion equations.The trained network can extrapolate a given initial condition via the Euler update {q, p} ← {q, p} + { q, ṗ}dt.As summarized by Fig. 6, the mix of 2 neuron types outperforms any single neuron type on average for both conventional and Hamiltonian neural networks, but the Hamiltonian neural network is much better, and its mixed version is doubly enhanced.(Spread in Hamiltonian validation losses is much smaller than the spread in the conventional validation losses, possibly because enforcing symplectic structure on the loss manifold for the Hamiltonian neural network is a regularization that facilitates more consistent optimization, while the unbounded loss of the conventional neural network suffers greater variance due to the wide range of stable and chaotic trajectories.)

Pendulum Clock from Video
As a final real-world example, we video recorded a wall-hanging pendulum clock, tracked the ends of its compound pendulum, and extracted its angles and angular velocities at equally spaced times 46 .Engineered to be nearly Hamiltonian, the pendulum's Graham escapement periodically interrupts the fall of its weight as gravity compensates dissipation.We trained Hamiltonian neural networks to forecast its phase space orbit, as summarized by Fig. 7. Once again, meta-learning proves advantageous.Right: Box plots summarize distribution of neural network mean-square-error validation loss L , starting from 50 random initializations of weights and biases, for a fully connected neural networks of sine neurons (blue), type-1 neurons (yellow), type-2 neurons (orange), and a mix of type 1 and type 2 neurons (red).Meta-learning diversity is a winning strategy.

Analysis
To understand how mixed activation functions outperform homogeneous neuronal populations, we estimate the change in the dimensionality of the network activations.Start by constructing a neuronal activity data matrix X with N rows corresponding to N neurons in the hidden layer and M columns representing inputs.Each matrix element X i j represents the activity of the i th neuron at the j th input.Center the activity so ⟨X⟩ = 0. Construct the neural co-variance matrix C = M −1 XX T , which indicates how pairs of neurons vary with respect to each other, and compute the participation ratio where λ n are the co-variance matrix eigenvalues.If all the variance is in one dimension, say λ n = δ n1 , then R = 1; if the variance is evenly distributed across all dimensions, so λ n = λ 1 , then R = N.Typically, 1 < R < N, and R corresponds to the number of dimensions needed to explain most of the variance 47 .The normalized participation ratio r = R/N.Figure 8 plots the joint probability densities ρ(A, r) for multiple realizations of the Fig. 3 MNIST-1D learned diversity neural network and homogeneous competitors.The mix of two neurons types has the best mean accuracy A and normalized participation ratio r, suggesting that more of its neurons are participating when the mix achieves the best MNIST-1D classification.In contrast, homogeneous networks of neurons with popular activation functions have lower accuracy and participation ratios reflecting their poorer effectiveness.
To understand the impact of learned diversity on the geometric nature of loss-function minima, we compute the spectrum of the Hessian matrix H = ∇ 2 L , which captures the curvature of the loss function.Since H is a symmetric matrix, all its eigenvalues are real.A purely convex loss function would have a positive semi-definite Hessian everywhere.However, in practice, the loss function is almost always non-convex (with multiple local minima) due to the presence of hidden neuron permutation symmetries 48 .Therefore, understanding how diversity helps training find deeper minima is crucial.
Previous work suggests that flatter minima generalizes better to the unseen data 49,50 .For the Fig. 3 neural network meta-learning two neuronal activation functions, we find that once training has converged, the resulting minima from diverse neurons is flatter than from homogeneous ones, as measured by both the trace Tr H of the Hessian and the fraction f of its eigenvalues near zero: Tr H 1 > Tr H 2 > Tr H 12 and f 1 < f 2 < f 12 .If steep minima are harder for gradient descent to locate, then the flatter minima engineered and discovered by learned diversity neural networks imply enhanced optimization.
Stochastic processes can provide additional insights.Optimizing a neural network by randomly shuffling training data is like a noisy descent to a minimum in a potential landscape, as in Fig. 9.The landscape is the network's cost or loss as a function of its weights and biases, and its shape depends on the neuron activation functions.The effective dynamics is that of an overdamped particle buffeted by noise sliding on a complicated potential with many local minima.The Langevin equation with noise intensity D = (η/B)L (θ )H(θ * ) describes the evolution of the weights and biases θ = {W i j , b i } in a valley with local minimum θ * , where η is the learning rate and B is the training batch size [51][52][53][54] .The drift term with dt includes minus the gradient of the loss function L , and the Brownian motion noise term with dW t includes the learning rate η.The noise aligns with the Hessian near a minimum, and the Eq. 8 Hessian dependence ensures that stochastic gradient descent escapes multiple sharp minima via directions corresponding to large Hessian eigenvalues and eventually converges to a flatter minimum.

Conclusions
Biomimetic engineering or biomimicry is design inspired by nature.Just as monoculture crops can be fragile, while diverse crops can be robust 55 , heterogeneous neural networks can outperform homogeneous ones.Here, we highlight advantages of varying activation functions within each layer and learning the best variation by replacing activations by sub-networks.Conceptually, learned diversity neural networks discover novel sets of activation functions, when most artificial neural networks use just one of a small number of conventional activations per layer.Practically, mixes of learned activations can outperform traditional activations -where even a 1% improvement can be significant -and the learned activations can be efficiently reused in diverse neural networks.Additionally, learned diversity can even improve already enhanced physicsinformed neural networks like Hamiltonian neural networks 43,56 .Future work includes optimizing learned diversity by adjusting hyperparameters, applying learned diversity to a wider range of regression and classification problems, testing the diverse neural networks for robustness 57 , investigating clustering of learned activations, and applying learned diversity to different neural network architectures, such as recurrent neural networks and reservoir computers [58][59][60] .
Learned diversity offers neural networks sets of tailored basis functions, which enhance their expressiveness and adaptability and facilitates efficient function approximation.When given the ability to learn their neuronal activation functions, neural networks discover heterogeneous arrangements of nonlinear neuronal activations that can outperform their homogeneous counterparts with the same training.Our work provides specific examples of dynamical systems that spontaneously select diversity over uniformity, and thereby furthers our understanding of diversity and its role in strengthening natural artificial systems.

Figure 1 .
Figure 1.Progression from conventional artificial neural network to diverse neural network to learned diverse neural network.Line thicknesses represent weights W , circle thicknesses represent biases b, and sketches inside circles represent activation functions σ .

Figure 2 .
Figure 2. Schematic stochastic gradient descent meta-learning nested loops.Neural-network weights and biases θ adjust to lower losses L (θ , θ s ), during an inner loop, while periodically the sub-network weights θ s open extra dimensions and themselves adjust to allow even lower losses, during an outer loop.Rainbow colors code time t.

Algorithm 1 :
Meta-learning activation functions σ n (•) as sub-networks of network f (•), where x ∈ X are training inputs, y ∈ Y are training outputs, ŷ are network outputs, R are learning rates, L are losses, N are number of iterations, N T is number of neuron types, and θ = {W, b} are weights and biases.Subscript I indicates the inner loop, which updates the network, and subscript O indicates the outer loop, which updates the sub-networks.Network weights and biases update N I |X| times in the inner loop, while sub-network weights and biases update N O times in the outer loop.

Figure 3 .
Figure 3. Meta-learning 2 activations for MNIST-1D classification.Left: Example MNIST-1D digit construction, rotated 90 • to emphasize the one-dimensionality of the digits.Center: Activation functions σ n (a) evolve from a base sinusoid, with violet-to-red rainbow colors encoding time t.Right: Box and whisker plots summarize distribution (including median, quartiles, and extent) of validation accuracy A for a fully connected neural networks of 100 ReLU neurons (blue), type-1 neurons (yellow), type-2 neurons (orange), and a mix of type 1 and type 2 neurons (red).The mix of 2 neuron types outperforms any single neuron type on average.

Figure 4 .
Figure 4. Neural network MNIST-1D classification accuracy as a function of network size.Box plots summarize accuracy distribution (including median, quartiles, extent, and outliers) for 100 initializations.Learning rate is optimized to avoid over-fitting but is the same for all network sizes.Activation functions evolved from zero (the null function) with similar results evolved from sine.Mixed networks of 2 neuron types outperform pure networks on average for all sizes and outperform both single learned activation and traditional activations.

Figure 5 .
Figure 5. Meta-learning 2 activations for nonlinear regressing or forecasting the van der Pol oscillator.Left: Typical orbit is attracted to a limit cycle, where rainbow colors code time t.Center: Activation functions σ n (a) evolve from a base sinusoid.Right: Box plots summarize distribution of neural network mean-square-error validation loss L , starting from 50 random initializations of weights and biases, for a fully connected neural networks of sine neurons (blue), type-1 neurons (yellow), type-2 neurons (orange), and a mix of type 1 and type 2 neurons (red).The mix of 2 neuron types outperforms any single neuron type on average.

Figure 6 .
Figure 6.Meta-learning 2 activations for nonlinear regressing or forecasting Hénon-Heiles orbits.Top: Regular and chaotic, low and high-energy Hénon-Heiles orbits, where rainbow colors code time.Bottom Left: Conventional and Hamiltonian neural networks learn activation functions from base sinusoids.Bottom Right: Box plots summarize distributions of meansquare-error validation losses L , starting from 50 random initializations of weights and biases, for fully connected neural networks.Hamiltonian neural networks greatly outperform conventional neural networks and heterogeneous neuron types consistently outperform their homogeneous components on average.

Figure 7 .
Figure 7. Meta-learning 2 activations for forecasting a real pendulum clock engineered to be almost Hamiltonian.Left: Falling weight (not shown) drives a wall-hanging pendulum clock.Center: State space flow from video data is nearly elliptical.Right: Box plots summarize distribution of neural network mean-square-error validation loss L , starting from 50 random initializations of weights and biases, for a fully connected neural networks of sine neurons (blue), type-1 neurons (yellow), type-2 neurons (orange), and a mix of type 1 and type 2 neurons (red).Meta-learning diversity is a winning strategy.

Figure 8 .
Figure 8. Probability densities ρ(A, r) versus accuracy A and normalized participation ratio r = R/N for multiple realizations of the Fig.3MNIST-1D heterogeneous network and three homogeneous networks with popular activation functions hyperbolic tangent, Rectified Linear Unit f (x) = max(0, x), and sine.Increased participation accompanies increased accuracy, with the diverse network maximizing both.

Figure 9 .
Figure 9. Noisy descent.Rainbow colors code time t as state point wanders to different local minima of potential landscape V from same initial conditions under multiple realizations of the same noise.