Abstract
Machine learning influences numerous aspects of modern society, empowers new technologies, from Alphago to ChatGPT, and increasingly materializes in consumer products such as smartphones and selfdriving cars. Despite the vital role and broad applications of artificial neural networks, we lack systematic approaches, such as network science, to understand their underlying mechanism. The difficulty is rooted in many possible model configurations, each with different hyperparameters and weighted architectures determined by noisy data. We bridge the gap by developing a mathematical framework that maps the neural network’s performance to the network characters of the line graph governed by the edge dynamics of stochastic gradient descent differential equations. This framework enables us to derive a neural capacitance metric to universally capture a model’s generalization capability on a downstream task and predict model performance using only early training results. The numerical results on 17 pretrained ImageNet models across five benchmark datasets and one NAS benchmark indicate that our neural capacitance metric is a powerful indicator for model selection based only on early training results and is more efficient than stateoftheart methods.
Similar content being viewed by others
Introduction
Deep neural networks (DNNs) have emerged as a crucial component of artificial intelligence (AI) and have successful applications in various domains, including computer vision, natural language processing, speech recognition, robotics, and more^{1,2,3,4}. Despite these remarkable achievements, neural networks are often criticized as black boxes and remain challenging to comprehend due to their nonlinear and complex nature^{5}. Increasing research is developing more interpretable DNN architectures, such as those based on attention mechanisms or interpretable features^{6,7,8}. Nevertheless, neural network training is complex and affected by various factors such as noisy training data, neural architecture, loss function, and optimization algorithms, remaining a critical challenge to uncover the black box of DNNs^{9,10}.
The training process is an iterative update of the synaptic connection weights^{11,12}. The straightforward way is to model the process as a discrete dynamical system, which provides a theoretical foundation for analyzing convergence rates and generalization error bounds^{13,14,15,16}. However, existing approaches have primarily focused on the macroscopic and collective behavior of neurons in neural networks^{17,18,19}, without explicitly examining the individual interactions between trainable weights or synaptic connections and their coevolution during training.
Transfer learning is a widely used and effective technique in deep learning that leverages pretrained models to solve numerous complex problems. One application is the large language model ChatGPT, which is wellversed in using transfer learning for question answering^{20,21}. However, selecting the optimal pretrained model for a given task remains challenging because thoroughly training each candidate is computationally expensive and timeconsuming, promoting an urgent need for an efficient predictive measure based only on early training results.
A comprehensive understanding of neural dynamics is the critical step to addressing these challenges, ultimately leading to optimal neural network design. We fill the gap by adopting a microscopic perspective to investigate the edge dynamics of synaptic connections induced by stochastic gradient descent (SGD)^{11} through differential equations. The proposed new approach forms an associated network of edges and models neural network training as a networked dynamical system over these edges. However, solving the nonlinear networked edge dynamics poses significant computational challenges, given the millions of weights in convolutional neural networks, such as MobileNet^{22} (16 millions of weights) and VGG16^{1} (528 millions of weights). To overcome this limitation, we use the network reduction approach (GBB reduction) proposed by Gao et al. to decouple the neural network system, which enables us to map the neural network’s performance to its network characters^{23,24}. Our analysis advances several critical problems in AI, such as learning curve prediction, model selection, and zeroshot learning. Specifically, our universal approach significantly improves the relative ranking prediction of pretrained models by 9.1% to 65.3% using early training statistics from as few as five epochs. These findings demonstrate the effectiveness of our framework in finding the best predictive model and have significant implications for neural network architecture design and search in various applications.
Results
Map from a neural network to an associated graph of edges
The critical step is to map an artificial neural network to a networked dynamical system so that we can use the corresponding approaches to analyze them. We built a mapping scheme ϕ: G_{A} ↦ G_{B}, from a neural network G_{A} to an associated graph G_{B}. The topology of the edges (synaptic connections) follows a welldefined line graph proposed by Nepusz and Vicsek^{25}, and nodes of G_{B} are edges of G_{A}. More precisely, each node in G_{B} is associated with a trainable parameter in G_{A}. For an MLP, each edge has a trainable weight, and the edge set of G_{A} is also the synaptic connection of G_{B}. For a CNN, this onetoone mapping from neurons on layer ℓ to layer ℓ + 1 is replaced by a onetomany mapping because of weight sharing, e.g., a parameter in a convolutional filter is repeatedly used in forward propagation and associated with multiple pairs of neurons from the two neighboring layers. Since the error gradients flow in a reversed direction, we reverse the corresponding links of the proposed line graph for G_{B}. Specifically, given any pair of nodes in G_{B}, if they share an associated intersection neuron in FP propagation routes, a link with a reversed direction will be created for them. Fig. 1a demonstrates the mapping of an example MLP. We have the topology of G_{B} in place, but the weights of links in G_{B} are not yet specified. To make up for these missing components, we reveal the interactions of synaptic connections from SGD, quantify the interaction strengths and then define the weights of links in G_{B} accordingly (see Methods section for detailed derivation).
Figure 1b shows how to use our approach to predict the performance of a pretrained neural network model based on transfer learning. The output layer of each pretrained model is replaced with a threelayer neural capacitance probe (NCP) unit with (1) a dense layer of size 256 and (2) a dense layer of size 128. Each of these dense layers follows (3) a batch normalization^{26}, and (4) is followed by a dropout layer with a dropout probability of 0.4. Before finetuning, we initialize the NCP unit using Kaiming Normal initialization^{27}. See Supplementary Note 3 for details about the three layers in NCP.
Neural network model selection with the neural capacitance β _{eff}(t)
We evaluate 17 pretrained ImageNet models implemented in Keras^{28}, including AlexNet, VGGs (VGG16 & 19), ResNets (ResNet50, 50V2, 101, 101V2, 152, 152V2), DenseNets (DenseNet121, 169, 201), MobileNets (MobileNet & MobileNetV2), Inceptions (InceptionV3 & InceptionResNetV2) and Xception, to measure the performance of our approach. Furthermore, we used four benchmark datasets, CIFAR10, CIFAR100, SVHN, Fashion MNIST of size 32 × 32 × 3, and one Kaggle challenge dataset, Birds of size 224 × 224 × 3, and split the original train/test. Also, 15K original training samples are set aside to validate our approach on each dataset. We set a batch size of 64 and a learning rate of 0.001, finetuning each modified pretrained model for T = 50 epochs. As shown in Algorithm 1, the NCP does not involve finetuning and is merely used to calculate the neural capacitance β_{eff}(t), which varies as the number of epochs t changes. To keep the notation succinct, we use β_{eff} to represent β_{eff}(t). According to Theorem 1 (see Methods section on the property of the neural capacitance), when the model converges, β_{eff} → 0. Indirectly, the model’s predictability can be determined by the relation between the training β_{eff} and the validation accuracy I. Since both β_{eff} and I are available during finetuning, we collect a set of data points of these two in the early phase as the observations and fit a regularized linear model I = h(β_{eff}; θ) with Bayesian ridge regression^{29}, where θ are the associated coefficients. The estimated predictor I = h(β_{eff}; θ^{*}) makes prediction of the final accuracy of models by setting β_{eff} = 0, i.e., I^{*} = h(0; θ^{*}), see Fig. 1c an example in row 3 of Fig. 2. One can either retain or remove the NCP and finetune the selected model to fully train the best model.
To control the randomness, we repeat 20 times of the finetuning experiments for each model and analyze the average result. As shown in Fig. 2, the pretrained models are converged after the finetuning on CIFAR10. For each model, we collect the validation accuracy (blue stars in row 1) and β_{eff} on the training set (green squares in row 2) during the early stage of finetuning as the observations (e.g., green squares in row 3 marked by the green box for five epochs), then use these observations to predict the test accuracy unseen before the finetuning terminates. The blue lines are estimated h( ⋅ ; θ), the true test accuracy at T and the predicted accuracy are marked as red triangles and blue stars, respectively. Both the estimates and predictions are accurate. For better illustration, learning curves are visualized on a log scale.
The relative rank of these candidates is more important than their exact values of predicted accuracy in model selection. Thus, we choose Spearman’s rank correlation coefficient ρ to evaluate and compare different approaches. We calculate ρ over the ground truth test accuracy at epoch T and all pretrained models’ predicted accuracy I^{*}. In Fig. 3a, we report the ground truth and predicted accuracy for each model on CIFAR10, as well as the overall ranking performance measured by ρ. It indicates that βbased ranking is reliable with ρ > 0.9. We also report the complete results on all five datasets in Fig. 4. The numerical results indicate that the approach is general for different datasets.
The estimation quality of h determines how well the relation between I and β_{eff} is captured. Besides the regression method, the starting epoch t_{0} of the observations also plays a role in the estimation. As shown in Fig. 3b, we evaluate the impact of t_{0} on ρ of our approach. As expected, when fixing the length of learning curves, a higher t_{0} usually produces a better ρ. Since our ultimate goal is to predict with the early observations, t_{0} should also be constrained to a small value. To make the comparisons fair, we view t_{0} as a hyperparameter, and select it according to the Bayesian information criterion (BIC)^{30}, as shown in row 3 of Fig. 2.
Impact of size of training set
It is important to understand scalability and the performance sensitivity to training set sizes. Thus, we further split the CIFAR10, which has 50K original training and 10K testing samples, into 35K for training and 15K for validation. In studying the dynamics of neural network training, it is essential to understand how varying the training size influences the effectiveness of our approach. We select the first {10,15,20,25,30}K of the original 50K samples as the reducedsize training set and the last 10K samples as the validation set to finetune the pretrained models for 50 epochs. As shown in Fig. 3c, we can use a training set of size as small as 25K to achieve similar performance to that uses all 35K training samples. It has an important implication for efficient neural network training, because the size of the required training set can be significantly reduced (around 30% in our experiment) while maintaining similar model ranking performance. Note that the true test accuracy used in computing ρ is the same test accuracy for the model trained from 35K training samples and it’s shared by all the five cases {10,15,20,25,30}K in our analysis.
Comparison with other approaches
For comparison analysis, we considered two families of predictors: learning curve (LC) based predictors, and transferability measures (TMs) as the baselines. (i) LC predictors. Chandrashekaran and Lane^{31} treated the current LC as an affine transformation of previous LCs. They built an ensemble of transformations employing previous LCs and the first few epochs of the current LC to predict the final accuracy of the current LC. Baker et al.^{32} proposed an SVMbased LC predictor using features extracted from previous LCs, including the architecture information such as the number of layers, parameters, and training techniques such as learning rate and learning rate decay. A separate SVM is used to predict the accuracy of an LC at a particular epoch. Domhan et al.^{33} trained an ensemble of parametric functions that observe the first few epochs of an LC and extrapolate it. Klein et al.^{34} devised a Bayesian neural network to model the functions that Domhan formulated to capture the structure of the LCs more effectively. Wistuba and Pedapat^{35} trained a transfer learningbased predictor on LCs generated from other datasets. It is a neural networkbased predictor that leverages architecture and dataset embedding to capture the similarities between the architectures of various models and also the other datasets that it was trained on. (ii) Transferability measures. As an alternative estimation of the final performance of neural network models, some transferability measures (TMs) are developed^{36,37,38,39,40,41,42,43,44,45,46,47}, and many of them are trainingfree metrics for assessing the performance of neural networks. Notably, our approach has access to some observations collected from early training, and therefore our prediction mechanism is more similar to the learning curve prediction than those TMbased approaches that are designed as a surrogate of the transferability without finetuning or retraining. In addition to LCbased predictors, we compared our method with trainingfree NAS methods. The result is shown in the Supplementary Note 8. Direct comparison on the prediction performance (indicated by the ranking correlation) is not desirable since trainingfree NAS methods do not require training while our proposed method requires training of the model to compute β_{eff}.
We select several LC predictors, such as two heuristic rules the last seen value (LSV)^{48} and the bestseen value (BSV), BGRN^{32}, CL^{31}, as well as three representative TMs: NCE^{36}, LEEP^{37} and LogME^{38} as the baselines. As shown in Table 1 and Supplementary Fig. S1, using a few observations, e.g., only 5 epochs, our approach can achieve from 9.1% up to 65.3% relative improvements over the best baseline on CIFAR10, SVHN, Fashion MNIST, and Birds. On CIFAR100, NCE achieves marginally better performance than ours with 10 observations. Moreover, since each pretrained model produces one learning curve per run, we also report our ranking performance and the baselines based on learning curves collected in individual runs (Supplementary Fig. S2).
Running time analysis
Our approach is efficient, especially for large and deep neural networks. Different from the training task that involves a full FP and BP, i.e. T_{train} = T_{FP} + T_{BP}, computing β_{eff} only requires to compute the adjacency matrix P according to Eq.(7) on the NCP unit, \({T}_{{\beta }_{{{{{{{\rm{eff}}}}}}}}}={T}_{{{{{{{\rm{NCP}}}}}}}}\). Although the computation is complicated, the NCP is lightweight. The computing cost per epoch is comparable to the training time per epoch (see Supplementary Fig. S3). Let \({T}_{{\beta }_{{{{{{{\rm{eff}}}}}}}}}=c\times {T}_{{{{{{{\rm{train}}}}}}}}\). If c > 1, i.e., \({T}_{{\beta }_{{{{{{{\rm{eff}}}}}}}}}\) is higher than T_{train}, vice versa. Considering the required epochs, our approach needs k observations, and takes \({T}_{{{{{{{\rm{ours}}}}}}}}=k\times {T}_{{\beta }_{{{{{{{\rm{eff}}}}}}}}}\). To obtain the groundtruth final accuracy by running K epochs, it takes T_{full} = K × T_{train}. If T_{full} > T_{ours}, our β_{eff} based prediction is cheaper than “just training longer". It indicates that \(K\times {T}_{{{{{{{\rm{train}}}}}}}}k\times {T}_{{\beta }_{{{{{{{\rm{eff}}}}}}}}}=(Kc\times k)\times {T}_{{{{{{{\rm{train}}}}}}}} \, > \, 0\), saving us K − c × k more training epochs.
We perform a running time analysis of the two tasks with 4 × NVIDIA Tesla V100 SXM2 32GB, and visualize the related times in Supplementary Fig. S3. On average \(c={T}_{{\beta }_{{{{{{{\rm{eff}}}}}}}}}/{T}_{{{{{{{\rm{train}}}}}}}} \, \approx \, 1.3\), computing β_{eff} takes 1.3 times of the training per epoch. But the efforts are paying off, as we can predict the final accuracy by observing only k = 10 of K = 100 full training epochs, T_{ours} is only 13% of T_{full}. When the observations are used for LC prediction, the heuristics directly take one observation (last or best) as the predicted value, so they are mostly computationally cheap but have suboptimal model ranking performances. BGRN and CL require more computational time because both need training a predictor with a set of full learning curves from other models. Our approach also estimates a predictor but does not need any external LCs. Next, we assume that each model only observes k = 5 epochs and conduct a running time analysis of these approaches over LC prediction, including estimating a predictor. As shown in Supplementary Table S1, our approach applies Bayesian ridge regression to efficiently estimate the predictor I = h(β_{eff}; θ), taking comparable time as BGRN, significantly less than CL. Nevertheless, it performs best in model ranking. In contrast, the most expensive CL, does not perform well, sometimes even worse.
Discussion
In Network Science, a fundamental objective is to comprehend the functioning of a network based on its structure with broad applications in many fields. This work attempts to advance our understanding of the functioning of artificial neural networks through a grasp of complex networks. Recently, some prior works explore the neural network SGD training dynamics, regarding the global convergence^{49}, system identification^{50,51}, as well as deep neural network generalization^{52}. For example, Goldt et al.^{53} formulated the SGD dynamics of overparameterized twolayer neural networks with a set of differential equations. Furthermore, some exciting phenomena^{54} emerge during the early phase of neural network training, such as trainable sparse subnetworks emerge^{55}, gradient descent moves into a small subspace^{56}. Moreover, there exists a critical effective connection between layers^{57}. Inspired by the insights gained from studying the neural network training dynamics through a networked dynamical systems lens, we developed a theoretically sound framework for improving neural network model selection.
Our work presents a novel perspective of neural network model selection by directly exploring the dynamical evolution of synaptic connections during neural network training. Our framework reformulates SGDbased neural network training dynamics as an edge dynamics \({{{{{{\mathcal{B}}}}}}}\) to capture the mutual interaction and dependency of synaptic connections. Accordingly, a networked system is built by converting a neural network G_{A} to a line graph G_{B} with the governing dynamics \({{{{{{\mathcal{B}}}}}}}\), which induces a definition of the link weights in G_{B}. Moreover, a topological property β_{eff} of G_{B} is developed and shown to be an effective metric in predicting the ranking of a set of pretrained models based on early training results.
There are several important directions that we intend to explore in the future, including: (i) Simplify the adjacency matrix P to capture the dependency and mutual interaction between synaptic connections, e.g., approximate gradients using local information^{58}, (ii) extend the proposed framework to more neural architecture search (NAS) benchmarks^{59,60,61,62} to select the best subnetwork, and (iii) design an efficient algorithm to optimize neural network architectures directly.
Methods
Dimension reduction of networked systems
Realworld complex systems, such as plantpollinator interactions^{63} and the spread of COVID19^{64}, are commonly modeled using networks^{65,66}. Consider a network G = (V, E) with nodes V and edges E. Let n = ∣V∣ be the number of nodes in the network, the interactions between nodes can be formulated as a set of differential equations
where x_{i} is the state of node i in the system. For instance, in an ecological network, x_{i} could represent the abundance of a particular species of plant, while in an epidemic network, it could represent the infection rate of a person. The adjacency matrix P encodes the interaction strength between nodes, where P_{ij} is the entry in row i and column j. The functions f( ⋅ ) and g( ⋅ , ⋅ ) capture the internal and external impacts on node i, respectively. Typically, these functions are nonlinear. Let x = (x_{1}, x_{2}, …, x_{n}). For a small network, given an initial state, one can run a forward simulation for an equilibrium state x^{*}, such that \({\dot{x}}_{i}^{*}=f({x}_{i}^{*})+{\sum }_{j\in V}{P}_{ij}g({x}_{i}^{*},{x}_{j}^{*})=0\).
However, when the size of the system goes up to millions or even billions, it will pose a big challenge to solve the coupled differential equations. The problem can be efficiently addressed by employing a meanfield technique^{23,24}, where a linear operator \({{{{{{{\mathcal{L}}}}}}}}_{P}(\cdot )\) is introduced to decouple the system. Specifically, \({{{{{{{\mathcal{L}}}}}}}}_{P}\) depends on the adjacency matrix P and is defined as \({{{{{{{\mathcal{L}}}}}}}}_{P}({{{{{{\boldsymbol{z}}}}}}})=\frac{{{{{{{{\boldsymbol{1}}}}}}}}^{T}P{{{{{{\boldsymbol{z}}}}}}}}{{{{{{{{\boldsymbol{1}}}}}}}}^{T}P{{{{{{\boldsymbol{1}}}}}}}}\), where \({{{{{{\boldsymbol{z}}}}}}}\in {{{{{{{\mathcal{R}}}}}}}}^{n}\). Let δ_{in} = P1 and δ_{out} = 1^{T}P be the in and outdegrees of nodes. For a weighted G, the degrees are weighted as well. Applying \({{{{{{{\mathcal{L}}}}}}}}_{P}(\cdot )\) to δ_{in}, it gives
which proves to be a powerful metric to measure the resilience of networks, and has been applied to make reliable inferences from incomplete networks^{67,68}. We use it to measure the predictive ability of a neural network, whose training in essence is a dynamical system. For an overview of the related technique, see Supplementary Note 6.
Neural network training is a dynamical system
Conventionally, training a neural network is a nonlinear optimization problem. Because of the hierarchical structure of neural networks, the training procedure is implemented by two alternate procedures: forwardpropagation (FP) and backpropagation (BP), as described in Fig. 1a. During FP, data goes through the input layer, hidden layers, up to the output layer, which produces the predictions of the input data. The differences between the outputs and the labels of the input data are used to define an objective function \({{{{{{\mathcal{C}}}}}}}\), a.k.a training error function. BP proceeds to minimize \({{{{{{\mathcal{C}}}}}}}\), in a reverse way as did in FP, by propagating the error from the output layer down to the input layer. The trainable weights of synaptic connections are updated accordingly.
Let G_{A} be a neural network, w be the flattened weight vector of G_{A}, and z be the activation values. As a whole, the training of a neural network G_{A} can be described with two coupled dynamics: \({{{{{{\mathcal{A}}}}}}}\) on G_{A}, and \({{{{{{\mathcal{B}}}}}}}\) on G_{B}, where nodes in G_{A} are neurons, and nodes in G_{B} are the synaptic connections. The coupling relation arises from the strong interdependency between z and w: the states z (activation values or activation gradients) of G_{A} are the parameters of \({{{{{{\mathcal{B}}}}}}}\), and the states w of G_{B} are the trainable parameters of G_{A}. If we put the whole training process in the context of networked systems, \({{{{{{\mathcal{A}}}}}}}\) denotes a node dynamics because the states of nodes evolve during FP, and \({{{{{{\mathcal{B}}}}}}}\) expresses an edge dynamics because of the updates of edge weights during BP^{13,69,70}. Mathematically, we formulate the node and edge dynamics based on the gradients of \({{{{{{\mathcal{C}}}}}}}\):
where t denotes the training step. Let \({a}_{i}^{(\ell )}\) be the preactivation of node i on layer ℓ, and σ_{ℓ}( ⋅ ) be the activation function of layer ℓ. Usually, the output activation function is a softmax. The hierarchical structure of G_{A} exerts some constraints over z for neighboring layers, i.e., \({z}_{i}^{(\ell )}={\sigma }_{\ell }({a}_{i}^{(\ell )}),1\le i\le {n}_{\ell },\forall 1\le \ell < L\) and \({z}_{k}^{(L)}=\exp \{{a}_{k}^{(L)}\}/{\sum }_{j}\exp \{{a}_{j}^{(L)}\},1\le k\le {n}_{L}\), where n_{ℓ} is the total number of neurons on layer ℓ, and G_{A} has L + 1 layers. It also presents a dependency between z and w, e.g, when G_{A} is an MLP without bias, \({a}_{i}^{(\ell )}={{{{{{{\boldsymbol{w}}}}}}}}_{i}^{(\ell )T}{{{{{{{\boldsymbol{z}}}}}}}}^{(\ell 1)}\), which builds a connection from G_{A} to G_{B}. It is obvious, given w, the activation z satisfying all these constraints, is also a fixed point of \({{{{{{\mathcal{A}}}}}}}\). Meanwhile, an equilibrium state of \({{{{{{\mathcal{B}}}}}}}\) provides a set of optimal weights for G_{A}.
The metric β_{eff} is a universal metric to characterize different types of networks, including biological neural networks^{71}. Because of the generality of β_{eff}, we analyze how it looks on artificial neural networks, which are designed to mimic the biological counterparts for general intelligence. Therefore, we set up an analog system for the trainable weights. To the end, we build a line graph for the trainable weights, and reformulate the training dynamics in the same form as the general dynamics (Eq. (1)). The reformulated dynamics reveals a simple yet powerful property regarding β_{eff}, which is utilized to predict the final accuracy of G_{A} with a few observations during the early phase of the training.
Quantify the interaction strengths of edges
In SGD, each time a batch of samples is chosen to update w, i.e., \({{{{{{\boldsymbol{w}}}}}}}\leftarrow {{{{{{\boldsymbol{w}}}}}}}\alpha {\nabla }_{{{{{{{\boldsymbol{w}}}}}}}}{{{{{{\mathcal{C}}}}}}}\), where α > 0 is the learning rate. When desired conditions are met, training is terminated. Let \({{{{{{{\boldsymbol{\delta }}}}}}}}^{(\ell )}={[\partial {{{{{{\mathcal{C}}}}}}}/\partial {z}_{1}^{(\ell )},\cdots,\partial {{{{{{\mathcal{C}}}}}}}/\partial {z}_{{n}_{\ell }}^{(\ell )}]}^{T}\in {{{{{{{\mathcal{R}}}}}}}}^{{n}_{\ell }}\) (in some literature δ^{(ℓ)} is defined as gradients with respect to a^{(ℓ)}, which does not affect our analysis) be the activation gradients, and \({{{{{{{\boldsymbol{\sigma }}}}}}}}_{\ell }^{{\prime} }={[{\sigma }_{\ell,1}^{{\prime} },\cdots,{\sigma }_{\ell,{n}_{\ell }}^{{\prime} }]}^{T}\in {{{{{{{\mathcal{R}}}}}}}}^{{n}_{\ell }}\) be the derivatives of activation function σ for layer ℓ, with \({\sigma }_{\ell,k}^{{\prime} }={\sigma }_{\ell }^{{\prime} }({a}_{k}^{(\ell )}),1\le k\le {n}_{\ell },1\le \ell \le L\). To understand how the weights W^{(ℓ)} affect each other, we explicitly expand δ^{(ℓ)} and have \({{{{{{{\boldsymbol{\delta }}}}}}}}^{(\ell )}={W}^{(\ell+1)T}({W}^{(\ell+2)T}(\cdots ({W}^{(L1)T}({W}^{(L)T}({{{{{{{\boldsymbol{z}}}}}}}}^{(L)}{{{{{{\boldsymbol{y}}}}}}}))\odot {{{{{{{\boldsymbol{\sigma }}}}}}}}_{L1}^{{\prime} })\cdots )\odot {{{{{{{\boldsymbol{\sigma }}}}}}}}_{\ell+2}^{{\prime} })\odot {{{{{{{\boldsymbol{\sigma }}}}}}}}_{\ell+1}^{{\prime} }\left.\right)\), where ⊙ is the Hadamard product. We find that W^{(ℓ)} is associated with all accessible parameters on downstream layers, and the recursive relation defines a highorder hypernetwork interaction^{72} between any W^{(ℓ)} and the other parameters. With the fact that x ⊙ y = Λ(y)x, where Λ(y) is a diagonal matrix with the entries of y on the diagonal, we have \({{{{{{{\boldsymbol{\delta }}}}}}}}^{(\ell )}={W}^{(\ell+1)T}\Lambda ({{{{{{{\boldsymbol{\sigma }}}}}}}}_{\ell+1}^{{\prime} }){{{{{{{\boldsymbol{\delta }}}}}}}}^{(\ell+1)}={W}^{(\ell+1)T}\Lambda ({{{{{{{\boldsymbol{\sigma }}}}}}}}_{\ell+1}^{{\prime} }){W}^{(\ell+2)T}\Lambda ({{{{{{{\boldsymbol{\sigma }}}}}}}}_{\ell+2}^{{\prime} })\cdots {W}^{(L1)T}\Lambda ({{{{{{{\boldsymbol{\sigma }}}}}}}}_{L1}^{{\prime} })\)W^{(L)T}(z^{(L)} − y). For a ReLU σ_{ℓ}( ⋅ ), \({{{{{{{\boldsymbol{\sigma }}}}}}}}_{\ell }^{{\prime} }\) is binary depending on the sign of the input preactivation values a^{(ℓ)} of layer ℓ. If \({a}_{i}^{(\ell )}\le 0\), then \({\sigma }_{\ell }^{{\prime} }({a}_{i}^{(\ell )})=0\), blocking a BP propagation route of the prediction deviations z^{(L)} − y and giving rise to vanishing gradients.
We intended to build direct interactions between synaptic connections. It can be done by identifying which units provide direct physical interactions to a given unit and appear on the righthand side of its differential equation \({{{{{{\mathcal{B}}}}}}}\) in Eq.(3), and how much such interactions come into play. There are multiple routes to build up a direct interaction between any pair of network weights from different layers, as presented by the product terms in δ^{(ℓ)}. However, the coupled interaction makes it an impossible task, which is wellknown as a credit assignment problem^{51,73}. We propose a remedy. The impacts of all the other units on W^{(ℓ)} is approximated by direct, local impacts from W^{(ℓ+1)}, and the others’ contribution as a whole is encoded in the activation gradient δ^{(ℓ+1)}. Moreover, we have the weight gradient (Supplementary Note 1)
which shows the dependency of W^{(ℓ)} on W^{(ℓ+1)}, and itself can be viewed as an explicit description of the dynamical system \({{{{{{\mathcal{B}}}}}}}\) in Eq.(3). Put it in terms of a differential equation, we have
Because of the mutual dependency of the weights and the activation values, it is hard to make an exact decomposition of the impacts of different parameters on W^{(ℓ)}. But, in the gradient \({{{{{{{\boldsymbol{\nabla }}}}}}}}_{{W}^{(\ell )}}\), W^{(ℓ+1)} presents as an explicit term and contributes the direct impact on W^{(ℓ)}. To capture such direct impact and derive the adjacency matrix P for G_{B}, we apply Taylor expansion on \({{{{{{{\boldsymbol{\nabla }}}}}}}}_{{W}^{(\ell )}}\) and have
which defines the interaction strength between each pair of weights from layer ℓ + 1 to layer ℓ. For a detailed derivation of P on MLP and general neural networks, see Supplementary Notes 2 and 3. Let w = (w_{1}, w_{2}, …) be a flattened vector of all trainable weights of G_{A}. Given a pair of weights w_{i} and w_{j}, one from layer ℓ_{1}, another from layer ℓ_{2}. If ℓ_{2} = ℓ_{1} + 1, the entry P_{ij} is defined according to Eq.(7), otherwise P_{ij} = 0. Considering the scale of the trainable parameters in G_{A}, P is very sparse. Let W^{(ℓ+1)*} be the equilibrium states (Supplementary Note 3), the training dynamics Eq.(6) is reformulated into the form of Eq.(1), and gives the edge dynamics \({{{{{{\mathcal{B}}}}}}}\) for G_{B}:
with \(f({w}_{i})=F({w}_{i}^{*})\) and \(g({w}_{i},{w}_{j})={w}_{j}{w}_{j}^{*}\). The value of weights at an equilibrium state \(\{{w}_{j}^{*}\}\) is unknown, but it is a constant and does not affect the computing of β_{eff}.
Property of the neural capacitance
According to Eq.(7), we have the weighted adjacency matrix P of G_{B} in place. The matrix P encodes rich information of the network, such as the topology, the weights, the gradients, and the training labels indirectly. Now we quantify the total impact that a trainable parameter (or synaptic connection) receives from itself and the others, corresponding to the weighted indegrees δ_{in} = P1. Applying \({{{{{{{\mathcal{L}}}}}}}}_{P}(\cdot )\) to δ_{in}, we get a “counterpart” metric \({\beta }_{{{{{{{\rm{eff}}}}}}}}={{{{{{{\mathcal{L}}}}}}}}_{P}({{{{{{{\boldsymbol{\delta }}}}}}}}_{{{{{{{\rm{in}}}}}}}})\) to measure the predictive ability of a neural network G_{A}, as the resilience metric (Eq. (2)) does to a general network G. If G_{A} is an MLP, we can explicitly write the entries of P and β_{eff}. For details of how to derive P and β_{eff} of an MLP, see Supplementary Note 2. Moreover, we prove in Theorem 1 below that as G_{A} converges, \({{{{{{{\boldsymbol{\nabla }}}}}}}}_{W}^{(\ell )}\) vanishes, and β_{eff} approaches zero (see Supplementary Note 4).
Theorem 1
Let ReLU be the activation function of G_{A}. When G_{A} converges, then β_{eff} = 0.
To be noted that a small value is added to the denominator of Eq.(2) to avoid a possible 0/0.
Algorithm 1
Implement NCP and Computeβ_{eff}(t)
Input: Pretrained source model \({{{{{{{\mathcal{F}}}}}}}}_{s}=\{{{{{{{{\mathcal{F}}}}}}}}_{s}^{(1)},{{{{{{{\mathcal{F}}}}}}}}_{s}^{(2)}\}\) with bottom \({{{{{{{\mathcal{F}}}}}}}}_{s}^{(1)}\) and output layer \({{{{{{{\mathcal{F}}}}}}}}_{s}^{(2)}\), target dataset D_{t}, maximum epoch T
1: Remove \({{{{{{{\mathcal{F}}}}}}}}_{s}^{(2)}\) from \({{{{{{{\mathcal{F}}}}}}}}_{s}\) and add on top of \({{{{{{{\mathcal{F}}}}}}}}_{s}^{(1)}\) an NCP unit \({{{{{{\mathcal{U}}}}}}}\) with multiple layers (Fig. 1b)
2: Randomly initialize and freeze \({{{{{{\mathcal{U}}}}}}}\)
3: Train target model \({{{{{{{\mathcal{F}}}}}}}}_{t}=\{{{{{{{{\mathcal{F}}}}}}}}_{s}^{(1)},{{{{{{\mathcal{U}}}}}}}\}\) by finetuning \({{{{{{{\mathcal{F}}}}}}}}_{s}^{(1)}\) on D_{t} for epochs of T
4: Obtain P from \({{{{{{\mathcal{U}}}}}}}\) according to Eq.(7)
5: Compute β_{eff} with P according to Eq.(2)
For an MLP G_{A}, it is possible to derive an analytical form of β_{eff}. However, it becomes extremely complicated for a deep neural network with multiple convolutional layers. To realize β_{eff} for deep neural networks in any form, we take advantage of the automatic differentiation implemented in TensorFlow^{74}. Considering the number of parameters, it is still computationally prohibitive to calculate a β_{eff} for the entire G_{A}.
Therefore, we seek to derive a surrogate from a partial of G_{A}. Specifically, we insert a neural capacitance probe (NCP) unit, i.e., putting additional layers on top of the beheaded G_{A} (excluding the original output layer), and estimate the predictive ability of the entire G_{A} using β_{eff} of the NCP unit. Therefore, we call β_{eff} a neural capacitance.
Bayesian ridge regression
Ridge regression introduces an ℓ_{2}regularization to linear regression, and solves the problem
where \(X\in {{{{{{{\mathcal{R}}}}}}}}^{n\times d}\), \({{{{{{\boldsymbol{y}}}}}}}\in {{{{{{{\mathcal{R}}}}}}}}^{n}\), \({{{{{{\boldsymbol{\theta }}}}}}}\in {{{{{{{\mathcal{R}}}}}}}}^{d}\) is the associated set of coefficients, the hyperparameter λ > 0 controls the impact of the penalty term \(\parallel {{{{{{\boldsymbol{\theta }}}}}}}{\parallel }_{2}^{2}\). Bayesian ridge regression introduces uninformative priors over the hyperparameters, and estimates a probabilistic model of the problem in Eq.(9). Usually, the ordinary least squares method posits the conditional distribution of y to be a Gaussian, i.e., \(p({{{{{{\boldsymbol{y}}}}}}} X,{{{{{{\boldsymbol{\theta }}}}}}})={{{{{{\mathcal{N}}}}}}}({{{{{{\boldsymbol{y}}}}}}} X{{{{{{\boldsymbol{\theta }}}}}}},{\sigma }^{2}{I}_{d})\), where σ > 0 is a hyperparameter to be tuned, and I_{d} is a d × d identity matrix. Moreover, if we assume a spherical Gaussian prior θ, i.e., \(p({{{{{{\boldsymbol{\theta }}}}}}})={{{{{{\mathcal{N}}}}}}}({{{{{{\boldsymbol{\theta }}}}}}} 0,{\tau }^{2}{I}_{d})\), where τ > 0 is another hyperparameter to be estimated from the data at hand. According to Bayes’ theorem, p(θ∣X, y) ∝ p(θ)p(y∣X, θ), the estimates of the model are made by maximizing the posterior distribution p(θ∣X, y), i.e., \(\arg {\max }_{{{{{{{\boldsymbol{\theta }}}}}}}}\log p({{{{{{\boldsymbol{\theta }}}}}}} X,{{{{{{\boldsymbol{y}}}}}}})=\arg {\max }_{{{{{{{\boldsymbol{\theta }}}}}}}}\log {{{{{{\mathcal{N}}}}}}}({{{{{{\boldsymbol{y}}}}}}} X{{{{{{\boldsymbol{\theta }}}}}}},{\sigma }^{2}{I}_{d})+\log {{{{{{\mathcal{N}}}}}}}({{{{{{\boldsymbol{\theta }}}}}}} {{{{{{\bf{0}}}}}}},{\tau }^{2}{I}_{d})\), which is a maximumaposteriori (MAP) estimation of the ridge regression when λ = σ^{2}/τ^{2}. All θ, λ, and τ are estimated jointly during the model fitting, and \(\sigma=\tau \sqrt{\lambda }\). Based on the approach proposed by Tipping^{29} and MacKay^{75} to update the parameters λ and τ, we estimate I = h(β_{eff}; θ) with scikitlearn^{76}. We can summarize the application of Bayesian ridge regression to our framework as follows:

Inputs: {(β_{eff,k}, I_{k})∣k = 1, 2, …, K} is a set of observations, where β_{eff,k} is the proposed metric calculated from the training set, I_{k} represents the validation accuracy, K is the total number of observations collected from early stage of the model training.

Output: I − h(β_{eff}; θ) = 0, where θ is the fitting parameters in the Bayesian ridge regression.

Prediction: I^{*} = h(0, θ) as per Theorem 1.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
Data from this study are publicly available. (1) Pretrained ImageNet models in Keras^{28}, (2) Benchmark datasets CIFAR10, CIFAR100, SVHN, Fashion MNIST from Keras, (3) Kaggle challenge dataset Birds: https://www.kaggle.com/gpiosenka/100birdspecie.
Code availability
Code is publicly available at https://codeocean.com/capsule/6480460/tree/v1.
Change history
08 August 2024
In this article the hyperlink provided for the capsule in the Code Availability section was incorrect. The original article has been corrected.
References
Simonyan, K. & Zisserman, A. Very deep convolutional networks for largescale image recognition. Int. Conf. Learning Representation 1, 1–14 (2014).
Weiss, K., Khoshgoftaar, T. M. & Wang, D. A survey of transfer learning. J. Big Data 3, 1–40 (2016).
Jia, Y. et al. Transfer learning from speaker verification to multispeaker texttospeech synthesis. Adv. Neural Info. Processing Syst. 31, 1–11 (2018).
Guo, X. et al. Deep transfer learning enables lesion tracing of circulating tumor cells. Nat. Commun. 13, 7687 (2022).
Alain, G. & Bengio, Y. Understanding intermediate layers using linear classifier probes. Int. Conf. Learn. Representation 1, 1–4 (2016).
Mnih, V., Heess, N., Graves, A. et al. Recurrent models of visual attention. Adv. Neural Info. Process. Syst. 27, 1–9 (2014).
Bahdanau, D., Cho, K. & Bengio, Y. Neural machine translation by jointly learning to align and translate. Int. Conf. Learn. Representations 1, 1–15 (2014).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778 (2016).
Zeiler, M. D. & Fergus, R. Visualizing and understanding convolutional networks. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 612, 2014, Proceedings, Part I 13, 818–833 (Springer, 2014).
Wang, H. et al. Deep active learning by leveraging training dynamics. Adv. Neural Info. Processing Syst. 35, 25171–25184 (2022).
Bottou, L. Stochastic gradient descent tricks. In Neural networks: Tricks of the Trade, 421–436 (Springer, 2012).
LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
Mei, S., Montanari, A. & Nguyen, P.M. A mean field view of the landscape of twolayer neural networks. Proc. Natl. Acad. Sci. 115, E7665–E7671 (2018).
Chang, B., Chen, M., Haber, E. & Chi, H. AntisymmetricRNN: A dynamical system view on recurrent neural networks. In International Conference on Learning Representations (2018).
Dogra, A. S. & Redman, W. Optimizing neural networks via Koopman operator theory. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F. & Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, 2087–2097 (Curran Associates, Inc., 2020).
Feng, Y. & Tu, Y. Phases of learning dynamics in artificial neural networks: in the absence or presence of mislabeled data. Machine Learn.: Sci. Technol. 2, 1–11 (2021).
Hopfield, J. J. Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. 79, 2554–2558 (1982).
Deng, Z. & Zhang, Y. Collective behavior of a smallworld recurrent neural system with scalefree distribution. IEEE Trans. Neural Netw. 18, 1364–1375 (2007).
Bau, D. et al. Understanding the role of individual units in a deep neural network. Proc. Natl. Acad. Sci. 117, 30071–30078 (2020).
Radford, A. et al. Language models are unsupervised multitask learners. OpenAI blog 1, 9 (2019).
Brown, T. et al. Language models are fewshot learners. Adv. Neural Info. Processing Syst. 33, 1877–1901 (2020).
Howard, A. G. et al. MobileNets: Efficient convolutional neural networks for mobile vision applications. CoRR 1, 1–9 (2017).
Gao, J., Barzel, B. & Barabási, A.L. Universal resilience patterns in complex networks. Nature 530, 307–312 (2016).
Zhang, H., Wang, Q., Zhang, W., Havlin, S. & Gao, J. Estimating comparable distances to tipping points across mutualistic systems by scaled recovery rates. Nat. Ecol. Evol. 6, 1524–1536 (2022).
Nepusz, T. & Vicsek, T. Controlling edge dynamics in complex networks. Nature Physics 8, 568–573 (2012).
Ioffe, S. & Szegedy, C. Batch Normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, 448–456 (PMLR, 2015).
He, K., Zhang, X., Ren, S. & Sun, J. Delving deep into rectifiers: Surpassing humanlevel performance on ImageNet classification. In Proceedings of the IEEE International Conference on Computer Vision, 1026–1034 (2015).
Ketkar, N. Introduction to Keras. In Deep learning with Python, 97–111 (Springer, 2017).
Tipping, M. E. Sparse Bayesian learning and the relevance vector machine. J. Machine Learn. Res. 1, 211–244 (2001).
Friedman, J. et al. The elements of statistical learning, vol. 1 (Springer series in statistics New York, 2001).
Chandrashekaran, A. & Lane, I. R. Speeding up hyperparameter optimization by extrapolation of learning curves using previous builds. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 477–492 (Springer, 2017).
Baker, B., Gupta, O., Raskar, R. & Naik, N. Accelerating neural architecture search using performance prediction. International Conference on Learning Representations 1, 1–19 (2017).
Domhan, T., Springenberg, J. T. & Hutter, F. Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves. In Twentyfourth International Joint Conference on Artificial Intelligence (2015).
Klein, A., Falkner, S., Bartels, S., Hennig, P. & Hutter, F. Fast Bayesian optimization of machine learning hyperparameters on large datasets. In Artificial Intelligence and Statistics, 528–536 (PMLR, 2017).
Wistuba, M. & Pedapati, T. Learning to rank learning curves. In International Conference on Machine Learning, 10303–10312 (PMLR, 2020).
Tran, A. T., Nguyen, C. V. & Hassner, T. Transferability and hardness of supervised classification tasks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 1395–1405 (2019).
Nguyen, C., Hassner, T., Seeger, M. & Archambeau, C. LEEP: A new measure to evaluate transferability of learned representations. In International Conference on Machine Learning, 7294–7305 (PMLR, 2020).
You, K., Liu, Y., Wang, J. & Long, M. LogME: Practical assessment of pretrained models for transfer learning. In International Conference on Machine Learning, 12133–12143 (PMLR, 2021).
Bolya, D., Mittapalli, R. & Hoffman, J. Scalable diverse model selection for accessible transfer learning. Adv. Neural Info. Processing Syst. 34, 1–12 (2021).
Deshpande, A. et al. A linearized framework and a new benchmark for model selection for finetuning. Computer Vision and Pattern Recognition 1, 1–14 (2021).
Lin, M. et al. Zennas: A zeroshot nas for highperformance image recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 347–356 (2021).
Mellor, J., Turner, J., Storkey, A. & Crowley, E. J. Neural architecture search without training. In International Conference on Machine Learning, 7588–7598 (PMLR, 2021).
Tanaka, H., Kunin, D., Yamins, D. L. & Ganguli, S. Pruning neural networks without any data by iteratively conserving synaptic flow. Adv. Neural Info. Processing Syst. 33, 6377–6389 (2020).
Chen, W., Huang, W., Gong, X., Hanin, B. & Wang, Z. Deep architecture connectivity matters for its convergence: A finegrained analysis. Adv. Neural Info. Processing Syst. 35, 35298–35312 (2022).
Zhang, Z. & Jia, Z. Gradsign: model performance inference with theoretical insights. In International Conference on Learning Representations (ICLR, 2021).
Li, G., Yang, Y., Bhardwaj, K. & Marculescu, R. Zico: Zeroshot nas via inverse coefficient of variation on gradients. In International Conference on Learning Representations (ICLR, 2023).
Patil, S. M. & Dovrolis, C. Phew: Constructing sparse networks that learn fast and generalize well without training data. In International Conference on Machine Learning, 8432–8442 (PMLR, 2021).
Klein, A., Falkner, S., Springenberg, J. T. & Hutter, F. Learning curve prediction with Bayesian neural networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 2426, 2017, Conference Track Proceedings (OpenReview.net, 2017).
Tian, Y. An analytical formula of population gradient for twolayered ReLU network and its applications in convergence and critical point analysis. In International Conference on Machine Learning, 3404–3413 (PMLR, 2017).
Haykin, S.Neural Networks and Learning Machines (Pearson Education India, 2010).
Lillicrap, T. P., Santoro, A., Marris, L., Akerman, C. J. & Hinton, G. Backpropagation and the brain. Nat. Rev. Neurosci. 1–12 (2020).
Bhardwaj, K., Li, G. & Marculescu, R. How does topology influence gradient propagation and model performance of deep networks with densenettype skip connections? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13498–13507 (2021).
Goldt, S., Advani, M., Saxe, A. M., Krzakala, F. & Zdeborová, L. Dynamics of stochastic gradient descent for twolayer neural networks in the teacherstudent setup. In Wallach, H., Larochelle, H., Beygelzimer, A., d’AlchéBuc, F., Fox, E. & Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 32 (Curran Associates, Inc., 2019).
Frankle, J., Schwab, D. J. & Morcos, A. S. The early phase of neural network training. Int. Conf. Learning Representations 1, 1–20 (2020).
Frankle, J., Dziugaite, G. K., Roy, D. M. & Carbin, M. Stabilizing the lottery ticket hypothesis. Comput Vision Pattern Recogn 1, 1–19 (2019).
GurAri, G., Roberts, D. A. & Dyer, E. Gradient descent happens in a tiny subspace. Int. Conf. Learning Representations 1, 1–19 (2018).
Achille, A., Rovere, M. & Soatto, S. Critical learning periods in deep networks. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 69, 2019 (OpenReview.net, 2019).
Jaderberg, M. et al. Decoupled neural interfaces using synthetic gradients. In International Conference on Machine Learning, 1627–1635 (PMLR, 2017).
Ying, C. et al. NASBench101: Towards reproducible neural architecture search. In International Conference on Machine Learning, 7105–7114 (PMLR, 2019).
Dong, X., Liu, L., Musial, K. & Gabrys, B. NATSBench: Benchmarking nas algorithms for architecture topology and size. IEEE Transac. Pattern Anal. Machine Intelligence 7, 3634–3646 (2021).
Zela, A., Siems, J. & Hutter, F. NASBench1Shot1: benchmarking and dissecting oneshot neural architecture search. In International Conference on Learning Representations 1–12 (ICLR, 2020).
Li, C. et al. HWNASBench: hardwareaware neural architecture search benchmark. In International Conference on Learning Representations 1–14 (ICLR, 2021).
Waser, N. M. & Ollerton, J. Plantpollinator interactions: from specialization to generalization (University of Chicago Press, 2006).
Thurner, S., Klimek, P. & Hanel, R. A networkbased explanation of why most covid19 infection curves are linear. Proc. Natl. Acad. Sci. 117, 22684–22689 (2020).
Mitchell, M. Complex systems: Network thinking. Artificial Intelligence 170, 1194–1212 (2006).
Barabási, A.L. & Pósfai, M.Network Science (Cambridge University Press, 2016).
Jiang, C., Gao, J. & MagdonIsmail, M. True nonlinear dynamics from incomplete networks. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, 131–138 (2020).
Jiang, C., Gao, J. & MagdonIsmail, M. Inferring degrees from incomplete networks and nonlinear dynamics. In Proceedings of the TwentyNinth International Joint Conference on Artificial Intelligence, 3307–3313 (2020).
Poggio, T., Banburski, A. & Liao, Q. Theoretical issues in deep networks. Proc. Natl. Acad. Sci. 117, 30039–30045 (2020).
Poggio, T., Liao, Q. & Banburski, A. Complexity control by gradient descent in deep networks. Nat. Commun. 11, 1–5 (2020).
Shu, P. et al. The resilience and vulnerability of human brain networks across the lifespan. IEEE Trans. Neural Syst. Rehab. Eng. 29, 1756–1765 (2021).
Casadiego, J., Nitzan, M., Hallerberg, S. & Timme, M. Modelfree inference of direct network interactions from nonlinear collective dynamics. Nat. Commun. 8, 1–10 (2017).
Whittington, J. C. & Bogacz, R. Theories of error backpropagation in the brain. Trends Cogn. Sci. 23, 235–250 (2019).
Abadi, M. et al. TensorFlow: A system for largescale machine learning. In 12th USENIX symposium on operating systems design and implementation (OSDI 16), 265–283 (2016).
MacKay, D. J. Bayesian interpolation. Neural Comput. 4, 415–447 (1992).
Pedregosa, F. et al. Scikitlearn: Machine learning in python. J. Machine Learning Res. 12, 2825–2830 (2011).
Acknowledgements
We acknowledge the support of the USA National Science Foundation under grant #2047488, #2312501, and the RensselaerIBM AI Research Collaboration.
Author information
Authors and Affiliations
Contributions
C.J. and Z.H. designed experiments, conducted experiments, collected and analyzed data. T.P. conducted experiments and reported performance for baseline models. P.Y.C. and Y.S. provided valuable insights and expertise in deep learning models. J.G. supervised the project and was the lead writer of the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Communications thanks Yuandong Tian, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Jiang, C., Huang, Z., Pedapati, T. et al. Network properties determine neural network performance. Nat Commun 15, 5718 (2024). https://doi.org/10.1038/s41467024480698
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467024480698
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.