Introduction

Deep neural networks (DNNs) have emerged as a crucial component of artificial intelligence (AI) and have successful applications in various domains, including computer vision, natural language processing, speech recognition, robotics, and more1,2,3,4. Despite these remarkable achievements, neural networks are often criticized as black boxes and remain challenging to comprehend due to their nonlinear and complex nature5. Increasing research is developing more interpretable DNN architectures, such as those based on attention mechanisms or interpretable features6,7,8. Nevertheless, neural network training is complex and affected by various factors such as noisy training data, neural architecture, loss function, and optimization algorithms, remaining a critical challenge to uncover the black box of DNNs9,10.

The training process is an iterative update of the synaptic connection weights11,12. The straightforward way is to model the process as a discrete dynamical system, which provides a theoretical foundation for analyzing convergence rates and generalization error bounds13,14,15,16. However, existing approaches have primarily focused on the macroscopic and collective behavior of neurons in neural networks17,18,19, without explicitly examining the individual interactions between trainable weights or synaptic connections and their co-evolution during training.

Transfer learning is a widely used and effective technique in deep learning that leverages pre-trained models to solve numerous complex problems. One application is the large language model ChatGPT, which is well-versed in using transfer learning for question answering20,21. However, selecting the optimal pre-trained model for a given task remains challenging because thoroughly training each candidate is computationally expensive and time-consuming, promoting an urgent need for an efficient predictive measure based only on early training results.

A comprehensive understanding of neural dynamics is the critical step to addressing these challenges, ultimately leading to optimal neural network design. We fill the gap by adopting a microscopic perspective to investigate the edge dynamics of synaptic connections induced by stochastic gradient descent (SGD)11 through differential equations. The proposed new approach forms an associated network of edges and models neural network training as a networked dynamical system over these edges. However, solving the nonlinear networked edge dynamics poses significant computational challenges, given the millions of weights in convolutional neural networks, such as MobileNet22 (16 millions of weights) and VGG161 (528 millions of weights). To overcome this limitation, we use the network reduction approach (GBB reduction) proposed by Gao et al. to decouple the neural network system, which enables us to map the neural network’s performance to its network characters23,24. Our analysis advances several critical problems in AI, such as learning curve prediction, model selection, and zero-shot learning. Specifically, our universal approach significantly improves the relative ranking prediction of pre-trained models by 9.1% to 65.3% using early training statistics from as few as five epochs. These findings demonstrate the effectiveness of our framework in finding the best predictive model and have significant implications for neural network architecture design and search in various applications.

Results

Map from a neural network to an associated graph of edges

The critical step is to map an artificial neural network to a networked dynamical system so that we can use the corresponding approaches to analyze them. We built a mapping scheme ϕ: GAGB, from a neural network GA to an associated graph GB. The topology of the edges (synaptic connections) follows a well-defined line graph proposed by Nepusz and Vicsek25, and nodes of GB are edges of GA. More precisely, each node in GB is associated with a trainable parameter in GA. For an MLP, each edge has a trainable weight, and the edge set of GA is also the synaptic connection of GB. For a CNN, this one-to-one mapping from neurons on layer to layer  + 1 is replaced by a one-to-many mapping because of weight sharing, e.g., a parameter in a convolutional filter is repeatedly used in forward propagation and associated with multiple pairs of neurons from the two neighboring layers. Since the error gradients flow in a reversed direction, we reverse the corresponding links of the proposed line graph for GB. Specifically, given any pair of nodes in GB, if they share an associated intersection neuron in FP propagation routes, a link with a reversed direction will be created for them. Fig. 1a demonstrates the mapping of an example MLP. We have the topology of GB in place, but the weights of links in GB are not yet specified. To make up for these missing components, we reveal the interactions of synaptic connections from SGD, quantify the interaction strengths and then define the weights of links in GB accordingly (see Methods section for detailed derivation).

Fig. 1: Illustration of our framework.
figure 1

a An example multilayer perceptron (MLP) GA is mapped to a directed line graph GB, which is governed by an edge dynamics \({{{{{{\mathcal{B}}}}}}}\). Each node (dichromatic square) of GB is associated with a synaptic connection linking two neurons (in different colors) from different layers of GA. b A diagram of transfer learning from the source domain (left stack) to a target domain (right stack). The pre-trained model is modified by adding additional layers, i.e., installing a neural capacitance probe (NCP) unit, on top of the bottom layers. The NCP is frozen with a set of randomly initialized weights, and only the bottom layers are fine-tuned. c Observed partial learning curves (green line segments) of validation accuracy over the early-stage training epochs and the corresponding neural capacitance metric βeff during fine-tuning. The predicted final accuracy at βeff → 0 (red dot) is used to select the best one from a set of models. The metric βeff relies on GB’s weighted adjacency matrix P, which itself is derived from the reformulation of the training dynamics. To predict the performance, a lightweight βeff of the NCP is used instead of the heavyweight one over the entire network on the right stack of (b).

Figure 1b shows how to use our approach to predict the performance of a pre-trained neural network model based on transfer learning. The output layer of each pre-trained model is replaced with a three-layer neural capacitance probe (NCP) unit with (1) a dense layer of size 256 and (2) a dense layer of size 128. Each of these dense layers follows (3) a batch normalization26, and (4) is followed by a dropout layer with a dropout probability of 0.4. Before fine-tuning, we initialize the NCP unit using Kaiming Normal initialization27. See Supplementary Note 3 for details about the three layers in NCP.

Neural network model selection with the neural capacitance β eff(t)

We evaluate 17 pre-trained ImageNet models implemented in Keras28, including AlexNet, VGGs (VGG16 & 19), ResNets (ResNet50, 50V2, 101, 101V2, 152, 152V2), DenseNets (DenseNet121, 169, 201), MobileNets (MobileNet & MobileNetV2), Inceptions (InceptionV3 & InceptionResNetV2) and Xception, to measure the performance of our approach. Furthermore, we used four benchmark datasets, CIFAR10, CIFAR100, SVHN, Fashion MNIST of size 32 × 32 × 3, and one Kaggle challenge dataset, Birds of size 224 × 224 × 3, and split the original train/test. Also, 15K original training samples are set aside to validate our approach on each dataset. We set a batch size of 64 and a learning rate of 0.001, fine-tuning each modified pre-trained model for T = 50 epochs. As shown in Algorithm 1, the NCP does not involve fine-tuning and is merely used to calculate the neural capacitance βeff(t), which varies as the number of epochs t changes. To keep the notation succinct, we use βeff to represent βeff(t). According to Theorem 1 (see Methods section on the property of the neural capacitance), when the model converges, βeff → 0. Indirectly, the model’s predictability can be determined by the relation between the training βeff and the validation accuracy I. Since both βeff and I are available during fine-tuning, we collect a set of data points of these two in the early phase as the observations and fit a regularized linear model I = h(βeff; θ) with Bayesian ridge regression29, where θ are the associated coefficients. The estimated predictor I = h(βeff; θ*) makes prediction of the final accuracy of models by setting βeff = 0, i.e., I* = h(0; θ*), see Fig. 1c an example in row 3 of Fig. 2. One can either retain or remove the NCP and fine-tune the selected model to fully train the best model.

Fig. 2: Learning curves of five representative pre-trained models. βeff.
figure 2

The first row shows the Accuracy as a function of Epoch \({{\mathrm{t}}}\) and the second row denotes the \({\beta}\) as a function of Epoch \({{\mathrm{t}}}\). A regularized linear model h(  ; θ) (blue curve in row 3) is estimated with Bayesian ridge regression using a few of observations of βeff on training set and validation accuracy I during early fine-tuning. The starting epoch t0 of observations affects the fit of h, and is automatically determined according to BIC, and the true test accuracy at epoch 50 is predicted with I* = h(0; θ*).

To control the randomness, we repeat 20 times of the fine-tuning experiments for each model and analyze the average result. As shown in Fig. 2, the pre-trained models are converged after the fine-tuning on CIFAR10. For each model, we collect the validation accuracy (blue stars in row 1) and βeff on the training set (green squares in row 2) during the early stage of fine-tuning as the observations (e.g., green squares in row 3 marked by the green box for five epochs), then use these observations to predict the test accuracy unseen before the fine-tuning terminates. The blue lines are estimated h(  ; θ), the true test accuracy at T and the predicted accuracy are marked as red triangles and blue stars, respectively. Both the estimates and predictions are accurate. For better illustration, learning curves are visualized on a log scale.

The relative rank of these candidates is more important than their exact values of predicted accuracy in model selection. Thus, we choose Spearman’s rank correlation coefficient ρ to evaluate and compare different approaches. We calculate ρ over the ground truth test accuracy at epoch T and all pre-trained models’ predicted accuracy I*. In Fig. 3a, we report the ground truth and predicted accuracy for each model on CIFAR10, as well as the overall ranking performance measured by ρ. It indicates that β-based ranking is reliable with ρ > 0.9. We also report the complete results on all five datasets in Fig. 4. The numerical results indicate that the approach is general for different datasets.

Fig. 3: The sensitivity analysis of the neural capacitance’s predictive capability.
figure 3

a Our βeff based prediction of the validation accuracy versus the true test accuracy at epoch 50 of seven representative pre-trained models. Each shape is associated with one type of pre-trained models. Distinct models of the same type are marked in different colors. Because the accuracy of AlexNet is much lower than others, we exclude it for better visualization. Its predicted accuracy is 0.871, and the true test accuracy is 0.868. If it is included, ρ = 0.93 > 0.92. b Impacts of the starting epoch t0 of the observations and (c) the number of training samples on the ranking performance of our βeff based approach.

Fig. 4: The validation accuracy prediction of pre-trained models on all five datasets.
figure 4

The validation accuracy based on βeff is strongly correlated with the true test accuracy of these models after fine-tuning for T = 50 epochs. The Spearman’s ranking correlation ρ is used to quantify the performance in model selection. Each shape is associated with one type of pre-trained models. Distinct models of the same type are marked in different colors. To be noted, each includes AlexNet in computing ρs.

The estimation quality of h determines how well the relation between I and βeff is captured. Besides the regression method, the starting epoch t0 of the observations also plays a role in the estimation. As shown in Fig. 3b, we evaluate the impact of t0 on ρ of our approach. As expected, when fixing the length of learning curves, a higher t0 usually produces a better ρ. Since our ultimate goal is to predict with the early observations, t0 should also be constrained to a small value. To make the comparisons fair, we view t0 as a hyper-parameter, and select it according to the Bayesian information criterion (BIC)30, as shown in row 3 of Fig. 2.

Impact of size of training set

It is important to understand scalability and the performance sensitivity to training set sizes. Thus, we further split the CIFAR10, which has 50K original training and 10K testing samples, into 35K for training and 15K for validation. In studying the dynamics of neural network training, it is essential to understand how varying the training size influences the effectiveness of our approach. We select the first {10,15,20,25,30}K of the original 50K samples as the reduced-size training set and the last 10K samples as the validation set to fine-tune the pre-trained models for 50 epochs. As shown in Fig. 3c, we can use a training set of size as small as 25K to achieve similar performance to that uses all 35K training samples. It has an important implication for efficient neural network training, because the size of the required training set can be significantly reduced (around 30% in our experiment) while maintaining similar model ranking performance. Note that the true test accuracy used in computing ρ is the same test accuracy for the model trained from 35K training samples and it’s shared by all the five cases {10,15,20,25,30}K in our analysis.

Comparison with other approaches

For comparison analysis, we considered two families of predictors: learning curve (LC) based predictors, and transferability measures (TMs) as the baselines. (i) LC predictors. Chandrashekaran and Lane31 treated the current LC as an affine transformation of previous LCs. They built an ensemble of transformations employing previous LCs and the first few epochs of the current LC to predict the final accuracy of the current LC. Baker et al.32 proposed an SVM-based LC predictor using features extracted from previous LCs, including the architecture information such as the number of layers, parameters, and training techniques such as learning rate and learning rate decay. A separate SVM is used to predict the accuracy of an LC at a particular epoch. Domhan et al.33 trained an ensemble of parametric functions that observe the first few epochs of an LC and extrapolate it. Klein et al.34 devised a Bayesian neural network to model the functions that Domhan formulated to capture the structure of the LCs more effectively. Wistuba and Pedapat35 trained a transfer learning-based predictor on LCs generated from other datasets. It is a neural network-based predictor that leverages architecture and dataset embedding to capture the similarities between the architectures of various models and also the other datasets that it was trained on. (ii) Transferability measures. As an alternative estimation of the final performance of neural network models, some transferability measures (TMs) are developed36,37,38,39,40,41,42,43,44,45,46,47, and many of them are training-free metrics for assessing the performance of neural networks. Notably, our approach has access to some observations collected from early training, and therefore our prediction mechanism is more similar to the learning curve prediction than those TM-based approaches that are designed as a surrogate of the transferability without fine-tuning or re-training. In addition to LC-based predictors, we compared our method with training-free NAS methods. The result is shown in the Supplementary Note 8. Direct comparison on the prediction performance (indicated by the ranking correlation) is not desirable since training-free NAS methods do not require training while our proposed method requires training of the model to compute βeff.

We select several LC predictors, such as two heuristic rules the last seen value (LSV)48 and the best-seen value (BSV), BGRN32, CL31, as well as three representative TMs: NCE36, LEEP37 and LogME38 as the baselines. As shown in Table 1 and Supplementary Fig. S1, using a few observations, e.g., only 5 epochs, our approach can achieve from 9.1% up to 65.3% relative improvements over the best baseline on CIFAR10, SVHN, Fashion MNIST, and Birds. On CIFAR100, NCE achieves marginally better performance than ours with 10 observations. Moreover, since each pre-trained model produces one learning curve per run, we also report our ranking performance and the baselines based on learning curves collected in individual runs (Supplementary Fig. S2).

Table 1 A comparison between ours and the baselines in model ranking

Running time analysis

Our approach is efficient, especially for large and deep neural networks. Different from the training task that involves a full FP and BP, i.e. Ttrain = TFP + TBP, computing βeff only requires to compute the adjacency matrix P according to Eq.(7) on the NCP unit, \({T}_{{\beta }_{{{{{{{\rm{eff}}}}}}}}}={T}_{{{{{{{\rm{NCP}}}}}}}}\). Although the computation is complicated, the NCP is lightweight. The computing cost per epoch is comparable to the training time per epoch (see Supplementary Fig. S3). Let \({T}_{{\beta }_{{{{{{{\rm{eff}}}}}}}}}=c\times {T}_{{{{{{{\rm{train}}}}}}}}\). If c > 1, i.e., \({T}_{{\beta }_{{{{{{{\rm{eff}}}}}}}}}\) is higher than Ttrain, vice versa. Considering the required epochs, our approach needs k observations, and takes \({T}_{{{{{{{\rm{ours}}}}}}}}=k\times {T}_{{\beta }_{{{{{{{\rm{eff}}}}}}}}}\). To obtain the ground-truth final accuracy by running K epochs, it takes Tfull = K × Ttrain. If Tfull > Tours, our βeff based prediction is cheaper than “just training longer". It indicates that \(K\times {T}_{{{{{{{\rm{train}}}}}}}}-k\times {T}_{{\beta }_{{{{{{{\rm{eff}}}}}}}}}=(K-c\times k)\times {T}_{{{{{{{\rm{train}}}}}}}} \, > \, 0\), saving us K − c × k more training epochs.

We perform a running time analysis of the two tasks with 4 × NVIDIA Tesla V100 SXM2 32GB, and visualize the related times in Supplementary Fig. S3. On average \(c={T}_{{\beta }_{{{{{{{\rm{eff}}}}}}}}}/{T}_{{{{{{{\rm{train}}}}}}}} \, \approx \, 1.3\), computing βeff takes 1.3 times of the training per epoch. But the efforts are paying off, as we can predict the final accuracy by observing only k = 10 of K = 100 full training epochs, Tours is only 13% of Tfull. When the observations are used for LC prediction, the heuristics directly take one observation (last or best) as the predicted value, so they are mostly computationally cheap but have sub-optimal model ranking performances. BGRN and CL require more computational time because both need training a predictor with a set of full learning curves from other models. Our approach also estimates a predictor but does not need any external LCs. Next, we assume that each model only observes k = 5 epochs and conduct a running time analysis of these approaches over LC prediction, including estimating a predictor. As shown in Supplementary Table S1, our approach applies Bayesian ridge regression to efficiently estimate the predictor I = h(βeff; θ), taking comparable time as BGRN, significantly less than CL. Nevertheless, it performs best in model ranking. In contrast, the most expensive CL, does not perform well, sometimes even worse.

Discussion

In Network Science, a fundamental objective is to comprehend the functioning of a network based on its structure with broad applications in many fields. This work attempts to advance our understanding of the functioning of artificial neural networks through a grasp of complex networks. Recently, some prior works explore the neural network SGD training dynamics, regarding the global convergence49, system identification50,51, as well as deep neural network generalization52. For example, Goldt et al.53 formulated the SGD dynamics of over-parameterized two-layer neural networks with a set of differential equations. Furthermore, some exciting phenomena54 emerge during the early phase of neural network training, such as trainable sparse sub-networks emerge55, gradient descent moves into a small subspace56. Moreover, there exists a critical effective connection between layers57. Inspired by the insights gained from studying the neural network training dynamics through a networked dynamical systems lens, we developed a theoretically sound framework for improving neural network model selection.

Our work presents a novel perspective of neural network model selection by directly exploring the dynamical evolution of synaptic connections during neural network training. Our framework reformulates SGD-based neural network training dynamics as an edge dynamics \({{{{{{\mathcal{B}}}}}}}\) to capture the mutual interaction and dependency of synaptic connections. Accordingly, a networked system is built by converting a neural network GA to a line graph GB with the governing dynamics \({{{{{{\mathcal{B}}}}}}}\), which induces a definition of the link weights in GB. Moreover, a topological property βeff of GB is developed and shown to be an effective metric in predicting the ranking of a set of pre-trained models based on early training results.

There are several important directions that we intend to explore in the future, including: (i) Simplify the adjacency matrix P to capture the dependency and mutual interaction between synaptic connections, e.g., approximate gradients using local information58, (ii) extend the proposed framework to more neural architecture search (NAS) benchmarks59,60,61,62 to select the best subnetwork, and (iii) design an efficient algorithm to optimize neural network architectures directly.

Methods

Dimension reduction of networked systems

Real-world complex systems, such as plant-pollinator interactions63 and the spread of COVID-1964, are commonly modeled using networks65,66. Consider a network G = (V, E) with nodes V and edges E. Let n = V be the number of nodes in the network, the interactions between nodes can be formulated as a set of differential equations

$${\dot{x}}_{i}=f({x}_{i})+{\sum}_{j\in V}{P}_{ij}g({x}_{i},\, {x}_{j}),\forall i\in V,$$
(1)

where xi is the state of node i in the system. For instance, in an ecological network, xi could represent the abundance of a particular species of plant, while in an epidemic network, it could represent the infection rate of a person. The adjacency matrix P encodes the interaction strength between nodes, where Pij is the entry in row i and column j. The functions f(  ) and g(  ,  ) capture the internal and external impacts on node i, respectively. Typically, these functions are nonlinear. Let x = (x1, x2, …, xn). For a small network, given an initial state, one can run a forward simulation for an equilibrium state x*, such that \({\dot{x}}_{i}^{*}=f({x}_{i}^{*})+{\sum }_{j\in V}{P}_{ij}g({x}_{i}^{*},{x}_{j}^{*})=0\).

However, when the size of the system goes up to millions or even billions, it will pose a big challenge to solve the coupled differential equations. The problem can be efficiently addressed by employing a mean-field technique23,24, where a linear operator \({{{{{{{\mathcal{L}}}}}}}}_{P}(\cdot )\) is introduced to decouple the system. Specifically, \({{{{{{{\mathcal{L}}}}}}}}_{P}\) depends on the adjacency matrix P and is defined as \({{{{{{{\mathcal{L}}}}}}}}_{P}({{{{{{\boldsymbol{z}}}}}}})=\frac{{{{{{{{\boldsymbol{1}}}}}}}}^{T}P{{{{{{\boldsymbol{z}}}}}}}}{{{{{{{{\boldsymbol{1}}}}}}}}^{T}P{{{{{{\boldsymbol{1}}}}}}}}\), where \({{{{{{\boldsymbol{z}}}}}}}\in {{{{{{{\mathcal{R}}}}}}}}^{n}\). Let δin = P1 and δout = 1TP be the in- and out-degrees of nodes. For a weighted G, the degrees are weighted as well. Applying \({{{{{{{\mathcal{L}}}}}}}}_{P}(\cdot )\) to δin, it gives

$${\beta }_{{{{{{{\rm{eff}}}}}}}}={{{{{{{\mathcal{L}}}}}}}}_{P}({{{{{{{\boldsymbol{\delta }}}}}}}}_{{{{{{{\rm{in}}}}}}}})=\frac{{{{{{{{\bf{1}}}}}}}}^{T}P{{{{{{{\boldsymbol{\delta }}}}}}}}_{{{{{{{\rm{i}}}}}}}n}}{{{{{{{{\bf{1}}}}}}}}^{T}{{{{{{{\boldsymbol{\delta }}}}}}}}_{{{{{{{\rm{in}}}}}}}}}={\frac{{{{{{{{{\boldsymbol{\delta }}}}}}}}_{{{{{{{\rm{o}}}}}}}ut}}^{T}}{{{{{{{\boldsymbol{\delta }}}}}}}}}_{{{{{{{\rm{in}}}}}}}}{{{{{{{\bf{1}}}}}}}}^{T}{{{{{{{\boldsymbol{\delta }}}}}}}}_{{{{{{{\rm{in}}}}}}}},$$
(2)

which proves to be a powerful metric to measure the resilience of networks, and has been applied to make reliable inferences from incomplete networks67,68. We use it to measure the predictive ability of a neural network, whose training in essence is a dynamical system. For an overview of the related technique, see Supplementary Note 6.

Neural network training is a dynamical system

Conventionally, training a neural network is a nonlinear optimization problem. Because of the hierarchical structure of neural networks, the training procedure is implemented by two alternate procedures: forward-propagation (FP) and back-propagation (BP), as described in Fig. 1a. During FP, data goes through the input layer, hidden layers, up to the output layer, which produces the predictions of the input data. The differences between the outputs and the labels of the input data are used to define an objective function \({{{{{{\mathcal{C}}}}}}}\), a.k.a training error function. BP proceeds to minimize \({{{{{{\mathcal{C}}}}}}}\), in a reverse way as did in FP, by propagating the error from the output layer down to the input layer. The trainable weights of synaptic connections are updated accordingly.

Let GA be a neural network, w be the flattened weight vector of GA, and z be the activation values. As a whole, the training of a neural network GA can be described with two coupled dynamics: \({{{{{{\mathcal{A}}}}}}}\) on GA, and \({{{{{{\mathcal{B}}}}}}}\) on GB, where nodes in GA are neurons, and nodes in GB are the synaptic connections. The coupling relation arises from the strong inter-dependency between z and w: the states z (activation values or activation gradients) of GA are the parameters of \({{{{{{\mathcal{B}}}}}}}\), and the states w of GB are the trainable parameters of GA. If we put the whole training process in the context of networked systems, \({{{{{{\mathcal{A}}}}}}}\) denotes a node dynamics because the states of nodes evolve during FP, and \({{{{{{\mathcal{B}}}}}}}\) expresses an edge dynamics because of the updates of edge weights during BP13,69,70. Mathematically, we formulate the node and edge dynamics based on the gradients of \({{{{{{\mathcal{C}}}}}}}\):

$$({{{{{{\mathcal{A}}}}}}}) \, d{{{{{{\boldsymbol{z}}}}}}}/dt \, \approx \, {h}_{{{{{{{\mathcal{A}}}}}}}}({{{{{{\boldsymbol{z}}}}}}},t;{{{{{{\boldsymbol{w}}}}}}})=-{\nabla }_{{{{{{{\boldsymbol{z}}}}}}}}{{{{{{\mathcal{C}}}}}}}({{{{{{\boldsymbol{z}}}}}}}(t)),$$
(3)
$$({{{{{{\mathcal{B}}}}}}}) \, d{{{{{{\boldsymbol{w}}}}}}}/dt \, \approx \, {h}_{{{{{{{\mathcal{B}}}}}}}}({{{{{{\boldsymbol{w}}}}}}},t;{{{{{{\boldsymbol{z}}}}}}})=-{\nabla }_{{{{{{{\boldsymbol{w}}}}}}}}{{{{{{\mathcal{C}}}}}}}({{{{{{\boldsymbol{w}}}}}}}(t)),$$
(4)

where t denotes the training step. Let \({a}_{i}^{(\ell )}\) be the pre-activation of node i on layer , and σ(  ) be the activation function of layer . Usually, the output activation function is a softmax. The hierarchical structure of GA exerts some constraints over z for neighboring layers, i.e., \({z}_{i}^{(\ell )}={\sigma }_{\ell }({a}_{i}^{(\ell )}),1\le i\le {n}_{\ell },\forall 1\le \ell < L\) and \({z}_{k}^{(L)}=\exp \{{a}_{k}^{(L)}\}/{\sum }_{j}\exp \{{a}_{j}^{(L)}\},1\le k\le {n}_{L}\), where n is the total number of neurons on layer , and GA has L + 1 layers. It also presents a dependency between z and w, e.g, when GA is an MLP without bias, \({a}_{i}^{(\ell )}={{{{{{{\boldsymbol{w}}}}}}}}_{i}^{(\ell )T}{{{{{{{\boldsymbol{z}}}}}}}}^{(\ell -1)}\), which builds a connection from GA to GB. It is obvious, given w, the activation z satisfying all these constraints, is also a fixed point of \({{{{{{\mathcal{A}}}}}}}\). Meanwhile, an equilibrium state of \({{{{{{\mathcal{B}}}}}}}\) provides a set of optimal weights for GA.

The metric βeff is a universal metric to characterize different types of networks, including biological neural networks71. Because of the generality of βeff, we analyze how it looks on artificial neural networks, which are designed to mimic the biological counterparts for general intelligence. Therefore, we set up an analog system for the trainable weights. To the end, we build a line graph for the trainable weights, and reformulate the training dynamics in the same form as the general dynamics (Eq. (1)). The reformulated dynamics reveals a simple yet powerful property regarding βeff, which is utilized to predict the final accuracy of GA with a few observations during the early phase of the training.

Quantify the interaction strengths of edges

In SGD, each time a batch of samples is chosen to update w, i.e., \({{{{{{\boldsymbol{w}}}}}}}\leftarrow {{{{{{\boldsymbol{w}}}}}}}-\alpha {\nabla }_{{{{{{{\boldsymbol{w}}}}}}}}{{{{{{\mathcal{C}}}}}}}\), where α > 0 is the learning rate. When desired conditions are met, training is terminated. Let \({{{{{{{\boldsymbol{\delta }}}}}}}}^{(\ell )}={[\partial {{{{{{\mathcal{C}}}}}}}/\partial {z}_{1}^{(\ell )},\cdots,\partial {{{{{{\mathcal{C}}}}}}}/\partial {z}_{{n}_{\ell }}^{(\ell )}]}^{T}\in {{{{{{{\mathcal{R}}}}}}}}^{{n}_{\ell }}\) (in some literature δ() is defined as gradients with respect to a(), which does not affect our analysis) be the activation gradients, and \({{{{{{{\boldsymbol{\sigma }}}}}}}}_{\ell }^{{\prime} }={[{\sigma }_{\ell,1}^{{\prime} },\cdots,{\sigma }_{\ell,{n}_{\ell }}^{{\prime} }]}^{T}\in {{{{{{{\mathcal{R}}}}}}}}^{{n}_{\ell }}\) be the derivatives of activation function σ for layer , with \({\sigma }_{\ell,k}^{{\prime} }={\sigma }_{\ell }^{{\prime} }({a}_{k}^{(\ell )}),1\le k\le {n}_{\ell },1\le \ell \le L\). To understand how the weights W() affect each other, we explicitly expand δ() and have \({{{{{{{\boldsymbol{\delta }}}}}}}}^{(\ell )}={W}^{(\ell+1)T}({W}^{(\ell+2)T}(\cdots ({W}^{(L-1)T}({W}^{(L)T}({{{{{{{\boldsymbol{z}}}}}}}}^{(L)}-{{{{{{\boldsymbol{y}}}}}}}))\odot {{{{{{{\boldsymbol{\sigma }}}}}}}}_{L-1}^{{\prime} })\cdots )\odot {{{{{{{\boldsymbol{\sigma }}}}}}}}_{\ell+2}^{{\prime} })\odot {{{{{{{\boldsymbol{\sigma }}}}}}}}_{\ell+1}^{{\prime} }\left.\right)\), where  is the Hadamard product. We find that W() is associated with all accessible parameters on downstream layers, and the recursive relation defines a high-order hyper-network interaction72 between any W() and the other parameters. With the fact that xy = Λ(y)x, where Λ(y) is a diagonal matrix with the entries of y on the diagonal, we have \({{{{{{{\boldsymbol{\delta }}}}}}}}^{(\ell )}={W}^{(\ell+1)T}\Lambda ({{{{{{{\boldsymbol{\sigma }}}}}}}}_{\ell+1}^{{\prime} }){{{{{{{\boldsymbol{\delta }}}}}}}}^{(\ell+1)}={W}^{(\ell+1)T}\Lambda ({{{{{{{\boldsymbol{\sigma }}}}}}}}_{\ell+1}^{{\prime} }){W}^{(\ell+2)T}\Lambda ({{{{{{{\boldsymbol{\sigma }}}}}}}}_{\ell+2}^{{\prime} })\cdots {W}^{(L-1)T}\Lambda ({{{{{{{\boldsymbol{\sigma }}}}}}}}_{L-1}^{{\prime} })\)W(L)T(z(L) − y). For a ReLU σ(  ), \({{{{{{{\boldsymbol{\sigma }}}}}}}}_{\ell }^{{\prime} }\) is binary depending on the sign of the input pre-activation values a() of layer . If \({a}_{i}^{(\ell )}\le 0\), then \({\sigma }_{\ell }^{{\prime} }({a}_{i}^{(\ell )})=0\), blocking a BP propagation route of the prediction deviations z(L) − y and giving rise to vanishing gradients.

We intended to build direct interactions between synaptic connections. It can be done by identifying which units provide direct physical interactions to a given unit and appear on the right-hand side of its differential equation \({{{{{{\mathcal{B}}}}}}}\) in Eq.(3), and how much such interactions come into play. There are multiple routes to build up a direct interaction between any pair of network weights from different layers, as presented by the product terms in δ(). However, the coupled interaction makes it an impossible task, which is well-known as a credit assignment problem51,73. We propose a remedy. The impacts of all the other units on W() is approximated by direct, local impacts from W(+1), and the others’ contribution as a whole is encoded in the activation gradient δ(+1). Moreover, we have the weight gradient (Supplementary Note 1)

$${{{{{{{\boldsymbol{\nabla }}}}}}}}_{{W}^{(\ell )}}=\Lambda ({{{{{{{\boldsymbol{\sigma }}}}}}}}_{\ell }^{{\prime} }){{{{{{{\boldsymbol{\delta }}}}}}}}^{(\ell )}{{{{{{{\boldsymbol{z}}}}}}}}^{(\ell -1)T}=\Lambda ({{{{{{{\boldsymbol{\sigma }}}}}}}}_{\ell }^{{\prime} }){W}^{(\ell+1)T}\Lambda ({{{{{{{\boldsymbol{\sigma }}}}}}}}_{\ell+1}^{{\prime} }){{{{{{{\boldsymbol{\delta }}}}}}}}^{(\ell+1)}{{{{{{{\boldsymbol{z}}}}}}}}^{(\ell -1)T},$$
(5)

which shows the dependency of W() on W(+1), and itself can be viewed as an explicit description of the dynamical system \({{{{{{\mathcal{B}}}}}}}\) in Eq.(3). Put it in terms of a differential equation, we have

$$\frac{d{W}^{(\ell )}}{dt}=-\Lambda ({{{{{{{\boldsymbol{\sigma }}}}}}}}_{\ell }^{{\prime} }){W}^{(\ell+1)T}\Lambda ({{{{{{{\boldsymbol{\sigma }}}}}}}}_{\ell+1}^{{\prime} }){{{{{{{\boldsymbol{\delta }}}}}}}}^{(\ell+1)}{{{{{{{\boldsymbol{z}}}}}}}}^{(\ell -1)T} \, \triangleq \, F({W}^{(\ell+1)}).$$
(6)

Because of the mutual dependency of the weights and the activation values, it is hard to make an exact decomposition of the impacts of different parameters on W(). But, in the gradient \({{{{{{{\boldsymbol{\nabla }}}}}}}}_{{W}^{(\ell )}}\), W(+1) presents as an explicit term and contributes the direct impact on W(). To capture such direct impact and derive the adjacency matrix P for GB, we apply Taylor expansion on \({{{{{{{\boldsymbol{\nabla }}}}}}}}_{{W}^{(\ell )}}\) and have

$${P}^{(l,l+1)}={\partial }^{2}{{{{{{\mathcal{C}}}}}}}/\partial {W}^{(\ell )}\partial {W}^{(\ell+1)},$$
(7)

which defines the interaction strength between each pair of weights from layer  + 1 to layer . For a detailed derivation of P on MLP and general neural networks, see Supplementary Notes 2 and 3. Let w = (w1, w2, …) be a flattened vector of all trainable weights of GA. Given a pair of weights wi and wj, one from layer 1, another from layer 2. If 2 = 1 + 1, the entry Pij is defined according to Eq.(7), otherwise Pij = 0. Considering the scale of the trainable parameters in GA, P is very sparse. Let W(+1)* be the equilibrium states (Supplementary Note 3), the training dynamics Eq.(6) is reformulated into the form of Eq.(1), and gives the edge dynamics \({{{{{{\mathcal{B}}}}}}}\) for GB:

$${\dot{w}}_{i}=f({w}_{i})+{\sum}_{j}{P}_{ij}g({w}_{i},{w}_{j}),$$
(8)

with \(f({w}_{i})=F({w}_{i}^{*})\) and \(g({w}_{i},{w}_{j})={w}_{j}-{w}_{j}^{*}\). The value of weights at an equilibrium state \(\{{w}_{j}^{*}\}\) is unknown, but it is a constant and does not affect the computing of βeff.

Property of the neural capacitance

According to Eq.(7), we have the weighted adjacency matrix P of GB in place. The matrix P encodes rich information of the network, such as the topology, the weights, the gradients, and the training labels indirectly. Now we quantify the total impact that a trainable parameter (or synaptic connection) receives from itself and the others, corresponding to the weighted in-degrees δin = P1. Applying \({{{{{{{\mathcal{L}}}}}}}}_{P}(\cdot )\) to δin, we get a “counterpart” metric \({\beta }_{{{{{{{\rm{eff}}}}}}}}={{{{{{{\mathcal{L}}}}}}}}_{P}({{{{{{{\boldsymbol{\delta }}}}}}}}_{{{{{{{\rm{in}}}}}}}})\) to measure the predictive ability of a neural network GA, as the resilience metric (Eq. (2)) does to a general network G. If GA is an MLP, we can explicitly write the entries of P and βeff. For details of how to derive P and βeff of an MLP, see Supplementary Note 2. Moreover, we prove in Theorem 1 below that as GA converges, \({{{{{{{\boldsymbol{\nabla }}}}}}}}_{W}^{(\ell )}\) vanishes, and βeff approaches zero (see Supplementary Note 4).

Theorem 1

Let ReLU be the activation function of GA. When GA converges, then βeff = 0.

To be noted that a small value is added to the denominator of Eq.(2) to avoid a possible 0/0.

Algorithm 1

Implement NCP and Computeβeff(t)

Input: Pre-trained source model \({{{{{{{\mathcal{F}}}}}}}}_{s}=\{{{{{{{{\mathcal{F}}}}}}}}_{s}^{(1)},{{{{{{{\mathcal{F}}}}}}}}_{s}^{(2)}\}\) with bottom \({{{{{{{\mathcal{F}}}}}}}}_{s}^{(1)}\) and output layer \({{{{{{{\mathcal{F}}}}}}}}_{s}^{(2)}\), target dataset Dt, maximum epoch T

1: Remove \({{{{{{{\mathcal{F}}}}}}}}_{s}^{(2)}\) from \({{{{{{{\mathcal{F}}}}}}}}_{s}\) and add on top of \({{{{{{{\mathcal{F}}}}}}}}_{s}^{(1)}\) an NCP unit \({{{{{{\mathcal{U}}}}}}}\) with multiple layers (Fig. 1b)

2: Randomly initialize and freeze \({{{{{{\mathcal{U}}}}}}}\)

3: Train target model \({{{{{{{\mathcal{F}}}}}}}}_{t}=\{{{{{{{{\mathcal{F}}}}}}}}_{s}^{(1)},{{{{{{\mathcal{U}}}}}}}\}\) by fine-tuning \({{{{{{{\mathcal{F}}}}}}}}_{s}^{(1)}\) on Dt for epochs of T

4: Obtain P from \({{{{{{\mathcal{U}}}}}}}\) according to Eq.(7)

5: Compute βeff with P according to Eq.(2)

For an MLP GA, it is possible to derive an analytical form of βeff. However, it becomes extremely complicated for a deep neural network with multiple convolutional layers. To realize βeff for deep neural networks in any form, we take advantage of the automatic differentiation implemented in TensorFlow74. Considering the number of parameters, it is still computationally prohibitive to calculate a βeff for the entire GA.

Therefore, we seek to derive a surrogate from a partial of GA. Specifically, we insert a neural capacitance probe (NCP) unit, i.e., putting additional layers on top of the beheaded GA (excluding the original output layer), and estimate the predictive ability of the entire GA using βeff of the NCP unit. Therefore, we call βeff a neural capacitance.

Bayesian ridge regression

Ridge regression introduces an 2-regularization to linear regression, and solves the problem

$$\arg {\min }_{{{{{{{\boldsymbol{\theta }}}}}}}}{({{{{{{\boldsymbol{y}}}}}}}-X{{{{{{\boldsymbol{\theta }}}}}}})}^{T}({{{{{{\boldsymbol{y}}}}}}}-X{{{{{{\boldsymbol{\theta }}}}}}})+\lambda \parallel {{{{{{\boldsymbol{\theta }}}}}}}{\parallel }_{2}^{2},$$
(9)

where \(X\in {{{{{{{\mathcal{R}}}}}}}}^{n\times d}\), \({{{{{{\boldsymbol{y}}}}}}}\in {{{{{{{\mathcal{R}}}}}}}}^{n}\), \({{{{{{\boldsymbol{\theta }}}}}}}\in {{{{{{{\mathcal{R}}}}}}}}^{d}\) is the associated set of coefficients, the hyper-parameter λ > 0 controls the impact of the penalty term \(\parallel {{{{{{\boldsymbol{\theta }}}}}}}{\parallel }_{2}^{2}\). Bayesian ridge regression introduces uninformative priors over the hyper-parameters, and estimates a probabilistic model of the problem in Eq.(9). Usually, the ordinary least squares method posits the conditional distribution of y to be a Gaussian, i.e., \(p({{{{{{\boldsymbol{y}}}}}}}| X,{{{{{{\boldsymbol{\theta }}}}}}})={{{{{{\mathcal{N}}}}}}}({{{{{{\boldsymbol{y}}}}}}}| X{{{{{{\boldsymbol{\theta }}}}}}},{\sigma }^{2}{I}_{d})\), where σ > 0 is a hyper-parameter to be tuned, and Id is a d × d identity matrix. Moreover, if we assume a spherical Gaussian prior θ, i.e., \(p({{{{{{\boldsymbol{\theta }}}}}}})={{{{{{\mathcal{N}}}}}}}({{{{{{\boldsymbol{\theta }}}}}}}| 0,{\tau }^{2}{I}_{d})\), where τ > 0 is another hyper-parameter to be estimated from the data at hand. According to Bayes’ theorem, p(θX, y) p(θ)p(yX, θ), the estimates of the model are made by maximizing the posterior distribution p(θX, y), i.e., \(\arg {\max }_{{{{{{{\boldsymbol{\theta }}}}}}}}\log p({{{{{{\boldsymbol{\theta }}}}}}}| X,{{{{{{\boldsymbol{y}}}}}}})=\arg {\max }_{{{{{{{\boldsymbol{\theta }}}}}}}}\log {{{{{{\mathcal{N}}}}}}}({{{{{{\boldsymbol{y}}}}}}}| X{{{{{{\boldsymbol{\theta }}}}}}},{\sigma }^{2}{I}_{d})+\log {{{{{{\mathcal{N}}}}}}}({{{{{{\boldsymbol{\theta }}}}}}}| {{{{{{\bf{0}}}}}}},{\tau }^{2}{I}_{d})\), which is a maximum-a-posteriori (MAP) estimation of the ridge regression when λ = σ2/τ2. All θ, λ, and τ are estimated jointly during the model fitting, and \(\sigma=\tau \sqrt{\lambda }\). Based on the approach proposed by Tipping29 and MacKay75 to update the parameters λ and τ, we estimate I = h(βeff; θ) with scikit-learn76. We can summarize the application of Bayesian ridge regression to our framework as follows:

  • Inputs: {(βeff,k, Ik)k = 1, 2, …, K} is a set of observations, where βeff,k is the proposed metric calculated from the training set, Ik represents the validation accuracy, K is the total number of observations collected from early stage of the model training.

  • Output: I − h(βeff; θ) = 0, where θ is the fitting parameters in the Bayesian ridge regression.

  • Prediction: I* = h(0, θ) as per Theorem 1.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.