Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks

A theoretical understanding of generalization remains an open problem for many machine learning models, including deep networks where overparameterization leads to better performance, contradicting the conventional wisdom from classical statistics. Here, we investigate generalization error for kernel regression, which, besides being a popular machine learning method, also describes certain infinitely overparameterized neural networks. We use techniques from statistical mechanics to derive an analytical expression for generalization error applicable to any kernel and data distribution. We present applications of our theory to real and synthetic datasets, and for many kernels including those that arise from training deep networks in the infinite-width limit. We elucidate an inductive bias of kernel regression to explain data with simple functions, characterize whether a kernel is compatible with a learning task, and show that more data may impair generalization when noisy or not expressible by the kernel, leading to non-monotonic learning curves with possibly many peaks.


Introduction
Learning machines aim to find statistical patterns in data that generalize to previously unseen samples [1]. How well they perform in doing so depends on factors such as the size and the nature of the training data set, the complexity of the learning task, and the inductive bias of the learning machine. Identifying precisely how these factors contribute to the generalization performance has been a theoretical challenge. In particular, a definitive theory should be able to predict generalization performance on real data. Existing theories fall short of this goal, often providing impractical bounds and inaccurate estimates [2,3].
The need for a new theory of generalization is exacerbated by recent developments in deep learning [4]. Experience in the field suggests that larger models perform better [5,6,7], encouraging training of larger and larger networks with state-of-the-art architectures reaching hundreds of billions of parameters [7]. These networks work in an overparameterized regime [3,5] with much more parameters than training samples, and are highly expressive to a level that they can even fit random noise [2]. Yet, they generalize well, contradicting the conventional wisdom from classical statistical learning theory [1,3,8] according to which overparameterization should lead to overfitting and worse generalization. It must be that overparameterized networks have inductive biases that suit the learning task. Therefore, it is crucial for a theory of generalization to elucidate such biases.
While addressing the full complexity of deep learning is as of now beyond the reach of theoretical study, a tractable, yet practically-relevant limit was established by recent work pointing to a correspondence between training deep networks and performing regression with various rotation invariant kernels. In the limit where the width of a network is taken to infinity (network is thus overparameterized), neural network training with a certain random initialization scheme can be described by ridgeless kernel regression with the Neural Network Gaussian Process kernel (NNGPK) if only the last layer is trained [9,10,11,12], or the Neural Tangent Kernel (NTK) if all the layers are trained [13]. Consequently, studying the inductive biases of kernels arising from the infinite-width limit should give insight to the success of overparameterized neural networks. Indeed, key generalization phenomena in deep learning also occur in kernel methods, and it has been argued that understanding generalization in kernel methods is necessary for understanding generalization in deep learning [14].
Motivated by these connections to deep networks and also by its wide use, in this paper, we present a theory of generalization in kernel regression [15,16,17,18,19]. Our theory is generally applicable to any kernel and contains the infinite-width limit of deep networks as a special case. Most importantly, our theory is applicable to real datasets.
We describe typical generalization performance of kernel regression shedding light onto practical uses of the algorithm, in contrast to the worst case bounds of statistical learning theory [8,18,20,21,22]. In the past, statistical mechanics provided a useful theoretical framework for such typical-case analyses for various algorithms [23,24,25,26,27,28,29,30,31,32]. Here, using the replica method of statistical mechanics [33], we derive an analytical expression for the typical generalization error of kernel regression as a function of 1) the number of training samples, 2) the eigenvalues and eigenfunctions of the kernel, which define the inductive bias of kernel regression, and 3) the alignment of the target function with the kernel's eigenfunctions, which provides a notion of how compatible the kernel is for the task. We test our theory on various real datasets and kernels. Our analytical generalization error predictions fit experiments remarkably well.
Our theory sheds light onto the various generalization phenomena. We elucidate a strong inductive bias: as the size of the training set grows, kernel regression fits successively higher spectral modes of the target function, where the spectrum is defined by solving an eigenfunction problem [34,35,19,36]. Consequently, our theory can predict which kernels or neural architectures are well suited to a given task by studying the alignment of top kernel eigenfunctions with the target function for the task. Target functions that place most power in the top kernel eigenfunctions can be estimated accurately at small sample sizes, leading to good generalization. Finally, when the data labels are noisy or the target function has components not expressible by the kernel, we observe that generalization error can exhibit non-monotonic behavior as a function of the number of samples, contrary to the common intuition that more data should lead to smaller error. This non-monotonic behavior is reminiscent of the recently described "double-descent" phenomenon [3,5,37,38], where generalization error is non-monotonic as a function of model complexity in many modern machine learning models. We show that the non-monotonicity can be mitigated by increasing the implicit or explicit regularization.
To understand these phenomena better, we present a detailed analytical study of the application of our theory to rotation invariant kernels, motivated by their wide use and relevance for deep learning. Besides NNGPK and NTK, this class includes many other popular kernels such as the Gaussian, Exponential and Matern kernels [39,40]. When the data generating distribution is also spherically symmetric, our theory is amenable to further analytical treatment. Our analyses provide a mechanistic understanding of the inductive bias of kernel regression and the possible non-monotonic behavior of learning curves.

Generalization Error of Kernel Regression from Statistical Mechanics
Kernel regression is a supervised learning problem where one estimates a function from a number of observations. For our setup, let D = {x µ , y µ } P µ=1 be a sample of P observations drawn from a probability distribution on X × R, and X ⊆ R D . The inputs x µ are drawn from a distribution p(x), and the labels y µ are assumed to be generated by a noisy target y µ =f (x µ ) + µ , wheref is square integrable with respect to p(x), and µ represents zero-mean additive noise with covariance µ ν = δ µν σ 2 . The kernel regression problem is where λ is the "ridge" parameter, H is a Reproducing Kernel Hilbert Space (RKHS) uniquely determined by its reproducing kernel K(x, x ) and the input distribution p(x) [41], and ·, · H is the RKHS inner product. The Hilbert norm penalty controls the complexity of f . The λ → 0 limit is referred to as the kernel interpolation limit, where the dataset is exactly fit: f * = arg min f ∈H f, f H , s.t. f (x µ ) = y µ , µ = 1, . . . P . We emphasize that in our setting the target function does not have to be in the RKHS. Our goal is to calculate generalization error, i.e. mean squared error between the estimator, f * , and the ground-truth (target)f (x) averaged over the data distribution and datasets: E g measures, on average, how well the function learned agrees with the target on previously unseen (and seen) data sampled from the same distribution. This problem can be analyzed using the replica method from statistical physics of disordered systems [33], treating the training set as a quenched disorder. Our calculation is outlined in Methods and further detailed in the Supplementary Information. Here we present our main results.
Our results rely on the Mercer decomposition of the kernel in terms of orthogonal eigenfunctions {φ ρ }, which form a complete basis for the RKHS, and eigenvalues {η ρ }. N is typically infinite. For ease of presentation, we assume that all eigenvalues are strictly greater than zero. In Supplementary Note 1 and 2, we fully address the case with zero eigenvalues. Working with the orthogonal basis set ψ ρ (x) ≡ √ η ρ φ ρ (x), also called a feature map, we introduce coefficients {w ρ } and {w * ρ } that represent the target and learned functions respectivelyf (x) = ρ w ρ ψ ρ (x), and f * (x) = ρ w * ρ ψ ρ (x).
With this setup, we calculate the generalization error of kernel regression for any kernel and data distribution to be (Methods and Supplementary Note 2): We note that the generalization error is the sum of a σ-independent term and a σ-dependent term, the latter of which fully captures the effect of noise on generalization error. Formally, this equation describes the typical behavior of kernel regression in a thermodynamic limit that involves taking P to infinity. In this limit, variations in kernel regression's performance due to the differences in how the training set is formed, which is assumed to be a stochastic process, become negligible. The precise nature of the limit depends on the kernel and the data distribution. In this work, we consider two different analytically solvable cases and identify natural scalings of N and D with P , which in turn govern how the kernel eigenvalues η ρ scale inversely with P . We further give the infinite-P limits of Eq. (4) explicitly for these cases. In practice, however, we find that our generalization error formula describes average learning curves very well for finite P for even as low as a few samples. We observe that the variance in learning curves due to stochastic sampling of the training set is significant for low P , but decays with increasing P as expected.
We will demonstrate various generalization phenomena that arises from Eq. (4) through simulations and analytical study. One immediate observation is the spectral bias: faster rates of convergence of the error along eigenfunctions corresponding to higher eigenvalues in the noise free (σ 2 = 0) limit. The generalization error can be decomposed into a sum of modal errors E g = ρ η ρ w 2 ρ E ρ , where each normalized mode error E ρ represents the contribution of the mode error due to estimation of the coefficient for eigenfunction ψ ρ (Methods). The normalized mode errors are ordered according to their eigenvalues (Methods) which implies that modes ρ with large eigenvalues η ρ are learned more rapidly as P increases than modes with small eigenvalues. An important implication of this result is that target functions acting on the same data distribution with higher cumulative power distributions C(ρ), defined as the proportion of target power in the first ρ modes for all k ≥ 1 will have lower generalization error normalized by total target power, E g (P )/E g (0), for all P (Methods). Therefore, C(ρ) provides a measure of the compatibility between the kernel and the target, which we name task-model alignment.
We further note that the target function enters normalized generalization error only through combinations C(ρ) − C(ρ − 1) = η ρ w 2 ρ / ρ η ρ w 2 ρ . Hence, the kernel eigenvalues, the cumulative power distribution, and the noise parameter completely specify the normalized generalization error. Spectral bias, task-model alignment and noise explain generalization in kernel regression.
Generalization error can exhibit non-monotonicity which can be understood through the bias and variance decomposition [38,42,43], . We found that the average estimator is given by f * (x; P ) D = ρ P ηρ P ηρ+κw ρ ψ ρ (x), which monotonically approaches the target function as P increases, giving rise to a monotonically decreasing bias (Supplementary Note 2). However, the variance term arising from the variance of the estimator over possible sampled datasets D is potentially non-monotonic as the dataset size increases. Therefore, the total generalization error can exhibit local maxima.

Applications to Real Datasets
Next, we evaluate our theory on realistic datasets and show that it predicts kernel regression learning curves with remarkable accuracy. We further elucidate various heuristic generalization principles.
To apply our theory, we numerically solve the eigenvalue problem Eq. (3) on the dataset (Methods) and obtain the necessary eigenvalues and eigenfunctions. When solved on a finite dataset, Eq. (3) is an uncentered kernel PCA problem (Methods). We use these eigenfunctions (or eigenvectors for finite data) to express our target function, and the resulting coefficients and kernel eigenvalues to evaluate the generalization error. a c b e d Fig. 1: Effect of task-model alignment on the generalization of kernel regression. a, b. Projections of digits from MNIST along the top two (uncentered) kernel principal components of 2-layer NTK for 0s vs. 1s and 8s vs. 9s, respectively. c. Learning curves for both tasks. The theoretical learning curves (Eq. (4), dashed lines) show strong agreement with experiment (dots). d. The kernel eigenspectra for the respective datasets. e. The cumulative power distributions C(ρ). Error bars show the standard deviation over 50 trials.
In our first experiment, we test our theory using a 2-layer NTK [10,13] on two different tasks: discriminating between 0s and 1s, and between 8s and 9s from MNIST dataset [44]. We formulate each of these tasks as a kernel regression problem by considering a vector target function which takes in digits and outputs one-hot labels. Our kernel regression theory can be applied separately to each element of the target function vector (Methods), and a generalization error can be calculated by adding the error due to each vector component.
We can visualize the complexity of the two tasks by plotting the projection of the data along the top two kernel principal components (Fig. 1a,b). The projection for 0-1 digits appears highly separable compared to 8-9s, and thus simpler to learn to discriminate. Indeed, the generalization error for the 0-1 discrimination task falls more rapidly than the error for the 8-9 task (Fig. 1c). Our theory is in remarkable agreement with experiments. Why is 0-1 discrimination easier for this kernel? Fig. 1d shows that the eigenvalues of a c b e d Fig. 2: Gaussian RBF kernel regression on MNIST and CIFAR-10 datasets. Kernel is K(x, x ) = e − 1 2Dω 2 ||x−x || 2 with kernel bandwidth ω = 0.1, ridge parameter λ = 0.01 and D being the size of images. a. Generalization error E g (p) when σ 2 = 0: Solid lines are theory (Eq. (4)), dots are experiments. b. Kernel eigenvalues and c. cumulative powers C(ρ) for MNIST and CIFAR-10. d, e. Generalization error when σ 2 = 0.5 with its bias-variance decomposition for MNIST and CIFAR-10 datasets, respectively. Solid lines are theory, markers are experiments. Error bars represent standard deviation over 160 trials. Bias and variance are obtained by calculating the mean and variance of the estimator over 150 trials, respectively. the NTK evaluated on the data are very similar for both datasets. To quantify the compatibility of the kernel with the tasks, we measure the cumulative power distribution C(ρ). Even though in this case the data distributions are different, C(ρ) is still informative. Fig. 1e illustrates that C(ρ) rises more rapidly for the easier 0-1 task and more slowly for the 8-9 task, providing a heuristic explanation of why it requires a greater number of samples to learn.
We next test our theory for Gaussian RBF kernel on the MNIST [44] and CIFAR [45] datasets. Fig. 2a shows excellent agreement between our theory and experiments for both. Fig. 2b shows that the eigenvalues of the Gaussian RBF kernel evaluated on data are similar for MNIST and CIFAR-10. The cumulative powers C(ρ) (Fig. 2c), however, are very different. Placing more power in the first few modes makes learning faster. When the labels have nonzero noise σ 2 > 0 (Fig. 2d,e), generalization error is non-monotonic with a peak, a feature that has been named "double-descent" [3,37]. By decomposing E g into the bias and the variance of the estimator, we see that the non-monotonicity is caused solely by the variance (Fig. 2d,e). Similar observations about variance were made in different contexts before [38,42,46].
These experiments and discussion in the previous section provide illustrations of the three main heuristics about how dataset, kernel, target function, and noise interact to produce generalization in kernel regression: 1. Spectral Bias: Kernel eigenfunctions φ ρ with large eigenvalues η ρ can be estimated with kernel regression using a smaller number of samples.
2. Task-Model Alignment: Target functions with most of their power in top kernel eigenfunctions can be estimated efficiently and are compatible with the chosen kernel. We introduce cumulative power distribution, C(ρ), as defined in Eq. (6), as a measure off this alignment.
3. Non-monotonicity: Generalization error may be non-monotonic with dataset size in the presence of noise (as in Fig. 2), or when the target function is not expressible by the kernel (not in the RKHS).
We provide a discussion of and examples for the latter kind in Supplementary Note 3 and 4. We show that modes of the target function corresponding to zero eigenvalues of the kernel act effectively as noise on the learning problem.
To explore these phenomena further and understand their causes, we study several simplified models where the kernel eigenvalue problem and generalization error equations can be solved analytically.

Double-Descent Phase Transition in a Band-Limited RKHS
An explicitly solvable and instructive case is the white band-limited RKHS with N equal nonzero eigenvalues, a special case of which is linear regression. Later, we will observe that the mathematical description of rotation invariant kernels on isotropic distributions reduces to this simple model in each learning stage.
In this model, the kernel eigenvalues are equal η ρ = 1 N for a finite number of modes ρ = 1, ..., N and truncate thereafter: η ρ = 0 for ρ > N . Similarly, the target power w 2 ρ truncates after N modes and satisfies the normalization condition N ρ=1 w 2 ρ = N . In Supplementary Note 3, we relax these constraints and discuss their implications. Linear regression (or linear perceptron) with isotropic data is a special case when D = N , φ ρ (x) = x ρ , and x ρ x ρ x∼p(x) = δ ρρ [25].
We study this model in the thermodynamic limit. We find that the natural scaling is to take P → ∞ and N → ∞ with α = P/N ∼ O(1), and D ∼ O(1) (or D = N ∼ O(P ) in the linear regression case), leading to the generalization error: Note that the result is independent of the teacher weights as long as they are properly normalized. The function κ λ (α) appears in many contexts relevant to random matrix theory, as it is related to the resolvent, or Stieltjes transform, of a random Wishart matrix [47,48] (Supplementary Note 3). This simple model shows interesting behavior, elucidating the role of regularization and under-vs. over-parameterization in learning machines. First we consider the interpolation limit (λ = 0, Fig. 3a). The generalization error simplifies to There is a first order phase transition at α c = 1, when the number of samples P is equal to the number of non-zero modes N and therefore to the number of parameters, {w ρ }, that define the target function. The phase transition is signaled by the non-analytic behavior of E g and verifiable by computing the first-derivative of free energy (Supplementary Note 3). When σ = 0, E g linearly falls with more data and at the critical point generalization error goes to zero. With noise present, the behavior at the critical point changes drastically, and there is a singular peak in the generalization error due to the noise term of the generalization error (Fig. 3a). At this point the kernel machine is (over-)fitting exactly all data points, including noise. Then, as number of samples increase beyond the number of parameters (α > 1), the machine is able to average over noise and the generalization error falls with asymptotic behavior E g ∼ σ 2 /α. Our results are consistent with those previously obtained for the linear perceptron with a noisy target [25,49], which is a special case of kernel regression with a white band-limited spectrum.
When λ > 0 and σ = 0, E g decreases monotonically with α and is asymptotic to E g ∼ λ 2 /α 2 (Fig. 3b). A sharp change at α = 1 is visible for small λ, reminiscent of the phase transition at λ = 0. When σ > 0 is a c b d Fig. 3: Learning curves and double-descent phase diagram for kernels with white band-limited spectra. We simulated N = 800 dimensional uncorrelated Gaussian features φ(x) = x ∼ N (0, I) and estimated a linear functionf (x) = β x with ||β|| 2 = N . Error bars describe the standard deviation over 15 trials. Solid lines are theory (Eq. (7)), dots are experiments. a. When λ = 0 and σ 2 = 0, E g linearly decreases with α and when σ 2 > 0 it diverges as α → 1. b. When σ 2 = 0, explicit regularization λ always leads to slower decay in E g . c. For nonzero noise σ 2 > 0, there is an optimal regularization λ * = σ 2 which gives the best generalization performance. d. Double-descent phase diagram where the colored squares correspond to the curves with same color in panel c. Optimal regularization (λ * = σ 2 ) curve is shown in yellow dashed line which does not intersect the double-descent region above the curve defined by g(λ) (Eq. (8)).
sufficiently large compared to λ, non-monotonicity is again present, giving maximum generalization error at α ≈ 1 + λ (Fig. 3c), with an asymptotic fall E g ∼ σ 2 α . We find that E g (α) is non-monotonic when the noise level in target satisfies the following inequality ( Fig. 3d and Supplementary Note 3): where g(λ) = 3λ 3λ + 2 − 2 √ 1 + λ √ 9λ + 1 cos θ(λ) , and θ(λ) = 1 3 π + tan −1 Although there is no strict phase transition (in the sense of non-analytic free energy) except at λ = 0, Eq. (8) defines a phase boundary separating the monotonic and non-monotonic learning curve regions for a given regularization parameter and noise. For a given λ, double-descent occurs for sufficiently high σ 2 . In the non-monotonic region, there is a single local maximum when σ 2 > 2λ + 1, otherwise a local minima followed by a local maxima (we call only this kind of peak as the double-descent peak).
Based on this explicit formula, one could choose a large enough λ to mitigate the peak to avoid overfitting for a given noise level (Fig. 3d). However, larger λ may imply slower learning (See Fig. 3b and Supplementary Note 3) requiring more training samples to achieve the same generalization error. By inspecting the derivative ∂Eg ∂λ = 0, we find that λ * = σ 2 (yellow dashed line in Fig. 3d) is the optimal choice for ridge parameter, minimizing E g (α) for a given σ 2 at all α (Fig. 3c). For λ > λ * the noise-free error term increases from the optimum whereas λ < λ * gives a larger noise term. Our result agrees with a similar observation regarding the existence of an optimal ridge parameter in linear regression [46].
Further insight to the phase transition can be gained by looking at the bias and the variance of the estimator [38, 42, 43]. The average estimator learned by kernel regression linearly approaches the target function as α → 1 (Supplementary Note 2): f * (x) D = min{α, 1}f (x) (Fig. 4a), which indicates that the bias (B) and variance (V ) contributions to generalization error have the forms B = max{0, 1 − α} 2 , . In the absence of noise, σ = 0, variance is initially low at small α, reaches its maximum at α = 1/2 and then decreases as α → 1 as the learned function concentrates aroundf (Fig. 4b). When there is noise, the phase transition at α = 1 arises from the divergence in the variance V of the learned estimator (Fig. 4c).   Fig. 3, when λ = 0 and σ 2 = 0, the bias is a monotonically decreasing function of α while variance has a peak at α = 1/2 yet it does not diverge. c. When λ = 0 and σ 2 = 0.2, we observe that the divergence of E g is only due to the diverging variance of the estimator. In b. and c., solid lines are theory, dots are experiments. Error bars represent the standard deviation over 15 trials.

Multiple Learning Episodes and Descents: Rotation Invariant Kernels and Measures
Next, we study kernel regression on high dimensional spheres focusing on rotation invariant kernels, which satisfy K(Ox, Ox ) = K(x, x ), where O is an arbitrary orthogonal matrix. This broad class of kernels includes widely used radial basis function kernels K(x, x ) = K(||x − x ||) (Gaussian, Laplace, Matern, rational quadratic, thin plate splines, etc) and dot product kernels K(x, x ) = K(x · x ) (polynomial kernels, NNGPK and NTK) [10,39,40].
When the data distribution is spherically isotropic p(x) = p(||x||), we can separate Mercer eigenfunctions for rotation invariant kernels into radial and angular parts. The spherical parts depend on the radial distances of the data points ||x||, whereas the angular components admit a decomposition in terms of spherical harmonics of the unit vectors Y km (x), where k is the degree and m is the order of the harmonic [50]. A review of the basic properties of spherical harmonics are provided in the Supplementary Note 6. Utilizing this spherical symmetry, we obtain the following Mercer decomposition Since the eigenvalues are independent of the spherical harmonic order m, the minimal degeneracy of the RKHS spectrum is the number of degree k harmonics: in the limit D → ∞ given by ∼ D k /k! [51, 52]. However, the degeneracy can be even larger if there are different (z, k) indices with the same eigenvalue. For notational convenience, we denote degenerate eigenvalues as η K (K ∈ Z + ) and corresponding eigenfunctions as φ K,ρ where ρ ∈ Z + indexes the degenerate indices. After finding the eigenvalues of a kernel on the basis φ K,ρ , one can evaluate Eq. (4) to predict the generalization error of the kernel machine.
We focus on the case where the degeneracy of 34]. Examples include the widely-used Gaussian kernel and dot product kernels such as NTK, which we discuss below and in Supplementary Note 4.
This scaling from the degeneracy allows us to consider multiple P, D → ∞ limits leading to different learning stages. We consider a separate limit for each degenerate eigenvalue L while keeping α ≡ P/N (D, L) finite. With this setting, we evaluate Eq. (4) with definitionsη K ≡ N (D, K)η K ,w 2 K ≡ 1 N (D,K) ρw 2 K,ρ , to obtain the generalization error in learning stage L: Several observations can be made: 1. We note that E (L) . In the learning stage L, generalization error due to all target modes with K < L has already decayed to zero. As α → ∞, K = L modes of the target function are learned, leaving K > L modes. This illustrates an inductive bias towards learning target function modes corresponding to higher kernel eigenvalues.
reduces, up to a constantη Lw 2 L , to the generalization error in the band limited case, Eq. (7), with the identification of an effective noise parameter,σ L , and an effective ridge parameter, λ L . Inspection ofσ L reveals that target modes with K > L (E (L) g (∞)) act as noise in the current stage. Inspection ofλ L reveals that kernel eigenvalues with K > L act as a regularizer in the current stage. The role of the number of eigenvalues in the white band limited case, N , is played here by the degeneracy N (D, L). 3. Asymptotically, first term in E (L) g (α) is monotonically decreasing with α −2 , while the second term shows non-monotonic behavior having a maximum at α = 1 +λ L . Similar to the white band-limited case, generalization error diverges due to variance explosion at α = 1 +λ L whenλ L = 0 (a band-limited spectrum is possible) implying again a first order phase transition. Non-monotonicity caused by the noise term implies a possible peak in the generalization error in each learning stage. A phase diagram can be drawn, where phase boundaries are again defined by Eq. (8) evaluated with the effective ridge and noise parameters, Fig. 5a. 4. Similar to the white band limited case, optimal regularization happens wheñ minimizing E (L) g (α) for a givenσ L for all α. This result extends the previous findings on linear regression [46] to the large class of rotation invariant kernels. 5. When all stages are considered, it is possible to observe learning curves with multiple descents with at most one peak per stage. The presence and size of the descent peak depends on the level of noise in the data and the effective regularization as shown in Eq. (8) and Eq. (9). Similar observations of multiple peaks in the learning curves were made in [36] in the context of ridgeless regression on polynomial kernels.
As an example of the effect of kernel spectrum on double-descent, consider a power lawη is Hurwitz-Zeta function. In the ridgeless λ = 0 case, faster decaying spectra (higher s, smallerλ L ) are more prone to double-descent than the slower ones (Fig. 5a). Furthermore, we also observe that higher modes (higher L, higherλ L ) are more immune to overfitting, signaled by non-monotonicity, than the lower modes.
We apply our theory to Gaussian RBF regression on synthetic data in Fig. 5 where Fig. 5b demonstrates a perfect agreement with theory and experiment on Gaussian RBF with synthetic data when no label noise is present. The vertical dashed lines represent the locations where P = N (D, 1) and P = N (D, 2), respectively. Fig. 5c shows the regression experiment with the parameters (σ 2 1 ,λ 1 ) indicated by colored squares on the phase diagram (Fig. 5a). When the parameters chosen on the yellow dashed line in Fig. 5a, corresponding to the optimal regularization for fixed effective noise, the lowest generalization error is achieved in the first learning episode without a double-descent. Finally, Fig. 5d shows the theory and experiment curves with the parameters (σ 2 1 ,λ 1 ) shown by the colored circles in Fig. 5a. As expected, for fixed effective regularization, increasing noise hurts generalization. For further experiments see Supplementary Note 4.

Dot Product Kernels, NTK and Wide Neural Networks
Our theory allows the study of generalization error for trained and wide feedforward neural networks by exploiting a correspondence with kernel regression. When weights in each layer are initialized from a Gaussian distribution N (0, σ 2 W ) and the size of hidden layers tend to infinity, the function f (x, θ) learned by training the network parameters θ with gradient descent on a squared loss to zero training error is equivalent to the function obtained from ridgeless (λ = 0) kernel regression with the NTK: [13]. For fully connected neural networks, the NTK is a dot product kernel [13,34]. For such kernels and spherically symmetric data distributions p(x) = p( x ), kernel eigenfunctions do not have a radial part, and consequently the eigenvalues are free of a z-index. Therefore, k-th eigenvalue has degeneracy of the degree k spherical harmonics, O D (D k ), (K, L → k, l and ρ → m) [34], allowing recourse to the same scaling we used to analyze rotation invariant kernels in the previous section. The learning curves for infinitely wide neural network will thus have the same form in Eq. (9), evaluated with NTK eigenvalues and with λ = 0.
In Fig. 6a, we compare the prediction of our theoretical expression for E g , Eq. (4), to NTK regression and neural network training. The match to NTK training is excellent. We can describe neural network training up to a certain P after which the correspondence to NTK regression breaks down due to the network's finite-width. For large P , the neural network operates in under-parameterized regime where the network initialization variance due to finite number of parameters starts contributing to the generalization error [3,38,54,42]. A detailed discussion of these topics is provided in Supplementary Note 4.   4)). Larger regularization suppresses the descent peaks, which occur at P * ∼ N (D, L) shown by the vertical dashed lines. c. Varyingλ L with fixed theσ L . d. vice versa. For fixed noise, we observe an optimalλ 1 for up to P/N (D, 1) ∼ 10 after which the next learning stage starts. Error bars indicate standard deviation over 300 trials for b and 100 trials for c, d.
Neural networks are thought to generalize well because of implicit regularization [2,51,52]. This can be addressed with our formalism. For spherical data, we see that the implicit regularization of a neural network for each mode l is given byλ l = k>lη k η l . As an example, we calculate the spectrum of NTK for rectifier activations, and observe that the spectrum whitens with increasing depth [52], corresponding to largerλ l and therefore more regularization for small learning stages l (Fig. 6b). The trend for small degree harmonics l is especially relevant since, as we have shown, approximately D l samples are required to learn degree l harmonics. In this small l regime, we see that deep networks exhibit much higher effective regularization compared to shallow ones due to slower decay of eigenvalues. b a Gegenbauer polynomial (see Supplementary Note 5 and 6). Here we picked k = 1 (linear target). Solid lines are the theory predicted learning curves (Eq. (4)), dots represent NTK regression and × represents E g after neural network training. Correspondence between NN training and NTK regression breaks down at large sample sizes P since the network operates in under-parameterized regime and finite-size effects become dominating in E g . Error bars represent standard deviation of 15 averages for kernel regression and 5 averages for neural network experiments. b.λ l dependence to mode l across various layer NTKs. The weight and bias variances for the neural network are chosen to be σ 2 W = 1 and σ 2 b = 0, respectively.

Discussion
We studied generalization in kernel regression using statistical mechanics and the replica method [33]. We derived an analytical expression for the generalization error, Eq. (4), valid for any kernel and any dataset. We showed that our expression explains generalization on real datasets, and provided a detailed analysis of its application to band-limited kernels with white spectra and the widely used class of rotation invariant kernels [39, 40] operating on spherical data. For the latter case, we defined an effective regularization and an effective noise which govern the generalization behavior, including non-monotonicity of learning curves. It will be interesting to see if analogues of these concepts can be defined for real datasets. Our results are directly applicable to infinite-width limit of neural networks that admit a kernel description (including feedforward, convolutional and recurrent neural networks) [ . We also note a closely related recent study [19], which we discuss further in Supplementary Discussion, that utilizes random matrix theory to study generalization in kernel regression. One goal of our present work is to provide a framework that incorporates structural information about the data distribution into a realistic prediction of generalization performance that holds for real data and any kernel. Indeed, a recent study suggested that structure in data allows kernel methods to outperform pessimistic generalization expectations based on the high ambient dimension [65]. In a different setting, authors of [66] calculate test error of random Fourier features model using random matrix theory techniques without strong assumptions on data distribution and obtain excellent agreement on real datasets. Overall, our results demonstrate how data and inductive biases of a model interact to shape generalization behavior, and in particular the importance of the compatibility of a learning task with the model for sample-efficient learning. Our findings elucidate three heuristic principles for generalization in kernel regression.
First is the spectral bias. The eigendecomposition of the kernel provides a natural ordering of functions which are easiest to estimate. Decomposing generalization error into modal errors, we found that errors in spectral modes with large eigenvalues decrease more rapidly with increasing sample size than modes with small eigenvalues, also observed in [34], illustrating a preference to fit certain functions over others. Our findings are consistent with other experimental results and analytical ones which derive error bounds on test risk to elucidate the spectral or frequency bias of NTK and NNGP [67, 68, 69, 70].
Consequently, how a given task decomposes in the eigenbasis, a heuristic that we name task-model alignment, determines the number of samples required to achieve good performance: tasks with most of their power in top eigenmodes can be learned in a sample efficient manner. We introduced cumulative power distribution as a metric for task-model alignment and proved that target functions with higher cumulative power distributions will have lower normalized generalization error for all P under the same kernel and data distribution. A related notion of kernel compatibility with target was defined in [71, 72], which we discuss in detail in Supplementary Discussion.
The third phenomenon we explore is how non-monotonicity can appear in the learning curves when either labels are noisy, or the target function has modes that is not expressible with the kernel. Nonmonotonicity is caused by the variance term in the bias-variance decomposition of the generalization error. In the analytically tractable models we considered, this is related to a phase transition appearing in separate learning stages for the rotation invariant kernels. Non-monotonicity can be mitigated with explicit or implicit regularization [38,42,73]. We showed the existence of an optimal regularizer, independent of sample size, for our theoretical settings. When applied to linear regression, our optimal regularizer matches that previously given by [46].
Non-monotonicity in generalization error gathered a lot of interest recently. Many studies pointed to absence of overfitting in overparameterized machine learning models, signaled by a peak and a subsequent descent in generalization error as the model complexity, or the number of parameters, increases, and the model transitions from an underparameterized to overparameterized (interpolating) regime [3,5,35,54,38,42,43,74,75,66]. Multiple peaks are also possible in this context [76]. Our work provides an explanation for the lack of overfitting in overparameterized models by elucidating strong inductive biases of kernel regression, valid even in the interpolation limit, which includes infinitely overparameterized limits of neural networks. Sample-wise non-monotonicity has also been observed previously in many models [5,25,24,54,73], including ones that show multiple peaks [46, 43, 36, 77]. A closely related study obtained an upper bound for test risk in ridgeless regression which shows non-monotonic behavior with increasing sample size whenever P ∼ O(D L ), consistent with our results on rotation invariant kernels and isotropic data.
An interesting comparison can be made between the multiple peaks we observed in our analytically solvable models and the multiple peaks observed in random features models [43,76]. In these models, one of the peaks (termed "nonlinear" in [43]) happens when the number of samples reaches the number of features, and thus the number of parameters of the model, crossing the interpolation threshold. While the peak we observed in the white band limited case with nonlinear features also happens at the interpolation threshold (P = N ), the mechanisms causing double descent are different. In random features models, this peak is due to variance in the initialization of the random feature vectors. In our example, such variance does not exist. The peak is due to overfitting the noisy labels and disappears when there is no noise. The peaks observed for the rotationally invariant case has a more subtle connection. In each learning stage, peaks occur when number of samples reach the degeneracy of eigenfunctions in that stage. While kernel regression is non-parametric, one can think of this again as crossing an interpolation threshold defined by the dimensionality of the feature space in the large-D limit. Like the white band limited case, these peaks are due to overfitting noise.
While our theory is remarkably successful in its predictions, it has limitations. First, the theory requires eigendecomposition of the kernel on the full dataset which is computationally costly. Second, its applicability to state-of-the-art neural networks is limited by the kernel regime of networks, which does not capture many interesting and useful deep learning phenomena [78,62]. Third, our theory uses a Gaussian approximation [79] and a replica symmetric ansätz [33]. While these assumptions were validated by the remarkable fit to experiments, it will be interesting to see if their relaxations reveal new insights.

Statistical Mechanics Formulation
With the setting described in the main text, kernel regression problem reduces to minimization of the energy function The quantity of interest is the generalization error in equation Eq. (2), which in matrix notation is where Λ ργ = η ρ δ ργ represents a diagonal matrix with entries given by the RKHS eigenvalues η ρ . In order to calculate the generalization error, we introduce a Gibbs distribution p G (w) ≡ 1 Z e −βH(w) with the energy function in Eq. (11). In the β → ∞ limit, this Gibbs measure is dominated by the solution to the kernel regression problem. We utilize this fact to calculate the generalization error for kernel regression. This can be done by introducing a source term with strength J to the partition function, where we recognize the free energy βF ≡ − ln Z(J) is the relevant quantity to compute generalization error for a given dataset, E g (D). In Supplementary Note 2, we introduce other source terms to calculate training error, average estimator and its variance. The free energy depends on the sampled dataset D, which can be thought of as a quenched disorder of the system. Experience from the study of physics of disordered systems suggests that the free energy concentrates around its mean (is self-averaging) for large P [33]. Therefore, we calculate the typical behavior of the system by performing the average free energy over all possible datasets: All calculations are detailed in Supplementary Note 1 and 2. Here we provide a summary. To perform averages over the quenched disorder, we resort to the replica trick [80] using log Z(J) D = lim n→0 A key step is a Gaussian approximation to the average over the dataset in the feature space [81], which exploits the orthogonality of the feature vectors with respect to the input distribution p(x). These averages are expressed in terms of order parameters defining the mean and the covariance of the Gaussian. The calculation proceeds by a replica symmetric ansatz [33], evaluating the saddle point equations and taking the β → ∞ limit.

Modal errors
The generalization error can be written as a sum of modal errors arising from the estimation of the coefficient for eigenfunction ψ ρ : where We now provide a proof that the mode error equation implies that the logarithmic derivatives of mode errors have the same ordering as the kernel eigenvalues when σ = 0. Assuming that η ρ > η ρ , and noting that κ (P ) = − κγ P (1+γ) < 0 since κ, γ, P > 0, we have Thus, we arrive at the conclusion Let u ρ,ρ (P ) = log The above derivation demonstrates that d dP u ρ,ρ (P ) < 0 for all P . Since u ρ,ρ (0) = 0, this implies that u ρ,ρ (P ) < 0 or equivalently E ρ < E ρ for all P . This result indicates that the mode errors have the opposite ordering of eigenvalues, summarizing the phenomenon of spectral bias for kernel regression: the generalization error falls faster for the eigenmodes with larger eigenvalues. If the target function has norm T = f (x) 2 = ρ η ρ w 2 ρ then the generalization error is a convex combination of The quantities T E ρ (P ) maintain the same ordering as the normalized mode errors E ρ for all P , and we see that re-allocations of target function power that strictly increase the cumulative power distribution C(ρ) = 1 T ρ ≤ρ η ρ w 2 ρ curve must cause a reduction in generalization error. We emphasize that, for a fixed set of kernel eigenvalues, strictly higher C(ρ) yields better generalization but provide a caveat: for a fixed target function, comparison of different kernels requires knowledge of both the change in the spectrum η ρ as well as changes in the C(ρ) curve.

Diagonalizing the kernel on real datasets
Calculation of E g requires two inputs: kernel eigenvalues η ρ and teacher weightsw ρ , both calculated using the underlying data distribution. For a dataset with a finitely many samples, we assume a discrete uniform distribution over data p( being the size of the whole dataset (train+test). Then, the corresponding eigenvalue problem reads: Note that both data indices and eigen-indices take values i, k = 1, ..., M . For a target function with multiple classes f (x) : R D → R C , we denote the one-hot encoded labels Y = [y 1 , ..., y C ] ∈ R M ×C and obtain the teacher weights for each class withw c = 1 M Λ −1/2 Φ y c . For solving kernel regression, each of the C one-hot output channels can be treated as an individual target function f t,c (x) where f t,c (x µ ) = y µ c for one-hot labels y µ c . The weights w c obtained above allows the expansion of each teacher channel in the kernel's The total generalization error for the entire task is simply x,D where f * c is the kernel regression solution for output channel c. Note that, with the choice of one-hot labels, the total target power is normalized to 1. After computing learning curves for each channel c, which requires plugging in {(η k , w 2 c,k )} M k=1 into our theory, the learning curves for each channel are simply summed to arrive at the final generalization error.
In other cases, when we do not generally possess a priori knowledge about p(x), the underlying data distribution, solving the kernel eigenvalue problem in Eq.
(3) appears intractable. However, when we are provided with a large number of samples from the dataset, we may approximate the kernel eigenvalue problem by using a Monte-Carlo estimate of the data density i.e. p(

Data Availability
All data generated and/or analyzed in this study, along with the analysis code, are available in the Github repository [82], https://github.com/Pehlevan-Group/kernel-generalization.

Code Availability
All code used to perform numerical experiments, analyze data and generate figures are available in the Github repository [82], https://github.com/Pehlevan-Group/kernel-generalization.

Supplementary Note 1 -Problem Setup
A reproducing kernel Hilbert space [1] H living on X ⊂ R D is a subset of square integrable functions L 2 (X , p) for measure p, equipped with an inner product ·, · H and a kernel satisfying the following reproducing property: with K(., x) is itself being an element of H. We assume data is drawn from a finite measure with density p(x) and that the kernel has finite trace on this measure K(x, x)p(x)dx < ∞. We introduce the integral operator T K : L 2 (X , p) → L 2 (X , p) which is a linear map from functions to functions Mercer's Theorem allows the diagonalization of the kernel in terms of the eigenfunctions of T K , 2,3]. This set of eigenfunctions forms an orthonormal basis with respect to L 2 (X , p). The resulting Mercer decomposition of the kernel is We refer to {η ρ } as the spectrum of RKHS, or the kernel eigenvalues. The number of nonzero kernel eigenvalues defines the dimension of the RKHS, which is infinite in the general setting. By the orthonormality of the functions φ ρ (x) on p(x), we can verify that the eigenfunction property is satisfied As a consequence of the basis property of {φ ρ } ∞ ρ=0 , a square integrable function f (x) ∈ L 2 (X , p) can be expanded as: By the reproducing property of the kernel, we find the identity φ ρ , φ γ H = δρ,γ ηρ . Therefore the RKHS norm of a function f is Here, we explicitly denote dataset dependence D. Labels y µ are generated from a noisy target function: We define the null-space indices of the kernel as N = {ρ|η ρ = 0} and its complement as N ⊥ .
Since the learned function is always in the RKHS (the energy H includes a penalty on ||f || 2 H ), it cannot place any power in the null modes indexed by N and therefore can only optimize over linear combinations of eigenfunctions indexed by N ⊥ . We introduce rescaled features ψ ρ = √ η ρ φ ρ for all ρ ∈ N ⊥ . The square integrable target function admits the decomposition where we have introduced vectors w, a, Φ, Ψ of the appropriate dimensions. We stress that the assumption of a square integrable target f is identical to the condition that the target possess finite variance under the data density p(x), a very mild assumption which gives our theory significant generality. The kernel regression task reduces to minimization of the energy function over the learned weights w ρ : whereā contains the coefficients of the teacher for null modes indexed by N . Generalization error and training error are defined as: where we introduced the diagonal matrix of the spectrum Λ ργ ≡ η ρ δ ργ . The last term in E g (D), ||a|| 2 , represents the inability of the kernel to place any power in modes indexed by N and is thus an irreducible estimation error, present even in the P → ∞ limit, which we denote as E g (∞). Our main goal is to average E g (D) and E tr (D) over all possible realizations of D of fixed size P .

Supplementary Note 2 -Replica Calculation
We perform replica calculation to find the expected value of training error, estimator and its RKHS norm as well as the generalization error. To set up our statistical mechanics problem, we first introduce the following partition function: such that since the integral in Supplementary Equation 12 concentrates around the weights w * of the kernel regression solution in the β → ∞ limit. Last two terms in the exponent of Supplementary Equation 12 is to calculate the expected estimator f (x) and its correlation function via: where the primed quantities are defined as ξ = Λ 1/2 ξ and χ = Λ 1/2 χ . Hence one can read out: Here the .. w refers to ensemble average over all possible w's with probability weights given by e −βH(w;D) . We will eventually consider the β → ∞ limit to obtain the quantities above corresponding to the kernel regression solution.
In order to perform the training set average O(D) D for the quantities above, we must average log Z over all possible training samples and noises. Resorting to the replica trick, averaging log Z reduces to averaging n-times replicated partition function Z n : where we shifted w a → w a +w and definedW = (w − ξ) − (χ w)χ for notational simplicity. Now we can average over quenched disorder introduced due to the training samples and noise before integrating out the thermal degrees of freedom.

Averaging over Quenched Disorder
The quantity of interest is the following: Rather than integrating over x, we integrate over q a = w a · Ψ(x) +ā · Φ(x) + a , which is itself a random variable with mean and covariance: where Σ = ā 2 + σ 2 11 is the covariance matrix of noise across replicas. Computation of this mean and covariance relied on the orthogonality of the eigenfunctions. Notice that the target modes on the null space of the kernel act as noise on training labels, quantifying the effect of out-of-RKHS target functions. From here on, we take ā 2 = 0 for simplicity. At the end of the calculation it can be recovered by shifting σ 2 → σ 2 + ā 2 and E g → E g + ā 2 . Note that the noise-free part of the diagonal elements represents the generalization error in a single replica i.e. C aa = E a g + σ 2 , while off-diagonal elements give the overlap of the weights across different replicas. In the limit β → ∞, we expect these two quantities to be equal as the optimal weights averaged over training samples across different replicas will be the same due to the convexity of the problem.
First, we omit the zero mode µ by considering only kernels with zero constant terms (such that η 0 = 0), however it is straightforward to include it in the rest of the calculation and we give the contribution from zero mode in our final expression. Next, by observing that q a is a summation of many uncorrelated random variables ( ψ ρ (x)ψ ρ (x) x∼p(x) = η ρ δ ρρ ) and a Gaussian noise, we approximate the probability distribution of q a by a multivariate Gaussian with its mean and covariance given by Supplementary Equation 18: This approximation is further validated with the excellent match of our theory to simulations. Then the average over quenched disorder reduces to: where we defined β K ≡ β(1 − K) for notational convenience. Combining everything together, the dataset averaged replicated partition function becomes: Using the definitions Supplementary Equation 18, we insert conjugate variables through the following identity: Here, integral overĈ runs over the imaginary axis and we explicitly scaled it by P . Then defining: we obtain: Therefore, we only need to evaluate the integral in G S . First we would like to express the ordered sum a≥b w a I − JP Λ − χχ I ab − 2P β ΛĈ ab w b as an unordered sum over a, b. Note that Hence, we obtain: where we defined: In order to evaluate the Gaussian integral, we will cast the function and target weights into an nM dimensional vector: Furthermore, we introduce the nM × nM matrix X as: Finally we denote the integration measure as Dw = a,ρ dw a ρ . With these definitions, G S becomes: Hence, we turned the integral in G S to a simple Gaussian integral. The result is: Now the integral in Supplementary Equation 24 can be evaluated using the method of steepest descent.
In Supplementary Equation 24, we see that all the terms in the exponent is O(n). Furthermore, we will use P as the saddle point parameter going to infinity with a proper scaling. Therefore, defining the following function: we obtain: The reader may question the dependence of various quantities in S on P , since we are taking a P → ∞ limit. This is because we want to keep our treatment general. Depending on the kernel and data distribution, there are other quantities here that can scale with P . Specific examples will be given.

Replica Symmetry and Saddle Point Equations
In order to proceed with the saddle point integration, we further assume replica symmetry relying on the convexity of the problem: Therefore, we have C = (C 0 − C)I + C11 andĈ = (Ĉ 0 −Ĉ)I +Ĉ11 . In this case, the matrix X has the form: where: It is straightforward to calculate the inverse of this matrix using Sherman-Morrison-Woodbury where we defined, for shorthand notation. Hence, we get: We also need to calculate the determinant of this matrix which can be done by using Gaussian elimination method to bring it into a block-triangular form. The result is: Taylor expanding the last term using det(I + nC) = 1 + n Tr C + O(n 2 ), we obtain: log det X =n log det G + n Tr X 1 G −1 + O(n 2 ) = n log det G − n PĈ β Tr ΛG −1 + O(n 2 ). (37) Next, using the matrix determinant lemma det A + uv T = det(A)(1 + v T A −1 u), we obtain: where we recovered β K = β(1 − K). Finally, we need to simplify a≥bĈ ab (C ab − Σ ab ) under the replica symmetry up to leading order in n: Therefore, under replica symmetry, the function S given in Supplementary Equation 30 simplifies to: where we recall that G = I − P (2Ĉ 0 −Ĉ)+JβP β Λ − χχ . The saddle point equations of S with respect to C 0 and C are simple: The equation ∂S/∂Ĉ = 0 yields: and the equation ∂S/∂Ĉ 0 = 0 yields: Two commonly appearing forms are: Plugging second equation to the expression for G, we get: hence we obtain the following implicit equation: In terms of κ, final saddle point equations reduce to: Here * indicates the quantities that give the saddle point. Finally, solving for C * in the last equation, we obtain: Having obtained the saddle points, we can evaluate the saddle point integral. In the limit P → ∞, the dominant contribution is: Taking the n → 0 limit and plugging in the saddle point solutions to the expression Supplementary  Equation 40, we obtain the free energy log Z = −P S to be:

Generalization Error
Now, we can calculate E g = lim β→∞ 2 βP ∂ ∂J log Z | J,K,ξ,χ=0 . Recall that κ is itself a function of J. Explicit calculation and β → ∞ limit yields: where where we defined γ = ρ P η 2 ρ (κ+P ηρ) 2 . In terms of these quantities, averaged generalization error becomes: For completeness, we recovered ā 2 = 0 (Supplementary Note 1) to account for the case where target functionf (x) =w · Ψ(x) + ρ∈N a ρ Φ ρ (x) has components out-of-RKHS. The last term accounts for the contribution to E g from the zero mode when η 0 = 0. Note that at this point we have already taken the P → ∞ limit. P still appears in these expressions to consider different scaling limits for kernel eigenvalues. This expression generalizes Eq. (4) in main text to cases where the target function has modes not expressible by the kernel. These modes contribute an irreducible component to the generalization error, as well as act as noise on expressible modes. Based on this expression, we can define another effective noiseσ = σ 2 + ā 2 , which can lead to non-monotonic learning curves even in the absence of label noise (σ = 0). An example of this phenomenon is provided in Supplementary Figure 2.

Training Error
Next, we calculate the expected training error defined as: . (54) First we shall calculate: A cumbersome calculation yields: where E g is defined above Supplementary Equation 53. By using the summation forms of κ, γ and δ, one can show that: Plugging this back in, we obtain: Hence average case training error becomes: This is consistent with the expectation that in the β → ∞ limit, kernel machine can fit all training samples perfectly when λ = 0. We also note that the same relation between training and generalization errors were also obtained in other works [4,5] using random matrix theory.

Expected Estimator and the Correlation Function
Finally, we calculate the RKHS weights of the expected function and its variance: where the derivatives are with respect to ξ α = ξ α / √ η α and χ α = χ α / √ η α , respectively. Taking derivatives for each entry of ξ , we obtain: Figure 1: Kernel regression with Gaussian RBF K(x, x ) = e − 1 2Dω 2 ||x−x || 2 for MNIST with kernel bandwidth ω = 0.1, ridge parameter λ = 0.01 and noise variance σ 2 = 0, 0.5. Theoretical predictions for E g (solid blue line) and E tr (dashed orange line) are in excellent agreement with experiment (blue dots and orange triangles, respectively). Error bars represent standard deviation over 30 trials. a. When no label noise is present, E tr is always smaller than E g . b. When noise is non-zero, E tr may exceed E g due to explicit presence of label noise in the loss function.
Hence the average estimator has the following form: which approaches to the target function as P → ∞. Note that the learned function can only express the components which span the RKHS. If the target function has out-of-RKHS components, those will never be learned. A related result was given in [6] which used a similar technique to calculate the expected estimator. With our notation, their perturbative result for uniform spherical datasets and dot product kernels reads (Eq. (24) in [6]): where λ is the ridge parameter and . . . represent the higher order corrections in their perturbative setting. We obtain the same result result if we expand our expression for f * (x; P ) D in a power series of = (κ − λ) assuming is small, and use κ − λ = κ ρ ηρ P ηρ+κ : Hence, our result includes corrections not captured in [6]. Finally, we want to calculate the correlation function of the estimator. Given the partition function Z = dwe −βH(w;D)+βξ·w+ β 2 χ ww χ , notice that the variance: vanishes as β → ∞ since there is a unique solution. However, there is variance to the estimator due to averaging over different training sets which is given by: and it is finite as β → ∞. The first term, the eigenfunction expansion coefficients of the correlation function of the estimator f (x)f (x ) , can be calculated by taking two derivatives of log Z with respect to χ . To simplify the calculation, we first redefine G ≡ I + P κ − Λ 1/2 χ χ Λ 1/2 −1 by setting J = K = 0 and introduce the notation ∂ α ≡ ∂ ∂χ α for notational simplicity. First, we calculate the derivatives of κ: Hence, we find that: This greatly simplifies the second derivative of κ: where γ = ρ P η 2 ρ (P ηρ+κ) 2 as defined before. Now we calculate the variance of the expected function: Now we can calculate the coefficients of covariance of the estimator: Hence the covariance of the estimator is: Note that the generalization error can be decomposed as: From the calculation above, we can find: where the first line is the contribution to generalization error due to estimator variance. Hence generalization error is: which is what we obtained before. In terms of the variance of the estimator, E g is also equal to: This is the bias-variance decomposition of generalization error in our setting where the bias term is monotonically decreasing while the variance term is solely responsible for any non-monotonicity appearing in the generalization error.

Supplementary Note 3 -White Band-limited RKHS Spectrum
As a simple but illuminating example, we consider a kernel with band-limited spectrum: η ρ = 0 for ρ > N . For simplicity, we study the case where the spectrum is white η ρ = 1 N for all ρ = 1, ..., N and study this system in the large N , large P limit with α = P/N ∼ O(1). We normalize the target power in the first N modes N ρ=1 w 2 ρ = N . Furthermore, the coefficients for the target function are a ρ for all ρ > N : . At the saddle point, the implicit equation κ can be solved explicitly in the P, N → ∞ limit with α = P/N ∼ O(1).
The generalization error Supplementary Equation 53 becomes: where E g (∞) = ρ>N a 2 ρ is the asymptotic value of the generalization error andσ 2 = σ 2 + E g (∞) is the effective noise. The first term is the noiseless contribution to E g while second term is only due to the noise in target coming from both the explicit noise σ 2 and the variance of modes in the null-space of the kernel E g (∞). In the following we will simply denoteσ 2 with σ 2 since it is only relevant noise appearing in the generalization error formula. The generalization error asymptotically falls faster in the absence of noise: Furthermore, explicit calculation reveals that the noiseless term monotonically decreases with α, while the noise term has a maximum at α = 1 + λ and its maximum is given by: In the presence of noise, generalization error diverges when λ → 0, while finite λ smoothes out the learning curve. In machine learning, this non-monotonic behavior of generalization error is called "double-descent", and signals overfitting of the noise in the data [7,8,9]. Diverging generalization error further implies a first order phase transition when α = 1+λ = 1. This can be seen by examining the first derivative of the free energy Supplementary Equation 50 in β → ∞ limit: where the approximation is valid for λ 1. We observe that, in the absence of noise, there is no phase transition while in the noisy case, there is a sharp discontinuity and divergence when λ = 0. Although there is no phase transition in the strict sense of a non-analytic free energy except for the case λ = 0, we describe whether there is non-monotonicity or not as separate phases of the kernel machine.
We would like to understand what combinations of (λ, σ 2 ) leads to non-monotonicity in generalization error. One can obtain the exact phase boundary for non-monotonicity by studying the zeros of ∂E g /∂α given by: Explicit calculation yields: where g(λ) is: In the non-monotonic region, we further observe that the curve σ 2 critical = 2λ + 1 for λ < 1 separates two regions with a single and double local extrema. Above this curve, there is a single local maximum corresponding to a peak while below there is a local minimum followed by a local maximum (doubledescent). As a side note, we emphasize that double-descent might occur when target functions have out-of-RKHS components even without label corruption. In Supplementary Figure 2, the target function is chosen such that E g (∞) = 0.2 in Supplementary Equation 79. Since this causes non-zero effective noise, we observe double-descent even in the absence of label noise.
Supplementary Figure 2: Zero noise target function with components outside of the RKHS. We simulated the linear kernel K(x, x ) = N ρ=1 η ρ x ρ x ρ where x ∼ N (0, I) are N = 960 dimensional uncorrelated Gaussian inputs and η ρ = 0 for ρ > 800 and η ρ = 1, otherwise. Target is a linear function f (x) = β x with β ∼ N (0, 1 N I). This experiment demonstrates that out-of-RKHS components generate an irreducible error E g (∞) (here E g (∞) = 0.2) and acts as effective noise, producing a double descent peak. The ridge parameter is chosen to be λ = 0.01. Error bars represent the standard deviation over 50 trials.
Although large λ regularizes the learning and avoids an overfitting peak, too large λ will also slow down the learning as can be seen from the asymptotic limit of Supplementary Equation 82 in λ: To find an optimal choice of ridge parameter, we study the first derivative of E g with respect to λ and find that there is an optimal λ for a given noise level σ 2 independent of α: This simple relation holds for all α and also indicates that the optimal choice of regularization leads to a monotonic learning curve as expected (See Fig. 3 in the main text). Note that the error due to the noise term is decreasing, while the noise-independent term is increasing with λ. Finally, we numerically plot the α at which learning curve peak occurs as a function of noise with varying λ levels in Supplementary Figure 3. We observe that for large noise levels, location of the peak gets closer to α = 1 + λ.

Connection to Random Matrices
The quantities of interest in the generalization formula for the white band-limited spectrum are related to the distribution of random Wishart Matrices. In particular, let X ∈ R N ×P be a random Gaussian matrix with entries X ij ∼ N (0, 1), then the quantity κ λ (α) can be directly related to the following Stieltjes transform (resolvent) of the matrix 1 N XX + λI. The relationship is where α = P/N in the P, N → ∞ limit [10,11]. If one wanted to compute the average case (over design matrices X) generalization error for linear regression, this random matrix arises naturally from the solution to the regularized least-squares regression loss Xy.
If the target values are produced by a teacher y = 1 √ N X w then This matrix, and its average over the design X, is also studied intensively in previous work on the generalization error of kernel regression [12,13,14]. Differentiation in N of this average matrix gives the desired result in the N → ∞ limit Since the distribution of each x µ is isotropic, all covariance eigenvalues are equal and each component w ρ is learned at an identical rate. This expression is the same as our expression for the white band limited case, Supplementary Equation 79. Next, we apply our findings to rotation invariant kernels and find that generalization error decomposes into different learning episodes which are individually described by the same formula we derived here in a special setting.

Supplementary Note 4 -Rotation Invariant Kernels
Here, we consider a widely used class of kernels left invariant under the rotations of the inputs: K(Ox, Ox ) = K(x, x )). We start by decomposing rotation invariant kernels into their spherical and radial directions: Lemma 1. Let F r be the set of functions that are invariant to all rotations that leave the vector r ∈ S D−1 unchanged for all f ∈ F r and all orthogonal matrices O ∈ R D×D with Or = r, f (Ox) = f (x). Any function f ∈ F r admits a decomposition where Q Proof. For f to be invariant under the set of rotations which leave the vector r invariant, the restriction of f to spherical shells of radius ||x|| = R must also be invariant under rotations. For fixed radius R, the set of all functions that are rotation invariant lie in span{Q k (r · /|| · ||)}, since the Gegenbauer polynomials are complete with respect to the measure of inner products on S D−1 . Repeating this decomposition for each restriction radius ||x|| gives radial dependent coefficients a k (||x||).
Using this lemma, we have the following decomposition for rotation invariant kernels (K(Ox, Ox ) = K(x, x )) by first considering the rotation O's that leave x unchanged and then by considering the rotation O's that leave x unchanged: To calculate the eigenspectrum, we insert an ansatz of the form φ zkm (x) = R z,k (||x||)Y km (x) to the eigenvalue problem which gives a collection of radial eigenvalue problems (one for each degree k of spherical harmonics) ∞ 0 d||x||p(||x||)g k (||x||, ||x ||)R z,k (||x||) = η z,k R z,k (||x ||).
For each, k, we solve the integral eigenvalue problem for a set of functions {R z,k (||x||)} z that are orthonormal with respect to p(||x||). After solving these radial eigenvalue problems, we obtain the following Mercer decomposition of the kernel where η z,k are the eigenvalues of this decomposition, R z,k (||x||) denotes the radial dependence and Y km are hyper-spherical harmonics in D-dimensions. The eigenvalues η z,k are the same for every m for each (z, k) mode. There are at least N (D, For the most general case, generalization error for a rotational invariant kernel is given by Supplementary Equation 53 with ρ summation replaced by a sum over (z, k, m). As noted above, eigenvalues η z,k,m = η z,k are the same for m = 1, ..., N (D, k). For given k, η z,k for all z might be different with same degeneracy N (D, k). In this case, we choose the scaling α = P/N (D, l) ∼ O D (1) and take the limit P, D → ∞. We find that the resulting generalization decouples over different k modes for a general target functionf (x) = z,k,mw z,k,m φ z,k,m (x): where we defined the following O(1) quantities: First term corresponds to learning the mode l features while second term corresponds to the higher modes. Note that γ(α = 0) = γ(α = ∞) = 0 meaning that the modes k > l are not being learned in the learning stage l. Last term is the noise contribution to E g . Furthermore, self-consistent equation for κ simplifies to a polynomial equation of degree #(z) + 1 instead of degree #(z) + #(l) + 1, where #(z) and #(l) denote the total number of z and l modes, respectively. We found that eigenvalues with different degeneracies N (D, k) decouple as different learning stages for generic rotation invariant kernels in D → ∞ limit. However, kernels with further symmetries such as translational invariance can have eigenvalues with larger degeneracies. To take this case into account, we introduce the following notation: η K denotes the degenerate eigenvalues indexed by an integer K potentially representing different combinations of (z, k) and φ K,ρ denotes the corresponding eigenfunctions where ρ denotes collectively the degenerate indices. In this case, the degeneracy of each mode K is denoted by N (D, K) which can be larger than the degeneracy of spherical harmonics. Considering the case where there is only a single eigenvalue for integer mode K with degeneracy N (D, K), self-consistent equation for κ for learning stage L in Supplementary Equation 96 becomes a quadratic equation and we obtain the following solution: whereκ is the scaled κ byη K andη K = N (D, K)η K . In this case, we choose the scaling P = αN (D, K). This formula is same as the white band-limited example except for a more complicated effective regularizationλ L . Therefore, each learning stage behaves in the same way as white bandlimited case, and in the presence of noise, we may observe to see multiple descents associated to each learning episode. Similar to the discussion for white band-limited case,κ is a monotonically decreasing function of α. Effective regularizationλ L controls the decay rate ofκ and is completely fixed by kernel eigenspectrum and explicit ridge parameter. For largerλ L , the decay of κ(α) is slower and for λ L = 0, decay is fastest. In fact, for the special caseλ L = 0 decay rate is discontinuous and the second derivative ofκ diverges at α = 1 +λ L = 1.
With these definitions, γ becomes: Similar to the discussion in white band-limited example, the function γ has a maximum at α = 1+λ L , and asλ L → 0, its maximum goes to γ → 1, while for largeλ L its maximum falls like 1/4λ L . Therefore, for certain cases we expect local maxima or divergences in generalization error due to the factor of 1/(1 − γ) and for largerλ L we expect the effect of peaks to decrease, acting as an effective regularization [15]. Replacing these definitions in Supplementary Equation 53, we obtain the generalization error for rotation invariant kernels as: where E K is the asymptotic value of the generalization error and superscript (L) indicates that we are considering the scaling P = N (D, L)α. The particular form we presented E (L) g is useful to study α dependence of generalization error across different modes L since the right-hand side of the equation functionally depends only on α andλ L which is completely fixed by the full spectrum of RKHS. Asymptotically, first term is monotonically decreasing with 1 α 2 , while the second term has a maximum at α = 1 +λ L with magnitude: where generalization error might display a peak with increasing training samples. Therefore, we conclude that the "double descent" behavior can only arise due to the noise in target, consistent with the observations of [16]. We also observe that the effective noise is given byσ 2 L ≡ which implies that the errors from higher modes might act like noise in generalization error. Note that effective noise can be scale N (D, L) dependent due to the weight factor in the denominator.
From the particular form of generalization error in Supplementary Equation 96, we observe that there is a trade-off between noiseless and the noisy term and it is not obvious for which combinations ofσ 2 L andλ L the generalization error has a local maximum. Similar to the discussion in white band-limited case, we obtain the a phase diagram by identifying where on the (λ L ,σ 2 L ) plane the first derivative of Supplementary Equation 96 vanishes defined: Above this curve where non-monotonicity occurs, we further observe that the curveσ 2 L = 2λ L + 1 forλ L < 1 separates two regions with a single and double local extrema.
Here, similar to the white band-limited case, we find an optimalλ * L =σ 2 L for each learning episode L, achieving the minimum generalization error for all α.
This analysis allows us to understand the dependence of non-monotonic learning behavior on the kernel spectrum by studyingλ L . Let us consider the case whereη L ∼ O(s −L ) for some s > 1, the case relevant for the Gaussian kernel example. Then in the ridgeless (λ = 0) limitλ L is given by: Hereλ L is the same for all L. We observe that as spectrum decays faster, generalization error might feature larger peaks since the regularizationλ L → 0. Therefore, faster decaying spectra are more likely to cause non-monotonic learning curves than the slower decaying ones. Another example isη L ∼ O(L −s ) which is more relevant for studying neural networks. In this case, we have:λ where ζ(s, L) is the Hurwitz zeta function defined as: where in the last step, we performed the change of variables x → tx. To understand the spectrum dependence ofλ L , we approximate ζ(s, L) ≈ L −s+1 /(s − 1) for large L. Thenλ L simplifies to: Similar to exponential spectrum, again the regularization falls as spectrum decays faster. Furthermore, we can see that regularizationλ L increases at least linearly with L (or with power law for λ = 0) meaning that non-monotonicity becomes less visible for higher modes. Another note is that one can think of the quantityλ L as an "effective ridge parameter" which regularizes higher order modes causing them not to fit random noise and therefore stay smoother. This can be thought of as an example of implicit regularization in learning machines where more complicated features (higher modes) are implicitly chosen not to be learned, since learning rates also slow down with sample complexity asλ L gets larger. This property of the power law spectrum keeps the learned function smoother. It can be also interpreted as explicitly regularizing the learning with a mode dependent ridge parameter λ = L/(s − 1).
Next, we consider concrete examples of the theory.

Gaussian kernel
As a popular example, we study Gaussian kernel which further also possesses translational symmetry. Let p(x) = N (0, r 2 I) be the data distribution on the input space R D and K(x, x ) = e − 1 2Dω 2 ||x−x || 2 be the Gaussian kernel. For this density and the kernel, the eigenfunctions and eigenvalues can be computed exactly [1]: where a = 1 Supplementary Figure 4: Kernel Eigenspectra for the Gaussian RBF on a Gaussian measure in the D → ∞ limit. a. Larger bandwidth spectra decay more rapidly with increasing K. b. The optimal bandwidth ω * as a function of the effective noiseσ 2 . Small bandwidth kernels are preferred for late learning stages (large K) and large effective noiseσ 2 . For smallσ 2 the optimal bandwidth satisfies ω * 2 ∝σ −2 as predicted by the approximation obtained in the r ω limit.
by settingσ 2 K =λ K . Under the assumption that the kernel bandwidth is large ω 2 r 2 , we find thatλ K ∼ 1 1+K r 2 ω 2 so that the optimal bandwidth for learning stage K and noise levelσ 2 The effective eigenvalues η K for each learning stage K and different bandwidths are provided in Supplementary Figure 4a. In the infinite dimension limit D → ∞ the effective regularization is controlled by the bandwidth of the kernel resulting in an optimum ω * which is plotted in Supplementary Figure 4b. Supplementary Figure 5 displays kernel regression on a target function: where K is the Gaussian kernel with variance ω 2 and α i are drawn from a Bernoulli distribution. Generating P noisy labels from this function, we perform kernel regression and calculate generalization error on a randomly generated test data. We repeat this process many times to obtain training and target dataset averaged generalization error (see Supplementary Note 5 for simulation details). Kernel regression experiment fits the theory prediction almost perfectly as can be seen from Supplementary Figure 5. We consider kernel ridgeless regression (λ = 0) with a kernel with power law spectrumη k = k −s . In this case, effective regularizationλ l increases with mode l sinceλ l ≈ l/(s − 1). With a similar procedure applied in Gaussian RBF example, we set target weightsw 2 k = η k =η k /N (D, k). A kernel regression experiment and prediction are shown in Supplementary Figure 6a. Note that we use the finite P version of the generalization error to produce this plot meaning that the theory is still perfectly predictive without taking the infinite P limit. In each panel, the label noise variance chosen according to g(λ k ) corresponding to the learning stage k = 1, 2, 3. Whenσ 2 k < g(λ k ), generalization error falls monotonically at the learning stage k and when σ 2 > g(λ k ) we observe double-descent. Supplementary Figure 6b, on the other hand, demonstrates how infinite P and D limit of E g , Eq. (9) in main text, predicts the regression experiments at each learning stage (corresponding to different panels) when P is finite. We observe that a later learning stage starts before the previous learning stage saturates to its asymptotic value. The relevance of power law spectra and dot-product kernels to deep neural networks comes from the correspondence of infinitely wide neural networks and ridgeless kernel regression [17]. Consider a neural network with L hidden layers where n = n units in each of these layers for = 1, . . . , L and n 0 = D being the input dimension. We initialize the weights in each layer randomly W a. Empirically, spectrumη l = η l N (D, l) becomes white as more layers added. b. Furthermore, we confirm that the spectrumη l is independent of input dimension for large D.

Dot-Product Kernels and Neural Tangent Kernel
, the network function is: where σ is a non-linearity. We will only consider the Rectified Linear Unit (ReLU). Training the network parameters θ with gradient flow on a squared loss to zero training error is equivalent to the function obtained from ridgeless kernel regression with the Neural Tangent Kernel (NTK) [17,18,19]. This kernel can be obtained heuristically by linearizing the neural network function f (x, θ) around its initial set of parameters θ 0 , f (x, θ) ≈ f (x, θ 0 ) + ∇ θ f (x, θ 0 ) · (θ − θ 0 ). Optimizing a mean squared regression error over θ is equivalent to solving a linear regression problem for θ where the feature Gram matrix is formed from initial parameter gradients: K NTK,ij = ∇ θ f (x i , θ 0 ) · ∇ θ f (x j , θ 0 ). In the infinite-width limit, this quantity converges to its average over all possible initializations θ 0 , giving rise to the deterministic NTK [17]. As an example, the exact form of NTK for ReLU non-linearity and zero bias is given by: where f (θ) = cos −1 1 π sin(θ) + π − θ cos(θ) . By projecting this function onto the Gegenbauer polynomials, we can obtain the spectrum of NTK for any layer [12]. We empirically observe that the eigenvalues obey power-law for large modes as seen from Supplementary Figure 7.
Having obtained the kernel and its spectrum, we perform kernel regression with the exact infinite-width limit NTK and train the corresponding finite width neural network. In Supplementary  Figure 8a, we demonstrate the results for fitting a pure mode target functionf (x) = a k Q (D−1) k (β · x) which has vanishing weights except for a single mode k. β is randomly generated. We find that our theory describes NTK regression perfectly while neural network experiments show deviation from the theory at large P , possibly due to finite size effects. Indeed, increasing the width leads to a better match between experimental neural network risk and our theory, as shown in Supplementary  Figure 8b Fig. 6 in main text and is explained in detail in Supplementary Note 5. Solid lines are the theory predicted learning curves, dots represent NTK regression and × represents E g after neural network training. b. Generalization error for 2-layer NN with varying hidden units. We observe that increasing the width brings the learning curve closer to the NTK regression theory (dashed lines). Error bars represent standard deviation for kernel regression and neural network experiment over 15 and 5 trials, respectively.
We further observe a discrepancy between NTK regression and neural network training at low-P where the generalization error is systematically biased towards higher values than its kernel regression counterpart. While this effect requires a further study of the correspondence between finite width neural networks and NTK regression, and remains to be fully understood, we empirically observe that the method of neural network ensembling [20,16,9] removes this discrepancy (Supplementary Figure 9b). Neural network ensembling is known to reduce the variance due to random parameter initializations [16] and hence also prevents the high-P mismatch we observe in Fig. 6 in main text and Supplementary Figure 9a.

Approximate Scaling of Learning Curves
A coarse approximation to our derived learning curves can be deduced provided that the kernel eigenspectrum obeys a certain property. The theoretical mode errors equations reveal that mode errors decay at different rates which are set by the kernel eigenvalues {η ρ }. We exploit this fact to identify the number of eigenmodes that have been estimated once P samples are taken. This is equivalent to identifying the mode ρ * (P ) which satisfies κ(P ) ≈ P η ρ * (P ) (114) b a Supplementary Figure 9: a. 2-layer NTK regression and corresponding neural network training with 4000 hidden units for D = 25 with varying noise levels. Target function is the same as Fig. 6 in main text and is explained in detail in Supplementary Note 5. Solid lines are the theory predicted learning curves, dots represent NTK regression and × represents E g after neural network training. Notice the mismatch between regression and neural network experiment for low P regime. b. Same experiment with 10 times ensembling. We observe that neural network ensembling reduces the mismatch for both low and high P regimes. Error bars represent standard deviation for kernel regression and neural network experiment over 15 and 5 trials, respectively.
If κ P η ρ then mode ρ is not yet being learned while P η ρ κ implies that the mode ρ has already been estimated accurately. The self-consistency condition for κ in the λ → 0 limit is Using the assumption that κ(P ) ≈ P η ρ * (P ) , we find The solution ρ * ∼ P is self-consistent provided that the effective regularization grows sublinearlỹ This condition is not guaranteed to be satisfied, but if it is, then the scaling ρ * (P ) ∼ P is valid and we can approximate the remaining error with the power in all modes ρ > ρ * ≈ P which indicates that, at P , samples the expected generalization error can be approximated by a tail sum of the power in the target function. Examples of spectral decay rates that satisfy the sufficient condition above include power law η ρ ∼ ρ −b and exponential decays η ρ ∼ u ρ , where b and u are constants. In Supplementary Figure 10, we compare this coarse grained theory to the MNIST experiments performed and reported in Fig. 1e in main text. b a Supplementary Figure 10: a. Approximate Learning Curves for the Easy-Hard tasks in Fig. 1 in main text using tail sums. b. The effective regularization is only weakly sublinear in ρ.
where ω k is a random projection vector on S D−1 and a k are coefficients. Using the identity [21], we can expand this target function in terms of the features ψ km (x) = √ η k Y km (x): Therefore, we obtain: where we used Q (α) k (1) = α k+α N (D, k) [21]. Having obtained the weights, we can find the generalization error using Supplementary Equation 53: where κ = λ + κ k N (D, k) η k κ+P η k to be solved self-consistently and γ = k N (D, k) Hence this formula allows us to create a wide range of target functions and calculate the theoretical generalization error for a regression task on it. In our experiments, we consider two types of target functions as detailed below.

Pure Target Functions
We consider target functions of the formf (x) = a k Q (α) k (ω · x) where k is an integer. These functions are expressed with a single mode and simpler to evaluate numerically. Throughout our experiments, we picked a k = η k k+α α so thatw 2 k = η k . Hence the generalization error is simply: For synthetic neural network experiments in Fig. 6 in main text and Supplementary Figure 8, we use this type of target functions with k = 1 (hence linear target function). The code referenced in Methods section can be used for reproducing the results in this paper as well as for experimenting with different target functions including mixed target functions.

Target Function with Non-zero Weights and Synthetic Kernel Regression Experiments
Generating a target function with all weights being non-zero with the method described above may be computationally expensive because sampling from many high dimensional spherical harmonics is not efficient. Instead, we devise the following method: • We generate the target function using the representer's theorem. We choose P target examples {x µ } (different than the training set) on the sphere and observe that: We note thatw ρ are random variables. To calculate their statistics, suppose we drawᾱ µ for each example i.i.d. from a distribution with mean 0 and variance 1/P . Then averaging over many {x µ } with large P , we get the mean and variance ofw ρ to be: For large P , w ρ 2 concentrates around η ρ , which we use in our theoretical calculations.
We also allow sample corruption by a Gaussian noise: where noise for each sample has variance µ ν = σ 2 δ µν .
• To solve the kernel regression problem, we again use the representer's theorem. Given P training samples, {x µ }, the solution is of the form: Plugging this into the kernel regression problem, with samples y µ =f (x µ ) + µ generated by the target, we obtain the coefficients: • Once we get these coefficients α, we can express the total generalization error as a sum of mode wise errors If we recognize that ρ indexes both (k, m) for the spherical harmonics, we can simplify the mode error to a simple matrix expression We use this expression to compute experimental mode errors.
The theoretical generalization error can be obtained simply replacingw

Details of Neural Network Experiments
To perform neural network experiments, we use Neural Tangents package [23]. The label generating target functions are picked using the method for pure target functions (Supplementary Note 5). After we pick P labelled training data, we generate a two-layer neural network using NTK initialization [17] with weight variance σ 2 W = 1 and bias variance σ 2 b = 0 with ReLU activation. We pick hidden layer widths varying between 4, 000 − 50, 000. The network is trained with ADAM optimizer with learning rate 0.008. We perform 5 times averaging, and we sample different data points at each trial.
Corresponding NTK ridgeless regression experiments performed with zero regularization and the theory is calculated using the numerically calculated NTK spectrum and picking the weights to be equal to the spectrum.

White Band-limited Kernel Experiments
The simplest kernel to experimentally probe the white bandlimited setting is the linear kernel K(x, x ) = x · x where x ∼ N (0, I). In this setting, the Mercer eigenfunctions are exactly the individual input vector elements φ ρ (x) = x ρ , ρ = 1, ..., N : We used this kernel and a linear target function f t (x) = β · x where β ∼ N (0, I) in the Fig. 3 experiments in main text. The number of features hence the input dimension is N = 500.
The one dimensional plot of the averaged estimator for uniform data on S 1 in Fig. 4 in main text used the following kernel which is an RKHS with dimension 2N and the corresponding features are ψ cos k (x) = √ 2 cos(kx) and ψ sin k (x) = √ 2 cos(kx). In this experiment, the target function was chosen to be the (centered) vonMises functionf (x) = e 4(cos(x)−1) − I 0 (4) where I 0 is the Bessel function of order 0. The RKHS weightsā k were calculated by projecting the function on the features ψ cos k (x) and ψ sin k (x). To ensure the target is in the RKHS, we develop a bandlimited versionf Training data is generated using this target function by randomly choosing x ∈ [−π, π] and the perform the kernel regression to obtain the estimator's RKHS weights. Finally, we evaluate the estimator on the interval [−π, π] to compare the learned function to the target.

Supplementary Note 6 -Notes on Spherical Harmonics
Here we collect some useful results on spherical harmonics. Details can be found in [21]. We are interested in finding a basis for the functions space on S D−1 ⊂ R D . Let P D k to be the space of homogeneous polynomials of degree k. Then its dimension is: Spherical harmonics are homogeneous Y km (tx) = t k Y km (x), harmonic ∇ 2 Y km (x) = 0 polynomials, restricted to S D−1 . They are orthonormal with respect to the uniform measure on the sphere The number of degree k spherical harmonics in dimension D is For large dimension D → ∞ this number of degree k harmonics grows like The Gegenbauer polynomial of degree k, Q (α) k , can be related to all of the degree k spherical harmonics where α = D−2 2 . We often refer to Q (α) k (x · y) as Q (D−1) k (x · y) in the main text to emphasize that it is a polynomial defined on S D−1 .

Relation to Kernel Alignment Risk Estimator (KARE)
A closely related study to our theory on the generalization of kernel methods utilizes random matrix theory to arrive at a related, but different, generalization prediction [4]. Their Kernel Alignment Risk Estimator (KARE) was shown to accurately predict generalization on the Higgs and MNIST datasets. The signal capture threshold (SCT) which arises in KARE theory is equivalent to the quantity we denote as κ, which can be interpreted as the resolvent of a generalized Wishart matrix (SI.3.1). The mode independent prefactor in our theory 1 1−γ , which plays an important role in the KARE theory, can be obtained from differentiation κ with respect to the ridge ∂ λ κ. Despite these similarities, our theory is not equivalent to KARE, which can be easily seen in the λ → 0 limit, in which KARE scales with P in a manner that does not depend on the kernel or task spectra whereas our learning curves still depend on η ρ and w 2 ρ even in the ridgeless limit. Our theory works well even in the λ = 0 simulations, for example in Fig. 3a and Fig. 4a in main text.
KARE may appear easier to implement than the theory we present here, since its only operations involve taking inverses, traces, and quadratic forms of kernel Gram matrices, whereas our theory involves a full eigendecomposition. In terms of time complexity, however, both methods scale as O(M 3 ) for a sampled dataset of size M .
The generalization of regularized linear regression with isotropic Gaussian features was analyzed with the replica method in [28] and diagrammatic methods in [29,30,31]. These works predicted learning curves equivalent to a special case of our kernel regression theory: the white band-limited spectrum. In this special case, we provide a phase diagram showing how effective noise and regularization can alter the non-monotonic behavior of the learning curve. We further show the universality of this learning curve and phase diagram to any setting where the kernel admits a truncated Mercer decomposition with all eigenvalues equal. We explicitly illustrate such an equivalence in the spherically symmetric data setting. In the L-th learning stage (P ≈ D L with D → ∞), any kernel essentially reduces to white band-limited kernel with N (D, L) equal variance modes (spherical harmonics of degree L). We show that the higher degree components of the target act as effective noise while higher degree kernel eigenvalues act as effective regularization, allowing the phase diagram from the white band-limited setting to carry over to this interesting case.
Generalization error for Gaussian process regression, where the target function is a random field with covariance kernel K t , was analyzed in the average case by Peter Sollich [32,14]. Using the Sherman-Morrison inverse formula, he derived a continuous approximation of the learning curves from a partial differential equation for κ. In his work, he introduced a variety of approximations to learning curves based on this PDE approach. The learning curves obtained in the present work with the replica method agree with his lower continuous approximation in the kernel regression limit. Our theory goes beyond this work both in formalism and the results. Our framework allows the computation of other relevant observables (training error, average estimator, bias, and variance) by simply including additional sources in our partition function. We provide a detailed analysis of the generalization error expression to elucidate important heuristics for generalization and further show its applicability to real datasets.
Recent works on the effects of over-parameterization in random feature models have been investigated with the replica method [33,16] and with random matrix theory tools [8,34]. In the random features model, the last layer weights in a randomly initialized neural network are trained. In this setting the learning curves depend on the input dimension D, the number of random features N (network width) and the number of data points P . Typically, the authors consider fitting noisy linear target functions with these networks. In this model, two types of overfitting peaks can occur, one when P ≈ D and one when P ≈ N [35]. The relation between these peaks and the ones we see are discussed in the main text.
The equivalence between training infinite-width neural networks and kernel methods, allowed Cohen et al. to study the generalization of wide neural networks with a Gaussian field theory [6]. They obtained a different expression for the theoretical generalization error using a Poisson averaging technique, allowing them to expand their theoretical prediction in powers of 1/P at finite λ. We show in Supplementary Equation 64 that their expression for f * (x; P ) D can be obtained from ours by a series expansion in a certain limit, showing that our result contains corrections not captured in their theory. Further, to the perturbative order they consider, they find that f * (x; P ) 2 D = f * (x; P ) 2 D , missing the variance of the estimator which plays a crucial role in our theory, especially for non-monotonicity. Furthermore, their expressions break down in the ridgeless limit and they resort to a renormalization approach to predict learning curves. Our expression is valid in this limit.
Our recent conference publication on kernel regression and the infinite-width limit of neural networks [12] laid the groundwork for the present investigation. However, the current paper significantly expands on it and brings in many new insights. First, the effect of label noise on generalization and the multiple descent phase diagram were not explored in our previous work which we provide an account of here. Second, we provide comparisons with CIFAR-10, not present in the previous work. Third, we introduced and analyzed the white band limited RKHS model. Fourth, we emphasize task-model alignment as a heuristic here and provide a metric for it. Fifth, the field theory formalism presented in this paper is more general than the technique of [12], which only used the replica method to calculate the average of an inverse matrix. The flexibility of the new theory allowed us to compute other observables including training error, the expected estimator, and the covariance of the estimated coefficients over different datasets.