Understanding quantum machine learning also requires rethinking generalization

Quantum machine learning models have shown successful generalization performance even when trained with few data. In this work, through systematic randomization experiments, we show that traditional approaches to understanding generalization fail to explain the behavior of such quantum models. Our experiments reveal that state-of-the-art quantum neural networks accurately fit random states and random labeling of training data. This ability to memorize random data defies current notions of small generalization error, problematizing approaches that build on complexity measures such as the VC dimension, the Rademacher complexity, and all their uniform relatives. We complement our empirical results with a theoretical construction showing that quantum neural networks can fit arbitrary labels to quantum states, hinting at their memorization ability. Our results do not preclude the possibility of good generalization with few training data but rather rule out any possible guarantees based only on the properties of the model family. These findings expose a fundamental challenge in the conventional understanding of generalization in quantum machine learning and highlight the need for a paradigm shift in the study of quantum models for machine learning tasks.

The importance of notions of generalization for PQCs is actually reflecting the development in classical machine learning: Vapnik's contributions [58] have laid the groundwork for the formal study of statistical learning systems.This methodology was considered standard in classical machine learning theory until roughly the last decade.However, the mindset put forth in this work has been disrupted by seminal work [59] demonstrating that the conventional understanding of gener-alization is unable to explain the great success of large-scale deep convolutional neural networks.These networks, which display orders of magnitude more trainable parameters than the dimensions of the images they process, defied conventional wisdom concerning generalization.
Employing clever randomization tests derived from nonparametric statistics [60], the authors of Ref. [59] exposed cracks in the foundations of Vapnik's theory and its successors [61], at least when applied to specific, state-of-the-art, large networks.Established complexity measures, such as the well-known VC dimension or Rademacher complexity [62], among others, were inadequate in explaining the generalization behavior of large classical neural networks.Their findings, in the form of numerical experiments, directly challenge many of the well-established uniform generalization bounds for learning models, such as those derived in, e.g., Refs.[63][64][65].Uniform generalization bounds apply uniformly to all hypotheses across an entire function family.Consequently, they fail to distinguish between hypotheses with good out-ofsample performance and those which completely overfit the training data.Moreover, uniform generalization bounds are oblivious to the difference between real-world data and randomly corrupted patterns.This inherent uniformity is what grants long reach to the randomization tests: exposing a single instance of poor generalization is sufficient to reduce the statements of mathematical theorems to mere trivially loose bounds.
This state of affairs has important consequences for the emergent field of QML, as we explore here.Noteworthy, current studies of generalization in quantum machine learning models have uniquely focused on uniform variants.Consequently, our present comprehension remains akin to the classical machine learning canon before the advent of Ref. [59].This observation raises a natural question as to whether the same randomization tests would yield analogous outcomes when applied to quantum models.In classical machine learning, it is widely acknowledged that the scale of deep neural networks plays a crucial role in generalization.Analogously, it is widely accepted that current QML models are consider- Then, the training sets are fed into an optimization algorithm, which is employed to identify the best fit for each data set individually from a family of parameterized quantum circuits FQ.This process generates two hypotheses: one for the original data foriginal and another for the corrupted data fcorrupted.We empirically find that the labeling functions can perfectly fit the training data, leading to small training errors.In parallel, foriginal achieves a small test error, indicating good learning performance, and quantified by a small generalization gap gen(foriginal) = small.On the contrary, the randomization process causes fcorrupted to achieve a large test error, which in turn results in a large generalization gap gen(fcorrupted) = large.(b) Regarding uniform generalization bounds, it is worth noting that this corner of QML literature assigns the same upper bound gunif to the entire function family without considering the specific characteristics of each individual function.Finally, we combine two significant findings: (1) We have identified a hypothesis with a large empirical generalization gap, and (2) the uniform generalization bounds impose identical upper bounds on all hypotheses.Consequently, we conclude that any uniform generalization bound derived from the literature must be regarded as "large", indicating that all such bounds are loose for that training data size.The notion of loose generalization bound does not exclude the possibility of achieving good generalization; rather, it fails to explain or predict such successful behavior.
ably distant from that size scale.In this context, one would not anticipate similarities between current QML models and high-achieving classical learning models.
In this article, we provide empirical, long-reaching evidence of unexpected behavior in the field of generalization, with quite arresting conclusions.In fact, we are in the position to challenge notions of generalization, building on similar randomization tests that have been used in Ref. [59].As it turns out, they already yield surprising results when applied to near-term QML models employing quantum states as inputs.Our empirical findings, also in the form of numerical experiments, reveal that uniform generalization bounds may not be the right approach for current-scale QML.To corroborate this body of numerical work with a rigorous underpinning, we show how QML models can assign arbitrary labels to quantum states.Specifically, we show that PQCs are able to perfectly fit training sets of polynomial size in the number of qubits.By revealing this ability to memorize random data, our results rule out the good generalization guarantees with few training data from uniform bounds [53,55].To clarify, our experiments do not study the generalization capacity of stateof-the-art QML.Instead, we expose the limitation of uniform generalization bounds when applied to these models.While QML models have demonstrated good generalization performance in some settings [20,46,53,55,[66][67][68], our contributions do not explain why or how they achieve it.We highlight that the reasons behind their successful generalization remain elusive.

A. Statistical learning theory background
We begin by briefly introducing the necessary terminology for discussing our findings in the framework of supervised learning.We denote X as the input domain and Y as the set of possible labels.We assume there is an unknown but fixed distribution D(X × Y) from which the data originate.Let F represent the family of functions that map X to Y.The expected risk functional R then quantifies the predictive accuracy of a given function f for data sampled according to D.
The training set, denoted as S, comprises N samples drawn from D. The empirical risk RS (f ) then evaluates the performance of a function f on the restricted set S. The difference between R(f ) and RS (f ) is referred to as the generalization gap, defined as The dependence of gen(f ) on S is implied, as evident from the context.Similarly, the dependence of R(f ), RS (f ), and gen(f ) on D is also implicit.We employ C(F) to represent any complexity measure of a function family, such as the VC dimension, the Rademacher complexity, or others [62].It is important to note that these measures are properties of the whole function family F, and not of single functions f ∈ F.

B. Numerical results
Our goal is to improve our understanding of PQCs as learning models.In particular, we tread in the domain of generalization and its interplay with the ability to memorize random data.The main idea of our work builds on the theory of randomization tests from non-parametric statistics [60].Fig. 1 contains a visualization of our framework.
Initially, we train QNNs on quantum states whose labels have been randomized and compare the training accuracy achieved by the same learning model when trained on the true labels.Our results reveal that, in many cases, the models learn to classify the training data perfectly, regardless of whether the labels have been randomized.By altering the input data, we reach our first finding: Observation 1 (Fitting random labels).Existing QML models can accurately fit random labels to quantum states.
Next, we randomize only a fraction of the labels.We observe a steady increase in the generalization error as the label noise rises.This suggests that QNNs are capable of extracting the residual signal in the data while simultaneously fitting the noisy portion using brute-force memorization.
Observation 2 (Fitting partially corrupted labels).Existing QML models can accurately fit partially corrupted labels to quantum states.
In addition to randomizing the labels, we also explore the effects of randomizing the input quantum states themselves and conclude: Observation 3 (Fitting random quantum states).Existing QML models can accurately fit labels to random quantum states.
These randomization experiments result in a remarkably large generalization gap after training without changing the circuit structure, the number of parameters, the number of training examples, or the learning algorithm.As highlighted in Ref. [59] for classical learning models, these straightforward experiments have far-reaching implications: 1.Quantum neural networks already show memorization capability for quantum data.2. The trainability of a model remains largely unaffected by the absence of correlation between input states and labels.
3. Randomizing the labels does not change any properties of the learning task other than the data itself.
In the following, we present our experimental design and the formal interpretation of our results.Even though it would seem that our results contradict established theorems, we elucidate how and why we can prove that uniform generalization bounds are vacuous for currently tested models.

Quantum phase recognition and randomization tests
Here, we show the numerical results of our randomization tests, focusing on a candidate architecture and a wellestablished classification problem: the quantum convolutional neural network (QCNN) [66] and the classification of quantum phases of matter.
Classifying quantum phases of matter accurately is a relevant task for the study of condensed-matter physics [69,70].Moreover, due to its significance, it frequently appears as a benchmark problem in the literature [69,71].In our experiments, we consider the generalized cluster Hamiltonian where n is the number of qubits, X i and Z i are Pauli operators acting on the i th qubit, and j 1 and j 2 are coupling strengths.Specifically, we classify states according to which one of four symmetry-protected topological phases they display.As demonstrated in Ref. [72], and depicted in Fig. 2, the ground-state phase diagram comprises the phases: (I) symmetry-protected topological, (II) ferromagnetic, (III) antiferromagnetic, and (IV) trivial.
The learning task we undertake involves identifying the correct quantum phase given the ground state of the generalized cluster Hamiltonian for some choice of (j 1 , j 2 ).We generate a training set S = {(|ψ i , y i )} N i=1 by sampling coupling coefficients uniformly at random in the domain j 1 , j 2 ∈ [−4, 4], with N being the number of training data points, |ψ i representing the ground state vectors of H corresponding to the sampled (j 1 , j 2 ), and y i denoting the corresponding phase label among the aforementioned phases.In particular, labels are length-two bit strings y i ∈ {(0, 0), (0, 1), (1, 0), (1, 1)}.
We employ the QCNN architecture presented in Ref. [66] to address the classification problem.By adapting classical convolutional neural networks to a quantum setting, QCNNs are particularly well-suited for tasks involving spatial and temporal patterns, which makes this architecture a natural choice for phase classification problems.A unique feature of the QCNN architecture is the interleaving of convolutional and pooling layers.Convolutional layers consist of translation-invariant parameterized unitaries applied to neighboring qubits, functioning as filters between feature maps across different layers of the QCNN.Following the convolutional layer, pooling layers are introduced to reduce the dimensionality of the quantum state while retaining the relevant features of the data.This is achieved by measuring a subset of qubits and applying translationally invariant parameterized single-qubit unitaries based on the corresponding measurement outcomes.The operation of a QCNN can be interpreted as a quantum channel C ϑ specified by parameters ϑ, mapping an input state ρ in into an output state ρ out , represented as Subsequently, the expectation value of a task-oriented Hermitian operator is measured, utilizing the resulting ρ out .
Our implementation follows that presented in Ref. [53].The QCNN maps an input state vector |ψ , consisting of n qubits, into a 2-qubit output state.For the labeling function given the output state, we use the probabilities of the outcome of each bit string when the state is measured in the computational basis (p 00 , p 01 , p 10 , p 11 ).In particular, we predict the label ŷ according to the measurement outcome with the lowest probability according to For each experiment repetition, we generate data from the corresponding distribution D. For training, we use the loss function Thus, given a training set S ∼ D N , we minimize the empirical risk We consider three ways of altering the original data distribution D 0 from where data is sampled, namely: (a) data wherein true labels are replaced by random labels D 1 , (b) randomization of only a fraction r ∈ [0, 1] of the data, mixing real and corrupted labels in the same distribution D r , and (c) replacing the input quantum states with random states D st , instead of randomizing the labels.In each of these randomization experiments, the generalization gap and the risk functionals are defined according to the relevant distribution D ∈ {D 1 , D r , D st }.In all cases, the correlations between states and labels are gradually lost, which means we can control how much signal there is to be learned.In experiments where data-label correlations have vanished entirely, learning is impossible.One could expect the impossibility of learning to manifest itself during the training process, e.g., through lack of convergence.We observe that training the QCNN model on random data results in almost perfect classification performance on the training set.At face value, this means the QCNN is able to memorize noise.
In the following experiments, we approximate the expected risk R with an empirical risk RT using a large test set T .This test set is sampled independently from the same distribution as the training set S. In particular, the test set contains 1000 points for all the experiments, T ∼ D 1000 .
Additionally, we report our results using the probability of error, which is further elucidated below.Consequently, we employ the term "error" instead of "risk".Henceforth, we refer to test accuracy and test error as accurate proxies for the true accuracy and expected risk, respectively.All our experiments follow a three-step process: 1. Create a training set S ∼ D N and a test set T ∼ D 1000 .
2. Find a function f that approximately minimizes the empirical risk of Eq. ( 5).For ease of notation, we shall employ gen(f ) instead of gen T (f ) while discussing the generalization gap without reiterating its empirical nature.a. Random labels: We start our randomization tests by drawing data from D 1 , wherein the true labels have been replaced by random labels sampled uniformly from {(0, 0), (0, 1), (1, 0), (1, 1)}.In order to sample from D 1 , a labeled pair can be obtained from the original data distribution (|ψ , y) ∼ D 0 , after which the label y can be randomly replaced.In this experiment, we have employed QCNNs with varying numbers of qubits n ∈ {8, 16, 32}.For each qubit number, we have generated training sets with different sizes N ∈ {5, 8, 10, 14, 20} for both random and real labels.The models were trained individually for each (n, N ) combination.
In Fig. 3 (a), we illustrate the results obtained when fitting random and real labels, as well as random states (discussed later).Each data point in the figure represents the average generalization gap achieved for a fixed training set size N for the different qubit numbers n.We observe a large gap for the random labels, close to 0.75, which should be seen as effectively maximal: perfect training accuracy and the same test accuracy as random guessing would yield.This finding suggests that the QCNN can be adjusted to fit the random labels in the training set, despite the labels bearing no correlation to For the real data and random labels, we employed 8, 16, and 32 qubits, while for the random states, we employed 8, 10, and 12 qubits.We observe that both random labels and random states exhibit a similar trend in the generalization gap, with a slight discrepancy in height due to the different relative frequencies of the four classes under the respective randomization protocols.In both cases, the test accuracy fails to surpass that of random guessing.Notably, the largest generalization gap occurs in the random labels experiments when using a training set of up to size N = 10, highlighting the memorization capacity of this particular QCNN.The training with uncorrupted data yields behavior in accordance with previous results [53].(b) Test error as a function of the ratio of label corruption after training the QCNN on training sets of size N ∈ 4, 6, 8 and n = 8.The plot illustrates the interpolation between uncorrupted data (r = 0) and random labels (r = 1).As the label corruption approaches 1, the test accuracy drops to levels of random guessing.The dependence between the test error and label corruption reveals the ability of the QCNN to extract remaining signal despite the noise in the initial training set.The inset focuses on the case N = 6.It conveys the optimization speed for four different levels of corruption, namely, 0, 2, 4 and 6 out of 6 labels being corrupted, and provides insights into the average convergence time.The shaded area denotes the variance over five experiment repetitions with independently initialized QCNN parameters.Surprisingly, on average, fitting completely random noise takes less time than fitting unperturbed data.This phenomenon emphasizes that QCNNs can accurately memorize random data.the input states.As the training set sizes increase, since the capacity of the QCNN is fixed, achieving a perfect classification accuracy for the entire training set becomes increasingly challenging.Consequently, the generalization gap diminishes.It is worth noting that a decrease in training accuracy is also observed for the true labeling of data [53].
b. Corrupted labels: Next to the randomization of labels, we further investigate the QCNN fitting behavior when data come with varying levels of label corruption D r , ranging from no labels being altered (r = 0) to all of them being corrupted (r = 1).The experiments consider different number of training points N ∈ {4, 6, 8}, and varying number of qubits n ∈ {8, 10, 12}.For each combination of (n, N ), we start the experiments with no randomized labels (r = 0).Then, we gradually increase the ratio of randomized labels until all labels are altered, that is, r ∈ {0, 1/N, 2/N, . . ., 1}.Fig. 3 (b) shows the test error after convergence.In all repetitions, this experiment reaches 100% training accuracy.We observe a steady increase in the test error as the noise level intensifies.This suggests that QCNNs are capable of extracting the remaining signal in the data while simultaneously fitting the noise by brute force.As the label corruption approaches 1, the test error converges to 75%, corresponding to the performance of random guessing.
The inset in Fig. 3 (b) focuses on the experiments conducted with N = 6 training points.In particular, we examine the relationship between the learning speed and the ratio of random labels.The plot shows an average over five experiment repetitions.Remarkably, each individual run exhibits a consistent pattern: the training error initially remains high, but it converges quickly once the decrease starts.This behavior was also reported for classical neural networks [59].The precise moment at which the training error begins to decrease seems to be heavily dependent on the random initialization of the parameters.However, it also relates to the signal-tonoise ratio r in the training data.Notably, we observe a long and stable plateau for the intermediate cases r = 1/3 and r = 2/3, roughly halfway between the starting training error and zero.This plateau represents an average between those runs where the rapid decrease has not yet started and those where the convergence has already been achieved, leading to significant variance.Interestingly, in the complete absence of correlation between states and labels (r = 1), the QCNN, on average, perfectly fits the training data even slightly faster than for the real labels (r = 0).c.Random states: In this scenario, we introduce randomness to the input ground state vectors rather than to the labels.Our goal is to introduce a certain degree of randomization into the quantum states while preserving some inherent structure in the problem.To achieve this, we define the data distribution D st for the random quantum states in a specific manner instead of just drawing pure random states uniformly.
To sample data from D st , we first draw a pair from the original distribution (|ψ , y) ∼ D 0 , and then we apply the following transformation to the state vector |ψ : We compute the mean µ ψ and variance σ ψ of its amplitudes and then sample new amplitudes randomly from a Gaussian distribution N (µ ψ , σ ψ ).After the new amplitudes are obtained, we normalize them.The random state experiments were performed with varying numbers of qubits n ∈ {8, 10, 12} and training set sizes N ∈ {5, 8, 10, 14, 20}.
In Fig. 3 (a), we show the results for fitting random input states, together with the random and real label experiment outcomes.The empirical generalization gaps achieved by the QCNN for random states exhibit a similar shape to those obtained for random labels.Indeed, a slight difference in the relative occurrences of each of the four classes leads to improved performance by biased random guessing.We observe that the QCNN can perfectly fit the training set for few data, and then the generalization gap decreases, analogously to the scenario with random labels.
The case of random states presents an intriguing aspect.The QCNN architecture was initially designed to unveil and exploit local correlations in input quantum states [66].However, our randomization protocol in this experiment removes precisely all local information, leaving only global information from the original data, such as the mean and the variance of the amplitudes.This was not the case in the random labels experiment, where the input ground states remained unaltered while only the labels were modified.The ability of the QCNN to memorize random data seems to be unaffected despite its structure to exploit local information.

Implications
Our findings indicate that novel approaches are required in studying the capabilities of quantum neural networks.Here, we elucidate how our experimental results fit the statistical learning theoretic framework.The main goal of machine learning is to find the expected risk minimizer f opt associated with a given learning task, However, given the unknown nature of the complete data distribution D, the evaluation of R becomes infeasible.Consequently, we must resort to its unbiased estimator, the empiri-cal risk RS .We let an optimization algorithm obtain f * , an approximate empirical risk minimizer Nonetheless, although RS (f ) is an unbiased estimator for R(f ), it remains uncertain whether the empirical risk minimizer f * will yield a low expected risk R(f * ).The generalization gap gen(f ) then comes in as the critical quantity of interest, quantifying the difference in performance on the training set RS (f ) and the expected performance on the entire domain R(f ).
In the literature, extensive efforts have been invested in providing robust guarantees on the magnitude of the generalization gap of QML models through so-called generalization bounds [44-51, 53, 55, 56, 62].These theorems assert that under reasonable assumptions, the generalization gap of a given model can be upper bounded by a quantity that can depend on various parameters.These include properties of the function family, the optimization algorithm used, or the data distribution.The derivation of a generalization bound for a learning model typically involves rigorous mathematical calculations and often considers restricted scenarios.Many results in the literature fit the following template: Generic uniform generalization bound.Let F be a hypothesis class, and let D be any data-generating distribution.Let R be a risk functional associated to D, and RS its empirical version, for a given set of N labeled data: S ∼ D N .Let C(F) be a complexity measure of F.Then, for any function f ∈ F, the generalization gap gen(f ) can be upper bounded, with high probability, by where usually g unif (F) ∈ O (poly(C(F), 1/N )) is given explicitly.We make the dependence of g unif on N implicit for clarity.The high probability is taken with respect to repeated sampling from D of sets S of size N .
We refer to these as uniform generalization bounds by virtue of them being equal for all elements f in the class F. Also, these bounds apply irrespective of the probability distribution D. The usefulness of uniform generalization bounds lies in their ability to provide performance guarantees for a model before undertaking any computationally expensive training.Thus, it becomes of interest to identify ranges of values for C(F) and N that result in a diminishing or entirely vanishing generalization gap (such as the limit N → ∞).These bounds usually deal with asymptotic regimes.Thus it is sometimes unclear how tight their statements are for practical scenarios.
In cases where the risk functional is itself bounded, we can further refine the bound.For example, if we take R e to be the probability of error we can immediately say that, for any f , there is a trivial upper bound on the generalization gap gen(f ) ≤ 1.Thus, the generalization bound could be rewritten as This additional threshold renders the actual value of g unif (F) of considerable significance.We now have the necessary tools to discuss the results of our experiments properly.Randomizing the data simply involves changing the data-generating distribution, e.g., from the original D 0 to a randomized D ∈ {D 1 , D r , D st }.As we have just remarked, the r.h.s. of Eq. ( 8) does not change for different distributions, implying that the same upper bound on the generalization gap applies to both data coming from D 0 , or corrupted data from D. If data from D is such that inputs and labels are uncorrelated, then any hypothesis cannot be better than random guessing in expectation.This results in the expected risk value being close to its maximum.For instance, in the case of the probability of error and a classification task with M classes, if each input is assigned a class uniformly at random, then it must hold for any hypothesis f , indicating that the expected risk must always be large.
A large risk for a particular example does not generally imply a large generalization gap gen(f ) ≈ R e (f ).For instance, if a learning model is unable to fit a corrupted training set S, Re S (f ) ≈ R e (f ), then one would have a small generalization gap gen(f ) ≈ 0. Conversely, for the generalization gap of f to be large gen(f ) ≈ 1 − 1/M , the learning algorithm must find a function that can actually fit S, with Re S (f ) ≈ 0. Yet, even in this last scenario, the uniform generalization bound still applies.
Let us denote N the size of the largest training set S for which we found a function f r able to fit the random data Re S (f r ) ≈ 0 (which leads to a large generalization gap gen(f r ) ≈ 1−1/M ).Since the uniform generalization bound applies to all functions in the class f ∈ F, we have found as an empirical lower bound to the generalization bound.This reveals that the generalization bound is vacuous for training sets of size up to N .Noteworthy is also that, further than N , there is a regime where the generalization bound remains impractically large.The strength of our results resides in the fact that we did not need to specify a complexity measure C(F).Our empirical findings apply to every uniform generalization bound, irrespective of its derivation.This gives strong evidence for the need for a perspective shift to the study of generalization in quantum machine learning.

C. Analytical results
In the previous section, we provided evidence that QNNs can accurately fit random labels.Our empirical findings are restricted to the number of qubits and training samples we tested.While these limitations seem restrictive, they are actually the relevant regimes of interest, considering the empirical evidence.In this section, we study the memorization capability of QML models of arbitrary size in terms of finite sample expressivity.
Finite sample expressivity refers to the ability of a function family to memorize arbitrary data.In general, expressivity is the ability of a hypothesis class to approximate functions in the entire domain X .Conversely, finite sample expressivity studies the ability to approximate functions on fixed-size subsets of X .Although finite sample expressivity is a weaker notion of expressivity, it can be seen as a stronger alternative to the pseudo-dimension of a hypothesis family [44,62].
The importance of finite sample expressivity lies in the fact that machine learning tasks always deal with finite training sets.Suppose a given model is found to be able to realize any possible labeling of an available training set.Then, reasonably one would not expect the model to learn meaningful insights from the training data.It is plausible that some form of learning may still occur, albeit without a clear understanding of the underlying mechanisms.However, under such circumstances, uniform generalization bounds would inevitably become trivial.

Theorem 1 (Finite sample expressivity of quantum circuits).
Let ρ 1 , . . ., ρ N be unknown quantum states on n ∈ N qubits, with N ∈ O(poly(n)), and let W be the Gram matrix If W is well-conditioned, then, for any y 1 , . . ., y N ∈ R real numbers, we can construct a quantum circuit M y of poly(n) depth such that The proof is given in Appendix A. Theorem 1 gives us a constructive approach to, given a finite set of quantum states and real labels, find a quantum circuit that produces each of the labels as the expectation value for each of the input states.This should give an intuition for why QML models seem capable of learning random labels and random quantum states.Nevertheless, as stated, the theorem falls short in applying specifically to PQCs.The construction we propose requires query access to the set of input states every time the circuit is executed.We estimate the values tr(ρ i ρ j ) employing the SWAP test.The circuit that realizes the SWAP test bears little relation to usual QML ansätze.Ideally, if possible, one should impose a familiar PQC structure and drop the need to use the input states.
Next, we propose an alternative, more restricted version of the same statement, keeping QML in mind as the desired application.For it, we need a sense of distinguishability of quantum states.
Definition 1 (Distinguishability condition).We say n-qubit quantum states ρ 1 , . . ., ρ N fulfill the distinguishability condition if we can find intermediate states ρ i → ρi based on some generic quantum state approximation protocol such that they fulfill the following: 1.For each i ∈ [N ], ρi is efficiently preparable with a PQC.
Notable examples of approximation protocols are those inspired by classical shadows [73] or tensor networks [74].For instance, similarly to classical shadows, one could draw unitaries from an approximate poly(n)-design using a brickwork ansatz with poly(n)-many layers of i.i.d.Haar random 2-local gates.For a given quantum state ρ, one produces several pairs (U, b) where U is the randomly drawn unitary and b is the bitstring outcome after performing a computational basis measurement of U ρU † , and one refers to each individual pair as a snapshot.Notice that this approach does not follow exactly the traditional classical shadows protocol.Our end goal is to prepare the approximation as a PQC, rather than utilizing it for classical simulation purposes.In particular, we do not employ the inverse measurement channel, since that would break complete positivity and thus the corresponding approximation would not be a quantum state.For each snapshot, one can efficiently prepare the corresponding quantum state U † |b b|U by undoing the unitary that was drawn after preparing the corresponding computational basis state vector |b .Given a collection of snapshots {(U 1 , b 1 ), . . ., (U M , b M )}, an approximation protocol would consist of preparing the mixed state Since each b m is prepared with at most n Pauli-X gates and each U m is a brickwork PQC architecture, this approximation protocol fulfills the restriction of efficient preparation from Definition 1. Whether or not this or any other generic approximation protocol is accurate enough for a specific choice of quantum states we discuss in Section IV B. There, we present Algorithm 1 together with its correctness statement as Theorem 3. Given the input states ρ 1 , . . ., ρ N Algorithm 1 moreover allows to combine several quantum state approximation protocols in order to produce a well-conditioned matrix of inner products Ŵ .
Theorem 2 (Finite sample expressivity of PQCs).Let ρ 1 , . . ., ρ N be unknown quantum states on n ∈ N qubits, with N ∈ O(poly(n)), and fulfilling the distinguishability condition of Definition 1.Then, we can construct a PQC M(ϑ) of poly(n) depth such that, for any y = (y 1 , . . ., y N ) ∈ R real numbers, we can efficiently find a specification of the parameters ϑ y such that The proof is given in Appendix B. With Theorem 2, we understand that PQCs can produce any labeling of arbitrary sets of quantum states, provided they fulfill our distinguishability condition.
Notice that Definition 1 is needed for the correctness of Theorem 2. We require knowledge of an efficient classical description of the quantum states for two main reasons.On the one hand, PQCs are the object of our study.
Hence, we need to prepare the approximation efficiently as a PQC.In addition, on the other hand, the distinguishability condition is also enough to prevent us from running into computation-complexity bottle-necks, like those arising from the distributed inner product estimation results in Ref. [75].

III. DISCUSSION
We next discuss the implications of our results and suggest research avenues to explore in the future.We have shown that quantum neural networks (QNNs) can fit random data, including randomized labels or quantum states.We provided a detailed explanation of how to place our findings in a statistical learning theory context.We do not claim that uniform generalization bounds are wrong or that any prior results are false.Instead, we show that the statements of theorems that fit our generic uniform template must be vacuous for the regimes where the models are able to fit a large fraction of random data.
Our numerical results suggest that we must reach further than uniform generalization bounds to fully understand quantum machine learning (QML) models.In particular, experiments like ours immediately problematize approaches based on complexity measures like the VC dimension, the Rademacher complexity, and all their uniform relatives.To the best of our knowledge, all generalization bounds derived for QML so far are of the uniform kind.Therefore, our findings highlight the need for a perspective shift in generalization for QML.In the future, it will be interesting to conduct causation experiments on QNNs using non-uniform generalization measures.Promising candidates for good generalization measures in QML include the time to convergence of the training procedure, the geometric sharpness of the minimum the algorithm converged to, and the robustness against noise in the data [76].
We selected one of the most promising QML architectures for our experiments, known as the quantum convolutional neural network (QCNN).We considered the task of classifying quantum phases of matter, which is a state-of-the-art application.The structure of the QCNN, with its equivariant and pooling layers, results in an ansatz with restricted expressivity.Its core features, including intermediate measurements, parameter-sharing, and logarithmic depth, contribute to higher bias and lower variance.This means the QCNN should display better generalization behavior than, for example, the usual hardware-efficient ansätze [77].Most complexity measures are monotonous functions of the expressivity of the function family, and uniform generalization bounds are monotonous functions of a complexity measure.Therefore, our demonstration that uniform generalization bounds applied to the QCNN family are trivially loose immediately implies that the same bounds applied to less restricted models must also be vacuous.In this sense, our results for QCNNs carry over to the entirety of unrestricted QML ansätze.Overall, our study adds to the evidence supporting the need for a proper understanding of symmetries and equivariance in QML [55,[78][79][80].
In addition to our numerical experiments, we have analytically shown that polynomially-sized QNNs are able to fit arbitrary labeling of data sets.This seems to contradict claims that few training data are provably sufficient to guarantee good generalization in QML, raised e.g. in Ref. [53].Our analytical and numerical results do not preclude the possibility of good generalization with few training data but rather indicate we cannot guarantee it with arguments based on uniform generalization bounds.The reasons why successful generalization might occur have yet to be discovered.
We have brought the randomization tests of Ref. [59] to the quantum level, relating them to the task of quantum phase recognition as a representative example of state-of-the-art QML.Upon first glance, the training set sizes employed in our randomization experiments may be relatively small compared to the classical learning tasks investigated in Ref. [59].However, it is essential to consider both studies within their respective contexts.In Ref. [59], the considered learning models were regarded as the best in terms of generalization for the common benchmark tasks.As previously mentioned, good generalization performance has been reported in QML, particularly for classifying quantum phases of matter using a QCNN architecture.At present, this combination of model and task is also among the best leading approaches concerning generalization within the QML literature.The range of sizes for which we demonstrated memorization behavior aligns with the size regime for which good generalization performance was achieved.It is important to note that while the actual size scales in the classical case are orders of magnitude larger than those presented here, both studies focus on the optimal approaches available at the time.
Despite the parallelism between our work and Ref. [59], it is essential to be aware of the underlying differences between both studies.The notion of overparameterization1 plays a critical role in classical machine learning.Only with the onset of models containing far more trainable parameters than input dimensions did the traditional understanding of generalization start to dwindle.In contrast, although the number of parameters in the considered architectures is larger than the size of the training sets, they exhibit a logarithmic scaling with the number of qubits.Meanwhile, the number of dimensions of the quantum states scales exponentially.Hence, it is inappropriate to categorize the models we have investigated as large in the same way as the classical models in Ref. [59].This observation reveals a promising research direction: not only must we rethink our approach to studying generalization in QML, but we must also recognize that the mechanisms leading to successful generalization in QML may differ entirely from those in classical machine learning.On a higher level, this work exemplifies the necessity of establishing connections between the literature on classical machine learning and the evolving field of quantum machine learning.

A. Numerical methods
This section provides a comprehensive description of our numerical experiments, including the computation techniques employed for the random and real label implementations, as well as the random state and partially-corrupted label implementations.
Random and real label implementations.The test and training ground state vectors |ψ i of the cluster Hamiltonian in Eq. ( 2) have been obtained as variational principles over matrix product states in a reading of the density matrix renormalization group ansatz [81] through the software package Quimb [82].We have utilized the matrix product state backend from TensorCircuit [83] to simulate the quantum circuits.In particular, a bond dimension of χ = 40 was employed for the simulations of 16-and 32-qubit QCNNs.We find that further increasing the bond dimension does not lead to any noticeable changes in our results.
Random state and partially-corrupted label implementations.In this scenario, the test and training ground state vectors |ψ i were obtained directly diagonalizing the Hamiltonian.Note that our QCNN comprised a smaller number of qubits for these examples, namely, n ∈ {8, 10, 12}.The simulation of quantum circuits was performed using Qibo [84], a software framework that allows faster simulation of quantum circuits.
For all implementations, the training parameters were initialized randomly.The optimization method employed to update the parameters of the QCNN during training is the CMA-ES [85], a stochastic, derivative-free optimization strategy.The code generated under the current study is also available in Ref. [86].

B. Analytical methods
Here, we shed light on the practicalities of Definition 1, a requirement for our central Theorem 2. Algorithm 1 allows for several approximation protocols to be combined to increase the chances of fulfilling the assumptions of Definition 1. Indeed, we can allow for the auxiliary states ρ1 , . . ., ρN to be linear combinations of several approximation states while staying in the mindset of Definition 1.Then, we can cast the problem of finding an optimal weighting for the linear combination as a linear optimization problem with a positive semi-definite constraint.
With Theorem 3, we can assess the distinguishability condition of Definition 1 for specific states ρ 1 , . . ., ρ N and specific approximation protocols.Theorem 3 also considers the case where different approximation protocols are combined, which does not contradict the requirements of Theorem 2.
Proof.The inequality Ŵ (α; σ) ≤ N follows from Gershgorin's circle theorem [87], given that all entries of Ŵ are bounded between [0, 1].In particular, the largest singular value of the matrix Ŵ reaches the value N when all entries are 1.
is a linear constraint on α and Ŵ , for i, j ∈ [N ], while in matrix ordering is a positive semi-definite constraint.Ŵ ≤ N I is equivalent with Ŵ ≤ N , while κI ≤ Ŵ means that the smallest singular value of Ŵ is lower bounded by κ, being equivalent with for an invertible Ŵ (α; σ).The test whether such a Ŵ is wellconditioned hence takes the form of a semi-definite feasibility problem [88].One can additionally minimize the objective functions and both again as linear or convex quadratic and hence semidefinite problems.Overall, the problem can be solved as a semi-definite problem, that can be solved in a run-time with low-order polynomial effort with interior point methods.Duality theory readily provides a rigorous certificate for the solution [88].
We propose using Algorithm 1 to construct the optimal auxiliary states ρ1 , . . ., ρN , given the unknown input states ρ 1 , . . ., ρ N and a collection of available approximation protocols A 1 , . . ., A m .The algorithm produces an output of either 0 in cases where no combination of the approximation states satisfies the distinguishability condition, or it provides the weights α necessary to construct the auxiliary states as a sum of approximation states.In Theorem 3, we prove the correctness of the algorithm.

Algorithm 1 Convex optimization state approximation
Require: 1: ρ = (ρ1, . . ., ρN ) Quantum states 2: A = (A1, . . ., Am) State approximation algorithms 3: κ Condition number Ensure: α such that Ŵ is well-conditioned if possible, 0 otherwise.We refer to the proof of Theorem 2, in Appendix B, for an explanation of how to construct the intermediate states ρi as a linear combination of auxiliary states σ i without giving up the PQC framework.

CODE AND DATA AVAILABILITY
The code and data generated during the current study are available in Ref. [86].

ACKNOWLEDGMENTS
The authors would like to thank Matthias C. Caro, Vedran Dunjko, Johannes Jakob Meyer, and Ryan Sweke for useful comments on an earlier version of this manuscript and Christian Bertoni, José Carrasco, and Sofiene Jerbi for insightful discussions.The authors also acknowledge the BMBF (MUNIQC-Atoms, Hybrid), the BMWK (EniQmA, PlanQK), the QuantERA (HQCC), the Quantum Flagship (PasQuans2), the MATH+ Cluster of Excellence, the DFG (CRC 183, B01), and the Einstein Foundation (Einstein Research Unit on Quantum Devices) for financial support.

Figure 1 .
Figure 1.Visualization of our framework.(a) In the empirical experiments, a distribution of labeled quantum data D undergoes a randomization process, leading to a corrupted data distribution D. The training and a test set are drawn independently from each distribution.Then, the training sets are fed into an optimization algorithm, which is employed to identify the best fit for each data set individually from a family of parameterized quantum circuits FQ.This process generates two hypotheses: one for the original data foriginal and another for the corrupted data fcorrupted.We empirically find that the labeling functions can perfectly fit the training data, leading to small training errors.In parallel, foriginal achieves a small test error, indicating good learning performance, and quantified by a small generalization gap gen(foriginal) = small.On the contrary, the randomization process causes fcorrupted to achieve a large test error, which in turn results in a large generalization gap gen(fcorrupted) = large.(b) Regarding uniform generalization bounds, it is worth noting that this corner of QML literature assigns the same upper bound gunif to the entire function family without considering the specific characteristics of each individual function.Finally, we combine two significant findings:(1) We have identified a hypothesis with a large empirical generalization gap, and (2) the uniform generalization bounds impose identical upper bounds on all hypotheses.Consequently, we conclude that any uniform generalization bound derived from the literature must be regarded as "large", indicating that all such bounds are loose for that training data size.The notion of loose generalization bound does not exclude the possibility of achieving good generalization; rather, it fails to explain or predict such successful behavior.

Figure 2 .
Figure 2. The ground-state phase diagram of the Hamiltonian of Eq. (2).

Figure 3 .
Figure 3. Randomization tests.(a) Generalization gap as a function of the training set size achieved by the quantum convolutional neural network (QCNN) architecture.The QCNN is trained on real data, random label data, and random state data.The horizontal dashed line should be thought of as the largest generalization gap attainable, characterized by zero training error and test error equal to random guessing (0.75 due to the task having four possible classes).The shaded area corresponds to the standard deviation across different experiment repetitions.For the real data and random labels, we employed 8, 16, and 32 qubits, while for the random states, we employed 8, 10, and 12 qubits.We observe that both random labels and random states exhibit a similar trend in the generalization gap, with a slight discrepancy in height due to the different relative frequencies of the four classes under the respective randomization protocols.In both cases, the test accuracy fails to surpass that of random guessing.Notably, the largest generalization gap occurs in the random labels experiments when using a training set of up to size N = 10, highlighting the memorization capacity of this particular QCNN.The training with uncorrupted data yields behavior in accordance with previous results[53].(b) Test error as a function of the ratio of label corruption after training the QCNN on training sets of size N ∈ 4, 6, 8 and n = 8.The plot illustrates the interpolation between uncorrupted data (r = 0) and random labels (r = 1).As the label corruption approaches 1, the test accuracy drops to levels of random guessing.The dependence between the test error and label corruption reveals the ability of the QCNN to extract remaining signal despite the noise in the initial training set.The inset focuses on the case N = 6.It conveys the optimization speed for four different levels of corruption, namely, 0, 2, 4 and 6 out of 6 labels being corrupted, and provides insights into the average convergence time.The shaded area denotes the variance over five experiment repetitions with independently initialized QCNN parameters.Surprisingly, on average, fitting completely random noise takes less time than fitting unperturbed data.This phenomenon emphasizes that QCNNs can accurately memorize random data.