Quantum machine learning beyond kernel methods

Machine learning algorithms based on parametrized quantum circuits are prime candidates for near-term applications on noisy quantum computers. In this direction, various types of quantum machine learning models have been introduced and studied extensively. Yet, our understanding of how these models compare, both mutually and to classical models, remains limited. In this work, we identify a constructive framework that captures all standard models based on parametrized quantum circuits: that of linear quantum models. In particular, we show using tools from quantum information theory how data re-uploading circuits, an apparent outlier of this framework, can be efficiently mapped into the simpler picture of linear models in quantum Hilbert spaces. Furthermore, we analyze the experimentally-relevant resource requirements of these models in terms of qubit number and amount of data needed to learn. Based on recent results from classical machine learning, we prove that linear quantum models must utilize exponentially more qubits than data re-uploading models in order to solve certain learning tasks, while kernel methods additionally require exponentially more data points. Our results provide a more comprehensive view of quantum machine learning models as well as insights on the compatibility of different models with NISQ constraints.


I. INTRODUCTION
In the current Noisy Intermediate-Scale Quantum (NISQ) era [1], a few methods have been proposed to construct useful quantum algorithms that are compatible with mild hardware restrictions [2,3].Most of these methods involve the specification of a quantum circuit Ansatz, optimized in a classical fashion to solve specific computational tasks.Next to variational quantum eigensolvers in chemistry [4] and variants of the quantum approximate optimization algorithm [5], machine learning approaches based on such parametrized quantum circuits [6] stand as some of the most promising practical applications to yield quantum advantages.
In essence, a supervised machine learning problem often reduces to the task of fitting a parametrized function -also referred to as the machine learning model -to a set of previously labeled points, called a training set.Interestingly, many problems in physics and beyond, from the classification of phases of matter [7] to predicting the folding structures of proteins [8], can be phrased as such machine learning tasks.In the domain of quantum machine learning [9,10], an emerging approach for this type of problem is to use parametrized quantum circuits to define a hypothesis class of functions [11][12][13][14][15][16].The hope is for these parametrized models to offer representational power beyond what is possible with classical models, including the highly successful deep neural networks.And indeed, we have substantial evidence of such a quantum learning advantage for artificial problems [16][17][18][19][20][21], but the next frontier is to show that quantum models can be advantageous in solving real-world problems as well.Yet, it is still unclear which of these models we should preferably use in practical applications.To bring quantum machine learning models forward, we first need a b) The quantum kernel associated to these quantum feature states.The expectation value of the projection P0 = |0 0| corresponds to the inner product between ρ(x) and ρ(x ).An implicit quantum model is defined by a linear combination of such inner products, for x an input point and x training data points.c) A data re-uploading model, interlaying data encoding and variational unitaries before a final measurement.
deeper understanding of their learning performance guarantees and the actual resource requirements they entail.This is precisely where the main contribution of our work lies.In this paper, we analyze the relations between the different quantum models currently proposed in the literature, and uncover clear indications on which models to use in practice, in light of experimentally-relevant restrictions such as the number of qubits and quantum circuit evaluations needed to learn.
Previous works have made strides in this direction by exploiting a connection between some quantum models and kernel methods from classical machine learning [22].Many quantum models indeed operate by encoding data in a high-dimensional Hilbert space and using solely inner products evaluated in this feature space to model properties of the data.This is also how kernel methods work.Building on this similarity, the authors of [23,24] noted that a given quantum encoding can be used to define two types of models (see Fig. 1): (a) explicit quantum models, where an encoded data point is measured according to a variational observable that specifies its label, or (b) implicit kernel models, where weighted inner products of encoded data points are used to assign labels instead.In the quantum machine learning literature, much emphasis has been placed on implicit models [20,[25][26][27][28][29][30][31], in part due to a fundamental result known as the representer theorem [22].This result shows that implicit models can always achieve a smaller labeling error than explicit models, when evaluated on the same training set.Seemingly, this suggests that implicit models are systematically more advantageous than their explicit counterparts in solving machine learning tasks [25].This idea also inspired a line of research where, in order to evaluate the existence of quantum advantages, classical models were only compared to quantum kernel methods.This restricted comparison led to the conclusion that classical models could be competitive with (or outperform) quantum models, even in tailored quantum problems [20].
In recent times, there has also been progress in socalled data re-uploading models [32] which have demonstrated their importance in designing expressive models, both analytically [33] and empirically [15,16,32], and proving that (even single-qubit) parametrized quantum circuits are universal function approximators [34,35].
Through their alternation of data-encoding and variational unitaries, data re-uploading models can be seen as a generalization of explicit models.However, this generalization also breaks the correspondence to implicit models, as a given data point x no longer corresponds to a fixed encoded point ρ(x).Hence, these observations suggest that data re-uploading models are strictly more general than explicit models and that they are incompatible with the kernel-model paradigm.Until now, it remained an open question whether some advantage could be gained from data re-uploading models, in light of the guarantees of kernel methods.
In this work, we introduce a unifying framework for explicit, implicit and data re-uploading quantum models (see Fig. 2).We show that all function families stemming from these can be formulated as linear models in suitably-defined quantum feature spaces.This allows us to systematically compare explicit and data re-uploading models to their kernel formulations.We find that, while kernel models are guaranteed to achieve a lower training error, this improvement can come at the cost of a poor generalization performance outside the training set.Our results indicate that the advantages of quantum machine learning may lie beyond kernel methods, more specifically in explicit and data re-uploading models.To corroborate this theory, we quantify the resource requirements of these different quantum models in terms of the number of qubits and data points needed to learn.We show the existence of a regression task with exponential separations between each pair of quantum models, demonstrating the practical advantages of explicit models over implicit models, and of data re-uploading models over explicit models.From an experimental perspective, these separations shed light on the resource efficiency of different quantum models, which is of crucial importance for near-term applications in quantum machine learning.

II. A UNIFYING FRAMEWORK FOR QUANTUM LEARNING MODELS
In this section, we start by reviewing the notion of linear quantum models and explain how explicit and implicit models are by definition linear models in quantum feature spaces.We then present data re-uploading models and show how, despite being defined as a generalization of explicit models, they can also be realized by linear models in larger Hilbert spaces.

A. Linear quantum models
Let us first understand how explicit and implicit quantum models can both be described as linear quantum models [25,36].To define both of these models, we first consider a feature encoding unitary U φ : X → F that maps input vectors x ∈ X , e.g., images in R d , to n-qubit A linear function in the quantum feature space F is defined by the expectation values for some Hermitian observable O ∈ F. Indeed, one can see from Eq. (1) that f (x) is the Hilbert-Schmidt inner product between the Hermitian matrices ρ(x) and O, which is by definition a linear function of the form φ(x), w F , for φ(x) = ρ(x) and w = O.In a regression task, these real-valued expectation values are used directly to define a labeling function, while in a classification task, they are post-processed to produce discrete labels (using, for instance, a sign function).Explicit and implicit models differ in the way they define the family of observables {O} they each consider.

Explicit models
An explicit quantum model [23,24] using the feature encoding U φ (x) is defined by a variational family of unitaries V (θ) and a fixed observable O, such that for O θ = V (θ) † OV (θ), specify its labeling function.Restricting the family of variational observables {O θ } θ is equivalent to restricting the vectors w accessible to the linear quantum model f (x) = φ(x), w F , w ∈ F, associated to the encoding ρ(x).

Implicit models
Implicit quantum models [23,24] are constructed from the quantum feature states ρ(x) in a different way.Their definition depends directly on the data points {x (1) , . . ., x (M ) } in a given training set D, as they take the form of a linear combination ] the kernel function associated to the feature encoding U φ (x).By linearity of the trace however, we can express any such implicit model as a linear model in F, defined by the observable: Therefore, both explicit and implicit quantum models belong to the general family of linear models in the quantum feature space F.

B. Linear realizations of data re-uploading models
Data re-uploading models [32] on the other hand do not naturally fit this formulation.These models generalize explicit models by increasing the number of encoding layers U (x), 1 ≤ ≤ L (which can be all distinct), and interlaying them with variational unitaries V (θ).This results in expectation-value functions of the form: for a variational encoding , and a variational observable O θ = V L (θ) † OV L (θ).Given that the unitaries U (x) and V (θ) do not commute in general, one cannot straightforwardly gather all trainable gates in a final variational observable O θ ∈ F as to obtain a linear model fθ (x) = φ(x), O θ F with a fixed quantum feature encoding φ(x).Our first contribution is to show that, by augmenting the dimension of the Hilbert space F (i.e., considering circuits that act on a larger number of qubits), one can construct such explicit linear realizations fθ of data re-uploading models.That is, given a family of data re-uploading models represents all functions in the original family, along with an efficient procedure to map the former models to the latter.

Approximate mapping
Before getting to the main result of this section (Theorem 1), we first present an illustrative construction to convey intuition on how mappings from data reuploading to explicit models can be realized.This construction, depicted in Fig. 3, leads to approximate mappings, meaning that these only guarantee | fθ (x) − f θ (x)| ≤ δ, ∀x, θ for some (adjustable) error of approximation δ.More precisely, we have: For D the number of encoding gates used by the data reuploading model and O ∞ the spectral norm of its observable, the explicit model uses additional qubits and gates.
The general idea behind this construction is to encode the input data x in ancilla qubits, to finite precision, which can then be used repeatedly to approximate data-encoding gates using data-independent unitaries.More precisely, all data components x i ∈ R of an input vector x = (x 1 , . . ., x d ) are encoded as bitstrings | x i = |b 0 b 1 . . .b p−1 ∈ {0, 1} p , to some precision ε = 2 −p (e.g., using R x (b j ) rotations on |0 states).Now, using p fixed rotations, e.g., of the form R z (2 −j ), controlled by the bits |b j and acting on n "working" qubits, one can encode every x i in arbitrary (multi-qubit) rotations e −ixiH , e.g., R z (x i ), arbitrarily many times.Given that all these fixed rotations are data-independent, the feature encoding of any such circuit hence reduces to the encoding of the classical bit-strings x i , prior to all variational operations.By preserving the variational unitaries appearing in a data re-uploading circuit and replacing its encoding gates with such controlled rotations, we can then approximate any data re-uploading model of the form of Eq. ( 5).The approximation error δ of this mapping originates from the finite precision ε of encoding x, which results in an imperfect implementation of the encoding gates in the original circuit.But as ε → 0, we also have δ → 0, and the scaling of ε (or the number of ancillas dp) as a function of δ is detailed in Appendix B. ۧ ȁ+   ( 1 )   ( 1 )   ( 2 )   ( 2 )   ( 3 )   ۧ ȁ+ FIG.
4. An exact mapping from a data re-uploading model to an equivalent explicit model, using gate teleportation.The details of this mapping, as well as its more elaborate form (using nested gate teleportation) can found in Appendix B.

Exact mapping
We now move to our main construction, resulting in exact mappings between data re-uploading and explicit models, i.e., that achieve δ = 0 with finite resources.We rely here on a similar idea to our previous construction, in which we encode the input data on ancilla qubits and later use data-independent operations to implement the encoding gates on the working qubits.The difference here is that we use gate teleportation techniques, a form of measurement-based quantum computation [37], to directly implement the encoding gates on ancillary qubits and teleport them back (via entangled measurements) onto the working qubits when needed (see Fig. 4).
Theorem 1.Given an arbitrary data re-uploading model f θ (x) = Tr[ρ θ (x)O θ ] as specified by Eq. ( 5), there exists a mapping that produces an equivalent explicit model fθ (x) = Tr[ρ (x)O θ ] as specified by Eq. ( 2), such that: and , for an arbitrary renormalization parameter δ > 0. For D the number of encoding gates used by the data re-uploading model, the equivalent explicit model uses O(D log(D/δ )) additional qubits and gates.
As we detail in Appendix B, gate teleportation cannot succeed with unit probability without gate-dependent (and hence data-dependent) corrections conditioned on the measurement outcomes of the ancilla.But since we only care about equality in expectation values (Tr[ρ θ (x)O θ ] and Tr[ρ (x)O θ ]), we can simply discard these measurement outcomes in the observable O θ (i.e., project on the correction-free measurement outcomes).In general, this leads to an observable with a spectral ∞ exponentially larger than originally, and hence a model that is exponentially harder to evaluate to the same precision.Using a nested gateteleportation scheme (see Appendix B) with repeated applications of the encoding gates, we can however efficiently make this norm overhead arbitrarily small.
As our findings indicate, mappings from data reuploading to explicit models are not unique, and seem to always incur the use of additional qubits.In Sec.III C we prove that this is indeed the case, and that any mapping from an arbitrary data re-uploading model with D encoding gates to an equivalent explicit model must use Ω(D) additional qubits in general.This makes our gateteleportation mapping essentially optimal (i.e., up to logarithmic factors) in this extra cost.
To summarize, in this section, we demonstrated that linear quantum models can describe not only explicit and implicit models, but also data re-uploading circuits.More specifically, we showed that any hypothesis class of data re-uploading models can be mapped to an equivalent class of explicit models, that is, linear models with a restricted family of observables.In Appendix C, we extend this result and show that explicit models can also approximate any computable (classical or quantum) hypothesis class.

III. OUTPERFORMING KERNEL METHODS WITH EXPLICIT AND DATA RE-UPLOADING MODELS
From the standpoint of relating quantum models to each other, we have shown that the framework of linear quantum models allows to unify all standard models based on parametrized quantum circuits.While these findings are interesting from a theoretical perspective, they do not reveal how these models compare in practice.In particular, we would like to understand the advantages of using a certain model rather than the other in order to solve a given learning task.In this section, we address this question from several perspectives.First, we revisit the comparison between explicit and implicit models and clarify the implications of the representer theorem on the performance guarantees of these models.Then, we derive lower bounds for all three quantum models studied in this work in terms of their resource requirements, and show the existence of exponential separations between each pair of models.Finally, we discuss the implications of these results on the search for a quantum advantage in machine learning.

A. Classical background and the representer theorem
Interestingly, a piece of functional analysis from learning theory gives us a way of characterizing any family of linear quantum models [25].Namely, the so-called reproducing kernel Hilbert space, or RKHS [22], is the Hilbert space H spanned by all functions of the form f (x) = φ(x), w F , for all w ∈ F. It includes any explicit and implicit models defined by the quantum feature states φ(x) = ρ(x).From this point of view, a relaxation of any learning task using implicit or explicit models as a hypothesis family consists in finding the function in the RKHS H that has optimal learning performance.For the supervised learning task of modeling a target function g(x) using a training set {(x (1) , g(x (1) ), . . ., (x (M ) , g(x (M ) )}, this learning performance is usually measured in terms of a training loss of the form, e.g., The true figure of merit of this problem however, is in minimizing the expected loss L(f ), defined similarly as a probability-weighted average over the entire data space X .For this reason, a so-called regularization term F to incentivize the model not to overfit on the training data.Here, λ ≥ 0 is a hyperparameter that controls the strength of this regularization.
Learning theory also allows us to characterize the linear models in H that are optimal with respect to the regularized training loss L λ (f ), for any λ ≥ 0. Specifically, the representer theorem [22] states that the model f opt ∈ H minimizing L λ (f ) is always a kernel model of the form of Eq. (3) (see Appendix A for a formal statement).A direct corollary of this result is that implicit quantum models are guaranteed to achieve a lower (or equal) regularized training loss than any explicit quantum model using the same feature encoding [25].Moreover, the optimal weights α m of this model can be computed efficiently using O(M 2 ) evaluations of inner products on a quantum computer (that is, by estimating the expectation value in Fig. 1.b for all pairs of training points 1 ) and with classical post-processing in time O(M 3 ) using, e.g., ridge regression or support vector machines [22].
This result may be construed to suggest that, in our study of quantum machine learning models, we only need to worry about implicit models, where the only real question to ask is what feature encoding circuit we use to compute a kernel function, and all machine learning is otherwise classical.In the next subsections, we show however the value of explicit and data re-uploading approaches in terms of generalization performance and resource requirements.

B. Explicit can outperform implicit models
We turn our attention back to the explicit models resulting from our approximate mappings (see Fig. 3).Note that the kernel function associated to their bit-string encodings |ψ 1 For this work, we ignore the required precision of the estimations.We note however that these can require exponentially many measurements in the number of qubits, both for explicit [38] and implicit [27] models.
that is, the Kronecker delta function of the bit-strings x and x .Let us emphasize that, for an appropriate precision ε of encoding input vectors x, the family of explicit models resulting from our construction includes good approximations of virtually any parametrized quantum circuit model acting on n qubits.Yet, all of these result in the same kernel function of Eq. ( 9).This is a rather surprising result, for two reasons.First, this kernel is classically computable, which, in light of the representer theorem, seems to suggest that a simple classical model of the form of Eq. ( 3) can outperform any explicit quantum model stemming from our construction, and hence any quantum model in the limit ε → 0. Second, this implicit model always takes the form which is a model that overfits the training data and fails to generalize to unseen data points, as, for ε → 0 and any choice of α, f α,D (x) = 0 for any x outside the training set.As we detail in Appendix B, similar observations can be made for the kernels resulting from our gate-teleportation construction.These last remarks force us to rethink our interpretation of the representer theorem.When restricting our attention to the regularized training loss, implicit models do indeed lead to better training performance due to their increased expressivity. 2But, as our construction shows, this expressivity can dramatically harm the generalization performance of the learning model, despite the use of regularization during training.Hence, restricting the set of observables accessible to a linear quantum model (or, equivalently, restricting the accessible manifold of the RKHS) can potentially provide a substantial learning advantage.

C. Rigorous learning separations between all quantum models
Motivated by the previous illustrative example, we analyze more rigorously the advantages of explicit and data re-uploading models over implicit models.For this, we take a similar approach to recent works in classical machine learning which showed that neural networks can efficiently solve some learning tasks that linear or kernel methods cannot [39,40].In our case, we quantify the efficiency of a quantum model in solving a learning task by the number of qubits and the size of the training set it requires to achieve a non-trivial expected loss.To obtain scaling separations, we consider a learning task specified by an arbitrary input dimension d ∈ N and express the resource requirements of the different quantum models as a function of d.
Similarly to Ref. [39], the learning task we focus on is that of learning parity functions (see Fig. 5).These functions take as input a d-dimensional binary input x ∈ {−1, 1} d and return the parity (i.e., the product) of a certain subset A ⊂ {1, . . ., d} of the components of x.The interesting property of these functions is that, for any two choices of A, the resulting parity functions are orthogonal in the Hilbert space H of functions from {−1, 1} d to R. Hence, since the number of possible choices for A grow combinatorially with d, the subspace of H that these functions span also grows combinatorially with d (can be made into a 2 d scaling by restricting the choices of A).On the other hand, a linear model (explicit or implicit) also covers a restricted subspace (or manifold) of H.The dimension of this subspace is upper bounded by 2 2n for a quantum linear model acting on n qubits, and by M for an implicit model using M training samples (see Appendix G for detailed explanations).Hence by essentially comparing these dimensions (2 d versus 2 2n and M ) [40], we can derive our lower bounds for explicit and implicit models.As for data re-uploading models, they do not suffer from these dimensionality arguments.The different components of x can be processed sequentially by the model, such that a single-qubit data re-uploading quantum circuit can represent (and learn) any parity function.
We summarize our results in the following theorem, and refer to Appendix G for a more detailed exposition.
Theorem 2. There exists a regression task specified by an input dimension d ∈ N, a function family {g A : {−1, 1} d → {−1, 1}} A , and associated input distributions D A , such that, to achieve an average mean-squared error qubits, (ii) any implicit quantum model additionally requires data samples, while (iii) a data re-uploading model acting on a single qubit and using d encoding gates can be trained to achieve a perfect expected error with probability 1 − δ, using M = O(log d δ ) data samples.
A direct corollary of this result is a lower bound on the number of additional qubits that a universal mapping from any data re-uploading model to equivalent explicit models must use: Corollary 1.Any universal mapping that takes as input an arbitrary data re-uploading model f θ with D encoding gates and maps it to an equivalent explicit model f θ must produce models acting on Ω(D) additional qubits for worst-case inputs.
Comparing this lower bound to the scaling of our gateteleportation mapping (Theorem 1), we find that it is optimal up to logarithmic factors.

D. Quantum advantage beyond kernel methods
A major challenge in quantum machine learning is showing that the quantum methods discussed in this work can achieve a learning advantage over (standard) classical methods.While some approaches to this problem focus on constructing learning tasks with separations based on complexity-theoretic assumptions [17,19], other works try to assess empirically the type of learning problems where quantum models show an advantage over standard classical models [11,20].In this line of research, Huang et al. [20] propose looking into learning tasks where the target functions are themselves generated by (explicit) quantum models.Following similar observations to those made in Sec.III A about the learning guarantees of kernel methods, the authors also choose to assess the presence of quantum advantages by comparing the learning performance of standard classical models only to that of implicit quantum models (from the same family as the target explicit models).This restricted comparison led to the conclusion that, with the help of training data, classical machine learning models could be as powerful as quantum machine learning models, even in these tailored learning tasks.
Having discussed the limitations of kernel methods in the previous subsections, we revisit this type of numerical experiments, where we additionally evaluate the performance of explicit models on these type of tasks.Similarly to Huang et al. [20], we consider a regression task with input data from the fashion-MNIST dataset [41], composed of 28x28-pixel images of clothing items.Using principal component analysis, we first reduce the dimension of these images to obtain n-dimensional vectors, for 2 ≤ n ≤ 12.We then label the images using an explicit model acting on n qubits.For this, we use the feature encoding proposed by Havlíček et al. [23], which is conjectured to lead to classically intractable kernels, followed by a hardware-efficient variational unitary [4].The expectation value of a Pauli Z observable on the first qubit then produces the data labels. 3On this newly defined learning task, we test the performance of explicit models from the same function family as the explicit models generating the (training and test) data, and compare it to that of implicit models using the same feature encoding (hence from the same extended family of linear models), as well as a list of standard classical machine learning algorithms that are hyperparametrized for the task (see Appendix E).The results of this experiment are presented in Fig. 6.
The training losses we observe are consistent with our previous findings: the implicit models systematically achieve a lower training loss than their explicit counterparts (for an unregularized loss 4 notably, the implicit models achieve a training loss of 0).With respect to the testing loss on the other hand, which is representative of the expected loss, we see a clear separation starting from n = 7 qubits, where the classical models start having a competitive performance with the implicit models, while the explicit models clearly outperform them both.This goes to show that the existence of a quantum advantage should not be assessed only by comparing classical models to quantum kernel methods, as explicit (or data reuploading) models can also conceal a substantially better learning performance.

IV. CONCLUSION
In this work, we present a unifying framework for quantum machine learning models by expressing them as linear models in quantum feature spaces.In particular, we show how data re-uploading circuits can be represented exactly by explicit linear models in larger feature spaces.While this unifying formulation as linear models may suggest that all quantum machine learning models should be treated as kernel methods, we illustrate the advantages of variational quantum methods for machine learning.Going beyond the advantages in training performance guaranteed by the representer theorem, we first show how a systematic "kernelization" of linear quantum models can be harmful in terms of their generalization performance.Further, we analyze the resource requirements (number of qubits and data samples used by) these models, and show the existence of exponential separations between data re-uploading, linear, and kernel quantum models to solve certain learning tasks.
One take-away message from our results is that training loss, even when regularized, is a misleading figure of merit.Generalization performance, which is measured on seen as well as unseen data, is in fact the important quantity to care about in (quantum) machine learning.These two sentences written outside of context will seem obvious to individuals well-versed in learning theory.However, it is crucial to recall this fact when evaluating the consequences of the representer theorem.This theorem only discusses regularized training loss, and thus despite its guarantees on the training loss of quantum kernel methods, it allows explicit models to have an exponential learning advantage in the number of data samples they use to achieve a good generalization performance.
From the limitations of quantum kernel methods highlighted by these results, we revisit a discussion on the power of quantum learning models relative to classical models in machine learning tasks with quantumgenerated data.In a similar learning task to that of Huang et al. [20], we show that, while standard classical models can be competitive with quantum kernel methods even in these "quantum-tailored" problems, variational quantum models can exhibit a significant learning advantage.These results give us a more comprehensive view of the quantum machine learning landscape and broaden our perspective on the type of models to use in order to achieve a practical learning advantage in the NISQ regime.

V. DISCUSSION
In this paper, we focus on the theoretical foundations of quantum machine learning models and how expressivity impacts generalization performance.But a major practical consideration is also that of trainability of these models.In fact, we know of obstacles in trainability for both explicit and implicit models.Explicit models can suffer from barren plateaus in their loss landscapes [38,42], which manifest in exponentially vanishing gradients in the number of qubits used, while implicit models can suffer from exponentially vanishing kernel values [27,43].While these phenomena can happen under different conditions, they both mean that an exponential number of circuit evaluations can be needed to train and make use of these models.Therefore, aside from the considerations made in this work, emphasis should also be placed on avoiding these obstacles to make good use of quantum machine learning models in practice.
The learning task we consider to show the existence of exponential learning separations between the different quantum models is based on parity functions, which is not a concept class of practical interest in machine learning.We note however that our lower bound results can also be extended to other learning tasks with concept classes of large dimension (i.e., composed of many orthogonal functions).Quantum kernel methods will necessarily need a number of data points that scales linearly with this dimension, while, as we showcased in our results, the flexibility of data re-uploading circuits, as well as the restricted expressivity of explicit models can lead to substantial savings in resources.It remains an interesting research direction to explore how and when can these models be tailored to a machine learning task at hand, e.g., through the form of useful inductive biases (i.e., assumptions on the nature of the target functions) in their design.
Appendix A: Representer theorem In this appendix, we give a formal statement of the representer theorem from learning theory.
Theorem A.1 (Representer theorem [22]).Let g : X → Y be a target function with input and output domains X and Y, D = {(x (1) , g(x (1) ), . . ., (x (M ) , g(x (M ) )} a training set of size M , and k : X × X → R a kernel function with a corresponding reproducing kernel Hilbert space (RKHS) H.For any strictly monotonic increasing regularization function h : [0, ∞) → R and any training loss L : (X ×Y) M ×Y M → R∪∞, we have that any minimizer of the regularized training loss from the RKHS H, admits a representation of the form where α m ∈ R for all 1 ≤ m ≤ M .

A common choice for the regularization function is simply
, where λ ≥ 0 is a hyperparameter adjusting the strength of the regularization.
Appendix B: Mappings from data re-uploading to explicit models In this section, we detail possible mappings from data re-uploading models to explicit models, and prove Proposition 1 and Theorem 1 from Sec. II B.
In our analysis, we restrict our attention to encoding gates of the form e −ih(x)Hn/2 , for H n an arbitrary Pauli string acting on n qubits, e.g., H 3 = X ⊗ Z ⊗ I, and h : R d → R an arbitrary function mapping real-valued input vectors x to rotation angles h(x).Using known techniques (see Sec. 4.7.3 in [46]), we can show that in order to implement any such gate e −ih(x)Hn/2 exactly, one only needs to perform a Pauli-Z rotation e −ih(x)Z/2 on a single of these n qubits, along with O(n) operations that are independent of x (and can therefore be absorbed by the surrounding variational unitaries).Therefore, in our mappings, we only need to focus on encoding gates of the form R z (h(x)) = e −ih(x)Z/2 .

Approximate bit-string mapping
We start by analyzing the resource requirements (in terms of number of additional qubits and gates) of our approximate bit-string mapping (Proposition 1).
Note that, in our construction (see Fig. 3), the ancilla qubits are always prepared in computational basis states and only act as classical controls throughout the circuit.

Hence, the operation Tr
obtained by tracing out these ancillas is equivalent to the unitary V 1 (θ)U 1 ( x 1 ). ..V D (θ)U D ( x D ), where datadependent rotations are only implemented to angleprecision ε = 2 −p .In the following, we relate this precision ε (or equivalently, the number of ancilla qubits dp) 5to the approximation error δ ≥ | fθ (x) − f θ (x)| of our mapping. Call ) the data reuploading unitary and V = V 1 (θ)U 1 ( x 1 ). ..V D (θ)U D ( x D ) its approximation.We first relate the error δ to the distance measure for |∆ = (U − V ) |ψ , by using the triangular and Cauchy-Schwarz inequalities to derive the first two inequalities, and the definition of the spectral norm O ∞ and the distance U − V ∞ to derive the last inequality.Note that, for U and V obtained by sequences of unitary gates, the distance U − V ∞ is linear in the pairwise distances between the gates in these sequences (see Sec. 4.5.3 of [46]): therefore, to obtain | fθ (x) − f θ (x)| ≤ δ, it is sufficient to enforce: for each and single encoding gate in the circuit.Since we assumed that the encoding gates of the circuit take the form U j = R z (x j ), we can bound U j − V j ∞ as a function of the precision ε of encoding x j as: From Eqs. (B3) and (B4), we then get ε ≤ ) .The number of additional qubits in the circuit is then Dp, which is also the number of additional gates (R x data-encoding rotations and controlled data-independent rotations).

Mappings based on gate teleportation
We now move to our gate-teleportation mappings (Theorem 1).Again, we restrict our attention to teleporting R z (x) gates.This gate teleportation can easily be implemented using the gadget depicted in Fig. 7.It is easy to check that for an arbitrary input qubit ), the state generated by this gadget before the computational basis measurement (and correction) is: which results in the correct outcome |ψ = R z (x) |ψ for a |0 measurement, and a state that can be corrected (up to a global phase) via a R z (2x) rotation, in case of a |1 measurement.
Putting aside the corrections required by this gadget, we note the interesting property that, when used to simulate every encoding gate in the data re-uploading circuit, this gadget moves all data-dependent parts of the circuit on additional ancilla qubits, essentially turning it into an explicit model.However, this gadget still requires datadependent corrections (of the form R z (2h(x))) in the case of |1 measurement outcomes, which happen with probability 1/2 for each gate teleportation.To get around this problem, we simply replace the computational basis measurement in the gadget by a projection P 0 = |0 0| on the |0 state.While these projections cannot be implemented deterministically in practice, the resulting model is still a valid explicit model, in the sense that including the projections P ⊗D 0 in the observable O θ , for D uses of our gadget, still leads to a valid observable O θ = O θ ⊗ P ⊗D 0 .However, given that each of these projections does not account for the re-normalization of the resulting quantum state (i.e., the factor 1/ √ 2 in Eq. (B5)), this means that, in order to enforce Tr[ρ (x)O θ ] = Tr[ρ θ (x)O θ ], ∀x, θ, we need to multiply O θ by a factor of 2 D/2 .This implies that the evaluation of the resulting explicit model is exponentially harder than that of the original data re-uploading model, in the number of encoding gates D. As we show next, this factor can however be made arbitrarily close to 1, by allowing each encoding gate/angle to be used more than once in the feature encoding.To achieve this feat, we transform our previous gadget as to implement its data-dependent corrections using gate teleportation again.A single such nested use of gate teleportation now has probability 1 − 1/4 = 3/4 of succeeding without corrections, as opposed to the probability 1/2 of the previous gadget.For N nested uses (see Fig. 8), the success probability is then boosted to 1 − 2 −N , which can be made arbitrarily close to 1.If we use this nested gadget for all D encoding gates in the circuit, the probability that all of them are implemented successfully without corrections is then p = (1 − 2 −N ) D .This probability p can be made larger than 1 − δ , for any δ > 0, by choosing ) .This also leads to a normalization factor p −1 ≤ (1 − δ ) −1 , which can be made arbitrarily close to 1.Note as well that this normalization factor is always known exactly, such that we can guarantee Tr[ρ (x)O θ ] = Tr[ρ(x, θ)O θ ], ∀x, θ.

Kernels resulting from our mappings
In the main text, we showed how our illustrative mapping based on bit-string encodings of x resulted in trivial Kronecker-delta kernel functions and implicit models with very poor generalization performance.In this subsection, we derive a similar result for our gateteleportation mappings.
We first note that these mappings lead to feature encodings of the form: for D encoding gates with encoding angles h i (x), and using N nested gate teleportations for each of these gates.While less generic than the feature states resulting from our binary encodings, these still generate kernels 9.A programmable quantum processor.A state |ψP is fed to the processor as a program instructing the processor to implement a unitary map U on another input state.The noprogramming theorem states that programmable processors C capable of implementing any unitary map U cannot exist.
that are again classically simulatable and k(x, x ) → δ x,x for N D → ∞.
Moreover, in the case where the angles h i (x) are linear functions of components x i of x, we can directly apply Theorem 1 of [27] to show the following.For a number of encoding gates D and a number of nested gate teleportations N large enough (i.e., N D larger than some d 0 ), and for a dataset that is at most polynomially large in N D, no function can be learned using the implicit model resulting from this kernel.Note that, for this theorem to be applicable, we also need to assume non-degenerate data distributions µ (i.e., that do not have support on single data points) that are separable on all components x i of x, i.e., µ = i µ i , such that the mean embeddings ρ µi = ρ i (x)µ i (dx) for each component x i are all mixed.

Link to no-programming
It may seem to the informed reader as though our mappings from data re-uploading to explicit models are in violation of the so-called no-programming theorem from quantum information theory.In this section, we will shortly outline this theorem and explain why our mappings do not violate it.
A programmable quantum processor is defined as a CPTP map C : H S ⊗H P → H S , where H S and H P denote the system and program Hilbert spaces.The purpose of such a quantum processor is to implement unitary maps U : H S → H S where the information about U is fed to the processor solely by a program state |ψ P (see Fig. 9).The no-programming theorem [48,49] rules out the existence of perfect universal quantum processors, in the sense that there cannot exist a processor C that can implement infinitely many different unitary maps U deterministically using finite-dimensional program states |ψ P .
The explicit models resulting from our mappings have properties that are reminiscent of quantum processors.In Fig. 3 for instance, the bit-string encodings | x are used to implement data-encoding unitaries in an otherwise data-independent circuit.Thus, one may interpret these as the program states |ψ(x) of a quantum processor C given by the rest of the circuit.Same goes for the quantum states R z (x 2 ) |+ R z (x 3 ) |+ in Fig. 4.
In light of the no-programming theorem, it is quite remarkable that these explicit models can "program" a continuous set of unitaries {U (x, θ) = V (θ)U (x)} x∈R d (and particularly the unitaries U (x) for ≥ 2).Note however that, in the case of our bit-string mappings, these unitaries are only implemented approximately and that, in our gate-teleportation mappings, they are only implemented probabilistically.Our gate-teleportation mappings are only exact from the point of view of models, i.e., expectation values of observables, which are not covered by no-programming.The approximation errors δ > 0 and the normalization factors (1 − δ ) −1 = 1 that we obtain in our mappings are indeed symptomatic of our inability to program data re-uploading unitaries both exactly and deterministically.On the other hand, our results show that, contrary to unitary maps, expectation values can be "programmed" exactly.10.A CPTP feature map based on bit-string encodings.Using data-independent controlled rotations, we can implement an approximation of any unitary feature encoding U φ (x) by further tracing out the bit-string registers.tion g θ (x) to some error ε. 6 Indeed, when g θ is computed via a parametrized quantum circuit, we can use a similar construction to that depicted in Fig. 3 in the main text.When g θ is instead computed classically (e.g., using a neural network), we can either simulate this computation with a quantum circuit (see Sec. 3.2.5 of [46]), or simply include it in the observables O θ as a post-processing of a computational basis measurement of | x and | θ .
The explicit models constructed in this proof may seem quite contrived and unnatural from the feature encodings and variational processing they use.Nonetheless, these constructions showcase how parametrized rotations, a natural building block to encode input data and to be used as variational gates, can leverage explicit quantum models to be universal function/model approximators.

Appendix D: Beyond unitary feature encodings
So far, in our definition of linear quantum models, we only considered unitary feature encodings, i.e., where feature states are defined as ρ(x) = U φ (x) |0 0| U † φ (x) for a certain unitary map U φ (x).In this section, we make the case that more general feature encodings, namely encodings for which the map U φ (x) is allowed to be any completely positive trace preserving (CPTP) map, can lead to more interesting kernels k(x, x ) = Tr[ρ(x)ρ(x )].This observation is in line with recent findings about quantum kernels derived from non-unitary feature encodings [20,27].
We illustrate this point by focusing on the bit-string feature encoding U φ (x) 0 ⊗n+dp = |0 ⊗n d i=1 | x i that we presented in the main text.We start by noting that augmenting this feature encoding with an arbitrary unitary V always leads to the same kernel function: given that V † V = I.If we however allow for a nonunitary operation such as tracing out part of the quantum system (which is allowed by CPTP maps), we can use this bit-string encoding to construct kernels k(x, x ) that approximate virtually any quantum kernel resulting from a unitary feature encoding on n qubits.To see this, suppose for instance that we want to approximate the quantum kernel proposed by Havlíček et al. [23] (resulting from the unitary feature encoding of Eq. (E1)).In this case, we can, similarly to our construction in Fig. 3, use data-independent rotations, controlled by the bitstring registers and acting on the n working qubits, to simulate the data-dependent gates of Eq. (E1).Then, by tracing out the bit-string register, we effectively obtain an (arbitrarily good) approximation of the original feature encoding on the working qubits.This CPTP feature encoding is depicted in Fig. 10.
Appendix E: Details of the numerical simulations In this section, we provide more details on the numerical simulations presented in the main text.We first describe how the training and testing data of the learning task are generated, then specify the quantum and classical models that we trained on this task.

Dataset generation a. Generating data points
We generate our training and testing data by preprocessing the fashion MNIST dataset [41].All 28 × 28pixels grayscale images in the dataset are first subject to a dimensionality reduction via principal component analysis (PCA), where only their n principal components are preserved, 2 ≤ n ≤ 12.This PCA gives rise to data vectors x ∈ R n that are further normalized component-wise as to each be centered around 0 and have a standard deviation of 1.To create a training set, we sample M = 1000 of these vectors without replacement.A validation set and a test set, of size M = 100 each, are sampled similarly from the pre-processed fashion MNIST testing data.

b. Generating labels
The labels g(x) of the data points x in the training, validation and test sets are generated using the explicit models depicted in Fig. 11, for a number of qubits n equal to the dimension of x, and uniformly random parameters θ ∈ [0, 2π] 3nL .The feature encoding takes the form [23]: 11.The explicit model used in our numerical simulations.We use the feature encoding proposed by Havlíček et al. [23] (see Eq. (E1)), followed by a hardware-efficient variational circuit, where arbitrary single-qubit rotations R(θi,j) = Rx(θi,j,0)Ry(θi,j,1)Rz(θi,j,2) on each qubit are interlaid with nearest-neighbour CZ gates, for L layers (here L = 2).Finally, the expectation value of a Z1 observable (with a renormalization) assigns labels to input data x ∈ R n . for As for the variational unitaries V (θ), these are composed of L layers of single-qubit rotations R(θ i,j ) on each of the qubits, interlaid with CZ = |1 1| ⊗ Z gates between nearest neighbours in the circuit.We choose the number of layers L as a function of the number of qubits n in the circuit, such that the number of parameters (3nL) is approximately 90 at all system sizes.Specifically, from n = 2 to 12, we have L = 15, 10, Finally, the labels of the data points are specified by the expectations values where w D,θ is a re-normalization factor that sets the standard deviation of these labels to 1 over the training set.

c. Evaluating performance
We evaluate the training loss of a hypothesis function f using the mean squared error on the training data.The test loss (indicative of the expected loss) is evaluated similarly on the test data (of size M ).

Quantum machine learning models
In our simulations, we compare the performance of two types of quantum machine learning models.
First, we consider explicit models from the same variational family as those used to label the data (i.e., depicted in Fig. 11), but initialized with different variational parameters θ ∈ [0, 2π] 3nL , now sampled according to a independent normal distributions N (0, 0.05).As opposed to the generating functions of Eq. (E3), we replace the observable weight w D,θ by a free parameter w, initialized to 1 and trained along the variational parameters θ.We train the explicit models for 500 steps of gradient descent on the training loss of Eq. (E4).For this, we use an ADAM optimizer [51] with a learning rate α θ = 0.01 for the variational parameters θ and a learning rate α w = 0.1 for the observable weight w.
Second, we also consider implicit models that rely on the same feature encoding U φ (x) (Eq.(E1)) as the explicit models.I.e., these take the form for the same encodings ρ , and an observable O α,D given by Eq. ( 4) in the main text.We train their parameters α using the KernelRidge regression package of scikit-learn [52].In the numerical simulations of Fig. 6, we use an unregularized training loss, i.e., that of Eq. (E4).The learning performance of the implicit models trained with regularization is presented in Appendix F.

Classical machine learning models
We additionally compare the performance of our quantum machine learning models to a list of classical models, identical to that of Huang et al. [20].For completeness, we list these models here, and the hyperparameters they were trained with.All of these models were trained using the default specifications of scikit-learn [52], unless stated otherwise.
We use the MLPRegressor package with a maximum number of learning steps max iter = 500.
We select the best performance between the SVR and KernelRidge packages (both using the linear kernel).
We select the best performance between the SVR and KernelRidge packages (both using the RBF kernel).
We use the RandomForestRegressor package.
We use the GradientBoostingRegressor package.
We use the AdaBoostRegressor package.
At each system size, we keep the learning performance of the model with the lowest validation loss and plot its test loss.ple efficient but require Ω(d) qubits to achieve a nontrivial expected loss, (iii) implicit models require both Ω(d) qubits and Ω(2 d ) training samples to achieve this.

Learning parity functions
We consider the same learning task as Daniely & Malach [39], that is, learning k-sparse parity functions.These functions have discrete input and output spaces, X = {−1, +1} d and Y = {−1, +1}, respectively.d ∈ N specifies the dimension of the inputs x ∈ X , and an additional parameter 0 ≤ k ≤ d specifies the family of socalled k-sparse parity functions: That is, for a given subset A of the input components [d], the function g A returns the parity ±1 of these components for any input x ∈ X .A can be any subset of [d] of size k, which gives us a family of functions {g A } A⊂ [d],|A|=k that we take to be our concept class (for a certain k specified later).These functions have the interesting property that they are all linearly independent (despite potentially being an exponentially large family), which is the essential property we'll be using to derive our separation results.
Daniely & Malach [39] show that, for an appropriate choice of input distribution and loss function, ksparse parity functions cannot be approximated by any polynomial-size linear model (for k ≤ d 16 ), while a depth-2 neural network with hidden layers polynomially large in k can learn these (almost) perfectly.The size of the linear model is defined as the dimension of its feature space F multiplied by the norm of its weight vector w F .
In our results, we rely instead on a powerful theorem by Hsu [40] to derive simpler separations results in terms of the dimension of linear models, rather than their size.This theorem, stated in the next subsection, makes it natural to consider an input distribution D X that is simply the uniform distribution over the data space X , and the mean-squared error L A (f ) = E x∼D X (f (x) − g A (x)) 2 to be the expected loss for which we establish our bounds.

Learning performance of quantum models
This section is organized as follows: we first describe our lower bound results for linear quantum models, which are naturally derived for learning parities w.r.t. to the uniform input distribution D X .We then move to our upper bound results for data re-uploading models, which make us consider a different input distribution D A in order to achieve a largest separation with linear models.Together, these bounds give us our main Theorem G.6.

a. Lower bounds for linear models
As mentioned above, our separation results derive from dimensionality arguments on the family of functions that can be represented by linear models.To start, let us consider the Hilbert space H = L 2 (D X ) of real-valued functions that are square-summable with respect to the probability space (X , 2 X , D X ).The inner product associated to this Hilbert space is f, g For any k ∈ {0, . . ., d}, k-sparse parity functions belong to H, and they are moreover orthogonal functions of this Hilbert space.As for the functions generated by a linear model f (x) = φ(x), w F (for a fixed feature encoding φ), these are also functions of the Hilbert space H.More specifically, they are contained in a finite-dimensional subspace W = span{ φ(x), w j F } wj of H, for {w j } j a basis of F (with respect to the inner product •, • F ).To establish our bounds, all we need to do now is: a) relate the dimension dim(W) of this subspace to the expected loss of the linear model, and b) upper bound dim(W) in terms of the number of qubits or data samples accessible to this model.
Let ϕ 1 , . . ., ϕ N be orthogonal functions in a Hilbert space H such that ϕ i 2 H = 1 for all i = 1, . . ., N , and let W be a finite subspace of H.Then, for: By definition here, f − ϕ k , this theorem gives us a combinatorial lower bound on dim(W) when the linear model achieves a non-trivial average expected loss ε < 1 (otherwise obtained for f (x) = 0, ∀x ∈ X ).
We now move to point b).Note here that, in order to upper bound dim(W), all we need to do is upper bound the number of independent vectors in the span{ φ(x), w j F } wj = W, hence, equivalently, the number of basis vectors w j of F. This can easily be done when we know the number of qubits on which the linear model acts.Indeed, for an n-qubit model, F is the space of 2 n × 2 n Hermitian operators, and hence dim(W) ≤ 2 2n .This leads us to our first lemma.tribution D X , such that, to achieve an average meansquared error

Lemma G.2. There exists a regression task specified by an input dimension d ∈ N, a function family {g
any linear quantum model needs to act on . From Theorem G.1, we have dim(W) ≥ 2 d/2 (1 − ε), and from our previous observation, dim(W) ≤ 2 2n .
In the case of implicit linear models, note that we can bound dim(W) even more tightly.This is because the weight vector w ∈ F, or equivalently the observable of the implicit quantum model, is expressed as a linear combination of embedded data samples φ(x (i) ) = ρ(x (i) ).Therefore, the number of independent vectors in the span{ φ(x), w j F } wj = W is upper bounded by the number of data points M in the training set of the implicit model.This gives us dim(W) ≤ min(2 2n , M ) and the following lemma:

Lemma G.3. There exists a regression task specified by an input dimension d ∈ N, a function family {g
, and an input distribution D X , such that, to achieve an average meansquared error Note that implicit quantum models suffer from both lower bounds in Lemmas G.2 and G.3: they require both Ω(d) qubits and Ω(2 d ) data samples.

b. Upper bound for data re-uploading models
To establish our learning separation, we would like to show that data re-uploading circuits can efficiently represent parity functions.We show this constructively, by designing a single-qubit data re-uploading model that can compute any (k-sparse) parity function.This model is depicted in Fig. 14 and consists solely in R z ( x i ) = Z (xi−1)/2 encoding gates, parametrized R y rotations and a final Pauli-X measurement.To understand how such a circuit can compute parity functions, consider the parity to be encoded in the qubit being either in a |+ or a |− state.Therefore, the R z ( x i ) rotations flip the |± state whenever x i = −1, and preserve it otherwise.As for the R y (± π 2 ) rotations, these essentially act as Hadamard gates by transforming a |± state into a |1/0 state and back, which allows to "hide" this state from the action of a R z ( x i ) gate.We parametrize these gates as to hide the |± qubit from R z ( x i ) whenever θ i = 0, and let it act whenever θ i = π 2 .This leads us to the following lemma: Lemma G.4.For the same learning task considered in Lemmas G.2 and G.3, there exists a data re-uploading model acting on a single qubit, and with depth 2d + 1, that achieves a perfect mean-squared error: Proof.For a given A ⊂ [d], take in the circuit of Fig. 14: Note that our claim on data re-uploading models so far only has to do with representability and not actual learning from data.We are yet to prove that a similar learning performance can be achieved from a training set of size polynomial in d and a polynomial-time learning algorithm.For the uniform data distribution D X we considered so far, this is known to be possible using O(d) data samples and by solving a resulting linear system of equations acting on d variables [54].However, this distribution does not provide us with the best possible separation in terms of data samples, which is why we consider instead the mixture of data distributions introduced by Daniely & Malach [39].Originally intended to get around the hardness of learning parities with gradient-based algorithms [55], this distribution significantly reduces the data requirements for the data re-uploading (and explicit linear) models to O(log(d)), while preserving the Ω(2 d ) lower bound for implicit models.
For every A ⊂ [d], call D A = D X the uniform distribution over X , and D A the distribution where all components in [d] \ A are drawn uniformly at random, while, independently, the components in A are all +1 with probability 1/2 and all −1 otherwise.The data distribution D A that we consider samples x ∼ D A with probability 1/2 and x ∼ D A with probability 1/2.For k = |A| an odd number 7 , this distribution is particularly interesting as, when x ∼ D (2) A , x i = g A (x) for all i ∈ A, which statistically "reveals" A, while D A still preserves our previous hardness of generalization results.This allows us to prove the following lemma.
Lemma G.5.For the data distribution D A defined above, there exists a learning algorithm using M = 32 log 2d δ data samples and dM evaluations of the circuit in Fig. 14, that returns, for any A ⊂ with probability at least 1 − δ.
Proof.We analyze the following learning algorithm: given a training set of size M , evaluate, for all i ∈ [d], the empirical loss L(f i ) of the data re-uploading function f i obtained with the parameters θ i = π 2 , θ j = 0 for j = i.Return θ i = π 2 when L(f i ) ≤ 1.5 and θ i = 0 otherwise, for all i ∈ [d].
Call X i = (x i − g A (x)) 2 the random variable obtained by sampling x from the data distribution D A , for a given A. Note that, by construction, E D Given that the computed losses L(f i ) are empirical estimates of E D A [X i ], all we need in order to identify A is guarantee with high probability that we can distinguish whether E D A [X i ] = 1 or 2, for all i ∈ [d].We achieve this guarantee using the union bound and Hoeffding's inequality (X i ∈ [0, 4]): and to upper bound this failure probability by δ, we need M ≥ 32 log 2d δ .
We leave as an open question whether an optimization procedure with similar learning guarantees but based on gradient descent also exists. 7In Lemma G.2, when d 2 is an even number, we can take k =

c. Main theorem
To conclude our results, we are left to show that similar lower bounds to those in Lemmas G.2 and G.3 also hold for the data distribution D A .Intuitively, this problem is easy to solve: given the inability of linear models to represent good approximations of parity functions with respect to the uniform data distribution, it should be clear that these still have a bad generalization performance with respect to D Proof.We relate ε to ε A (1) (defined similarly to ε, but with respect to D A ): since ε A (2) ≥ 0. From Lemma G.2, we have that, to ensure ε A (1) ≤ 2ε, we need at least d 4 + 1 2 log 2 (1 − 2ε) qubits, which proves (i).(ii) follows similarly from Lemma G.3.Point (iii) corresponds to Lemmas G.4 and G.5.

Tight bounds on linear realizations of data re-uploading models
A direct corollary of Lemmas G.2 and G.4 is a lower bound on the number of additional qubits required to map any data re-uploading model to an equivalent (explicit) linear model.Indeed, since the data re-uploading model of Fig. 14 can represent any parity function exactly for any input dimension d ∈ N, while a linear model using a number of qubits sublinear in d can only achieve poor approximations on average, we can easily prove the following theorem (Corollary 1 in the main text).
Theorem G.7.Any procedure that takes as input an arbitrary data re-uploading model f θ with d encoding gates and returns an equivalent explicit model f θ (i.e., a universal mapping) must produce models acting on Ω(d) additional qubits for worst-case inputs.
Proof.By contradiction: were there a universal mapping using only O(d 1−α ), α > 0 additional qubits, (i.e, sublinear in d), applying it to the circuit in Fig. 14 would result in a linear model acting on O(d 1−α ) qubits with perfect performance in representing parity functions, which contradicts Lemma G.2.Note that our gate-teleportation mapping has an overhead of O(d log(d/δ )) qubits where δ is a controllable parameter.This theorem hence proves that our mapping is essentially optimal with respect to this overhead.

The case of classification
So far, in our separation results, we have only considered a regression loss (the mean-squared error), despite the parity functions having a discrete output.It is an intriguing question whether similar separation results can be obtained in the case of classification, i.e., for a binary classification loss E x∼D X |sign(f (x))−g A (x)| for instance.
It is rather straightforward to show a Ω(d) qubits lower bound for linear classifiers that achieve exact learning (i.e., a 0 loss).We can consider here the concept class of all k-sparse parity functions for k ∈ {0, . . ., d}, such that it contains all possible labelings X → {−1, 1}.Therefore, a model that can represent all these functions exactly needs, by definition, a VC dimension larger than 2 d .But we know that, for linear quantum classifiers acting on n qubits, this VC dimension is upper bounded by 2 2n + 1 [36].The lower bound then trivially derives.
Making this lower bound robust (i.e., allowing an ε ≥ 0 loss) is however a harder task.By noting that a Ω(d) lower bound on the feature space dimension of a linear classifier implies a Ω(log(d)) lower bound on the number of qubits of a quantum linear classifier (dim(F) = 2 2n ), we can adapt a result from Kamath et al. [56] to show the following: to achieve an average classification error ε on 1-sparse parity functions (i.e., f i (x) = x i ), a linear quantum classifier needs to act on Ω(log[d(1 − H(ε))]) qubits (for H(ε) the binary entropy function).However, according to the same authors, establishing a stronger lower bound (e.g., number of qubits poly-logarithmic in d, or equivalently, a feature space dimension super-polynomial in d) for a similar task would constitute a major frontier in complexity theory.Such a result would provide with a function that requires a depth-2 threshold circuit of super-polynomial size to be computed, while the best known lower bounds for the size of depth-2 threshold circuit are only polynomial.

FIG. 1 .
FIG.1.The quantum machine learning models studied in this work.a) An explicit quantum model, where the label of a data point x is specified by the expectation value of a variational measurement on its associated quantum feature state ρ(x).b) The quantum kernel associated to these quantum feature states.The expectation value of the projection P0 = |0 0| corresponds to the inner product between ρ(x) and ρ(x ).An implicit quantum model is defined by a linear combination of such inner products, for x an input point and x training data points.c) A data re-uploading model, interlaying data encoding and variational unitaries before a final measurement.

FIG. 2 .
FIG. 2. The model families in quantum machine learning.(a) While data re-uploading models are by definition a generalization of linear quantum models, our exact mappings (see Sec. II B) demonstrate that any polynomial-size data re-uploading model can be realized by a polynomial-size explicit linear model.(b) Kernelizing an explicit model corresponds to turning its observable into a linear combination of feature states ρ(x), for x in a dataset D. The representer theorem (see Sec. III A) guarantees that, for any dataset D, the implicit model f * α,D minimizing the training loss associated to D outperforms any explicit minimizer f * θ from the same Reproducing Kernel Hilbert Space (RKHS) with respect to this same training loss.However, depending on the feature encoding ρ(•) and the data distribution, a restricted dataset D may cause the implicit minimizer f * α,D to severely overfit on the dataset and have dramatically worse generalization performance than f * θ (see Sec. III B and III D).

1 FIG. 5 .
FIG. 5. Learning separations.We describe a learning task based on parity functions acting on d-bit input vectors x ∈ {−1, 1} d , for d ∈ N.This task allows us to separate all three quantum models studied in this work in terms of their resource requirements, as a function of d (see Theorem 2).

FIG. 6 .
FIG.6.Regression performance of explicit, implicit and classical models on a "quantum-tailored" learning task.For all system sizes, each model has access to a training set of M = 1000 pre-processed and re-labeled fashion-MNIST images.Testing loss is computed on a test set of size 100.Shaded regions indicate the standard deviation over 10 labeling functions.The training errors of implicit models are close to 0 for all system sizes.

FIG. 12 .
FIG.12.Best performance of implicit models for different regularization strengths.

FIG. 13 .
FIG.13.Regression performance of explicit models from the same variational family as the models generating the data labels (1) and from a different variational family(2).
i ] = 0 for i ∈ A, while E D (1) A [X i ] = E D (2) A [X i ] = 2 for i / ∈ A. Therefore E D A [X i ] = 1 for i ∈ A, and E D A [X i ] = 2 otherwise.

A 2 L 2
. We make this intuition rigorous in the following theorem (restatement of Theorem 2 in the main text).Theorem G.6.There exists a regression task specified by an input dimension d ∈ N, a function family {g A :{−1, 1} d → {−1, 1}} A⊂[d],|A|= d/2, and associated input distributions D A , such that, to achieve an average meansquared errorE A inf f ∈W f − g A (D A ) = ε < 1 2(i) any linear quantum model needs to act on any implicit quantum model additionally requiresM ≥ 2 d/2 (1 − 2ε)data samples, while (iii) a data re-uploading model acting on a single qubit can be trained to achieve a perfect expected loss with probability 1 − δ, using M = 32 log 2d δ data samples.