Abstract
Machine learning algorithms based on parametrized quantum circuits are prime candidates for nearterm applications on noisy quantum computers. In this direction, various types of quantum machine learning models have been introduced and studied extensively. Yet, our understanding of how these models compare, both mutually and to classical models, remains limited. In this work, we identify a constructive framework that captures all standard models based on parametrized quantum circuits: that of linear quantum models. In particular, we show using tools from quantum information theory how data reuploading circuits, an apparent outlier of this framework, can be efficiently mapped into the simpler picture of linear models in quantum Hilbert spaces. Furthermore, we analyze the experimentallyrelevant resource requirements of these models in terms of qubit number and amount of data needed to learn. Based on recent results from classical machine learning, we prove that linear quantum models must utilize exponentially more qubits than data reuploading models in order to solve certain learning tasks, while kernel methods additionally require exponentially more data points. Our results provide a more comprehensive view of quantum machine learning models as well as insights on the compatibility of different models with NISQ constraints.
Similar content being viewed by others
Introduction
In the current noisy intermediatescale quantum (NISQ) era^{1}, a few methods have been proposed to construct useful quantum algorithms that are compatible with mild hardware restrictions^{2,3}. Most of these methods involve the specification of a quantum circuit Ansatz, optimized in a classical fashion to solve specific computational tasks. Next to variational quantum eigensolvers in chemistry^{4} and variants of the quantum approximate optimization algorithm^{5}, machine learning approaches based on such parametrized quantum circuits^{6} stand as some of the most promising practical applications to yield quantum advantages.
In essence, a supervised machine learning problem often reduces to the task of fitting a parametrized function—also referred to as the machine learning model—to a set of previously labeled points, called a training set. Interestingly, many problems in physics and beyond, from the classification of phases of matter^{7} to predicting the folding structures of proteins^{8}, can be phrased as such machine learning tasks. In the domain of quantum machine learning^{9,10}, an emerging approach for this type of problem is to use parametrized quantum circuits to define a hypothesis class of functions^{11,12,13,14,15,16}. The hope is for these parametrized models to offer representational power beyond what is possible with classical models, including highly successful deep neural networks. And indeed, we have substantial evidence of such a quantum learning advantage for artificial problems^{16,17,18,19,20,21}, but the next frontier is to show that quantum models can be advantageous in solving realworld problems as well. Yet, it is still unclear which of these models we should preferably use in practical applications. To bring quantum machine learning models forward, we first need a deeper understanding of their learning performance guarantees and the actual resource requirements they entail.
Previous works have made strides in this direction by exploiting a connection between some quantum models and kernel methods from classical machine learning^{22}. Many quantum models indeed operate by encoding data in a highdimensional Hilbert space and using solely inner products evaluated in this feature space to model the properties of the data. This is also how kernel methods work. Building on this similarity, the authors of refs. ^{23,24} noted that a given quantum encoding can be used to define two types of models (see Fig. 1): (a) explicit quantum models, where an encoded data point is measured according to a variational observable that specifies its label, or (b) implicit kernel models, where weighted inner products of encoded data points are used to assign labels instead. In the quantum machine learning literature, much emphasis has been placed on implicit models^{20,25,26,27,28,29,30,31}, in part due to a fundamental result known as the representer theorem^{22}. This result shows that implicit models can always achieve a smaller labeling error than explicit models, when evaluated on the same training set. Seemingly, this suggests that implicit models are systematically more advantageous than their explicit counterparts in solving machine learning tasks^{25}. This idea also inspired a line of research where, in order to evaluate the existence of quantum advantages, classical models were only compared to quantum kernel methods. This restricted comparison led to the conclusion that classical models could be competitive with (or outperform) quantum models, even in tailored quantum problems^{20}.
In recent times, there has also been progress in socalled data reuploading models^{32} which have demonstrated their importance in designing expressive models, both analytically^{33} and empirically^{15,16,32}, and proving that (even singlequbit) parametrized quantum circuits are universal function approximators^{34,35}. Through their alternation of dataencoding and variational unitaries, data reuploading models can be seen as a generalization of explicit models. However, this generalization also breaks the correspondence to implicit models, as a given data point x no longer corresponds to a fixed encoded point ρ(x). Hence, these observations suggest that data reuploading models are strictly more general than explicit models and that they are incompatible with the kernelmodel paradigm. Until now, it remained an open question whether some advantage could be gained from data reuploading models, in light of the guarantees of kernel methods.
In this work, we introduce a unifying framework for explicit, implicit and data reuploading quantum models (see Fig. 2). We show that all function families stemming from these can be formulated as linear models in suitably defined quantum feature spaces. This allows us to systematically compare explicit and data reuploading models to their kernel formulations. We find that, while kernel models are guaranteed to achieve a lower training error, this improvement can come at the cost of a poor generalization performance outside the training set. Our results indicate that the advantages of quantum machine learning may lie beyond kernel methods, more specifically in explicit and data reuploading models. To corroborate this theory, we quantify the resource requirements of these different quantum models in terms of the number of qubits and data points needed to learn. We show the existence of a regression task with exponential separations between each pair of quantum models, demonstrating the practical advantages of explicit models over implicit models, and of data reuploading models over explicit models. From an experimental perspective, these separations shed light on the resource efficiency of different quantum models, which is of crucial importance for nearterm applications in quantum machine learning.
Results
A unifying framework for quantum learning models
We start by reviewing the notion of linear quantum models and explain how explicit and implicit models are by definition linear models in quantum feature spaces. We then present data reuploading models and show how, despite being defined as a generalization of explicit models, they can also be realized by linear models in larger Hilbert spaces.
Linear quantum models
Let us first understand how explicit and implicit quantum models can both be described as linear quantum models^{25,36}. To define both of these models, we first consider a feature encoding unitary \({U}_{\phi }:{{{{{{{\mathcal{X}}}}}}}}\to {{{{{{{\mathcal{F}}}}}}}}\) that maps input vectors \({{{{{{{\boldsymbol{x}}}}}}}}\in {{{{{{{\mathcal{X}}}}}}}}\), e.g., images in \({{\mathbb{R}}}^{d}\), to nqubit quantum states \(\rho ({{{{{{{\boldsymbol{x}}}}}}}})={U}_{\phi }({{{{{{{\boldsymbol{x}}}}}}}})\left{{{{{{{\mathbf{0}}}}}}}}\right\rangle\left\langle {{{{{{{\mathbf{0}}}}}}}}\right{U}_{\phi }^{{{{\dagger}}} }({{{{{{{\boldsymbol{x}}}}}}}})\) in the Hilbert space \({{{{{{{\mathcal{F}}}}}}}}\) of 2^{n} × 2^{n} Hermitian operators.
A linear function in the quantum feature space \({{{{{{{\mathcal{F}}}}}}}}\) is defined by the expectation values
for some Hermitian observable \(O\in {{{{{{{\mathcal{F}}}}}}}}\). Indeed, one can see from Eq. (1) that f(x) is the Hilbert–Schmidt inner product between the Hermitian matrices ρ(x) and O, which is by definition a linear function of the form \({\langle \phi ({{{{{{{\boldsymbol{x}}}}}}}}),w\rangle }_{{{{{{{{\mathcal{F}}}}}}}}}\), for ϕ(x) = ρ(x) and w = O. In a regression task, these realvalued expectation values are used directly to define a labeling function, while in a classification task, they are postprocessed to produce discrete labels (using, for instance, a sign function).
Explicit and implicit models differ in the way they define the family of observables {O} they each consider.
An explicit quantum model^{23,24} using the feature encoding U_{ϕ}(x) is defined by a variational family of unitaries V(θ) and a fixed observable O, such that
for O_{θ} = V(θ)^{†}OV(θ), specify its labeling function. Restricting the family of variational observables \({\{{O}_{{{{{{{{\boldsymbol{\theta }}}}}}}}}\}}_{{{{{{{{\boldsymbol{\theta }}}}}}}}}\) is equivalent to restricting the vectors w accessible to the linear quantum model \(f({{{{{{{\boldsymbol{x}}}}}}}})\,=\,{\langle \phi ({{{{{{{\boldsymbol{x}}}}}}}}),w\rangle }_{{{{{{{{\mathcal{F}}}}}}}}},\,w\,\in \,{{{{{{{\mathcal{F}}}}}}}}\), associated with the encoding ρ(x).
Implicit quantum models^{23,24} are constructed from the quantum feature states ρ(x) in a different way. Their definition depends directly on the data points {x^{(1)}, …, x^{(M)}} in a given training set \({{{{{{{\mathcal{D}}}}}}}}\), as they take the form of a linear combination
for \(k({{{{{{{\boldsymbol{x}}}}}}}},{{{{{{{{\boldsymbol{x}}}}}}}}}^{(m)})={\langle \phi ({{{{{{{\boldsymbol{x}}}}}}}}),\phi ({{{{{{{{\boldsymbol{x}}}}}}}}}^{(m)})\rangle }_{{{{{{{{\mathcal{F}}}}}}}}}=\,{{\mbox{Tr}}}\,[\rho ({{{{{{{\boldsymbol{x}}}}}}}})\rho ({{{{{{{{\boldsymbol{x}}}}}}}}}^{(m)})]\) the kernel function associated with the feature encoding U_{ϕ}(x). By linearity of the trace, however, we can express any such implicit model as a linear model in \({{{{{{{\mathcal{F}}}}}}}}\), defined by the observable:
Therefore, both explicit and implicit quantum models belong to the general family of linear models in the quantum feature space \({{{{{{{\mathcal{F}}}}}}}}\).
Linear realizations of data reuploading models
Data reuploading models^{32} on the other hand do not naturally fit this formulation. These models generalize explicit models by increasing the number of encoding layers U_{ℓ}(x), 1 ≤ ℓ ≤ L (which can be all distinct), and interlaying them with variational unitaries V_{ℓ}(θ). This results in expectationvalue functions of the form:
for a variational encoding \({\rho }_{{{{{{{{\boldsymbol{\theta }}}}}}}}}({{{{{{{\boldsymbol{x}}}}}}}})\,=\,U({{{{{{{\boldsymbol{x}}}}}}}},{{{{{{{\boldsymbol{\theta }}}}}}}})\left{{{{{{{\mathbf{0}}}}}}}}\right\rangle \left\langle {{{{{{{\mathbf{0}}}}}}}}\right{U}^{{{{\dagger}}} }({{{{{{{\boldsymbol{x}}}}}}}},{{{{{{{\boldsymbol{\theta }}}}}}}})\), where \(U({{{{{{{\boldsymbol{x}}}}}}}},{{{{{{{\boldsymbol{\theta }}}}}}}})={U}_{L}({{{{{{{\boldsymbol{x}}}}}}}})\mathop{\prod }\nolimits_{\ell=1}^{L1}{V}_{\ell }({{{{{{{\boldsymbol{\theta }}}}}}}}){U}_{\ell }({{{{{{{\boldsymbol{x}}}}}}}})\), and a variational observable O_{θ} = V_{L}(θ)^{†}OV_{L}(θ). Given that the unitaries U_{ℓ}(x) and \({V}_{{\ell }^{{\prime} }}({{{{{{{\boldsymbol{\theta }}}}}}}})\) do not commute in general, one cannot straightforwardly gather all trainable gates in a final variational observable \({O}_{{{{{{{{\boldsymbol{\theta }}}}}}}}}^{{\prime} }\in {{{{{{{\mathcal{F}}}}}}}}\) as to obtain a linear model \({\tilde{f}}_{{{{{{{{\boldsymbol{\theta }}}}}}}}}({{{{{{{\boldsymbol{x}}}}}}}})={\langle \phi ({{{{{{{\boldsymbol{x}}}}}}}}),{O}_{{{{{{{{\boldsymbol{\theta }}}}}}}}}^{{\prime} }\rangle }_{{{{{{{{\mathcal{F}}}}}}}}}\) with a fixed quantum feature encoding ϕ(x). Our first contribution is to show that, by augmenting the dimension of the Hilbert space \({{{{{{{\mathcal{F}}}}}}}}\) (i.e., considering circuits that act on a larger number of qubits), one can construct such explicit linear realizations \({\tilde{f}}_{{{{{{{{\boldsymbol{\theta }}}}}}}}}\) of data reuploading models. That is, given a family of data reuploading models \({\{\,{f}_{{{{{{{{\boldsymbol{\theta }}}}}}}}}(\cdot )={{\mbox{Tr}}}[\,{\rho }_{{{{{{{{\boldsymbol{\theta }}}}}}}}}(\cdot ){O}_{{{{{{{{\boldsymbol{\theta }}}}}}}}}]\}}_{{{{{{{{\boldsymbol{\theta }}}}}}}}}\), we can construct an equivalent family of explicit models \({\{\,{\tilde{f}}_{{{{{{{{\boldsymbol{\theta }}}}}}}}}(\cdot )={{\mbox{Tr}}}[{\rho }^{{\prime} }(\cdot ){O}_{{{{{{{{\boldsymbol{\theta }}}}}}}}}^{{\prime} }]\}}_{{{{{{{{\boldsymbol{\theta }}}}}}}}}\) that represents all functions in the original family, along with an efficient procedure to map the former models to the latter.
Before getting to the main result of this section (Theorem 1), we first present an illustrative construction to convey intuition on how mappings from data reuploading to explicit models can be realized. This construction, depicted in Fig. 3, leads to approximate mappings, meaning that these only guarantee \({\,\tilde{f}}_{{{{{{{{\boldsymbol{\theta }}}}}}}}}({{{{{{{\boldsymbol{x}}}}}}}}){f}_{{{{{{{{\boldsymbol{\theta }}}}}}}}}({{{{{{{\boldsymbol{x}}}}}}}})\le \,\delta,\) ∀ x, θ for some (adjustable) error of approximation δ. More precisely, we have:
Proposition 1 Given an arbitrary data reuploading model f_{θ}(x) = Tr[ρ_{θ}(x)O_{θ}] as specified by Eq. (5), and an approximation error δ > 0, there exists a mapping that produces an explicit model \({\tilde{f}}_{{{{{{{{\boldsymbol{\theta }}}}}}}}}({{{{{{{\boldsymbol{x}}}}}}}})=\,{{\mbox{Tr}}}\,[\,{\rho }^{{\prime} }({{{{{{{\boldsymbol{x}}}}}}}}){O}_{{{{{{{{\boldsymbol{\theta }}}}}}}}}^{{\prime} }]\) as specified by Eq. (2), such that:
D the number of encoding gates used by the data reuploading model and \({\leftO\right}_{\infty }\) the spectral norm of its observable, the explicit model uses \({{{{{{{\mathcal{O}}}}}}}}(D\log (D{\leftO\right}_{\infty }{\delta }^{1}))\) additional qubits and gates.
The general idea behind this construction is to encode the input data x in ancilla qubits, to finite precision, which can then be used repeatedly to approximate dataencoding gates using dataindependent unitaries. More precisely, all data components \({x}_{i}\in {\mathbb{R}}\) of an input vector x = (x_{1}, …, x_{d}) are encoded as bitstrings \(\left{\widetilde{{{{{{{{\boldsymbol{x}}}}}}}}}}_{i}\right\rangle={b}_{0}{b}_{1}\ldots {b}_{p1}\rangle \in {\{0,1\}}^{p}\), to some precision ε = 2^{−p} (e.g., using R_{x}(b_{j}) rotations on \(\left0\right\rangle\) states). Now, using p fixed rotations, e.g., of the form R_{z}(2^{−j}), controlled by the bits \({b}_{j}\rangle\) and acting on n “working” qubits, one can encode every x_{i} in arbitrary (multiqubit) rotations \({e}^{{{{{{{{\rm{i}}}}}}}}{x}_{i}H}\), e.g., R_{z}(x_{i}), arbitrarily many times. Given that all these fixed rotations are dataindependent, the feature encoding of any such circuit hence reduces to the encoding of the classical bitstrings \({\widetilde{{{{{{{{\boldsymbol{x}}}}}}}}}}_{i}\), prior to all variational operations. By preserving the variational unitaries appearing in a data reuploading circuit and replacing its encoding gates with such controlled rotations, we can then approximate any data reuploading model of the form of Eq. (5). The approximation error δ of this mapping originates from the finite precision ε of encoding x, which results in an imperfect implementation of the encoding gates in the original circuit. But as ε → 0, we also have δ → 0, and the scaling of ε (or the number of ancillas dp) as a function of δ is detailed in Supplementary Section 2.
We now move to our main construction, resulting in exact mappings between data reuploading and explicit models, i.e., that achieve δ = 0 with finite resources. We rely here on a similar idea to our previous construction, in which we encode the input data on ancilla qubits and later use dataindependent operations to implement the encoding gates on the working qubits. The difference here is that we use gateteleportation techniques, a form of measurementbased quantum computation^{37}, to directly implement the encoding gates on ancillary qubits and teleport them back (via entangled measurements) onto the working qubits when needed (see Fig. 4).
Theorem 1 Given an arbitrary data reuploading model f_{θ}(x) = Tr[ρ_{θ}(x)O_{θ}] as specified by Eq. (5), there exists a mapping that produces an equivalent explicit model \({\tilde{f}}_{{{{{{{{\boldsymbol{\theta }}}}}}}}}({{{{{{{\boldsymbol{x}}}}}}}})=\,{{\mbox{Tr}}}\,[{\rho }^{{\prime} }({{{{{{{\boldsymbol{x}}}}}}}}){O}_{{{{{{{{\boldsymbol{\theta }}}}}}}}}^{{\prime} }]\) as specified by Eq. (2), such that:
and \({\left{O}_{{{{{{{{\boldsymbol{\theta }}}}}}}}}^{{\prime} }\right}_{\infty }^{2}\,\le \,{(1{\delta }^{{\prime} })}^{1}{\left{O}_{{{{{{{{\boldsymbol{\theta }}}}}}}}}\right}_{\infty }^{2}\), for an arbitrary renormalization parameter \({\delta }^{{\prime} } \, > \, 0\). For D the number of encoding gates used by the data reuploading model, the equivalent explicit model uses \({{{{{{{\mathcal{O}}}}}}}}(D\log (D/{\delta }^{{\prime} }))\) additional qubits and gates.
As we detail in Supplementary Section 2, gate teleportation cannot succeed with unit probability without gatedependent (and hence datadependent) corrections conditioned on the measurement outcomes of the ancilla. But since we only care about equality in expectation values (Tr[ρ_{θ}(x)O_{θ}] and \(\,{{\mbox{Tr}}}\,[{\rho }^{{\prime} }({{{{{{{\boldsymbol{x}}}}}}}}){O}_{{{{{{{{\boldsymbol{\theta }}}}}}}}}^{{\prime} }]\)), we can simply discard these measurement outcomes in the observable \({O}_{{{{{{{{\boldsymbol{\theta }}}}}}}}}^{{\prime} }\) (i.e., project on the correctionfree measurement outcomes). In general, this leads to an observable with a spectral norm \({\left{O}_{{{{{{{{\boldsymbol{\theta }}}}}}}}}^{{\prime} }\right}_{\infty }^{2}={2}^{D}{\left{O}_{{{{{{{{\boldsymbol{\theta }}}}}}}}}\right}_{\infty }^{2}\) exponentially larger than originally, and hence a model that is exponentially harder to evaluate to the same precision. Using a nested gateteleportation scheme (see Supplementary Section 2) with repeated applications of the encoding gates, we can however efficiently make this norm overhead arbitrarily small.
As our findings indicate, mappings from data reuploading to explicit models are not unique, and seem to always incur the use of additional qubits. When discussing our learning separation results (see Corollary 1 below), we prove that this is indeed the case, and that any mapping from an arbitrary data reuploading model with D encoding gates to an equivalent explicit model must use Ω(D) additional qubits in general. This makes our gateteleportation mapping essentially optimal (i.e., up to logarithmic factors) in this extra cost.
To summarize, in this section, we demonstrated that linear quantum models can describe not only explicit and implicit models, but also data reuploading circuits. More specifically, we showed that any hypothesis class of data reuploading models can be mapped to an equivalent class of explicit models, that is, linear models with a restricted family of observables. In Supplementary Section 3, we extend this result and show that explicit models can also approximate any computable (classical or quantum) hypothesis class.
Outperforming kernel methods with explicit and data reuploading models
From the standpoint of relating quantum models to each other, we have shown that the framework of linear quantum models allows us to unify all standard models based on parametrized quantum circuits. While these findings are interesting from a theoretical perspective, they do not reveal how these models compare in practice. In particular, we would like to understand the advantages of using a certain model rather than the other in order to solve a given learning task. In this section, we address this question from several perspectives. First, we revisit the comparison between explicit and implicit models and clarify the implications of the representer theorem on the performance guarantees of these models. Then, we derive lower bounds for all three quantum models studied in this work in terms of their resource requirements, and show the existence of exponential separations between each pair of models. Finally, we discuss the implications of these results on the search for a quantum advantage in machine learning.
Classical background and the representer theorem
Interestingly, a piece of functional analysis from learning theory gives us a way of characterizing any family of linear quantum models^{25}. Namely, the socalled reproducing kernel Hilbert space, or RKHS^{22}, is the Hilbert space \({{{{{{{\mathcal{H}}}}}}}}\) spanned by all functions of the form \(f({{{{{{{\boldsymbol{x}}}}}}}})={\langle \phi ({{{{{{{\boldsymbol{x}}}}}}}}),w\rangle }_{{{{{{{{\mathcal{F}}}}}}}}}\), for all \(w\in {{{{{{{\mathcal{F}}}}}}}}\). It includes any explicit and implicit models defined by the quantum feature states ϕ(x) = ρ(x). From this point of view, a relaxation of any learning task using implicit or explicit models as a hypothesis family consists in finding the function in the RKHS \({{{{{{{\mathcal{H}}}}}}}}\) that has optimal learning performance. For the supervised learning task of modeling a target function g(x) using a training set \(\{\left({{{{{{{{\boldsymbol{x}}}}}}}}}^{(1)}\right.,g({{{{{{{{\boldsymbol{x}}}}}}}}}^{(1)}),\ldots,\left({{{{{{{{\boldsymbol{x}}}}}}}}}^{(M)},g({{{{{{{{\boldsymbol{x}}}}}}}}}^{(M)})\right.\}\), this learning performance is usually measured in terms of a training loss of the form, e.g.,
The true figure of merit of this problem, however, is in minimizing the expected loss \({{{{{{{\mathcal{L}}}}}}}}(f)\), defined similarly as a probabilityweighted average over the entire data space \({{{{{{{\mathcal{X}}}}}}}}\). For this reason, a socalled regularization term \(\lambda {\leftf\right}_{{{{{{{{\mathcal{H}}}}}}}}}^{2}=\lambda {\leftO\right}_{{{{{{{{\mathcal{F}}}}}}}}}^{2}\) is often added to the training loss \({\widehat{{{{{{{{\mathcal{L}}}}}}}}}}_{\lambda }(f)=\widehat{{{{{{{{\mathcal{L}}}}}}}}}(f)+\lambda {\leftO\right}_{{{{{{{{\mathcal{F}}}}}}}}}^{2}\) to incentivize the model not to overfit on the training data. Here, λ ≥ 0 is a hyperparameter that controls the strength of this regularization.
Learning theory also allows us to characterize the linear models in \({{{{{{{\mathcal{H}}}}}}}}\) that are optimal with respect to the regularized training loss \({\widehat{{{{{{{{\mathcal{L}}}}}}}}}}_{\lambda }(f)\), for any λ ≥ 0. Specifically, the representer theorem^{22} states that the model \({f}_{{{{{\rm{opt}}}}}}\in {{{{{{{\mathcal{H}}}}}}}}\) minimizing \({\widehat{{{{{{{{\mathcal{L}}}}}}}}}}_{\lambda }(f)\) is always a kernel model of the form of Eq. (3) (see Supplementary Section 1 for a formal statement). A direct corollary of this result is that implicit quantum models are guaranteed to achieve a lower (or equal) regularized training loss than any explicit quantum model using the same feature encoding^{25}. Moreover, the optimal weights α_{m} of this model can be computed efficiently using \({{{{{{{\mathcal{O}}}}}}}}({M}^{2})\) evaluations of inner products on a quantum computer (that is, by estimating the expectation value in Fig. 1b for all pairs of training points) and with classical postprocessing in time \({{{{{{{\mathcal{O}}}}}}}}({M}^{3})\) using, e.g., ridge regression or support vector machines^{22}. For this work, we ignore the required precision for the estimations of the quantum kernel. We note however that these can require exponentially many measurements in the number of qubits, both for explicit^{38} and implicit^{27} models.
This result may be construed to suggest that, in our study of quantum machine learning models, we only need to worry about implicit models, where the only real question to ask is what feature encoding circuit we use to compute a kernel function, and all machine learning is otherwise classical. In the next subsections, we show however the value of explicit and data reuploading approaches in terms of generalization performance and resource requirements.
Explicit can outperform implicit models
We turn our attention back to the explicit models resulting from our approximate mappings (see Fig. 3). Note that the kernel function associated with their bitstring encodings \(\left\psi ({{{{{{{\boldsymbol{x}}}}}}}})\right\rangle={\left0\right\rangle }^{\otimes n}\left\tilde{{{{{{{{\boldsymbol{x}}}}}}}}}\right\rangle\), \(\rho ({{{{{{{\boldsymbol{x}}}}}}}})=\left\psi ({{{{{{{\boldsymbol{x}}}}}}}})\right\rangle \left\langle \psi ({{{{{{{\boldsymbol{x}}}}}}}})\right\), is trivially
that is, the Kronecker delta function of the bitstrings \(\widetilde{{{{{{{{\boldsymbol{x}}}}}}}}}\) and \({\widetilde{{{{{{{{\boldsymbol{x}}}}}}}}}}^{{\prime} }\). Let us emphasize that, for an appropriate precision ε of encoding input vectors x, the family of explicit models resulting from our construction includes good approximations of virtually any parametrized quantum circuit model acting on n qubits. Yet, all of these result in the same kernel function of Eq. (9). This is a rather surprising result, for two reasons. First, this kernel is classically computable, which, in light of the representer theorem, seems to suggest that a simple classical model of the form of Eq. (3) can outperform any explicit quantum model stemming from our construction, and hence any quantum model in the limit ε → 0. Second, this implicit model always takes the form
which is a model that overfits the training data and fails to generalize to unseen data points, as, for ε → 0 and any choice of α, \({f}_{{{{{{{{\boldsymbol{\alpha }}}}}}}},{{{{{{{\mathcal{D}}}}}}}}}({{{{{{{\boldsymbol{x}}}}}}}})=0\) for any x outside the training set. As we detail in Supplementary Section 2, similar observations can be made for the kernels resulting from our gateteleportation construction.
These last remarks force us to rethink our interpretation of the representer theorem. When restricting our attention to the regularized training loss, implicit models do indeed lead to better training performance due to their increased expressivity. For example, on a classification task with labels g(x) = ±1, the kernel model of Eq. (10) is optimal with respect to any regularized training loss for α_{m} = g(x^{(m)}) ∀ m such that \(\widehat{{{{{{{{\mathcal{L}}}}}}}}}(\,f)=0\) and \({\leftf\right}_{{{{{{{{\mathcal{H}}}}}}}}}^{2}=M\). But, as our construction shows, this expressivity can dramatically harm the generalization performance of the learning model, despite the use of regularization during training. Hence, restricting the set of observables accessible to a linear quantum model (or, equivalently, restricting the accessible manifold of the RKHS) can potentially provide a substantial learning advantage.
Rigorous learning separations between all quantum models
Motivated by the previous illustrative example, we analyze more rigorously the advantages of explicit and data reuploading models over implicit models. For this, we take a similar approach to recent works in classical machine learning which showed that neural networks can efficiently solve some learning tasks that linear or kernel methods cannot^{39,40}. In our case, we quantify the efficiency of a quantum model in solving a learning task by the number of qubits and the size of the training set it requires to achieve a nontrivial expected loss. To obtain scaling separations, we consider a learning task specified by an arbitrary input dimension \(d\in {\mathbb{N}}\) and express the resource requirements of the different quantum models as a function of d.
Similarly to ref. ^{39}, the learning task we focus on is that of learning parity functions (see Fig. 5). These functions take as input a ddimensional binary input x ∈ {−1, 1}^{d} and return the parity (i.e., the product) of a certain subset A ⊂ {1, …, d} of the components of x. The interesting property of these functions is that, for any two choices of A, the resulting parity functions are orthogonal in the Hilbert space \({{{{{{{\mathcal{H}}}}}}}}\) of functions from {−1, 1}^{d} to \({\mathbb{R}}\). Hence, since the number of possible choices for A grow combinatorially with d, the subspace of \({{{{{{{\mathcal{H}}}}}}}}\) that these functions span also grows combinatorially with d (can be made into a 2^{d} scaling by restricting the choices of A). On the other hand, a linear model (explicit or implicit) also covers a restricted subspace (or manifold) of \({{{{{{{\mathcal{H}}}}}}}}\). The dimension of this subspace is upper bounded by 2^{2n} for a quantum linear model acting on n qubits, and by M for an implicit model using M training samples (see Supplementary Section 7 for detailed explanations). Hence, by essentially comparing these dimensions (2^{d} versus 2^{2n} and M)^{40}, we can derive our lower bounds for explicit and implicit models. As for data reuploading models, they do not suffer from these dimensionality arguments. The different components of x can be processed sequentially by the model, such that a singlequbit data reuploading quantum circuit can represent (and learn) any parity function.
We summarize our results in the following theorem, and refer to Supplementary Section 7 for a more detailed exposition.
Theorem 2 There exists a regression task specified by an input dimension \(d\in {\mathbb{N}}\), a function family \({\{{g}_{A}:{\{1,1\}}^{d}\to \{1,1\}\}}_{A}\), and associated input distributions \({{{{{{{{\mathcal{D}}}}}}}}}_{A}\), such that, to achieve an average meansquared error

(i)
any linear quantum model needs to act on
$$n\,\ge \,{{\Omega }}(d+\log (12\varepsilon ))$$qubits,

(ii)
any implicit quantum model additionally requires
$$M\,\ge \,{{\Omega }}({2}^{d}(12\varepsilon ))$$data samples, while

(iii)
a data reuploading model acting on a single qubit and using d encoding gates can be trained to achieve a perfect expected error with probability 1 − δ, using \(M={{{{{{{\mathcal{O}}}}}}}}(\log (\frac{d}{\delta }))\) data samples.
A direct corollary of this result is a lower bound on the number of additional qubits that a universal mapping from any data reuploading model to equivalent explicit models must use:
Corollary 1 Any universal mapping that takes as input an arbitrary data reuploading model f_{θ} with D encoding gates and maps it to an equivalent explicit model \({\widetilde{f}}_{{{{{{{{\boldsymbol{\theta }}}}}}}}}\) must produce models acting on Ω(D) additional qubits for worstcase inputs.
Comparing this lower bound to the scaling of our gateteleportation mapping (Theorem 1), we find that it is optimal up to logarithmic factors.
Quantum advantage beyond kernel methods
A major challenge in quantum machine learning is showing that the quantum methods discussed in this work can achieve a learning advantage over (standard) classical methods. While some approaches to this problem focus on constructing learning tasks with separations based on complexitytheoretic assumptions^{17,19}, other works try to assess empirically the type of learning problems where quantum models show an advantage over standard classical models^{11,20}. In this line of research, Huang et al.^{20} propose looking into learning tasks where the target functions are themselves generated by (explicit) quantum models. Following similar observations to those made above about the learning performance guarantees of kernel methods, the authors also choose to assess the presence of quantum advantages by comparing the learning performance of standard classical models only to that of implicit quantum models (from the same family as the target explicit models). This restricted comparison led to the conclusion that, with the help of training data, classical machine learning models could be as powerful as quantum machine learning models, even in these tailored learning tasks.
Having discussed the limitations of kernel methods in the previous subsections, we revisit this type of numerical experiments, where we additionally evaluate the performance of explicit models on these types of tasks.
Similarly to Huang et al.^{20}, we consider a regression task with input data from the fashionMNIST dataset^{41}, composed of 28 × 28pixel images of clothing items. Using principal component analysis, we first reduce the dimension of these images to obtain ndimensional vectors, for 2 ≤ n ≤ 12. We then label the images using an explicit model acting on n qubits. For this, we use the feature encoding proposed by Havlíček et al.^{23}, which is conjectured to lead to classically intractable kernels, followed by a hardwareefficient variational unitary^{4}. The expectation value of a Pauli Z observable on the first qubit then produces the data labels. Note that we additionally normalize the labels as to obtain a standard deviation of 1 for all system sizes. On this newly defined learning task, we test the performance of explicit models from the same function family as the explicit models generating the (training and test) data, and compare it to that of implicit models using the same feature encoding (hence from the same extended family of linear models), as well as a list of standard classical machine learning algorithms that are hyperparametrized for the task (see Supplementary Section 5). The results of this experiment are presented in Fig. 6.
The training losses we observe are consistent with our previous findings: the implicit models systematically achieve a lower training loss than their explicit counterparts. For an unregularized loss notably, the implicit models achieve a training loss of 0, and as noted in Supplementary Section 6, the addition of regularization to the training loss of the implicit model does not impact the separation we observe here. With respect to the testing loss on the other hand, which is representative of the expected loss, we see a clear separation starting from n = 7 qubits, where the classical models start having a competitive performance with the implicit models, while the explicit models clearly outperform them both. This goes to show that the existence of a quantum advantage should not be assessed only by comparing classical models to quantum kernel methods, as explicit (or data reuploading) models can also conceal a substantially better learning performance.
Discussion
In this work, we present a unifying framework for quantum machine learning models by expressing them as linear models in quantum feature spaces. In particular, we show how data reuploading circuits can be represented exactly by explicit linear models in larger feature spaces. While this unifying formulation as linear models may suggest that all quantum machine learning models should be treated as kernel methods, we illustrate the advantages of variational quantum methods for machine learning. Going beyond the advantages in training performance guaranteed by the representer theorem, we first show how a systematic “kernelization" of linear quantum models can be harmful in terms of their generalization performance. Furthermore, we analyze the resource requirements (number of qubits and data samples used by) of these models, and show the existence of exponential separations between data reuploading, linear, and kernel quantum models to solve certain learning tasks.
One takeaway message from our results is that training loss, even when regularized, is a misleading figure of merit. Generalization performance, which is measured on seen as well as unseen data, is in fact the important quantity to care about in (quantum) machine learning. These two sentences written outside of context will seem obvious to individuals wellversed in learning theory. However, it is crucial to recall this fact when evaluating the consequences of the representer theorem. This theorem only discusses regularized training loss, and thus despite its guarantees on the training loss of quantum kernel methods, it allows explicit models to have an exponential learning advantage in the number of data samples they use to achieve a good generalization performance.
From the limitations of quantum kernel methods highlighted by these results, we revisit a discussion on the power of quantum learning models relative to classical models in machine learning tasks with quantumgenerated data. In a similar learning task to that of Huang et al.^{20}, we show that, while standard classical models can be competitive with quantum kernel methods even in these “quantumtailored” problems, variational quantum models can exhibit a significant learning advantage. These results give us a more comprehensive view of the quantum machine learning landscape and broaden our perspective on the type of models to use in order to achieve a practical learning advantage in the NISQ regime.
In this paper, we focus on the theoretical foundations of quantum machine learning models and how expressivity impacts generalization performance. But a major practical consideration is also that of trainability of these models. In fact, we know of obstacles in trainability for both explicit and implicit models. Explicit models can suffer from barren plateaus in their loss landscapes^{38,42}, which manifest in exponentially vanishing gradients in the number of qubits used, while implicit models can suffer from exponentially vanishing kernel values^{27,43}. While these phenomena can happen under different conditions, they both mean that an exponential number of circuit evaluations can be needed to train and make use of these models. Therefore, aside from the considerations made in this work, emphasis should also be placed on avoiding these obstacles to make good use of quantum machine learning models in practice.
The learning task we consider to show the existence of exponential learning separations between the different quantum models is based on parity functions, which is not a concept class of practical interest in machine learning. We note however that our lower bound results can also be extended to other learning tasks with concept classes of large dimensions (i.e., composed of many orthogonal functions). Quantum kernel methods will necessarily need a number of data points that scale linearly with this dimension, while, as we showcased in our results, the flexibility of data reuploading circuits, as well as the restricted expressivity of explicit models can lead to substantial savings in resources. It remains an interesting research direction to explore how and when can these models be tailored to a machine learning task at hand, e.g., through the form of useful inductive biases (i.e., assumptions on the nature of the target functions) in their design.
Data availability
The data that support the plots within this paper are available at https://github.com/sjerbi/QMLbeyondkernel^{44}. Source Data are provided with this paper.
Code availability
The code used to run the numerical simulations, implemented using TensorFlow Quantum^{45}, is available at https://github.com/sjerbi/QMLbeyondkernel^{44}.
References
Preskill, J. Quantum computing in the NISQ era and beyond. Quantum 2, 79 (2018).
Cerezo, M. et al. Variational quantum algorithms. Nat. Rev. Phys. 3, 625 (2021).
Bharti, K. et al. Noisy intermediatescale quantum (NISQ) algorithms. Rev. Mod. Phys. 94, 015004 (2022).
Peruzzo, A. et al. A variational eigenvalue solver on a photonic quantum processor. Nat. Commun. 5, 1 (2014).
Farhi, E., Goldstone, J. & Gutmann, S. A quantum approximate optimization algorithm. Preprint at https://arxiv.org/abs/1411.4028 (2014).
Benedetti, M., Lloyd, E., Sack, S. & Fiorentini, M. Parameterized quantum circuits as machine learning models. Quantum Sci. Technol. 4, 043001 (2019).
Carrasquilla, J. & Melko, R. G. Machine learning phases of matter. Nat. Phys. 13, 431 (2017).
Jumper, J. et al. Highly accurate protein structure prediction with alphafold. Nature 596, 583 (2021).
Biamonte, J. et al. Quantum machine learning. Nature 549, 195 (2017).
Dunjko, V. & Briegel, H. J. Machine learning & artificial intelligence in the quantum domain: a review of recent progress. Rep. Prog. Phys. 81, 074001 (2018).
Schuld, M., Bocharov, A., Svore, K. M. & Wiebe, N. Circuitcentric quantum classifiers. Phys. Rev. A 101, 032308 (2020).
Farhi, E. & Neven, H. Classification with quantum neural networks on near term processors. Preprint at https://arxiv.org/abs/1802.06002 (2018).
Liu, J.G. & Wang, L. Differentiable learning of quantum circuit born machines. Phys. Rev. A 98, 062324 (2018).
Zhu, D. et al. Training of quantum circuits on a hybrid quantum computer. Sci. Adv. 5, eaaw9918 (2019).
Skolik, A., Jerbi, S., & Dunjko, V. Quantum agents in the gym: a variational quantum algorithm for deep qlearning. Quantum 6, 720 (2022).
Jerbi, S., Gyurik, C., Marshall, S., Briegel, H. & Dunjko, V. Parametrized quantum policies for reinforcement learning. Adv. Neural. Inf. Process. Syst. 34, 28362–28375. https://proceedings.neurips.cc/paper/2021/hash/eec96a7f788e88184c0e713456026f3fAbstract.html (2021).
Liu, Y., Arunachalam, S., & Temme, K. A rigorous and robust quantum speedup in supervised machine learning. Nat. Phys. 17, 1–5. https://doi.org/10.1038/s4156702101287z (2021).
Du, Y., Hsieh, M.H., Liu, T. & Tao, D. Expressive power of parametrized quantum circuits. Phys. Rev. Res. 2, 033125 (2020).
Sweke, R., Seifert, J.P., Hangleiter, D. & Eisert, J. On the quantum versus classical learnability of discrete distributions. Quantum 5, 417 (2021).
Huang, H.Y. et al. Power of data in quantum machine learning. Nat. Commun. 12, 1 (2021).
Huang, H.Y., Kueng, R. & Preskill, J. Informationtheoretic bounds on quantum advantage in machine learning. Phys. Rev. Lett. 126, 190505 (2021).
Schölkopf, B. et al. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond (MIT Press, 2002).
Havlíček, V. et al. Supervised learning with quantumenhanced feature spaces. Nature 567, 209 (2019).
Schuld, M. & Killoran, N. Quantum machine learning in feature Hilbert spaces. Phys. Rev. Lett. 122, 040504 (2019).
Schuld, M. Supervised quantum machine learning models are kernel methods. Preprint at https://arxiv.org/abs/2101.11020 (2021).
Lloyd, S., Schuld, M., Ijaz, A., Izaac, J., & Killoran, N. Quantum embeddings for machine learning. Preprint at https://arxiv.org/abs/2001.03622 (2020).
Kübler, J. M., Buchholz, S. & Schölkopf, B. The inductive bias of quantum kernels. Adv. Neural. Inf. Process. Syst. 34. 12661–12673. https://proceedings.neurips.cc/paper/2021/hash/69adc1e107f7f7d035d7baf04342e1caAbstract.html (2021).
Peters, E. et al. Machine learning of high dimensional data on a noisy quantum processor. npj Quantum Inf. 7, 161 (2021).
Haug, T., Self, C. N. & Kim, M. Quantum machine learning of large datasets using randomized measurements. Mach. Learn.: Sci. Technol. 4, 015005 (2023).
Bartkiewicz, K. et al. Experimental kernelbased quantum machine learning in finite feature space. Sci. Rep. 10, 1 (2020).
Kusumoto, T., Mitarai, K., Fujii, K., Kitagawa, M. & Negoro, M. Experimental quantum kernel trick with nuclear spins in a solid. npj Quantum Inf. 7, 1 (2021).
PérezSalinas, A., CerveraLierta, A., GilFuster, E. & Latorre, J. I. Data reuploading for a universal quantum classifier. Quantum 4, 226 (2020).
Schuld, M., Sweke, R. & Meyer, J. J. Effect of data encoding on the expressive power of variational quantummachinelearning models. Phys. Rev. A 103, 032430 (2021).
PérezSalinas, A., LópezNú nez, D., GarcíaSáez, A., FornDíaz, P. & Latorre, J. I. One qubit as a universal approximant. Phys. Rev. A 104, 012405 (2021).
Goto, T., Tran, Q. H. & Nakajima, K. Universal approximation property of quantum machine learning models in quantumenhanced feature spaces. Phys. Rev. Lett. 127, 090506 (2021).
Gyurik, C. & Dunjko, V. Structural risk minimization for quantum linear classifiers. Quantum 7, 893 (2023).
Briegel, H. J., Browne, D. E., Dür, W., Raussendorf, R. & Van den Nest, M. Measurementbased quantum computation. Nat. Phys. 5, 19 (2009).
McClean, J. R., Boixo, S., Smelyanskiy, V. N., Babbush, R. & Neven, H. Barren plateaus in quantum neural network training landscapes. Nat. Commun. 9, 1 (2018).
Daniely, A. & Malach, E. Learning parities with neural networks. Adv. Neural. Inf. Process. Syst. 33, 20356–20365. https://proceedings.neurips.cc/paper/2020/hash/eaae5e04a259d09af85c108fe4d7dd0cAbstract.html (2020).
Hsu, D. Dimension lower bounds for linear approaches to function approximation. Daniel Hsu’s homepage. https://www.cs.columbia.edu/djhsu/papers/dimensionargument.pdf (2021).
Xiao, H., Rasul, K. & Vollgraf, R. Fashionmnist: a novel image dataset for benchmarking machine learning algorithms. Preprint at https://arxiv.org/abs/1708.07747 (2017).
Cerezo, M., Sone, A., Volkoff, T., Cincio, L. & Coles, P. J. Cost function dependent barren plateaus in shallow parametrized quantum circuits. Nat. Commun. 12, 1 (2021).
Thanasilp, S., Wang, S., Cerezo, M., & Holmes, Z. Exponential concentration and untrainability in quantum kernel methods. Preprint at https://arxiv.org/abs/2208.11060 (2022).
Jerbi, S. sjerbi/qmlbeyondkernel. https://doi.org/10.5281/zenodo.7529787 Publication release (2023).
Broughton, M. et al. Tensorflow quantum: a software framework for quantum machine learning. Preprint at https://arxiv.org/abs/2003.02989 (2020).
Acknowledgements
The authors would like to thank Isaac D. Smith, Casper Gyurik, Matthias C. Caro, Elies GilFuster, Ryan Sweke, and Maria Schuld for helpful discussions and comments, as well as HsinYuan Huang for clarifications on their numerical simulations^{20}. S.J., L.J.F., H.P.N. and H.J.B. acknowledge support from the Austrian Science Fund (FWF) through the projects DKALM:W1259N27 and SFB BeyondC F7102. S.J. also acknowledges the Austrian Academy of Sciences as a recipient of the DOC Fellowship. H.J.B. also acknowledges support by the European Research Council (ERC) under Project No. 101055129. H.J.B. was also supported by the Volkswagen Foundation (Az:97721). This work was in part supported by the Dutch Research Council (NWO/OCW), as part of the Quantum Software Consortium program (project number 024.003.037). V.D. acknowledges the support of the project NEASQC funded by the European Union’s Horizon 2020 research and innovation programme (grant agreement No 951821). V.D. also acknowledges support through an unrestricted gift from Google Quantum AI.
Author information
Authors and Affiliations
Contributions
The project was conceived by S.J., V.D., L.J.F., H.P.N., and H.J.B. The theoretical aspects of this work were developed by S.J., L.J.F., H.P.N., J.M.K., and V.D. The numerical experiments were conducted by S.J. All authors contributed to technical discussions and writing of the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Communications thanks the anonymous reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Source data
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Jerbi, S., Fiderer, L.J., Poulsen Nautrup, H. et al. Quantum machine learning beyond kernel methods. Nat Commun 14, 517 (2023). https://doi.org/10.1038/s4146702336159y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s4146702336159y
This article is cited by

Quantum support vector machines for classification and regression on a trappedion quantum computer
Quantum Machine Intelligence (2024)

Hyperparameter importance and optimization of quantum neural networks across small datasets
Machine Learning (2024)

Quantum Gaussian process regression for Bayesian optimization
Quantum Machine Intelligence (2024)

Quantum selfattention neural networks for text classification
Science China Information Sciences (2024)

Advancements in Quantum Optics: Harnessing the Power of Photons for NextGeneration Technologies
Journal of Optics (2023)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.