Ansatz-Independent Variational Quantum Classifiers and the Price of Ansatz

The paradigm of variational quantum classifiers (VQCs) encodes classical information as quantum states, followed by quantum processing and then measurements to generate classical predictions. VQCs are promising candidates for efficient utilizations of noisy intermediate scale quantum (NISQ) devices: classifiers involving M-dimensional datasets can be implemented with only \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\lceil \log _2 M \rceil $$\end{document}⌈log2M⌉ qubits by using an amplitude encoding. A general framework for designing and training VQCs, however, is lacking. An encouraging specific embodiment of VQCs, quantum circuit learning (QCL), utilizes an ansatz: a circuit with a predetermined circuit geometry and parametrized gates expressing a time-evolution unitary operator; training involves learning the gate parameters through a gradient-descent algorithm where the gradients themselves can be efficiently estimated by the quantum circuit. The representational power of QCL, however, depends strongly on the choice of the ansatz, as it limits the range of possible unitary operators that a VQC can search over. Equally importantly, the landscape of the optimization problem may have challenging properties such as barren plateaus and the associated gradient-descent algorithm may not find good local minima. Thus, it is critically important to estimate (i) the price of ansatz; that is, the gap between the performance of QCL and the performance of ansatz-independent VQCs, and (ii) the price of using quantum circuits as classical classifiers: that is, the performance gap between VQCs and equivalent classical classifiers. This paper develops a computational framework to address both these open problems. First, it shows that VQCs, including QCL, fit inside the well-known kernel method. Next it introduces a framework for efficiently designing ansatz-independent VQCs, which we call the unitary kernel method (UKM). The UKM framework enables one to estimate the first known computationally-determined bounds on both the price of ansatz and the price of any speedup advantages of VQCs: numerical results with datatsets of various dimensions, ranging from 4 to 256, show that the ansatz-induced gap can vary between 10 and 20\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\%$$\end{document}%, while the VQC-induced gap (between VQC and kernel method) can vary between 10 and 16\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\%$$\end{document}%. To further understand the role of ansatz in VQCs, we also propose a method of decomposing a given unitary operator into a quantum circuit, which we call the variational circuit realization (VCR): given any parameterized circuit block (as for example, used in QCL), it finds optimal parameters and the number of layers of the circuit block required to approximate any target unitary operator with a given precision.

ansatz-independent upper bound of the performance of QCL is of great interest since the performance of QCL itself heavily depends on both an ansatz and on an optimization method.
In this paper, we first discuss the correspondence between a VQC and the well-known kernel method 8,9 . Then we propose an ansatz-independent VQC, which we call the unitary kernel method (UKM). By using the UKM, we present ansatz-independent upper bounds on the performance of QCL for a wide range of classification tasks, i.e., the price paid by any chosen ansatz, as well as by the use of the gradient-descent algorithm for learning the parameters of the chosen ansatz. Next, we construct QCL-type circuits that could implement the unitary operator computed by the UKM. Since the UKM computes an ansatz-independent unitary evolution operator (hence, computable by a quantum circuit), it provides a tighter bound on QCL than obtained by the classical kernel method. It also provides an estimate of the gap between a VQC and a classical kernel-method classifier.
To effectively use the unitary operator obtained by the UKM, we propose a unitary decomposition method to create a circuit geometry, which we call the variational circuit realization (VCR). By combining the UKM and the VCR, we can efficiently construct a circuit geometry that works well for classification problems.
In the rest of the paper we also use the term quantum advantage to capture both, (i) any potential gain in performance over classical ML algorithms, and (ii) any potential gain in hardware efficiency. For example, in the case of amplitude encoding, the number of qubits used is logarithmic in the dimension of the data. Thus, even if a classical ML algorithm performs better, VQCs can still have an advantage: when quantum devices become cheap and well-developed, it could lead to practical methods for implementing high dimensional classification problems. Figure 1 presents a schematic of a general VQC, introduces and compares QCL and the UKM, and explains the VCR.

Variational quantum classifier
We first introduce an analytical formalism for a VQC. Suppose that we are given an n-qubit system and a classical dataset D := {x i , y i } N i=1 , where x i ∈ R M is a feature vector and y i ∈ {1, −1} is the corresponding label for i = 1, 2, . . . , N . In this paper, we consider amplitude encoding 7 ; thus we fix n = ⌈log 2 M⌉ . One can embed x i into a higher dimensional vector φ(x i ) ∈ R L with L = O(M c ) and then use the rest of the framework; the number of qubits n will still be O(log M) , thus retaining any potential quantum advantage. Here, O(·) is the big-O notation and c is a certain constant. In this paper, however, we stick to amplitude encoding. Let us consider making a prediction on y i by the following function: Figure 1. Schematic of the algorithms discussed in this paper: (a) A general form of a hybrid quantum-classical classifier, which we refer to as a VQC, (b) QCL, (c) UKM, and (d) VCR. (a) In the architecture of a VQC, the initial state is |init� := |0� ⊗n . We first encode a given classical vector x i : |ψ in (x i )� :=Ŝ(x i )|init� . One can embed x i into a higher dimensional vector φ(x i ) ∈ R L with L = O(M c ) and then use the rest of the framework; the number of qubits n will still be O(log M) , thus retaining any potential quantum advantage. Second, we apply Û : |ψ out (x i ;Û)� :=Û|ψ in (x i )� . Third, we perform measurements with respect to {Ô j } j=1,2,...,Q . Finally, we make a prediction on the label of x i by using the outputs of the measurements. (b) In QCL, we assume a circuit geometry parameterized by θ for Û : Û c (θ) . In most cases, a circuit used for QCL is composed of single-and twoqubit operators and has a layered structure. A typical example is shown. (c) In the UKM, we directly optimize Û . (d) In the VCR, we decompose a unitary operator into a quantum circuit by assuming a layered structure for a quantum circuit. For a circuit realization, a simpler circuit is preferable; so, we explicitly denote the number of layers L. where n is the number of qubits. Here, x i,j is the j-th element of x i . We denote, by Ŝ (x i ) , the unitary operator that maps |init� : can also be learned and optimized, the convention in Refs. 6,7 is to treat them as fixed parameters and θ b is a bias term to be estimated.
In a VQC, we estimate Û and θ b in Eq. (1) imposing the unitarity constraint on Û as follows: where here ℓ(·, ·) is a loss function, such as the mean-squared error function or the hinge function 8,9 , and 1 n is the n-dimensional identity operator. As explained later, we consider a parameterized unitary operator and optimize it in QCL and we directly optimize the unitary operator in the UKM.

Correspondence between a VQC and the kernel method
In the conventional kernel method 8,9,11 , a function φ(·) : R P → R G is used to map any input data point z i ∈ R P to φ(z i ) ∈ R G , and then a linear function is used to make a prediction on y i by where φ k (z i ) is the k-th element of φ(z i ) , and v := [v 1 , v 2 , . . . , v G ] ⊺ is a real vector. For example, in a commonly used degree-2 polynomial kernel function, the products of all the pairs of the coordinates of z i are used to generate a higher dimensional embedding, along with a constant term. That is, G = P 2 + 1 , φ k+P(l−1) (z i ) = z i,k · z i,l , for k, l = 1, 2, . . . , P , and finally φ (P 2 +1) = 1 . With this choice of a kernel function, Eq. (6) can be written as Once an embedding has been defined, we minimize the following function to determine v: We show next how the VQC problem in Eq. (4) can be mapped to the above kernel form, i.e. any solution obtained by a VQC is a constrained solution of a corresponding kernel based classifier. Thus, the performance of a suitably defined kernel method -without any constraints on {v k } k -will always provide an upper bound on the performance of a VQC, including classifiers based on QCL. In the case of VQCs, we have P = 2 n . Introducing ψ in l (x i ) := �l|ψ in (x i )� , O j,(k,l) := �k|Ô j |l� , and u k,l := �k|Û|l� for k, l = 1, 2, . . . , 2 n , �Ô j � x i ,Û , introduced in Eq. (2), can be rewritten as . www.nature.com/scientificreports/ where, for k, l = 1, 2, . . . , 2 n , u k := [u 1,k , u 2,k , . . . , u 2 n ,k ] H for k = 1, 2, . . . , 2 n ( (·) H is the Hermitian conjugate), and u H k u l = δ k,l . By using Eqs. (10) and (11), the VQC prediction function in Eq. (1) can be written as Now, if we compare the VQC prediction function in (12), to the kernel method prediction function in (8), we get a direct correspondence, where a VQC is reduced to a constrained version of the kernel method, and thus, the kernel method provides an upper bound on the performance of VQCs. Formally, the following choice of φ m (·) and v m in (6) is required [The kernel method is discussed in Sect. S-VI of the SM and the relationship between a VQC and the kernel method is discussed in Sect. S-VII of the SM in detail.]: for i = 1, 2, . . . , 2 n , z i = ψ in (x i ) , for k, l = 1, 2, . . . , 2 n , and Furthermore, we have P = 2 n and G = 2 2n + 1. Another advantage of showing this relationship is that it helps us benchmark how well a VQC optimization method performs: since the kernel method is an upper bound, if the VQC attains performs very close to that of the kernel method, then it would show that it is performing at its highest capacity.

Quantum circuit learning
Here, we review QCL proposed in Refs. 6,7 from the viewpoint of a VQC. In QCL, we assume a parameterized unitary operator Û c (θ) [Both Û c (θ) and Û c (θ; L) are used to denote a unitary operator realized by a quantum circuit; but we use Û c (θ; L) when we want to explicitly emphasize the number of layers L.] as Û and optimize θ [Refer to Sect. S-V B of the SM for the details of quantum circuits.]. We then compute |ψ out (x i ; θ)� :=Û c (θ)|ψ in (x i )� . Then, we make a prediction on x i by where �Ô j � x i ,θ := �ψ out (x i ; θ)|Ô j |ψ out (x i ; θ)� . Similarly to Eq. (1), {ξ j } Q j=1 are fixed parameters and θ b is a bias term to be estimated. The second step of QCL is to update θ and θ b by where and ℓ(·, ·) is a loss function [For details, refer to Sec. S-V of the SM.]. For this purpose, we often use the Nelder-Mead method 12 and other sophisticated numerical methods 13,14 .
As mentioned above, QCL assumes a parameterized unitary operator Û c (θ) ; thus, its performance heavily depends on the circuit geometry of Û c (θ) . An assumed circuit geometry is also called an ansatz; thus we can say that QCL is an ansatz-dependent VQC. This fact strongly motivates us to devise an ansatz-independent VQC, that is, the UKM. Furthermore, Ref. 15 pointed out the difficulty of learning parameters of quantum circuits, which they call the barren plateau problem. Then, a VQC that is free of the barren plateau problem is of interest.

Unitary kernel method
We here describe the UKM, which is one of the main algorithms in this paper. In the UKM, we directly minimize Eq. (5). To this end, we employ the unitary version of the method of splitting orthogonal constraints (SOC) 16 . Hereafter, we denote, by X , an operator obtained via the method of SOC. We introduce P and D and iterate update www.nature.com/scientificreports/ equations for X , P , and D until convergence. Furthermore, we denote X , P , D , and θ b at the k-th iteration by X k , P k , D k , and θ b,k , respectively. At the first step of the k-th iteration, we compute X k and θ b,k by where To solve Eq. (20), we optimize the real and complex parts of X k independently [See Sect. S-IX B of the SM for details.]. Next, we compute P k by where K 1,k and K † 2,k are unitary operators that satisfy K 1,kˆ kK † 2,k =X k +D k−1 and ˆ k is a diagonal operator. At the end of the k-th iteration, we compute We repeat the above equations, Eqs. (20), (22), and (23), until convergence. We call this method the UKM. In Algorithm 1, the UKM is summarized [For the details of the UKM, refer to Sect. S-IX A of the SM.].
It is clear from the formulation of the method of SOC that X does not strictly satisfy the unitarity constraint; instead, P and OU of X does. Thus, using the optimal value of X obtained from the UKM leads to a classical classifier (it cannot be implemented using a quantum circuit), and will in general have higher performance than the unitary operators given by P and OU of X that approximate X . Thus, we compute the success rates for the training and test datasets by using all the versions: X , P , and OU of X [OU is explained in Sect. S-IX A of the SM.], of which only P and OU of X correspond to VQCs.

Variational circuit realization
There are some studies on decomposing unitary operators into quantum circuits [17][18][19] , including Knill's decomposition and the quantum Shannon decomposition (QSD). In these methods, however, the number of the CNOT gates scales quadratically in M.
Here we propose an alternate method: the assumed circuit is comprised of L layers of a parameterized subcircuit with parameterized gates and a fixed circuit geometry; similar to the ansatz used in QCL. We then solve for the minimum number of layers L, such that the optimized circuit approximates the given unitary operator with a specified precision of δ . We refer to this circuit methodology as the VCR. The schematic of the VCR is demonstrated in Fig. 1d. Let Û and Û c (θ; L) be a target unitary operator and a unitary operator realized by a quantum circuit that is parametrized by θ and has L layers, respectively. Typically, the target unitary operator is obtained by the UKM discussed above. Furthermore, we define the global phase unitary operator � 2 n ( ) := e −i 1 2 n . When Û and Û c+p (θ, ; L) :=� 2 n ( )Û c (θ; L) are identical, we have Then, we can estimate θ , for any p > 0 , by where In a circuit realization, the complexity of a circuit is of great interest. In this paper, we assume a layered structure for a quantum circuit. Thus, given an error threshold δ , it is convenient to define L δ : where

Numerical simulation
We first show the numerical results of QCL and the UKM for the cancer dataset (0 or 1) [The iris dataset in the UCI repository 20 has two labels: (0) 'B' and (1) 'M. ' In the cancer dataset (0 or 1), we consider the classification problem between the 0 label and the 1 label. Furthermore, we relabel 0 with −1 to adjust labels with the eigenvalues of σ z . For the numerical results for other datasets, refer to Sect. S-IX B of the SM.] in the UCI repository 20 . The results for multiple datasets with different dimensions, M, are presented in Table 1. ]. The UKM can be programmed to yield both real and complex unitary matrices and hence, we consider the performance for both cases separately; see the appendix and the SM. For QCL, we consider four types of quantum circuits: the CNOT-based circuit, the CRot-based circuit, the 1-dimensional (1d) Heisenberg circuit, and the fully-connected (FC) Heisenberg circuit [The definitions of the CNOT-based circuit, the CRotbased circuit, the 1d Heisenberg circuit, and the FC Heisenberg circuit are given in Sect. S-V B of the SM.], and run iterations 300 times. To accelerate QCL, we utilize the stochastic gradient descent method 9 . In both cases, we use the squared error function ℓ SE (a, b) := 1 2 |a − b| 2 for ℓ(·, ·) in Eqs. (5) and (19), and set Q = 1 and ξ 1 = 1 in Eqs. (1) and (17). Furthermore, we consider two cases with the bias term and without the bias term in Eqs. (1) and (17). Note that we use the optimize function provided in the SciPy package 21 for the implementation of the UKM and the Pennylane package 22 for QCL. Because of the nature of SOC, the performance of the solutions often oscillates; thus, we run the UKM for a certain number of iterations and choose the solution of the best performance. Then we summarize the results of 5-fold cross-validation (CV) with 5 different random seeds of QCL and the UKM in Tables 2 and 3, respectively. For each method, we select the best model for the training dataset over iterations to compute the performance.
In Fig. 2, we plot the data shown in Tables 2 and 3. Table 1. Results of 5-fold CV with 5 different random seeds of the UKM ( X , P , and OU of X ), QCL, and the kernel method for all the datasets. The numbers of data points N and dimensions M of the datasets are shown. The number of qubits n required for amplitude encoding is also shown. Note that n = ⌈log 2 M⌉. The performance cells are of the format "training performance/test performance. " We choose the model that shows the best test performance for each algorithm. For the UKM, we consider the complex and real cases with and without the bias term. We set r = 0.010 . For QCL, we consider the CNOT-based, CRot-based, 1d-Heisenberg, and FC-Heisenberg circuits with and without the bias term for the iris, cancer, sonar, and wine datasets, and the CNOT-based and CRot-based circuits with and without the bias term for the semeion and MNIST256 datasets. We set the number of layers L to 5. For φ(·) in the kernel method, we consider linear and quadratic functions with and without the bias term for = 10 −2 , 10 −1 , 1 . The values of the best VQC for each dataset are printed in bold.  Table 2. Results of 5-fold CV with 5 different random seeds for the cancer dataset (0 or 1). The number of layers L is 5 and the number of iterations is 300. We consider four types of circuits with and without the bias term: the CNOT-based circuit, the CRot-based circuit, 1d Heisenberg circuit, and the FC Heisenberg circuit. As shown in Fig. 3, increasing the number of layers L does not lead to better performance, and can in fact decrease performance of QCL. www.nature.com/scientificreports/ As shown in Fig. 2, the performance of the UKM is better that of QCL in several numerical setups. Given our analytical results showing that the kernel method is a superset of VQCs, we next present the performance of the kernel method [Particularly, we use Ridge classification as the kernel method. In Ridge classification, we use the squared error function ℓ SE (·, ·) for ℓ(·, ·) in J cost (v) : F . For the details of the kernel method, see Sect. S-VI of the SM. More specifically, Ridge classification is described in Sect. S-VI B of the SM. Refs. 8,9 are also helpful.]. We set = 10 −1 , which is the coefficient of the regularization term, and consider linear and quadratic functions for φ(·) with and without normalization. The norm of the vector of each data point is not unity. Normalization means that we normalize the vector of each data point before performing classification. The purpose is to see the effect of normalization incorporated into amplitude encoding though the original setup of classification does not have the process. Note that we use the scikit-learn package 23 for the kernel method. Then we summarize the results of 5-fold CV with 5 different random seeds of the kernel method in Table 4.
For some , the performance of the kernel method is better than QCL and the UKM, as expected. Next, we explore the performance dependence of QCL on the number of layers L. The result is shown in Fig. 3. One would naturally expect that increasing the number of layers L leads to better performance. In general, a circuit with L + 1 layers can clearly do at least as well as the circuit with L layers: pick the same parameters for Table 3. Results of 5-fold cross-validation (CV) with 5 different random seeds for the cancer dataset (0 or 1). We show the performance obtained by X , P , and OU of X . We consider real and complex matrices for the initial input with and without the bias term. We set r = 0.010 and K = 30 . We repeat the CG iteration for Eq. (20) 10 times in each step of the method of SOC. Due to the inherent formulation of the method of SOC, X does not strictly satisfy the unitarity condition; P and OU of X strictly satisfy the unitarity condition, yielding VQCs. The overall higher performance of X can be attributed to it being a classical classifier; a special case of the kernel method. Note, however, that the classifier created by the UKM without bias yield better performance than the best classifiers created by QCL, as shown in Table 2.  www.nature.com/scientificreports/ the first L layers, and choose parameters to create an identity operator with the last layer. But Fig. 3 shows it is not the case. Rather, the test performance gets worse when we increase the number of layers L. This variability is potentially related to the structure of the cost function landscape: as the number of parameters is increased by adding an extra layer, there are potentially more local minima or the landscape develops what has been referred to as a "barren plateau" in Ref. 15 .
We also see the performance dependence of the UKM on r, which is the coefficient of the second term in the right-hand side of Eq. (20). The result is shown in Fig. 4.
For small r, X in the UKM deviates from unitary matrices and the performance gets better. On the other hand, for large r, X in the UKM becomes closer to unitary matrices but the performance gets worse. Thus, we should choose an appropriate value of r. Table 4. Results of 5-fold CV with 5 different random seeds of the kernel method for the cancer dataset (0 or 1). We set = 10 −1 . For φ(·) , we use linear and quadratic functions with and without normalization.    www.nature.com/scientificreports/ In Fig. 5, we show the performance dependence of the kernel method on , which is the coefficient of the regularization term.
Like r in the UKM, we also need to choose an appropriate to realize good performance.
In Table 1, we summarize the performance of QCL, the UKM, and the kernel method for all the datasets investigated in this study. We choose the model that shows the best test performance for each algorithm. For the UKM, we consider the complex and real cases with and without the bias term. We set r = 0.010 . For QCL, we consider the CNOT-based, CRot-based, 1d-Heisenberg, and FC-Heisenberg circuits with and without the bias term for the iris, cancer, sonar, and wine datasets, and the CNOT-based and CRot-based circuits with and without bias term for the semeion and MNIST256 datasets. We set the number of layers L to 5. For φ(·) in the kernel method, we consider linear and quadratic functions with and without the bias term for = 10 −2 , 10 −1 , 1 . The numerical results support the claim that the UKM lies between the kernel method and QCL. We also show the detailed numerical results for all the datasets in the SM [In Sect. S-XI of the SM, the numerical results for other datasets are shown.]. The results shown in the SM are consistent with this paper. Finally, we note the difference between the squared error and hinge functions. In this paper, we have used the squared error function; we show the results of the hinge function in the SM. The results are qualitatively same as obtained with the squared error function, and the statements about the relative performances of QCL and the UKM do not change.
We then show numerical simulations on the VCR. Let Û c (θ; L) be the unitary operator realized by a quantum circuit that is parametrized by θ and has L layers. For Û c (θ; L) , we use the CNOT-based circuit. Furthermore, we use the BFGS method 14 to solve Eq. (25). Note that we use the optimize function provided in the SciPy package 21 for the implementation of the VCR. Here, let us consider the cancer dataset (0 or 1) and minimize Eq. (26) with p = 2 . As a target unitary operator, we use the unitary operator that gives the success rate for the training dataset 0.9194 and that for the test dataset 0.9131. In Fig. 6, we show the values of the cost function in the right-hand side of Eq. (26) with different numbers of layers L. In Table 5, we summarize the performance of the input unitary operator, QCL, and the circuit geometries computed by the VCR. Figure 6 and Table 5 show that Û c (θ; L) gives fairly high performance. Furthermore, we have L 0.001 = 80 where the definition of L δ is given in Eq. (27). This implies that 80 layers are sufficient to approximate the given unitary operator in the case of the CNOT-based circuit.
Note that the optimization problems arising in VQC involve well-known cost functions used in machine learning, with the added constraint of unitarity. Thus, in general, the UKM optimization problem is nonconvex and there is no rigorous proof that the UKM will achieve the best possible solution; the same is the case with the ansatz-dependent QCL. We can, however, make the following observations: (i) Clearly, the optimal performance of the UKM is an upper bound for the optimal performance of QCL; by focusing on unitary operations, the UKM searches over all possible ansätze; (ii) Given any QCL solution, the UKM can almost always show better performance (in the worst case, the same performance), by initializing it with the QCL solution; (iii) Even with random initializations, as shown numerically in the paper, the expected performance of the UKM can be better than the expected performance of QCL. Thus, the UKM can be framework enables one to estimate the first known computationally-determined bounds on the price of ansatz.
We also emphasize that the initialization of the UKM is done randomly, but the numerical simulations show that the UKM works stably. Table 5. Performance of the VCR for the cancer dataset (0 or 1). We show the success rates for the training and test datasets and the value of the cost function for the VCR. The input for the VCR is P created by the UKM under the condition of real matrices without the bias term with r = 0.010 . For reference, we add the last three rows that show the results of 5-fold CV. The table shows that around 50 layers, by combining the UKM with the VCR one can get a better performance than that of QCL. www.nature.com/scientificreports/

Discussions
As shown in this paper, the performance of QCL is bounded from above by the UKM, which in turn has its performance bounded above by kernel method based classical classifiers. One of the primary contributing factors is the difference in the degrees of freedom in QCL and the UKM. In the UKM, we have O(M 2 ) parameters to estimate; on the other hand, the number of parameters in QCL is O(L ln M) . This difference implies that a circuit ansatz introduces a strong bias in QCL, and may restrict the performance of QCL considerably. Thus, by designing the UKM, we can explore the ultimate power of QCL and at least, for the case of a small number of qubits n, the numerical results in this paper show that the ultimate power of QCL is limited (see Table 1); the performance of the UKM could be up to 10-20% higher than that of the QCL. As noted earlier, we can also explore the potential limitations of QCL from the viewpoint of optimization. Figure 3 implies the difficulty of optimizing parameters in QCL. The success rates in Fig. 3 should be more smooth and monotonically increasing: clearly, a circuit with L layers should perform better than a circuit with L − 1 layers, but it seems the QCL can easily get stuck in local minima. This phenomenon may come from the barren plateau problem 15 . On the other hand, the performance of the UKM is very high and close to that of the kernel method in Fig. 4; thus, we can say that the UKM does not suffer from a similar optimization problem. This also implies that finding a proper ansatz such that the QCL paradigm attains the same performance as the UKM is a computationally challenging problem. Even if an ansatz has the representation capability to yield optimal results, the QCL optimization algorithm might not find the optimal gate parameters. Then, we turn our attention to discussing the numerical results of the VCR. Recall that M and L are the dimension of the data points and the number of layers in an ansatz adopted in QCL, respectively. Note also that we use amplitude encoding in this paper. Then circuits in QCL have ⌈ln M⌉ qubits and have O(L ln M) gates. The number of parameters to estimate is of the same order since we use the three-dimensional rotation gate as a parametrized gate. The UKM also has the same number of qubits ⌈ln M⌉ ; so it retains the qubit efficiency, but it optimizes over O(M 2 ) parameters. Moreover, circuits obtained by the combination of the UKM and the VCR are still of complexity O(L ln M) , except that now L is not a constant, as in QCL. For the datasets used in this paper, the VCR yields much more compact circuits than traditional methods for obtaining circuits for unitary operators, such as the QSD, where the number of gates will be O(M 2 ) . Thus, the VCR yields better performance than the traditional methods.
Also, we show using the VCR that we can realize the unitary operator obtained by the UKM using the same ansatz used in QCL. Furthermore, the combination of the UKM and the VCR leads to better performance and a circuit with fewer gates or layers than QCL in some cases; see also the section on the numerical simulation of the VCR in the SM [See Sect. S-XII of the SM. We show the numerical results of the VCR on two additional datasets, and their results are consistent.]. In other cases, we have bigger circuits (i.e., L is larger) but with better performance. If a dataset has very high dimensions, i.e. M is very large, the computational time and circuit size might be very large, O(M 2 ) . But we still have the ⌈ln M⌉ advantage in the number of qubits n. However, QCL also has two major potential problems, when M is very large. First, the dataset size has to be very large due to the curse of dimensionality, as M increases. So the training time and convergence complexity will be a problem no matter what the parameter size is. Second, there is no guarantee that a kernel function with O(L ln M) parameters will do well, especially for small L. The performance for small L and large M could be poor. There is no theoretical proof that, for large M, QCL will do well with small L. We both use the same number of qubits ⌈ln M⌉ ; so in terms of intermediate-scale quantum computers, we both have the same advantage. And the computation of the VCR is O(M 2 ) ; so it is doable for any reasonable dimensions M. In particular, we believe the UKM can be used to derive VQC implementations on NISQ devices comprising up to 20 qubits, (i.e. M = 10 6 or million dimensional data sets) using enough classical computing resources. Thus, in addition to the application of UKM in deriving bounds and understanding the role of ansatz in quantum algorithms, it can even complement QCL in the short term and design optimal VQCs for NISQ devices. www.nature.com/scientificreports/ In this paper, we focused on amplitude encoding. Recently, the relationship between QCL and the kernel method was discussed from the viewpoint of encoding in Ref. 11 . More specifically, the basis encoding, the angle encoding, coherent state encoding, and other encodings were investigated besides amplitude encoding. We also would like to note that amplitude encoding provides a logarithmic compression: the number of qubits needed is logarithm in the dimension of the data. Most other encoding schemes require qubits proportional to the dimension (1/2 for example). Given that VQCs are a sub-class of classical kernel methods -that is, there is no performance gain over classical ML algorithms -the only potential quantum advantage lies in the logarithmic compression in qubits. Hence, from a practical perspective amplitude encoding is the most interesting case to study. However, from a research perspective, it will be interesting to investigate the performance of VQCs for such encodings via the UKM and to compare the relative performances of QCL and the UKM for such encodings as well.
We next mention some recent related literature related QML. In Ref. 24,25 , VQAs that changes the form of a circuit adaptively are studied. Such an incremental search over the ansatz space might lead to better performance than vanilla VQAs. However, their performance should be limited compared to the UKM because the UKM directly solves an optimization problem with a weaker constraint: by optimizing over unitary operations, the UKM efficiently searches over all ansatz. Thus, we think that the UKM provides one with better bounds though they are not compared in this study. In Ref. 26 , the authors insist that classical machine learning (ML) with data rivals quantum ML. This point is very important because it implies the difficulty of showing a quantum advantage of quantum ML. In our manuscript, we demonstrated that the expressive power of QCL is lower than the vanilla kernel method even if we get rid of the assumption of a specific form of a quantum circuit, i.e., VQCs performance is bounded above by a kernel method. We believe that both, our manuscript and Ref. 26 , use different approaches to reach the same conclusion that QCL may not not be as promising as researchers had originally expected it to be. Finally, we mention the possible applicability of the UKM to other problems. In the QAOA and the VQE, optimization problems are dealt with and similarly to QCL some kinds of underlying circuit geometries are assumed. By using the UKM, it is expected that we can clarify the power of the QAOA and the VQE in an ansatzindependent manner. Furthermore, VQAs for a number of problems have been proposed: the general stochastic simulation of mixed states 27 , time evolution simulation with a non-Hermitian Hamiltonian, linear algebra problem, and open quantum system dynamics 28 , stochastic differential equations 29 , quantum fisher information 30 , the simulation of nonequilibrium steady states 31 , and molecular simulation 24 . We believe that the UKM is also applicable for this class of problems and may clarify the hidden power of VQAs.

Concluding remarks
In this paper, we have first discussed the mathematical relationship between VQCs, which are a superset of QCL, and the kernel method. This relationship implies that VQCs including QCL is a subset of the classical kernel method and cannot outperform the kernel method.
Then we have proposed the UKM for classification problems. Mathematically the UKM lies between the kernel method and QCL, and thus it is expected to provide us an upper bound on the performance of QCL. By extensive numerical simulations, we have shown that the UKM is better than QCL, as expected. We also have proposed the VCR to find a circuit geometry that realizes a given unitary operator. By combining the UKM and the VCR, we have shown that we can find a circuit geometry that shows high performance in classification.
In future work, we plan to explore the performance of VQCs for other methods of encoding the related classical data. For example, one straightforward extension would be to embed the feature vector x i ∈ R M into a higher dimensional vector φ(x i ) ∈ R L with L = O(M c ) and then use the rest of the framework; the number of qubits n will still be O(log M) , thus retaining any potential quantum advantage. Such extensions can increase the power of both VQCs and QCL.

Data availability
The datasets used and/or analyzed during the current study available from the corresponding author on reasonable request.