Introduction

Advances in technologies of quantum hardware lead to intensive research on finding practical applications on noisy intermediate-scale quantum (NISQ) devices1. Variational quantum algorithms (VQAs)2,3,4 are a class of promising candidates of quantum algorithms that are implementable on the NISQ devices. The VQAs can be used for a variety of computational tasks including quantum chemistry calculations5,6,7,8, combinatorial optimization9,10,11, and training of machine-learning models12,13,14,15. These tasks are achieved by minimizing task-specific cost functions usually defined as a sum of expectation values of observables. The optimization of minimizing the cost function is performed through updating parameters of a parameterized quantum circuit using a classical optimizer in a feedback loop. In particular, VQAs employ a quantum device to prepare quantum states that the parameterized quantum-circuit outputs. We perform a shot of quantum measurement on each output state to extract classical information, which is useful for estimating the expectation values of the cost function. The measurement outcomes are fed to the classical optimizer, with which we improve the circuit parameters so as to minimize the cost function iteratively.

But problematically, if we try to estimate the expectation values with high precision in the VQAs, we usually need an excessive number of measurement shots until minimizing the cost function16,17. In practice, a user of a quantum computer often needs to access a distant server of a quantum computer to query measurement shots, while the classical optimizer can be performed locally by the user at a negligible cost compared to the cost of using the quantum computer in terms of time and money18; in this setting, the number of measurement shots crucially dominates the cost of VQAs, which we aim to minimize here. In previous research, problems of reducing computational resources in VQAs have often been tackled by estimating an expectation-value efficiently19,20,21,22 and reducing the number of iterations until convergence23,24,25,26,27,28,29,30. By contrast, to overcome a dominant obstacle in the above setting of VQAs, we here study the problem of reducing the overall cost of measurement shots in the optimization, that is, how we can optimize the circuit parameters at as little cost of the total number of measurement shots as possible. A difficulty of this problem stems from the nature of quantum mechanics: it is costly to extract expectation values as classical information from quantum states, yet the optimization would be hard without the assistance of classical information obtained from measurements on the quantum states. We stress that the problem here is not the estimation of the expectation values themselves; rather, a fundamental question that we ask is how efficiently we can use classical information of the measurement outcomes to optimize the circuit parameters without extracting the expectation values with high precision.

In this work, we address this problem by establishing a framework for the classical optimizer that combines two different optimization approaches, namely, stochastic gradient descent (SGD) and Bayesian optimization (BO). SGD is a standard algorithm in machine learning for training models, using an estimator of gradient at each optimization step rather than the exact value of the gradient31,32. Among a variety of existing optimizers proposed for VQAs23,24,25,26,27,28,29,30,33,34,35,36, gradient-based optimizers have been studied intensively, motivated by the fact that the use of gradient information improves convergence37. Recently, SGD for VQAs has been investigated as a class of gradient-based optimizers33. The SGD for VQAs often uses a fixed small number of measurement shots to estimate the gradient, which may successfully avoid measuring expectation values with high precision. However, SGD has major shortcomings that may make the algorithm inefficient. First, instead of the low cost of each iteration, SGD may need a larger number of iteration until convergence than optimization algorithms using the exact gradient; second, SGD requires careful control of the step size of updating the parameters in each iteration, which may crucially affect the efficiency of the algorithm, but an appropriate choice of the step size is often difficult. On the other hand, BO is another common algorithm for optimization of a black-box function without necessarily using its gradient, which is especially suitable for optimizing imprecise and expensive-to-evaluate functions38,39. The BO has many successful applications such as computer vision, robotics, and experimental designs40,41,42,43. Owing to its robustness against noise in the imprecise evaluation of the functions38,39, BO may also be useful for the optimization in VQAs44,45. However, it is known that BO becomes intractable in high-dimensional settings (typically  10)46, and the number of parameters to be optimized in VQAs is usually too large to apply the BO directly.

To retain advantages of SGD and BO in VQAs while compensating for their shortcomings, we here construct the alternative framework for the optimization of parameterized circuits, stochastic gradient line Bayesian optimization (SGLBO), as illustrated in Fig. 1. The key idea of SGLBO is that we estimate an appropriate direction of updating the circuit parameters based on SGD, and also utilize BO to estimate the optimal step size in a 1D direction of the estimated gradient in each iteration. This idea aims at simultaneously resolving the problems of the step size in the SGD and of the infeasibility of high-dimensional optimization with the BO. To enhance the performance further, we combine the SGLBO with two noise-reducing techniques: adaptive shot strategy and suffix averaging. The adaptive shot strategy is a technique for dynamically determining the number of measurement shots to be used for the estimation of the gradient34,47,48,49,50,51,52,53. We here develop an adaptive shot strategy suitable for SGLBO, based on a technique of the norm test48,49,51. The norm test combined with SGD is known to provide faster convergence49,51, and in the case of SGLBO, the norm test reduces not only the number of iterations but also the overall number of measurement shots. On the other hand, suffix averaging is a technique for achieving noise reduction. Instead of directly using the point of the final iteration in the optimization as an estimate of the minimizer of the cost function, the suffix averaging technique uses the average over a latter part of the sequence of points obtained from the iterations54,55,56. We utilize this technique to reduce the statistical noise in estimating the gradient and the optimal step size in SGLBO, and also reduce the effect of the hardware noise of the quantum device.

Fig. 1: An illustration of two iterations in SGLBO for minimizing a 2D cost function.
figure 1

a The figure represents the updating procedure of SGLBO on the landscape of the cost function. In particular, in the first iteration, at an initial point 1, we estimate a direction of the gradient of the cost function based on SGD and perform BO on the 1D subspace in this direction to estimate the optimal step size. b Then, we reach point 2 from the point 1 by moving in the estimated direction by the estimated optimal step size. c Next, at point 2, we perform the same procedure of estimating the gradient based on the SGD and estimating the optimal step size by the BO on the line of the 1D subspace, to move from point 2 to point 3. We iterate these procedures until SGLBO converges or consumes a preset number of measurement shots. After all these iterations, SGLBO returns a suffix average over the points visited in the iterations as an output.

To show the significance of the SGLBO, we numerically demonstrate that the SGLBO can find an estimate of the minimizer of the cost function with a significantly small number of overall measurement shots compared to other state-of-art optimizers23,34,57, in representative tasks for the VQAs, i.e., variational quantum eigensolver5 and variational quantum compiling58. Thus, the reduction of the number of iterations achieved by finding the optimal step size by BO indeed contributes to the overall reduction of the number of measurement shots. We also discover that the SGLBO turns out to outperform the state-of-art optimizers not only in terms of the number of measurement shots but also the accuracy in estimating the minimum of the cost functions used in the simulation. Remarkably, we discover that even under a moderate amount of hardware noise, the SGLBO can estimate the minimum in a task with almost the same accuracy as noiseless cases, whereas the other state-of-the-art optimizers cannot in the same task. These results indicate that the SGLBO is a promising approach to reduce the number of measurement shots in the VQAs, and also to make the VQAs more feasible under unavoidable hardware noise in near-term quantum devices. Note that combination of SGD and BO has been previously studied only in a specific machine-learning setting59, but its applicability and advantage for other tasks such as VQAs have been unknown; by contrast, our crucial contribution is to formulate SGLBO as the efficient and noise-robust framework for the task of optimizing parameterized quantum circuits and further develop the techniques of adaptive shot strategy and suffix averaging to demonstrate its advantage in this optimization task.

Consequently, the SGLBO establishes an alternative approach for efficient quantum-circuit optimizers, progressing beyond the existing state-of-the-art optimizers23,34,57; in particular, the novelty of SGLBO is to integrate two different optimization approaches, SGD and BO, to eliminate their shortcomings and take their advantages. Augmented with the further techniques of adaptive shot strategy and suffix averaging, the SGLBO is shown to have a significant advantage in the reduction of the cost of the number of measurement shots and also in the robustness against hardware noise, compared to the state-of-the-art optimizers for VQAs. These results open a way to practical algorithm designs for more efficient quantum-circuit optimization in terms of the overall cost of measurement shots, by avoiding both the precise estimation of expectation values and the many iterations of updating circuit parameters; at the same time, the approach developed for the SGLBO provides a fundamental insight into how VQAs can use classical information extracted from quantum states beyond estimating expectation values.

In the rest of this section, we describe the problem setting of optimization tasks in VQAs and review SGD and BO.

VQAs2,3,4 are a class of algorithms that use a parameterized quantum circuit U(θ) to minimize a task-specific cost function f(θ). The vector \({{{\boldsymbol{\theta }}}}={[{\theta }_{1},\cdots ,{\theta }_{D}]}^{\top }\in {{\mathbb{R}}}^{D}\) of D arguments of f is used as the circuit parameters of U(θ). The cost function f(θ) in VQAs is conventionally defined as an expectation-value of an observable O on n qubits, with respect to a quantum state output by the parameterized circuit, i.e.,

$$f({{{\boldsymbol{\theta }}}})={{{\rm{Tr}}}}[OU({{{\boldsymbol{\theta }}}}){(\left|0\right\rangle \left\langle 0\right|)}^{\otimes n}{U}^{{\dagger} }({{{\boldsymbol{\theta }}}})],$$
(1)

where \(\left|0\right\rangle \) is a standard-basis state used for initialization of each qubit, \(U({{{\boldsymbol{\theta }}}}){\left|0\right\rangle }^{\otimes n}\) is the output state of the n-qubit parameterized circuit, and U(θ) is the complex conjugate of U(θ). The observable O is expanded as a sum of n-qubit tensor products of Pauli operators

$$O=\mathop{\sum}\limits_{k}{c}_{k}{P}_{k},$$
(2)

where ck for each k is a real coefficient of the kth term, and Pk is a tensor product of n single-qubit Pauli operators \({P}_{k}{ = \bigotimes }_{l = 1}^{n}{P}_{k,l}\) with Pk,l  {I, X, Y, Z} being a Pauli (or identity) operator acting on the lth qubit. Here, the identity operator is denoted by \(I:=\left|0\right\rangle \left\langle 0\right|+\left|1\right\rangle \left\langle 1\right|\), and Pauli operators acting on a single qubit are \(X:=\left|0\right\rangle \left\langle 1\right|+\left|0\right\rangle \left\langle 1\right|\), \(Y:=-{{{\rm{i}}}}\left|0\right\rangle \left\langle 1\right|+{{{\rm{i}}}}\left|1\right\rangle \left\langle 0\right|\), and \(Z:=\left|0\right\rangle \left\langle 0\right|-\left|1\right\rangle \left\langle 1\right|\). In a usual setting of VQAs, U(θ) is composed of non-parametric gates such as CNOT gates, and parametric gates in the form of

$$U({\theta }_{i})=\exp (-{{{\rm{i}}}}{P}_{i}{\theta }_{i}),$$
(3)

where Pi is also a tensor product of n single-qubit Pauli operators in the same way as Pk in Eq. (2). For example, Fig. 2 shows a representative choice of parameterized circuits used for VQAs4. Note that the parameter space of the circuit in Fig. 2 is a D-dimensional hypercube θ [−π, π]D, i.e., a bounded subspace of \({{\mathbb{R}}}^{D}\), on which a uniform probability distribution is well defined.

Fig. 2: An example of a parameterized quantum circuit used as an ansatz in VQAs.
figure 2

The circuit parameters \({{{\boldsymbol{\theta }}}}={[{\theta }_{1},\cdots ,{\theta }_{D}]}^{\top }\in {{\mathbb{R}}}^{D}\) with D = 2n(r + 1) elements are individually allocated as each rotation angle of Pauli rotation gates \({R}_{X}({\theta }_{i}):={e}^{-{{{\rm{i}}}}{\theta }_{i}X}\) and \({R}_{Z}({\theta }_{j}):={e}^{-{{{\rm{i}}}}{\theta }_{j}Z}\). The part of the circuit surrounded by the braces is repeated r times, where the repeated parts may have different parameters.

The task in the VQAs is to obtain an estimate of the minimum of the cost function

$$\mathop{\min }\limits_{{{{\boldsymbol{\theta }}}}\in {{\mathbb{R}}}^{D}}f({{{\boldsymbol{\theta }}}}).$$
(4)

The minimizer is denoted by

$${{{{\boldsymbol{\theta }}}}}^{* }=\mathop{{{{\rm{arg\,min}}}}}\limits_{{{{\boldsymbol{\theta }}}}\in {{\mathbb{R}}}^{D}}f({{{\boldsymbol{\theta }}}}).$$
(5)

Note that the cost function f(θ), in general, can be non-convex, and it can be computationally hard in general to obtain the exact solution of the optimization problem in VQAs60. By contrast, this paper aims to provide a heuristic optimizer that approximately solves this optimization problem with a small number of measurement shots. In experiments using a quantum device, we can evaluate the cost function from the sum of the expectation values \({{{\rm{Tr}}}}[{P}_{k}U({{{\boldsymbol{\theta }}}}){(\left|0\right\rangle \left\langle 0\right|)}^{\otimes n}{U}^{{\dagger} }({{{\boldsymbol{\theta }}}})]\) for all k, each of which can be estimated by independently repeating the preparation of \(U({{{\boldsymbol{\theta }}}}){\left|0\right\rangle }^{\otimes n}\) by the parameterized circuit and the measurement of this state in the eigenbasis of the Pauli operator Pk. For each k, let \({\overline{P}}_{k}\in {\mathbb{R}}\) be a sample mean obtained from these measurements for Pk, and due to Eq. (2), we estimate f(θ) by

$$f({{{\boldsymbol{\theta }}}})\approx \mathop{\sum}\limits_{k}{c}_{k}{\overline{P}}_{k}.$$
(6)

Each of these measurements is called a measurement shot. In this way, we evaluate f using a finite number of measurement shots; in this setting, we are only allowed imprecise queries to the cost function due to statistical errors with the finite number of measurement shots. Based on the central limit theorem61, we may model each imprecise query to f(θ) as

$$y=f({{{\boldsymbol{\theta }}}})+\epsilon ,$$
(7)

where y is an observed value, and \(\epsilon \sim {{{\mathcal{N}}}}(0,{\sigma }^{2})\) is independent and identically distributed (IID) Gaussian noise. From Hoeffding’s inequality62, to estimate f(θ) within an error ϵ with high probability, as large as O(1/ϵ2) measurement shots may be required. In practice, it is prohibitively costly (i.e., an excessive number of measurement shots are needed) to evaluate a well-approximated value of the cost function (as well as its gradient), which leads to significant overhead in performing VQAs16,17.

SGD aims to optimize a function f(θ) using an unbiased estimate of the gradient of f to update the parameters θ iteratively toward the optimal point with high probability.

In the optimization of circuit parameters for VQAs, we may need to evaluate the gradient of the cost function f(θ). For f(θ) defined with parametric gates in the form of (3), we can utilize a parameter-shift rule63,64 to calculate partial derivatives of the cost function from cost-function values at shifted circuit parameters, i.e.,

$$\frac{\partial f({{{\boldsymbol{\theta }}}})}{\partial {\theta }_{i}}=\frac{f({{{\boldsymbol{\theta }}}}+\frac{\pi }{2}{{{{\boldsymbol{e}}}}}_{i})-f({{{\boldsymbol{\theta }}}}-\frac{\pi }{2}{{{{\boldsymbol{e}}}}}_{i})}{2}.$$
(8)

Here θi is a circuit parameter allocated to the rotation angle of the ith Pauli rotation gate \(U({\theta }_{i})=\exp (-{{{\rm{i}}}}{P}_{i}{\theta }_{i})\), and ei represents a unit vector along the coordinate of θi. Note that to obtain all the elements of the gradient of f(θ), we may need to evaluate each partial derivative independently.

However, as discussed above, we cannot exactly calculate the cost function and its gradient with a finite number of measurement shots, and the precise estimation of the gradient is costly in VQAs. In this setting, a standard method for solving Eq. (4) is stochastic gradient descent (SGD)31,33, which updates the current point \({\hat{{{{\boldsymbol{\theta }}}}}}^{(t)}\) at iteration t according to

$${\hat{{{{\boldsymbol{\theta }}}}}}^{(t+1)}={\hat{{{{\boldsymbol{\theta }}}}}}^{(t)}-{\eta }^{(t)}{\hat{{{{\boldsymbol{g}}}}}}^{(t)}({\hat{{{{\boldsymbol{\theta }}}}}}^{(t)}),$$
(9)

where η(t) is the step size, and \({\hat{{{{\boldsymbol{g}}}}}}^{(t)}({\hat{{{{\boldsymbol{\theta }}}}}}^{(t)}):={({\hat{g}}_{1}^{(t)}({\hat{{{{\boldsymbol{\theta }}}}}}^{(t)}),\ldots ,{\hat{g}}_{D}^{(t)}({\hat{{{{\boldsymbol{\theta }}}}}}^{(t)}))}^{\top }\) is an unbiased estimator of the gradient \(\nabla f({\hat{{{{\boldsymbol{\theta }}}}}}^{(t)})\), i.e., \({\mathbb{E}}[{\hat{{{{\boldsymbol{g}}}}}}^{(t)}({\hat{{{{\boldsymbol{\theta }}}}}}^{(t)})]=\nabla f({\hat{{{{\boldsymbol{\theta }}}}}}^{(t)})\). Here \({\hat{{{{\boldsymbol{g}}}}}}^{(t)}\) is estimated with a finite number of measurement shots, i.e., with a shot size

$${{{{\boldsymbol{s}}}}}_{{{{\rm{grad}}}}}^{(t)}={({s}_{1}^{(t)},\ldots ,{s}_{D}^{(t)})}^{\top }.$$
(10)

The estimate of each partial derivative is individually computed as

$${\hat{g}}_{i}^{(t)}({{{\boldsymbol{\theta }}}})=\frac{1}{{s}_{i}^{(t)}}\mathop{\sum }\limits_{{\mathsf{m}}=1}^{{s}_{i}^{(t)}}{g}_{i}^{{\mathsf{m}}}({{{\boldsymbol{\theta }}}}),$$
(11)
$${g}_{i}^{{\mathsf{m}}}({{{\boldsymbol{\theta }}}})=({O}_{+}^{{\mathsf{m}}}-{O}_{-}^{{\mathsf{m}}})/2,$$
(12)

where \({O}_{\pm }^{{\mathsf{m}}}\) is a single-shot estimator of \(f({{{\boldsymbol{\theta }}}}\pm \frac{\pi }{2}{{{{\boldsymbol{e}}}}}_{i})\). Each single-shot estimator of \(f({{{\boldsymbol{\theta }}}}\pm \frac{\pi }{2}{{{{\boldsymbol{e}}}}}_{i})\) is constructed according to Eq. (6) by substituting θ with \({{{\boldsymbol{\theta }}}}\pm \frac{\pi }{2}{{{{\boldsymbol{e}}}}}_{i}\), and the number of measurement shots used for estimating the kth term \({c}_{k}{\overline{P}}_{k}\) in Eq. (6) is denoted by \({s}_{i,k}^{(t)}\), which satisfies \({\sum }_{k}{s}_{i,k}^{(t)}={s}_{i}^{(t)}\). Given the shot size \({{{{\boldsymbol{s}}}}}_{{{{\rm{grad}}}}}^{(t)}\), each \({s}_{i,k}^{(t)}\) is probabilistically determined using a multinomial distribution in such a way that the probability pk of measuring the kth term should be proportional to the weight ck, i.e., pkck and ∑kpk = 122; that is, it should hold that \({\mathbb{E}}[{s}_{i,k}^{(t)}]={p}_{k}{s}_{i}^{(t)}\) for each k and i. Since the gradient is estimated from two values \(f({{{\boldsymbol{\theta }}}}\pm \frac{\pi }{2}{{{{\boldsymbol{e}}}}}_{i})\) of the cost function, the number of measurement shots used for obtaining \({\hat{{{{\boldsymbol{g}}}}}}^{(t)}\) is

$$\mathop{\sum }\limits_{i=1}^{D}2{s}_{i}^{(t)}=2{s}_{{{{\rm{grad}}}}}^{(t)},$$
(13)

where we write \({s}_{{{{\rm{grad}}}}}^{(t)}:=| | {{{{\boldsymbol{s}}}}}_{{{{\rm{grad}}}}}^{(t)}| {| }_{1}\).

The estimator \({\hat{{{{\boldsymbol{g}}}}}}^{(t)}({{{\boldsymbol{\theta }}}})\) in VQAs is unbiased for all \({{{\boldsymbol{\theta }}}}\in {{\mathbb{R}}}^{D}\), which is a preferable property to achieve convergence of SGD33. In addition, to guarantee convergence of SGD, we may require the step size to vanish as the estimated points approach a minimizer. In this case, the SGD achieves the optimization to accuracy ϵ within O(1/ϵ4) iterations in general for non-convex functions32, such as typical cost functions in VQAs. However, in practice, a user needs to designate a specific decay rate of step size to achieve good performance, whose optimization can be difficult.

BO is a gradient-free framework for optimization of an unknown function f(θ)38,39. BO can be employed to optimize an expensive-to-evaluate cost function in settings where only noisy observations of the function are possible, and we try to seek a minimizer of f(θ) with as small a number of noisy observations as possible. One of the features of BO is to utilize an easy-to-compute surrogate model that approximates the unknown cost function based on observed data65,66,67. A popular surrogate model for BO is Gaussian process (GP)68. GP is a collection of random variables such that every finite subset of random variables obeys a multivariate normal distribution. In the BO, we put a GP prior over the true function f(θ) as \(f({{{\boldsymbol{\theta }}}}) \sim {{{\mathcal{GP}}}}(\mu ({{{\boldsymbol{\theta }}}}),k({{{\boldsymbol{\theta }}}},{{{\boldsymbol{\theta }}}}^{\prime} ))\), where \(\mu ({{{\boldsymbol{\theta }}}})={\mathbb{E}}(f({{{\boldsymbol{\theta }}}}))\) is a mean function, \(k({{{\boldsymbol{\theta }}}},{{{\boldsymbol{\theta }}}}^{\prime} )\) is a covariance kernel function. In practice, if one has no prior knowledge about the mean of the function μ(θ) that one tries to fit, μ(θ) can be set to 0. A major choice of the kernel function is a Gaussian kernel

$$k({{{\boldsymbol{\theta}}}},{{{\boldsymbol{\theta}}}}^{\prime} )={\tau }^{2}{\rm{exp}} \left(\frac{-\Vert{{{\boldsymbol{\theta}}}}-{{{\boldsymbol{\theta }}}}^{\prime} \Vert^{2}}{2{l}^{2}}\right),$$
(14)

where τ2 is called the signal variance that determines the average of the differences from the mean of the function, and l is called the length-scale that determines the length required for the values of the function to be uncorrelated68. For other conventional kernel functions, e.g., a Matérn kernel, see ref. 68.

Here we consider a situation where we have a set of N noisy observations of the cost function \({{{{\mathcal{D}}}}}_{1:N}={\{({{{{\boldsymbol{\theta }}}}}^{(i)},{y}^{(i)})\}}_{i = 1}^{N}\) at points θ(1), …, θ(N), where each y(i) = f(θ(i)) + ϵ suffers from the IID Gaussian noise \(\epsilon \sim {{{\mathcal{N}}}}(0,{\sigma }^{2})\). Assuming that these observations are given according to GP, we calculate a GP posterior conditioned on these estimations, which is governed by hyperparameters, namely, the signal variance τ2, the length-scale l, and the variance of Gaussian noise σ2. These hyperparameters can be estimated by means of maximizing a log marginal likelihood68. Then, if we observe the cost function f at a new point θ*, the value to be observed will obey a GP posterior expressed as

$$\begin{array}{ll}{f}_{*}| {{{{\boldsymbol{\theta }}}}}_{* },{{{{\mathcal{D}}}}}_{1:N}\sim {{{\mathcal{N}}}}({{{{\boldsymbol{k}}}}}_{* }^{\top }{[K+{\sigma }^{2}I]}^{-1}{{{\boldsymbol{y}}}},{k}_{* * }-{{{{\boldsymbol{k}}}}}^{\top }{[K+{\sigma }^{2}I]}^{-1}{{{\boldsymbol{k}}}}),\end{array}$$
(15)

where f* = f(θ*), \({{{{\boldsymbol{k}}}}}_{* }={[k({{{{\boldsymbol{\theta }}}}}_{* },{{{{\boldsymbol{\theta }}}}}^{(1)}),\cdots ,k({{{{\boldsymbol{\theta }}}}}_{* },{{{{\boldsymbol{\theta }}}}}^{(N)})]}^{\top }\), k** = k(θ*, θ*), and K is the covariance matrix \({[k({{{{\boldsymbol{\theta }}}}}^{(i)},{{{{\boldsymbol{\theta }}}}}^{(j)})]}_{i,j = 1}^{N}\)68.

In BO, we construct an acquisition function φ(θ) from the posterior in Eq. (15) and determine the next query point according to

$${{{{\boldsymbol{\theta }}}}}^{(N+1)}=\mathop{{{{\rm{arg\,min}}}}}\limits_{{{{\boldsymbol{\theta }}}}\in {{\mathbb{R}}}^{D}}\varphi ({{{\boldsymbol{\theta }}}}).$$
(16)

Several ways of constructing the acquisition function have been proposed, such as Thompson sampling69, upper confidence bound70, and expected improvement71. In particular, Thompson sampling estimates values of f at a given set of points by sampling according to the multivariate normal distributions obtained from Eq. (15), and use these sampled values as the values of the acquisition function at these points. Then, we take the minimum among the values of the acquisition function for the set of points and perform the next query to f at the minimum point in the set as shown in Eq. (16). The minimization of φ(θ) is performed by using efficient optimization heuristics72,73. BO proceeds with querying the cost function at the minimizer of φ(θ) and iteratively update the GP posterior according to Eq. (15) until a fixed number of queries to the cost function are performed38.

This framework of BO has been shown to reduce the required number of queries to the cost function in achieving the minimization compared to other global optimization algorithms38. The performance of BO itself is governed by the ability to find the minimizer of φ(θ), which is also non-convex as well as the cost function. Thus, it is important to design the acquisition function suitably so that the computational cost is relatively low and optimization heuristics are tractable46,74,75,76. However, if the acquisition function is defined in a high-dimensional parameter space that typically appears in VQAs, it is excessively costly to use the BO.

Results

In the following, we present the description of SGLBO and introduce the adaptive shot strategy and suffix averaging. Moreover, numerical experiments are provided to demonstrate the advantage of SGLBO compared to other state-of-the-art optimizers for VQAs.

Algorithm 1 Stochastic gradient line Bayesian optimization (SGLBO)

Require: Cost function f(θ) with D parameters in Eq. (1), the initial shot size \({{{{\boldsymbol{s}}}}}_{{{{\rm{grad}}}}}^{(0)}\) for evaluating the gradient in Eq. (10), a kernel \(k({{{\boldsymbol{\theta }}}},{{{\boldsymbol{\theta }}}}^{\prime} )\) and an acquisition function φ(θ) used for GP in Eqs. (15) and (16), the initial point \({\hat{{{{\boldsymbol{\theta }}}}}}^{(0)}\) to be updated according to Eq. (17), the bound \({\eta }_{\max }\) of the 1D subspace \({{{{\mathcal{L}}}}}^{(t)}\) to perform the BO in Eq. (19), the initial number \({s}_{{{{\rm{cost}}}}}^{(0)}\) of measurement shots for evaluating the cost function in BO in Eq. (20), the number N = Ninit + Neval of queries used for the BO in Eq. (21), the total number stot of measurement shots for the stopping condition (23), the precision κ in estimating the gradient according to Eq. (30), the description of the lower bound \({G}_{{{{\rm{grad}}}}}^{(t)}\) of the shot size in Eq. (30), the description of the lower bound \({G}_{{{{\rm{cost}}}}}^{(t)}\) of the number of measurement shots in estimating the cost function in Eq. (31), a parameter α for suffix averaging in Eq. (32).

1: initialize:

2:  \(t\leftarrow 0,\,{s}_{{{{\rm{temp}}}}}^{(t)}\leftarrow 0\)

3: while \({s}_{{{{\rm{temp}}}}}^{(t)} \,<\, {s}_{{{{\rm{tot}}}}}\) Iterate until the stopping condition (23) is satisfied.

4:  \({\hat{{{{\boldsymbol{g}}}}}}^{(t)},{S}^{(t)}\leftarrow \) Estimate the gradient \({\mathbb{E}}[{\hat{{{{\boldsymbol{g}}}}}}^{(t)}]=\nabla f({\hat{{{{\boldsymbol{\theta }}}}}}^{(t)})\) using \(2\times {{{{\boldsymbol{s}}}}}_{{{{\rm{grad}}}}}^{(t)}\) measurement shots according to Eq. (11), and calculate its empirical variance S(t) in Eq. (30).

5:  \({{{{\mathcal{L}}}}}^{(t)}\leftarrow \) Take the 1D subspace \({{{{\mathcal{L}}}}}^{(t)}\) depending on \({\hat{{{{\boldsymbol{\theta }}}}}}^{(t)},{\hat{{{{\boldsymbol{g}}}}}}^{(t)},{\eta }_{\max }\) according to Eq. (19).

6:  \({\hat{{{{\boldsymbol{\theta }}}}}}^{(t+1)}\leftarrow \) Determine \({\hat{{{{\boldsymbol{\theta }}}}}}^{(t+1)}\) by the BO on \({{{{\mathcal{L}}}}}^{(t)}\) with \(k({{{\boldsymbol{\theta }}}},{{{\boldsymbol{\theta }}}}^{\prime} ),\varphi ({{{\boldsymbol{\theta }}}}),{N}_{{{{\rm{init}}}}},{N}_{{{{\rm{eval}}}}}\) as described in the main text below Eq. (21).

7:  \({{{{\boldsymbol{s}}}}}_{{{{\rm{grad}}}}}^{(t+1)}\leftarrow \) Determine the shot size for estimating the gradient, from \(\kappa ,{\hat{{{{\boldsymbol{g}}}}}}^{(t)},{S}^{(t)},D,{G}_{{{{\rm{grad}}}}}^{(t)}\) according to Eq. (30).

8:  \({s}_{{{{\rm{cost}}}}}^{(t+1)}\leftarrow \) Determine the number of measurement shots for estimating the cost function in the BO, from \({{{{\boldsymbol{s}}}}}_{{{{\rm{grad}}}}}^{(t+1)},{G}_{{{{\rm{cost}}}}}^{(t)}\) according to Eq. (31).

9:  \({s}_{{{{\rm{temp}}}}}^{(t+1)}\leftarrow {s}_{{{{\rm{temp}}}}}^{(t)}+2{s}_{{{{\rm{grad}}}}}^{(t)}+N{s}_{{{{\rm{cost}}}}}^{(t)}\) due to Eq. (22).

10:   t ← t + 1

11: endwhile

12: T ← t

13: return\({\overline{{{{\boldsymbol{\theta }}}}}}_{\alpha ,T}\leftarrow \) Take the suffix average according to Eq. (32).

Description of algorithm

We present a framework for the optimizer of parameterized quantum circuits in the VQAs, stochastic gradient descent line Bayesian optimization (SGLBO). The idea behind SGLBO is to estimate the direction of the gradient based on SGD and further to utilize BO to estimate the optimal step size within the one-dimensional subspace of parameters in this direction. This allows us to avoid the difficulty of choosing an appropriate step size in SGD, and also to achieve a feasible use of BO by limiting the domain to apply the BO to the one-dimensional space. In addition, we introduce two noise-reduction techniques, adaptive shot strategy and suffix averaging, to improve the speed and the accuracy of minimizing the cost function. Adaptive shot strategy and suffix averaging are crucial and characteristic components for the feasibility of SGLBO and will be explained in “Adaptive shot strategy” section and “Suffix averaging for SGLBO” section. Below, we will present the procedure of SGLBO (see also Algorithm 1).

The SGLBO achieves the minimization of the cost function by iteratively updating the points to estimate the minimizer of the cost function. Let T denote the total number of iterations in the SGLBO. For each iteration t = 0, 1, …, T − 1, let \({\hat{{{{\boldsymbol{\theta }}}}}}^{(t)}\) denote the point obtained in the (t + 1)th iteration of the SGLBO, which is an estimator of the circuit parameters that minimize the cost function, and the initial point \({\hat{{{{\boldsymbol{\theta }}}}}}^{(0)}\) represents an initial guess of the minimizer. Note that we here take \({\hat{{{{\boldsymbol{\theta }}}}}}^{(0)}\) uniformly at random, but in case a better initial guess of the minimizer than the uniformly random point is available, \({\hat{{{{\boldsymbol{\theta }}}}}}^{(0)}\) could be chosen as the better guess77,78. Similarly to the SGD, the SGLBO computes an unbiased estimator \({\hat{{{{\boldsymbol{g}}}}}}^{(t)}\) of the gradient of the cost function at the point \({\hat{{{{\boldsymbol{\theta }}}}}}^{(t)}\), using \(2{s}_{{{{\rm{grad}}}}}^{(t)}\) measurement shots due to Eq. (13). The shot size \({{{{\boldsymbol{s}}}}}_{{{{\rm{grad}}}}}^{(t)}\) is determined in each iteration t based on adaptive shot strategy, which will be explained in the following section. Using \({\hat{{{{\boldsymbol{g}}}}}}^{(t)}\), the SGLBO updates the point \({\hat{{{{\boldsymbol{\theta }}}}}}^{(t)}\) to the next point according to an update rule described by

$${\hat{{{{\boldsymbol{\theta }}}}}}^{(t+1)}={\hat{{{{\boldsymbol{\theta }}}}}}^{(t)}-{\hat{\eta }}^{* (t)}{\hat{{{{\boldsymbol{g}}}}}}^{(t)},$$
(17)

where \({\hat{\eta }}^{* (t)}\) is an estimator of the optimal step size. The optimal step size η*(t) is defined as

$${{{{\boldsymbol{\theta }}}}}^{* (t)}:=\mathop{{{{\rm{arg\,min}}}}}\limits_{{{{\boldsymbol{\theta }}}}\in {{{{\mathcal{L}}}}}^{(t)}}f({{{\boldsymbol{\theta }}}})={\hat{{{{\boldsymbol{\theta }}}}}}^{(t)}-{\eta }^{* (t)}{\hat{{{{\boldsymbol{g}}}}}}^{(t)},$$
(18)

where \({{{{\mathcal{L}}}}}^{(t)}\) is the one-dimensional subspace for applying the BO, i.e.,

$${{{{\mathcal{L}}}}}^{(t)}:=\{{\hat{{{{\boldsymbol{\theta }}}}}}^{(t)}-{\eta }^{(t)}{\hat{{{{\boldsymbol{g}}}}}}^{(t)}| {\eta }^{(t)}\in [-{\eta }_{\max },{\eta }_{\max }]\},$$
(19)

and \({\eta }_{\max } \,>\, 0\) is a constant hyperparameter to bound the one-dimensional subspace that will be specified in “Example of choice of hyperparameters and implementation” section. We remark that we choose \({\eta }_{\max }\) as a constant independent of D so that the BO should be feasible even in the case of large D. A parameter region of D parameters of a circuit can be a D-dimensional hypercube, e.g., θ [−π, π]D for the circuit in Fig. 2, and thus, to cross the whole parameter region by \({{{{\mathcal{L}}}}}^{(t)}\), one may be tempted to choose \({\eta }_{\max }\) as the length of the diagonal of this D-dimensional hypercube, i.e., \({\eta }_{\max }\approx \sqrt{D}\); however, for the feasibility of the BO, it is indeed essential to keep \({\eta }_{\max }\) constant. Our approach can be considered an improvement over the SGD with a constant step size \({\eta }_{\max }\), where we use the BO to estimate the optimal step size \({\hat{\eta }}^{* (t)}\) instead of using the fixed step size \({\eta }_{\max }\).

To obtain an estimate of the optimal step size \({\hat{\eta }}^{* (t)}\) in Eq. (17), we perform the procedure of BO on \({{{{\mathcal{L}}}}}^{(t)}\) by using a fixed number of measurement shots

$${s}_{{{{\rm{cost}}}}}^{(t)}$$
(20)

per query to the cost function, and querying these noisy observations of the cost function N times in total with

$$N={N}_{{{{\rm{init}}}}}+{N}_{{{{\rm{eval}}}}}.$$
(21)

where Ninit is the number of points used for initial evaluation for BO, and Neval is the number of points evaluated during the BO in each step in addition to Ninit. This procedure determines \({\hat{\eta }}^{* (t)}\) in such a way that \({\hat{{{{\boldsymbol{\theta }}}}}}^{(t+1)}\) in Eq. (17) should be given by θ(N+1) in Eq. (16). We will specify Ninit and Neval in “Example of choice of hyperparameters and implementation” section. In the BO, we use Ninit points for the initial queries, which we take at equal intervals in the 1D subspace \({{{{\mathcal{L}}}}}^{(t)}\). Using the observed points, the BO iterates a cycle according to Eqs. (15) and (16) to decide an additional point to evaluate per cycle. Repeating Neval cycles, we have Neval points in addition to the Ninit initial points, where the nth cycle for n {1, …, Neval} uses (Ninit + n − 1) points to decide the (Ninit + n)th point. These N points are used for the update according to Eq. (17), i.e., the calculation of \({\hat{\eta }}^{* (t)}\).

In this way, the SGLBO updates the point \({\hat{{{{\boldsymbol{\theta }}}}}}^{(t)}\) according to Eq. (17) until we consume a preset total number of measurement shots stot, which we initially designate. In particular, in the (t + 1)th iteration for each t = 0, …, T − 1, we use \(2{s}_{{{{\rm{grad}}}}}^{(t)}\) measurement shots for estimating the gradient according to Eq. (13), and also use \({s}_{{{{\rm{cost}}}}}^{(t)}\) measurement shots for each of the N queries to the cost function in the BO; that is, the number of measurement shots that we use in the (t + 1)th iteration is

$$2{s}_{{{{\rm{grad}}}}}^{(t)}+N{s}_{{{{\rm{cost}}}}}^{(t)}.$$
(22)

In the SGLBO, if the total number of measurement shots used in the iterations exceeds the preset bound stot, i.e.,

$$\mathop{\sum }\limits_{t=0}^{T-1}\left[2{s}_{{{{\rm{grad}}}}}^{(t)}+N{s}_{{{{\rm{cost}}}}}^{(t)}\right]\geqq {s}_{{{{\rm{tot}}}}},$$
(23)

then we stop the iterations. Note that T is given by the minimum number of iterations satisfying Eq. (23), determined during running the SGBLO depending on stot. We could also stop the iterations if we achieve the convergence of the cost function, while we here use the stopping condition based on stot for simplicity of presentation. We remark that it would be too costly in VQAs to check the convergence of the values of the cost function \(f({\hat{{{{\boldsymbol{\theta }}}}}}^{(t)})\) itself, which we avoid here; instead, it would be possible, e.g., to use another stopping condition by checking the convergence of the sequence of parameters \({({\hat{{{{\boldsymbol{\theta }}}}}}^{(t)})}_{t = 0,\ldots ,T-1}\).

Finally, after the last iteration, the optimizer calculates a suffix average55 of the points \({({\hat{{{{\boldsymbol{\theta }}}}}}^{(t)})}_{t = 0,\ldots ,T-1}\), i.e., an average of a subset of the points in a latter part of the iterations, which we will explain in “Suffix averaging for SGLBO” section. This suffix average is output as the estimate of the minimizer of the cost function.

The procedure of the SGLBO may require an additional cost of measurement shots for the BO compared to the SGD without using the BO, but this cost is negligible as explained in the following. To estimate the optimal step size by the BO, we may use an extra number of measurement shots to query the cost function, in addition to the gradient estimation based on the SGD. For simplicity, suppose that the shot size (10) and the number of measurement shots to evaluate the cost function in the BO are given by a constant s, i.e., \({s}_{i}^{(t)}=s\) (i {1, …, D}) and \({s}_{{{{\rm{cost}}}}}^{(t)}=s\). Then, due to Eq. (22), the number of measurement shots to be used in each iteration of the SGLBO is (2D + N)s. In this case, the cost of estimating the optimal step size is the same as the cost of the gradient estimation for a parameterized quantum circuit with N/2 additional parameters. This cost can be negligibly low as the number of circuit parameters D gets large, and hence, we can indeed gain the benefit of estimating the optimal step size by the BO.

The foundation for why SGLBO can efficiently find a candidate of the minimum point, i.e., a stationary point, can be explained as follows. The constant step-size SGD with averaging converges to a stationary point even in a non-convex setting79. The SGLBO is designed to converge faster than this constant step-size SGD with averaging since we use the BO to find a step size that further reduces the value of the cost function compared to taking the deterministic constant step size. In particular, in each step t {0, …, T − 1}, BO aims to find the minimum point along a 1D subspace; that is, the cost function \(f({\hat{{{{\boldsymbol{\theta }}}}}}^{(t)})\) is reduced to \(f({\hat{{{{\boldsymbol{\theta }}}}}}^{(t+1)})\) satisfying \(f({\hat{{{{\boldsymbol{\theta }}}}}}^{(t+1)})\leqq f({\hat{{{{\boldsymbol{\theta }}}}}}^{(t)})\) with high probability, in the case where BO is performed with sufficiently good precision. In this case, as the iterations proceed, SGLBO improves the cost function according to \(f({\hat{{{{\boldsymbol{\theta }}}}}}^{(0)})\geqq f({\hat{{{{\boldsymbol{\theta }}}}}}^{(1)})\geqq \cdots \geqq f({\hat{{{{\boldsymbol{\theta }}}}}}^{(T-1)})\), which does not necessarily hold in SGD but should hold in the SGLBO with high probability, leading to an improvement compared to the mere use of the SGD. We remark that the optimization problems in VQAs are non-convex, and hence, a tight analysis of the convergence speed would be challenging in general. Some previous research such as refs. 33,37 performs convergence analyses of optimizers for VQAs with assumptions on convexity or strong convexity, but the performance for non-convex problems that typically appear in VQAs are unknown. In contrast, the above explanation of convergence does not require the convexity assumptions. However, to bound the speed of convergence of SGLBO, further assumptions may be needed since non-convex optimization problems are hard to solve by nature. We leave the tight analysis of the convergence speed of the SGLBO under an appropriate assumption for the setting of VQAs for further research; instead, we will use numerical simulation to show the fast convergence speed of the SGLBO in our numerical experiments.

Adaptive shot strategy

The number of measurement shots used for estimating values and gradients of the cost function is one of the crucial parameters in stochastic optimization algorithms. In such algorithms, we may have a trade-off between efficiency and accuracy. In particular, at the beginning of optimization, we can use an imprecise gradient estimated with few measurement shots to roughly move to points around the minimizer. On the other hand, at the end of optimization, the gradients with less noise are needed to further decrease the value of the cost function. This observation motivates us to establish a strategy to gradually increase the shot size (10) used for estimating the gradient in the SGLBO as the optimization proceeds.

Such adaptive shot strategies have been well studied in the field of machine learning47,48,49,50,51,52,53, and one of them has been applied also in the context of VQAs34,35. However, the formula for estimating the next number of measurement shots given in refs. 34,35 depends on the step size and becomes invalid when the step size exceeds a certain range. Problematically, the step size in the SGLBO often exceeds the range. Thus, our algorithm utilizes a different approach, the norm test48,49,51, which determines the number of measurement shots to maintain a constant signal-to-noise ratio of the estimate of the gradient.

In the norm test, we want to decide the shot size based on a condition that the estimated vector \(-{\hat{{{{\boldsymbol{g}}}}}}^{(t)}\) should be appropriately in a descent direction51, which ideally would be

$${\delta }^{(t)}:=| | {\hat{{{{\boldsymbol{g}}}}}}^{(t)}-{{{\boldsymbol{\nabla }}}}f({\hat{{{{\boldsymbol{\theta }}}}}}^{(t)})| | \leqq \kappa | | {\hat{{{{\boldsymbol{g}}}}}}^{(t)}| | ,$$
(24)

with a parameter κ satisfying 0 κ < 1. Intuitively, as the optimization proceeds, the norm \(| | {\hat{{{{\boldsymbol{g}}}}}}^{(t)}| | \) of the gradient becomes small, and the condition (24) requires that the estimate \({\hat{{{{\boldsymbol{g}}}}}}^{(t)}\) of the gradient should become precise as \(| | {\hat{{{{\boldsymbol{g}}}}}}^{(t)}| | \) gets small. However, the exact evaluation of δ(t) would be prohibitively costly in VQAs. Thus, we square both sides of the above inequality and then replace the left hand side with its expectation, i.e., \({\mathbb{E}}[{({\delta }^{(t)})}^{2}]={{{\rm{Var}}}}[{\hat{{{{\boldsymbol{g}}}}}}^{(t)}]\), where \({{{\rm{Var}}}}[{\hat{{{{\boldsymbol{g}}}}}}^{(t)}]\) is the variance of \({\hat{{{{\boldsymbol{g}}}}}}^{(t)}\). The exact value of this variance is still difficult to calculate, and hence, we make the approximation using a sample variance80, i.e.,

$${{{\rm{Var}}}}[{\hat{{{{\boldsymbol{g}}}}}}^{(t)}]\simeq \frac{{{{\rm{Tr}}}}({{{\Sigma }}}^{(t)})}{{s}_{{{{\rm{grad}}}}}^{(t)}}$$
(25)

where \({{{\Sigma }}}_{ij}^{(t)}:={\mathbb{E}}[({g}_{i}^{(t)}-\nabla {f}_{i}^{(t)})({g}_{j}^{(t)}-\nabla {f}_{j}^{(t)})]\). Instead of Eq. (24), the norm test could check

$$\frac{{{{\rm{Tr}}}}({{{\Sigma }}}^{(t)})}{{s}_{{{{\rm{grad}}}}}^{(t)}}\leqq {\kappa }^{2}| | {\hat{{{{\boldsymbol{g}}}}}}^{(t)}| {| }^{2}.$$
(26)

To adapt the condition (26) to the setting of VQAs, we consider the freedom of choosing the number of measurement shots for estimating each partial derivative of the cost function in Eq. (8). Since each partial derivative is estimated independently, Eq. (26) can be written as,

$$\mathop{\sum}\limits_{i}\frac{{\left({\sigma }_{i}^{(t)}\right)}^{2}}{{s}_{i}^{(t)}}\leqq {\kappa }^{2}| | {\hat{{{{\boldsymbol{g}}}}}}^{(t)}| {| }^{2},$$
(27)

where \({\sigma }_{i}^{(t)}:=\sqrt{{{{\rm{Var}}}}[{g}_{i}^{(t)}]}\). Now we impose a constraint on the number of measurement shots so that each estimate of the partial derivative should have an equal variance, i.e., \({({\sigma }_{i}^{(t)})}^{2}/{s}_{i}^{(t)}={({\sigma }_{j}^{(t)})}^{2}/{s}_{j}^{(t)}\) for i ≠ j. Then, we obtain a lower bound of \({s}_{i}^{(t)}\) for each i, i.e.,

$${s}_{i}^{(t)}\geqq \frac{1}{{\kappa }^{2}}\frac{{\left({\sigma }_{i}^{(t)}\right)}^{2}D}{| | {\hat{{{{\boldsymbol{g}}}}}}^{(t)}| {| }^{2}}.$$
(28)

In practice, the true variance \({({\sigma }_{i}^{(t)})}^{2}\) is still too costly to evaluate, and thus, we replace it with the empirical variance \({({S}^{(t)})}^{2}\), which is accessible. Consequently, we forecast the number of measurement shots so that it should satisfy

$${s}_{i}^{(t+1)}\geqq \frac{1}{{\kappa }^{2}}\frac{{\left({S}_{i}^{(t)}\right)}^{2}D}{| | {\hat{{{{\boldsymbol{g}}}}}}^{(t)}| {| }^{2}},$$
(29)

which we use to estimate the gradient in the next iteration. Since the SGLBO is intended to be applied to highly noisy cases, to avoid the cases where \({s}_{i}^{(t+1)}\) is too small to estimate the gradient appropriately, we here set a lower bound \({G}_{{{{\rm{grad}}}}}^{(t)}\) on the shot size and decide the next shot size according to

$$s_{i}^{(t+1)} =\max\left\{ \left\lceil\frac{1}{\kappa^2} \frac{\left(S_{i}^{(t)}\right)^2D}{||\hat{{\boldsymbol{g}}}^{(t)}||^2}\right\rceil,G_{\mathrm{grad}}^{(t)}\right\},$$
(30)

where is the ceiling funciton. The choice of \({G}_{{{{\rm{grad}}}}}^{(t)}\) will be specified in “Example of choice of hyperparameters and implementation” section.

Using the shot size specified by Eq. (30), we also decide the number of measurement shots used for observing values of the cost function in the BO according to

$${s}_{{{{\rm{cost}}}}}^{(t+1)}=\max \left\{\frac{1}{D}\mathop{\sum }\limits_{i=1}^{D}{s}_{i}^{(t)},{G}_{{\rm{cost}}}^{(t)}\right\},$$
(31)

where \({G}_{{{{\rm{cost}}}}}^{(t)} \,>\, 0\) is a constant for avoiding the cases where \({s}_{{{{\rm{cost}}}}}^{(t+1)}\) becomes too small to estimate the optimal step size appropriately. The choice of \({G}_{{{{\rm{cost}}}}}^{(t)}\) will also be specified in “Example of choice of hyperparameters and implementation” section.

Suffix averaging for SGLBO

In VQAs, one could use a point obtained from the final iteration as the result of the optimization. However, in SGLBO, we use BO to estimate the optimal step size in Eq. (18), and due to statistical error in the estimation, we suffer from the influence of the error between the estimate of the optimal step size obtained from the BO and the true optimal step size. Moreover, hardware noise also prevents steady update of the points, especially when we use near-term noisy quantum devices. Such errors or noises may lead to an oscillation of the points in the final part of the iterations around the minimizer. To suppress such oscillation, we take a suffix average of these points in the final part of the iterations, rather than using the single point of the final iteration itself.

Given the sequence of points obtained from T iterations \({\hat{{{{\boldsymbol{\theta }}}}}}^{(0)},\ldots ,{\hat{{{{\boldsymbol{\theta }}}}}}^{(T-1)}\), the α-suffix average is defined as the average of the last αT points55

$${\overline{{{{\boldsymbol{\theta }}}}}}_{\alpha ,T}=\frac{1}{\alpha T}\mathop{\sum }\limits_{t=(1-\alpha )T-1}^{T-1}{\hat{{{{\boldsymbol{\theta }}}}}}^{(t)},$$
(32)

where α (0, 1] is some constant, and α and T are taken here in such a way that αT should be an integer. During the optimization, we store the sequence of the points \({({\hat{{{{\boldsymbol{\theta }}}}}}^{(t)})}_{t}\) in memory. At the end of optimization, we calculate the suffix average of these points according to the above formula and output the suffix average as the result of the SGLBO.

Importantly, to achieve the goal of suppressing the effect of noise at the points in the final part of the iterations, the suffix averaging here uses an equal weight in averaging out the noise in this part. To achieve this suppression with small overhead, the parameter α should be chosen appropriately, in such a way that the last αT points should be kept in a reasonably small fraction among all T points yet still large enough to suppress the noise effectively. We note that, instead of using the equal weight, averaging with a decaying sequence of weights would also work56, which may have a merit in a case where one does not have enough memory to store all points and wants to average the points on the fly. Detailed comparison of suffix-averaging techniques using different sequences of weights in VQAs is left for future work.

The suffix averaging can accelerate the convergence of SGD in some cases; for example, for optimization of a strongly convex function, i.e., a function that is (roughly speaking) more convex than a quadratic function, the error of the point in the Tth iteration decreases at the speed of \(O(\log (T)/T)\) with high probability, but the error of the suffix average of the points in the latter half of the T iterations reduces to O(1/T), achieving the optimal speed55. In the case of VQAs, f may not be strongly convex. However, even in the SGLBO, we can suppress the oscillation around the minimizer in practice by taking the suffix average, which contributes to improving the results of the optimization.

Example of choice of hyperparameters and implementation

We show an example of the choice of hyperparameters in Algorithm 1. These hyperparameters will be used in numerical experiments. In the numerical experiments, we also consider the cases with and without hardware noise, referring to them as the noisy case and the noiseless case, respectively.

For estimating the gradient in the SGLBO, we take the initial shot size as

$${s}_{i}^{(0)}=2 \quad {{{\rm{for}}}}\,{{{\rm{all}}}} \quad i,$$
(33)

and initialize \({\hat{{{{\boldsymbol{\theta }}}}}}^{(0)}\) by sampling from the uniform probability distribution. We set the lower bound \({G}_{{{{\rm{grad}}}}}^{(t)}\) on the shot size by an average shot size in the last 10 iterations; i.e., for t + 1  10, according to Eq. (30), we take

$$\begin{array}{ll} &G_{\mathrm{grad}}^{(t)}=\frac{1}{10D}\sum\limits_{i^\prime=1}^{D}\sum\limits_{t^\prime=1}^{10}s_{i^\prime}^{(t-10+t^\prime)}, \\ &{\rm{i.e.,}}\, s_{i}^{(t+1)}= \max\left\{\left\lceil\frac{1}{\kappa^2} \frac{\left(S_{i}^{(t)}\right)^2D}{||\hat{{\boldsymbol{g}}}^{(t)}||^2}\right\rceil,\frac{1}{10D}\sum\limits_{i^\prime=1}^{D}\sum\limits_{t^\prime=1}^{10}s_{i^\prime}^{(t-10+t^\prime)}\right\}, \end{array}$$
(34)

and \({G}_{{{{\rm{grad}}}}}^{(t)}=1\) for t 10. We set κ = 0.99 in Eq. (30).

In the BO that is used as a subroutine in the SGLBO, we use the Gaussian kernel in Eq. (14) with τ2 = 0.2, l = 0.7 as initial values. Before performing the GP regression to estimate values of a cost function, we optimize the hyperparameters, i.e., τ2, l, and the variance of Gaussian noise σ2, by maximizing the marginal likelihood of the hyperparameters. To avoid overfitting, we restrict the parameter region of these hyperparameters; in our numerical experiments, we set the parameter region as 10−3τ2 5, 10−3l 1, and 10−5σ2 5. In addition, we perform this hyperparameter optimization 10 times from uniformly random starting points and take the best parameters to ensure that the hyperparameters are not a poor local optimum. As the acquisition function used in the BO, we choose Thompson sampling68,69. After performing the BO, we set the estimated optimal step size as the minimum point of the predictive mean of a GP posterior conditioned on N observed data points.

For the BO, we set Ninit = 5 and Neval = 5. The Ninit points of the initial evaluation is randomly chosen according to the uniform probability distribution over the 1D subspace \({{{{\mathcal{L}}}}}^{(t)}\) in Eq. (19) with

$${\eta }^{(t)}\in [-{\eta }_{\max },{\eta }_{\max }],\,{\eta }_{\max }=\min \left\{\frac{\beta }{| | H| | },\pi \right\},$$
(35)

where H is the operator norm, and β > 0 is a constant that we set depending on the problem later in “Advantage of SGLBO for various system sizes” section and “Robustness against hardware noise in SGLBO” section. Note that one of the initial evaluation points must be taken as η(t) = 0, i.e., the current point \({\hat{{{{\boldsymbol{\theta }}}}}}^{(t)}\), for the stability of the BO. The number of measurement shots used for evaluating each point in the BO is given by Eq. (31) with

$$\begin{array}{l}{G}_{\rm{cost}}^{(t)}=\frac{\Vert H\Vert^{2}}{{\epsilon }^{2}}\,{{{\rm{for}}\,{\rm{all}}}}\,t,\\ {\rm{i}}.{\rm{e}}.,\,{s}_{{\rm{cost}}}^{(t)}={\rm{max}} \left\{\frac{1}{D}\mathop{\sum }\limits_{i=1}^{D} {s}_{i}^{(t)}, \frac{\Vert H\Vert^{2}}{{\epsilon }^{2}}\right\},\end{array}$$
(36)

where ϵ = 0.1. Given the outcomes of these measurements, we perform GP regression using GPy81.

For the suffix averaging, we set α = 0.1 in Eq. (32).

Numerical experiments

In the following, we numerically demonstrate the advantages of the SGLBO in comparison with state-of-the-art optimizers for VQAs. The optimizers to be compared with the SGLBO are summarized in “Optimizers for VQAs and their implementations” section. In particular, we investigate two situations: (1) when the size of a system scales up in “Advantage of SGLBO for various system sizes” section, and (2) when hardware noise and connectivity between qubits on hardware are taken into account in “Robustness against hardware noise in SGLBO” section. To this end, we simulate the performance of the optimizers in tasks of variational quantum eigensolver (VQE)5 for (1) and variational quantum compilation (VQC)58 for (2). Furthermore, we demonstrate in “Merits of noise-reducing techniques for general optimizers” section that the techniques of suffix averaging and adaptive shot strategy used in the SGLBO can also improve performance and noise robustness of a general class of optimizers, not only the SGLBO.

Optimizers for VQAs and their implementations

To compare the SGLBO with other existing optimizers, we consider the following three state-of-the-art optimizers: adaptive moment estimation (Adam)57, individual coupled adaptive number of shots (iCANS)34, and Nakanishi-Fujii-Todo method (NFT)23. Adam is a variant of SGD; although a number of different strategies for choosing step size in SGD have been proposed, Adam chooses the step size adaptively based on the accumulated information of estimates of the gradient used in previous iterations. The choice of step size in Adam is known to work well for many applications in the field of machine learning, but for VQAs, the required number of measurement shots for the optimization with Adam has been still prohibitively large34. We use Adam as a representative choice of a straightforward application of SGD to VQAs. The iCANS is also a variant of stochastic gradient optimizers in which the number of measurement shots at each iteration is chosen frugally based on the first and second moment of the gradient to improve performance in VQAs. While both of these optimizers are gradient-based optimizers, NFT is a sequential optimization method along an axis of the parameters using function fitting rather than the gradient.

For iCANS, we in particular use iCANS134, and for Adam, we used the same values of the hyperparameters as ref. 34. In terms of the initial number of measurement shots used in iCANS, which is not mentioned in ref. 34, we set \({s}_{i}^{(0)}=2\) for all i in our numerical experiments. Here we note that for iCANS1, the step size ηt is changed depending on the tasks of VQAs as specified in “Advantage of SGLBO for various system sizes” section and “Robustness against hardware noise in SGLBO” section, following ref. 34. In addition, we used \({s}_{i}^{(t)}=1000\) shots for each evaluation of the cost function in Eq. (8) in Adam and \({s}_{{{{\rm{cost}}}}}^{(t)}=1000\) shots for each evaluation of the cost function to fit the function in NFT. Note that the values of the hyperparameters for which the optimizer works well are selected manually or by referring to the values of previous studies, and we did not perform an exhaustive hyperparameter search since such a search is computationally too costly to perform. After all, it may be infeasible to run such a hyperparameter search when we apply these optimizers to practical problems.

In these numerical experiments, we simulate quantum circuits by using Pennylane82. In “Advantage of SGLBO for varisou system sizes” section and “Robustness against hardware noise in SGLBO” section, the values of the cost function appearing in the figures are evaluated at the point of the final iterate in \({({\hat{{{{\boldsymbol{\theta }}}}}}^{(t)})}_{t}\) (and the suffix averaged point in the SGLBO) by a noiseless simulator, where both the statistical noise and the hardware noise are ignored; in “Merits of noise-reducing techniques for general optimizers” section, these values are evaluated at the suffix averaged point by the noiseless simulator. For each optimizer, we repeated the overall optimization procedures fifteen times from uniformly random initial points, where each run from an initial point is repeated twice, and took the average over all the thirty runs. In the figures, we display the logarithm of the average as a thick line and each run as a thin line, using log-linear plots.

Advantage of SGLBO for various system sizes

In this section, we investigate the performance of SGLBO as we scale up the system size. We evaluate the performance of the optimizers in terms of the total number of measurement shots used during the optimization. In each iteration, we calculate the difference per site between the cost-function value at the current point of each optimizer and the minimum value of the cost function. In particular, we here consider a VQE task5 for a 1D transverse field Ising model under open boundary conditions. The VQE is an algorithm to calculate the ground state energy of a given Hamiltonian, where the cost function is defined as the expectation-value of the Hamiltonian. The Hamiltonian here is given by

$$H=-J\left(\mathop{\sum }\limits_{j=1}^{n-1}{Z}_{j}{Z}_{j+1}+g\mathop{\sum }\limits_{j=1}^{n}{X}_{j}\right)$$
(37)

where Zj and Xj are the Pauli Z and X matrices, respecitvely, at the jth site on a 1D chain of qubits, J represents the energy scale, and g is the relative strength of the external field compared to the nearest-neighbor couplings83. We choose J = 1.0 and g = 1.5. We use the ansatz circuit in Fig. 2 with r = 4 repetitions for n = 4, 8, 12 qubits. These sizes of the circuits are chosen based on the feasibility of classical simulation. We remark that we do not change the depth of the ansatz circuits in this setting and change only the system size, so that the gradient does not vanish exponentially for the large system size84; that is, it is expected that the problem of the barren plateau, which potentially make the optimization infeasible84,85,86, is avoided in our setting. In this problem, for the SGLBO, we restrict the region for the line search \({{{{\mathcal{L}}}}}_{i}\) by β = 3, and for the iCANS, we set the step size ηt = 1/H, following ref. 34.

The result of the numerical simulation is shown in Fig. 3. Significantly, we discover that the SGLBO outperforms the other optimizers23,34,57 in all the cases of n = 4, 8, 12 qubits, in terms of both the speed of convergence and the accuracy of estimating the minimum of the cost function. Thus, these advantages of the SGLBO can be obtained not only for the relatively small system size n = 4 but more broadly for the larger system sizes n = 8, 12. While NFT and Adam hit the limit of accuracy of the minimization in the early stage of the optimization, SGLBO and iCANS continue to improve the cost function even at the end of the optimization, which shows the advantage of deciding the number of measurement shots adaptively for each iteration in these algorithms. Moreover, owing to using the BO for estimating the optimal step size in each iteration, the SGLBO enjoys faster convergence with a fewer number of overall measurement shots. The additional cost of measurement shots in the BO in Eq. (22) turns out to be negligible even on a small scale n = 4, as well as the larger scales discussed in “Description of algorithm” section. Consequently, for the VQE tasks in Fig. 3, the SGLBO achieves the optimization of parameterized quantum circuits at the significantly faster convergence speed in terms of the number of measurement shots, and with better accuracy in minimizing the cost function than the other state-of-the-art optimizers.

Fig. 3: Comparison of optimizers in terms of the performance on the VQE tasks.
figure 3

We optimize the cost function of the VQE for 1D transverse field Ising model with n = 4, 8, 12 qubits in the noiseless case. In all plots, the x-axis represents the total number of measurement shots used during the optimization, and the y-axis represents the difference ΔE per site between the true value of the cost function (i.e., not evaluated with finite measurement shots) at each iteration and the minimum value of the cost function under the ansatz described in Fig. 2 with r = 4. For each optimizer, the thin lines represent each run repeated twice from fifteen different initial points, and the thick line represents the average of these thirty runs. Significantly, the SGLBO outperforms the other state-of-the-art optimizers in terms of both the convergence speed and the achievable accuracy for, a broad region n = 4, 8, 12 of the number of qubits.

Robustness against hardware noise in SGLBO

Next, we investigate the noise robustness of SGLBO. We consider VQC58 with a fixed input state. The task of VQC is to find parameters of a parameterized circuit so that the unitary implemented by the circuit should act as equivalently as possible to a given target unitary when acting on a given input state. Following ref. 58, we define the cost function as

$$f({{{\boldsymbol{\theta }}}})=1-\frac{1}{n}\mathop{\sum }\limits_{j=1}^{n}{G}_{0}^{(j)},$$
(38)

where

$$\begin{array}{l}{G}_{0}^{(j)} ={{{\rm{Tr}}}}[(\left|0\right\rangle {\left\langle 0\right|}_{j}\otimes {{\mathbb{1}}}_{\bar{j}}){U}^{{\dagger} }({{{\boldsymbol{\theta }}}})U({{{{\boldsymbol{\theta }}}}}^{* }){(\left|0\right\rangle \left\langle 0\right|)}^{\otimes n}{U}^{{\dagger} }({{{{\boldsymbol{\theta }}}}}^{* })U({{{\boldsymbol{\theta }}}})].\end{array}$$
(39)

Here \({{\mathbb{1}}}_{\bar{j}}\) is an identity operator acting on all qubits except the jth qubit, \({G}_{0}^{(j)}\) is the probability of getting the outcome 0 on the jth qubit, θ is a vector of circuit parameters to be optimized, and θ* is a target vector of circuit parameters that are chosen here as \({{{{\boldsymbol{\theta }}}}}^{* }={(0,\ldots ,0)}^{\top }\in {{\mathbb{R}}}^{D}\). The target unitary is U(θ*), and the input state is \({(\left|0\right\rangle \left\langle 0\right|)}^{\otimes n}\). The ansatz circuit U(θ) used here is the one in Fig. 2 with n = 4 and r = 6. In this case, the ansatz circuit can reach the optimal point at θ = θ* to output \({(\left|0\right\rangle \left\langle 0\right|)}^{\otimes n}\), where the value of the cost function is exactly zero at the optimal point, and y-axis shows the difference between the true optimal value (i.e., zero) and the value at the estimated optimal point. We note that this cost function is defined by local observables, so the gradient does not vanish in the shallow ansatz circuit used in this VQC task58,85. In VQC, we demonstrate the performance of the optimizers in both noiseless and noisy cases. To simulate noise in the noisy case, we used information about the gate-operation and readout errors and the connectivity of IBM’s Bogota processor87,88. The detailed explanation on the parameters of the noise model is in Supplementary Information. We set β = 6 to limit the region \({{{{\mathcal{L}}}}}_{i}\) for SGLBO and choose the step size ηt = 0.1 for iCANS, following ref. 34.

The result of the numerical simulation is presented in Fig. 4. In the noiseless case, the SGLBO works better than the other state-of-the-art optimizers, which is consistent with the result of the VQE in Fig. 3. Even more remarkably, even in the presence of a moderate amount of hardware noise described above, the SGLBO can achieve almost the same accuracy in minimizing the cost function as that in the noiseless case, while the other optimizers converge to worse cost-function values. This result indicates a remarkable noise resilience of the SGLBO, owing to using the BO and also the technique of suffix averaging. In the SGLBO, the estimates of the minimizer of the cost function may be affected by hardware noise, and even if we use the BO that is relatively robust against the noise, these estimates may oscillate around the minimizer. However, the suffix averaging of these estimates makes it possible to obtain a point that is even nearer to the minimizer. In addition, the cost function in VQC has a preferable property that the minimizer is not susceptible to shifting caused by hardware noise89, and this property also contributes to the noise resilience in this case; that is, in other tasks for the VQAs without this property, the same accuracy as noiseless cases would be hard to achieve in noisy cases. This result shows that the SGLBO can be more tolerant to hardware noise than the other state-of-the-art optimizers, which is crucial for the feasibility of performing VQAs on NISQ devices.

Fig. 4: Comparison of optimizers in terms of the performance on VQC tasks.
figure 4

We optimize the cost function of the VQC task a without hardware noise and b with hardware noise for the ansatz circuit in Fig. 2 with n = 4 qubits and r = 6 repetitions. In both plots, x-axis represents the total number of measurement shots used during the optimization, and y-axis represents the cost-function value. For each optimizer, the thin lines represent each run repeated twice from fifteen different initial points, and the thick line represents the average of these thirty runs. Remarkably, even under the moderate amount of the noise explained in the main text, the SGLBO can achieve almost the same accuracy as the noiseless case, whereas the achievable accuracy of the other state-of-the-art optimizers becomes worse in the noisy case.

Merits of noise-reducing techniques for general optimizers

We here also show that the technique of suffix averaging and adaptive shot strategy that we use in SGLBO turns out to be advantageous even in improving performance and noise robustness of the other state-of-the-art optimizers, not only the SGLBO.

In particular, we here consider the same task of VQC as “Robustness against hardware noise in SGLBO” section, and we first apply the suffix averaging technique to all the optimizers, i.e., iCANS, Adam, and NFT as well as SGLBO. The result of the numerical simulation is shown in Fig. 5. In both the noiseless and noisy cases, the technique of suffix averaging can significantly improve the accuracy of the state-of-the-art optimizers, especially NFT and Adam, compared to the cases without suffix averaging in Fig. 4. For iCANS, suffix averaging may not be as effective as NFT and Adam, but can still achieve a comparable accuracy to the cases without suffix averaging. This result shows that the technique of suffix averaging that we apply in the SGLBO can indeed be useful as a general technique for improving a wide class of optimizers, not only for the SGLBO itself. At the same time, our numerical simulation shows that even if we improve the other optimizers by the suffix averaging, the SGLBO still outperforms these optimizers.

Fig. 5: Comparison of optimizers with the suffix averaging technique (SA), in the performance on the same VQC tasks as Fig. 4.
figure 5

The suffix averaging technique is not applied to iCANS, NFT, and Adam in Fig. 4 but is applied to all the optimizers in this figure. The x- and y-axes are the same as Fig. 4. For each optimizer, the thin lines represent each run repeated twice from fifteen different initial points, and the thick line represents the average of these thirty runs. In both the noiseless (a) and noisy (b) cases, the technique of suffix averaging can significantly improve the accuracy of state-of-the-art optimizers, especially NFT and Adam, while the SGLBO still outperforms the others. This shows that the suffix averaging technique developed here is not only a particular technique for improving the SGLBO but can be a broadly applicable technique for designing an efficient optimizer for VQAs.

Next, we apply the technique of adaptive shot strategy to Adam. Note that our technique of adaptive shot strategy cannot be applied directly to NFT since NFT does not use gradient; also, iCANS uses its own variant of adaptive shot strategies, and hence, our technique based on the norm test cannot be combined with iCANS either without changing its own strategy. Following the setting of SGLBO with (33), we set \({s}_{i}^{(0)}=2\) for all i when we combine the adaptive shot strategy with Adam in these experiments. The results of the numerical experiments are shown in Fig. 6. In both noiseless and noisy cases, the adaptive shot strategy improves the performance of the original Adam. This indicates that the adaptive shot strategy based on the norm test is effectively applicable to the gradient-based optimizers and can improve the performance of the optimizers. In Fig. 6, we also demonstrate the combination of the suffix averaging and the adaptive shot strategy with Adam. In noiseless case, since Adam with the adaptive shot strategy has not yet hit the floor in the minimization and is still improving its accuracy, taking suffix averaging worsened the accuracy, as opposed to the case of averaging out the noise around the optimal points. On the other hand, in noisy case, the accuracy is improved. This result further confirms the effectiveness of the suffix averaging technique against hardware noise. The SGLBO still outperforms the other optimizers combined with these techniques.

Fig. 6: Comparison of Adam with the suffix averaging technique (SA) and/or the adaptive shot strategy (ASS), in terms of the performance on the same VQC task as Fig. 4.
figure 6

The x- and y-axes are the same as Fig. 4. For each optimizer, the thin lines represent each run repeated twice from fifteen different initial points, and the thick line represents the average of these thirty runs. In both the noiseless (a) and noisy (b) cases, the adaptive shot strategy improves the performance of the original Adam, but the SGLBO outperforms the others. This shows that the adaptive shot strategy is also useful in improving the accuracy of Adam, rather than a specific technique for the SGLBO.

In this way, the techniques that we develop for the SGLBO are also applicable broadly beyond the SGLBO itself, establishing a foundation for designing further efficient optimizers for VQAs in future research. At the same time, these results show that SGLBO is an effective combination of all the techniques, i.e., SGD, BO, the suffix averaging, and the adaptive shot strategy, to outperform the state-of-the-art optimizers.

Discussion

In this work, we have developed an efficient framework, stochastic gradient line Bayesian optimization (SGLBO), for optimizing parameterized quantum circuits in variational quantum algorithms (VQAs). The core idea of the SGLBO is to estimate the direction of the gradient based on stochastic gradient descent (SGD), and also to use Bayesian optimization (BO) for estimating the optimal step size in this direction. The BO used for estimating the optimal step size in the SGLBO contributes to minimizing the cost function faster and more accurately, owing to the robustness of the BO against noise. To achieve the optimization feasibly within the fewer number of measurement shots, we also formulated an adaptive measurement-shot strategy based on the norm test to estimate the direction of the gradient efficiently. In addition, to suppress the effect of statistical error and hardware noise, we introduce the suffix averaging technique. The SGLBO with these techniques can save the cost of the number of measurement shots in optimizing the parameterized circuits, and also improve the accuracy in minimizing the cost function in the VQAs.

To compare the performance of the SGLBO with other state-of-the-art optimizers, we numerically investigated two situations: (1) when the system size increases and (2) when the hardware noise is present. For various system sizes, we discover that the SGLBO significantly improves the required number of measurement shots for achieving a desired accuracy in minimizing cost functions, and reaches an even better accuracy in minimizing the cost functions than other state-of-the-art optimizers, as shown in Fig. 3. Furthermore, we have shown that, even in the presence of a moderate amount of hardware noise, the SGLBO can achieve almost the same accuracy as that in the noiseless case, whereas the accuracy of the other state-of-the-art optimizers has got worse, in the task shown in Fig. 4. To suppress the noise, the suffix averaging technique as well as the use of the BO is crucial, and it turns out that the suffix averaging and the adaptive shot strategy developed for the SGLBO can also improve the accuracy and the noise robustness of other existing optimizers as demonstrated in Fig. 5.

Consequently, integrating two different optimization approaches, SGD and BO, our results on the SGLBO open an alternative way to drastically reduce the cost of measurement shots in the optimization of parameterized quantum circuits, and also to make VQAs more feasible under unavoidable hardware noise in near-term quantum devices. The techniques introduced here are versatile for problems with various system sizes, effective even in presence of noise, and widely applicable to a variety of algorithms for optimizing parameterized quantum circuits in the setting of VQAs, as demonstrated above. At the same time, the approach developed for the SGLBO provides a fundamental insight into how VQAs can use classical information extracted from quantum states, progressing beyond estimating expectation values. Moreover, the idea of the SGLBO indeed provides a general framework for optimizing noisy functions in the field of machine learning (ML), not specifically to VQAs. Thus, our results are expected to be of interest not only to users of noisy intermediate-scale quantum (NISQ) devices but to much broader communities of quantum information, such as those working on ML-assisted calibration of quantum devices in experiments, quantum tomography using an ansatz, and quantum metrology.

These results point toward various directions of future research. One possible direction is to investigate the difference in performance when the 1D subspace for the BO currently taken in the gradient descent direction (Eq. (19)) is chosen in another direction, such as natural gradient descent28,30,90,91, negative curvature descent92, and conjugate gradient93. Also, the development of a more efficient method for determining appropriate hyperparameter values in the SGLBO is also important for improving the accuracy. In our work, we have empirically found that the SGLBO with suffix averaging performs well in practice even if hardware noise is considered, but further research is needed to clarify of what class of hardware noise the suffix averaging can be tolerant, and how many iterations are needed to achieve comparable performance to the noiseless case. It would also be interesting to provide a theoretical guarantee on the performance of the SGLBO under appropriate assumptions, especially in the setting of non-convex optimization; after all, both empirical and theoretical studies are crucial for harnessing the potential for near-term applications of VQAs. Finally, since the SGLBO discovers a way to avoid the cost of precise estimation of expectation values in optimizing parameterized circuits for VQAs, it is even more advantageous to pursue applications of VQAs that do not require estimating the expectation values throughout running the entire algorithm, i.e., even after the optimization; for example, state-of-the-art quantum algorithms for quantum machine learning avoid the expectation-value estimation by solving sampling problems so that the speedup should not be canceled out94,95,96, and further research is needed to clarify how we can similarly avoid the expectation-value estimation in quantum machine learning with VQAs.