Stochastic gradient line Bayesian optimization for efficient noise-robust optimization of parameterized quantum circuits

Tamiya, Shiro; Yamasaki, Hayata

doi:10.1038/s41534-022-00592-6

Download PDF

Article
Open access
Published: 27 July 2022

Stochastic gradient line Bayesian optimization for efficient noise-robust optimization of parameterized quantum circuits

npj Quantum Information volume 8, Article number: 90 (2022) Cite this article

3849 Accesses
10 Citations
4 Altmetric
Metrics details

Subjects

Abstract

Optimizing parameterized quantum circuits is a key routine in using near-term quantum devices. However, the existing algorithms for such optimization require an excessive number of quantum-measurement shots for estimating expectation values of observables and repeating many iterations, whose cost has been a critical obstacle for practical use. We develop an efficient alternative optimization algorithm, stochastic gradient line Bayesian optimization (SGLBO), to address this problem. SGLBO reduces the measurement-shot cost by estimating an appropriate direction of updating circuit parameters based on stochastic gradient descent (SGD) and further utilizing Bayesian optimization (BO) to estimate the optimal step size for each iteration in SGD. In addition, we formulate an adaptive measurement-shot strategy and introduce a technique of suffix averaging to reduce the effect of statistical and hardware noise. Our numerical simulation demonstrates that the SGLBO augmented with these techniques can drastically reduce the measurement-shot cost, improve the accuracy, and make the optimization noise-robust.

Adaptive quantum error mitigation using pulse-based inverse evolutions

Article Open access 22 November 2023

Quantum approximate optimization via learning-based adaptive optimization

Article Open access 06 March 2024

Limitations of optimization algorithms on noisy quantum devices

Article 21 October 2021

Introduction

Advances in technologies of quantum hardware lead to intensive research on finding practical applications on noisy intermediate-scale quantum (NISQ) devices¹. Variational quantum algorithms (VQAs)^2,3,4 are a class of promising candidates of quantum algorithms that are implementable on the NISQ devices. The VQAs can be used for a variety of computational tasks including quantum chemistry calculations^5,6,7,8, combinatorial optimization^9,10,11, and training of machine-learning models^12,13,14,15. These tasks are achieved by minimizing task-specific cost functions usually defined as a sum of expectation values of observables. The optimization of minimizing the cost function is performed through updating parameters of a parameterized quantum circuit using a classical optimizer in a feedback loop. In particular, VQAs employ a quantum device to prepare quantum states that the parameterized quantum-circuit outputs. We perform a shot of quantum measurement on each output state to extract classical information, which is useful for estimating the expectation values of the cost function. The measurement outcomes are fed to the classical optimizer, with which we improve the circuit parameters so as to minimize the cost function iteratively.

But problematically, if we try to estimate the expectation values with high precision in the VQAs, we usually need an excessive number of measurement shots until minimizing the cost function^16,17. In practice, a user of a quantum computer often needs to access a distant server of a quantum computer to query measurement shots, while the classical optimizer can be performed locally by the user at a negligible cost compared to the cost of using the quantum computer in terms of time and money¹⁸; in this setting, the number of measurement shots crucially dominates the cost of VQAs, which we aim to minimize here. In previous research, problems of reducing computational resources in VQAs have often been tackled by estimating an expectation-value efficiently^19,20,21,22 and reducing the number of iterations until convergence^{23,24,25,26,27,28,29,30}. By contrast, to overcome a dominant obstacle in the above setting of VQAs, we here study the problem of reducing the overall cost of measurement shots in the optimization, that is, how we can optimize the circuit parameters at as little cost of the total number of measurement shots as possible. A difficulty of this problem stems from the nature of quantum mechanics: it is costly to extract expectation values as classical information from quantum states, yet the optimization would be hard without the assistance of classical information obtained from measurements on the quantum states. We stress that the problem here is not the estimation of the expectation values themselves; rather, a fundamental question that we ask is how efficiently we can use classical information of the measurement outcomes to optimize the circuit parameters without extracting the expectation values with high precision.

In this work, we address this problem by establishing a framework for the classical optimizer that combines two different optimization approaches, namely, stochastic gradient descent (SGD) and Bayesian optimization (BO). SGD is a standard algorithm in machine learning for training models, using an estimator of gradient at each optimization step rather than the exact value of the gradient^31,32. Among a variety of existing optimizers proposed for VQAs^{23,24,25,26,27,28,29,30,33,34,35,36}, gradient-based optimizers have been studied intensively, motivated by the fact that the use of gradient information improves convergence³⁷. Recently, SGD for VQAs has been investigated as a class of gradient-based optimizers³³. The SGD for VQAs often uses a fixed small number of measurement shots to estimate the gradient, which may successfully avoid measuring expectation values with high precision. However, SGD has major shortcomings that may make the algorithm inefficient. First, instead of the low cost of each iteration, SGD may need a larger number of iteration until convergence than optimization algorithms using the exact gradient; second, SGD requires careful control of the step size of updating the parameters in each iteration, which may crucially affect the efficiency of the algorithm, but an appropriate choice of the step size is often difficult. On the other hand, BO is another common algorithm for optimization of a black-box function without necessarily using its gradient, which is especially suitable for optimizing imprecise and expensive-to-evaluate functions^38,39. The BO has many successful applications such as computer vision, robotics, and experimental designs^40,41,42,43. Owing to its robustness against noise in the imprecise evaluation of the functions^38,39, BO may also be useful for the optimization in VQAs^44,45. However, it is known that BO becomes intractable in high-dimensional settings (typically ≧ 10)⁴⁶, and the number of parameters to be optimized in VQAs is usually too large to apply the BO directly.

To retain advantages of SGD and BO in VQAs while compensating for their shortcomings, we here construct the alternative framework for the optimization of parameterized circuits, stochastic gradient line Bayesian optimization (SGLBO), as illustrated in Fig. 1. The key idea of SGLBO is that we estimate an appropriate direction of updating the circuit parameters based on SGD, and also utilize BO to estimate the optimal step size in a 1D direction of the estimated gradient in each iteration. This idea aims at simultaneously resolving the problems of the step size in the SGD and of the infeasibility of high-dimensional optimization with the BO. To enhance the performance further, we combine the SGLBO with two noise-reducing techniques: adaptive shot strategy and suffix averaging. The adaptive shot strategy is a technique for dynamically determining the number of measurement shots to be used for the estimation of the gradient^{34,47,48,49,50,51,52,53}. We here develop an adaptive shot strategy suitable for SGLBO, based on a technique of the norm test^48,49,51. The norm test combined with SGD is known to provide faster convergence^49,51, and in the case of SGLBO, the norm test reduces not only the number of iterations but also the overall number of measurement shots. On the other hand, suffix averaging is a technique for achieving noise reduction. Instead of directly using the point of the final iteration in the optimization as an estimate of the minimizer of the cost function, the suffix averaging technique uses the average over a latter part of the sequence of points obtained from the iterations^54,55,56. We utilize this technique to reduce the statistical noise in estimating the gradient and the optimal step size in SGLBO, and also reduce the effect of the hardware noise of the quantum device.

**Fig. 1: An illustration of two iterations in SGLBO for minimizing a 2D cost function.**

To show the significance of the SGLBO, we numerically demonstrate that the SGLBO can find an estimate of the minimizer of the cost function with a significantly small number of overall measurement shots compared to other state-of-art optimizers^23,34,57, in representative tasks for the VQAs, i.e., variational quantum eigensolver⁵ and variational quantum compiling⁵⁸. Thus, the reduction of the number of iterations achieved by finding the optimal step size by BO indeed contributes to the overall reduction of the number of measurement shots. We also discover that the SGLBO turns out to outperform the state-of-art optimizers not only in terms of the number of measurement shots but also the accuracy in estimating the minimum of the cost functions used in the simulation. Remarkably, we discover that even under a moderate amount of hardware noise, the SGLBO can estimate the minimum in a task with almost the same accuracy as noiseless cases, whereas the other state-of-the-art optimizers cannot in the same task. These results indicate that the SGLBO is a promising approach to reduce the number of measurement shots in the VQAs, and also to make the VQAs more feasible under unavoidable hardware noise in near-term quantum devices. Note that combination of SGD and BO has been previously studied only in a specific machine-learning setting⁵⁹, but its applicability and advantage for other tasks such as VQAs have been unknown; by contrast, our crucial contribution is to formulate SGLBO as the efficient and noise-robust framework for the task of optimizing parameterized quantum circuits and further develop the techniques of adaptive shot strategy and suffix averaging to demonstrate its advantage in this optimization task.

Consequently, the SGLBO establishes an alternative approach for efficient quantum-circuit optimizers, progressing beyond the existing state-of-the-art optimizers^23,34,57; in particular, the novelty of SGLBO is to integrate two different optimization approaches, SGD and BO, to eliminate their shortcomings and take their advantages. Augmented with the further techniques of adaptive shot strategy and suffix averaging, the SGLBO is shown to have a significant advantage in the reduction of the cost of the number of measurement shots and also in the robustness against hardware noise, compared to the state-of-the-art optimizers for VQAs. These results open a way to practical algorithm designs for more efficient quantum-circuit optimization in terms of the overall cost of measurement shots, by avoiding both the precise estimation of expectation values and the many iterations of updating circuit parameters; at the same time, the approach developed for the SGLBO provides a fundamental insight into how VQAs can use classical information extracted from quantum states beyond estimating expectation values.

In the rest of this section, we describe the problem setting of optimization tasks in VQAs and review SGD and BO.

VQAs^2,3,4 are a class of algorithms that use a parameterized quantum circuit U(θ) to minimize a task-specific cost function f(θ). The vector ${{{\boldsymbol{\theta }}}}={[{\theta }_{1},\cdots ,{\theta }_{D}]}^{\top }\in {{\mathbb{R}}}^{D}$ of D arguments of f is used as the circuit parameters of U(θ). The cost function f(θ) in VQAs is conventionally defined as an expectation-value of an observable O on n qubits, with respect to a quantum state output by the parameterized circuit, i.e.,

$$f({{{\boldsymbol{\theta }}}})={{{\rm{Tr}}}}[OU({{{\boldsymbol{\theta }}}}){(\left|0\right\rangle \left\langle 0\right|)}^{\otimes n}{U}^{{\dagger} }({{{\boldsymbol{\theta }}}})],$$

(1)

where $\left|0\right\rangle $ is a standard-basis state used for initialization of each qubit, $U({{{\boldsymbol{\theta }}}}){\left|0\right\rangle }^{\otimes n}$ is the output state of the n-qubit parameterized circuit, and U^†(θ) is the complex conjugate of U(θ). The observable O is expanded as a sum of n-qubit tensor products of Pauli operators

$$O=\mathop{\sum}\limits_{k}{c}_{k}{P}_{k},$$

(2)

where c_k for each k is a real coefficient of the kth term, and P_k is a tensor product of n single-qubit Pauli operators ${P}_{k}{ = \bigotimes }_{l = 1}^{n}{P}_{k,l}$ with P_k,l ∈ {I, X, Y, Z} being a Pauli (or identity) operator acting on the lth qubit. Here, the identity operator is denoted by $I:=\left|0\right\rangle \left\langle 0\right|+\left|1\right\rangle \left\langle 1\right|$, and Pauli operators acting on a single qubit are $X:=\left|0\right\rangle \left\langle 1\right|+\left|0\right\rangle \left\langle 1\right|$, $Y:=-{{{\rm{i}}}}\left|0\right\rangle \left\langle 1\right|+{{{\rm{i}}}}\left|1\right\rangle \left\langle 0\right|$, and $Z:=\left|0\right\rangle \left\langle 0\right|-\left|1\right\rangle \left\langle 1\right|$. In a usual setting of VQAs, U(θ) is composed of non-parametric gates such as CNOT gates, and parametric gates in the form of

$$U({\theta }_{i})=\exp (-{{{\rm{i}}}}{P}_{i}{\theta }_{i}),$$

(3)

where _Pi is also a tensor product of n single-qubit Pauli operators in the same way as P_k in Eq. (2). For example, Fig. 2 shows a representative choice of parameterized circuits used for VQAs⁴. Note that the parameter space of the circuit in Fig. 2 is a D-dimensional hypercube θ ∈ [−π, π]^D, i.e., a bounded subspace of ${{\mathbb{R}}}^{D}$, on which a uniform probability distribution is well defined.

**Fig. 2: An example of a parameterized quantum circuit used as an ansatz in VQAs.**

The task in the VQAs is to obtain an estimate of the minimum of the cost function

$$\mathop{\min }\limits_{{{{\boldsymbol{\theta }}}}\in {{\mathbb{R}}}^{D}}f({{{\boldsymbol{\theta }}}}).$$

(4)

The minimizer is denoted by

$${{{{\boldsymbol{\theta }}}}}^{* }=\mathop{{{{\rm{arg\,min}}}}}\limits_{{{{\boldsymbol{\theta }}}}\in {{\mathbb{R}}}^{D}}f({{{\boldsymbol{\theta }}}}).$$

(5)

Note that the cost function f(θ), in general, can be non-convex, and it can be computationally hard in general to obtain the exact solution of the optimization problem in VQAs⁶⁰. By contrast, this paper aims to provide a heuristic optimizer that approximately solves this optimization problem with a small number of measurement shots. In experiments using a quantum device, we can evaluate the cost function from the sum of the expectation values ${{{\rm{Tr}}}}[{P}_{k}U({{{\boldsymbol{\theta }}}}){(\left|0\right\rangle \left\langle 0\right|)}^{\otimes n}{U}^{{\dagger} }({{{\boldsymbol{\theta }}}})]$ for all k, each of which can be estimated by independently repeating the preparation of $U({{{\boldsymbol{\theta }}}}){\left|0\right\rangle }^{\otimes n}$ by the parameterized circuit and the measurement of this state in the eigenbasis of the Pauli operator P_k. For each k, let ${\overline{P}}_{k}\in {\mathbb{R}}$ be a sample mean obtained from these measurements for P_k, and due to Eq. (2), we estimate f(θ) by

$$f({{{\boldsymbol{\theta }}}})\approx \mathop{\sum}\limits_{k}{c}_{k}{\overline{P}}_{k}.$$

(6)

Each of these measurements is called a measurement shot. In this way, we evaluate f using a finite number of measurement shots; in this setting, we are only allowed imprecise queries to the cost function due to statistical errors with the finite number of measurement shots. Based on the central limit theorem⁶¹, we may model each imprecise query to f(θ) as

$$y=f({{{\boldsymbol{\theta }}}})+\epsilon ,$$

(7)

where y is an observed value, and $\epsilon \sim {{{\mathcal{N}}}}(0,{\sigma }^{2})$ is independent and identically distributed (IID) Gaussian noise. From Hoeffding’s inequality⁶², to estimate f(θ) within an error ϵ with high probability, as large as O(1/ϵ²) measurement shots may be required. In practice, it is prohibitively costly (i.e., an excessive number of measurement shots are needed) to evaluate a well-approximated value of the cost function (as well as its gradient), which leads to significant overhead in performing VQAs^16,17.

SGD aims to optimize a function f(θ) using an unbiased estimate of the gradient of f to update the parameters θ iteratively toward the optimal point with high probability.

In the optimization of circuit parameters for VQAs, we may need to evaluate the gradient of the cost function f(θ). For f(θ) defined with parametric gates in the form of (3), we can utilize a parameter-shift rule^63,64 to calculate partial derivatives of the cost function from cost-function values at shifted circuit parameters, i.e.,

$$\frac{\partial f({{{\boldsymbol{\theta }}}})}{\partial {\theta }_{i}}=\frac{f({{{\boldsymbol{\theta }}}}+\frac{\pi }{2}{{{{\boldsymbol{e}}}}}_{i})-f({{{\boldsymbol{\theta }}}}-\frac{\pi }{2}{{{{\boldsymbol{e}}}}}_{i})}{2}.$$

(8)

Here θ_i is a circuit parameter allocated to the rotation angle of the ith Pauli rotation gate $U({\theta }_{i})=\exp (-{{{\rm{i}}}}{P}_{i}{\theta }_{i})$, and e_i represents a unit vector along the coordinate of θ_i. Note that to obtain all the elements of the gradient of f(θ), we may need to evaluate each partial derivative independently.

However, as discussed above, we cannot exactly calculate the cost function and its gradient with a finite number of measurement shots, and the precise estimation of the gradient is costly in VQAs. In this setting, a standard method for solving Eq. (4) is stochastic gradient descent (SGD)^31,33, which updates the current point ${\hat{{{{\boldsymbol{\theta }}}}}}^{(t)}$ at iteration t according to

$${\hat{{{{\boldsymbol{\theta }}}}}}^{(t+1)}={\hat{{{{\boldsymbol{\theta }}}}}}^{(t)}-{\eta }^{(t)}{\hat{{{{\boldsymbol{g}}}}}}^{(t)}({\hat{{{{\boldsymbol{\theta }}}}}}^{(t)}),$$

(9)

where η^(t) is the step size, and ${\hat{{{{\boldsymbol{g}}}}}}^{(t)}({\hat{{{{\boldsymbol{\theta }}}}}}^{(t)}):={({\hat{g}}_{1}^{(t)}({\hat{{{{\boldsymbol{\theta }}}}}}^{(t)}),\ldots ,{\hat{g}}_{D}^{(t)}({\hat{{{{\boldsymbol{\theta }}}}}}^{(t)}))}^{\top }$ is an unbiased estimator of the gradient $\nabla f({\hat{{{{\boldsymbol{\theta }}}}}}^{(t)})$, i.e., ${\mathbb{E}}[{\hat{{{{\boldsymbol{g}}}}}}^{(t)}({\hat{{{{\boldsymbol{\theta }}}}}}^{(t)})]=\nabla f({\hat{{{{\boldsymbol{\theta }}}}}}^{(t)})$. Here ${\hat{{{{\boldsymbol{g}}}}}}^{(t)}$ is estimated with a finite number of measurement shots, i.e., with a shot size

$${{{{\boldsymbol{s}}}}}_{{{{\rm{grad}}}}}^{(t)}={({s}_{1}^{(t)},\ldots ,{s}_{D}^{(t)})}^{\top }.$$

(10)

The estimate of each partial derivative is individually computed as

$${\hat{g}}_{i}^{(t)}({{{\boldsymbol{\theta }}}})=\frac{1}{{s}_{i}^{(t)}}\mathop{\sum }\limits_{{\mathsf{m}}=1}^{{s}_{i}^{(t)}}{g}_{i}^{{\mathsf{m}}}({{{\boldsymbol{\theta }}}}),$$

(11)

$${g}_{i}^{{\mathsf{m}}}({{{\boldsymbol{\theta }}}})=({O}_{+}^{{\mathsf{m}}}-{O}_{-}^{{\mathsf{m}}})/2,$$

(12)

where ${O}_{\pm }^{{\mathsf{m}}}$ is a single-shot estimator of $f({{{\boldsymbol{\theta }}}}\pm \frac{\pi }{2}{{{{\boldsymbol{e}}}}}_{i})$. Each single-shot estimator of $f({{{\boldsymbol{\theta }}}}\pm \frac{\pi }{2}{{{{\boldsymbol{e}}}}}_{i})$ is constructed according to Eq. (6) by substituting θ with ${{{\boldsymbol{\theta }}}}\pm \frac{\pi }{2}{{{{\boldsymbol{e}}}}}_{i}$, and the number of measurement shots used for estimating the kth term ${c}_{k}{\overline{P}}_{k}$ in Eq. (6) is denoted by ${s}_{i,k}^{(t)}$, which satisfies ${\sum }_{k}{s}_{i,k}^{(t)}={s}_{i}^{(t)}$. Given the shot size ${{{{\boldsymbol{s}}}}}_{{{{\rm{grad}}}}}^{(t)}$, each ${s}_{i,k}^{(t)}$ is probabilistically determined using a multinomial distribution in such a way that the probability p_k of measuring the kth term should be proportional to the weight ∣c_k∣, i.e., p_k ∝ ∣c_k∣ and ∑_kp_k = 1²²; that is, it should hold that ${\mathbb{E}}[{s}_{i,k}^{(t)}]={p}_{k}{s}_{i}^{(t)}$ for each k and i. Since the gradient is estimated from two values $f({{{\boldsymbol{\theta }}}}\pm \frac{\pi }{2}{{{{\boldsymbol{e}}}}}_{i})$ of the cost function, the number of measurement shots used for obtaining ${\hat{{{{\boldsymbol{g}}}}}}^{(t)}$ is

$$\mathop{\sum }\limits_{i=1}^{D}2{s}_{i}^{(t)}=2{s}_{{{{\rm{grad}}}}}^{(t)},$$

(13)

where we write ${s}_{{{{\rm{grad}}}}}^{(t)}:=| | {{{{\boldsymbol{s}}}}}_{{{{\rm{grad}}}}}^{(t)}| {| }_{1}$.

The estimator ${\hat{{{{\boldsymbol{g}}}}}}^{(t)}({{{\boldsymbol{\theta }}}})$ in VQAs is unbiased for all ${{{\boldsymbol{\theta }}}}\in {{\mathbb{R}}}^{D}$, which is a preferable property to achieve convergence of SGD³³. In addition, to guarantee convergence of SGD, we may require the step size to vanish as the estimated points approach a minimizer. In this case, the SGD achieves the optimization to accuracy ϵ within O(1/ϵ⁴) iterations in general for non-convex functions³², such as typical cost functions in VQAs. However, in practice, a user needs to designate a specific decay rate of step size to achieve good performance, whose optimization can be difficult.

BO is a gradient-free framework for optimization of an unknown function f(θ)^38,39. BO can be employed to optimize an expensive-to-evaluate cost function in settings where only noisy observations of the function are possible, and we try to seek a minimizer of f(θ) with as small a number of noisy observations as possible. One of the features of BO is to utilize an easy-to-compute surrogate model that approximates the unknown cost function based on observed data^65,66,67. A popular surrogate model for BO is Gaussian process (GP)⁶⁸. GP is a collection of random variables such that every finite subset of random variables obeys a multivariate normal distribution. In the BO, we put a GP prior over the true function f(θ) as $f({{{\boldsymbol{\theta }}}}) \sim {{{\mathcal{GP}}}}(\mu ({{{\boldsymbol{\theta }}}}),k({{{\boldsymbol{\theta }}}},{{{\boldsymbol{\theta }}}}^{\prime} ))$, where $\mu ({{{\boldsymbol{\theta }}}})={\mathbb{E}}(f({{{\boldsymbol{\theta }}}}))$ is a mean function, $k({{{\boldsymbol{\theta }}}},{{{\boldsymbol{\theta }}}}^{\prime} )$ is a covariance kernel function. In practice, if one has no prior knowledge about the mean of the function μ(θ) that one tries to fit, μ(θ) can be set to 0. A major choice of the kernel function is a Gaussian kernel

$$k({{{\boldsymbol{\theta}}}},{{{\boldsymbol{\theta}}}}^{\prime} )={\tau }^{2}{\rm{exp}} \left(\frac{-\Vert{{{\boldsymbol{\theta}}}}-{{{\boldsymbol{\theta }}}}^{\prime} \Vert^{2}}{2{l}^{2}}\right),$$

(14)

where τ² is called the signal variance that determines the average of the differences from the mean of the function, and l is called the length-scale that determines the length required for the values of the function to be uncorrelated⁶⁸. For other conventional kernel functions, e.g., a Matérn kernel, see ref. ⁶⁸.

Here we consider a situation where we have a set of N noisy observations of the cost function ${{{{\mathcal{D}}}}}_{1:N}={\{({{{{\boldsymbol{\theta }}}}}^{(i)},{y}^{(i)})\}}_{i = 1}^{N}$ at points θ⁽¹⁾, …, θ^(N), where each y⁽ⁱ⁾ = f(θ⁽ⁱ⁾) + ϵ suffers from the IID Gaussian noise $\epsilon \sim {{{\mathcal{N}}}}(0,{\sigma }^{2})$. Assuming that these observations are given according to GP, we calculate a GP posterior conditioned on these estimations, which is governed by hyperparameters, namely, the signal variance τ², the length-scale l, and the variance of Gaussian noise σ². These hyperparameters can be estimated by means of maximizing a log marginal likelihood⁶⁸. Then, if we observe the cost function f at a new point θ_*, the value to be observed will obey a GP posterior expressed as

$$\begin{array}{ll}{f}_{*}| {{{{\boldsymbol{\theta }}}}}_{* },{{{{\mathcal{D}}}}}_{1:N}\sim {{{\mathcal{N}}}}({{{{\boldsymbol{k}}}}}_{* }^{\top }{[K+{\sigma }^{2}I]}^{-1}{{{\boldsymbol{y}}}},{k}_{* * }-{{{{\boldsymbol{k}}}}}^{\top }{[K+{\sigma }^{2}I]}^{-1}{{{\boldsymbol{k}}}}),\end{array}$$

(15)

where f_* = f(θ_*), ${{{{\boldsymbol{k}}}}}_{* }={[k({{{{\boldsymbol{\theta }}}}}_{* },{{{{\boldsymbol{\theta }}}}}^{(1)}),\cdots ,k({{{{\boldsymbol{\theta }}}}}_{* },{{{{\boldsymbol{\theta }}}}}^{(N)})]}^{\top }$, k_** = k(θ_*, θ_*), and K is the covariance matrix ${[k({{{{\boldsymbol{\theta }}}}}^{(i)},{{{{\boldsymbol{\theta }}}}}^{(j)})]}_{i,j = 1}^{N}$⁶⁸.

In BO, we construct an acquisition function φ(θ) from the posterior in Eq. (15) and determine the next query point according to

$${{{{\boldsymbol{\theta }}}}}^{(N+1)}=\mathop{{{{\rm{arg\,min}}}}}\limits_{{{{\boldsymbol{\theta }}}}\in {{\mathbb{R}}}^{D}}\varphi ({{{\boldsymbol{\theta }}}}).$$

(16)

Several ways of constructing the acquisition function have been proposed, such as Thompson sampling⁶⁹, upper confidence bound⁷⁰, and expected improvement⁷¹. In particular, Thompson sampling estimates values of f at a given set of points by sampling according to the multivariate normal distributions obtained from Eq. (15), and use these sampled values as the values of the acquisition function at these points. Then, we take the minimum among the values of the acquisition function for the set of points and perform the next query to f at the minimum point in the set as shown in Eq. (16). The minimization of φ(θ) is performed by using efficient optimization heuristics^72,73. BO proceeds with querying the cost function at the minimizer of φ(θ) and iteratively update the GP posterior according to Eq. (15) until a fixed number of queries to the cost function are performed³⁸.

This framework of BO has been shown to reduce the required number of queries to the cost function in achieving the minimization compared to other global optimization algorithms³⁸. The performance of BO itself is governed by the ability to find the minimizer of φ(θ), which is also non-convex as well as the cost function. Thus, it is important to design the acquisition function suitably so that the computational cost is relatively low and optimization heuristics are tractable^46,74,75,76. However, if the acquisition function is defined in a high-dimensional parameter space that typically appears in VQAs, it is excessively costly to use the BO.

Results

In the following, we present the description of SGLBO and introduce the adaptive shot strategy and suffix averaging. Moreover, numerical experiments are provided to demonstrate the advantage of SGLBO compared to other state-of-the-art optimizers for VQAs.

Algorithm 1 Stochastic gradient line Bayesian optimization (SGLBO)

Require: Cost function f(θ) with D parameters in Eq. (1), the initial shot size ${{{{\boldsymbol{s}}}}}_{{{{\rm{grad}}}}}^{(0)}$ for evaluating the gradient in Eq. (10), a kernel $k({{{\boldsymbol{\theta }}}},{{{\boldsymbol{\theta }}}}^{\prime} )$ and an acquisition function φ(θ) used for GP in Eqs. (15) and (16), the initial point ${\hat{{{{\boldsymbol{\theta }}}}}}^{(0)}$ to be updated according to Eq. (17), the bound ${\eta }_{\max }$ of the 1D subspace ${{{{\mathcal{L}}}}}^{(t)}$ to perform the BO in Eq. (19), the initial number ${s}_{{{{\rm{cost}}}}}^{(0)}$ of measurement shots for evaluating the cost function in BO in Eq. (20), the number N = N_init + N_eval of queries used for the BO in Eq. (21), the total number s_tot of measurement shots for the stopping condition (23), the precision κ in estimating the gradient according to Eq. (30), the description of the lower bound ${G}_{{{{\rm{grad}}}}}^{(t)}$ of the shot size in Eq. (30), the description of the lower bound ${G}_{{{{\rm{cost}}}}}^{(t)}$ of the number of measurement shots in estimating the cost function in Eq. (31), a parameter α for suffix averaging in Eq. (32).

1: initialize:

2: $t\leftarrow 0,\,{s}_{{{{\rm{temp}}}}}^{(t)}\leftarrow 0$

3: while ${s}_{{{{\rm{temp}}}}}^{(t)} \,<\, {s}_{{{{\rm{tot}}}}}$ ⊳ Iterate until the stopping condition (23) is satisfied.

4: ${\hat{{{{\boldsymbol{g}}}}}}^{(t)},{S}^{(t)}\leftarrow $ Estimate the gradient ${\mathbb{E}}[{\hat{{{{\boldsymbol{g}}}}}}^{(t)}]=\nabla f({\hat{{{{\boldsymbol{\theta }}}}}}^{(t)})$ using $2\times {{{{\boldsymbol{s}}}}}_{{{{\rm{grad}}}}}^{(t)}$ measurement shots according to Eq. (11), and calculate its empirical variance S^(t) in Eq. (30).

5: ${{{{\mathcal{L}}}}}^{(t)}\leftarrow $ Take the 1D subspace ${{{{\mathcal{L}}}}}^{(t)}$ depending on ${\hat{{{{\boldsymbol{\theta }}}}}}^{(t)},{\hat{{{{\boldsymbol{g}}}}}}^{(t)},{\eta }_{\max }$ according to Eq. (19).

6: ${\hat{{{{\boldsymbol{\theta }}}}}}^{(t+1)}\leftarrow $ Determine ${\hat{{{{\boldsymbol{\theta }}}}}}^{(t+1)}$ by the BO on ${{{{\mathcal{L}}}}}^{(t)}$ with $k({{{\boldsymbol{\theta }}}},{{{\boldsymbol{\theta }}}}^{\prime} ),\varphi ({{{\boldsymbol{\theta }}}}),{N}_{{{{\rm{init}}}}},{N}_{{{{\rm{eval}}}}}$ as described in the main text below Eq. (21).

7: ${{{{\boldsymbol{s}}}}}_{{{{\rm{grad}}}}}^{(t+1)}\leftarrow $ Determine the shot size for estimating the gradient, from $\kappa ,{\hat{{{{\boldsymbol{g}}}}}}^{(t)},{S}^{(t)},D,{G}_{{{{\rm{grad}}}}}^{(t)}$ according to Eq. (30).

8: ${s}_{{{{\rm{cost}}}}}^{(t+1)}\leftarrow $ Determine the number of measurement shots for estimating the cost function in the BO, from ${{{{\boldsymbol{s}}}}}_{{{{\rm{grad}}}}}^{(t+1)},{G}_{{{{\rm{cost}}}}}^{(t)}$ according to Eq. (31).

9: ${s}_{{{{\rm{temp}}}}}^{(t+1)}\leftarrow {s}_{{{{\rm{temp}}}}}^{(t)}+2{s}_{{{{\rm{grad}}}}}^{(t)}+N{s}_{{{{\rm{cost}}}}}^{(t)}$ due to Eq. (22).

10: t ← t + 1

11: end while

12: T ← t

13: return ${\overline{{{{\boldsymbol{\theta }}}}}}_{\alpha ,T}\leftarrow $ Take the suffix average according to Eq. (32).

Description of algorithm

We present a framework for the optimizer of parameterized quantum circuits in the VQAs, stochastic gradient descent line Bayesian optimization (SGLBO). The idea behind SGLBO is to estimate the direction of the gradient based on SGD and further to utilize BO to estimate the optimal step size within the one-dimensional subspace of parameters in this direction. This allows us to avoid the difficulty of choosing an appropriate step size in SGD, and also to achieve a feasible use of BO by limiting the domain to apply the BO to the one-dimensional space. In addition, we introduce two noise-reduction techniques, adaptive shot strategy and suffix averaging, to improve the speed and the accuracy of minimizing the cost function. Adaptive shot strategy and suffix averaging are crucial and characteristic components for the feasibility of SGLBO and will be explained in “Adaptive shot strategy” section and “Suffix averaging for SGLBO” section. Below, we will present the procedure of SGLBO (see also Algorithm 1).

The SGLBO achieves the minimization of the cost function by iteratively updating the points to estimate the minimizer of the cost function. Let T denote the total number of iterations in the SGLBO. For each iteration t = 0, 1, …, T − 1, let ${\hat{{{{\boldsymbol{\theta }}}}}}^{(t)}$ denote the point obtained in the (t + 1)th iteration of the SGLBO, which is an estimator of the circuit parameters that minimize the cost function, and the initial point ${\hat{{{{\boldsymbol{\theta }}}}}}^{(0)}$ represents an initial guess of the minimizer. Note that we here take ${\hat{{{{\boldsymbol{\theta }}}}}}^{(0)}$ uniformly at random, but in case a better initial guess of the minimizer than the uniformly random point is available, ${\hat{{{{\boldsymbol{\theta }}}}}}^{(0)}$ could be chosen as the better guess^77,78. Similarly to the SGD, the SGLBO computes an unbiased estimator ${\hat{{{{\boldsymbol{g}}}}}}^{(t)}$ of the gradient of the cost function at the point ${\hat{{{{\boldsymbol{\theta }}}}}}^{(t)}$, using $2{s}_{{{{\rm{grad}}}}}^{(t)}$ measurement shots due to Eq. (13). The shot size ${{{{\boldsymbol{s}}}}}_{{{{\rm{grad}}}}}^{(t)}$ is determined in each iteration t based on adaptive shot strategy, which will be explained in the following section. Using ${\hat{{{{\boldsymbol{g}}}}}}^{(t)}$, the SGLBO updates the point ${\hat{{{{\boldsymbol{\theta }}}}}}^{(t)}$ to the next point according to an update rule described by

$${\hat{{{{\boldsymbol{\theta }}}}}}^{(t+1)}={\hat{{{{\boldsymbol{\theta }}}}}}^{(t)}-{\hat{\eta }}^{* (t)}{\hat{{{{\boldsymbol{g}}}}}}^{(t)},$$

(17)

where ${\hat{\eta }}^{* (t)}$ is an estimator of the optimal step size. The optimal step size η^*(t) is defined as

$${{{{\boldsymbol{\theta }}}}}^{* (t)}:=\mathop{{{{\rm{arg\,min}}}}}\limits_{{{{\boldsymbol{\theta }}}}\in {{{{\mathcal{L}}}}}^{(t)}}f({{{\boldsymbol{\theta }}}})={\hat{{{{\boldsymbol{\theta }}}}}}^{(t)}-{\eta }^{* (t)}{\hat{{{{\boldsymbol{g}}}}}}^{(t)},$$

(18)

where ${{{{\mathcal{L}}}}}^{(t)}$ is the one-dimensional subspace for applying the BO, i.e.,

$${{{{\mathcal{L}}}}}^{(t)}:=\{{\hat{{{{\boldsymbol{\theta }}}}}}^{(t)}-{\eta }^{(t)}{\hat{{{{\boldsymbol{g}}}}}}^{(t)}| {\eta }^{(t)}\in [-{\eta }_{\max },{\eta }_{\max }]\},$$

(19)

and ${\eta }_{\max } \,>\, 0$ is a constant hyperparameter to bound the one-dimensional subspace that will be specified in “Example of choice of hyperparameters and implementation” section. We remark that we choose ${\eta }_{\max }$ as a constant independent of D so that the BO should be feasible even in the case of large D. A parameter region of D parameters of a circuit can be a D-dimensional hypercube, e.g., θ ∈ [−π, π]^D for the circuit in Fig. 2, and thus, to cross the whole parameter region by ${{{{\mathcal{L}}}}}^{(t)}$, one may be tempted to choose ${\eta }_{\max }$ as the length of the diagonal of this D-dimensional hypercube, i.e., ${\eta }_{\max }\approx \sqrt{D}$; however, for the feasibility of the BO, it is indeed essential to keep ${\eta }_{\max }$ constant. Our approach can be considered an improvement over the SGD with a constant step size ${\eta }_{\max }$, where we use the BO to estimate the optimal step size ${\hat{\eta }}^{* (t)}$ instead of using the fixed step size ${\eta }_{\max }$.

To obtain an estimate of the optimal step size ${\hat{\eta }}^{* (t)}$ in Eq. (17), we perform the procedure of BO on ${{{{\mathcal{L}}}}}^{(t)}$ by using a fixed number of measurement shots

$${s}_{{{{\rm{cost}}}}}^{(t)}$$

(20)

per query to the cost function, and querying these noisy observations of the cost function N times in total with

$$N={N}_{{{{\rm{init}}}}}+{N}_{{{{\rm{eval}}}}}.$$

(21)

where N_init is the number of points used for initial evaluation for BO, and N_eval is the number of points evaluated during the BO in each step in addition to N_init. This procedure determines ${\hat{\eta }}^{* (t)}$ in such a way that ${\hat{{{{\boldsymbol{\theta }}}}}}^{(t+1)}$ in Eq. (17) should be given by θ^(N+1) in Eq. (16). We will specify N_init and N_eval in “Example of choice of hyperparameters and implementation” section. In the BO, we use N_init points for the initial queries, which we take at equal intervals in the 1D subspace ${{{{\mathcal{L}}}}}^{(t)}$. Using the observed points, the BO iterates a cycle according to Eqs. (15) and (16) to decide an additional point to evaluate per cycle. Repeating N_eval cycles, we have N_eval points in addition to the N_init initial points, where the nth cycle for n ∈ {1, …, N_eval} uses (N_init + n − 1) points to decide the (N_init + n)th point. These N points are used for the update according to Eq. (17), i.e., the calculation of ${\hat{\eta }}^{* (t)}$.

In this way, the SGLBO updates the point ${\hat{{{{\boldsymbol{\theta }}}}}}^{(t)}$ according to Eq. (17) until we consume a preset total number of measurement shots s_tot, which we initially designate. In particular, in the (t + 1)th iteration for each t = 0, …, T − 1, we use $2{s}_{{{{\rm{grad}}}}}^{(t)}$ measurement shots for estimating the gradient according to Eq. (13), and also use ${s}_{{{{\rm{cost}}}}}^{(t)}$ measurement shots for each of the N queries to the cost function in the BO; that is, the number of measurement shots that we use in the (t + 1)th iteration is

$$2{s}_{{{{\rm{grad}}}}}^{(t)}+N{s}_{{{{\rm{cost}}}}}^{(t)}.$$

(22)

In the SGLBO, if the total number of measurement shots used in the iterations exceeds the preset bound s_tot, i.e.,

$$\mathop{\sum }\limits_{t=0}^{T-1}\left[2{s}_{{{{\rm{grad}}}}}^{(t)}+N{s}_{{{{\rm{cost}}}}}^{(t)}\right]\geqq {s}_{{{{\rm{tot}}}}},$$

(23)

then we stop the iterations. Note that T is given by the minimum number of iterations satisfying Eq. (23), determined during running the SGBLO depending on s_tot. We could also stop the iterations if we achieve the convergence of the cost function, while we here use the stopping condition based on s_tot for simplicity of presentation. We remark that it would be too costly in VQAs to check the convergence of the values of the cost function $f({\hat{{{{\boldsymbol{\theta }}}}}}^{(t)})$ itself, which we avoid here; instead, it would be possible, e.g., to use another stopping condition by checking the convergence of the sequence of parameters ${({\hat{{{{\boldsymbol{\theta }}}}}}^{(t)})}_{t = 0,\ldots ,T-1}$.

Finally, after the last iteration, the optimizer calculates a suffix average⁵⁵ of the points ${({\hat{{{{\boldsymbol{\theta }}}}}}^{(t)})}_{t = 0,\ldots ,T-1}$, i.e., an average of a subset of the points in a latter part of the iterations, which we will explain in “Suffix averaging for SGLBO” section. This suffix average is output as the estimate of the minimizer of the cost function.

The procedure of the SGLBO may require an additional cost of measurement shots for the BO compared to the SGD without using the BO, but this cost is negligible as explained in the following. To estimate the optimal step size by the BO, we may use an extra number of measurement shots to query the cost function, in addition to the gradient estimation based on the SGD. For simplicity, suppose that the shot size (10) and the number of measurement shots to evaluate the cost function in the BO are given by a constant s, i.e., ${s}_{i}^{(t)}=s$ (i ∈ {1, …, D}) and ${s}_{{{{\rm{cost}}}}}^{(t)}=s$. Then, due to Eq. (22), the number of measurement shots to be used in each iteration of the SGLBO is (2D + N)s. In this case, the cost of estimating the optimal step size is the same as the cost of the gradient estimation for a parameterized quantum circuit with N/2 additional parameters. This cost can be negligibly low as the number of circuit parameters D gets large, and hence, we can indeed gain the benefit of estimating the optimal step size by the BO.

The foundation for why SGLBO can efficiently find a candidate of the minimum point, i.e., a stationary point, can be explained as follows. The constant step-size SGD with averaging converges to a stationary point even in a non-convex setting⁷⁹. The SGLBO is designed to converge faster than this constant step-size SGD with averaging since we use the BO to find a step size that further reduces the value of the cost function compared to taking the deterministic constant step size. In particular, in each step t ∈ {0, …, T − 1}, BO aims to find the minimum point along a 1D subspace; that is, the cost function $f({\hat{{{{\boldsymbol{\theta }}}}}}^{(t)})$ is reduced to $f({\hat{{{{\boldsymbol{\theta }}}}}}^{(t+1)})$ satisfying $f({\hat{{{{\boldsymbol{\theta }}}}}}^{(t+1)})\leqq f({\hat{{{{\boldsymbol{\theta }}}}}}^{(t)})$ with high probability, in the case where BO is performed with sufficiently good precision. In this case, as the iterations proceed, SGLBO improves the cost function according to $f({\hat{{{{\boldsymbol{\theta }}}}}}^{(0)})\geqq f({\hat{{{{\boldsymbol{\theta }}}}}}^{(1)})\geqq \cdots \geqq f({\hat{{{{\boldsymbol{\theta }}}}}}^{(T-1)})$, which does not necessarily hold in SGD but should hold in the SGLBO with high probability, leading to an improvement compared to the mere use of the SGD. We remark that the optimization problems in VQAs are non-convex, and hence, a tight analysis of the convergence speed would be challenging in general. Some previous research such as refs. ^33,37 performs convergence analyses of optimizers for VQAs with assumptions on convexity or strong convexity, but the performance for non-convex problems that typically appear in VQAs are unknown. In contrast, the above explanation of convergence does not require the convexity assumptions. However, to bound the speed of convergence of SGLBO, further assumptions may be needed since non-convex optimization problems are hard to solve by nature. We leave the tight analysis of the convergence speed of the SGLBO under an appropriate assumption for the setting of VQAs for further research; instead, we will use numerical simulation to show the fast convergence speed of the SGLBO in our numerical experiments.

Adaptive shot strategy

The number of measurement shots used for estimating values and gradients of the cost function is one of the crucial parameters in stochastic optimization algorithms. In such algorithms, we may have a trade-off between efficiency and accuracy. In particular, at the beginning of optimization, we can use an imprecise gradient estimated with few measurement shots to roughly move to points around the minimizer. On the other hand, at the end of optimization, the gradients with less noise are needed to further decrease the value of the cost function. This observation motivates us to establish a strategy to gradually increase the shot size (10) used for estimating the gradient in the SGLBO as the optimization proceeds.

Such adaptive shot strategies have been well studied in the field of machine learning^{47,48,49,50,51,52,53}, and one of them has been applied also in the context of VQAs^34,35. However, the formula for estimating the next number of measurement shots given in refs. ^34,35 depends on the step size and becomes invalid when the step size exceeds a certain range. Problematically, the step size in the SGLBO often exceeds the range. Thus, our algorithm utilizes a different approach, the norm test^48,49,51, which determines the number of measurement shots to maintain a constant signal-to-noise ratio of the estimate of the gradient.

In the norm test, we want to decide the shot size based on a condition that the estimated vector $-{\hat{{{{\boldsymbol{g}}}}}}^{(t)}$ should be appropriately in a descent direction⁵¹, which ideally would be

$${\delta }^{(t)}:=| | {\hat{{{{\boldsymbol{g}}}}}}^{(t)}-{{{\boldsymbol{\nabla }}}}f({\hat{{{{\boldsymbol{\theta }}}}}}^{(t)})| | \leqq \kappa | | {\hat{{{{\boldsymbol{g}}}}}}^{(t)}| | ,$$

(24)

with a parameter κ satisfying 0 ≦ κ < 1. Intuitively, as the optimization proceeds, the norm $| | {\hat{{{{\boldsymbol{g}}}}}}^{(t)}| | $ of the gradient becomes small, and the condition (24) requires that the estimate ${\hat{{{{\boldsymbol{g}}}}}}^{(t)}$ of the gradient should become precise as $| | {\hat{{{{\boldsymbol{g}}}}}}^{(t)}| | $ gets small. However, the exact evaluation of δ^(t) would be prohibitively costly in VQAs. Thus, we square both sides of the above inequality and then replace the left hand side with its expectation, i.e., ${\mathbb{E}}[{({\delta }^{(t)})}^{2}]={{{\rm{Var}}}}[{\hat{{{{\boldsymbol{g}}}}}}^{(t)}]$, where ${{{\rm{Var}}}}[{\hat{{{{\boldsymbol{g}}}}}}^{(t)}]$ is the variance of ${\hat{{{{\boldsymbol{g}}}}}}^{(t)}$. The exact value of this variance is still difficult to calculate, and hence, we make the approximation using a sample variance⁸⁰, i.e.,

$${{{\rm{Var}}}}[{\hat{{{{\boldsymbol{g}}}}}}^{(t)}]\simeq \frac{{{{\rm{Tr}}}}({{{\Sigma }}}^{(t)})}{{s}_{{{{\rm{grad}}}}}^{(t)}}$$

(25)

where ${{{\Sigma }}}_{ij}^{(t)}:={\mathbb{E}}[({g}_{i}^{(t)}-\nabla {f}_{i}^{(t)})({g}_{j}^{(t)}-\nabla {f}_{j}^{(t)})]$. Instead of Eq. (24), the norm test could check

$$\frac{{{{\rm{Tr}}}}({{{\Sigma }}}^{(t)})}{{s}_{{{{\rm{grad}}}}}^{(t)}}\leqq {\kappa }^{2}| | {\hat{{{{\boldsymbol{g}}}}}}^{(t)}| {| }^{2}.$$

(26)

To adapt the condition (26) to the setting of VQAs, we consider the freedom of choosing the number of measurement shots for estimating each partial derivative of the cost function in Eq. (8). Since each partial derivative is estimated independently, Eq. (26) can be written as,

$$\mathop{\sum}\limits_{i}\frac{{\left({\sigma }_{i}^{(t)}\right)}^{2}}{{s}_{i}^{(t)}}\leqq {\kappa }^{2}| | {\hat{{{{\boldsymbol{g}}}}}}^{(t)}| {| }^{2},$$

(27)

where ${\sigma }_{i}^{(t)}:=\sqrt{{{{\rm{Var}}}}[{g}_{i}^{(t)}]}$. Now we impose a constraint on the number of measurement shots so that each estimate of the partial derivative should have an equal variance, i.e., ${({\sigma }_{i}^{(t)})}^{2}/{s}_{i}^{(t)}={({\sigma }_{j}^{(t)})}^{2}/{s}_{j}^{(t)}$ for i ≠ j. Then, we obtain a lower bound of ${s}_{i}^{(t)}$ for each i, i.e.,

$${s}_{i}^{(t)}\geqq \frac{1}{{\kappa }^{2}}\frac{{\left({\sigma }_{i}^{(t)}\right)}^{2}D}{| | {\hat{{{{\boldsymbol{g}}}}}}^{(t)}| {| }^{2}}.$$

(28)

In practice, the true variance ${({\sigma }_{i}^{(t)})}^{2}$ is still too costly to evaluate, and thus, we replace it with the empirical variance ${({S}^{(t)})}^{2}$, which is accessible. Consequently, we forecast the number of measurement shots so that it should satisfy

$${s}_{i}^{(t+1)}\geqq \frac{1}{{\kappa }^{2}}\frac{{\left({S}_{i}^{(t)}\right)}^{2}D}{| | {\hat{{{{\boldsymbol{g}}}}}}^{(t)}| {| }^{2}},$$

(29)

which we use to estimate the gradient in the next iteration. Since the SGLBO is intended to be applied to highly noisy cases, to avoid the cases where ${s}_{i}^{(t+1)}$ is too small to estimate the gradient appropriately, we here set a lower bound ${G}_{{{{\rm{grad}}}}}^{(t)}$ on the shot size and decide the next shot size according to

$$s_{i}^{(t+1)} =\max\left\{ \left\lceil\frac{1}{\kappa^2} \frac{\left(S_{i}^{(t)}\right)^2D}{||\hat{{\boldsymbol{g}}}^{(t)}||^2}\right\rceil,G_{\mathrm{grad}}^{(t)}\right\},$$

(30)

where ⌈ ⋯ ⌉ is the ceiling funciton. The choice of ${G}_{{{{\rm{grad}}}}}^{(t)}$ will be specified in “Example of choice of hyperparameters and implementation” section.

Using the shot size specified by Eq. (30), we also decide the number of measurement shots used for observing values of the cost function in the BO according to

$${s}_{{{{\rm{cost}}}}}^{(t+1)}=\max \left\{\frac{1}{D}\mathop{\sum }\limits_{i=1}^{D}{s}_{i}^{(t)},{G}_{{\rm{cost}}}^{(t)}\right\},$$

(31)

where ${G}_{{{{\rm{cost}}}}}^{(t)} \,>\, 0$ is a constant for avoiding the cases where ${s}_{{{{\rm{cost}}}}}^{(t+1)}$ becomes too small to estimate the optimal step size appropriately. The choice of ${G}_{{{{\rm{cost}}}}}^{(t)}$ will also be specified in “Example of choice of hyperparameters and implementation” section.

Suffix averaging for SGLBO

In VQAs, one could use a point obtained from the final iteration as the result of the optimization. However, in SGLBO, we use BO to estimate the optimal step size in Eq. (18), and due to statistical error in the estimation, we suffer from the influence of the error between the estimate of the optimal step size obtained from the BO and the true optimal step size. Moreover, hardware noise also prevents steady update of the points, especially when we use near-term noisy quantum devices. Such errors or noises may lead to an oscillation of the points in the final part of the iterations around the minimizer. To suppress such oscillation, we take a suffix average of these points in the final part of the iterations, rather than using the single point of the final iteration itself.

Given the sequence of points obtained from T iterations ${\hat{{{{\boldsymbol{\theta }}}}}}^{(0)},\ldots ,{\hat{{{{\boldsymbol{\theta }}}}}}^{(T-1)}$, the α-suffix average is defined as the average of the last αT points⁵⁵

$${\overline{{{{\boldsymbol{\theta }}}}}}_{\alpha ,T}=\frac{1}{\alpha T}\mathop{\sum }\limits_{t=(1-\alpha )T-1}^{T-1}{\hat{{{{\boldsymbol{\theta }}}}}}^{(t)},$$

(32)

where α ∈ (0, 1] is some constant, and α and T are taken here in such a way that αT should be an integer. During the optimization, we store the sequence of the points ${({\hat{{{{\boldsymbol{\theta }}}}}}^{(t)})}_{t}$ in memory. At the end of optimization, we calculate the suffix average of these points according to the above formula and output the suffix average as the result of the SGLBO.

Importantly, to achieve the goal of suppressing the effect of noise at the points in the final part of the iterations, the suffix averaging here uses an equal weight in averaging out the noise in this part. To achieve this suppression with small overhead, the parameter α should be chosen appropriately, in such a way that the last αT points should be kept in a reasonably small fraction among all T points yet still large enough to suppress the noise effectively. We note that, instead of using the equal weight, averaging with a decaying sequence of weights would also work⁵⁶, which may have a merit in a case where one does not have enough memory to store all points and wants to average the points on the fly. Detailed comparison of suffix-averaging techniques using different sequences of weights in VQAs is left for future work.

The suffix averaging can accelerate the convergence of SGD in some cases; for example, for optimization of a strongly convex function, i.e., a function that is (roughly speaking) more convex than a quadratic function, the error of the point in the Tth iteration decreases at the speed of $O(\log (T)/T)$ with high probability, but the error of the suffix average of the points in the latter half of the T iterations reduces to O(1/T), achieving the optimal speed⁵⁵. In the case of VQAs, f may not be strongly convex. However, even in the SGLBO, we can suppress the oscillation around the minimizer in practice by taking the suffix average, which contributes to improving the results of the optimization.

Example of choice of hyperparameters and implementation

We show an example of the choice of hyperparameters in Algorithm 1. These hyperparameters will be used in numerical experiments. In the numerical experiments, we also consider the cases with and without hardware noise, referring to them as the noisy case and the noiseless case, respectively.

For estimating the gradient in the SGLBO, we take the initial shot size as

$${s}_{i}^{(0)}=2 \quad {{{\rm{for}}}}\,{{{\rm{all}}}} \quad i,$$

(33)

and initialize ${\hat{{{{\boldsymbol{\theta }}}}}}^{(0)}$ by sampling from the uniform probability distribution. We set the lower bound ${G}_{{{{\rm{grad}}}}}^{(t)}$ on the shot size by an average shot size in the last 10 iterations; i.e., for t + 1 ≧ 10, according to Eq. (30), we take

$$\begin{array}{ll} &G_{\mathrm{grad}}^{(t)}=\frac{1}{10D}\sum\limits_{i^\prime=1}^{D}\sum\limits_{t^\prime=1}^{10}s_{i^\prime}^{(t-10+t^\prime)}, \\ &{\rm{i.e.,}}\, s_{i}^{(t+1)}= \max\left\{\left\lceil\frac{1}{\kappa^2} \frac{\left(S_{i}^{(t)}\right)^2D}{||\hat{{\boldsymbol{g}}}^{(t)}||^2}\right\rceil,\frac{1}{10D}\sum\limits_{i^\prime=1}^{D}\sum\limits_{t^\prime=1}^{10}s_{i^\prime}^{(t-10+t^\prime)}\right\}, \end{array}$$

(34)

and ${G}_{{{{\rm{grad}}}}}^{(t)}=1$ for t ≦ 10. We set κ = 0.99 in Eq. (30).

In the BO that is used as a subroutine in the SGLBO, we use the Gaussian kernel in Eq. (14) with τ² = 0.2, l = 0.7 as initial values. Before performing the GP regression to estimate values of a cost function, we optimize the hyperparameters, i.e., τ², l, and the variance of Gaussian noise σ², by maximizing the marginal likelihood of the hyperparameters. To avoid overfitting, we restrict the parameter region of these hyperparameters; in our numerical experiments, we set the parameter region as 10⁻³ ≦ τ² ≦ 5, 10⁻³ ≦ l ≦ 1, and 10⁻⁵ ≦ σ² ≦ 5. In addition, we perform this hyperparameter optimization 10 times from uniformly random starting points and take the best parameters to ensure that the hyperparameters are not a poor local optimum. As the acquisition function used in the BO, we choose Thompson sampling^68,69. After performing the BO, we set the estimated optimal step size as the minimum point of the predictive mean of a GP posterior conditioned on N observed data points.

For the BO, we set N_init = 5 and N_eval = 5. The N_init points of the initial evaluation is randomly chosen according to the uniform probability distribution over the 1D subspace ${{{{\mathcal{L}}}}}^{(t)}$ in Eq. (19) with

$${\eta }^{(t)}\in [-{\eta }_{\max },{\eta }_{\max }],\,{\eta }_{\max }=\min \left\{\frac{\beta }{| | H| | },\pi \right\},$$

(35)

where ∣∣H∣∣ is the operator norm, and β > 0 is a constant that we set depending on the problem later in “Advantage of SGLBO for various system sizes” section and “Robustness against hardware noise in SGLBO” section. Note that one of the initial evaluation points must be taken as η^(t) = 0, i.e., the current point ${\hat{{{{\boldsymbol{\theta }}}}}}^{(t)}$, for the stability of the BO. The number of measurement shots used for evaluating each point in the BO is given by Eq. (31) with

$$\begin{array}{l}{G}_{\rm{cost}}^{(t)}=\frac{\Vert H\Vert^{2}}{{\epsilon }^{2}}\,{{{\rm{for}}\,{\rm{all}}}}\,t,\\ {\rm{i}}.{\rm{e}}.,\,{s}_{{\rm{cost}}}^{(t)}={\rm{max}} \left\{\frac{1}{D}\mathop{\sum }\limits_{i=1}^{D} {s}_{i}^{(t)}, \frac{\Vert H\Vert^{2}}{{\epsilon }^{2}}\right\},\end{array}$$

(36)

where ϵ = 0.1. Given the outcomes of these measurements, we perform GP regression using GPy⁸¹.

For the suffix averaging, we set α = 0.1 in Eq. (32).

Numerical experiments

In the following, we numerically demonstrate the advantages of the SGLBO in comparison with state-of-the-art optimizers for VQAs. The optimizers to be compared with the SGLBO are summarized in “Optimizers for VQAs and their implementations” section. In particular, we investigate two situations: (1) when the size of a system scales up in “Advantage of SGLBO for various system sizes” section, and (2) when hardware noise and connectivity between qubits on hardware are taken into account in “Robustness against hardware noise in SGLBO” section. To this end, we simulate the performance of the optimizers in tasks of variational quantum eigensolver (VQE)⁵ for (1) and variational quantum compilation (VQC)⁵⁸ for (2). Furthermore, we demonstrate in “Merits of noise-reducing techniques for general optimizers” section that the techniques of suffix averaging and adaptive shot strategy used in the SGLBO can also improve performance and noise robustness of a general class of optimizers, not only the SGLBO.

Optimizers for VQAs and their implementations

To compare the SGLBO with other existing optimizers, we consider the following three state-of-the-art optimizers: adaptive moment estimation (Adam)⁵⁷, individual coupled adaptive number of shots (iCANS)³⁴, and Nakanishi-Fujii-Todo method (NFT)²³. Adam is a variant of SGD; although a number of different strategies for choosing step size in SGD have been proposed, Adam chooses the step size adaptively based on the accumulated information of estimates of the gradient used in previous iterations. The choice of step size in Adam is known to work well for many applications in the field of machine learning, but for VQAs, the required number of measurement shots for the optimization with Adam has been still prohibitively large³⁴. We use Adam as a representative choice of a straightforward application of SGD to VQAs. The iCANS is also a variant of stochastic gradient optimizers in which the number of measurement shots at each iteration is chosen frugally based on the first and second moment of the gradient to improve performance in VQAs. While both of these optimizers are gradient-based optimizers, NFT is a sequential optimization method along an axis of the parameters using function fitting rather than the gradient.

For iCANS, we in particular use iCANS1³⁴, and for Adam, we used the same values of the hyperparameters as ref. ³⁴. In terms of the initial number of measurement shots used in iCANS, which is not mentioned in ref. ³⁴, we set ${s}_{i}^{(0)}=2$ for all i in our numerical experiments. Here we note that for iCANS1, the step size η_t is changed depending on the tasks of VQAs as specified in “Advantage of SGLBO for various system sizes” section and “Robustness against hardware noise in SGLBO” section, following ref. ³⁴. In addition, we used ${s}_{i}^{(t)}=1000$ shots for each evaluation of the cost function in Eq. (8) in Adam and ${s}_{{{{\rm{cost}}}}}^{(t)}=1000$ shots for each evaluation of the cost function to fit the function in NFT. Note that the values of the hyperparameters for which the optimizer works well are selected manually or by referring to the values of previous studies, and we did not perform an exhaustive hyperparameter search since such a search is computationally too costly to perform. After all, it may be infeasible to run such a hyperparameter search when we apply these optimizers to practical problems.

In these numerical experiments, we simulate quantum circuits by using Pennylane⁸². In “Advantage of SGLBO for varisou system sizes” section and “Robustness against hardware noise in SGLBO” section, the values of the cost function appearing in the figures are evaluated at the point of the final iterate in ${({\hat{{{{\boldsymbol{\theta }}}}}}^{(t)})}_{t}$ (and the suffix averaged point in the SGLBO) by a noiseless simulator, where both the statistical noise and the hardware noise are ignored; in “Merits of noise-reducing techniques for general optimizers” section, these values are evaluated at the suffix averaged point by the noiseless simulator. For each optimizer, we repeated the overall optimization procedures fifteen times from uniformly random initial points, where each run from an initial point is repeated twice, and took the average over all the thirty runs. In the figures, we display the logarithm of the average as a thick line and each run as a thin line, using log-linear plots.

Advantage of SGLBO for various system sizes

In this section, we investigate the performance of SGLBO as we scale up the system size. We evaluate the performance of the optimizers in terms of the total number of measurement shots used during the optimization. In each iteration, we calculate the difference per site between the cost-function value at the current point of each optimizer and the minimum value of the cost function. In particular, we here consider a VQE task⁵ for a 1D transverse field Ising model under open boundary conditions. The VQE is an algorithm to calculate the ground state energy of a given Hamiltonian, where the cost function is defined as the expectation-value of the Hamiltonian. The Hamiltonian here is given by

$$H=-J\left(\mathop{\sum }\limits_{j=1}^{n-1}{Z}_{j}{Z}_{j+1}+g\mathop{\sum }\limits_{j=1}^{n}{X}_{j}\right)$$

(37)

where Z_j and X_j are the Pauli Z and X matrices, respecitvely, at the jth site on a 1D chain of qubits, J represents the energy scale, and g is the relative strength of the external field compared to the nearest-neighbor couplings⁸³. We choose J = 1.0 and g = 1.5. We use the ansatz circuit in Fig. 2 with r = 4 repetitions for n = 4, 8, 12 qubits. These sizes of the circuits are chosen based on the feasibility of classical simulation. We remark that we do not change the depth of the ansatz circuits in this setting and change only the system size, so that the gradient does not vanish exponentially for the large system size⁸⁴; that is, it is expected that the problem of the barren plateau, which potentially make the optimization infeasible^84,85,86, is avoided in our setting. In this problem, for the SGLBO, we restrict the region for the line search ${{{{\mathcal{L}}}}}_{i}$ by β = 3, and for the iCANS, we set the step size η_t = 1/∣∣H∣∣, following ref. ³⁴.

The result of the numerical simulation is shown in Fig. 3. Significantly, we discover that the SGLBO outperforms the other optimizers^23,34,57 in all the cases of n = 4, 8, 12 qubits, in terms of both the speed of convergence and the accuracy of estimating the minimum of the cost function. Thus, these advantages of the SGLBO can be obtained not only for the relatively small system size n = 4 but more broadly for the larger system sizes n = 8, 12. While NFT and Adam hit the limit of accuracy of the minimization in the early stage of the optimization, SGLBO and iCANS continue to improve the cost function even at the end of the optimization, which shows the advantage of deciding the number of measurement shots adaptively for each iteration in these algorithms. Moreover, owing to using the BO for estimating the optimal step size in each iteration, the SGLBO enjoys faster convergence with a fewer number of overall measurement shots. The additional cost of measurement shots in the BO in Eq. (22) turns out to be negligible even on a small scale n = 4, as well as the larger scales discussed in “Description of algorithm” section. Consequently, for the VQE tasks in Fig. 3, the SGLBO achieves the optimization of parameterized quantum circuits at the significantly faster convergence speed in terms of the number of measurement shots, and with better accuracy in minimizing the cost function than the other state-of-the-art optimizers.

**Fig. 3: Comparison of optimizers in terms of the performance on the VQE tasks.**

Robustness against hardware noise in SGLBO

Next, we investigate the noise robustness of SGLBO. We consider VQC⁵⁸ with a fixed input state. The task of VQC is to find parameters of a parameterized circuit so that the unitary implemented by the circuit should act as equivalently as possible to a given target unitary when acting on a given input state. Following ref. ⁵⁸, we define the cost function as

$$f({{{\boldsymbol{\theta }}}})=1-\frac{1}{n}\mathop{\sum }\limits_{j=1}^{n}{G}_{0}^{(j)},$$

(38)

where

$$\begin{array}{l}{G}_{0}^{(j)} ={{{\rm{Tr}}}}[(\left|0\right\rangle {\left\langle 0\right|}_{j}\otimes {{\mathbb{1}}}_{\bar{j}}){U}^{{\dagger} }({{{\boldsymbol{\theta }}}})U({{{{\boldsymbol{\theta }}}}}^{* }){(\left|0\right\rangle \left\langle 0\right|)}^{\otimes n}{U}^{{\dagger} }({{{{\boldsymbol{\theta }}}}}^{* })U({{{\boldsymbol{\theta }}}})].\end{array}$$

(39)

Here ${{\mathbb{1}}}_{\bar{j}}$ is an identity operator acting on all qubits except the jth qubit, ${G}_{0}^{(j)}$ is the probability of getting the outcome 0 on the jth qubit, θ is a vector of circuit parameters to be optimized, and θ^* is a target vector of circuit parameters that are chosen here as ${{{{\boldsymbol{\theta }}}}}^{* }={(0,\ldots ,0)}^{\top }\in {{\mathbb{R}}}^{D}$. The target unitary is U(θ^*), and the input state is ${(\left|0\right\rangle \left\langle 0\right|)}^{\otimes n}$. The ansatz circuit U(θ) used here is the one in Fig. 2 with n = 4 and r = 6. In this case, the ansatz circuit can reach the optimal point at θ = θ^* to output ${(\left|0\right\rangle \left\langle 0\right|)}^{\otimes n}$, where the value of the cost function is exactly zero at the optimal point, and y-axis shows the difference between the true optimal value (i.e., zero) and the value at the estimated optimal point. We note that this cost function is defined by local observables, so the gradient does not vanish in the shallow ansatz circuit used in this VQC task^58,85. In VQC, we demonstrate the performance of the optimizers in both noiseless and noisy cases. To simulate noise in the noisy case, we used information about the gate-operation and readout errors and the connectivity of IBM’s Bogota processor^87,88. The detailed explanation on the parameters of the noise model is in Supplementary Information. We set β = 6 to limit the region ${{{{\mathcal{L}}}}}_{i}$ for SGLBO and choose the step size η_t = 0.1 for iCANS, following ref. ³⁴.

The result of the numerical simulation is presented in Fig. 4. In the noiseless case, the SGLBO works better than the other state-of-the-art optimizers, which is consistent with the result of the VQE in Fig. 3. Even more remarkably, even in the presence of a moderate amount of hardware noise described above, the SGLBO can achieve almost the same accuracy in minimizing the cost function as that in the noiseless case, while the other optimizers converge to worse cost-function values. This result indicates a remarkable noise resilience of the SGLBO, owing to using the BO and also the technique of suffix averaging. In the SGLBO, the estimates of the minimizer of the cost function may be affected by hardware noise, and even if we use the BO that is relatively robust against the noise, these estimates may oscillate around the minimizer. However, the suffix averaging of these estimates makes it possible to obtain a point that is even nearer to the minimizer. In addition, the cost function in VQC has a preferable property that the minimizer is not susceptible to shifting caused by hardware noise⁸⁹, and this property also contributes to the noise resilience in this case; that is, in other tasks for the VQAs without this property, the same accuracy as noiseless cases would be hard to achieve in noisy cases. This result shows that the SGLBO can be more tolerant to hardware noise than the other state-of-the-art optimizers, which is crucial for the feasibility of performing VQAs on NISQ devices.

**Fig. 4: Comparison of optimizers in terms of the performance on VQC tasks.**

Merits of noise-reducing techniques for general optimizers

We here also show that the technique of suffix averaging and adaptive shot strategy that we use in SGLBO turns out to be advantageous even in improving performance and noise robustness of the other state-of-the-art optimizers, not only the SGLBO.

In particular, we here consider the same task of VQC as “Robustness against hardware noise in SGLBO” section, and we first apply the suffix averaging technique to all the optimizers, i.e., iCANS, Adam, and NFT as well as SGLBO. The result of the numerical simulation is shown in Fig. 5. In both the noiseless and noisy cases, the technique of suffix averaging can significantly improve the accuracy of the state-of-the-art optimizers, especially NFT and Adam, compared to the cases without suffix averaging in Fig. 4. For iCANS, suffix averaging may not be as effective as NFT and Adam, but can still achieve a comparable accuracy to the cases without suffix averaging. This result shows that the technique of suffix averaging that we apply in the SGLBO can indeed be useful as a general technique for improving a wide class of optimizers, not only for the SGLBO itself. At the same time, our numerical simulation shows that even if we improve the other optimizers by the suffix averaging, the SGLBO still outperforms these optimizers.

**Fig. 5: Comparison of optimizers with the suffix averaging technique (SA), in the performance on the same VQC tasks as Fig. 4.**

Next, we apply the technique of adaptive shot strategy to Adam. Note that our technique of adaptive shot strategy cannot be applied directly to NFT since NFT does not use gradient; also, iCANS uses its own variant of adaptive shot strategies, and hence, our technique based on the norm test cannot be combined with iCANS either without changing its own strategy. Following the setting of SGLBO with (33), we set ${s}_{i}^{(0)}=2$ for all i when we combine the adaptive shot strategy with Adam in these experiments. The results of the numerical experiments are shown in Fig. 6. In both noiseless and noisy cases, the adaptive shot strategy improves the performance of the original Adam. This indicates that the adaptive shot strategy based on the norm test is effectively applicable to the gradient-based optimizers and can improve the performance of the optimizers. In Fig. 6, we also demonstrate the combination of the suffix averaging and the adaptive shot strategy with Adam. In noiseless case, since Adam with the adaptive shot strategy has not yet hit the floor in the minimization and is still improving its accuracy, taking suffix averaging worsened the accuracy, as opposed to the case of averaging out the noise around the optimal points. On the other hand, in noisy case, the accuracy is improved. This result further confirms the effectiveness of the suffix averaging technique against hardware noise. The SGLBO still outperforms the other optimizers combined with these techniques.

**Fig. 6: Comparison of Adam with the suffix averaging technique (SA) and/or the adaptive shot strategy (ASS), in terms of the performance on the same VQC task as Fig. 4.**

In this way, the techniques that we develop for the SGLBO are also applicable broadly beyond the SGLBO itself, establishing a foundation for designing further efficient optimizers for VQAs in future research. At the same time, these results show that SGLBO is an effective combination of all the techniques, i.e., SGD, BO, the suffix averaging, and the adaptive shot strategy, to outperform the state-of-the-art optimizers.

Discussion

In this work, we have developed an efficient framework, stochastic gradient line Bayesian optimization (SGLBO), for optimizing parameterized quantum circuits in variational quantum algorithms (VQAs). The core idea of the SGLBO is to estimate the direction of the gradient based on stochastic gradient descent (SGD), and also to use Bayesian optimization (BO) for estimating the optimal step size in this direction. The BO used for estimating the optimal step size in the SGLBO contributes to minimizing the cost function faster and more accurately, owing to the robustness of the BO against noise. To achieve the optimization feasibly within the fewer number of measurement shots, we also formulated an adaptive measurement-shot strategy based on the norm test to estimate the direction of the gradient efficiently. In addition, to suppress the effect of statistical error and hardware noise, we introduce the suffix averaging technique. The SGLBO with these techniques can save the cost of the number of measurement shots in optimizing the parameterized circuits, and also improve the accuracy in minimizing the cost function in the VQAs.

To compare the performance of the SGLBO with other state-of-the-art optimizers, we numerically investigated two situations: (1) when the system size increases and (2) when the hardware noise is present. For various system sizes, we discover that the SGLBO significantly improves the required number of measurement shots for achieving a desired accuracy in minimizing cost functions, and reaches an even better accuracy in minimizing the cost functions than other state-of-the-art optimizers, as shown in Fig. 3. Furthermore, we have shown that, even in the presence of a moderate amount of hardware noise, the SGLBO can achieve almost the same accuracy as that in the noiseless case, whereas the accuracy of the other state-of-the-art optimizers has got worse, in the task shown in Fig. 4. To suppress the noise, the suffix averaging technique as well as the use of the BO is crucial, and it turns out that the suffix averaging and the adaptive shot strategy developed for the SGLBO can also improve the accuracy and the noise robustness of other existing optimizers as demonstrated in Fig. 5.

Consequently, integrating two different optimization approaches, SGD and BO, our results on the SGLBO open an alternative way to drastically reduce the cost of measurement shots in the optimization of parameterized quantum circuits, and also to make VQAs more feasible under unavoidable hardware noise in near-term quantum devices. The techniques introduced here are versatile for problems with various system sizes, effective even in presence of noise, and widely applicable to a variety of algorithms for optimizing parameterized quantum circuits in the setting of VQAs, as demonstrated above. At the same time, the approach developed for the SGLBO provides a fundamental insight into how VQAs can use classical information extracted from quantum states, progressing beyond estimating expectation values. Moreover, the idea of the SGLBO indeed provides a general framework for optimizing noisy functions in the field of machine learning (ML), not specifically to VQAs. Thus, our results are expected to be of interest not only to users of noisy intermediate-scale quantum (NISQ) devices but to much broader communities of quantum information, such as those working on ML-assisted calibration of quantum devices in experiments, quantum tomography using an ansatz, and quantum metrology.

These results point toward various directions of future research. One possible direction is to investigate the difference in performance when the 1D subspace for the BO currently taken in the gradient descent direction (Eq. (19)) is chosen in another direction, such as natural gradient descent^28,30,90,91, negative curvature descent⁹², and conjugate gradient⁹³. Also, the development of a more efficient method for determining appropriate hyperparameter values in the SGLBO is also important for improving the accuracy. In our work, we have empirically found that the SGLBO with suffix averaging performs well in practice even if hardware noise is considered, but further research is needed to clarify of what class of hardware noise the suffix averaging can be tolerant, and how many iterations are needed to achieve comparable performance to the noiseless case. It would also be interesting to provide a theoretical guarantee on the performance of the SGLBO under appropriate assumptions, especially in the setting of non-convex optimization; after all, both empirical and theoretical studies are crucial for harnessing the potential for near-term applications of VQAs. Finally, since the SGLBO discovers a way to avoid the cost of precise estimation of expectation values in optimizing parameterized circuits for VQAs, it is even more advantageous to pursue applications of VQAs that do not require estimating the expectation values throughout running the entire algorithm, i.e., even after the optimization; for example, state-of-the-art quantum algorithms for quantum machine learning avoid the expectation-value estimation by solving sampling problems so that the speedup should not be canceled out^94,95,96, and further research is needed to clarify how we can similarly avoid the expectation-value estimation in quantum machine learning with VQAs.

Data availability

Data for the plots supporting the results in this work can be obtained from the corresponding author upon reasonable request.

Code availability

Computer codes to perform the numerical experiments in this work are available from the corresponding author upon reasonable request.

References

Preskill, J. Quantum computing in the NISQ era and beyond. Quantum 2, 79 (2018).
Article Google Scholar
Cerezo, M. et al. Variational quantum algorithms. Nat. Rev. Phys. 3, 625–644 (2021).
Article Google Scholar
Endo, S., Cai, Z., Benjamin, S. C. & Yuan, X. Hybrid quantum-classical algorithms and quantum error mitigation. J. Phys. Soc. Jpn. 90, 032001 (2021).
Article ADS Google Scholar
Bharti, K. et al. Noisy intermediate-scale quantum algorithms. Rev. Mod. Phys. 94, 015004 (2022).
Peruzzo, A. et al. A variational eigenvalue solver on a photonic quantum processor. Nat. Commun. 5, 4213 (2014).
Article ADS Google Scholar
Kandala, A. et al. Hardware-efficient variational quantum eigensolver for small molecules and quantum magnets. Nature 549, 242–246 (2017).
Article ADS Google Scholar
McClean, J. R., Romero, J., Babbush, R. & Aspuru-Guzik, A. The theory of variational hybrid quantum-classical algorithms. New J. Phys. 18, 023023 (2016).
Article ADS MATH Google Scholar
McArdle, S., Endo, S., Aspuru-Guzik, A., Benjamin, S. C. & Yuan, X. Quantum computational chemistry. Rev. Mod. Phys. 92, 015003 (2020).
Article ADS MathSciNet Google Scholar
Farhi, E., Goldstone, J. & Gutmann, S. A quantum approximate optimization algorithm. Preprint at https://arxiv.org/abs/1411.4028 (2014).
Zhou, L., Wang, S.-T., Choi, S., Pichler, H. & Lukin, M. D. Quantum approximate optimization algorithm: performance, mechanism, and implementation on near-term devices. Phys. Rev. X 10, 021067 (2020).
Google Scholar
Harrigan, M. P. et al. Quantum approximate optimization of non-planar graph problems on a planar superconducting processor. Nat. Phys. 17, 332–336 (2021).
Article Google Scholar
Havlíček, V. et al. Supervised learning with quantum-enhanced feature spaces. Nature 567, 209–212 (2019).
Article ADS Google Scholar
Romero, J., Olson, J. P. & Aspuru-Guzik, A. Quantum autoencoders for efficient compression of quantum data. Quantum Sci. Technol. 2, 045001 (2017).
Article ADS Google Scholar
Benedetti, M., Garcia-Pintos, D., Nam, Y. & Perdomo-Ortiz, A. A generative modeling approach for benchmarking and training shallow quantum circuits. NPJ Quant. Inf. 5, 45 (2018).
Schuld, M. & Killoran, N. Quantum machine learning in feature hilbert spaces. Phys. Rev. Lett. 122, 040504 (2019).
Article ADS Google Scholar
Wecker, D., Hastings, M. B. & Troyer, M. Progress towards practical quantum variational algorithms. Phys. Rev. A 92, 042303 (2015).
Article ADS Google Scholar
Gonthier, J. F. et al. Identifying challenges towards practical quantum advantage through resource estimation: the measurement roadblock in the variational quantum eigensolver. Preprint at https://arxiv.org/abs/2012.04001 (2020).
Sung, K. J. et al. Using models to improve optimizers for variational quantum algorithms. Quantum Sci. Technol. 5, 044008 (2020).
Article ADS Google Scholar
Huggins, W. J. et al. Efficient and noise resilient measurements for quantum chemistry on near-term quantum computers. NPJ Quant. Inf. 7, 23 (2021).
Article ADS Google Scholar
Huang, H.-Y., Kueng, R. & Preskill, J. Predicting many properties of a quantum system from very few measurements. Nat. Phys. 16, 1050–1057 (2020).
Article Google Scholar
Huang, H.-Y., Kueng, R. & Preskill, J. Efficient estimation of pauli observables by derandomization. Phys. Rev. Lett. 127, 030503 (2021).
Article ADS MathSciNet Google Scholar
Arrasmith, A., Cincio, L., Somma, R. D. & Coles, P. J. Operator sampling for shot-frugal optimization in variational algorithms. Preprint at https://arxiv.org/abs/2004.06252 (2020).
Nakanishi, K. M., Fujii, K. & Todo, S. Sequential minimal optimization for quantum-classical hybrid algorithms. Phys. Rev. Res. 2, 043158 (2020).
Article Google Scholar
Wilson, M. et al. Optimizing quantum heuristics with meta-learning. Quantum Mach. Intell. 3, 13 (2021).
Article Google Scholar
Koczor, B. & Benjamin, S. C. Quantum analytic descent. Phys. Rev. Res. 4, 023017 (2022).
Article Google Scholar
Ostaszewski, M., Grant, E. & Benedetti, M. Structure optimization for parameterized quantum circuits. Quantum 5, 391 (2021).
Article Google Scholar
Cervera-Lierta, A., Kottmann, J. S. & Aspuru-Guzik, A. Meta-variational quantum eigensolver: Learning energy profiles of parameterized hamiltonians for quantum simulation. PRX Quantum 2, 020329 (2021).
Article ADS Google Scholar
Stokes, J., Izaac, J., Killoran, N. & Carleo, G. Quantum natural gradient. Quantum 4, 269 (2020).
Article Google Scholar
Self, C. N. et al. Variational quantum algorithm with information sharing. NPJ Quant. Inf. 7, 116 (2021).
Article ADS Google Scholar
Haug, T. & Kim, M. S. Optimal training of variational quantum algorithms without barren plateaus. Preprint at https://arxiv.org/abs/2104.14543 (2021).
Robbins, H. & Monro, S. A stochastic approximation method. Ann. Math. Stat 22, 400–407 (1951).
Article MathSciNet MATH Google Scholar
Bottou, L., Curtis, F. E. & Nocedal, J. Optimization methods for large-scale machine learning. SIAM Rev. 60, 223–311 (2018).
Sweke, R. et al. Stochastic gradient descent for hybrid quantum-classical optimization. Quantum 4, 314 (2020).
Article Google Scholar
Kübler, J. M., Arrasmith, A., Cincio, L. & Coles, P. J. An adaptive optimizer for measurement-frugal variational algorithms. Quantum 4, 263 (2020).
Article Google Scholar
Gu, A., Lowe, A., Dub, P. A., Coles, P. J. & Arrasmith, A. Adaptive shot allocation for fast convergence in variational quantum algorithms. Preprint at https://arxiv.org/abs/2108.10434 (2021).
Lavrijsen, W., Tudor, A., Muller, J., Iancu, C. & de Jong, W. Classical optimizers for noisy intermediate-scale quantum devices. In 2020 IEEE International Conference on Quantum Computing and Engineering (QCE), 267–277 (IEEE, 2020).
Harrow, A. W. & Napp, J. C. Low-depth gradient measurements can improve convergence in variational hybrid quantum-classical algorithms. Phys. Rev. Lett. 126, 140502 (2021).
Article ADS Google Scholar
Shahriari, B., Swersky, K., Wang, Z., Adams, R. P. & de Freitas, N. Taking the human out of the loop: a review of bayesian optimization. Proc. IEEE 104, 148–175 (2016).
Article Google Scholar
Snoek, J., Larochelle, H. & Adams, R. P. Practical Bayesian Optimization of Machine Learning Algorithms. In Advances in Neural Information Processing Systems, Vol. 25 (NIPS, 2012).
Bergstra, J., Yamins, D. & Cox, D. Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. In Proceedings of the 30th International Conference on Machine Learning, Vol. 28, 115–123 (PMLR, 2013).
Martinez-Cantin, R., Freitas, N., Brochu, E., Castellanos, J. & Doucet, A. A bayesian exploration-exploitation approach for optimal online sensing and planning with a visually guided mobile robot. Auton. Robots 27, 93–103 (2009).
Article Google Scholar
Lizotte, D. J., Wang, T., Bowling, M. H. & Schuurmans, D. Automatic gait optimization with gaussian process regression. In Proceedings of the Twentieth International Joint Conference on Artificial Intelligence, 944–949 (Morgan Kaufmann Publishers Inc., 2007).
Azimi, J. et al. Myopic policies for budgeted optimization with constrained experiments. In Proceedings of the National Conference on Artificial Intelligence (AAAI, 2010).
Otterbach, J. S. et al. Unsupervised machine learning on a hybrid quantum computer. Preprint at https://arxiv.org/abs/1712.05771 (2017).
Zhu, D. et al. Training of quantum circuits on a hybrid quantum computer. Sci. Adv. 5, eaaw9918 (2019).
Kandasamy, K., Schneider, J. & Poczos, B. High dimensional bayesian optimisation and bandits via additive models. In Proceedings of the 32nd International Conference on Machine Learning, Vol. 37, 295–304 (PMLR, 2015).
Friedlander, M. P. & Schmidt, M. Hybrid deterministic-stochastic methods for data fitting. SIAM J. Sci. Comput. 34, A1380–A1405 (2012).
Article MathSciNet MATH Google Scholar
Bollapragada, R., Byrd, R. & Nocedal, J. Adaptive sampling strategies for stochastic optimization. SIAM J. Optim. 28, 3312–3343 (2017).
Article MathSciNet MATH Google Scholar
Byrd, R., Chin, G., Nocedal, J. & Wu, Y. Sample size selection in optimization methods for machine learning. Math. Program. 134, 127–155 (2012).
Article MathSciNet MATH Google Scholar
Pasupathy, R., Glynn, P., Ghosh, S. & Hashemi, F. On sampling rates in simulation-based recursions. SIAM J. Optim. 28, 45–73 (2018).
Article MathSciNet MATH Google Scholar
De, S., Yadav, A., Jacobs, D. & Goldstein, T. Automated Inference with Adaptive Batches. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, Vol. 54, 1504–1513 (PMLR, 2017).
Balles, L., Romero, J. & Hennig, P. Coupling adaptive batch sizes with learning rates. In Proceedings of the Thirty-Third Conference on Uncertainty in Artificial Intelligence, UAI, 675–684 (Curran Associates, Inc., 2017).
Bollapragada, R., Nocedal, J., Mudigere, D., Shi, H.-J. & Tang, P. T. P. A progressive batching l-BFGS method for machine learning. In Proceedings of the 35th International Conference on Machine Learning, PMLR, Proceedings of Machine Learning Research, Vol. 80, 620–629 (2018).
Rakhlin, A., Shamir, O. & Sridharan, K. Making gradient descent optimal for strongly convex stochastic optimization. In Proceedings of the 29th International Coference on International Conference on Machine Learning, 1571–1578 (Omnipress, 2012).
Harvey, N. J. A., Liaw, C., Plan, Y. & Randhawa, S. Tight analyses for non-smooth stochastic gradient descent. In Conference on Learning Theory, (eds Beygelzimer, A. & Hsu, D.) 1579–1613 (2019).
Shamir, O. & Zhang, T. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In Proceedings of the 30th International Conference on International Conference on Machine Learning, Vol. 28, 71–79 (JMLR.org, 2013).
Kingma, D. P. & Ba, J. L. Adam: a method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015), (2015).
Khatri, S. et al. Quantum-assisted quantum compiling. Quantum 3, 140 (2019).
Article Google Scholar
Mahsereci, M. & Hennig, P. Probabilistic line searches for stochastic optimization. J. Mach. Learn. Res. 18, 4262–4320 (2017).
MathSciNet MATH Google Scholar
Bittel, L. & Kliesch, M. Training variational quantum algorithms is np-hard. Phys. Rev. Lett. 127, 120502 (2021).
Article ADS MathSciNet Google Scholar
Kwak, S. & Kim, J. Central limit theorem: the cornerstone of modern statistics. Korean J. Anesthesiol. 70, 144 (2017).
Article Google Scholar
Hoeffding, W. Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc. 58, 13–30 (1963).
Article MathSciNet MATH Google Scholar
Mitarai, K., Negoro, M., Kitagawa, M. & Fujii, K. Quantum circuit learning. Phys. Rev. A 98, 032309 (2018).
Article ADS Google Scholar
Schuld, M., Bergholm, V., Gogolin, C., Izaac, J. & Killoran, N. Evaluating analytic gradients on quantum hardware. Phys. Rev. A 99, 032331 (2019).
Article ADS Google Scholar
Bodin, E. et al. Modulating surrogates for bayesian optimization. In ICML 2020: 37th International Conference on Machine Learning, Vol. 1, 970–979 (PMLR, 2020).
Springenberg, J. T., Klein, A., Falkner, S. & Hutter, F. Bayesian optimization with robust bayesian neural networks. In Advances in Neural Information Processing Systems, Vol. 29, 4134–4142 (2016).
Snoek, J. et al. Scalable bayesian optimization using deep neural networks. Proceedings of the 32nd International Conference on International Conference on Machine Learning, Vol. 37, 2171–2180 (JMLR, 2015).
Rasmussen, C. E. & Williams, C. K. I.Gaussian Processes for Machine Learning (The MIT Press, 2005).
Basu, K. & Ghosh, S. Adaptive rate of convergence of thompson sampling for gaussian process optimization. Preprint at https://arxiv.org/abs/1705.06808 (2020).
Srinivas, N., Krause, A., Kakade, S. M. & Seeger, M. W. Information-theoretic regret bounds for gaussian process optimization in the bandit setting. IEEE Trans. Inf. Theory 58, 3250–3265 (2012).
Article MathSciNet MATH Google Scholar
Jones, D. R. A taxonomy of global optimization methods based on response surfaces. J. Glob. Optim. 21, 345–383 (2001).
Article MathSciNet MATH Google Scholar
Spall, J. An overview of the simultaneous perturbation method for efficient optimization. Johns Hopkins APL Tech. Dig. 19, 482–492 (1998).
Google Scholar
Jones, D. R.Direct global optimization algorithm, 431–440 (Springer, 2001).
Rolland, P. T. Y., Scarlett, J., Bogunovic, I. & Cevher, V. High dimensional bayesian optimization via additive models with overlapping groups. In Proceedings of the 21st International Conference on Artificial Intelligence and Statistics (AISTATS) 2018, 298–307 (PMLR, 2018).
Djolonga, J., Krause, A. & Cevher, V. High-dimensional gaussian process bandits. In Advances in Neural Information Processing Systems, Vol. 26, 1025–1033 (NIPS, 2013).
Kirschner, J., Mutny, M., Hiller, N., Ischebeck, R. & Krause, A. Adaptive and safe bayesian optimization in high dimensions via one-dimensional subspaces. In Proceedings of the 36th International Conference on Machine Learning (ICML 2019), Vol. 97, 3429–3438 (PMLR, 2019).
Grant, E., Wossnig, L., Ostaszewski, M. & Benedetti, M. An initialization strategy for addressing barren plateaus in parametrized quantum circuits. Quantum 3, 214 (2019).
Article Google Scholar
Mitarai, K., Suzuki, Y., Mizukami, W., Nakagawa, Y. O. & Fujii, K. Quadratic clifford expansion for efficient benchmarking and initialization of variational quantum algorithms. Phys. Rev. Res. 4, 033012 (2022).
Yu, L., Balasubramanian, K., Volgushev, S. & Erdogdu, M. A. An analysis of constant step size sgd in the non-convex regime: Asymptotic normality and bias. In Adavances in Neural Information Processing Systems, Vol. 34, 4234–4248 (NeurIPS, 2021).
Freund, J. E. Mathematical Statistics with Applications, 8th edn. (Pearson, 2014).
GPy. GPy: Gaussian processes framework in python. https://github.com/SheffieldML/GPy (2021).
Bergholm, V. et al. Pennylane: Automatic differentiation of hybrid quantum-classical computations. Preprint at https://arxiv.org/abs/1811.04968 (2020).
Pfeuty, P. The one-dimensional ising model with a transverse field. Ann. Phys. 57, 79–90 (1970).
Article ADS Google Scholar
McClean, J. R., Boixo, S., Smelyanskiy, V. N., Babbush, R. & Neven, H. Barren plateaus in quantum neural network training landscapes. Nat. Commun. 9, 4812 (2018).
Article ADS Google Scholar
Cerezo, M., Sone, A., Volkoff, T., Cincio, L. & Coles, P. J. Cost function dependent barren plateaus in shallow parametrized quantum circuits. Nat. Commun. 12, 1791 (2021).
Ortiz Marrero, C., Kieferová, M. & Wiebe, N. Entanglement-induced barren plateaus. PRX Quantum 2, 040316 (2021).
Article ADS Google Scholar
IBM Quantum Experience. https://quantum-computing.ibm.com/ (2021).
IBM Quantum Backends. https://github.com/Qiskit/qiskit-terra/tree/main/qiskit/test/mock/backends (2021).
Sharma, K., Khatri, S., Cerezo, M. & Coles, P. J. Noise resilience of variational quantum compiling. New J. Phys. 22, 043006 (2020).
Article ADS MathSciNet Google Scholar
Wierichs, D., Gogolin, C. & Kastoryano, M. Avoiding local minima in variational quantum eigensolvers with the natural gradient optimizer. Phys. Rev. Res. 2, 043246 (2020).
Article Google Scholar
van Straaten, B. & Koczor, B. Measurement cost of metric-aware variational quantum algorithms. PRX Quantum 2, 030324 (2021).
Article Google Scholar
Liu, M., Li, Z., Wang, X., Yi, J. & Yang, T. Adaptive negative curvature descent with applications in non-convex optimization. In Advances in Neural Information Processing Systems, Vol. 31, 4854–4863 (NIPS, 2018).
Fletcher, R. & Reeves, C. M. Function minimization by conjugate gradients. Comput. J. 7, 149–154 (1964).
Article MathSciNet MATH Google Scholar
Yamasaki, H., Subramanian, S., Sonoda, S. & Koashi, M. Learning with optimized random features: exponential speedup by quantum machine learning without sparsity and low-rank assumptions. In Advances in Neural Information Processing Systems, Vol. 33, 13674–13687 (NeurIPS, 2020).
Yamasaki, H. & Sonoda, S. Exponential error convergence in data classification with optimized random features: Acceleration by quantum machine learning. Preprint at https://arxiv.org/abs/2106.09028 (2021).
Kerenidis, I. & Prakash, A. Quantum Recommendation Systems. In 8th Innovations in Theoretical Computer Science Conference (ITCS 2017), Vol. 67, 49:1–49:21 (ACM, 2017).

Download references

Acknowledgements

This work was supported by JST [Moonshot R&D][Grant Number JPMJMS2061], JSPS Overseas Research Fellowships, and JST PRESTO Grant Number JPMJPR201A.

Author information

Authors and Affiliations

Department of Applied Physics, Graduate School of Engineering, The University of Tokyo, 7-3-1 Hongo, Bynkyo-ku, Tokyo, 113-8656, Japan
Shiro Tamiya
Institute for Quantum Optics and Quantum Information—IQOQI Vienna, Austrian Academy of Sciences, Boltzmanngasse 3, 1090, Vienna, Austria
Hayata Yamasaki
Atominstitut, Technische Universität Wien, Stadionallee 2, 1020, Vienna, Austria
Hayata Yamasaki

Authors

Shiro Tamiya
View author publications
You can also search for this author in PubMed Google Scholar
Hayata Yamasaki
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

S.T. and H.Y. contributed to the initial conception of the ideas, to the working out of details, and to the writing and editing of the manuscript.

Corresponding authors

Correspondence to Shiro Tamiya or Hayata Yamasaki.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

41534_2022_592_MOESM1_ESM.pdf

Supplementary Information — Stochastic Gradient Line Bayesian Optimization for Efficient Noise-Robust Optimization of Parameterized Quantum Circuits

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Tamiya, S., Yamasaki, H. Stochastic gradient line Bayesian optimization for efficient noise-robust optimization of parameterized quantum circuits. npj Quantum Inf 8, 90 (2022). https://doi.org/10.1038/s41534-022-00592-6

Download citation

Received: 17 December 2021
Accepted: 23 June 2022
Published: 27 July 2022
DOI: https://doi.org/10.1038/s41534-022-00592-6

This article is cited by

Quantum approximate optimization via learning-based adaptive optimization
- Lixue Cheng
- Yu-Qin Chen
- Shengyu Zhang
Communications Physics (2024)
Quantum algorithm for electronic band structures with local tight-binding orbitals
- Kyle Sherbert
- Anooja Jayaraj
- Marco Buongiorno Nardelli
Scientific Reports (2022)
Observing ground-state properties of the Fermi-Hubbard model using a scalable algorithm on a quantum computer
- Stasja Stanisic
- Jan Lukas Bosse
- Ashley Montanaro
Nature Communications (2022)

Subjects

Abstract

Similar content being viewed by others

Adaptive quantum error mitigation using pulse-based inverse evolutions

Quantum approximate optimization via learning-based adaptive optimization

Limitations of optimization algorithms on noisy quantum devices

Introduction

Results

Algorithm 1 Stochastic gradient line Bayesian optimization (SGLBO)

Description of algorithm

Adaptive shot strategy

Suffix averaging for SGLBO

Example of choice of hyperparameters and implementation

Numerical experiments

Optimizers for VQAs and their implementations

Advantage of SGLBO for various system sizes

Robustness against hardware noise in SGLBO

Merits of noise-reducing techniques for general optimizers

Discussion

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Supplementary information

41534_2022_592_MOESM1_ESM.pdf

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Quantum approximate optimization via learning-based adaptive optimization

Quantum algorithm for electronic band structures with local tight-binding orbitals

Observing ground-state properties of the Fermi-Hubbard model using a scalable algorithm on a quantum computer

Search

Quick links