Cost function dependent barren plateaus in shallow parametrized quantum circuits

Cerezo, M.; Sone, Akira; Volkoff, Tyler; Cincio, Lukasz; Coles, Patrick J.

doi:10.1038/s41467-021-21728-w

Download PDF

Article
Open access
Published: 19 March 2021

Cost function dependent barren plateaus in shallow parametrized quantum circuits

M. Cerezo ORCID: orcid.org/0000-0002-2757-3170^1,2,
Akira Sone^1,2,
Tyler Volkoff¹,
Lukasz Cincio¹ &
…
Patrick J. Coles¹

Nature Communications volume 12, Article number: 1791 (2021) Cite this article

27k Accesses
497 Citations
84 Altmetric
Metrics details

Subjects

Abstract

Variational quantum algorithms (VQAs) optimize the parameters θ of a parametrized quantum circuit V(θ) to minimize a cost function C. While VQAs may enable practical applications of noisy quantum computers, they are nevertheless heuristic methods with unproven scaling. Here, we rigorously prove two results, assuming V(θ) is an alternating layered ansatz composed of blocks forming local 2-designs. Our first result states that defining C in terms of global observables leads to exponentially vanishing gradients (i.e., barren plateaus) even when V(θ) is shallow. Hence, several VQAs in the literature must revise their proposed costs. On the other hand, our second result states that defining C with local observables leads to at worst a polynomially vanishing gradient, so long as the depth of V(θ) is ${\mathcal{O}}(\mathrm{log}\,n)$. Our results establish a connection between locality and trainability. We illustrate these ideas with large-scale simulations, up to 100 qubits, of a quantum autoencoder implementation.

A Lie algebraic theory of barren plateaus for deep parameterized quantum circuits

Article Open access 22 August 2024

Noise-induced barren plateaus in variational quantum algorithms

Article Open access 29 November 2021

Generalization in quantum machine learning from few training data

Article Open access 22 August 2022

Introduction

One of the most important technological questions is whether Noisy Intermediate-Scale Quantum (NISQ) computers will have practical applications¹. NISQ devices are limited both in qubit count and in gate fidelity, hence preventing the use of quantum error correction.

The leading strategy to make use of these devices is variational quantum algorithms (VQAs)². VQAs employ a quantum computer to efficiently evaluate a cost function C, while a classical optimizer trains the parameters θ of a Parametrized Quantum Circuit (PQC) V(θ). The benefits of VQAs are three-fold. First, VQAs allow for task-oriented programming of quantum computers, which is important since designing quantum algorithms is non-intuitive. Second, VQAs make up for small qubit counts by leveraging classical computational power. Third, pushing complexity onto classical computers, while only running short-depth quantum circuits, is an effective strategy for error mitigation on NISQ devices.

There are very few rigorous scaling results for VQAs (with exception of one-layer approximate optimization^3,4,5). Ideally, in order to reduce gate overhead that arises when implementing on quantum hardware one would like to employ a hardware-efficient ansatz⁶ for V(θ). As recent large-scale implementations for chemistry⁷ and optimization⁸ applications have shown, this ansatz leads to smaller errors due to hardware noise. However, one of the few known scaling results is that deep versions of randomly initialized hardware-efficient ansatzes lead to exponentially vanishing gradients⁹. Very little is known about the scaling of the gradient in such ansatzes for shallow depths, and it would be especially useful to have a converse bound that guarantees non-exponentially vanishing gradients for certain depths. This motivates our work, where we rigorously investigate the gradient scaling of VQAs as a function of the circuit depth.

The other motivation for our work is the recent explosion in the number of proposed VQAs. The Variational Quantum Eigensolver (VQE) is the most famous VQA. It aims to prepare the ground state of a given Hamiltonian H = ∑_αc_ασ_α, with H expanded as a sum of local Pauli operators¹⁰. In VQE, the cost function is obviously the energy $C=\left\langle \psi | H| \psi \right\rangle$ of the trial state $\left|\psi \right\rangle$. However, VQAs have been proposed for other applications, like quantum data compression¹¹, quantum error correction¹², quantum metrology¹³, quantum compiling^14,15,16,17, quantum state diagonalization^18,19, quantum simulation^20,21,22,23, fidelity estimation²⁴, unsampling²⁵, consistent histories²⁶, and linear systems^27,28,29. For these applications, the choice of C is less obvious. Put another way, if one reformulates these VQAs as ground-state problems (which can be done in many cases), the choice of Hamiltonian H is less intuitive. This is because many of these applications are abstract, rather than associated with a physical Hamiltonian.

We remark that polynomially vanishing gradients imply that the number of shots needed to estimate the gradient should grow as ${\mathcal{O}}(\mathrm{poly}\,(n))$. In contrast, exponentially vanishing gradients (i.e., barren plateaus) imply that derivative-based optimization will have exponential scaling³⁰, and this scaling can also apply to derivative-free optimization³¹. Assuming a polynomial number of shots per optimization step, one will be able to resolve against finite sampling noise and train the parameters if the gradients vanish polynomially. Hence, we employ the term “trainable” for polynomially vanishing gradients.

In this work, we connect the trainability of VQAs to the choice of C. For the abstract applications in refs. ^{11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29}, it is important for C to be operational, so that small values of C imply that the task is almost accomplished. Consider an example of state preparation, where the goal is to find a gate sequence that prepares a target state $\left|{\psi }_{0}\right\rangle$. A natural cost function is the square of the trace distance D_T between $\left|{\psi }_{0}\right\rangle$ and $\left|\psi \right\rangle =V{({\boldsymbol{\theta }})}^{\dagger }\left|{\boldsymbol{0}}\right\rangle$, given by ${C}_{{\rm{G}}}={D}_{\text{T}}{(\left|{\psi }_{0}\right\rangle ,\left|\psi \right\rangle )}^{2}$, which is equivalent to

$${C}_{{\rm{G}}}={\rm{Tr}}[{O}_{{\rm{G}}}V({\boldsymbol{\theta }})\left|{\psi }_{0}\right\rangle \ \left\langle {\psi }_{0}\right|V{({\boldsymbol{\theta }})}^{\dagger }]\ ,$$

(1)

with ${O}_{{\rm{G}}}={\mathbb{1}}-\left|{\boldsymbol{0}}\right\rangle \ \left\langle {\boldsymbol{0}}\right|$. Note that $\sqrt{{C}_{{\rm{G}}}}\ge | \left\langle \psi | M| \psi \right\rangle -\left\langle {\psi }_{0}| M| {\psi }_{0}\right\rangle |$ has a nice operational meaning as a bound on the expectation value difference for a POVM element M.

However, here we argue that this cost function and others like it exhibit exponentially vanishing gradients. Namely, we consider global cost functions, where one directly compares states or operators living in exponentially large Hilbert spaces (e.g., $\left|\psi \right\rangle$ and $\left|{\psi }_{0}\right\rangle$). These are precisely the cost functions that have operational meanings for tasks of interest, including all tasks in refs. ^{11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29}. Hence, our results imply that a non-trivial subset of these references will need to revise their choice of C.

Interestingly, we demonstrate vanishing gradients for shallow PQCs. This is in contrast to McClean et al.⁹, who showed vanishing gradients for deep PQCs. They noted that randomly initializing θ for a V(θ) that forms a 2-design leads to a barren plateau, i.e., with the gradient vanishing exponentially in the number of qubits, n. Their work implied that researchers must develop either clever parameter initialization strategies^32,33 or clever PQCs ansatzes^4,34,35. Similarly, our work implies that researchers must carefully weigh the balance between trainability and operational relevance when choosing C.

While our work is for general VQAs, barren plateaus for global cost functions were noted for specific VQAs and for a very specific tensor-product example by our research group^14,18, and more recently in²⁹. This motivated the proposal of local cost functions^{14,16,18,22,25,26,27}, where one compares objects (states or operators) with respect to each individual qubit, rather than in a global sense, and therein it was shown that these local cost functions have indirect operational meaning.

Our second result is that these local cost functions have gradients that vanish polynomially rather than exponentially in n, and hence have the potential to be trained. This holds for V(θ) with depth ${\mathcal{O}}(\mathrm{log}\,n)$. Figure 1 summarizes our two main results.

Finally, we illustrate our main results for an important example: quantum autoencoders¹¹. Our large-scale numerics show that the global cost function proposed in¹¹ has a barren plateau. On the other hand, we propose a novel local cost function that is trainable, hence making quantum autoencoders a scalable application.

Results

Warm-up example

To illustrate cost-function-dependent barren plateaus, we first consider a toy problem corresponding to the state preparation problem in the Introduction with the target state being $\left|{\boldsymbol{0}}\right\rangle$. We assume a tensor-product ansatz of the form $V({\boldsymbol{\theta }}){ = \bigotimes }_{j = 1}^{n}{e}^{-i{\theta }^{j}{\sigma }_{x}^{(j)}/2}$, with the goal of finding the angles θ^j such that $V({\boldsymbol{\theta }})\left|{\boldsymbol{0}}\right\rangle =\left|{\boldsymbol{0}}\right\rangle$. Employing the global cost of (1) results in ${C}_{{\rm{G}}}=1-\mathop{\prod }\nolimits_{j = 1}^{n}{\cos }^{2}\frac{{\theta }^{j}}{2}$. The barren plateau can be detected via the variance of its gradient: ${\rm{Var}}[\frac{\partial {C}_{{\rm{G}}}}{\partial {\theta }^{j}}]=\frac{1}{8}{(\frac{3}{8})}^{n-1}$, which is exponentially vanishing in n. Since the mean value is $\left\langle \frac{\partial {C}_{{\rm{G}}}}{\partial {\theta }^{j}}\right\rangle =0$, the gradient concentrates exponentially around zero.

On the other hand, consider a local cost function:

$${C}_{{\rm{L}}}={\rm{Tr}}\left[{O}_{{\rm{L}}}V({\boldsymbol{\theta }})\left|{\boldsymbol{0}}\right\rangle \ \left\langle {\boldsymbol{0}}\right|V{({\boldsymbol{\theta }})}^{\dagger }\right],$$

(2)

$${\,\text{with}\,}\quad {O}_{{\rm{L}}}={\mathbb{1}}-{\frac {1}{n}}\mathop{\sum }\limits_{j=1}^{n}\left|0\right\rangle \ {\left\langle 0\right|}_{j}\otimes {{\mathbb{1}}}_{\overline{j}}\ ,$$

(3)

where ${{\mathbb{1}}}_{\overline{j}}$ is the identity on all qubits except qubit j. Note that C_L vanishes under the same conditions as C_G^14,16, C_L = 0 ⇔ C_G = 0. We find ${C}_{{\rm{L}}}=1-\frac{1}{n}\mathop{\sum }\nolimits_{j = 1}^{n}{\cos }^{2}\frac{{\theta }^{j}}{2}$, and the variance of its gradient is ${\rm{Var}}[\frac{\partial {C}_{{\rm{L}}}}{\partial {\theta }^{j}}]=\frac{1}{8{n}^{2}}$, which vanishes polynomially with n and hence exhibits no barren plateau. Figure 2 depicts the cost landscapes of C_G and C_L for two values of n and shows that the barren plateau can be avoided here via a local cost function.

Moreover, this example allows us to delve deeper into the cost landscape to see a phenomenon that we refer to as a narrow gorge. While a barren plateau is associated with a flat landscape, a narrow gorge refers to the steepness of the valley that contains the global minimum. This phenomenon is illustrated in Fig. 2, where each dot corresponds to cost values obtained from randomly selected parameters θ. For C_G we see that very few dots fall inside the narrow gorge, while for C_L the narrow gorge is not present. Note that the narrow gorge makes it harder to train C_G since the learning rate of descent-based optimization algorithms must be exponentially small in order not to overstep the narrow gorge. The following proposition (proved in the Supplementary Note 2) formalizes the narrow gorge for C_G and its absence for C_L by characterizing the dependence on n of the probability C ⩽ δ. This probability is associated with the parameter space volume that leads to C ⩽ δ.

Proposition 1

Let θ^j be uniformly distributed on [−π, π] ∀j. For any δ ∈ (0, 1), the probability that C_G ≤ δ satisfies

$$\Pr \{{C}_{{\rm{G}}}\le \delta \}\le {(1-\delta )}^{-1}{\left(\frac{1}{2}\right)}^{n}.$$

(4)

For any $\delta \in [\frac{1}{2},1]$, the probability that C_L ≤ δ satisfies

$$\Pr \{{C}_{{\rm{L}}}\le \delta \}\ge \frac{{(2\delta -1)}^{2}}{\frac{1}{2n}+{(2\delta -1)}^{2}}\mathop{\longrightarrow }\limits_{n\to \infty }1\ .$$

(5)

General framework

For our general results, we consider a family of cost functions that can be expressed as the expectation value of an operator O as follows

$$C={\rm{Tr}}\left[OV({\boldsymbol{\theta }})\rho {V}^{\dagger }({\boldsymbol{\theta }})\right]\ ,$$

(6)

where ρ is an arbitrary quantum state on n qubits. Note that this framework includes the special case where ρ could be a pure state, as well as the more special case where $\rho =\left|{\boldsymbol{0}}\right\rangle \ \left\langle {\boldsymbol{0}}\right|$, which is the input state for many VQAs such as VQE. Moreover, in VQE, one chooses O = H, where H is the physical Hamiltonian. In general, the choice of O and ρ essentially defines the application of interest of the particular VQA.

It is typical to express O as a linear combination of the form $O={c}_{0}{\mathbb{1}}+\mathop{\sum }\nolimits_{i = 1}^{N}{c}_{i}{O}_{i}$. Here O_i ≠ ${\mathbb{1}}$, ${c}_{i}\in {\mathbb{R}}$, and we assume that at least one c_i ≠ 0. Note that C_G and C_L in (1) and (2) fall under this framework. In our main results below, we will consider two different choices of O that respectively capture our general notions of global and local cost functions and also generalize the aforementioned C_G and C_L.

As shown in Fig. 3a, V(θ) consists of L layers of m-qubit unitaries W_kl(θ_kl), or blocks, acting on alternating groups of m neighboring qubits. We refer to this as an Alternating Layered Ansatz. We remark that the Alternating Layered Ansatz will be a hardware-efficient ansatz so long as the gates that compose each block are taken from a set of gates native to a specific device. As depicted in Fig. 3c, the one dimensional Alternating Layered Ansatz can be readily implemented in devices with one-dimensional connectivity, as well as in devices with two-dimensional connectivity (such as that of IBM’s³⁶ and Google’s³⁷ quantum devices). That is, with both one- and two-dimensional hardware connectivity one can group qubits to form an Alternating Layered Ansatz as in Fig. 3a.

The index l = 1, …, L in W_kl(θ_kl) indicates the layer that contains the block, while k = 1, …, ξ indicates the qubits it acts upon. We assume n is a multiple of m, with n = mξ, and that m does not scale with n. As depicted in Fig. 3a, we define S_k as the m-qubit subsystem on which W_kL acts, and we define ${\mathcal{S}}=\{{S}_{k}\}$ as the set of all such subsystems. Let us now consider a block W_kl(θ_kl) in the lth layer of the ansatz. For simplicity we henceforth use W to refer to a given W_kl(θ_kl). As shown in the Methods section, given a θ^ν ∈ θ_kl that parametrizes a rotation ${e}^{-i{\theta }^{\nu }{\sigma }_{\nu }/2}$ (with σ_ν a Pauli operator) inside a given block W, one can always express

$$\frac{\partial W}{\partial {\theta }^{\nu }}:={\partial }_{\nu }W=\frac{-i}{2}{W}_{\text{A}}{\sigma }_{\nu }{W}_{\text{B}},$$

(7)

where W_A and W_B contain all remaining gates in W, and are properly defined in the Methods section.

The contribution to the gradient ∇C from a parameter θ^ν in the block W is given by the partial derivative ∂_νC. While the value of ∂_νC depends on the specific parameters θ, it is useful to compute ${\langle {\partial }_{\nu }C\rangle }_{V}$, i.e., the average gradient over all possible unitaries V(θ) within the ansatz. Such an average may not be representative near the minimum of C, although it does provide a good estimate of the expected gradient when randomly initializing the angles in V(θ). In the Methods Section we explicitly show how to compute averages of the form 〈…〉_V, and in the Supplementary Note 3 we provide a proof for the following Proposition.

Proposition 2

The average of the partial derivative of any cost function of the form (6) with respect to a parameter θ^ν in a block W of the ansatz in Fig. 3 is

$${\langle {\partial }_{\nu }C\rangle }_{V}=0\ ,$$

(8)

provided that either W_A or W_B of (7) form a 1-design.

Here we recall that a t-design is an ensemble of unitaries, such that sampling over their distribution yields the same properties as sampling random unitaries from the unitary group with respect to the Haar measure up to the first t moments³⁸. The Methods section provides a formal definition of a t-design.

Proposition 2 states that the gradient is not biased in any particular direction. To analyze the trainability of C, we consider the second moment of its partial derivatives:

$${\rm{Var}}[{\partial }_{\nu }C]={\left\langle {\left({\partial }_{\nu }C\right)}^{2}\right\rangle }_{V}\ ,$$

(9)

where we used the fact that ${\langle {\partial }_{\nu }C\rangle }_{V}=0$. The magnitude of Var[∂_νC] quantifies how much the partial derivative concentrates around zero, and hence small values in (9) imply that the slope of the landscape will typically be insufficient to provide a cost-minimizing direction. Specifically, from Chebyshev’s inequality, Var[∂_νC] bounds the probability that the cost-function partial derivative deviates from its mean value (of zero) as $\Pr \left(| {\partial }_{\nu }C| \ge c\right)\le {\rm{Var}}[{\partial }_{\nu }C]/{c}^{2}$ for all c > 0.

Main results

Here we present our main theorems and corollaries, with the proofs sketched in the Methods and detailed in the Supplementary Information. In addition, in the Methods section we provide some intuition behind our main results by analyzing a generalization of the warm-up example where V(θ) is composed of a single layer of the ansatz in Fig. 3. This case bridges the gap between the warm-up example and our main theorems and also showcases the tools used to derive our main result.

The following theorem provides an upper bound on the variance of the partial derivative of a global cost function which can be expressed as the expectation value of an operator of the form

$$O={c}_{0}{\mathbb{1}}+\mathop{\sum }\limits_{i=1}^{N}{c}_{i}{\widehat{O}}_{i1}\otimes {\widehat{O}}_{i2}\otimes \cdots \otimes {\widehat{O}}_{i\xi }\ .$$

(10)

Specifically, we consider two cases of interest: (i) When N = 1 and each ${\widehat{O}}_{1k}$ is a non-trivial projector (${\widehat{O}}_{1k}^{2}={\widehat{O}}_{1k}\ne {\mathbb{1}}$) of rank r_k acting on subsystem S_k, or (ii) When N is arbitrary and ${\widehat{O}}_{ik}$ is traceless with ${\rm{Tr}}[{\widehat{O}}_{ik}^{2}]\le {2}^{m}$ (for example, when ${\widehat{O}}_{ik}{ = \bigotimes }_{j = 1}^{m}{\sigma }_{j}^{\mu }$ is a tensor product of Pauli operators ${\sigma }_{j}^{\mu }\in \{{{\mathbb{1}}}_{j},{\sigma }_{j}^{x},{\sigma }_{j}^{y},{\sigma }_{j}^{z}\}$, with at least one ${\sigma }_{j}^{\mu }\,\ne\, {\mathbb{1}}$). Note that case (i) includes C_G of (1) as a special case.

Theorem 1

Consider a trainable parameter θ^ν in a block W of the ansatz in Fig. 3. Let Var[∂_νC] be the variance of the partial derivative of a global cost function C (with O given by (10)) with respect to θ^ν. If W_A, W_B of (7), and each block in V(θ) form a local 2-design, then Var[∂_νC] is upper bounded by

$${\rm{Var}}[{\partial }_{\nu }C]\;\leqslant\; {F}_{n}(L,l)\ .$$

(11)

(i)
For N = 1 and when each ${\widehat{O}}_{1k}$ is a non-trivial projector, then defining $R=\mathop{\prod }\nolimits_{k = 1}^{\xi }{r}_{k}^{2}$, we have
$${F}_{n}(L,l)=\frac{{2}^{2m+(2m-1)(L-l)}}{({2}^{2m}-1)\cdot {3}^{\frac{n}{m}}\cdot {2}^{(2-\frac{3}{m})n}}{c}_{1}^{2}R\ .$$
(12)
(ii)
For arbitrary N and when each ${\widehat{O}}_{ik}$ satisfies ${\rm{Tr}}[{\widehat{O}}_{ik}]=0$ and ${\rm{Tr}}[{\widehat{O}}_{ik}^{2}]\;\leqslant\; {2}^{m}$, then
$${F}_{n}(L,l)=\frac{{2}^{2m(L-l+1)+1}}{{3}^{\frac{2n}{m}}\cdot {2}^{\left(3-\frac{4}{m}\right)n}}\mathop{\sum }\limits_{i,j=1}^{N}{c}_{i}{c}_{j}\ .$$
(13)

From Theorem 1 we derive the following corollary.

Corollary 1

Consider the function F_n(L, l).

(i)
Let N = 1 and let each ${\widehat{O}}_{1k}$ be a non-trivial projector, as in case (i) of Theorem 1. If ${c}_{1}^{2}R\in {\mathcal{O}}({2}^{n})$ and if the number of layers $L\in {\mathcal{O}}(\mathrm{poly}\,(\mathrm{log}\,(n)))$, then
$${F}_{n}\left(L,l\right)\in {\mathcal{O}}\left({2}^{-\left(1-\frac{1}{m}{\mathrm{log}\,}_{2}3\right)n}\right)\ ,$$
(14)
which implies that Var[∂_νC] is exponentially vanishing in n if m ⩾ 2.
(ii)
Let N be arbitrary, and let each ${\widehat{O}}_{ik}$ satisfy ${\rm{Tr}}[{\widehat{O}}_{ik}]=0$ and ${\rm{Tr}}[{\widehat{O}}_{ik}^{2}]\;\leqslant\; {2}^{m}$, as in case (ii) of Theorem 1. If $N\in {\mathcal{O}}({2}^{n})$, ${c}_{i}\in {\mathcal{O}}(1)$, and if the number of layers $L\in {\mathcal{O}}(\mathrm{poly}\,(\mathrm{log}\,(n)))$, then
$${F}_{n}\left(L,l\right)\in {\mathcal{O}}\left(\frac{1}{{2}^{\left(1-\frac{1}{m}\right)n}}\right)\ ,$$
(15)
which implies that Var[∂_νC] is exponentially vanishing in n if m ⩾ 2.

Let us now make several important remarks. First, note that part (i) of Corollary 1 includes as a particular example the cost function C_G of (1). Second, part (ii) of this corollary also includes as particular examples operators with $N\in {\mathcal{O}}(1)$, as well as $N\in {\mathcal{O}}(\mathrm{poly}\,(n))$. Finally, we remark that F_n(L, l) becomes trivial when the number of layers L is Ω(poly(n)), however, as we discuss below, we can still find that Var[∂_νC_G] vanishes exponentially in this case.

Our second main theorem shows that barren plateaus can be avoided for shallow circuits by employing local cost functions. Here we consider m-local cost functions where each ${\widehat{O}}_{i}$ acts nontrivially on at most m qubits and (on these qubits) can be expressed as ${\widehat{O}}_{i}={\widehat{O}}_{i}^{{\mu }_{i}}\otimes {\widehat{O}}_{i}^{\mu ^{\prime} }$:

$$O={c}_{0}{\mathbb{1}}+\mathop{\sum }\limits_{i=1}^{N}{c}_{i}{\widehat{O}}_{i}^{{\mu }_{i}}\otimes {\widehat{O}}_{i}^{\mu ^{\prime} }\ ,$$

(16)

where ${\widehat{O}}_{i}^{{\mu }_{i}}$ are operators acting on m/2 qubits which can be written as a tensor product of Pauli operators. Here, we assume the summation in Eq. (16) includes two possible cases as schematically shown in Fig. 3b: First, when ${\widehat{O}}_{i}^{{\mu }_{i}}$ (${\widehat{O}}_{i}^{\mu ^{\prime} }$) acts on the first (last) m/2 qubits of a given S_k, and second, when ${\widehat{O}}_{i}^{{\mu }_{i}}$ (${\widehat{O}}_{i}^{\mu ^{\prime} }$) acts on the last (first) m/2 qubits of a given S_k (S_k+1). This type of cost function includes any ultralocal cost function (i.e., where the ${\widehat{O}}_{i}$ are one-body) as in (2), and also VQE Hamiltonians with up to m/2 neighbor interactions. Then, the following theorem holds.

Theorem 2

Consider a trainable parameter θ^ν in a block W of the ansatz in Fig. 3. Let Var[∂_νC] be the variance of the partial derivative of an m-local cost function C (with O given by (16)) with respect to θ^ν. W_A, W_B of (7), and each block in V(θ) form a local 2-design, then Var[∂_νC] is lower bounded by

$${G}_{n}(L,l)\;\leqslant\; {\rm{Var}}[{\partial }_{\nu }C]\ ,$$

(17)

with

$${G}_{n}(L,l)= \frac{{2}^{m(l+1)-1}}{{({2}^{2m}-1)}^{2}{({2}^{m}+1)}^{L+l}}\\ \times \mathop{\sum}\limits_{i\in {i}_{{\mathcal{L}}}}\mathop{\sum}\limits _{{(k,k^{\prime} )\in {k}_{{{\mathcal{L}}}_{\text{B}}}}\atop {k^{\prime} \geqslant k}}{c}_{i}^{2}\epsilon ({\rho }_{k,k^{\prime} })\epsilon ({\widehat{O}}_{i})\ ,$$

(18)

where ${i}_{{\mathcal{L}}}$ is the set of i indices whose associated operators ${\widehat{O}}_{i}$ act on qubits in the forward light-cone ${\mathcal{L}}$ of W, and ${k}_{{{\mathcal{L}}}_{\text{B}}}$ is the set of k indices whose associated subsystems S_k are in the backward light-cone ${{\mathcal{L}}}_{\text{B}}$ of W. Here we defined the function $\epsilon (M)={D}_{\text{HS}}\left(M,{\rm{Tr}}(M){\mathbb{1}}/{d}_{M}\right)$ where D_HS is the Hilbert–Schmidt distance and d_M is the dimension of the matrix M. In addition, ${\rho }_{k,k^{\prime} }$ is the partial trace of the input state ρ down to the subsystems ${S}_{k}{S}_{k+1}...{S}_{k^{\prime} }$.

Let us make a few remarks. First, note that the $\epsilon ({\widehat{O}}_{i})$ in the lower bound indicates that training V(θ) is easier when ${\widehat{O}}_{i}$ is far from the identity. Second, the presence of $\epsilon ({\rho }_{k,k^{\prime} })$ in G_n(L, l) implies that we have no guarantee on the trainability of a parameter θ^ν in W if ρ is maximally mixed on the qubits in the backwards light-cone.

From Theorem 2 we derive the following corollary for m-local cost functions, which guarantees the trainability of the ansatz for shallow circuits.

Corollary 2

Consider the function F_n(L, l). Let O be an operator of the form (16), as in Theorem 2. If at least one term ${c}_{i}^{2}\epsilon ({\rho }_{k,k^{\prime} })\epsilon ({\widehat{O}}_{i})$ in the sum in (18) vanishes no faster than Ω(1/poly(n)), and if the number of layers L is ${\mathcal{O}}(\mathrm{log}\,(n))$, then

$${G}_{n}(L,l)\in {{\Omega }}\left(\frac{1}{\mathrm{poly}\,(n)}\right)\ .$$

(19)

On the other hand, if at least one term ${c}_{i}^{2}\epsilon ({\rho }_{k,k^{\prime} })\epsilon ({\widehat{O}}_{i})$ in the sum in (18) vanishes no faster than ${{\Omega }}\left(1/{2}^{\mathrm{poly}\,(\mathrm{log}\,(n))}\right)$, and if the number of layers is ${\mathcal{O}}(\mathrm{poly}\,(\mathrm{log}\,(n)))$, then

$${G}_{n}(L,l)\in {{\Omega }}\left(\frac{1}{{2}^{\mathrm{poly}\,(\mathrm{log}\,(n))}}\right)\ .$$

(20)

Hence, when L is ${\mathcal{O}}(\mathrm{poly}\,(\mathrm{log}\,(n)))$ there is a transition region where the lower bound vanishes faster than polynomially, but slower than exponentially.

We finally justify the assumption of each block being a local 2-design from the fact that shallow circuit depths lead to such local 2-designs. Namely, it has been shown that one-dimensional 2-designs have efficient quantum circuit descriptions, requiring ${\mathcal{O}}({m}^{2})$ gates to be exactly implemented³⁸, or ${\mathcal{O}}(m)$ to be approximately implemented^39,40. Hence, an L-layered ansatz in which each block forms a 2-design can be exactly implemented with a depth $D\in {\mathcal{O}}({m}^{2}L)$, and approximately implemented with $D\in {\mathcal{O}}(mL)$. For the case of two-dimensional connectivity, it has been shown that approximate 2-designs require a circuit depth of ${\mathcal{O}}(\sqrt{m})$ to be implemented⁴⁰. Therefore, in this case the depth of the layered ansatz is $D\in {\mathcal{O}}(\sqrt{m}L)$. The latter shows that increasing the dimensionality of the circuit reduces the circuit depth needed to make each block a 2-design.

Moreover, it has been shown that the Alternating Layered Ansatz of Fig. 3 will form an approximate one-dimensional 2-design on n qubits if the number of layers is ${\mathcal{O}}(n)$⁴⁰. Hence, for deep circuits, our ansatz behaves like a random circuit and we recover the barren plateau result of⁹ for both local and global cost functions.

Numerical simulations

As an important example to illustrate the cost-function-dependent barren plateau phenomenon, we consider quantum autoencoders^{11,41,42,43,44}. In particular, the pioneering VQA proposed in ref. ¹¹ has received significant literature attention, due to its importance to quantum machine learning and quantum data compression. Let us briefly explain the algorithm in ref. ¹¹.

Consider a bipartite quantum system AB composed of n_A and n_B qubits, respectively, and let $\{{p}_{\mu },|{\psi }_{\mu }\rangle \}$ be an ensemble of pure states on AB. The goal of the quantum autoencoder is to train a gate sequence V(θ) to compress this ensemble into the A subsystem, such that one can recover each state $|{\psi }_{\mu }\rangle$ with high fidelity from the information in subsystem A. One can think of B as the “trash” since it is discarded after the action of V(θ).

To quantify the degree of data compression, ref. ¹¹ proposed a cost function of the form:

$$C_{\rm{G}}^{\prime} =1-{\rm{Tr}}[\left|{\boldsymbol{0}}\right\rangle \ \left\langle {\boldsymbol{0}}\right|{\rho }_{\,\text{B}}^{\text{out}\,}]$$

(21)

$$={\rm{Tr}}[O_{\rm{G}}^{\prime} V({\boldsymbol{\theta }}){\rho }_{\,\text{AB}}^{\text{in}\,}V{({\boldsymbol{\theta }})}^{\dagger }]\ ,$$

(22)

where ${\rho }_{\,\text{AB}}^{\text{in}\,}={\sum }_{\mu }{p}_{\mu }|{\psi }_{\mu }\rangle \ \langle {\psi }_{\mu }|$ is the ensemble-average input state, ${\rho }_{\,\text{B}}^{\text{out}\,}={\sum }_{\mu }{p}_{\mu }{{\rm{Tr}}}_{\text{A}}[|\psi ^{\prime} \rangle \ \langle \psi ^{\prime} |]$ is the ensemble-average trash state, and $\left|\psi ^{\prime} \right\rangle =V({\boldsymbol{\theta }})|{\psi }_{\mu }\rangle$. Equation (22) makes it clear that $C_{\rm{G}}^{\prime}$ has the form in (6), and $O_{\rm{G}}^{\prime} ={{\mathbb{1}}}_{\text{AB}}-{{\mathbb{1}}}_{\text{A}}\otimes \left|{\boldsymbol{0}}\right\rangle \ \left\langle {\boldsymbol{0}}\right|$ is a global observable of the form in (10). Hence, according to Corollary 1, $C_{\rm{G}}^{\prime}$ exhibits a barren plateau for large n_B. (Specifically, Corollary 1 applies in this context when n_A < n_B). As a result, large-scale data compression, where one is interested in discarding large numbers of qubits, will not be possible with $C_{\rm{G}}^{\prime}$.

To address this issue, we propose the following local cost function

$$C_{\rm{L}}^{\prime} =1-\frac{1}{{n}_{\text{B}}}\mathop{\sum }\limits_{j=1}^{{n}_{\text{B}}}{\rm{Tr}}\left[\left(\left|0\right\rangle \ {\left\langle 0\right|}_{j}\otimes {{\mathbb{1}}}_{\overline{j}}\right){\rho }_{\,\text{B}}^{\text{out}\,}\right]$$

(23)

$$={\rm{Tr}}[O_{\rm{L}}^{\prime} V({\boldsymbol{\theta }}){\rho }_{\,\text{AB}}^{\text{in}\,}V{({\boldsymbol{\theta }})}^{\dagger }]\ ,\qquad\quad$$

(24)

where $O_{\rm{L}}^{\prime} ={{\mathbb{1}}}_{\text{AB}}-\frac{1}{{n}_{\text{B}}}\mathop{\sum }\nolimits_{j = 1}^{{n}_{\text{B}}}{{\mathbb{1}}}_{\text{A}}\otimes \left|0\right\rangle \ {\left\langle 0\right|}_{j}\otimes {{\mathbb{1}}}_{\overline{j}}$, and ${{\mathbb{1}}}_{\overline{j}}$ is the identity on all qubits in B except the jth qubit. As shown in the Supplementary Note 9, $C_{\rm{L}}^{\prime}$ satisfies $C_{\rm{L}}^{\prime} \;\leqslant\; C_{\rm{G}}^{\prime} \;\leqslant\; {n}_{\text{B}}C_{\rm{L}}^{\prime}$, which implies that $C_{\rm{L}}^{\prime}$ is faithful (vanishing under the same conditions as $C_{\rm{G}}^{\prime}$). Furthermore, note that $O_{\rm{L}}^{\prime}$ has the form in (16). Hence Corollary 2 implies that $C_{\rm{L}}^{\prime}$ does not exhibit a barren plateau for shallow ansatzes.

Here we simulate the autoencoder algorithm to solve a simple problem where n_A = 1, and where the input state ensemble $\{{p}_{\mu }, |{\psi }_{\mu } \rangle \}$ is given by

$$\left|{\psi }_{1}\right\rangle ={\left|0\right\rangle }_{\text{A}}\otimes {\left|0,0,0,\ldots ,0\right\rangle }_{\text{B}}\ ,\quad \,{\text{with}}\,\quad {p}_{1}=2/3\ ,$$

(25)

$$\left|{\psi }_{2}\right\rangle ={\left|1\right\rangle }_{\text{A}}\otimes {\left|1,1,0,\ldots ,0\right\rangle }_{\text{B}}\ ,\quad \,{\text{with}}\,\quad {p}_{2}=1/3\ .$$

(26)

In order to analyze the cost-function-dependent barren plateau phenomenon, the dimension of subsystem B is gradually increased as n_B = 10, 15, …, 100.

Numerical results

In our heuristics, the gate sequence V(θ) is given by two layers of the ansatz in Fig. 4, so that the number of gates and parameters in V(θ) increases linearly with n_B. Note that this ansatz is a simplified version of the ansatz in Fig. 3, as we can only generate unitaries with real coefficients. All parameters in V(θ) were randomly initialized and as detailed in the Methods section, we employ a gradient-free training algorithm that gradually increases the number of shots per cost-function evaluation.

**Fig. 4: Alternating Layered Ansatz for V(θ) employed in our numerical simulations.**

Analysis of the n-dependence. Figure 5 shows representative results of our numerical implementations of the quantum autoencoder in ref. ¹¹ obtained by training V(θ) with the global and local cost functions respectively given by (22) and (23). Specifically, while we train with finite sampling, in the figures we show the exact cost-function values versus the number of iterations. Here, the top (bottom) axis corresponds to the number of iterations performed while training with $C_{\rm{G}}^{\prime}$ ($C_{\rm{L}}^{\prime}$). For n_B = 10 and 15, Fig. 5 shows that we are able to train V(θ) for both cost functions. For n_B = 20, the global cost function initially presents a plateau in which the optimizing algorithm is not able to determine a minimizing direction. However, as the number of shots per function evaluation increases, one can eventually minimize $C_{\rm{G}}^{\prime}$. Such result indicates the presence of a barren plateau where the gradient takes small values which can only be detected when the number of shots becomes sufficiently large. In this particular example, one is able to start training at around 140 iterations.

**Fig. 5: Cost versus number of iterations for the quantum autoencoder problem defined by Eqs. (25)–(26).**

When n_B > 20 we are unable to train the global cost function, while always being able to train our proposed local cost function. Note that the number of iterations is different for $C_{\rm{G}}^{\prime}$ and $C_{\rm{L}}^{\prime}$, as for the global cost function case we reach the maximum number of shots in fewer iterations. These results indicate that the global cost function of (22) exhibits a barren plateau where the gradient of the cost function vanishes exponentially with the number of qubits, and which arises even for constant depth ansatzes. We remark that in principle one can always find a minimizing direction when training $C_{\rm{G}}^{\prime}$, although this would require a number of shots that increases exponentially with n_B. Moreover, one can see in Fig. 5 that randomly initializing the parameters always leads to $C_{\rm{G}}^{\prime} \approx 1$ due to the narrow gorge phenomenon (see Proposition 1), i.e., where the probability of being near the global minimum vanishes exponentially with n_B.

On the other hand, Fig. 5 shows that the barren plateau is avoided when employing a local cost function since we can train $C_{\rm{L}}^{\prime}$ for all considered values of n_B. Moreover, as seen in Fig. 5, $C_{\rm{L}}^{\prime}$ can be trained with a small number of shots per cost-function evaluation (as small as 10 shots per evaluation).

Analysis of the L-dependence. The power of Theorem 2 is that it gives the scaling in terms of L. While one can substitute a function of n for L as we did in Corollary 2, one can also directly study the scaling with L (for fixed n). Figure 6 shows the dependence on L when training $C_{\rm{L}}^{\prime}$ for the autoencoder example with n_A = 1 and n_B = 10. As one can see, the training becomes more difficult as L increases. Specifically, as shown in the inset it appears to become exponentially more difficult, as the number of shots needed to achieve a fixed cost value grows exponentially with L. This is consistent with (and hence verifies) our bound on the variance in Theorem 2, which vanishes exponentially in L, although we remark that this behavior can saturate for very large L⁹.

**Fig. 6: Local cost $C_{\rm{L}}^{\prime}$ versus number of iterations for the quantum autoencoder problem in Eqs. (25–26) with n_B = 10.**

In summary, even though the ansatz employed in our heuristics is beyond the scope of our theorems, we still find cost-function-dependent barren plateaus, indicating that the cost-function dependent barren plateau phenomenon might be more general and go beyond our analytical results.

Discussion

While scaling results have been obtained for classical neural networks⁴⁵, very few such results exist for the trainability of parametrized quantum circuits, and more generally for quantum neural networks. Hence, rigorous scaling results are urgently needed for VQAs, which many researchers believe will provide the path to quantum advantage with near-term quantum computers. One of the few such results is the barren plateau theorem of ref. ⁹, which holds for VQAs with deep, hardware-efficient ansatzes.

In this work, we proved that the barren plateau phenomenon extends to VQAs with randomly initialized shallow Alternating Layered Ansatzes. The key to extending this phenomenon to shallow circuits was to consider the locality of the operator O that defines the cost function C. Theorem 1 presented a universal upper bound on the variance of the gradient for global cost functions, i.e., when O is a global operator. Corollary 1 stated the asymptotic scaling of this upper bound for shallow ansatzes as being exponentially decaying in n, indicating a barren plateau. Conversely, Theorem 2 presented a universal lower bound on the variance of the gradient for local cost functions, i.e., when O is a sum of local operators. Corollary 2 notes that for shallow ansatzes this lower bound decays polynomially in n. Taken together, these two results show that barren plateaus are cost-function-dependent, and they establish a connection between locality and trainability.

In the context of chemistry or materials science, our present work can inform researchers about which transformation to use when mapping a fermionic Hamiltonian to a spin Hamiltonian⁴⁶, i.e., Jordan-Wigner versus Bravyi–Kitaev⁴⁷. Namely, the Bravyi–Kitaev transformation often leads to more local Pauli terms, and hence (from Corollary 2) to a more trainable cost function. This fact was recently numerically confirmed⁴⁸.

Moreover, the fact that Corollary 2 is valid for arbitrary input quantum states may be useful when constructing variational ansatzes. For example, one could propose a growing ansatz method where one appends $\mathrm{log}\,(n)$ layers of the hardware-efficient ansatz to a previously trained (hence fixed) circuit. This could then lead to a layer-by-layer training strategy where the previously trained circuit can correspond to multiple layers of the same hardware-efficient ansatz.

We remark that our definition of a global operator (local operator) is one that is both non-local (local) and many body (few body). Therefore, the barren plateau phenomenon could be due to the many-bodiness of the operator rather than the non-locality of the operator; we leave the resolution of this question to future work. On the other hand, our Theorem 1 rules out the possibility that barren plateaus could be due to cardinality, i.e., the number of terms in O when decomposed as a sum of Pauli products⁴⁹. Namely, case (ii) of this theorem implies barren plateaus for O of essentially arbitrary cardinality, and hence cardinality is not the key variable at work here.

We illustrated these ideas for two examples VQAs. In Fig. 2, we considered a simple state-preparation example, which allowed us to delve deeper into the cost landscape and uncover another phenomenon that we called a narrow gorge, stated precisely in Proposition 1. In Fig. 5, we studied the more important example of quantum autoencoders, which have generated significant interest in the quantum machine learning community. Our numerics showed the effects of barren plateaus: for more than 20 qubits we were unable to minimize the global cost function introduced in¹¹. To address this, we introduced a local cost function for quantum autoencoders, which we were able to minimize for system sizes of up to 100 qubits.

There are several directions in which our results could be generalized in future work. Naturally, we hope to extend the narrow gorge phenomenon in Proposition 1 to more general VQAs. In addition, we hope in the future to unify our theorems 1 and 2 into a single result that bounds the variance as a function of a parameter that quantifies the locality of O. This would further solidify the connection between locality and trainability. Moreover, our numerics suggest that our theorems (which are stated for exact 2-designs) might be extendable in some form to ansatzes composed of simpler blocks, like approximate 2-designs³⁹.

We emphasize that while our theorems are stated for a hardware-efficient ansatz and for costs that are of the form (6), it remains an interesting open question as to whether other ansatzes, cost function, and architectures exhibit similar scaling behavior as that stated in our theorems. For instance, we have recently shown⁵⁰ that our results can be extended to a more general type of Quantum Neural Network called dissipative quantum neural networks⁵¹. Another potential example of interest could be the unitary-coupled cluster (UCC) ansatz in chemistry⁵², which is intended for use in the ${\mathcal{O}}(\mathrm{poly}\,(n))$ depth regime³⁴. Therefore it is important to study the key mathematical features of an ansatz that might allow one to go from trainability for ${\mathcal{O}}(\mathrm{log}\,n)$ depth (which we guarantee here for local cost functions) to trainability for ${\mathcal{O}}(\mathrm{poly}\,n)$ depth.

Finally, we remark that some strategies have been developed to mitigate the effects of barren plateaus^32,33,53,54. While these methods are promising and have been shown to work in certain cases, they are still heuristic methods with no provable guarantees that they can work in generic scenarios. Hence, we believe that more work needs to be done to better understand how to prevent, avoid, or mitigate the effects of barren plateaus.

Methods

In this section, we provide additional details for the results in the main text, as well as a sketch of the proofs for our main theorems. We note that the proof of Theorem 2 comes before that of Theorem 1 since the latter builds on the former. More detailed proofs of our theorems are given in the Supplementary Information.

Variance of the cost function partial derivative

Let us first discuss the formulas we employed to compute Var[∂_νC]. Let us first note that without loss of generality, any block W_kl(θ_kl) in the Alternating Layered Ansatz can be written as a product of ζ_kl independent gates from a gate alphabet ${\mathcal{A}}=\{{G}_{\mu }(\theta )\}$ as

$${W}_{kl}({{\boldsymbol{\theta }}}_{kl})={G}_{{\zeta }_{kl}}({\theta }_{kl}^{{\zeta }_{kl}})\ldots {G}_{\nu }({\theta }_{kl}^{\nu })\ldots {G}_{1}({\theta }_{kl}^{1})\ ,$$

(27)

where each ${\theta }_{kl}^{\nu }$ is a continuous parameter. Here, ${G}_{\nu }({\theta }_{kl}^{\nu })={R}_{\nu }({\theta }_{kl}^{\nu }){Q}_{\nu }$ where Q_ν is an unparametrized gate and ${R}_{\nu }({\theta }_{kl}^{\nu })={e}^{-i{\theta }_{kl}^{\nu }{\sigma }_{\nu }/2}$ with σ_ν a Pauli operator. Note that W_kL denotes a block in the last layer of V(θ).

For the proofs of our results, it is helpful to conceptually break up the ansatz as follows. Consider a block W_kl(θ_kl) in the lth layer of the ansatz. For simplicity, we henceforth use W to refer to a given W_kl(θ_kl). Let S_w denote the m-qubit subsystem that contains the qubits W acts on, and let ${S}_{\overline{w}}$ be the (n − m) subsystem on which W acts trivially. Similarly, let ${{\mathcal{H}}}_{w}$ and ${{\mathcal{H}}}_{\overline{w}}$ denote the Hilbert spaces associated with S_w and ${S}_{\overline{w}}$, respectively. Then, as shown in Fig. 3a, V(θ) can be expressed as

$$V({\boldsymbol{\theta }})={V}_{\text{R}}({{\mathbb{1}}}_{\overline{w}}\otimes W){V}_{{\rm{L}}}\ .$$

(28)

Here, ${{\mathbb{1}}}_{\overline{w}}$ is the identity on ${{\mathcal{H}}}_{\overline{w}}$, and V_R contains the gates in the (forward) light-cone ${\mathcal{L}}$ of W, i.e., all gates with at least one input qubit causally connected to the output qubits of W. The latter allows us to define ${S}_{{\mathcal{L}}}$ as the subsystem of all qubits in ${\mathcal{L}}$.

Let us here recall that the Alternating Layered Ansatz can be implemented with either a 1D or 2D square connectivity as schematically depicted in Fig. 3c. We remark that the following results are valid for both cases as the light-cone structure will be the same. Moreover, the notation employed in our proofs applies to both the 1D and 2D cases. Hence, there is no need to refer to the connectivity dimension in what follows.

Let us now assume that θ^ν is a parameter inside a given block W, we obtain from (6), (27), and (28)

$${\partial }_{\nu }C= \frac{i}{2}{\rm{Tr}}\left[({{\mathbb{1}}}_{\overline{w}}\otimes {W}_{\text{B}}){V}_{{\rm{L}}}\rho {V}_{{\rm{L}}}^{\dagger }({{\mathbb{1}}}_{\overline{w}}\otimes {W}_{\,\text{B}\,}^{\dagger })\right.\\ \left.\times [{{\mathbb{1}}}_{\overline{w}}\otimes {\sigma }_{\nu },({{\mathbb{1}}}_{\overline{w}}\otimes {W}_{\,\text{A}\,}^{\dagger }){V}_{\,\text{R}}^{\dagger }O{V}_{\text{R}}({{\mathbb{1}}}_{\overline{w}}\otimes {W}_{\text{A}})]\right]\ ,$$

(29)

with

$${W}_{\text{B}}=\mathop{\prod }\limits_{\mu =1}^{\nu -1}{G}_{\mu }({\theta }^{\mu })\ ,\quad \,{\text{and}}\,\quad {W}_{\text{A}}=\mathop{\prod }\limits_{\mu =\nu }^{\zeta }{G}_{\mu }({\theta }^{\mu })\ .$$

(30)

Finally, from (29) we can derive a general formula for the variance:

$${\rm{Var}}[{\partial }_{\nu }C]=\frac{{2}^{m-1}{\rm{Tr}}[{\sigma }_{\nu }^{2}]}{{({2}^{2m}-1)}^{2}}\mathop{\sum} _{{{\boldsymbol{p}}{\boldsymbol{q}}}\atop {{\boldsymbol{p}}^{\prime} {\boldsymbol{q}}^{\prime} }}{\langle {{\Delta }}{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{{\boldsymbol{q}}^{\prime} {\boldsymbol{p}}^{\prime} }\rangle }_{{V}_{\text{R}}}{\langle {{\Delta }}{{{\Psi }}}_{{\boldsymbol{p}}{\boldsymbol{q}}}^{{\boldsymbol{p}}^{\prime} {\boldsymbol{q}}^{\prime} }\rangle }_{{V}_{{\rm{L}}}}$$

(31)

which holds if W_A and W_B form independent 2-designs. Here, the summation runs over all bitstrings p, q, ${\boldsymbol{p}}^{\prime}$, ${\boldsymbol{q}}^{\prime}$ of length 2^n−m. In addition, we defined

$${{\Delta }}{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{{\boldsymbol{q}}^{\prime} {\boldsymbol{p}}^{\prime} }={\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}{{{\Omega }}}_{{\boldsymbol{q}}^{\prime} {\boldsymbol{p}}^{\prime} }]-\frac{{\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}]{\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}^{\prime} {\boldsymbol{p}}^{\prime} }]}{{2}^{m}}\ ,$$

(32)

$${{\Delta }}{{{\Psi }}}_{{\boldsymbol{p}}{\boldsymbol{q}}}^{{\boldsymbol{p}}^{\prime} {\boldsymbol{q}}^{\prime} }={\rm{Tr}}[{{{\Psi }}}_{{\boldsymbol{p}}{\boldsymbol{q}}}{{{\Psi }}}_{{\boldsymbol{p}}^{\prime} {\boldsymbol{q}}^{\prime} }]-\frac{{\rm{Tr}}[{{{\Psi }}}_{{\boldsymbol{p}}{\boldsymbol{q}}}]{\rm{Tr}}[{{{\Psi }}}_{{\boldsymbol{p}}^{\prime} {\boldsymbol{q}}^{\prime} }]}{{2}^{m}}\ ,$$

(33)

where ${{\rm{Tr}}}_{\overline{w}}$ indicates the trace over subsystem ${S}_{\overline{w}}$, and Ω_qp and Ψ_qp are operators on ${{\mathcal{H}}}_{w}$ defined as

$${{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}={{\rm{Tr}}}_{\overline{w}}\left[(\left|{\boldsymbol{p}}\right\rangle \left\langle {\boldsymbol{q}}\right|\otimes {{\mathbb{1}}}_{w}){V}_{\,\text{R}}^{\dagger }O{V}_{\text{R}}\right]\ ,$$

(34)

$${{{\Psi }}}_{{\boldsymbol{p}}{\boldsymbol{q}}}={{\rm{Tr}}}_{\overline{w}}\left[(\left|{\boldsymbol{q}}\right\rangle \left\langle {\boldsymbol{p}}\right|\otimes {{\mathbb{1}}}_{w}){V}_{{\rm{L}}}\rho {V}_{{\rm{L}}}^{\dagger }\right]\ .$$

(35)

We derive Eq. (31) in the Supplementary Note 4.

Computing averages over V

Here we introduce the main tools employed to compute quantities of the form 〈…〉_V. These tools are used throughout the proofs of our main results.

Let us first remark that if the blocks in V(θ) are independent, then any average over V can be computed by averaging over the individual blocks, i.e., ${\langle \ldots \rangle }_{V}={\langle \ldots \rangle }_{{W}_{11},\ldots ,{W}_{kl},\ldots }={\langle \ldots \rangle }_{{V}_{{\rm{L}}},W,{V}_{\text{R}}}$. For simplicity let us first consider the expectation value over a single block W in the ansatz. In principle 〈…〉_W can be approximated by varying the parameters in W and sampling over the resulting 2^m × 2^m unitaries. However, if W forms a t-design, this procedure can be simplified as it is known that sampling over its distribution yields the same properties as sampling random unitaries from the unitary group with respect to the unique normalized Haar measure.

Explicitly, the Haar measure is a uniquely defined left and right-invariant measure over the unitary group dμ(W), such that for any unitary matrix A ∈ U(2^m) and for any function f(W) we have

$${\int}_{U({2}^{m})}d\mu (W)f(W) =\int d\mu (W)f(AW)\\ =\int d\mu (W)f(WA)\ ,$$

(36)

where the integration domain is assumed to be U(2^m) throughout this work. Consider a finite set ${\{{W}_{y}\}}_{y\in Y}$ (of size ∣Y∣) of unitaries W_y, and let P_(t, t)(W) be an arbitrary polynomial of degree at most t in the matrix elements of W and at most t in those of W^†. Then, this finite set is a t-design if³⁸

$${\langle {P}_{(t,t)}(W)\rangle }_{w} =\frac{1}{| Y| }\cdot \mathop{\sum}\limits _{y\in Y}{P}_{(t,t)}({W}_{y})\\ =\int d\mu (W){P}_{(t,t)}(W)\ .$$

(37)

From the general form of C in Eq. (6) we can see the cost function is a polynomial of degree at most 2 in the matrix elements of each block W_kl in V(θ), and at most 2 in those of ${({W}_{kl})}^{\dagger }$. Then, if a given block W forms a 2-design, one can employ the following elementwise formula of the Weingarten calculus^55,56 to explicitly evaluate averages over W up to the second moment:

$$\begin{array}{ccc}\int d\mu (W){w}_{{\boldsymbol{i}}{\boldsymbol{j}}}{w}_{{\boldsymbol{i}}^{\prime} {\boldsymbol{j}}^{\prime} }^{* }&=&\frac{{\delta }_{{\boldsymbol{i}}{\boldsymbol{i}}^{\prime} }{\delta }_{{\boldsymbol{j}}{\boldsymbol{j}}^{\prime} }}{{2}^{m}}\\ \int d\mu (W){w}_{{{\boldsymbol{i}}}_{1}{{\boldsymbol{j}}}_{1}}{w}_{{{\boldsymbol{i}}}_{2}{{\boldsymbol{j}}}_{2}}{w}_{{\boldsymbol{i}}_1^{\prime} {\boldsymbol{j}}_2^{\prime} }^{* }{w}_{{\boldsymbol{i}}_2^{\prime} {\boldsymbol{j}}_2^{\prime} }^{* }&=&\frac{1}{{2}^{2m}-1}\left({{{\Delta }}}_{1}-\frac{{{{\Delta }}}_{2}}{{2}^{m}}\right)\end{array}$$

(38)

where w_ij are the matrix elements of W, and

$${{{\Delta }}}_{1} ={\delta }_{{{\boldsymbol{i}}}_{1}{\boldsymbol{i}}_1^{\prime} }{\delta }_{{{\boldsymbol{i}}}_{2}{\boldsymbol{i}}_2^{\prime} }{\delta }_{{{\boldsymbol{j}}}_{1}{\boldsymbol{j}}_1^{\prime} }{\delta }_{{{\boldsymbol{j}}}_{2}{\boldsymbol{j}}_2^{\prime} }+{\delta }_{{{\boldsymbol{i}}}_{1}{\boldsymbol{i}}_2^{\prime} }{\delta }_{{{\boldsymbol{i}}}_{2}{\boldsymbol{i}}_1^{\prime} }{\delta }_{{{\boldsymbol{j}}}_{1}{\boldsymbol{j}}_2^{\prime} }{\delta }_{{{\boldsymbol{j}}}_{2}{\boldsymbol{j}}_1^{\prime} }\ ,\\ {{{\Delta }}}_{2} ={\delta }_{{{\boldsymbol{i}}}_{1}{\boldsymbol{i}}_1^{\prime} }{\delta }_{{{\boldsymbol{i}}}_{2}{\boldsymbol{i}}_2^{\prime} }{\delta }_{{{\boldsymbol{j}}}_{1}{\boldsymbol{j}}_2^{\prime} }{\delta }_{{{\boldsymbol{j}}}_{2}{\boldsymbol{j}}_1^{\prime} }+{\delta }_{{{\boldsymbol{i}}}_{1}{\boldsymbol{i}}_2^{\prime} }{\delta }_{{{\boldsymbol{i}}}_{2}{\boldsymbol{i}}_1^{\prime} }{\delta }_{{{\boldsymbol{j}}}_{1}{\boldsymbol{j}}_1^{\prime} }{\delta }_{{{\boldsymbol{j}}}_{2}{\boldsymbol{j}}_2^{\prime} }\ .$$

(39)

Intuition behind the main results

The goal of this section is to provide some intuition for our main results. Specifically, we show here how the scaling of the cost function variance can be related to the number of blocks we have to integrate to compute ${\langle \cdots \rangle }_{{V}_{\text{R}},{V}_{{\rm{L}}}}$, the locality of the cost functions, and with the number of layers in the ansatz.

First, we recall from Eq. (38) that integrating over a block leads to a coefficient of the order 1/2^2m. Hence, we see that the more blocks one integrates over, the worse the scaling can be.

We now generalize the warm-up example. Let V(θ) be a single layer of the alternating ansatz of Fig. 3, i.e., V(θ) is a tensor product of m-qubit blocks W_k: = W_k1, with k = 1, …, ξ (and with ξ = n/m), so that θ^ν is in the block ${W}_{k^{\prime} }$. In the Supplementary Note 5 we generalize this scenario to the when the input state is not $\left|{\boldsymbol{0}}\right\rangle$, but instead is an arbitrary state ρ.

From (31), the partial derivative of the global cost function in (1) can be expressed as

$${\rm{Var}}[{\partial }_{\nu }{C}_{{\rm{G}}}]=\upsilon \ \prod _{k\ne k^{\prime} }\ {\left\langle {\rm{Tr}}{\left[\left|0\right\rangle {\left\langle 0\right|}^{\otimes m}{W}_{k}\left|0\right\rangle {\left\langle 0\right|}^{\otimes m}{W}_{k}^{\dagger }\right]}^{2}\right\rangle }_{{W}_{k}}$$

(40)

where $\upsilon =\frac{{({2}^{m}-1)}^{2}{\rm{Tr}}[{\sigma }_{\nu }^{2}]}{{2}^{2m}{({2}^{m+1}-1)}^{2}}$. From (40) we have that in order to compute (40) one needs to integrate over ξ − 1 blocks. Then, since each integration leads to a coefficient 1/2^2m the variance will scale as ${\mathcal{O}}{\left(\right.1/({2}^{2m})}^{\xi -1}={\mathcal{O}}(1/{2}^{2n})$. Hence, the scaling of the variance gets worse for each block we integrate (such that the block acts on qubits we are measuring).

On the other hand, for a local cost let us consider a single term in (3) where $j\in {S}_{\tilde{k}}$, so that

$${\rm{Var}}[{\partial }_{\nu }{C}_{{\rm{L}}}]\ \propto \ \frac{\upsilon }{{n}^{2}}\ {\left\langle {\rm{Tr}}{\left[(\left|0\right\rangle {\left\langle 0\right|}_{j}\otimes {{\mathbb{1}}}_{\overline{j}}){W}_{\tilde{k}}\left|0\right\rangle {\left\langle 0\right|}^{\otimes m}{W}_{\tilde{k}}^{\dagger }\right]}^{2}\right\rangle }_{{W}_{\tilde{k}}}\ .$$

(41)

Here, in contrast to the global case, we only have to integrate over a single block irrespective of the total number of qubits. Hence, we now find that the variance scales as ${\mathcal{O}}(1/{n}^{2})$, where we remark that the scaling is essentially given by the prefactor 1/n² in (3).

Let us now briefly provide some intuition as to why the scaling of local cost gradients becomes exponentially vanishing with the number of layers as in Theorem 2. Consider the case when V(θ) contains L layers of the ansatz in Fig. 3. Moreover, as shown in Fig. 7, let W be in the first layer, and let O_i act on the m topmost qubits of ${\mathcal{L}}$. As schematically depicted in Fig. 7, we now have to integrating over L − 1 blocks. Then, as proved in the Supplementary Note 5, integrating over a block leads to a coefficient 2^m/2/(2^m + 1). Hence, after integrating L − 1 times, we obtain a coefficient ${2}^{m(L-1)/2}/{({2}^{m}+1)}^{L-1}$ which vanishes no faster than ${{\Omega }}\left(1/\mathrm{poly}\,(n)\right)$ if $mL\in {\mathcal{O}}(\mathrm{log}\,(n))$.

**Fig. 7: The block W is in the first layer of V(θ), and the operator O_i acts on the topmost m qubits in the forward light-cone ${\mathcal{L}}$ of W.**

As we discuss below, for more general scenarios the computation of Var[∂_νC] becomes more complex.

Sketch of the proof of the main theorems

Here we present a sketch of the proof of Theorems 1 and 2. We refer the reader to the Supplementary Information for a detailed version of the proofs.

As mentioned in the previous subsection, if each block in V(θ) forms a local 2-design, then we can explicitly calculate expectation values 〈…〉_W via (38). Hence, to compute ${\langle {{\Delta }}{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{{\boldsymbol{p}}^{\prime} {\boldsymbol{q}}^{\prime} }\rangle }_{{V}_{\text{R}}}$, and ${\langle {{\Delta }}{{{\Psi }}}_{{\boldsymbol{p}}{\boldsymbol{q}}}^{{\boldsymbol{p}}^{\prime} {\boldsymbol{q}}^{\prime} }\rangle }_{{V}_{{\rm{L}}}}$ in (31), one needs to algorithmically integrate over each block using the Weingarten calculus. In order to make such computation tractable, we employ the tensor network representation of quantum circuits.

For the sake of clarity, we recall that any two-qubit gate can be expressed as $U={\sum }_{ijkl}{U}_{ijkl}\left|ij\right\rangle \left\langle kl\right|$, where U_ijkl is a 2 × 2 × 2 × 2 tensor. Similarly, any block in the ansatz can be considered as a ${2}^{\frac{m}{2}}\ \times {2}^{\frac{m}{2}}\ \times {2}^{\frac{m}{2}}\ \times {2}^{\frac{m}{2}}$ tensor. As schematically shown in Fig. 8a, one can use the circuit description of ${{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{i}$ and Ψ_pq to derive the tensor network representation of terms such as ${\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{i}{{{\Omega }}}_{{\boldsymbol{q}}^{\prime} {\boldsymbol{p}}^{\prime} }^{j}]$. Here, ${{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{i}$ is obtained from (34) by simply replacing O with O_i.

**Fig. 8: Tensor-network representations of the terms relevant to Var[∂_νC].**

In Fig. 8b we depict an example where we employ the tensor network representation of ${{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{i}$ to compute the average of ${\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{i}{{{\Omega }}}_{{\boldsymbol{q}}^{\prime} {\boldsymbol{p}}^{\prime} }^{j}]$, and ${\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{i}]{\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}^{\prime} {\boldsymbol{p}}^{\prime} }^{j}]$. As expected, after each integration one obtains a sum of four new tensor networks according to Eq. (38).

Proof of Theorem 2

Let us first consider an m-local cost function C where O is given by (16), and where ${\widehat{O}}_{i}$ acts nontrivially in a given subsystem S_k of ${\mathcal{S}}$. In particular, when ${\widehat{O}}_{i}$ is of this form the proof is simplified, although the more general proof is presented in the Supplementary Note 6. If ${S}_{k}\not\subset {S}_{{\mathcal{L}}}$ we find ${{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{i}\propto {{\mathbb{1}}}_{w}$, and hence

$${\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{i}{{{\Omega }}}_{{\boldsymbol{q}}^{\prime} {\boldsymbol{p}}^{\prime} }^{j}]-\frac{{\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{i}]{\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}^{\prime} {\boldsymbol{p}}^{\prime} }^{j}]}{{2}^{m}}=0\ .$$

(42)

The latter implies that we only have to consider the operators ${\widehat{O}}_{i}$ which act on qubits inside of the forward light-cone ${\mathcal{L}}$ of W.

Then, as shown in the Supplementary Information

$${\left\langle {\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{i}{{{\Omega }}}_{{\boldsymbol{q}}^{\prime} {\boldsymbol{p}}^{\prime} }^{i}]-\frac{{\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{i}]{\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}^{\prime} {\boldsymbol{p}}^{\prime} }^{i}]}{{2}^{m}}\right\rangle }_{{V}_{\text{R}}}\propto \epsilon ({\widehat{O}}_{i})\ .$$

(43)

Here we remark that the proportionality factor contains terms of the form ${\delta }_{{({\boldsymbol{p}},{\boldsymbol{q}})}_{{S}_{\overline{w}}^{+}}}{\delta }_{{({\boldsymbol{p}}^{\prime} ,{\boldsymbol{q}}^{\prime} )}_{{S}_{\overline{w}}^{+}}}{\delta }_{{({\boldsymbol{p}},{\boldsymbol{q}}^{\prime} )}_{{S}_{\overline{w}}^{-}}}{\delta }_{{({\boldsymbol{p}}^{\prime} ,{\boldsymbol{q}})}_{{S}_{\overline{w}}^{-}}}$ (where ${S}_{\overline{w}}^{+}\cup {S}_{\overline{w}}^{-}={S}_{\overline{w}}$), which arises from the different tensor contractions of ${P}_{{\boldsymbol{p}}{\boldsymbol{q}}}=\left|{\boldsymbol{q}}\right\rangle \left\langle {\boldsymbol{p}}\right|$ in Fig. 8c. It is then straightforward to show that

$$\mathop{\sum}\limits _{{{\boldsymbol{p}}{\boldsymbol{q}}}\atop {{\boldsymbol{p}}^{\prime} {\boldsymbol{q}}^{\prime} }} {\delta }_{{({\boldsymbol{p}},{\boldsymbol{q}})}_{{S}_{\overline{w}}^{+}}}{\delta }_{{({\boldsymbol{p}}^{\prime} ,{\boldsymbol{q}}^{\prime} )}_{{S}_{\overline{w}}^{+}}}{\delta }_{{({\boldsymbol{p}},{\boldsymbol{q}}^{\prime} )}_{{S}_{\overline{w}}^{-}}}{\delta }_{{({\boldsymbol{p}}^{\prime} ,{\boldsymbol{q}})}_{{S}_{\overline{w}}^{-}}}{\left\langle {{\Delta }}{{{\Psi }}}_{{\boldsymbol{p}}{\boldsymbol{q}}}^{{\boldsymbol{p}}^{\prime} {\boldsymbol{q}}^{\prime} }\right\rangle }_{{V}_{{\rm{L}}}}\\ ={\left\langle {D}_{\text{HS}}\left({\tilde{\rho }}^{-},{{\rm{Tr}}}_{w}[{\tilde{\rho }}^{-}]\otimes \frac{{\mathbb{1}}}{{2}^{m}}\right)\right\rangle }_{{V}_{{\rm{L}}}}\ ,$$

(44)

where we define ${\tilde{\rho }}^{-}$ as the reduced states of $\tilde{\rho }={V}_{{\rm{L}}}\rho {V}_{{\rm{L}}}^{\dagger }$ in the Hilbert spaces associated with subsystems ${S}_{w}\cup {S}_{\overline{w}}^{-}$. Here we recall that D_HS is the Hilbert–Schmidt distance ${D}_{HS}\left(\rho ,\sigma \right)={\rm{Tr}}[{(\rho -\sigma )}^{2}]$.

By employing properties of D_HS one can show (see Supplementary Note 6)

$${D}_{\text{HS}}\left({\tilde{\rho }}^{-},{{\rm{Tr}}}_{w}[{\tilde{\rho }}^{-}]\otimes \frac{{\mathbb{1}}}{{2}^{m}}\right)\ge \frac{{D}_{\text{HS}}\left({\widetilde{\rho }}_{w},\frac{{\mathbb{1}}}{{2}^{m}}\right)}{{2}^{m(L-l+2)/2}}\ ,$$

(45)

where ${\tilde{\rho }}_{w}={{\rm{Tr}}}_{{S}_{\overline{w}}^{-}}[{\tilde{\rho }}^{-}]$. We can then leverage the tensor network representation of quantum circuits to algorithmically integrate over each block in V_L and compute ${\langle {D}_{\text{HS}}\left({\widetilde{\rho }}_{w},\frac{{\mathbb{1}}}{{2}^{m}}\right)\rangle }_{{V}_{{\rm{L}}}}$. One finds

$${\left\langle {D}_{\text{HS}}\left({\tilde{\rho }}_{w},\frac{{\mathbb{1}}}{{2}^{m}}\right)\right\rangle }_{{V}_{{\rm{L}}}}=\mathop{\sum} _{{(k,k^{\prime} )\in {k}_{{{\mathcal{L}}}_{\text{B}}}}\atop {k^{\prime} \geqslant k}}{t}_{k,k^{\prime} }\epsilon ({\rho }_{k,k^{\prime} })\ ,$$

(46)

with ${t}_{k,k^{\prime} }\geqslant \frac{{2}^{ml}}{{({2}^{m}+1)}^{2l}}$$\forall k,k^{\prime}$, and $\epsilon ({\rho }_{k,k^{\prime} })$ defined in Theorem 2. Combining these results leads to Theorem 2. Moreover, as detailed in the Supplementary information, Theorem 2 is also valid when O is of the form (16).

Proof of Theorem 1

Let us now provide a sketch of the proof of Theorem 1, case (i). Here we denote for simplicity ${\widehat{O}}_{k}:={\widehat{O}}_{1k}$. We leave the proof of case (ii) for the Supplementary Note 7. In this case there are now operators O_i which act outside of the forward light-cone ${\mathcal{L}}$ of W. Hence, it is convenient to include in V_R not only all the gates in ${\mathcal{L}}$ but also all the blocks in the final layer of V(θ) (i.e., all blocks W_kL, with k = 1, …ξ). We can define ${S}_{\overline{{\mathcal{L}}}}$ as the compliment of ${S}_{{\mathcal{L}}}$, i.e., as the subsystem of all qubits which are not in ${\mathcal{L}}$ (with associated Hilbert-space ${{\mathcal{H}}}_{\overline{{\mathcal{L}}}}$). Then, we have ${V}_{\text{R}}={V}_{{\mathcal{L}}}\otimes {V}_{\overline{{\mathcal{L}}}}$ and $\left|{\boldsymbol{q}}\right\rangle \left\langle {\boldsymbol{p}}\right|=\left|{\boldsymbol{q}}\right\rangle {\left\langle {\boldsymbol{p}}\right|}_{{\mathcal{L}}}\otimes \left|{\boldsymbol{q}}\right\rangle {\left\langle {\boldsymbol{p}}\right|}_{\overline{{\mathcal{L}}}}$, where we define ${V}_{\overline{{\mathcal{L}}}}: = {\bigotimes }_{k\in {k}_{\overline{{\mathcal{L}}}}}{W}_{kL}$, $\left|{\boldsymbol{q}}\right\rangle {\left\langle {\boldsymbol{p}}\right|}_{{\mathcal{L}}}:{ = \bigotimes }_{k\in {k}_{{\mathcal{L}}}}\left|{\boldsymbol{q}}\right\rangle {\left\langle {\boldsymbol{p}}\right|}_{k}$, and $\left|{\boldsymbol{q}}\right\rangle {\left\langle {\boldsymbol{p}}\right|}_{\overline{{\mathcal{L}}}}:{ = \bigotimes }_{k^{\prime} \in {k}_{\overline{{\mathcal{L}}}}}\left|{\boldsymbol{q}}\right\rangle {\left\langle {\boldsymbol{p}}\right|}_{k^{\prime} }$. Here, we define ${k}_{{\mathcal{L}}}:=\{k:{S}_{k}\subseteq {S}_{{\mathcal{L}}}\}$ and ${k}_{\overline{{\mathcal{L}}}}:=\{k:{S}_{k}\subseteq {S}_{\overline{{\mathcal{L}}}}\}$, which are the set of indices whose associated qubits are inside and outside ${\mathcal{L}}$, respectively. We also write $O={c}_{0}{\mathbb{1}}+{c}_{1}{\hat{O}}_{{\mathcal{L}}}\otimes {\hat{O}}_{\overline{{\mathcal{L}}}}$, where we define ${\hat{O}}_{{\mathcal{L}}}:{ = \bigotimes }_{k\in {k}_{{\mathcal{L}}}}{\widehat{O}}_{k}$ and ${\hat{O}}_{\overline{{\mathcal{L}}}}:{ = \bigotimes }_{k^{\prime} \in {k}_{\overline{{\mathcal{L}}}}}{\widehat{O}}_{k^{\prime} }$.

Using the fact that the blocks in V(θ) are independent we can now compute ${\langle {{\Delta }}{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{{\boldsymbol{q}}^{\prime} {\boldsymbol{p}}^{\prime} }\rangle }_{{V}_{\text{R}}}={\langle {{\Delta }}{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{{\boldsymbol{q}}^{\prime} {\boldsymbol{p}}^{\prime} }\rangle }_{{V}_{\overline{{\mathcal{L}}}},{V}_{{\mathcal{L}}}}$. Then, from the definition of Ω_pq in Eq. (34) and the fact that one can always express

$${\left\langle {{\Delta }}{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{{\boldsymbol{q}}^{\prime} {\boldsymbol{p}}^{\prime} }\right\rangle }_{{V}_{\text{R}}}= \, {\left\langle {\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{{\mathcal{L}}}{{{\Omega }}}_{{\boldsymbol{q}}^{\prime} {\boldsymbol{p}}^{\prime} }^{{\mathcal{L}}}]-\frac{{\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{{\mathcal{L}}}]{\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}^{\prime} {\boldsymbol{p}}^{\prime} }^{{\mathcal{L}}}]}{{2}^{m}}\right\rangle }_{{V}_{{\mathcal{L}}}}\\ \times \left(\prod\limits_{k\in {k}_{\overline{{\mathcal{L}}}}}{\left\langle {{{\Omega }}}_{k}\right\rangle }_{{W}_{kL}}\right)\ ,$$

(47)

with

$${{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{{\mathcal{L}}} ={{\rm{Tr}}}_{{\mathcal{L}}\cap \overline{w}}\left[(\left|{\boldsymbol{p}}\right\rangle \left\langle {\boldsymbol{q}}\right|\otimes {{\mathbb{1}}}_{w}){V}_{{\mathcal{L}}}{O}^{{\mathcal{L}}}{V}_{{\mathcal{L}}}\right]\\ {{{\Omega }}}_{k} ={\rm{Tr}}\left[\left|{\boldsymbol{p}}\right\rangle {\left\langle {\boldsymbol{q}}\right|}_{k}{W}_{kL}^{\dagger }{\widehat{O}}_{k}{W}_{kL}\right]{\rm{Tr}}\left[\left|{\boldsymbol{p}}^{\prime} \right\rangle {\left\langle {\boldsymbol{q}}^{\prime} \right|}_{k}{W}_{kL}^{\dagger }{\widehat{O}}_{k}{W}_{kL}\right]$$

and where ${{\rm{Tr}}}_{{\mathcal{L}}\cap \overline{w}}$ indicates the partial trace over the Hilbert-space associated with the qubits in ${S}_{{\mathcal{L}}}\cap {S}_{\overline{w}}$. As detailed in the Supplementary Information we can use Eq. (38) to show that

$${\left\langle {{{\Omega }}}_{k}\right\rangle }_{{W}_{kL}}\leqslant \frac{{r}_{k}^{2}\left({\delta }_{{({\boldsymbol{p}},{\boldsymbol{q}})}_{{S}_{k}}}{\delta }_{{({\boldsymbol{p}}^{\prime} ,{\boldsymbol{q}}^{\prime} )}_{{S}_{k}}}+{\delta }_{{({\boldsymbol{p}},{\boldsymbol{q}}^{\prime} )}_{{S}_{k}}}{\delta }_{{({\boldsymbol{p}}^{\prime} ,{\boldsymbol{q}})}_{{S}_{k}}}\right)}{{2}^{2m}-1}\ .$$

(48)

On the other hand, as shown in the Supplementary Note 7 (and as schematically depicted in Fig. 8c), when computing the expectation value ${\langle \ldots \rangle }_{{V}_{{\mathcal{L}}}}$ in (47), one obtains

$${\left\langle {\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{{\mathcal{L}}}{{{\Omega }}}_{{\boldsymbol{q}}^{\prime} {\boldsymbol{p}}^{\prime} }^{{\mathcal{L}}}]-\frac{{\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{{\mathcal{L}}}]{\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}^{\prime} {\boldsymbol{p}}^{\prime} }^{{\mathcal{L}}}]}{{2}^{m}}\right\rangle }_{{V}_{{\mathcal{L}}}}=\mathop{\sum} _{\tau }{t}_{\tau }^{ij}{{\Delta }}{O}_{\tau }^{{\mathcal{L}}}{\delta }_{\tau }\ ,$$

(49)

where we defined ${\delta }_{\tau }={\delta }_{{({\boldsymbol{p}},{\boldsymbol{q}})}_{{S}_{\overline{\tau }}}}{\delta }_{{({\boldsymbol{p}}^{\prime} ,{\boldsymbol{q}}^{\prime} )}_{{S}_{\overline{\tau }}}}{\delta }_{{({\boldsymbol{p}},{\boldsymbol{q}}^{\prime} )}_{{S}_{\tau }}}{\delta }_{{({\boldsymbol{p}}^{\prime} ,{\boldsymbol{q}})}_{{S}_{\tau }}}$, ${t}_{\tau }\in {\mathbb{R}}$, ${S}_{\tau }\cup {S}_{\overline{\tau }}={S}_{{\mathcal{L}}}\cap {S}_{\overline{w}}$ (with ${S}_{\tau }\,\ne\, {{\emptyset}}$), and

$${{\Delta }}{O}_{\tau }^{{\mathcal{L}}}= {{\rm{Tr}}}_{{x}_{\tau }{y}_{\tau }}\left[{{\rm{Tr}}}_{{z}_{\tau }}\left[{O}_{i}\right]{{\rm{Tr}}}_{{z}_{\tau }}[{O}_{j}]\right]\\ -\frac{{{\rm{Tr}}}_{{x}_{\tau }}\left[{{\rm{Tr}}}_{{y}_{\tau }{z}_{\tau }}\left[{O}_{i}\right]{{\rm{Tr}}}_{{y}_{\tau }{z}_{\tau }}[{O}_{j}]\right]}{{2}^{m}}\ .$$

(50)

Here we use the notation ${{\rm{Tr}}}_{{x}_{\tau }}$ to indicate the trace over the Hilbert space associated with subsystem ${S}_{{x}_{\tau }}$, such that ${S}_{{x}_{\tau }}\cup {S}_{{y}_{\tau }}\cup {S}_{{z}_{\tau }}={S}_{{\mathcal{L}}}$. As shown in the Supplementary Note 7, combining the deltas in Eqs. (48), and (49) with ${\left\langle {{\Delta }}{{{\Psi }}}_{{\boldsymbol{p}}{\boldsymbol{q}}}^{{\boldsymbol{p}}^{\prime} {\boldsymbol{q}}^{\prime} }\right\rangle }_{{V}_{{\rm{L}}}}$ leads to Hilbert–Schmidt distances between two quantum states as in (44). One can then use the following bounds ${D}_{\text{HS}}\left({\rho }_{1},{\rho }_{2}\right)\le 2$, ${{\Delta }}{O}_{\tau }^{{\mathcal{L}}}\le {\prod }_{k\in {k}_{{\mathcal{L}}}}{r}_{k}^{2}$, and ∑_τt_τ ≤ 2, along with some additional simple algebra explained in the Supplementary Information to obtain the upper bound in Theorem 1.

Ansatz and optimization method

Here we describe the gradient-free optimization method used in our heuristics. First, we note that all the parameters in the ansatz are randomly initialized. Then, at each iteration, one solves the following sub-space search problem: $\mathop{\min }\limits_{{\boldsymbol{s}}\in {{\mathbb{R}}}^{d}}C({\boldsymbol{\theta }}+{\boldsymbol{A}}\cdot {\boldsymbol{s}})$, where A is a randomly generated isometry, and s = (s₁, …, s_d) is a vector of coefficients to be optimized over. We used d = 10 in our simulations. Moreover, the training algorithm gradually increases the number of shots per cost-function evaluation. Initially, C is evaluated with 10 shots, and once the optimization reaches a plateau, the number of shots is increased by a factor of 3/2. This process is repeated until a termination condition on the value of C is achieved, or until we reach the maximum value of 10⁵ shots per function evaluation. While this is a simple variable-shot approach, we remark that a more advanced variable-shot optimizer can be found in ref. ⁵⁷.

Finally, let us remark that while we employ a sub-space search algorithm, in the presence of barren plateaus all optimization methods will (on average) fail unless the algorithm has a precision (i.e., number of shots) that grows exponentially with n. The latter is due to the fact that an exponentially vanishing gradient implies that on average the cost function landscape will essentially be flat, with the slope of the order of ${\mathcal{O}}(1/{2}^{n})$. Hence, unless one has a precision that can detect such small changes in the cost value, one will not be able to determine a cost minimization direction with gradient-based, or even with black-box optimizers such as the Nelder–Mead method^58,59,60,61.

Data availability

Data generated and analyzed during the current study are available from the corresponding author upon reasonable request.

References

Preskill, J. Quantum computing in the NISQ era and beyond. Quantum 2, 79 (2018).
Article Google Scholar
McClean, J. R., Romero, J., Babbush, R. & Aspuru-Guzik, A. The theory of variational hybrid quantum-classical algorithms. New J. Phys. 18, 023023 (2016).
Article ADS Google Scholar
Farhi, E., Goldstone, J. & Gutmann, S. A quantum approximate optimization algorithm. Preprint at https://arxiv.org/abs/1411.4028 (2014).
Hadfield, S. et al. From the quantum approximate optimization algorithm to a quantum alternating operator ansatz. Algorithms 12, 34 (2019).
Article MathSciNet Google Scholar
Hastings, M. B. Classical and quantum bounded depth approximation algorithms. Preprint at https://arxiv.org/abs/1905.07047 (2019).
Kandala, A. et al. Hardware-efficient variational quantum eigensolver for small molecules and quantum magnets. Nature 549, 242 (2017).
Article ADS CAS Google Scholar
Arute, F. et al. Hartree-fock on a superconducting qubit quantum computer. Science 369, 1084–1089 (2020a).
Article CAS Google Scholar
Harrigan, Matthew P. et al. Quantum approximate optimization of non-planar graph problems on a planar superconducting processor. Nature Physics 1–5 (2021).
McClean, J. R., Boixo, S., Smelyanskiy, V. N., Babbush, R. & Neven, H. Barren plateaus in quantum neural network training landscapes. Nat. Commun. 9, 4812 (2018).
Article ADS Google Scholar
Peruzzo, A. et al. A variational eigenvalue solver on a photonic quantum processor. Nat. Commun. 5, 4213 (2014).
Article ADS CAS Google Scholar
Romero, J., Olson, J. P. & Aspuru-Guzik, A. Quantum autoencoders for efficient compression of quantum data. Quant. Sci. Technol. 2, 045001 (2017).
Article ADS Google Scholar
Johnson, P. D., Romero, J., Olson, J., Cao, Y. & Aspuru-Guzik, A. QVECTOR: an algorithm for device-tailored quantum error correction. Preprint at https://arxiv.org/abs/1711.02249 (2017).
Koczor, B., Endo, S., Jones, T., Matsuzaki, Y. & Benjamin, S. C. Variational-state quantum metrology. New J. Phys. 22, 083038 (2020).
Article ADS MathSciNet Google Scholar
Khatri, S. et al. Quantum-assisted quantum compiling. Quantum 3, 140 (2019).
Article Google Scholar
Jones, T. & Benjamin, S. C. Quantum compilation and circuit optimisation via energy dissipation. Preprint at https://arxiv.org/abs/1811.03147 (2019).
Sharma, K., Khatri, S., Cerezo, M. & Coles, P. J. Noise resilience of variational quantum compiling. New J. Phys. 22, 043006 (2020a).
Article ADS MathSciNet Google Scholar
Heya, K., Suzuki, Y., Nakamura, Y. & Fujii, K. Variational quantum gate optimization. Preprint at https://arxiv.org/abs/1810.12745 (2018).
LaRose, R., Tikku, A., O’Neel-Judy, É., Cincio, L. & Coles, P. J. Variational quantum state diagonalization. npj Quant. Inf. 5, 1–10 (2018).
Google Scholar
Bravo-Prieto, C., García-Martín, D. & Latorre, J. Quantum singular value decomposer. Phys. Rev. A 101, 062310 (2020).
Article ADS MathSciNet CAS Google Scholar
Li, Y. & Benjamin, S. C. Efficient variational quantum simulator incorporating active error minimization. Phys. Rev. X 7, 021050 (2017).
Google Scholar
Heya, K., Nakanishi, K. M., Mitarai, K. & Fujii, K. Subspace variational quantum simulator. Preprint at https://arxiv.org/abs/1904.08566 (2019).
Cirstoiu, C. et al. Variational fast forwarding for quantum simulation beyond the coherence time. npj Quant. Inf. 6, 1–10 (2020).
Google Scholar
Otten, M., Cortes, C. L. & Gray, S. K. Noise-resilient quantum dynamics using symmetry-preserving ansatzes. Preprint at https://arxiv.org/abs/1910.06284 (2019).
Cerezo, M., Poremba, A., Cincio, L. & Coles, P. J. Variational quantum fidelity estimation. Quantum 4, 248 (2020).
Article Google Scholar
Carolan, J. et al. Variational quantum unsampling on a quantum photonic processor. Nature Physics 16.3 322-327 (2020).
Article ADS Google Scholar
Arrasmith, A., Cincio, L., Sornborger, A. T., Zurek, W. H. & Coles, P. J. Variational consistent histories as a hybrid algorithm for quantum foundations. Nat. Commun. 10, 3438 (2019).
Article ADS Google Scholar
Bravo-Prieto, C. et al. Variational quantum linear solver. Preprint at https://arxiv.org/abs/1909.05820 (2019).
Xu, X. et al. Variational algorithms for linear algebra. Preprint at https://arxiv.org/abs/1909.03898 (2019).
Huang, H.-Y., Bharti, K. & Rebentrost, P. Near-term quantum algorithms for linear systems of equations. Preprint at https://arxiv.org/abs/1909.07344 (2019).
Cerezo, M. & Coles, P. J. Impact of barren plateaus on the hessian and higher order derivatives. Preprint at https://arxiv.org/abs/2008.07454 (2020).
Arrasmith, A., Cerezo, M., Czarnik, P., Cincio, L. & Coles, P. J. Effect of barren plateaus on gradient-free optimization. Preprint at https://arxiv.org/abs/2011.12245 (2020).
Grant, E., Wossnig, L., Ostaszewski, M. & Benedetti, M. An initialization strategy for addressing barren plateaus in parametrized quantum circuits. Quantum 3, 214 (2019).
Article Google Scholar
Verdon, G. et al. Learning to learn with quantum neural networks via classical neural networks. Preprint at https://arxiv.org/abs/1907.05415 (2019a).
Lee, J., Huggins, W. J., Head-Gordon, M. & Whaley, K. B. Generalized unitary coupled cluster wave functions for quantum computation. J. Chem. Theory Comput. 15, 311–324 (2018).
Article Google Scholar
Verdon, G. et al. Quantum graph neural networks. Preprint at https://arxiv.org/abs/1909.12264 (2019b).
IBM Q: Quantum devices and simulators. https://www.research.ibm.com/ibm-q/technology/devices/.
Arute, F. et al. Quantum supremacy using a programmable superconducting processor. Nature 574, 505–510 (2019).
Article ADS CAS Google Scholar
Dankert, C., Cleve, R., Emerson, J. & Livine, E. Exact and approximate unitary 2-designs and their application to fidelity estimation. Phys. Rev. A 80, 012304 (2009).
Article ADS Google Scholar
Brandao, F. G. S. L., Harrow, A. W. & Horodecki, M. Local random quantum circuits are approximate polynomial-designs. Commun. Math. Phys. 346, 397–434 (2016).
Article ADS MathSciNet Google Scholar
Harrow, A. & Mehraban, S. Approximate unitary t-designs by short random quantum circuits using nearest-neighbor and long-range gates. Preprint at https://arxiv.org/abs/1809.06957 (2018).
Wan, K. H., Dahlsten, O., Kristjánsson, H., Gardner, R. & Kim, M. S. Quantum generalisation of feedforward neural networks. npj Quant. Inf. 3, 36 (2017).
Article ADS Google Scholar
Lamata, L. et al. Quantum autoencoders via quantum adders with genetic algorithms. Quant. Sci. Technol. 4, 014007 (2018).
Article ADS Google Scholar
Pepper, A., Tischler, N. & Pryde, G. J. Experimental realization of a quantum autoencoder: the compression of qutrits via machine learning. Phys. Rev. Lett. 122, 060501 (2019).
Article ADS CAS Google Scholar
Verdon, G., Pye, J. & Broughton, M. A universal training algorithm for quantum deep learning. Preprint at https://arxiv.org/abs/1806.09729 (2018).
Pennington, J. & Bahri, Y. Geometry of neural network loss surfaces via random matrix theory. in Proceedings of the 34th International Conference on Machine Learning-Volume 70 (JMLR. org, 2017) pp. 2798–2806, http://proceedings.mlr.press/v70/pennington17a.html.
Tranter, A., Love, P. J., Mintert, F. & Coveney, P. V. A comparison of the bravyi–kitaev and jordan–wigner transformations for the quantum simulation of quantum chemistry. J. Chem. Theory Comput. 14, 5617–5630 (2018).
Article CAS Google Scholar
Bravyi, S. B. & Kitaev, A. Y. Fermionic quantum computation. Ann. Phys. 298, 210–226 (2002).
Article ADS MathSciNet CAS Google Scholar
Uvarov, A., Biamonte, J. D. & Yudin, D. Variational quantum eigensolver for frustrated quantum systems. Phys. Rev. B 102, 075104 (2020).
Article ADS CAS Google Scholar
Biamonte, J. Universal variational quantum computation. Preprint at https://arxiv.org/abs/1903.04500 (2019).
Sharma, K., Cerezo, M., Cincio, L. & Coles, P. J. Trainability of dissipative perceptron-based quantum neural networks. Preprint at https://arxiv.org/abs/2005.12458 (2020b).
Beer, K. et al. Training deep quantum neural networks. Nat. Commun. 11, 1–6 (2020).
Article ADS Google Scholar
Bartlett, R. J. & Musiał, M. Coupled-cluster theory in quantum chemistry. Rev. Modern Phys. 79, 291 (2007).
Article ADS CAS Google Scholar
Volkoff, T. & Coles, P. J. Large gradients via correlation in random parameterized quantum circuits. Quant. Sci. Technol. (2021). https://iopscience.iop.org/article/10.1088/2058-9565/abd891.
Skolik, A. et al. Layerwise learning for quantum neural networks. Quantum Machine Intelligence 3.1 1–11 (2021).
Google Scholar
Benoît, C. & Śniady, P. Integration with respect to the haar measure on unitary, orthogonal and symplectic group. Commun. Math. Phys. 264, 773–795 (2006).
Article ADS MathSciNet Google Scholar
Puchała, Z. & Miszczak, J. A. Symbolic integration with respect to the haar measure on the unitary groups. Bull. Pol. Acad. Sci. Tech. Sci. 65, 21–27 (2017).
Google Scholar
Kübler, J. M., Arrasmith, A., Cincio, L. & Coles, P. J. An adaptive optimizer for measurement-frugal variational algorithms. Quantum 4, 263 (2020).
Article Google Scholar
Nelder, J. A. & Mead, R. A simplex method for function minimization. Comput. J. 7, 308–313 (1965).
Article MathSciNet Google Scholar
Paley, R. E. A. C. & Zygmund, A. A note on analytic functions in the unit circle. Math. Proc. Camb. Phil. Soc. 28, 266 (1932).
Article ADS Google Scholar
Fukuda, M., König, R. & Nechita, I. RTNI–a symbolic integrator for haar-random tensor networks. J. Phys. A 52, 425303 (2019).
Article ADS Google Scholar
Nielsen, M. A. & Chuang, I. L. Quantum computation and quantum information: 10th Anniversary Edition, 10th ed. (Cambridge University Press, New York, NY, USA, 2011)

Download references

Acknowledgements

We thank Jacob Biamonte, Elizabeth Crosson, Burak Sahinoglu, Rolando Somma, Guillaume Verdon, and Kunal Sharma for helpful conversations. All authors were supported by the Laboratory Directed Research and Development (LDRD) program of Los Alamos National Laboratory (LANL) under project numbers 20180628ECR (for M.C.), 20190065DR (for A.S., L.C., and P.J.C.), and 20200677PRD1 (for T.V.). M.C. and A.S. were also supported by the Center for Nonlinear Studies at LANL. P.J.C. acknowledges initial support from the LANL ASC Beyond Moore’s Law project. This work was also supported by the U.S. Department of Energy (DOE), Office of Science, Office of Advanced Scientific Computing Research, under the Quantum Computing Application Teams program.

Author information

Authors and Affiliations

Theoretical Division, Los Alamos National Laboratory, Los Alamos, NM, USA
M. Cerezo, Akira Sone, Tyler Volkoff, Lukasz Cincio & Patrick J. Coles
Center for Nonlinear Studies, Los Alamos National Laboratory, Los Alamos, NM, USA
M. Cerezo & Akira Sone

Authors

M. Cerezo
View author publications
You can also search for this author in PubMed Google Scholar
Akira Sone
View author publications
You can also search for this author in PubMed Google Scholar
Tyler Volkoff
View author publications
You can also search for this author in PubMed Google Scholar
Lukasz Cincio
View author publications
You can also search for this author in PubMed Google Scholar
Patrick J. Coles
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

The project was conceived by M.C., L.C., and P.J.C. The manuscript was written by M.C., A.S., T.V., L.C., and P.J.C. T.V. proved Proposition 1. M.C. and A.S. proved Proposition 2 and Theorems 1–2. M.C., A.S., T.V., and P.J.C. proved Corollaries 1–2. M.C., A.S., T.V., L.C., and P.J.C. analyzed the quantum autoencoder. For the numerical results, T.V. performed the simulation in Fig. 2, and L.C. performed the simulation in Fig. 5.

Corresponding authors

Correspondence to M. Cerezo or Patrick J. Coles.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Communications thanks the anonymous reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Peer Review File

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Cerezo, M., Sone, A., Volkoff, T. et al. Cost function dependent barren plateaus in shallow parametrized quantum circuits. Nat Commun 12, 1791 (2021). https://doi.org/10.1038/s41467-021-21728-w

Download citation

Received: 05 February 2020
Accepted: 05 February 2021
Published: 19 March 2021
DOI: https://doi.org/10.1038/s41467-021-21728-w

This article is cited by

Understanding quantum machine learning also requires rethinking generalization
- Elies Gil-Fuster
- Jens Eisert
- Carlos Bravo-Prieto
Nature Communications (2024)
Analyzing variational quantum landscapes with information content
- Adrián Pérez-Salinas
- Hao Wang
- Xavier Bonet-Monroig
npj Quantum Information (2024)
Theoretical guarantees for permutation-equivariant quantum neural networks
- Louis Schatzki
- Martín Larocca
- M. Cerezo
npj Quantum Information (2024)
Large-scale simulations of Floquet physics on near-term quantum computers
- Timo Eckstein
- Refik Mansuroglu
- Zoë Holmes
npj Quantum Information (2024)
On the sample complexity of quantum Boltzmann machine learning
- Luuk Coopmans
- Marcello Benedetti
Communications Physics (2024)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Introduction

Results

Warm-up example

Proposition 1

General framework

Proposition 2

Main results

Theorem 1

Corollary 1

Theorem 2

Corollary 2

Numerical simulations

Numerical results

Discussion

Methods

Variance of the cost function partial derivative

Computing averages over V

Intuition behind the main results

Sketch of the proof of the main theorems

Proof of Theorem 2

Proof of Theorem 1

Ansatz and optimization method

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Comments

Search

Quick links