Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

# Cost function dependent barren plateaus in shallow parametrized quantum circuits

## Abstract

Variational quantum algorithms (VQAs) optimize the parameters θ of a parametrized quantum circuit V(θ) to minimize a cost function C. While VQAs may enable practical applications of noisy quantum computers, they are nevertheless heuristic methods with unproven scaling. Here, we rigorously prove two results, assuming V(θ) is an alternating layered ansatz composed of blocks forming local 2-designs. Our first result states that defining C in terms of global observables leads to exponentially vanishing gradients (i.e., barren plateaus) even when V(θ) is shallow. Hence, several VQAs in the literature must revise their proposed costs. On the other hand, our second result states that defining C with local observables leads to at worst a polynomially vanishing gradient, so long as the depth of V(θ) is $${\mathcal{O}}(\mathrm{log}\,n)$$. Our results establish a connection between locality and trainability. We illustrate these ideas with large-scale simulations, up to 100 qubits, of a quantum autoencoder implementation.

## Introduction

One of the most important technological questions is whether Noisy Intermediate-Scale Quantum (NISQ) computers will have practical applications1. NISQ devices are limited both in qubit count and in gate fidelity, hence preventing the use of quantum error correction.

The leading strategy to make use of these devices is variational quantum algorithms (VQAs)2. VQAs employ a quantum computer to efficiently evaluate a cost function C, while a classical optimizer trains the parameters θ of a Parametrized Quantum Circuit (PQC) V(θ). The benefits of VQAs are three-fold. First, VQAs allow for task-oriented programming of quantum computers, which is important since designing quantum algorithms is non-intuitive. Second, VQAs make up for small qubit counts by leveraging classical computational power. Third, pushing complexity onto classical computers, while only running short-depth quantum circuits, is an effective strategy for error mitigation on NISQ devices.

There are very few rigorous scaling results for VQAs (with exception of one-layer approximate optimization3,4,5). Ideally, in order to reduce gate overhead that arises when implementing on quantum hardware one would like to employ a hardware-efficient ansatz6 for V(θ). As recent large-scale implementations for chemistry7 and optimization8 applications have shown, this ansatz leads to smaller errors due to hardware noise. However, one of the few known scaling results is that deep versions of randomly initialized hardware-efficient ansatzes lead to exponentially vanishing gradients9. Very little is known about the scaling of the gradient in such ansatzes for shallow depths, and it would be especially useful to have a converse bound that guarantees non-exponentially vanishing gradients for certain depths. This motivates our work, where we rigorously investigate the gradient scaling of VQAs as a function of the circuit depth.

The other motivation for our work is the recent explosion in the number of proposed VQAs. The Variational Quantum Eigensolver (VQE) is the most famous VQA. It aims to prepare the ground state of a given Hamiltonian H = ∑αcασα, with H expanded as a sum of local Pauli operators10. In VQE, the cost function is obviously the energy $$C=\left\langle \psi | H| \psi \right\rangle$$ of the trial state $$\left|\psi \right\rangle$$. However, VQAs have been proposed for other applications, like quantum data compression11, quantum error correction12, quantum metrology13, quantum compiling14,15,16,17, quantum state diagonalization18,19, quantum simulation20,21,22,23, fidelity estimation24, unsampling25, consistent histories26, and linear systems27,28,29. For these applications, the choice of C is less obvious. Put another way, if one reformulates these VQAs as ground-state problems (which can be done in many cases), the choice of Hamiltonian H is less intuitive. This is because many of these applications are abstract, rather than associated with a physical Hamiltonian.

We remark that polynomially vanishing gradients imply that the number of shots needed to estimate the gradient should grow as $${\mathcal{O}}(\mathrm{poly}\,(n))$$. In contrast, exponentially vanishing gradients (i.e., barren plateaus) imply that derivative-based optimization will have exponential scaling30, and this scaling can also apply to derivative-free optimization31. Assuming a polynomial number of shots per optimization step, one will be able to resolve against finite sampling noise and train the parameters if the gradients vanish polynomially. Hence, we employ the term “trainable” for polynomially vanishing gradients.

In this work, we connect the trainability of VQAs to the choice of C. For the abstract applications in refs. 11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29, it is important for C to be operational, so that small values of C imply that the task is almost accomplished. Consider an example of state preparation, where the goal is to find a gate sequence that prepares a target state $$\left|{\psi }_{0}\right\rangle$$. A natural cost function is the square of the trace distance DT between $$\left|{\psi }_{0}\right\rangle$$ and $$\left|\psi \right\rangle =V{({\boldsymbol{\theta }})}^{\dagger }\left|{\boldsymbol{0}}\right\rangle$$, given by $${C}_{{\rm{G}}}={D}_{\text{T}}{(\left|{\psi }_{0}\right\rangle ,\left|\psi \right\rangle )}^{2}$$, which is equivalent to

$${C}_{{\rm{G}}}={\rm{Tr}}[{O}_{{\rm{G}}}V({\boldsymbol{\theta }})\left|{\psi }_{0}\right\rangle \ \left\langle {\psi }_{0}\right|V{({\boldsymbol{\theta }})}^{\dagger }]\ ,$$
(1)

with $${O}_{{\rm{G}}}={\mathbb{1}}-\left|{\boldsymbol{0}}\right\rangle \ \left\langle {\boldsymbol{0}}\right|$$. Note that $$\sqrt{{C}_{{\rm{G}}}}\ge | \left\langle \psi | M| \psi \right\rangle -\left\langle {\psi }_{0}| M| {\psi }_{0}\right\rangle |$$ has a nice operational meaning as a bound on the expectation value difference for a POVM element M.

However, here we argue that this cost function and others like it exhibit exponentially vanishing gradients. Namely, we consider global cost functions, where one directly compares states or operators living in exponentially large Hilbert spaces (e.g., $$\left|\psi \right\rangle$$ and $$\left|{\psi }_{0}\right\rangle$$). These are precisely the cost functions that have operational meanings for tasks of interest, including all tasks in refs. 11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29. Hence, our results imply that a non-trivial subset of these references will need to revise their choice of C.

Interestingly, we demonstrate vanishing gradients for shallow PQCs. This is in contrast to McClean et al.9, who showed vanishing gradients for deep PQCs. They noted that randomly initializing θ for a V(θ) that forms a 2-design leads to a barren plateau, i.e., with the gradient vanishing exponentially in the number of qubits, n. Their work implied that researchers must develop either clever parameter initialization strategies32,33 or clever PQCs ansatzes4,34,35. Similarly, our work implies that researchers must carefully weigh the balance between trainability and operational relevance when choosing C.

While our work is for general VQAs, barren plateaus for global cost functions were noted for specific VQAs and for a very specific tensor-product example by our research group14,18, and more recently in29. This motivated the proposal of local cost functions14,16,18,22,25,26,27, where one compares objects (states or operators) with respect to each individual qubit, rather than in a global sense, and therein it was shown that these local cost functions have indirect operational meaning.

Our second result is that these local cost functions have gradients that vanish polynomially rather than exponentially in n, and hence have the potential to be trained. This holds for V(θ) with depth $${\mathcal{O}}(\mathrm{log}\,n)$$. Figure 1 summarizes our two main results.

Finally, we illustrate our main results for an important example: quantum autoencoders11. Our large-scale numerics show that the global cost function proposed in11 has a barren plateau. On the other hand, we propose a novel local cost function that is trainable, hence making quantum autoencoders a scalable application.

## Results

### Warm-up example

To illustrate cost-function-dependent barren plateaus, we first consider a toy problem corresponding to the state preparation problem in the Introduction with the target state being $$\left|{\boldsymbol{0}}\right\rangle$$. We assume a tensor-product ansatz of the form $$V({\boldsymbol{\theta }}){ = \bigotimes }_{j = 1}^{n}{e}^{-i{\theta }^{j}{\sigma }_{x}^{(j)}/2}$$, with the goal of finding the angles θj such that $$V({\boldsymbol{\theta }})\left|{\boldsymbol{0}}\right\rangle =\left|{\boldsymbol{0}}\right\rangle$$. Employing the global cost of (1) results in $${C}_{{\rm{G}}}=1-\mathop{\prod }\nolimits_{j = 1}^{n}{\cos }^{2}\frac{{\theta }^{j}}{2}$$. The barren plateau can be detected via the variance of its gradient: $${\rm{Var}}[\frac{\partial {C}_{{\rm{G}}}}{\partial {\theta }^{j}}]=\frac{1}{8}{(\frac{3}{8})}^{n-1}$$, which is exponentially vanishing in n. Since the mean value is $$\left\langle \frac{\partial {C}_{{\rm{G}}}}{\partial {\theta }^{j}}\right\rangle =0$$, the gradient concentrates exponentially around zero.

On the other hand, consider a local cost function:

$${C}_{{\rm{L}}}={\rm{Tr}}\left[{O}_{{\rm{L}}}V({\boldsymbol{\theta }})\left|{\boldsymbol{0}}\right\rangle \ \left\langle {\boldsymbol{0}}\right|V{({\boldsymbol{\theta }})}^{\dagger }\right],$$
(2)
$${\,\text{with}\,}\quad {O}_{{\rm{L}}}={\mathbb{1}}-{\frac {1}{n}}\mathop{\sum }\limits_{j=1}^{n}\left|0\right\rangle \ {\left\langle 0\right|}_{j}\otimes {{\mathbb{1}}}_{\overline{j}}\ ,$$
(3)

where $${{\mathbb{1}}}_{\overline{j}}$$ is the identity on all qubits except qubit j. Note that CL vanishes under the same conditions as CG14,16, CL = 0 CG = 0. We find $${C}_{{\rm{L}}}=1-\frac{1}{n}\mathop{\sum }\nolimits_{j = 1}^{n}{\cos }^{2}\frac{{\theta }^{j}}{2}$$, and the variance of its gradient is $${\rm{Var}}[\frac{\partial {C}_{{\rm{L}}}}{\partial {\theta }^{j}}]=\frac{1}{8{n}^{2}}$$, which vanishes polynomially with n and hence exhibits no barren plateau. Figure 2 depicts the cost landscapes of CG and CL for two values of n and shows that the barren plateau can be avoided here via a local cost function.

Moreover, this example allows us to delve deeper into the cost landscape to see a phenomenon that we refer to as a narrow gorge. While a barren plateau is associated with a flat landscape, a narrow gorge refers to the steepness of the valley that contains the global minimum. This phenomenon is illustrated in Fig. 2, where each dot corresponds to cost values obtained from randomly selected parameters θ. For CG we see that very few dots fall inside the narrow gorge, while for CL the narrow gorge is not present. Note that the narrow gorge makes it harder to train CG since the learning rate of descent-based optimization algorithms must be exponentially small in order not to overstep the narrow gorge. The following proposition (proved in the Supplementary Note 2) formalizes the narrow gorge for CG and its absence for CL by characterizing the dependence on n of the probability Cδ. This probability is associated with the parameter space volume that leads to Cδ.

### Proposition 1

Let θj be uniformly distributed on [−π, π] j. For any δ (0, 1), the probability that CG ≤ δ satisfies

$$\Pr \{{C}_{{\rm{G}}}\le \delta \}\le {(1-\delta )}^{-1}{\left(\frac{1}{2}\right)}^{n}.$$
(4)

For any $$\delta \in [\frac{1}{2},1]$$, the probability that CL ≤ δ satisfies

$$\Pr \{{C}_{{\rm{L}}}\le \delta \}\ge \frac{{(2\delta -1)}^{2}}{\frac{1}{2n}+{(2\delta -1)}^{2}}\mathop{\longrightarrow }\limits_{n\to \infty }1\ .$$
(5)

### General framework

For our general results, we consider a family of cost functions that can be expressed as the expectation value of an operator O as follows

$$C={\rm{Tr}}\left[OV({\boldsymbol{\theta }})\rho {V}^{\dagger }({\boldsymbol{\theta }})\right]\ ,$$
(6)

where ρ is an arbitrary quantum state on n qubits. Note that this framework includes the special case where ρ could be a pure state, as well as the more special case where $$\rho =\left|{\boldsymbol{0}}\right\rangle \ \left\langle {\boldsymbol{0}}\right|$$, which is the input state for many VQAs such as VQE. Moreover, in VQE, one chooses O = H, where H is the physical Hamiltonian. In general, the choice of O and ρ essentially defines the application of interest of the particular VQA.

It is typical to express O as a linear combination of the form $$O={c}_{0}{\mathbb{1}}+\mathop{\sum }\nolimits_{i = 1}^{N}{c}_{i}{O}_{i}$$. Here Oi ≠ $${\mathbb{1}}$$, $${c}_{i}\in {\mathbb{R}}$$, and we assume that at least one ci ≠ 0. Note that CG and CL in (1) and (2) fall under this framework. In our main results below, we will consider two different choices of O that respectively capture our general notions of global and local cost functions and also generalize the aforementioned CG and CL.

As shown in Fig. 3a, V(θ) consists of L layers of m-qubit unitaries Wkl(θkl), or blocks, acting on alternating groups of m neighboring qubits. We refer to this as an Alternating Layered Ansatz. We remark that the Alternating Layered Ansatz will be a hardware-efficient ansatz so long as the gates that compose each block are taken from a set of gates native to a specific device. As depicted in Fig. 3c, the one dimensional Alternating Layered Ansatz can be readily implemented in devices with one-dimensional connectivity, as well as in devices with two-dimensional connectivity (such as that of IBM’s36 and Google’s37 quantum devices). That is, with both one- and two-dimensional hardware connectivity one can group qubits to form an Alternating Layered Ansatz as in Fig. 3a.

The index l = 1, …, L in Wkl(θkl) indicates the layer that contains the block, while k = 1, …, ξ indicates the qubits it acts upon. We assume n is a multiple of m, with n = mξ, and that m does not scale with n. As depicted in Fig. 3a, we define Sk as the m-qubit subsystem on which WkL acts, and we define $${\mathcal{S}}=\{{S}_{k}\}$$ as the set of all such subsystems. Let us now consider a block Wkl(θkl) in the lth layer of the ansatz. For simplicity we henceforth use W to refer to a given Wkl(θkl). As shown in the Methods section, given a θνθkl that parametrizes a rotation $${e}^{-i{\theta }^{\nu }{\sigma }_{\nu }/2}$$ (with σν a Pauli operator) inside a given block W, one can always express

$$\frac{\partial W}{\partial {\theta }^{\nu }}:={\partial }_{\nu }W=\frac{-i}{2}{W}_{\text{A}}{\sigma }_{\nu }{W}_{\text{B}},$$
(7)

where WA and WB contain all remaining gates in W, and are properly defined in the Methods section.

The contribution to the gradient C from a parameter θν in the block W is given by the partial derivative ∂νC. While the value of ∂νC depends on the specific parameters θ, it is useful to compute $${\langle {\partial }_{\nu }C\rangle }_{V}$$, i.e., the average gradient over all possible unitaries V(θ) within the ansatz. Such an average may not be representative near the minimum of C, although it does provide a good estimate of the expected gradient when randomly initializing the angles in V(θ). In the Methods Section we explicitly show how to compute averages of the form 〈…〉V, and in the Supplementary Note 3 we provide a proof for the following Proposition.

### Proposition 2

The average of the partial derivative of any cost function of the form (6) with respect to a parameter θν in a block W of the ansatz in Fig. 3 is

$${\langle {\partial }_{\nu }C\rangle }_{V}=0\ ,$$
(8)

provided that either WA or WB of (7) form a 1-design.

Here we recall that a t-design is an ensemble of unitaries, such that sampling over their distribution yields the same properties as sampling random unitaries from the unitary group with respect to the Haar measure up to the first t moments38. The Methods section provides a formal definition of a t-design.

Proposition 2 states that the gradient is not biased in any particular direction. To analyze the trainability of C, we consider the second moment of its partial derivatives:

$${\rm{Var}}[{\partial }_{\nu }C]={\left\langle {\left({\partial }_{\nu }C\right)}^{2}\right\rangle }_{V}\ ,$$
(9)

where we used the fact that $${\langle {\partial }_{\nu }C\rangle }_{V}=0$$. The magnitude of Var[∂νC] quantifies how much the partial derivative concentrates around zero, and hence small values in (9) imply that the slope of the landscape will typically be insufficient to provide a cost-minimizing direction. Specifically, from Chebyshev’s inequality, Var[∂νC] bounds the probability that the cost-function partial derivative deviates from its mean value (of zero) as $$\Pr \left(| {\partial }_{\nu }C| \ge c\right)\le {\rm{Var}}[{\partial }_{\nu }C]/{c}^{2}$$ for all c > 0.

### Main results

Here we present our main theorems and corollaries, with the proofs sketched in the Methods and detailed in the Supplementary Information. In addition, in the Methods section we provide some intuition behind our main results by analyzing a generalization of the warm-up example where V(θ) is composed of a single layer of the ansatz in Fig. 3. This case bridges the gap between the warm-up example and our main theorems and also showcases the tools used to derive our main result.

The following theorem provides an upper bound on the variance of the partial derivative of a global cost function which can be expressed as the expectation value of an operator of the form

$$O={c}_{0}{\mathbb{1}}+\mathop{\sum }\limits_{i=1}^{N}{c}_{i}{\widehat{O}}_{i1}\otimes {\widehat{O}}_{i2}\otimes \cdots \otimes {\widehat{O}}_{i\xi }\ .$$
(10)

Specifically, we consider two cases of interest: (i) When N = 1 and each $${\widehat{O}}_{1k}$$ is a non-trivial projector ($${\widehat{O}}_{1k}^{2}={\widehat{O}}_{1k}\ne {\mathbb{1}}$$) of rank rk acting on subsystem Sk, or (ii) When N is arbitrary and $${\widehat{O}}_{ik}$$ is traceless with $${\rm{Tr}}[{\widehat{O}}_{ik}^{2}]\le {2}^{m}$$ (for example, when $${\widehat{O}}_{ik}{ = \bigotimes }_{j = 1}^{m}{\sigma }_{j}^{\mu }$$ is a tensor product of Pauli operators $${\sigma }_{j}^{\mu }\in \{{{\mathbb{1}}}_{j},{\sigma }_{j}^{x},{\sigma }_{j}^{y},{\sigma }_{j}^{z}\}$$, with at least one $${\sigma }_{j}^{\mu }\,\ne\, {\mathbb{1}}$$). Note that case (i) includes CG of (1) as a special case.

### Theorem 1

Consider a trainable parameter θν in a block W of the ansatz in Fig. 3. Let Var[∂νC] be the variance of the partial derivative of a global cost function C (with O given by (10)) with respect to θν. If WA, WB of (7), and each block in V(θ) form a local 2-design, then Var[∂νC] is upper bounded by

$${\rm{Var}}[{\partial }_{\nu }C]\;\leqslant\; {F}_{n}(L,l)\ .$$
(11)
1. (i)

For N = 1 and when each $${\widehat{O}}_{1k}$$ is a non-trivial projector, then defining $$R=\mathop{\prod }\nolimits_{k = 1}^{\xi }{r}_{k}^{2}$$, we have

$${F}_{n}(L,l)=\frac{{2}^{2m+(2m-1)(L-l)}}{({2}^{2m}-1)\cdot {3}^{\frac{n}{m}}\cdot {2}^{(2-\frac{3}{m})n}}{c}_{1}^{2}R\ .$$
(12)
2. (ii)

For arbitrary N and when each $${\widehat{O}}_{ik}$$ satisfies $${\rm{Tr}}[{\widehat{O}}_{ik}]=0$$ and $${\rm{Tr}}[{\widehat{O}}_{ik}^{2}]\;\leqslant\; {2}^{m}$$, then

$${F}_{n}(L,l)=\frac{{2}^{2m(L-l+1)+1}}{{3}^{\frac{2n}{m}}\cdot {2}^{\left(3-\frac{4}{m}\right)n}}\mathop{\sum }\limits_{i,j=1}^{N}{c}_{i}{c}_{j}\ .$$
(13)

From Theorem 1 we derive the following corollary.

### Corollary 1

Consider the function Fn(L, l).

1. (i)

Let N = 1 and let each $${\widehat{O}}_{1k}$$ be a non-trivial projector, as in case (i) of Theorem 1. If $${c}_{1}^{2}R\in {\mathcal{O}}({2}^{n})$$ and if the number of layers $$L\in {\mathcal{O}}(\mathrm{poly}\,(\mathrm{log}\,(n)))$$, then

$${F}_{n}\left(L,l\right)\in {\mathcal{O}}\left({2}^{-\left(1-\frac{1}{m}{\mathrm{log}\,}_{2}3\right)n}\right)\ ,$$
(14)

which implies that Var[∂νC] is exponentially vanishing in n if m 2.

2. (ii)

Let N be arbitrary, and let each $${\widehat{O}}_{ik}$$ satisfy $${\rm{Tr}}[{\widehat{O}}_{ik}]=0$$ and $${\rm{Tr}}[{\widehat{O}}_{ik}^{2}]\;\leqslant\; {2}^{m}$$, as in case (ii) of Theorem 1. If $$N\in {\mathcal{O}}({2}^{n})$$, $${c}_{i}\in {\mathcal{O}}(1)$$, and if the number of layers $$L\in {\mathcal{O}}(\mathrm{poly}\,(\mathrm{log}\,(n)))$$, then

$${F}_{n}\left(L,l\right)\in {\mathcal{O}}\left(\frac{1}{{2}^{\left(1-\frac{1}{m}\right)n}}\right)\ ,$$
(15)

which implies that Var[∂νC] is exponentially vanishing in n if m 2.

Let us now make several important remarks. First, note that part (i) of Corollary 1 includes as a particular example the cost function CG of (1). Second, part (ii) of this corollary also includes as particular examples operators with $$N\in {\mathcal{O}}(1)$$, as well as $$N\in {\mathcal{O}}(\mathrm{poly}\,(n))$$. Finally, we remark that Fn(L, l) becomes trivial when the number of layers L is Ω(poly(n)), however, as we discuss below, we can still find that Var[∂νCG] vanishes exponentially in this case.

Our second main theorem shows that barren plateaus can be avoided for shallow circuits by employing local cost functions. Here we consider m-local cost functions where each $${\widehat{O}}_{i}$$ acts nontrivially on at most m qubits and (on these qubits) can be expressed as $${\widehat{O}}_{i}={\widehat{O}}_{i}^{{\mu }_{i}}\otimes {\widehat{O}}_{i}^{\mu ^{\prime} }$$:

$$O={c}_{0}{\mathbb{1}}+\mathop{\sum }\limits_{i=1}^{N}{c}_{i}{\widehat{O}}_{i}^{{\mu }_{i}}\otimes {\widehat{O}}_{i}^{\mu ^{\prime} }\ ,$$
(16)

where $${\widehat{O}}_{i}^{{\mu }_{i}}$$ are operators acting on m/2 qubits which can be written as a tensor product of Pauli operators. Here, we assume the summation in Eq. (16) includes two possible cases as schematically shown in Fig. 3b: First, when $${\widehat{O}}_{i}^{{\mu }_{i}}$$ ($${\widehat{O}}_{i}^{\mu ^{\prime} }$$) acts on the first (last) m/2 qubits of a given Sk, and second, when $${\widehat{O}}_{i}^{{\mu }_{i}}$$ ($${\widehat{O}}_{i}^{\mu ^{\prime} }$$) acts on the last (first) m/2 qubits of a given Sk (Sk+1). This type of cost function includes any ultralocal cost function (i.e., where the $${\widehat{O}}_{i}$$ are one-body) as in (2), and also VQE Hamiltonians with up to m/2 neighbor interactions. Then, the following theorem holds.

### Theorem 2

Consider a trainable parameter θν in a block W of the ansatz in Fig. 3. Let Var[∂νC] be the variance of the partial derivative of an m-local cost function C (with O given by (16)) with respect to θν. WA, WB of (7), and each block in V(θ) form a local 2-design, then Var[∂νC] is lower bounded by

$${G}_{n}(L,l)\;\leqslant\; {\rm{Var}}[{\partial }_{\nu }C]\ ,$$
(17)

with

$${G}_{n}(L,l)= \frac{{2}^{m(l+1)-1}}{{({2}^{2m}-1)}^{2}{({2}^{m}+1)}^{L+l}}\\ \times \mathop{\sum}\limits_{i\in {i}_{{\mathcal{L}}}}\mathop{\sum}\limits _{{(k,k^{\prime} )\in {k}_{{{\mathcal{L}}}_{\text{B}}}}\atop {k^{\prime} \geqslant k}}{c}_{i}^{2}\epsilon ({\rho }_{k,k^{\prime} })\epsilon ({\widehat{O}}_{i})\ ,$$
(18)

where $${i}_{{\mathcal{L}}}$$ is the set of i indices whose associated operators $${\widehat{O}}_{i}$$ act on qubits in the forward light-cone $${\mathcal{L}}$$ of W, and $${k}_{{{\mathcal{L}}}_{\text{B}}}$$ is the set of k indices whose associated subsystems Sk are in the backward light-cone $${{\mathcal{L}}}_{\text{B}}$$ of W. Here we defined the function $$\epsilon (M)={D}_{\text{HS}}\left(M,{\rm{Tr}}(M){\mathbb{1}}/{d}_{M}\right)$$ where DHS is the Hilbert–Schmidt distance and dM is the dimension of the matrix M. In addition, $${\rho }_{k,k^{\prime} }$$ is the partial trace of the input state ρ down to the subsystems $${S}_{k}{S}_{k+1}...{S}_{k^{\prime} }$$.

Let us make a few remarks. First, note that the $$\epsilon ({\widehat{O}}_{i})$$ in the lower bound indicates that training V(θ) is easier when $${\widehat{O}}_{i}$$ is far from the identity. Second, the presence of $$\epsilon ({\rho }_{k,k^{\prime} })$$ in Gn(L, l) implies that we have no guarantee on the trainability of a parameter θν in W if ρ is maximally mixed on the qubits in the backwards light-cone.

From Theorem 2 we derive the following corollary for m-local cost functions, which guarantees the trainability of the ansatz for shallow circuits.

### Corollary 2

Consider the function Fn(L, l). Let O be an operator of the form (16), as in Theorem 2. If at least one term $${c}_{i}^{2}\epsilon ({\rho }_{k,k^{\prime} })\epsilon ({\widehat{O}}_{i})$$ in the sum in (18) vanishes no faster than Ω(1/poly(n)), and if the number of layers L is $${\mathcal{O}}(\mathrm{log}\,(n))$$, then

$${G}_{n}(L,l)\in {{\Omega }}\left(\frac{1}{\mathrm{poly}\,(n)}\right)\ .$$
(19)

On the other hand, if at least one term $${c}_{i}^{2}\epsilon ({\rho }_{k,k^{\prime} })\epsilon ({\widehat{O}}_{i})$$ in the sum in (18) vanishes no faster than $${{\Omega }}\left(1/{2}^{\mathrm{poly}\,(\mathrm{log}\,(n))}\right)$$, and if the number of layers is $${\mathcal{O}}(\mathrm{poly}\,(\mathrm{log}\,(n)))$$, then

$${G}_{n}(L,l)\in {{\Omega }}\left(\frac{1}{{2}^{\mathrm{poly}\,(\mathrm{log}\,(n))}}\right)\ .$$
(20)

Hence, when L is $${\mathcal{O}}(\mathrm{poly}\,(\mathrm{log}\,(n)))$$ there is a transition region where the lower bound vanishes faster than polynomially, but slower than exponentially.

We finally justify the assumption of each block being a local 2-design from the fact that shallow circuit depths lead to such local 2-designs. Namely, it has been shown that one-dimensional 2-designs have efficient quantum circuit descriptions, requiring $${\mathcal{O}}({m}^{2})$$ gates to be exactly implemented38, or $${\mathcal{O}}(m)$$ to be approximately implemented39,40. Hence, an L-layered ansatz in which each block forms a 2-design can be exactly implemented with a depth $$D\in {\mathcal{O}}({m}^{2}L)$$, and approximately implemented with $$D\in {\mathcal{O}}(mL)$$. For the case of two-dimensional connectivity, it has been shown that approximate 2-designs require a circuit depth of $${\mathcal{O}}(\sqrt{m})$$ to be implemented40. Therefore, in this case the depth of the layered ansatz is $$D\in {\mathcal{O}}(\sqrt{m}L)$$. The latter shows that increasing the dimensionality of the circuit reduces the circuit depth needed to make each block a 2-design.

Moreover, it has been shown that the Alternating Layered Ansatz of Fig. 3 will form an approximate one-dimensional 2-design on n qubits if the number of layers is $${\mathcal{O}}(n)$$40. Hence, for deep circuits, our ansatz behaves like a random circuit and we recover the barren plateau result of9 for both local and global cost functions.

### Numerical simulations

As an important example to illustrate the cost-function-dependent barren plateau phenomenon, we consider quantum autoencoders11,41,42,43,44. In particular, the pioneering VQA proposed in ref. 11 has received significant literature attention, due to its importance to quantum machine learning and quantum data compression. Let us briefly explain the algorithm in ref. 11.

Consider a bipartite quantum system AB composed of nA and nB qubits, respectively, and let $$\{{p}_{\mu },|{\psi }_{\mu }\rangle \}$$ be an ensemble of pure states on AB. The goal of the quantum autoencoder is to train a gate sequence V(θ) to compress this ensemble into the A subsystem, such that one can recover each state $$|{\psi }_{\mu }\rangle$$ with high fidelity from the information in subsystem A. One can think of B as the “trash” since it is discarded after the action of V(θ).

To quantify the degree of data compression, ref. 11 proposed a cost function of the form:

$$C_{\rm{G}}^{\prime} =1-{\rm{Tr}}[\left|{\boldsymbol{0}}\right\rangle \ \left\langle {\boldsymbol{0}}\right|{\rho }_{\,\text{B}}^{\text{out}\,}]$$
(21)
$$={\rm{Tr}}[O_{\rm{G}}^{\prime} V({\boldsymbol{\theta }}){\rho }_{\,\text{AB}}^{\text{in}\,}V{({\boldsymbol{\theta }})}^{\dagger }]\ ,$$
(22)

where $${\rho }_{\,\text{AB}}^{\text{in}\,}={\sum }_{\mu }{p}_{\mu }|{\psi }_{\mu }\rangle \ \langle {\psi }_{\mu }|$$ is the ensemble-average input state, $${\rho }_{\,\text{B}}^{\text{out}\,}={\sum }_{\mu }{p}_{\mu }{{\rm{Tr}}}_{\text{A}}[|\psi ^{\prime} \rangle \ \langle \psi ^{\prime} |]$$ is the ensemble-average trash state, and $$\left|\psi ^{\prime} \right\rangle =V({\boldsymbol{\theta }})|{\psi }_{\mu }\rangle$$. Equation (22) makes it clear that $$C_{\rm{G}}^{\prime}$$ has the form in (6), and $$O_{\rm{G}}^{\prime} ={{\mathbb{1}}}_{\text{AB}}-{{\mathbb{1}}}_{\text{A}}\otimes \left|{\boldsymbol{0}}\right\rangle \ \left\langle {\boldsymbol{0}}\right|$$ is a global observable of the form in (10). Hence, according to Corollary 1, $$C_{\rm{G}}^{\prime}$$ exhibits a barren plateau for large nB. (Specifically, Corollary 1 applies in this context when nA < nB). As a result, large-scale data compression, where one is interested in discarding large numbers of qubits, will not be possible with $$C_{\rm{G}}^{\prime}$$.

To address this issue, we propose the following local cost function

$$C_{\rm{L}}^{\prime} =1-\frac{1}{{n}_{\text{B}}}\mathop{\sum }\limits_{j=1}^{{n}_{\text{B}}}{\rm{Tr}}\left[\left(\left|0\right\rangle \ {\left\langle 0\right|}_{j}\otimes {{\mathbb{1}}}_{\overline{j}}\right){\rho }_{\,\text{B}}^{\text{out}\,}\right]$$
(23)
$$={\rm{Tr}}[O_{\rm{L}}^{\prime} V({\boldsymbol{\theta }}){\rho }_{\,\text{AB}}^{\text{in}\,}V{({\boldsymbol{\theta }})}^{\dagger }]\ ,\qquad\quad$$
(24)

where $$O_{\rm{L}}^{\prime} ={{\mathbb{1}}}_{\text{AB}}-\frac{1}{{n}_{\text{B}}}\mathop{\sum }\nolimits_{j = 1}^{{n}_{\text{B}}}{{\mathbb{1}}}_{\text{A}}\otimes \left|0\right\rangle \ {\left\langle 0\right|}_{j}\otimes {{\mathbb{1}}}_{\overline{j}}$$, and $${{\mathbb{1}}}_{\overline{j}}$$ is the identity on all qubits in B except the jth qubit. As shown in the Supplementary Note 9, $$C_{\rm{L}}^{\prime}$$ satisfies $$C_{\rm{L}}^{\prime} \;\leqslant\; C_{\rm{G}}^{\prime} \;\leqslant\; {n}_{\text{B}}C_{\rm{L}}^{\prime}$$, which implies that $$C_{\rm{L}}^{\prime}$$ is faithful (vanishing under the same conditions as $$C_{\rm{G}}^{\prime}$$). Furthermore, note that $$O_{\rm{L}}^{\prime}$$ has the form in (16). Hence Corollary 2 implies that $$C_{\rm{L}}^{\prime}$$ does not exhibit a barren plateau for shallow ansatzes.

Here we simulate the autoencoder algorithm to solve a simple problem where nA = 1, and where the input state ensemble $$\{{p}_{\mu }, |{\psi }_{\mu } \rangle \}$$ is given by

$$\left|{\psi }_{1}\right\rangle ={\left|0\right\rangle }_{\text{A}}\otimes {\left|0,0,0,\ldots ,0\right\rangle }_{\text{B}}\ ,\quad \,{\text{with}}\,\quad {p}_{1}=2/3\ ,$$
(25)
$$\left|{\psi }_{2}\right\rangle ={\left|1\right\rangle }_{\text{A}}\otimes {\left|1,1,0,\ldots ,0\right\rangle }_{\text{B}}\ ,\quad \,{\text{with}}\,\quad {p}_{2}=1/3\ .$$
(26)

In order to analyze the cost-function-dependent barren plateau phenomenon, the dimension of subsystem B is gradually increased as nB = 10, 15, …, 100.

### Numerical results

In our heuristics, the gate sequence V(θ) is given by two layers of the ansatz in Fig. 4, so that the number of gates and parameters in V(θ) increases linearly with nB. Note that this ansatz is a simplified version of the ansatz in Fig. 3, as we can only generate unitaries with real coefficients. All parameters in V(θ) were randomly initialized and as detailed in the Methods section, we employ a gradient-free training algorithm that gradually increases the number of shots per cost-function evaluation.

Analysis of the n-dependence. Figure 5 shows representative results of our numerical implementations of the quantum autoencoder in ref. 11 obtained by training V(θ) with the global and local cost functions respectively given by (22) and (23). Specifically, while we train with finite sampling, in the figures we show the exact cost-function values versus the number of iterations. Here, the top (bottom) axis corresponds to the number of iterations performed while training with $$C_{\rm{G}}^{\prime}$$ ($$C_{\rm{L}}^{\prime}$$). For nB = 10 and 15, Fig. 5 shows that we are able to train V(θ) for both cost functions. For nB = 20, the global cost function initially presents a plateau in which the optimizing algorithm is not able to determine a minimizing direction. However, as the number of shots per function evaluation increases, one can eventually minimize $$C_{\rm{G}}^{\prime}$$. Such result indicates the presence of a barren plateau where the gradient takes small values which can only be detected when the number of shots becomes sufficiently large. In this particular example, one is able to start training at around 140 iterations.

When nB > 20 we are unable to train the global cost function, while always being able to train our proposed local cost function. Note that the number of iterations is different for $$C_{\rm{G}}^{\prime}$$ and $$C_{\rm{L}}^{\prime}$$, as for the global cost function case we reach the maximum number of shots in fewer iterations. These results indicate that the global cost function of (22) exhibits a barren plateau where the gradient of the cost function vanishes exponentially with the number of qubits, and which arises even for constant depth ansatzes. We remark that in principle one can always find a minimizing direction when training $$C_{\rm{G}}^{\prime}$$, although this would require a number of shots that increases exponentially with nB. Moreover, one can see in Fig. 5 that randomly initializing the parameters always leads to $$C_{\rm{G}}^{\prime} \approx 1$$ due to the narrow gorge phenomenon (see Proposition 1), i.e., where the probability of being near the global minimum vanishes exponentially with nB.

On the other hand, Fig. 5 shows that the barren plateau is avoided when employing a local cost function since we can train $$C_{\rm{L}}^{\prime}$$ for all considered values of nB. Moreover, as seen in Fig. 5, $$C_{\rm{L}}^{\prime}$$ can be trained with a small number of shots per cost-function evaluation (as small as 10 shots per evaluation).

Analysis of the L-dependence. The power of Theorem 2 is that it gives the scaling in terms of L. While one can substitute a function of n for L as we did in Corollary 2, one can also directly study the scaling with L (for fixed n). Figure 6 shows the dependence on L when training $$C_{\rm{L}}^{\prime}$$ for the autoencoder example with nA = 1 and nB = 10. As one can see, the training becomes more difficult as L increases. Specifically, as shown in the inset it appears to become exponentially more difficult, as the number of shots needed to achieve a fixed cost value grows exponentially with L. This is consistent with (and hence verifies) our bound on the variance in Theorem 2, which vanishes exponentially in L, although we remark that this behavior can saturate for very large L9.

In summary, even though the ansatz employed in our heuristics is beyond the scope of our theorems, we still find cost-function-dependent barren plateaus, indicating that the cost-function dependent barren plateau phenomenon might be more general and go beyond our analytical results.

## Discussion

While scaling results have been obtained for classical neural networks45, very few such results exist for the trainability of parametrized quantum circuits, and more generally for quantum neural networks. Hence, rigorous scaling results are urgently needed for VQAs, which many researchers believe will provide the path to quantum advantage with near-term quantum computers. One of the few such results is the barren plateau theorem of ref. 9, which holds for VQAs with deep, hardware-efficient ansatzes.

In this work, we proved that the barren plateau phenomenon extends to VQAs with randomly initialized shallow Alternating Layered Ansatzes. The key to extending this phenomenon to shallow circuits was to consider the locality of the operator O that defines the cost function C. Theorem 1 presented a universal upper bound on the variance of the gradient for global cost functions, i.e., when O is a global operator. Corollary 1 stated the asymptotic scaling of this upper bound for shallow ansatzes as being exponentially decaying in n, indicating a barren plateau. Conversely, Theorem 2 presented a universal lower bound on the variance of the gradient for local cost functions, i.e., when O is a sum of local operators. Corollary 2 notes that for shallow ansatzes this lower bound decays polynomially in n. Taken together, these two results show that barren plateaus are cost-function-dependent, and they establish a connection between locality and trainability.

In the context of chemistry or materials science, our present work can inform researchers about which transformation to use when mapping a fermionic Hamiltonian to a spin Hamiltonian46, i.e., Jordan-Wigner versus Bravyi–Kitaev47. Namely, the Bravyi–Kitaev transformation often leads to more local Pauli terms, and hence (from Corollary 2) to a more trainable cost function. This fact was recently numerically confirmed48.

Moreover, the fact that Corollary 2 is valid for arbitrary input quantum states may be useful when constructing variational ansatzes. For example, one could propose a growing ansatz method where one appends $$\mathrm{log}\,(n)$$ layers of the hardware-efficient ansatz to a previously trained (hence fixed) circuit. This could then lead to a layer-by-layer training strategy where the previously trained circuit can correspond to multiple layers of the same hardware-efficient ansatz.

We remark that our definition of a global operator (local operator) is one that is both non-local (local) and many body (few body). Therefore, the barren plateau phenomenon could be due to the many-bodiness of the operator rather than the non-locality of the operator; we leave the resolution of this question to future work. On the other hand, our Theorem 1 rules out the possibility that barren plateaus could be due to cardinality, i.e., the number of terms in O when decomposed as a sum of Pauli products49. Namely, case (ii) of this theorem implies barren plateaus for O of essentially arbitrary cardinality, and hence cardinality is not the key variable at work here.

We illustrated these ideas for two examples VQAs. In Fig. 2, we considered a simple state-preparation example, which allowed us to delve deeper into the cost landscape and uncover another phenomenon that we called a narrow gorge, stated precisely in Proposition 1. In Fig. 5, we studied the more important example of quantum autoencoders, which have generated significant interest in the quantum machine learning community. Our numerics showed the effects of barren plateaus: for more than 20 qubits we were unable to minimize the global cost function introduced in11. To address this, we introduced a local cost function for quantum autoencoders, which we were able to minimize for system sizes of up to 100 qubits.

There are several directions in which our results could be generalized in future work. Naturally, we hope to extend the narrow gorge phenomenon in Proposition 1 to more general VQAs. In addition, we hope in the future to unify our theorems 1 and 2 into a single result that bounds the variance as a function of a parameter that quantifies the locality of O. This would further solidify the connection between locality and trainability. Moreover, our numerics suggest that our theorems (which are stated for exact 2-designs) might be extendable in some form to ansatzes composed of simpler blocks, like approximate 2-designs39.

We emphasize that while our theorems are stated for a hardware-efficient ansatz and for costs that are of the form (6), it remains an interesting open question as to whether other ansatzes, cost function, and architectures exhibit similar scaling behavior as that stated in our theorems. For instance, we have recently shown50 that our results can be extended to a more general type of Quantum Neural Network called dissipative quantum neural networks51. Another potential example of interest could be the unitary-coupled cluster (UCC) ansatz in chemistry52, which is intended for use in the $${\mathcal{O}}(\mathrm{poly}\,(n))$$ depth regime34. Therefore it is important to study the key mathematical features of an ansatz that might allow one to go from trainability for $${\mathcal{O}}(\mathrm{log}\,n)$$ depth (which we guarantee here for local cost functions) to trainability for $${\mathcal{O}}(\mathrm{poly}\,n)$$ depth.

Finally, we remark that some strategies have been developed to mitigate the effects of barren plateaus32,33,53,54. While these methods are promising and have been shown to work in certain cases, they are still heuristic methods with no provable guarantees that they can work in generic scenarios. Hence, we believe that more work needs to be done to better understand how to prevent, avoid, or mitigate the effects of barren plateaus.

## Methods

In this section, we provide additional details for the results in the main text, as well as a sketch of the proofs for our main theorems. We note that the proof of Theorem 2 comes before that of Theorem 1 since the latter builds on the former. More detailed proofs of our theorems are given in the Supplementary Information.

### Variance of the cost function partial derivative

Let us first discuss the formulas we employed to compute Var[∂νC]. Let us first note that without loss of generality, any block Wkl(θkl) in the Alternating Layered Ansatz can be written as a product of ζkl independent gates from a gate alphabet $${\mathcal{A}}=\{{G}_{\mu }(\theta )\}$$ as

$${W}_{kl}({{\boldsymbol{\theta }}}_{kl})={G}_{{\zeta }_{kl}}({\theta }_{kl}^{{\zeta }_{kl}})\ldots {G}_{\nu }({\theta }_{kl}^{\nu })\ldots {G}_{1}({\theta }_{kl}^{1})\ ,$$
(27)

where each $${\theta }_{kl}^{\nu }$$ is a continuous parameter. Here, $${G}_{\nu }({\theta }_{kl}^{\nu })={R}_{\nu }({\theta }_{kl}^{\nu }){Q}_{\nu }$$ where Qν is an unparametrized gate and $${R}_{\nu }({\theta }_{kl}^{\nu })={e}^{-i{\theta }_{kl}^{\nu }{\sigma }_{\nu }/2}$$ with σν a Pauli operator. Note that WkL denotes a block in the last layer of V(θ).

For the proofs of our results, it is helpful to conceptually break up the ansatz as follows. Consider a block Wkl(θkl) in the lth layer of the ansatz. For simplicity, we henceforth use W to refer to a given Wkl(θkl). Let Sw denote the m-qubit subsystem that contains the qubits W acts on, and let $${S}_{\overline{w}}$$ be the (n − m) subsystem on which W acts trivially. Similarly, let $${{\mathcal{H}}}_{w}$$ and $${{\mathcal{H}}}_{\overline{w}}$$ denote the Hilbert spaces associated with Sw and $${S}_{\overline{w}}$$, respectively. Then, as shown in Fig. 3a, V(θ) can be expressed as

$$V({\boldsymbol{\theta }})={V}_{\text{R}}({{\mathbb{1}}}_{\overline{w}}\otimes W){V}_{{\rm{L}}}\ .$$
(28)

Here, $${{\mathbb{1}}}_{\overline{w}}$$ is the identity on $${{\mathcal{H}}}_{\overline{w}}$$, and VR contains the gates in the (forward) light-cone $${\mathcal{L}}$$ of W, i.e., all gates with at least one input qubit causally connected to the output qubits of W. The latter allows us to define $${S}_{{\mathcal{L}}}$$ as the subsystem of all qubits in $${\mathcal{L}}$$.

Let us here recall that the Alternating Layered Ansatz can be implemented with either a 1D or 2D square connectivity as schematically depicted in Fig. 3c. We remark that the following results are valid for both cases as the light-cone structure will be the same. Moreover, the notation employed in our proofs applies to both the 1D and 2D cases. Hence, there is no need to refer to the connectivity dimension in what follows.

Let us now assume that θν is a parameter inside a given block W, we obtain from (6), (27), and (28)

$${\partial }_{\nu }C= \frac{i}{2}{\rm{Tr}}\left[({{\mathbb{1}}}_{\overline{w}}\otimes {W}_{\text{B}}){V}_{{\rm{L}}}\rho {V}_{{\rm{L}}}^{\dagger }({{\mathbb{1}}}_{\overline{w}}\otimes {W}_{\,\text{B}\,}^{\dagger })\right.\\ \left.\times [{{\mathbb{1}}}_{\overline{w}}\otimes {\sigma }_{\nu },({{\mathbb{1}}}_{\overline{w}}\otimes {W}_{\,\text{A}\,}^{\dagger }){V}_{\,\text{R}}^{\dagger }O{V}_{\text{R}}({{\mathbb{1}}}_{\overline{w}}\otimes {W}_{\text{A}})]\right]\ ,$$
(29)

with

$${W}_{\text{B}}=\mathop{\prod }\limits_{\mu =1}^{\nu -1}{G}_{\mu }({\theta }^{\mu })\ ,\quad \,{\text{and}}\,\quad {W}_{\text{A}}=\mathop{\prod }\limits_{\mu =\nu }^{\zeta }{G}_{\mu }({\theta }^{\mu })\ .$$
(30)

Finally, from (29) we can derive a general formula for the variance:

$${\rm{Var}}[{\partial }_{\nu }C]=\frac{{2}^{m-1}{\rm{Tr}}[{\sigma }_{\nu }^{2}]}{{({2}^{2m}-1)}^{2}}\mathop{\sum} _{{{\boldsymbol{p}}{\boldsymbol{q}}}\atop {{\boldsymbol{p}}^{\prime} {\boldsymbol{q}}^{\prime} }}{\langle {{\Delta }}{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{{\boldsymbol{q}}^{\prime} {\boldsymbol{p}}^{\prime} }\rangle }_{{V}_{\text{R}}}{\langle {{\Delta }}{{{\Psi }}}_{{\boldsymbol{p}}{\boldsymbol{q}}}^{{\boldsymbol{p}}^{\prime} {\boldsymbol{q}}^{\prime} }\rangle }_{{V}_{{\rm{L}}}}$$
(31)

which holds if WA and WB form independent 2-designs. Here, the summation runs over all bitstrings p, q, $${\boldsymbol{p}}^{\prime}$$, $${\boldsymbol{q}}^{\prime}$$ of length 2nm. In addition, we defined

$${{\Delta }}{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{{\boldsymbol{q}}^{\prime} {\boldsymbol{p}}^{\prime} }={\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}{{{\Omega }}}_{{\boldsymbol{q}}^{\prime} {\boldsymbol{p}}^{\prime} }]-\frac{{\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}]{\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}^{\prime} {\boldsymbol{p}}^{\prime} }]}{{2}^{m}}\ ,$$
(32)
$${{\Delta }}{{{\Psi }}}_{{\boldsymbol{p}}{\boldsymbol{q}}}^{{\boldsymbol{p}}^{\prime} {\boldsymbol{q}}^{\prime} }={\rm{Tr}}[{{{\Psi }}}_{{\boldsymbol{p}}{\boldsymbol{q}}}{{{\Psi }}}_{{\boldsymbol{p}}^{\prime} {\boldsymbol{q}}^{\prime} }]-\frac{{\rm{Tr}}[{{{\Psi }}}_{{\boldsymbol{p}}{\boldsymbol{q}}}]{\rm{Tr}}[{{{\Psi }}}_{{\boldsymbol{p}}^{\prime} {\boldsymbol{q}}^{\prime} }]}{{2}^{m}}\ ,$$
(33)

where $${{\rm{Tr}}}_{\overline{w}}$$ indicates the trace over subsystem $${S}_{\overline{w}}$$, and Ωqp and Ψqp are operators on $${{\mathcal{H}}}_{w}$$ defined as

$${{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}={{\rm{Tr}}}_{\overline{w}}\left[(\left|{\boldsymbol{p}}\right\rangle \left\langle {\boldsymbol{q}}\right|\otimes {{\mathbb{1}}}_{w}){V}_{\,\text{R}}^{\dagger }O{V}_{\text{R}}\right]\ ,$$
(34)
$${{{\Psi }}}_{{\boldsymbol{p}}{\boldsymbol{q}}}={{\rm{Tr}}}_{\overline{w}}\left[(\left|{\boldsymbol{q}}\right\rangle \left\langle {\boldsymbol{p}}\right|\otimes {{\mathbb{1}}}_{w}){V}_{{\rm{L}}}\rho {V}_{{\rm{L}}}^{\dagger }\right]\ .$$
(35)

We derive Eq. (31) in the Supplementary Note 4.

### Computing averages over V

Here we introduce the main tools employed to compute quantities of the form 〈…〉V. These tools are used throughout the proofs of our main results.

Let us first remark that if the blocks in V(θ) are independent, then any average over V can be computed by averaging over the individual blocks, i.e., $${\langle \ldots \rangle }_{V}={\langle \ldots \rangle }_{{W}_{11},\ldots ,{W}_{kl},\ldots }={\langle \ldots \rangle }_{{V}_{{\rm{L}}},W,{V}_{\text{R}}}$$. For simplicity let us first consider the expectation value over a single block W in the ansatz. In principle 〈…〉W can be approximated by varying the parameters in W and sampling over the resulting 2m × 2m unitaries. However, if W forms a t-design, this procedure can be simplified as it is known that sampling over its distribution yields the same properties as sampling random unitaries from the unitary group with respect to the unique normalized Haar measure.

Explicitly, the Haar measure is a uniquely defined left and right-invariant measure over the unitary group dμ(W), such that for any unitary matrix AU(2m) and for any function f(W) we have

$${\int}_{U({2}^{m})}d\mu (W)f(W) =\int d\mu (W)f(AW)\\ =\int d\mu (W)f(WA)\ ,$$
(36)

where the integration domain is assumed to be U(2m) throughout this work. Consider a finite set $${\{{W}_{y}\}}_{y\in Y}$$ (of size Y) of unitaries Wy, and let P(t, t)(W) be an arbitrary polynomial of degree at most t in the matrix elements of W and at most t in those of W. Then, this finite set is a t-design if38

$${\langle {P}_{(t,t)}(W)\rangle }_{w} =\frac{1}{| Y| }\cdot \mathop{\sum}\limits _{y\in Y}{P}_{(t,t)}({W}_{y})\\ =\int d\mu (W){P}_{(t,t)}(W)\ .$$
(37)

From the general form of C in Eq. (6) we can see the cost function is a polynomial of degree at most 2 in the matrix elements of each block Wkl in V(θ), and at most 2 in those of $${({W}_{kl})}^{\dagger }$$. Then, if a given block W forms a 2-design, one can employ the following elementwise formula of the Weingarten calculus55,56 to explicitly evaluate averages over W up to the second moment:

$$\begin{array}{ccc}\int d\mu (W){w}_{{\boldsymbol{i}}{\boldsymbol{j}}}{w}_{{\boldsymbol{i}}^{\prime} {\boldsymbol{j}}^{\prime} }^{* }&=&\frac{{\delta }_{{\boldsymbol{i}}{\boldsymbol{i}}^{\prime} }{\delta }_{{\boldsymbol{j}}{\boldsymbol{j}}^{\prime} }}{{2}^{m}}\\ \int d\mu (W){w}_{{{\boldsymbol{i}}}_{1}{{\boldsymbol{j}}}_{1}}{w}_{{{\boldsymbol{i}}}_{2}{{\boldsymbol{j}}}_{2}}{w}_{{\boldsymbol{i}}_1^{\prime} {\boldsymbol{j}}_2^{\prime} }^{* }{w}_{{\boldsymbol{i}}_2^{\prime} {\boldsymbol{j}}_2^{\prime} }^{* }&=&\frac{1}{{2}^{2m}-1}\left({{{\Delta }}}_{1}-\frac{{{{\Delta }}}_{2}}{{2}^{m}}\right)\end{array}$$
(38)

where wij are the matrix elements of W, and

$${{{\Delta }}}_{1} ={\delta }_{{{\boldsymbol{i}}}_{1}{\boldsymbol{i}}_1^{\prime} }{\delta }_{{{\boldsymbol{i}}}_{2}{\boldsymbol{i}}_2^{\prime} }{\delta }_{{{\boldsymbol{j}}}_{1}{\boldsymbol{j}}_1^{\prime} }{\delta }_{{{\boldsymbol{j}}}_{2}{\boldsymbol{j}}_2^{\prime} }+{\delta }_{{{\boldsymbol{i}}}_{1}{\boldsymbol{i}}_2^{\prime} }{\delta }_{{{\boldsymbol{i}}}_{2}{\boldsymbol{i}}_1^{\prime} }{\delta }_{{{\boldsymbol{j}}}_{1}{\boldsymbol{j}}_2^{\prime} }{\delta }_{{{\boldsymbol{j}}}_{2}{\boldsymbol{j}}_1^{\prime} }\ ,\\ {{{\Delta }}}_{2} ={\delta }_{{{\boldsymbol{i}}}_{1}{\boldsymbol{i}}_1^{\prime} }{\delta }_{{{\boldsymbol{i}}}_{2}{\boldsymbol{i}}_2^{\prime} }{\delta }_{{{\boldsymbol{j}}}_{1}{\boldsymbol{j}}_2^{\prime} }{\delta }_{{{\boldsymbol{j}}}_{2}{\boldsymbol{j}}_1^{\prime} }+{\delta }_{{{\boldsymbol{i}}}_{1}{\boldsymbol{i}}_2^{\prime} }{\delta }_{{{\boldsymbol{i}}}_{2}{\boldsymbol{i}}_1^{\prime} }{\delta }_{{{\boldsymbol{j}}}_{1}{\boldsymbol{j}}_1^{\prime} }{\delta }_{{{\boldsymbol{j}}}_{2}{\boldsymbol{j}}_2^{\prime} }\ .$$
(39)

### Intuition behind the main results

The goal of this section is to provide some intuition for our main results. Specifically, we show here how the scaling of the cost function variance can be related to the number of blocks we have to integrate to compute $${\langle \cdots \rangle }_{{V}_{\text{R}},{V}_{{\rm{L}}}}$$, the locality of the cost functions, and with the number of layers in the ansatz.

First, we recall from Eq. (38) that integrating over a block leads to a coefficient of the order 1/22m. Hence, we see that the more blocks one integrates over, the worse the scaling can be.

We now generalize the warm-up example. Let V(θ) be a single layer of the alternating ansatz of Fig. 3, i.e., V(θ) is a tensor product of m-qubit blocks Wk: = Wk1, with k = 1, …, ξ (and with ξ = n/m), so that θν is in the block $${W}_{k^{\prime} }$$. In the Supplementary Note 5 we generalize this scenario to the when the input state is not $$\left|{\boldsymbol{0}}\right\rangle$$, but instead is an arbitrary state ρ.

From (31), the partial derivative of the global cost function in (1) can be expressed as

$${\rm{Var}}[{\partial }_{\nu }{C}_{{\rm{G}}}]=\upsilon \ \prod _{k\ne k^{\prime} }\ {\left\langle {\rm{Tr}}{\left[\left|0\right\rangle {\left\langle 0\right|}^{\otimes m}{W}_{k}\left|0\right\rangle {\left\langle 0\right|}^{\otimes m}{W}_{k}^{\dagger }\right]}^{2}\right\rangle }_{{W}_{k}}$$
(40)

where $$\upsilon =\frac{{({2}^{m}-1)}^{2}{\rm{Tr}}[{\sigma }_{\nu }^{2}]}{{2}^{2m}{({2}^{m+1}-1)}^{2}}$$. From (40) we have that in order to compute (40) one needs to integrate over ξ − 1 blocks. Then, since each integration leads to a coefficient 1/22m the variance will scale as $${\mathcal{O}}{\left(\right.1/({2}^{2m})}^{\xi -1}={\mathcal{O}}(1/{2}^{2n})$$. Hence, the scaling of the variance gets worse for each block we integrate (such that the block acts on qubits we are measuring).

On the other hand, for a local cost let us consider a single term in (3) where $$j\in {S}_{\tilde{k}}$$, so that

$${\rm{Var}}[{\partial }_{\nu }{C}_{{\rm{L}}}]\ \propto \ \frac{\upsilon }{{n}^{2}}\ {\left\langle {\rm{Tr}}{\left[(\left|0\right\rangle {\left\langle 0\right|}_{j}\otimes {{\mathbb{1}}}_{\overline{j}}){W}_{\tilde{k}}\left|0\right\rangle {\left\langle 0\right|}^{\otimes m}{W}_{\tilde{k}}^{\dagger }\right]}^{2}\right\rangle }_{{W}_{\tilde{k}}}\ .$$
(41)

Here, in contrast to the global case, we only have to integrate over a single block irrespective of the total number of qubits. Hence, we now find that the variance scales as $${\mathcal{O}}(1/{n}^{2})$$, where we remark that the scaling is essentially given by the prefactor 1/n2 in (3).

Let us now briefly provide some intuition as to why the scaling of local cost gradients becomes exponentially vanishing with the number of layers as in Theorem 2. Consider the case when V(θ) contains L layers of the ansatz in Fig. 3. Moreover, as shown in Fig. 7, let W be in the first layer, and let Oi act on the m topmost qubits of $${\mathcal{L}}$$. As schematically depicted in Fig. 7, we now have to integrating over L − 1 blocks. Then, as proved in the Supplementary Note 5, integrating over a block leads to a coefficient 2m/2/(2m + 1). Hence, after integrating L − 1 times, we obtain a coefficient $${2}^{m(L-1)/2}/{({2}^{m}+1)}^{L-1}$$ which vanishes no faster than $${{\Omega }}\left(1/\mathrm{poly}\,(n)\right)$$ if $$mL\in {\mathcal{O}}(\mathrm{log}\,(n))$$.

As we discuss below, for more general scenarios the computation of Var[∂νC] becomes more complex.

### Sketch of the proof of the main theorems

Here we present a sketch of the proof of Theorems 1 and 2. We refer the reader to the Supplementary Information for a detailed version of the proofs.

As mentioned in the previous subsection, if each block in V(θ) forms a local 2-design, then we can explicitly calculate expectation values 〈…〉W via (38). Hence, to compute $${\langle {{\Delta }}{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{{\boldsymbol{p}}^{\prime} {\boldsymbol{q}}^{\prime} }\rangle }_{{V}_{\text{R}}}$$, and $${\langle {{\Delta }}{{{\Psi }}}_{{\boldsymbol{p}}{\boldsymbol{q}}}^{{\boldsymbol{p}}^{\prime} {\boldsymbol{q}}^{\prime} }\rangle }_{{V}_{{\rm{L}}}}$$ in (31), one needs to algorithmically integrate over each block using the Weingarten calculus. In order to make such computation tractable, we employ the tensor network representation of quantum circuits.

For the sake of clarity, we recall that any two-qubit gate can be expressed as $$U={\sum }_{ijkl}{U}_{ijkl}\left|ij\right\rangle \left\langle kl\right|$$, where Uijkl is a 2 × 2 × 2 × 2 tensor. Similarly, any block in the ansatz can be considered as a $${2}^{\frac{m}{2}}\ \times {2}^{\frac{m}{2}}\ \times {2}^{\frac{m}{2}}\ \times {2}^{\frac{m}{2}}$$ tensor. As schematically shown in Fig. 8a, one can use the circuit description of $${{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{i}$$ and Ψpq to derive the tensor network representation of terms such as $${\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{i}{{{\Omega }}}_{{\boldsymbol{q}}^{\prime} {\boldsymbol{p}}^{\prime} }^{j}]$$. Here, $${{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{i}$$ is obtained from (34) by simply replacing O with Oi.

In Fig. 8b we depict an example where we employ the tensor network representation of $${{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{i}$$ to compute the average of $${\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{i}{{{\Omega }}}_{{\boldsymbol{q}}^{\prime} {\boldsymbol{p}}^{\prime} }^{j}]$$, and $${\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{i}]{\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}^{\prime} {\boldsymbol{p}}^{\prime} }^{j}]$$. As expected, after each integration one obtains a sum of four new tensor networks according to Eq. (38).

### Proof of Theorem 2

Let us first consider an m-local cost function C where O is given by (16), and where $${\widehat{O}}_{i}$$ acts nontrivially in a given subsystem Sk of $${\mathcal{S}}$$. In particular, when $${\widehat{O}}_{i}$$ is of this form the proof is simplified, although the more general proof is presented in the Supplementary Note 6. If $${S}_{k}\not\subset {S}_{{\mathcal{L}}}$$ we find $${{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{i}\propto {{\mathbb{1}}}_{w}$$, and hence

$${\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{i}{{{\Omega }}}_{{\boldsymbol{q}}^{\prime} {\boldsymbol{p}}^{\prime} }^{j}]-\frac{{\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{i}]{\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}^{\prime} {\boldsymbol{p}}^{\prime} }^{j}]}{{2}^{m}}=0\ .$$
(42)

The latter implies that we only have to consider the operators $${\widehat{O}}_{i}$$ which act on qubits inside of the forward light-cone $${\mathcal{L}}$$ of W.

Then, as shown in the Supplementary Information

$${\left\langle {\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{i}{{{\Omega }}}_{{\boldsymbol{q}}^{\prime} {\boldsymbol{p}}^{\prime} }^{i}]-\frac{{\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{i}]{\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}^{\prime} {\boldsymbol{p}}^{\prime} }^{i}]}{{2}^{m}}\right\rangle }_{{V}_{\text{R}}}\propto \epsilon ({\widehat{O}}_{i})\ .$$
(43)

Here we remark that the proportionality factor contains terms of the form $${\delta }_{{({\boldsymbol{p}},{\boldsymbol{q}})}_{{S}_{\overline{w}}^{+}}}{\delta }_{{({\boldsymbol{p}}^{\prime} ,{\boldsymbol{q}}^{\prime} )}_{{S}_{\overline{w}}^{+}}}{\delta }_{{({\boldsymbol{p}},{\boldsymbol{q}}^{\prime} )}_{{S}_{\overline{w}}^{-}}}{\delta }_{{({\boldsymbol{p}}^{\prime} ,{\boldsymbol{q}})}_{{S}_{\overline{w}}^{-}}}$$ (where $${S}_{\overline{w}}^{+}\cup {S}_{\overline{w}}^{-}={S}_{\overline{w}}$$), which arises from the different tensor contractions of $${P}_{{\boldsymbol{p}}{\boldsymbol{q}}}=\left|{\boldsymbol{q}}\right\rangle \left\langle {\boldsymbol{p}}\right|$$ in Fig. 8c. It is then straightforward to show that

$$\mathop{\sum}\limits _{{{\boldsymbol{p}}{\boldsymbol{q}}}\atop {{\boldsymbol{p}}^{\prime} {\boldsymbol{q}}^{\prime} }} {\delta }_{{({\boldsymbol{p}},{\boldsymbol{q}})}_{{S}_{\overline{w}}^{+}}}{\delta }_{{({\boldsymbol{p}}^{\prime} ,{\boldsymbol{q}}^{\prime} )}_{{S}_{\overline{w}}^{+}}}{\delta }_{{({\boldsymbol{p}},{\boldsymbol{q}}^{\prime} )}_{{S}_{\overline{w}}^{-}}}{\delta }_{{({\boldsymbol{p}}^{\prime} ,{\boldsymbol{q}})}_{{S}_{\overline{w}}^{-}}}{\left\langle {{\Delta }}{{{\Psi }}}_{{\boldsymbol{p}}{\boldsymbol{q}}}^{{\boldsymbol{p}}^{\prime} {\boldsymbol{q}}^{\prime} }\right\rangle }_{{V}_{{\rm{L}}}}\\ ={\left\langle {D}_{\text{HS}}\left({\tilde{\rho }}^{-},{{\rm{Tr}}}_{w}[{\tilde{\rho }}^{-}]\otimes \frac{{\mathbb{1}}}{{2}^{m}}\right)\right\rangle }_{{V}_{{\rm{L}}}}\ ,$$
(44)

where we define $${\tilde{\rho }}^{-}$$ as the reduced states of $$\tilde{\rho }={V}_{{\rm{L}}}\rho {V}_{{\rm{L}}}^{\dagger }$$ in the Hilbert spaces associated with subsystems $${S}_{w}\cup {S}_{\overline{w}}^{-}$$. Here we recall that DHS is the Hilbert–Schmidt distance $${D}_{HS}\left(\rho ,\sigma \right)={\rm{Tr}}[{(\rho -\sigma )}^{2}]$$.

By employing properties of DHS one can show (see Supplementary Note 6)

$${D}_{\text{HS}}\left({\tilde{\rho }}^{-},{{\rm{Tr}}}_{w}[{\tilde{\rho }}^{-}]\otimes \frac{{\mathbb{1}}}{{2}^{m}}\right)\ge \frac{{D}_{\text{HS}}\left({\widetilde{\rho }}_{w},\frac{{\mathbb{1}}}{{2}^{m}}\right)}{{2}^{m(L-l+2)/2}}\ ,$$
(45)

where $${\tilde{\rho }}_{w}={{\rm{Tr}}}_{{S}_{\overline{w}}^{-}}[{\tilde{\rho }}^{-}]$$. We can then leverage the tensor network representation of quantum circuits to algorithmically integrate over each block in VL and compute $${\langle {D}_{\text{HS}}\left({\widetilde{\rho }}_{w},\frac{{\mathbb{1}}}{{2}^{m}}\right)\rangle }_{{V}_{{\rm{L}}}}$$. One finds

$${\left\langle {D}_{\text{HS}}\left({\tilde{\rho }}_{w},\frac{{\mathbb{1}}}{{2}^{m}}\right)\right\rangle }_{{V}_{{\rm{L}}}}=\mathop{\sum} _{{(k,k^{\prime} )\in {k}_{{{\mathcal{L}}}_{\text{B}}}}\atop {k^{\prime} \geqslant k}}{t}_{k,k^{\prime} }\epsilon ({\rho }_{k,k^{\prime} })\ ,$$
(46)

with $${t}_{k,k^{\prime} }\geqslant \frac{{2}^{ml}}{{({2}^{m}+1)}^{2l}}$$$$\forall k,k^{\prime}$$, and $$\epsilon ({\rho }_{k,k^{\prime} })$$ defined in Theorem 2. Combining these results leads to Theorem 2. Moreover, as detailed in the Supplementary information, Theorem 2 is also valid when O is of the form (16).

### Proof of Theorem 1

Let us now provide a sketch of the proof of Theorem 1, case (i). Here we denote for simplicity $${\widehat{O}}_{k}:={\widehat{O}}_{1k}$$. We leave the proof of case (ii) for the Supplementary Note 7. In this case there are now operators Oi which act outside of the forward light-cone $${\mathcal{L}}$$ of W. Hence, it is convenient to include in VR not only all the gates in $${\mathcal{L}}$$ but also all the blocks in the final layer of V(θ) (i.e., all blocks WkL, with k = 1, …ξ). We can define $${S}_{\overline{{\mathcal{L}}}}$$ as the compliment of $${S}_{{\mathcal{L}}}$$, i.e., as the subsystem of all qubits which are not in $${\mathcal{L}}$$ (with associated Hilbert-space $${{\mathcal{H}}}_{\overline{{\mathcal{L}}}}$$). Then, we have $${V}_{\text{R}}={V}_{{\mathcal{L}}}\otimes {V}_{\overline{{\mathcal{L}}}}$$ and $$\left|{\boldsymbol{q}}\right\rangle \left\langle {\boldsymbol{p}}\right|=\left|{\boldsymbol{q}}\right\rangle {\left\langle {\boldsymbol{p}}\right|}_{{\mathcal{L}}}\otimes \left|{\boldsymbol{q}}\right\rangle {\left\langle {\boldsymbol{p}}\right|}_{\overline{{\mathcal{L}}}}$$, where we define $${V}_{\overline{{\mathcal{L}}}}: = {\bigotimes }_{k\in {k}_{\overline{{\mathcal{L}}}}}{W}_{kL}$$, $$\left|{\boldsymbol{q}}\right\rangle {\left\langle {\boldsymbol{p}}\right|}_{{\mathcal{L}}}:{ = \bigotimes }_{k\in {k}_{{\mathcal{L}}}}\left|{\boldsymbol{q}}\right\rangle {\left\langle {\boldsymbol{p}}\right|}_{k}$$, and $$\left|{\boldsymbol{q}}\right\rangle {\left\langle {\boldsymbol{p}}\right|}_{\overline{{\mathcal{L}}}}:{ = \bigotimes }_{k^{\prime} \in {k}_{\overline{{\mathcal{L}}}}}\left|{\boldsymbol{q}}\right\rangle {\left\langle {\boldsymbol{p}}\right|}_{k^{\prime} }$$. Here, we define $${k}_{{\mathcal{L}}}:=\{k:{S}_{k}\subseteq {S}_{{\mathcal{L}}}\}$$ and $${k}_{\overline{{\mathcal{L}}}}:=\{k:{S}_{k}\subseteq {S}_{\overline{{\mathcal{L}}}}\}$$, which are the set of indices whose associated qubits are inside and outside $${\mathcal{L}}$$, respectively. We also write $$O={c}_{0}{\mathbb{1}}+{c}_{1}{\hat{O}}_{{\mathcal{L}}}\otimes {\hat{O}}_{\overline{{\mathcal{L}}}}$$, where we define $${\hat{O}}_{{\mathcal{L}}}:{ = \bigotimes }_{k\in {k}_{{\mathcal{L}}}}{\widehat{O}}_{k}$$ and $${\hat{O}}_{\overline{{\mathcal{L}}}}:{ = \bigotimes }_{k^{\prime} \in {k}_{\overline{{\mathcal{L}}}}}{\widehat{O}}_{k^{\prime} }$$.

Using the fact that the blocks in V(θ) are independent we can now compute $${\langle {{\Delta }}{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{{\boldsymbol{q}}^{\prime} {\boldsymbol{p}}^{\prime} }\rangle }_{{V}_{\text{R}}}={\langle {{\Delta }}{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{{\boldsymbol{q}}^{\prime} {\boldsymbol{p}}^{\prime} }\rangle }_{{V}_{\overline{{\mathcal{L}}}},{V}_{{\mathcal{L}}}}$$. Then, from the definition of Ωpq in Eq. (34) and the fact that one can always express

$${\left\langle {{\Delta }}{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{{\boldsymbol{q}}^{\prime} {\boldsymbol{p}}^{\prime} }\right\rangle }_{{V}_{\text{R}}}= \, {\left\langle {\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{{\mathcal{L}}}{{{\Omega }}}_{{\boldsymbol{q}}^{\prime} {\boldsymbol{p}}^{\prime} }^{{\mathcal{L}}}]-\frac{{\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{{\mathcal{L}}}]{\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}^{\prime} {\boldsymbol{p}}^{\prime} }^{{\mathcal{L}}}]}{{2}^{m}}\right\rangle }_{{V}_{{\mathcal{L}}}}\\ \times \left(\prod\limits_{k\in {k}_{\overline{{\mathcal{L}}}}}{\left\langle {{{\Omega }}}_{k}\right\rangle }_{{W}_{kL}}\right)\ ,$$
(47)

with

$${{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{{\mathcal{L}}} ={{\rm{Tr}}}_{{\mathcal{L}}\cap \overline{w}}\left[(\left|{\boldsymbol{p}}\right\rangle \left\langle {\boldsymbol{q}}\right|\otimes {{\mathbb{1}}}_{w}){V}_{{\mathcal{L}}}{O}^{{\mathcal{L}}}{V}_{{\mathcal{L}}}\right]\\ {{{\Omega }}}_{k} ={\rm{Tr}}\left[\left|{\boldsymbol{p}}\right\rangle {\left\langle {\boldsymbol{q}}\right|}_{k}{W}_{kL}^{\dagger }{\widehat{O}}_{k}{W}_{kL}\right]{\rm{Tr}}\left[\left|{\boldsymbol{p}}^{\prime} \right\rangle {\left\langle {\boldsymbol{q}}^{\prime} \right|}_{k}{W}_{kL}^{\dagger }{\widehat{O}}_{k}{W}_{kL}\right]$$

and where $${{\rm{Tr}}}_{{\mathcal{L}}\cap \overline{w}}$$ indicates the partial trace over the Hilbert-space associated with the qubits in $${S}_{{\mathcal{L}}}\cap {S}_{\overline{w}}$$. As detailed in the Supplementary Information we can use Eq. (38) to show that

$${\left\langle {{{\Omega }}}_{k}\right\rangle }_{{W}_{kL}}\leqslant \frac{{r}_{k}^{2}\left({\delta }_{{({\boldsymbol{p}},{\boldsymbol{q}})}_{{S}_{k}}}{\delta }_{{({\boldsymbol{p}}^{\prime} ,{\boldsymbol{q}}^{\prime} )}_{{S}_{k}}}+{\delta }_{{({\boldsymbol{p}},{\boldsymbol{q}}^{\prime} )}_{{S}_{k}}}{\delta }_{{({\boldsymbol{p}}^{\prime} ,{\boldsymbol{q}})}_{{S}_{k}}}\right)}{{2}^{2m}-1}\ .$$
(48)

On the other hand, as shown in the Supplementary Note 7 (and as schematically depicted in Fig. 8c), when computing the expectation value $${\langle \ldots \rangle }_{{V}_{{\mathcal{L}}}}$$ in (47), one obtains

$${\left\langle {\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{{\mathcal{L}}}{{{\Omega }}}_{{\boldsymbol{q}}^{\prime} {\boldsymbol{p}}^{\prime} }^{{\mathcal{L}}}]-\frac{{\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{{\mathcal{L}}}]{\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}^{\prime} {\boldsymbol{p}}^{\prime} }^{{\mathcal{L}}}]}{{2}^{m}}\right\rangle }_{{V}_{{\mathcal{L}}}}=\mathop{\sum} _{\tau }{t}_{\tau }^{ij}{{\Delta }}{O}_{\tau }^{{\mathcal{L}}}{\delta }_{\tau }\ ,$$
(49)

where we defined $${\delta }_{\tau }={\delta }_{{({\boldsymbol{p}},{\boldsymbol{q}})}_{{S}_{\overline{\tau }}}}{\delta }_{{({\boldsymbol{p}}^{\prime} ,{\boldsymbol{q}}^{\prime} )}_{{S}_{\overline{\tau }}}}{\delta }_{{({\boldsymbol{p}},{\boldsymbol{q}}^{\prime} )}_{{S}_{\tau }}}{\delta }_{{({\boldsymbol{p}}^{\prime} ,{\boldsymbol{q}})}_{{S}_{\tau }}}$$, $${t}_{\tau }\in {\mathbb{R}}$$, $${S}_{\tau }\cup {S}_{\overline{\tau }}={S}_{{\mathcal{L}}}\cap {S}_{\overline{w}}$$ (with $${S}_{\tau }\,\ne\, {{\emptyset}}$$), and

$${{\Delta }}{O}_{\tau }^{{\mathcal{L}}}= {{\rm{Tr}}}_{{x}_{\tau }{y}_{\tau }}\left[{{\rm{Tr}}}_{{z}_{\tau }}\left[{O}_{i}\right]{{\rm{Tr}}}_{{z}_{\tau }}[{O}_{j}]\right]\\ -\frac{{{\rm{Tr}}}_{{x}_{\tau }}\left[{{\rm{Tr}}}_{{y}_{\tau }{z}_{\tau }}\left[{O}_{i}\right]{{\rm{Tr}}}_{{y}_{\tau }{z}_{\tau }}[{O}_{j}]\right]}{{2}^{m}}\ .$$
(50)

Here we use the notation $${{\rm{Tr}}}_{{x}_{\tau }}$$ to indicate the trace over the Hilbert space associated with subsystem $${S}_{{x}_{\tau }}$$, such that $${S}_{{x}_{\tau }}\cup {S}_{{y}_{\tau }}\cup {S}_{{z}_{\tau }}={S}_{{\mathcal{L}}}$$. As shown in the Supplementary Note 7, combining the deltas in Eqs. (48), and (49) with $${\left\langle {{\Delta }}{{{\Psi }}}_{{\boldsymbol{p}}{\boldsymbol{q}}}^{{\boldsymbol{p}}^{\prime} {\boldsymbol{q}}^{\prime} }\right\rangle }_{{V}_{{\rm{L}}}}$$ leads to Hilbert–Schmidt distances between two quantum states as in (44). One can then use the following bounds $${D}_{\text{HS}}\left({\rho }_{1},{\rho }_{2}\right)\le 2$$, $${{\Delta }}{O}_{\tau }^{{\mathcal{L}}}\le {\prod }_{k\in {k}_{{\mathcal{L}}}}{r}_{k}^{2}$$, and ∑τtτ ≤ 2, along with some additional simple algebra explained in the Supplementary Information to obtain the upper bound in Theorem 1.

### Ansatz and optimization method

Here we describe the gradient-free optimization method used in our heuristics. First, we note that all the parameters in the ansatz are randomly initialized. Then, at each iteration, one solves the following sub-space search problem: $$\mathop{\min }\limits_{{\boldsymbol{s}}\in {{\mathbb{R}}}^{d}}C({\boldsymbol{\theta }}+{\boldsymbol{A}}\cdot {\boldsymbol{s}})$$, where A is a randomly generated isometry, and s = (s1, …, sd) is a vector of coefficients to be optimized over. We used d = 10 in our simulations. Moreover, the training algorithm gradually increases the number of shots per cost-function evaluation. Initially, C is evaluated with 10 shots, and once the optimization reaches a plateau, the number of shots is increased by a factor of 3/2. This process is repeated until a termination condition on the value of C is achieved, or until we reach the maximum value of 105 shots per function evaluation. While this is a simple variable-shot approach, we remark that a more advanced variable-shot optimizer can be found in ref. 57.

Finally, let us remark that while we employ a sub-space search algorithm, in the presence of barren plateaus all optimization methods will (on average) fail unless the algorithm has a precision (i.e., number of shots) that grows exponentially with n. The latter is due to the fact that an exponentially vanishing gradient implies that on average the cost function landscape will essentially be flat, with the slope of the order of $${\mathcal{O}}(1/{2}^{n})$$. Hence, unless one has a precision that can detect such small changes in the cost value, one will not be able to determine a cost minimization direction with gradient-based, or even with black-box optimizers such as the Nelder–Mead method58,59,60,61.

## Data availability

Data generated and analyzed during the current study are available from the corresponding author upon reasonable request.

## References

1. Preskill, J. Quantum computing in the NISQ era and beyond. Quantum 2, 79 (2018).

2. McClean, J. R., Romero, J., Babbush, R. & Aspuru-Guzik, A. The theory of variational hybrid quantum-classical algorithms. New J. Phys. 18, 023023 (2016).

3. Farhi, E., Goldstone, J. & Gutmann, S. A quantum approximate optimization algorithm. Preprint at https://arxiv.org/abs/1411.4028 (2014).

4. Hadfield, S. et al. From the quantum approximate optimization algorithm to a quantum alternating operator ansatz. Algorithms 12, 34 (2019).

5. Hastings, M. B. Classical and quantum bounded depth approximation algorithms. Preprint at https://arxiv.org/abs/1905.07047 (2019).

6. Kandala, A. et al. Hardware-efficient variational quantum eigensolver for small molecules and quantum magnets. Nature 549, 242 (2017).

7. Arute, F. et al. Hartree-fock on a superconducting qubit quantum computer. Science 369, 1084–1089 (2020a).

8. Harrigan, Matthew P. et al. Quantum approximate optimization of non-planar graph problems on a planar superconducting processor. Nature Physics 1–5 (2021).

9. McClean, J. R., Boixo, S., Smelyanskiy, V. N., Babbush, R. & Neven, H. Barren plateaus in quantum neural network training landscapes. Nat. Commun. 9, 4812 (2018).

10. Peruzzo, A. et al. A variational eigenvalue solver on a photonic quantum processor. Nat. Commun. 5, 4213 (2014).

11. Romero, J., Olson, J. P. & Aspuru-Guzik, A. Quantum autoencoders for efficient compression of quantum data. Quant. Sci. Technol. 2, 045001 (2017).

12. Johnson, P. D., Romero, J., Olson, J., Cao, Y. & Aspuru-Guzik, A. QVECTOR: an algorithm for device-tailored quantum error correction. Preprint at https://arxiv.org/abs/1711.02249 (2017).

13. Koczor, B., Endo, S., Jones, T., Matsuzaki, Y. & Benjamin, S. C. Variational-state quantum metrology. New J. Phys. 22, 083038 (2020).

14. Khatri, S. et al. Quantum-assisted quantum compiling. Quantum 3, 140 (2019).

15. Jones, T. & Benjamin, S. C. Quantum compilation and circuit optimisation via energy dissipation. Preprint at https://arxiv.org/abs/1811.03147 (2019).

16. Sharma, K., Khatri, S., Cerezo, M. & Coles, P. J. Noise resilience of variational quantum compiling. New J. Phys. 22, 043006 (2020a).

17. Heya, K., Suzuki, Y., Nakamura, Y. & Fujii, K. Variational quantum gate optimization. Preprint at https://arxiv.org/abs/1810.12745 (2018).

18. LaRose, R., Tikku, A., O’Neel-Judy, É., Cincio, L. & Coles, P. J. Variational quantum state diagonalization. npj Quant. Inf. 5, 1–10 (2018).

19. Bravo-Prieto, C., García-Martín, D. & Latorre, J. Quantum singular value decomposer. Phys. Rev. A 101, 062310 (2020).

20. Li, Y. & Benjamin, S. C. Efficient variational quantum simulator incorporating active error minimization. Phys. Rev. X 7, 021050 (2017).

21. Heya, K., Nakanishi, K. M., Mitarai, K. & Fujii, K. Subspace variational quantum simulator. Preprint at https://arxiv.org/abs/1904.08566 (2019).

22. Cirstoiu, C. et al. Variational fast forwarding for quantum simulation beyond the coherence time. npj Quant. Inf. 6, 1–10 (2020).

23. Otten, M., Cortes, C. L. & Gray, S. K. Noise-resilient quantum dynamics using symmetry-preserving ansatzes. Preprint at https://arxiv.org/abs/1910.06284 (2019).

24. Cerezo, M., Poremba, A., Cincio, L. & Coles, P. J. Variational quantum fidelity estimation. Quantum 4, 248 (2020).

25. Carolan, J. et al. Variational quantum unsampling on a quantum photonic processor. Nature Physics 16.3 322-327 (2020).

26. Arrasmith, A., Cincio, L., Sornborger, A. T., Zurek, W. H. & Coles, P. J. Variational consistent histories as a hybrid algorithm for quantum foundations. Nat. Commun. 10, 3438 (2019).

27. Bravo-Prieto, C. et al. Variational quantum linear solver. Preprint at https://arxiv.org/abs/1909.05820 (2019).

28. Xu, X. et al. Variational algorithms for linear algebra. Preprint at https://arxiv.org/abs/1909.03898 (2019).

29. Huang, H.-Y., Bharti, K. & Rebentrost, P. Near-term quantum algorithms for linear systems of equations. Preprint at https://arxiv.org/abs/1909.07344 (2019).

30. Cerezo, M. & Coles, P. J. Impact of barren plateaus on the hessian and higher order derivatives. Preprint at https://arxiv.org/abs/2008.07454 (2020).

31. Arrasmith, A., Cerezo, M., Czarnik, P., Cincio, L. & Coles, P. J. Effect of barren plateaus on gradient-free optimization. Preprint at https://arxiv.org/abs/2011.12245 (2020).

32. Grant, E., Wossnig, L., Ostaszewski, M. & Benedetti, M. An initialization strategy for addressing barren plateaus in parametrized quantum circuits. Quantum 3, 214 (2019).

33. Verdon, G. et al. Learning to learn with quantum neural networks via classical neural networks. Preprint at https://arxiv.org/abs/1907.05415 (2019a).

34. Lee, J., Huggins, W. J., Head-Gordon, M. & Whaley, K. B. Generalized unitary coupled cluster wave functions for quantum computation. J. Chem. Theory Comput. 15, 311–324 (2018).

35. Verdon, G. et al. Quantum graph neural networks. Preprint at https://arxiv.org/abs/1909.12264 (2019b).

36. IBM Q: Quantum devices and simulators. https://www.research.ibm.com/ibm-q/technology/devices/.

37. Arute, F. et al. Quantum supremacy using a programmable superconducting processor. Nature 574, 505–510 (2019).

38. Dankert, C., Cleve, R., Emerson, J. & Livine, E. Exact and approximate unitary 2-designs and their application to fidelity estimation. Phys. Rev. A 80, 012304 (2009).

39. Brandao, F. G. S. L., Harrow, A. W. & Horodecki, M. Local random quantum circuits are approximate polynomial-designs. Commun. Math. Phys. 346, 397–434 (2016).

40. Harrow, A. & Mehraban, S. Approximate unitary t-designs by short random quantum circuits using nearest-neighbor and long-range gates. Preprint at https://arxiv.org/abs/1809.06957 (2018).

41. Wan, K. H., Dahlsten, O., Kristjánsson, H., Gardner, R. & Kim, M. S. Quantum generalisation of feedforward neural networks. npj Quant. Inf. 3, 36 (2017).

42. Lamata, L. et al. Quantum autoencoders via quantum adders with genetic algorithms. Quant. Sci. Technol. 4, 014007 (2018).

43. Pepper, A., Tischler, N. & Pryde, G. J. Experimental realization of a quantum autoencoder: the compression of qutrits via machine learning. Phys. Rev. Lett. 122, 060501 (2019).

44. Verdon, G., Pye, J. & Broughton, M. A universal training algorithm for quantum deep learning. Preprint at https://arxiv.org/abs/1806.09729 (2018).

45. Pennington, J. & Bahri, Y. Geometry of neural network loss surfaces via random matrix theory. in Proceedings of the 34th International Conference on Machine Learning-Volume 70 (JMLR. org, 2017) pp. 2798–2806, http://proceedings.mlr.press/v70/pennington17a.html.

46. Tranter, A., Love, P. J., Mintert, F. & Coveney, P. V. A comparison of the bravyi–kitaev and jordan–wigner transformations for the quantum simulation of quantum chemistry. J. Chem. Theory Comput. 14, 5617–5630 (2018).

47. Bravyi, S. B. & Kitaev, A. Y. Fermionic quantum computation. Ann. Phys. 298, 210–226 (2002).

48. Uvarov, A., Biamonte, J. D. & Yudin, D. Variational quantum eigensolver for frustrated quantum systems. Phys. Rev. B 102, 075104 (2020).

49. Biamonte, J. Universal variational quantum computation. Preprint at https://arxiv.org/abs/1903.04500 (2019).

50. Sharma, K., Cerezo, M., Cincio, L. & Coles, P. J. Trainability of dissipative perceptron-based quantum neural networks. Preprint at https://arxiv.org/abs/2005.12458 (2020b).

51. Beer, K. et al. Training deep quantum neural networks. Nat. Commun. 11, 1–6 (2020).

52. Bartlett, R. J. & Musiał, M. Coupled-cluster theory in quantum chemistry. Rev. Modern Phys. 79, 291 (2007).

53. Volkoff, T. & Coles, P. J. Large gradients via correlation in random parameterized quantum circuits. Quant. Sci. Technol. (2021). https://iopscience.iop.org/article/10.1088/2058-9565/abd891.

54. Skolik, A. et al. Layerwise learning for quantum neural networks. Quantum Machine Intelligence 3.1 1–11 (2021).

55. Benoît, C. & Śniady, P. Integration with respect to the haar measure on unitary, orthogonal and symplectic group. Commun. Math. Phys. 264, 773–795 (2006).

56. Puchała, Z. & Miszczak, J. A. Symbolic integration with respect to the haar measure on the unitary groups. Bull. Pol. Acad. Sci. Tech. Sci. 65, 21–27 (2017).

57. Kübler, J. M., Arrasmith, A., Cincio, L. & Coles, P. J. An adaptive optimizer for measurement-frugal variational algorithms. Quantum 4, 263 (2020).

58. Nelder, J. A. & Mead, R. A simplex method for function minimization. Comput. J. 7, 308–313 (1965).

59. Paley, R. E. A. C. & Zygmund, A. A note on analytic functions in the unit circle. Math. Proc. Camb. Phil. Soc. 28, 266 (1932).

60. Fukuda, M., König, R. & Nechita, I. RTNI–a symbolic integrator for haar-random tensor networks. J. Phys. A 52, 425303 (2019).

61. Nielsen, M. A. & Chuang, I. L. Quantum computation and quantum information: 10th Anniversary Edition, 10th ed. (Cambridge University Press, New York, NY, USA, 2011)

Download references

## Acknowledgements

We thank Jacob Biamonte, Elizabeth Crosson, Burak Sahinoglu, Rolando Somma, Guillaume Verdon, and Kunal Sharma for helpful conversations. All authors were supported by the Laboratory Directed Research and Development (LDRD) program of Los Alamos National Laboratory (LANL) under project numbers 20180628ECR (for M.C.), 20190065DR (for A.S., L.C., and P.J.C.), and 20200677PRD1 (for T.V.). M.C. and A.S. were also supported by the Center for Nonlinear Studies at LANL. P.J.C. acknowledges initial support from the LANL ASC Beyond Moore’s Law project. This work was also supported by the U.S. Department of Energy (DOE), Office of Science, Office of Advanced Scientific Computing Research, under the Quantum Computing Application Teams program.

## Author information

Authors

### Contributions

The project was conceived by M.C., L.C., and P.J.C. The manuscript was written by M.C., A.S., T.V., L.C., and P.J.C. T.V. proved Proposition 1. M.C. and A.S. proved Proposition 2 and Theorems 1–2. M.C., A.S., T.V., and P.J.C. proved Corollaries 1–2. M.C., A.S., T.V., L.C., and P.J.C. analyzed the quantum autoencoder. For the numerical results, T.V. performed the simulation in Fig. 2, and L.C. performed the simulation in Fig. 5.

### Corresponding authors

Correspondence to M. Cerezo or Patrick J. Coles.

## Ethics declarations

### Competing interests

The authors declare no competing interests.

## Additional information

Peer review information Nature Communications thanks the anonymous reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and Permissions

## About this article

### Cite this article

Cerezo, M., Sone, A., Volkoff, T. et al. Cost function dependent barren plateaus in shallow parametrized quantum circuits. Nat Commun 12, 1791 (2021). https://doi.org/10.1038/s41467-021-21728-w

Download citation

• Received:

• Accepted:

• Published:

• DOI: https://doi.org/10.1038/s41467-021-21728-w

## Further reading

• ### Detecting and quantifying entanglement on near-term quantum devices

• Kun Wang
• Zhixin Song
• Xin Wang

npj Quantum Information (2022)

• ### Mode connectivity in the loss landscape of parameterized quantum circuits

• Kathleen E. Hamilton
• Emily Lynn
• Raphael C. Pooser

Quantum Machine Intelligence (2022)

• ### A continuous variable Born machine

• Ieva Čepaitė
• Brian Coyle
• Elham Kashefi

Quantum Machine Intelligence (2022)

• ### Quantifying scrambling in quantum neural networks

• Roy J. Garcia
• Kaifeng Bu
• Arthur Jaffe

Journal of High Energy Physics (2022)

• ### Improved training of deep convolutional networks via minimum-variance regularized adaptive sampling

• Alfonso Rojas-Domínguez
• S. Ivvan Valdez
• Martín Carpio

Soft Computing (2022)

## Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

## Search

### Quick links

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing