One of the most important technological questions is whether Noisy Intermediate-Scale Quantum (NISQ) computers will have practical applications1. NISQ devices are limited both in qubit count and in gate fidelity, hence preventing the use of quantum error correction.

The leading strategy to make use of these devices is variational quantum algorithms (VQAs)2. VQAs employ a quantum computer to efficiently evaluate a cost function C, while a classical optimizer trains the parameters θ of a Parametrized Quantum Circuit (PQC) V(θ). The benefits of VQAs are three-fold. First, VQAs allow for task-oriented programming of quantum computers, which is important since designing quantum algorithms is non-intuitive. Second, VQAs make up for small qubit counts by leveraging classical computational power. Third, pushing complexity onto classical computers, while only running short-depth quantum circuits, is an effective strategy for error mitigation on NISQ devices.

There are very few rigorous scaling results for VQAs (with exception of one-layer approximate optimization3,4,5). Ideally, in order to reduce gate overhead that arises when implementing on quantum hardware one would like to employ a hardware-efficient ansatz6 for V(θ). As recent large-scale implementations for chemistry7 and optimization8 applications have shown, this ansatz leads to smaller errors due to hardware noise. However, one of the few known scaling results is that deep versions of randomly initialized hardware-efficient ansatzes lead to exponentially vanishing gradients9. Very little is known about the scaling of the gradient in such ansatzes for shallow depths, and it would be especially useful to have a converse bound that guarantees non-exponentially vanishing gradients for certain depths. This motivates our work, where we rigorously investigate the gradient scaling of VQAs as a function of the circuit depth.

The other motivation for our work is the recent explosion in the number of proposed VQAs. The Variational Quantum Eigensolver (VQE) is the most famous VQA. It aims to prepare the ground state of a given Hamiltonian H = ∑αcασα, with H expanded as a sum of local Pauli operators10. In VQE, the cost function is obviously the energy \(C=\left\langle \psi | H| \psi \right\rangle\) of the trial state \(\left|\psi \right\rangle\). However, VQAs have been proposed for other applications, like quantum data compression11, quantum error correction12, quantum metrology13, quantum compiling14,15,16,17, quantum state diagonalization18,19, quantum simulation20,21,22,23, fidelity estimation24, unsampling25, consistent histories26, and linear systems27,28,29. For these applications, the choice of C is less obvious. Put another way, if one reformulates these VQAs as ground-state problems (which can be done in many cases), the choice of Hamiltonian H is less intuitive. This is because many of these applications are abstract, rather than associated with a physical Hamiltonian.

We remark that polynomially vanishing gradients imply that the number of shots needed to estimate the gradient should grow as \({\mathcal{O}}(\mathrm{poly}\,(n))\). In contrast, exponentially vanishing gradients (i.e., barren plateaus) imply that derivative-based optimization will have exponential scaling30, and this scaling can also apply to derivative-free optimization31. Assuming a polynomial number of shots per optimization step, one will be able to resolve against finite sampling noise and train the parameters if the gradients vanish polynomially. Hence, we employ the term “trainable” for polynomially vanishing gradients.

In this work, we connect the trainability of VQAs to the choice of C. For the abstract applications in refs. 11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29, it is important for C to be operational, so that small values of C imply that the task is almost accomplished. Consider an example of state preparation, where the goal is to find a gate sequence that prepares a target state \(\left|{\psi }_{0}\right\rangle\). A natural cost function is the square of the trace distance DT between \(\left|{\psi }_{0}\right\rangle\) and \(\left|\psi \right\rangle =V{({\boldsymbol{\theta }})}^{\dagger }\left|{\boldsymbol{0}}\right\rangle\), given by \({C}_{{\rm{G}}}={D}_{\text{T}}{(\left|{\psi }_{0}\right\rangle ,\left|\psi \right\rangle )}^{2}\), which is equivalent to

$${C}_{{\rm{G}}}={\rm{Tr}}[{O}_{{\rm{G}}}V({\boldsymbol{\theta }})\left|{\psi }_{0}\right\rangle \ \left\langle {\psi }_{0}\right|V{({\boldsymbol{\theta }})}^{\dagger }]\ ,$$

with \({O}_{{\rm{G}}}={\mathbb{1}}-\left|{\boldsymbol{0}}\right\rangle \ \left\langle {\boldsymbol{0}}\right|\). Note that \(\sqrt{{C}_{{\rm{G}}}}\ge | \left\langle \psi | M| \psi \right\rangle -\left\langle {\psi }_{0}| M| {\psi }_{0}\right\rangle |\) has a nice operational meaning as a bound on the expectation value difference for a POVM element M.

However, here we argue that this cost function and others like it exhibit exponentially vanishing gradients. Namely, we consider global cost functions, where one directly compares states or operators living in exponentially large Hilbert spaces (e.g., \(\left|\psi \right\rangle\) and \(\left|{\psi }_{0}\right\rangle\)). These are precisely the cost functions that have operational meanings for tasks of interest, including all tasks in refs. 11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29. Hence, our results imply that a non-trivial subset of these references will need to revise their choice of C.

Interestingly, we demonstrate vanishing gradients for shallow PQCs. This is in contrast to McClean et al.9, who showed vanishing gradients for deep PQCs. They noted that randomly initializing θ for a V(θ) that forms a 2-design leads to a barren plateau, i.e., with the gradient vanishing exponentially in the number of qubits, n. Their work implied that researchers must develop either clever parameter initialization strategies32,33 or clever PQCs ansatzes4,34,35. Similarly, our work implies that researchers must carefully weigh the balance between trainability and operational relevance when choosing C.

While our work is for general VQAs, barren plateaus for global cost functions were noted for specific VQAs and for a very specific tensor-product example by our research group14,18, and more recently in29. This motivated the proposal of local cost functions14,16,18,22,25,26,27, where one compares objects (states or operators) with respect to each individual qubit, rather than in a global sense, and therein it was shown that these local cost functions have indirect operational meaning.

Our second result is that these local cost functions have gradients that vanish polynomially rather than exponentially in n, and hence have the potential to be trained. This holds for V(θ) with depth \({\mathcal{O}}(\mathrm{log}\,n)\). Figure 1 summarizes our two main results.

Fig. 1: Summary of our main results.
figure 1

McClean et al.9 proved that a barren plateau can occur when the depth D of a hardware-efficient ansatz is \(D\in {\mathcal{O}}(\mathrm{poly}\,(n))\). Here we extend these results by providing bounds for the variance of the gradient of global and local cost functions as a function of D. In particular, we find that the barren plateau phenomenon is cost-function dependent. a For global cost functions (e.g., Eq. (1)), the landscape will exhibit a barren plateau essentially for all depths D. b For local cost functions (e.g., Eq. (2)), the gradient vanishes at worst polynomially and hence is trainable when \(D\in {\mathcal{O}}(\mathrm{log}\,(n))\), while barren plateaus occur for \(D\in {\mathcal{O}}(\mathrm{poly}\,(n))\), and between these two regions the gradient transitions from polynomial to exponential decay.

Finally, we illustrate our main results for an important example: quantum autoencoders11. Our large-scale numerics show that the global cost function proposed in11 has a barren plateau. On the other hand, we propose a novel local cost function that is trainable, hence making quantum autoencoders a scalable application.


Warm-up example

To illustrate cost-function-dependent barren plateaus, we first consider a toy problem corresponding to the state preparation problem in the Introduction with the target state being \(\left|{\boldsymbol{0}}\right\rangle\). We assume a tensor-product ansatz of the form \(V({\boldsymbol{\theta }}){ = \bigotimes }_{j = 1}^{n}{e}^{-i{\theta }^{j}{\sigma }_{x}^{(j)}/2}\), with the goal of finding the angles θj such that \(V({\boldsymbol{\theta }})\left|{\boldsymbol{0}}\right\rangle =\left|{\boldsymbol{0}}\right\rangle\). Employing the global cost of (1) results in \({C}_{{\rm{G}}}=1-\mathop{\prod }\nolimits_{j = 1}^{n}{\cos }^{2}\frac{{\theta }^{j}}{2}\). The barren plateau can be detected via the variance of its gradient: \({\rm{Var}}[\frac{\partial {C}_{{\rm{G}}}}{\partial {\theta }^{j}}]=\frac{1}{8}{(\frac{3}{8})}^{n-1}\), which is exponentially vanishing in n. Since the mean value is \(\left\langle \frac{\partial {C}_{{\rm{G}}}}{\partial {\theta }^{j}}\right\rangle =0\), the gradient concentrates exponentially around zero.

On the other hand, consider a local cost function:

$${C}_{{\rm{L}}}={\rm{Tr}}\left[{O}_{{\rm{L}}}V({\boldsymbol{\theta }})\left|{\boldsymbol{0}}\right\rangle \ \left\langle {\boldsymbol{0}}\right|V{({\boldsymbol{\theta }})}^{\dagger }\right],$$
$${\,\text{with}\,}\quad {O}_{{\rm{L}}}={\mathbb{1}}-{\frac {1}{n}}\mathop{\sum }\limits_{j=1}^{n}\left|0\right\rangle \ {\left\langle 0\right|}_{j}\otimes {{\mathbb{1}}}_{\overline{j}}\ ,$$

where \({{\mathbb{1}}}_{\overline{j}}\) is the identity on all qubits except qubit j. Note that CL vanishes under the same conditions as CG14,16, CL = 0 CG = 0. We find \({C}_{{\rm{L}}}=1-\frac{1}{n}\mathop{\sum }\nolimits_{j = 1}^{n}{\cos }^{2}\frac{{\theta }^{j}}{2}\), and the variance of its gradient is \({\rm{Var}}[\frac{\partial {C}_{{\rm{L}}}}{\partial {\theta }^{j}}]=\frac{1}{8{n}^{2}}\), which vanishes polynomially with n and hence exhibits no barren plateau. Figure 2 depicts the cost landscapes of CG and CL for two values of n and shows that the barren plateau can be avoided here via a local cost function.

Fig. 2: Cost function landscapes.
figure 2

a Two-dimensional cross-section through the landscape of \({C}_{{\rm{G}}}=1-\mathop{\prod }\nolimits_{j = 1}^{n}{\cos }^{2}({\theta }^{j}/2)\) for n = 4 (blue) and n = 24 (orange). b The same cross-section through the landscape of \({C}_{{\rm{L}}}=1-\frac{1}{n}\mathop{\sum }\nolimits_{j = 1}^{n}{\cos }^{2}({\theta }^{j}/2)\) is independent of n. In both cases, 200 Haar distributed points are shown, with very few (most) of these points lying in the valley containing the global minimum of CG (CL).

Moreover, this example allows us to delve deeper into the cost landscape to see a phenomenon that we refer to as a narrow gorge. While a barren plateau is associated with a flat landscape, a narrow gorge refers to the steepness of the valley that contains the global minimum. This phenomenon is illustrated in Fig. 2, where each dot corresponds to cost values obtained from randomly selected parameters θ. For CG we see that very few dots fall inside the narrow gorge, while for CL the narrow gorge is not present. Note that the narrow gorge makes it harder to train CG since the learning rate of descent-based optimization algorithms must be exponentially small in order not to overstep the narrow gorge. The following proposition (proved in the Supplementary Note 2) formalizes the narrow gorge for CG and its absence for CL by characterizing the dependence on n of the probability Cδ. This probability is associated with the parameter space volume that leads to Cδ.

Proposition 1

Let θj be uniformly distributed on [−π, π] j. For any δ (0, 1), the probability that CG ≤ δ satisfies

$$\Pr \{{C}_{{\rm{G}}}\le \delta \}\le {(1-\delta )}^{-1}{\left(\frac{1}{2}\right)}^{n}.$$

For any \(\delta \in [\frac{1}{2},1]\), the probability that CL ≤ δ satisfies

$$\Pr \{{C}_{{\rm{L}}}\le \delta \}\ge \frac{{(2\delta -1)}^{2}}{\frac{1}{2n}+{(2\delta -1)}^{2}}\mathop{\longrightarrow }\limits_{n\to \infty }1\ .$$

General framework

For our general results, we consider a family of cost functions that can be expressed as the expectation value of an operator O as follows

$$C={\rm{Tr}}\left[OV({\boldsymbol{\theta }})\rho {V}^{\dagger }({\boldsymbol{\theta }})\right]\ ,$$

where ρ is an arbitrary quantum state on n qubits. Note that this framework includes the special case where ρ could be a pure state, as well as the more special case where \(\rho =\left|{\boldsymbol{0}}\right\rangle \ \left\langle {\boldsymbol{0}}\right|\), which is the input state for many VQAs such as VQE. Moreover, in VQE, one chooses O = H, where H is the physical Hamiltonian. In general, the choice of O and ρ essentially defines the application of interest of the particular VQA.

It is typical to express O as a linear combination of the form \(O={c}_{0}{\mathbb{1}}+\mathop{\sum }\nolimits_{i = 1}^{N}{c}_{i}{O}_{i}\). Here Oi ≠ \({\mathbb{1}}\), \({c}_{i}\in {\mathbb{R}}\), and we assume that at least one ci ≠ 0. Note that CG and CL in (1) and (2) fall under this framework. In our main results below, we will consider two different choices of O that respectively capture our general notions of global and local cost functions and also generalize the aforementioned CG and CL.

As shown in Fig. 3a, V(θ) consists of L layers of m-qubit unitaries Wkl(θkl), or blocks, acting on alternating groups of m neighboring qubits. We refer to this as an Alternating Layered Ansatz. We remark that the Alternating Layered Ansatz will be a hardware-efficient ansatz so long as the gates that compose each block are taken from a set of gates native to a specific device. As depicted in Fig. 3c, the one dimensional Alternating Layered Ansatz can be readily implemented in devices with one-dimensional connectivity, as well as in devices with two-dimensional connectivity (such as that of IBM’s36 and Google’s37 quantum devices). That is, with both one- and two-dimensional hardware connectivity one can group qubits to form an Alternating Layered Ansatz as in Fig. 3a.

Fig. 3: Alternating Layered Ansatz.
figure 3

a Each block Wkl acts on m qubits and is parametrized via (27). As shown, we define Sk as the m-qubit subsystem on which WkL acts, where L is the last layer of V(θ). Given some block W, it is useful for our proofs (outlined in the Methods) to write \(V({\boldsymbol{\theta }})={V}_{\text{R}}({{\mathbb{1}}}_{\overline{w}}\otimes W){V}_{{\rm{L}}}\), where VR contains all gates in the forward light-cone \({\mathcal{L}}\) of W. The forward light-cone \({\mathcal{L}}\) is defined as all gates with at least one input qubit causally connected to the output qubits of W. We define \(\overline{{\mathcal{L}}}\) as the compliment of \({\mathcal{L}}\), Sw as the m-qubit subsystem on which W acts, and \({S}_{\overline{w}}\) as the n − m qubit subsystem on which W acts trivially. b The operator Oi acts nontrivially only in subsystem \({S}_{k-1}\in {\mathcal{S}}\), while \({O}_{i^{\prime} }\) acts nontrivially on the first m/2 qubits of Sk+1, and on the second m/2 qubits of Sk. c Depiction of the Alternating Layered Ansatz with one- and two-dimensional connectivity. Each circle represents a physical qubit.

The index l = 1, …, L in Wkl(θkl) indicates the layer that contains the block, while k = 1, …, ξ indicates the qubits it acts upon. We assume n is a multiple of m, with n = mξ, and that m does not scale with n. As depicted in Fig. 3a, we define Sk as the m-qubit subsystem on which WkL acts, and we define \({\mathcal{S}}=\{{S}_{k}\}\) as the set of all such subsystems. Let us now consider a block Wkl(θkl) in the lth layer of the ansatz. For simplicity we henceforth use W to refer to a given Wkl(θkl). As shown in the Methods section, given a θνθkl that parametrizes a rotation \({e}^{-i{\theta }^{\nu }{\sigma }_{\nu }/2}\) (with σν a Pauli operator) inside a given block W, one can always express

$$\frac{\partial W}{\partial {\theta }^{\nu }}:={\partial }_{\nu }W=\frac{-i}{2}{W}_{\text{A}}{\sigma }_{\nu }{W}_{\text{B}},$$

where WA and WB contain all remaining gates in W, and are properly defined in the Methods section.

The contribution to the gradient C from a parameter θν in the block W is given by the partial derivative ∂νC. While the value of ∂νC depends on the specific parameters θ, it is useful to compute \({\langle {\partial }_{\nu }C\rangle }_{V}\), i.e., the average gradient over all possible unitaries V(θ) within the ansatz. Such an average may not be representative near the minimum of C, although it does provide a good estimate of the expected gradient when randomly initializing the angles in V(θ). In the Methods Section we explicitly show how to compute averages of the form 〈…〉V, and in the Supplementary Note 3 we provide a proof for the following Proposition.

Proposition 2

The average of the partial derivative of any cost function of the form (6) with respect to a parameter θν in a block W of the ansatz in Fig. 3 is

$${\langle {\partial }_{\nu }C\rangle }_{V}=0\ ,$$

provided that either WA or WB of (7) form a 1-design.

Here we recall that a t-design is an ensemble of unitaries, such that sampling over their distribution yields the same properties as sampling random unitaries from the unitary group with respect to the Haar measure up to the first t moments38. The Methods section provides a formal definition of a t-design.

Proposition 2 states that the gradient is not biased in any particular direction. To analyze the trainability of C, we consider the second moment of its partial derivatives:

$${\rm{Var}}[{\partial }_{\nu }C]={\left\langle {\left({\partial }_{\nu }C\right)}^{2}\right\rangle }_{V}\ ,$$

where we used the fact that \({\langle {\partial }_{\nu }C\rangle }_{V}=0\). The magnitude of Var[∂νC] quantifies how much the partial derivative concentrates around zero, and hence small values in (9) imply that the slope of the landscape will typically be insufficient to provide a cost-minimizing direction. Specifically, from Chebyshev’s inequality, Var[∂νC] bounds the probability that the cost-function partial derivative deviates from its mean value (of zero) as \(\Pr \left(| {\partial }_{\nu }C| \ge c\right)\le {\rm{Var}}[{\partial }_{\nu }C]/{c}^{2}\) for all c > 0.

Main results

Here we present our main theorems and corollaries, with the proofs sketched in the Methods and detailed in the Supplementary Information. In addition, in the Methods section we provide some intuition behind our main results by analyzing a generalization of the warm-up example where V(θ) is composed of a single layer of the ansatz in Fig. 3. This case bridges the gap between the warm-up example and our main theorems and also showcases the tools used to derive our main result.

The following theorem provides an upper bound on the variance of the partial derivative of a global cost function which can be expressed as the expectation value of an operator of the form

$$O={c}_{0}{\mathbb{1}}+\mathop{\sum }\limits_{i=1}^{N}{c}_{i}{\widehat{O}}_{i1}\otimes {\widehat{O}}_{i2}\otimes \cdots \otimes {\widehat{O}}_{i\xi }\ .$$

Specifically, we consider two cases of interest: (i) When N = 1 and each \({\widehat{O}}_{1k}\) is a non-trivial projector (\({\widehat{O}}_{1k}^{2}={\widehat{O}}_{1k}\ne {\mathbb{1}}\)) of rank rk acting on subsystem Sk, or (ii) When N is arbitrary and \({\widehat{O}}_{ik}\) is traceless with \({\rm{Tr}}[{\widehat{O}}_{ik}^{2}]\le {2}^{m}\) (for example, when \({\widehat{O}}_{ik}{ = \bigotimes }_{j = 1}^{m}{\sigma }_{j}^{\mu }\) is a tensor product of Pauli operators \({\sigma }_{j}^{\mu }\in \{{{\mathbb{1}}}_{j},{\sigma }_{j}^{x},{\sigma }_{j}^{y},{\sigma }_{j}^{z}\}\), with at least one \({\sigma }_{j}^{\mu }\,\ne\, {\mathbb{1}}\)). Note that case (i) includes CG of (1) as a special case.

Theorem 1

Consider a trainable parameter θν in a block W of the ansatz in Fig. 3. Let Var[∂νC] be the variance of the partial derivative of a global cost function C (with O given by (10)) with respect to θν. If WA, WB of (7), and each block in V(θ) form a local 2-design, then Var[∂νC] is upper bounded by

$${\rm{Var}}[{\partial }_{\nu }C]\;\leqslant\; {F}_{n}(L,l)\ .$$
  1. (i)

    For N = 1 and when each \({\widehat{O}}_{1k}\) is a non-trivial projector, then defining \(R=\mathop{\prod }\nolimits_{k = 1}^{\xi }{r}_{k}^{2}\), we have

    $${F}_{n}(L,l)=\frac{{2}^{2m+(2m-1)(L-l)}}{({2}^{2m}-1)\cdot {3}^{\frac{n}{m}}\cdot {2}^{(2-\frac{3}{m})n}}{c}_{1}^{2}R\ .$$
  2. (ii)

    For arbitrary N and when each \({\widehat{O}}_{ik}\) satisfies \({\rm{Tr}}[{\widehat{O}}_{ik}]=0\) and \({\rm{Tr}}[{\widehat{O}}_{ik}^{2}]\;\leqslant\; {2}^{m}\), then

    $${F}_{n}(L,l)=\frac{{2}^{2m(L-l+1)+1}}{{3}^{\frac{2n}{m}}\cdot {2}^{\left(3-\frac{4}{m}\right)n}}\mathop{\sum }\limits_{i,j=1}^{N}{c}_{i}{c}_{j}\ .$$

From Theorem 1 we derive the following corollary.

Corollary 1

Consider the function Fn(L, l).

  1. (i)

    Let N = 1 and let each \({\widehat{O}}_{1k}\) be a non-trivial projector, as in case (i) of Theorem 1. If \({c}_{1}^{2}R\in {\mathcal{O}}({2}^{n})\) and if the number of layers \(L\in {\mathcal{O}}(\mathrm{poly}\,(\mathrm{log}\,(n)))\), then

    $${F}_{n}\left(L,l\right)\in {\mathcal{O}}\left({2}^{-\left(1-\frac{1}{m}{\mathrm{log}\,}_{2}3\right)n}\right)\ ,$$

    which implies that Var[∂νC] is exponentially vanishing in n if m 2.

  2. (ii)

    Let N be arbitrary, and let each \({\widehat{O}}_{ik}\) satisfy \({\rm{Tr}}[{\widehat{O}}_{ik}]=0\) and \({\rm{Tr}}[{\widehat{O}}_{ik}^{2}]\;\leqslant\; {2}^{m}\), as in case (ii) of Theorem 1. If \(N\in {\mathcal{O}}({2}^{n})\), \({c}_{i}\in {\mathcal{O}}(1)\), and if the number of layers \(L\in {\mathcal{O}}(\mathrm{poly}\,(\mathrm{log}\,(n)))\), then

    $${F}_{n}\left(L,l\right)\in {\mathcal{O}}\left(\frac{1}{{2}^{\left(1-\frac{1}{m}\right)n}}\right)\ ,$$

    which implies that Var[∂νC] is exponentially vanishing in n if m 2.

Let us now make several important remarks. First, note that part (i) of Corollary 1 includes as a particular example the cost function CG of (1). Second, part (ii) of this corollary also includes as particular examples operators with \(N\in {\mathcal{O}}(1)\), as well as \(N\in {\mathcal{O}}(\mathrm{poly}\,(n))\). Finally, we remark that Fn(L, l) becomes trivial when the number of layers L is Ω(poly(n)), however, as we discuss below, we can still find that Var[∂νCG] vanishes exponentially in this case.

Our second main theorem shows that barren plateaus can be avoided for shallow circuits by employing local cost functions. Here we consider m-local cost functions where each \({\widehat{O}}_{i}\) acts nontrivially on at most m qubits and (on these qubits) can be expressed as \({\widehat{O}}_{i}={\widehat{O}}_{i}^{{\mu }_{i}}\otimes {\widehat{O}}_{i}^{\mu ^{\prime} }\):

$$O={c}_{0}{\mathbb{1}}+\mathop{\sum }\limits_{i=1}^{N}{c}_{i}{\widehat{O}}_{i}^{{\mu }_{i}}\otimes {\widehat{O}}_{i}^{\mu ^{\prime} }\ ,$$

where \({\widehat{O}}_{i}^{{\mu }_{i}}\) are operators acting on m/2 qubits which can be written as a tensor product of Pauli operators. Here, we assume the summation in Eq. (16) includes two possible cases as schematically shown in Fig. 3b: First, when \({\widehat{O}}_{i}^{{\mu }_{i}}\) (\({\widehat{O}}_{i}^{\mu ^{\prime} }\)) acts on the first (last) m/2 qubits of a given Sk, and second, when \({\widehat{O}}_{i}^{{\mu }_{i}}\) (\({\widehat{O}}_{i}^{\mu ^{\prime} }\)) acts on the last (first) m/2 qubits of a given Sk (Sk+1). This type of cost function includes any ultralocal cost function (i.e., where the \({\widehat{O}}_{i}\) are one-body) as in (2), and also VQE Hamiltonians with up to m/2 neighbor interactions. Then, the following theorem holds.

Theorem 2

Consider a trainable parameter θν in a block W of the ansatz in Fig. 3. Let Var[∂νC] be the variance of the partial derivative of an m-local cost function C (with O given by (16)) with respect to θν. WA, WB of (7), and each block in V(θ) form a local 2-design, then Var[∂νC] is lower bounded by

$${G}_{n}(L,l)\;\leqslant\; {\rm{Var}}[{\partial }_{\nu }C]\ ,$$


$${G}_{n}(L,l)= \frac{{2}^{m(l+1)-1}}{{({2}^{2m}-1)}^{2}{({2}^{m}+1)}^{L+l}}\\ \times \mathop{\sum}\limits_{i\in {i}_{{\mathcal{L}}}}\mathop{\sum}\limits _{{(k,k^{\prime} )\in {k}_{{{\mathcal{L}}}_{\text{B}}}}\atop {k^{\prime} \geqslant k}}{c}_{i}^{2}\epsilon ({\rho }_{k,k^{\prime} })\epsilon ({\widehat{O}}_{i})\ ,$$

where \({i}_{{\mathcal{L}}}\) is the set of i indices whose associated operators \({\widehat{O}}_{i}\) act on qubits in the forward light-cone \({\mathcal{L}}\) of W, and \({k}_{{{\mathcal{L}}}_{\text{B}}}\) is the set of k indices whose associated subsystems Sk are in the backward light-cone \({{\mathcal{L}}}_{\text{B}}\) of W. Here we defined the function \(\epsilon (M)={D}_{\text{HS}}\left(M,{\rm{Tr}}(M){\mathbb{1}}/{d}_{M}\right)\) where DHS is the Hilbert–Schmidt distance and dM is the dimension of the matrix M. In addition, \({\rho }_{k,k^{\prime} }\) is the partial trace of the input state ρ down to the subsystems \({S}_{k}{S}_{k+1}...{S}_{k^{\prime} }\).

Let us make a few remarks. First, note that the \(\epsilon ({\widehat{O}}_{i})\) in the lower bound indicates that training V(θ) is easier when \({\widehat{O}}_{i}\) is far from the identity. Second, the presence of \(\epsilon ({\rho }_{k,k^{\prime} })\) in Gn(L, l) implies that we have no guarantee on the trainability of a parameter θν in W if ρ is maximally mixed on the qubits in the backwards light-cone.

From Theorem 2 we derive the following corollary for m-local cost functions, which guarantees the trainability of the ansatz for shallow circuits.

Corollary 2

Consider the function Fn(L, l). Let O be an operator of the form (16), as in Theorem 2. If at least one term \({c}_{i}^{2}\epsilon ({\rho }_{k,k^{\prime} })\epsilon ({\widehat{O}}_{i})\) in the sum in (18) vanishes no faster than Ω(1/poly(n)), and if the number of layers L is \({\mathcal{O}}(\mathrm{log}\,(n))\), then

$${G}_{n}(L,l)\in {{\Omega }}\left(\frac{1}{\mathrm{poly}\,(n)}\right)\ .$$

On the other hand, if at least one term \({c}_{i}^{2}\epsilon ({\rho }_{k,k^{\prime} })\epsilon ({\widehat{O}}_{i})\) in the sum in (18) vanishes no faster than \({{\Omega }}\left(1/{2}^{\mathrm{poly}\,(\mathrm{log}\,(n))}\right)\), and if the number of layers is \({\mathcal{O}}(\mathrm{poly}\,(\mathrm{log}\,(n)))\), then

$${G}_{n}(L,l)\in {{\Omega }}\left(\frac{1}{{2}^{\mathrm{poly}\,(\mathrm{log}\,(n))}}\right)\ .$$

Hence, when L is \({\mathcal{O}}(\mathrm{poly}\,(\mathrm{log}\,(n)))\) there is a transition region where the lower bound vanishes faster than polynomially, but slower than exponentially.

We finally justify the assumption of each block being a local 2-design from the fact that shallow circuit depths lead to such local 2-designs. Namely, it has been shown that one-dimensional 2-designs have efficient quantum circuit descriptions, requiring \({\mathcal{O}}({m}^{2})\) gates to be exactly implemented38, or \({\mathcal{O}}(m)\) to be approximately implemented39,40. Hence, an L-layered ansatz in which each block forms a 2-design can be exactly implemented with a depth \(D\in {\mathcal{O}}({m}^{2}L)\), and approximately implemented with \(D\in {\mathcal{O}}(mL)\). For the case of two-dimensional connectivity, it has been shown that approximate 2-designs require a circuit depth of \({\mathcal{O}}(\sqrt{m})\) to be implemented40. Therefore, in this case the depth of the layered ansatz is \(D\in {\mathcal{O}}(\sqrt{m}L)\). The latter shows that increasing the dimensionality of the circuit reduces the circuit depth needed to make each block a 2-design.

Moreover, it has been shown that the Alternating Layered Ansatz of Fig. 3 will form an approximate one-dimensional 2-design on n qubits if the number of layers is \({\mathcal{O}}(n)\)40. Hence, for deep circuits, our ansatz behaves like a random circuit and we recover the barren plateau result of9 for both local and global cost functions.

Numerical simulations

As an important example to illustrate the cost-function-dependent barren plateau phenomenon, we consider quantum autoencoders11,41,42,43,44. In particular, the pioneering VQA proposed in ref. 11 has received significant literature attention, due to its importance to quantum machine learning and quantum data compression. Let us briefly explain the algorithm in ref. 11.

Consider a bipartite quantum system AB composed of nA and nB qubits, respectively, and let \(\{{p}_{\mu },|{\psi }_{\mu }\rangle \}\) be an ensemble of pure states on AB. The goal of the quantum autoencoder is to train a gate sequence V(θ) to compress this ensemble into the A subsystem, such that one can recover each state \(|{\psi }_{\mu }\rangle\) with high fidelity from the information in subsystem A. One can think of B as the “trash” since it is discarded after the action of V(θ).

To quantify the degree of data compression, ref. 11 proposed a cost function of the form:

$$C_{\rm{G}}^{\prime} =1-{\rm{Tr}}[\left|{\boldsymbol{0}}\right\rangle \ \left\langle {\boldsymbol{0}}\right|{\rho }_{\,\text{B}}^{\text{out}\,}]$$
$$={\rm{Tr}}[O_{\rm{G}}^{\prime} V({\boldsymbol{\theta }}){\rho }_{\,\text{AB}}^{\text{in}\,}V{({\boldsymbol{\theta }})}^{\dagger }]\ ,$$

where \({\rho }_{\,\text{AB}}^{\text{in}\,}={\sum }_{\mu }{p}_{\mu }|{\psi }_{\mu }\rangle \ \langle {\psi }_{\mu }|\) is the ensemble-average input state, \({\rho }_{\,\text{B}}^{\text{out}\,}={\sum }_{\mu }{p}_{\mu }{{\rm{Tr}}}_{\text{A}}[|\psi ^{\prime} \rangle \ \langle \psi ^{\prime} |]\) is the ensemble-average trash state, and \(\left|\psi ^{\prime} \right\rangle =V({\boldsymbol{\theta }})|{\psi }_{\mu }\rangle\). Equation (22) makes it clear that \(C_{\rm{G}}^{\prime}\) has the form in (6), and \(O_{\rm{G}}^{\prime} ={{\mathbb{1}}}_{\text{AB}}-{{\mathbb{1}}}_{\text{A}}\otimes \left|{\boldsymbol{0}}\right\rangle \ \left\langle {\boldsymbol{0}}\right|\) is a global observable of the form in (10). Hence, according to Corollary 1, \(C_{\rm{G}}^{\prime}\) exhibits a barren plateau for large nB. (Specifically, Corollary 1 applies in this context when nA < nB). As a result, large-scale data compression, where one is interested in discarding large numbers of qubits, will not be possible with \(C_{\rm{G}}^{\prime}\).

To address this issue, we propose the following local cost function

$$C_{\rm{L}}^{\prime} =1-\frac{1}{{n}_{\text{B}}}\mathop{\sum }\limits_{j=1}^{{n}_{\text{B}}}{\rm{Tr}}\left[\left(\left|0\right\rangle \ {\left\langle 0\right|}_{j}\otimes {{\mathbb{1}}}_{\overline{j}}\right){\rho }_{\,\text{B}}^{\text{out}\,}\right]$$
$$={\rm{Tr}}[O_{\rm{L}}^{\prime} V({\boldsymbol{\theta }}){\rho }_{\,\text{AB}}^{\text{in}\,}V{({\boldsymbol{\theta }})}^{\dagger }]\ ,\qquad\quad$$

where \(O_{\rm{L}}^{\prime} ={{\mathbb{1}}}_{\text{AB}}-\frac{1}{{n}_{\text{B}}}\mathop{\sum }\nolimits_{j = 1}^{{n}_{\text{B}}}{{\mathbb{1}}}_{\text{A}}\otimes \left|0\right\rangle \ {\left\langle 0\right|}_{j}\otimes {{\mathbb{1}}}_{\overline{j}}\), and \({{\mathbb{1}}}_{\overline{j}}\) is the identity on all qubits in B except the jth qubit. As shown in the Supplementary Note 9, \(C_{\rm{L}}^{\prime}\) satisfies \(C_{\rm{L}}^{\prime} \;\leqslant\; C_{\rm{G}}^{\prime} \;\leqslant\; {n}_{\text{B}}C_{\rm{L}}^{\prime}\), which implies that \(C_{\rm{L}}^{\prime}\) is faithful (vanishing under the same conditions as \(C_{\rm{G}}^{\prime}\)). Furthermore, note that \(O_{\rm{L}}^{\prime}\) has the form in (16). Hence Corollary 2 implies that \(C_{\rm{L}}^{\prime}\) does not exhibit a barren plateau for shallow ansatzes.

Here we simulate the autoencoder algorithm to solve a simple problem where nA = 1, and where the input state ensemble \(\{{p}_{\mu }, |{\psi }_{\mu } \rangle \}\) is given by

$$\left|{\psi }_{1}\right\rangle ={\left|0\right\rangle }_{\text{A}}\otimes {\left|0,0,0,\ldots ,0\right\rangle }_{\text{B}}\ ,\quad \,{\text{with}}\,\quad {p}_{1}=2/3\ ,$$
$$\left|{\psi }_{2}\right\rangle ={\left|1\right\rangle }_{\text{A}}\otimes {\left|1,1,0,\ldots ,0\right\rangle }_{\text{B}}\ ,\quad \,{\text{with}}\,\quad {p}_{2}=1/3\ .$$

In order to analyze the cost-function-dependent barren plateau phenomenon, the dimension of subsystem B is gradually increased as nB = 10, 15, …, 100.

Numerical results

In our heuristics, the gate sequence V(θ) is given by two layers of the ansatz in Fig. 4, so that the number of gates and parameters in V(θ) increases linearly with nB. Note that this ansatz is a simplified version of the ansatz in Fig. 3, as we can only generate unitaries with real coefficients. All parameters in V(θ) were randomly initialized and as detailed in the Methods section, we employ a gradient-free training algorithm that gradually increases the number of shots per cost-function evaluation.

Fig. 4: Alternating Layered Ansatz for V(θ) employed in our numerical simulations.
figure 4

Each layer is composed of control-Z gates acting on alternating pairs of neighboring qubits which are preceded and followed by single qubit rotations around the y-axis, \({R}_{y}({\theta }_{i})={e}^{-i{\theta }_{i}{\sigma }_{y}/2}\). Shown is the case of two layers, nA = 1, and nB = 10 qubits. The number of variational parameters and gates scales linearly with nB: for the case shown there are 71 gates and 51 parameters. While each block in this ansatz will not form an exact local 2-design, and hence does not fall under our theorems, one can still obtain a cost-function-dependent barren plateau.

Analysis of the n-dependence. Figure 5 shows representative results of our numerical implementations of the quantum autoencoder in ref. 11 obtained by training V(θ) with the global and local cost functions respectively given by (22) and (23). Specifically, while we train with finite sampling, in the figures we show the exact cost-function values versus the number of iterations. Here, the top (bottom) axis corresponds to the number of iterations performed while training with \(C_{\rm{G}}^{\prime}\) (\(C_{\rm{L}}^{\prime}\)). For nB = 10 and 15, Fig. 5 shows that we are able to train V(θ) for both cost functions. For nB = 20, the global cost function initially presents a plateau in which the optimizing algorithm is not able to determine a minimizing direction. However, as the number of shots per function evaluation increases, one can eventually minimize \(C_{\rm{G}}^{\prime}\). Such result indicates the presence of a barren plateau where the gradient takes small values which can only be detected when the number of shots becomes sufficiently large. In this particular example, one is able to start training at around 140 iterations.

Fig. 5: Cost versus number of iterations for the quantum autoencoder problem defined by Eqs. (25)–(26).
figure 5

In all cases we employed two layers of the ansatz shown in Fig. 4, and we set nA = 1, while increasing nB = 10, 15, …, 100. The top (bottom) axis corresponds to the global cost function \(C_{\rm{G}}^{\prime}\) of Eq. (22) (local cost function \(C_{\rm{L}}^{\prime}\) of (23)). As can be seen, \(C_{\rm{G}}^{\prime}\) can be trained up to nB = 20 qubits, while \(C_{\rm{L}}^{\prime}\) trained in all cases. These results indicate that global cost function presents a barren plateau even for a shallow depth ansatz, and this can be avoided by employing a local cost function.

When nB > 20 we are unable to train the global cost function, while always being able to train our proposed local cost function. Note that the number of iterations is different for \(C_{\rm{G}}^{\prime}\) and \(C_{\rm{L}}^{\prime}\), as for the global cost function case we reach the maximum number of shots in fewer iterations. These results indicate that the global cost function of (22) exhibits a barren plateau where the gradient of the cost function vanishes exponentially with the number of qubits, and which arises even for constant depth ansatzes. We remark that in principle one can always find a minimizing direction when training \(C_{\rm{G}}^{\prime}\), although this would require a number of shots that increases exponentially with nB. Moreover, one can see in Fig. 5 that randomly initializing the parameters always leads to \(C_{\rm{G}}^{\prime} \approx 1\) due to the narrow gorge phenomenon (see Proposition 1), i.e., where the probability of being near the global minimum vanishes exponentially with nB.

On the other hand, Fig. 5 shows that the barren plateau is avoided when employing a local cost function since we can train \(C_{\rm{L}}^{\prime}\) for all considered values of nB. Moreover, as seen in Fig. 5, \(C_{\rm{L}}^{\prime}\) can be trained with a small number of shots per cost-function evaluation (as small as 10 shots per evaluation).

Analysis of the L-dependence. The power of Theorem 2 is that it gives the scaling in terms of L. While one can substitute a function of n for L as we did in Corollary 2, one can also directly study the scaling with L (for fixed n). Figure 6 shows the dependence on L when training \(C_{\rm{L}}^{\prime}\) for the autoencoder example with nA = 1 and nB = 10. As one can see, the training becomes more difficult as L increases. Specifically, as shown in the inset it appears to become exponentially more difficult, as the number of shots needed to achieve a fixed cost value grows exponentially with L. This is consistent with (and hence verifies) our bound on the variance in Theorem 2, which vanishes exponentially in L, although we remark that this behavior can saturate for very large L9.

Fig. 6: Local cost \(C_{\rm{L}}^{\prime}\) versus number of iterations for the quantum autoencoder problem in Eqs. (2526) with nB = 10.
figure 6

Each curve corresponds to a different number of layers L in the ansatz of Fig. 4 with L = 2,…, 20. Curves were averaged over 9 instances of the autoencoder. As the number of layers increases, the optimization becomes harder. Inset: Number of shots needed to reach cost values of \(C_{\rm{L}}^{\prime} =0.02\) and \(C_{\rm{L}}^{\prime} =0.05\) versus number of layers L. As L increases the number of shots needed to reach the indicated cost values appears to increase exponentially.

In summary, even though the ansatz employed in our heuristics is beyond the scope of our theorems, we still find cost-function-dependent barren plateaus, indicating that the cost-function dependent barren plateau phenomenon might be more general and go beyond our analytical results.


While scaling results have been obtained for classical neural networks45, very few such results exist for the trainability of parametrized quantum circuits, and more generally for quantum neural networks. Hence, rigorous scaling results are urgently needed for VQAs, which many researchers believe will provide the path to quantum advantage with near-term quantum computers. One of the few such results is the barren plateau theorem of ref. 9, which holds for VQAs with deep, hardware-efficient ansatzes.

In this work, we proved that the barren plateau phenomenon extends to VQAs with randomly initialized shallow Alternating Layered Ansatzes. The key to extending this phenomenon to shallow circuits was to consider the locality of the operator O that defines the cost function C. Theorem 1 presented a universal upper bound on the variance of the gradient for global cost functions, i.e., when O is a global operator. Corollary 1 stated the asymptotic scaling of this upper bound for shallow ansatzes as being exponentially decaying in n, indicating a barren plateau. Conversely, Theorem 2 presented a universal lower bound on the variance of the gradient for local cost functions, i.e., when O is a sum of local operators. Corollary 2 notes that for shallow ansatzes this lower bound decays polynomially in n. Taken together, these two results show that barren plateaus are cost-function-dependent, and they establish a connection between locality and trainability.

In the context of chemistry or materials science, our present work can inform researchers about which transformation to use when mapping a fermionic Hamiltonian to a spin Hamiltonian46, i.e., Jordan-Wigner versus Bravyi–Kitaev47. Namely, the Bravyi–Kitaev transformation often leads to more local Pauli terms, and hence (from Corollary 2) to a more trainable cost function. This fact was recently numerically confirmed48.

Moreover, the fact that Corollary 2 is valid for arbitrary input quantum states may be useful when constructing variational ansatzes. For example, one could propose a growing ansatz method where one appends \(\mathrm{log}\,(n)\) layers of the hardware-efficient ansatz to a previously trained (hence fixed) circuit. This could then lead to a layer-by-layer training strategy where the previously trained circuit can correspond to multiple layers of the same hardware-efficient ansatz.

We remark that our definition of a global operator (local operator) is one that is both non-local (local) and many body (few body). Therefore, the barren plateau phenomenon could be due to the many-bodiness of the operator rather than the non-locality of the operator; we leave the resolution of this question to future work. On the other hand, our Theorem 1 rules out the possibility that barren plateaus could be due to cardinality, i.e., the number of terms in O when decomposed as a sum of Pauli products49. Namely, case (ii) of this theorem implies barren plateaus for O of essentially arbitrary cardinality, and hence cardinality is not the key variable at work here.

We illustrated these ideas for two examples VQAs. In Fig. 2, we considered a simple state-preparation example, which allowed us to delve deeper into the cost landscape and uncover another phenomenon that we called a narrow gorge, stated precisely in Proposition 1. In Fig. 5, we studied the more important example of quantum autoencoders, which have generated significant interest in the quantum machine learning community. Our numerics showed the effects of barren plateaus: for more than 20 qubits we were unable to minimize the global cost function introduced in11. To address this, we introduced a local cost function for quantum autoencoders, which we were able to minimize for system sizes of up to 100 qubits.

There are several directions in which our results could be generalized in future work. Naturally, we hope to extend the narrow gorge phenomenon in Proposition 1 to more general VQAs. In addition, we hope in the future to unify our theorems 1 and 2 into a single result that bounds the variance as a function of a parameter that quantifies the locality of O. This would further solidify the connection between locality and trainability. Moreover, our numerics suggest that our theorems (which are stated for exact 2-designs) might be extendable in some form to ansatzes composed of simpler blocks, like approximate 2-designs39.

We emphasize that while our theorems are stated for a hardware-efficient ansatz and for costs that are of the form (6), it remains an interesting open question as to whether other ansatzes, cost function, and architectures exhibit similar scaling behavior as that stated in our theorems. For instance, we have recently shown50 that our results can be extended to a more general type of Quantum Neural Network called dissipative quantum neural networks51. Another potential example of interest could be the unitary-coupled cluster (UCC) ansatz in chemistry52, which is intended for use in the \({\mathcal{O}}(\mathrm{poly}\,(n))\) depth regime34. Therefore it is important to study the key mathematical features of an ansatz that might allow one to go from trainability for \({\mathcal{O}}(\mathrm{log}\,n)\) depth (which we guarantee here for local cost functions) to trainability for \({\mathcal{O}}(\mathrm{poly}\,n)\) depth.

Finally, we remark that some strategies have been developed to mitigate the effects of barren plateaus32,33,53,54. While these methods are promising and have been shown to work in certain cases, they are still heuristic methods with no provable guarantees that they can work in generic scenarios. Hence, we believe that more work needs to be done to better understand how to prevent, avoid, or mitigate the effects of barren plateaus.


In this section, we provide additional details for the results in the main text, as well as a sketch of the proofs for our main theorems. We note that the proof of Theorem 2 comes before that of Theorem 1 since the latter builds on the former. More detailed proofs of our theorems are given in the Supplementary Information.

Variance of the cost function partial derivative

Let us first discuss the formulas we employed to compute Var[∂νC]. Let us first note that without loss of generality, any block Wkl(θkl) in the Alternating Layered Ansatz can be written as a product of ζkl independent gates from a gate alphabet \({\mathcal{A}}=\{{G}_{\mu }(\theta )\}\) as

$${W}_{kl}({{\boldsymbol{\theta }}}_{kl})={G}_{{\zeta }_{kl}}({\theta }_{kl}^{{\zeta }_{kl}})\ldots {G}_{\nu }({\theta }_{kl}^{\nu })\ldots {G}_{1}({\theta }_{kl}^{1})\ ,$$

where each \({\theta }_{kl}^{\nu }\) is a continuous parameter. Here, \({G}_{\nu }({\theta }_{kl}^{\nu })={R}_{\nu }({\theta }_{kl}^{\nu }){Q}_{\nu }\) where Qν is an unparametrized gate and \({R}_{\nu }({\theta }_{kl}^{\nu })={e}^{-i{\theta }_{kl}^{\nu }{\sigma }_{\nu }/2}\) with σν a Pauli operator. Note that WkL denotes a block in the last layer of V(θ).

For the proofs of our results, it is helpful to conceptually break up the ansatz as follows. Consider a block Wkl(θkl) in the lth layer of the ansatz. For simplicity, we henceforth use W to refer to a given Wkl(θkl). Let Sw denote the m-qubit subsystem that contains the qubits W acts on, and let \({S}_{\overline{w}}\) be the (n − m) subsystem on which W acts trivially. Similarly, let \({{\mathcal{H}}}_{w}\) and \({{\mathcal{H}}}_{\overline{w}}\) denote the Hilbert spaces associated with Sw and \({S}_{\overline{w}}\), respectively. Then, as shown in Fig. 3a, V(θ) can be expressed as

$$V({\boldsymbol{\theta }})={V}_{\text{R}}({{\mathbb{1}}}_{\overline{w}}\otimes W){V}_{{\rm{L}}}\ .$$

Here, \({{\mathbb{1}}}_{\overline{w}}\) is the identity on \({{\mathcal{H}}}_{\overline{w}}\), and VR contains the gates in the (forward) light-cone \({\mathcal{L}}\) of W, i.e., all gates with at least one input qubit causally connected to the output qubits of W. The latter allows us to define \({S}_{{\mathcal{L}}}\) as the subsystem of all qubits in \({\mathcal{L}}\).

Let us here recall that the Alternating Layered Ansatz can be implemented with either a 1D or 2D square connectivity as schematically depicted in Fig. 3c. We remark that the following results are valid for both cases as the light-cone structure will be the same. Moreover, the notation employed in our proofs applies to both the 1D and 2D cases. Hence, there is no need to refer to the connectivity dimension in what follows.

Let us now assume that θν is a parameter inside a given block W, we obtain from (6), (27), and (28)

$${\partial }_{\nu }C= \frac{i}{2}{\rm{Tr}}\left[({{\mathbb{1}}}_{\overline{w}}\otimes {W}_{\text{B}}){V}_{{\rm{L}}}\rho {V}_{{\rm{L}}}^{\dagger }({{\mathbb{1}}}_{\overline{w}}\otimes {W}_{\,\text{B}\,}^{\dagger })\right.\\ \left.\times [{{\mathbb{1}}}_{\overline{w}}\otimes {\sigma }_{\nu },({{\mathbb{1}}}_{\overline{w}}\otimes {W}_{\,\text{A}\,}^{\dagger }){V}_{\,\text{R}}^{\dagger }O{V}_{\text{R}}({{\mathbb{1}}}_{\overline{w}}\otimes {W}_{\text{A}})]\right]\ ,$$


$${W}_{\text{B}}=\mathop{\prod }\limits_{\mu =1}^{\nu -1}{G}_{\mu }({\theta }^{\mu })\ ,\quad \,{\text{and}}\,\quad {W}_{\text{A}}=\mathop{\prod }\limits_{\mu =\nu }^{\zeta }{G}_{\mu }({\theta }^{\mu })\ .$$

Finally, from (29) we can derive a general formula for the variance:

$${\rm{Var}}[{\partial }_{\nu }C]=\frac{{2}^{m-1}{\rm{Tr}}[{\sigma }_{\nu }^{2}]}{{({2}^{2m}-1)}^{2}}\mathop{\sum} _{{{\boldsymbol{p}}{\boldsymbol{q}}}\atop {{\boldsymbol{p}}^{\prime} {\boldsymbol{q}}^{\prime} }}{\langle {{\Delta }}{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{{\boldsymbol{q}}^{\prime} {\boldsymbol{p}}^{\prime} }\rangle }_{{V}_{\text{R}}}{\langle {{\Delta }}{{{\Psi }}}_{{\boldsymbol{p}}{\boldsymbol{q}}}^{{\boldsymbol{p}}^{\prime} {\boldsymbol{q}}^{\prime} }\rangle }_{{V}_{{\rm{L}}}}$$

which holds if WA and WB form independent 2-designs. Here, the summation runs over all bitstrings p, q, \({\boldsymbol{p}}^{\prime}\), \({\boldsymbol{q}}^{\prime}\) of length 2nm. In addition, we defined

$${{\Delta }}{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{{\boldsymbol{q}}^{\prime} {\boldsymbol{p}}^{\prime} }={\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}{{{\Omega }}}_{{\boldsymbol{q}}^{\prime} {\boldsymbol{p}}^{\prime} }]-\frac{{\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}]{\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}^{\prime} {\boldsymbol{p}}^{\prime} }]}{{2}^{m}}\ ,$$
$${{\Delta }}{{{\Psi }}}_{{\boldsymbol{p}}{\boldsymbol{q}}}^{{\boldsymbol{p}}^{\prime} {\boldsymbol{q}}^{\prime} }={\rm{Tr}}[{{{\Psi }}}_{{\boldsymbol{p}}{\boldsymbol{q}}}{{{\Psi }}}_{{\boldsymbol{p}}^{\prime} {\boldsymbol{q}}^{\prime} }]-\frac{{\rm{Tr}}[{{{\Psi }}}_{{\boldsymbol{p}}{\boldsymbol{q}}}]{\rm{Tr}}[{{{\Psi }}}_{{\boldsymbol{p}}^{\prime} {\boldsymbol{q}}^{\prime} }]}{{2}^{m}}\ ,$$

where \({{\rm{Tr}}}_{\overline{w}}\) indicates the trace over subsystem \({S}_{\overline{w}}\), and Ωqp and Ψqp are operators on \({{\mathcal{H}}}_{w}\) defined as

$${{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}={{\rm{Tr}}}_{\overline{w}}\left[(\left|{\boldsymbol{p}}\right\rangle \left\langle {\boldsymbol{q}}\right|\otimes {{\mathbb{1}}}_{w}){V}_{\,\text{R}}^{\dagger }O{V}_{\text{R}}\right]\ ,$$
$${{{\Psi }}}_{{\boldsymbol{p}}{\boldsymbol{q}}}={{\rm{Tr}}}_{\overline{w}}\left[(\left|{\boldsymbol{q}}\right\rangle \left\langle {\boldsymbol{p}}\right|\otimes {{\mathbb{1}}}_{w}){V}_{{\rm{L}}}\rho {V}_{{\rm{L}}}^{\dagger }\right]\ .$$

We derive Eq. (31) in the Supplementary Note 4.

Computing averages over V

Here we introduce the main tools employed to compute quantities of the form 〈…〉V. These tools are used throughout the proofs of our main results.

Let us first remark that if the blocks in V(θ) are independent, then any average over V can be computed by averaging over the individual blocks, i.e., \({\langle \ldots \rangle }_{V}={\langle \ldots \rangle }_{{W}_{11},\ldots ,{W}_{kl},\ldots }={\langle \ldots \rangle }_{{V}_{{\rm{L}}},W,{V}_{\text{R}}}\). For simplicity let us first consider the expectation value over a single block W in the ansatz. In principle 〈…〉W can be approximated by varying the parameters in W and sampling over the resulting 2m × 2m unitaries. However, if W forms a t-design, this procedure can be simplified as it is known that sampling over its distribution yields the same properties as sampling random unitaries from the unitary group with respect to the unique normalized Haar measure.

Explicitly, the Haar measure is a uniquely defined left and right-invariant measure over the unitary group dμ(W), such that for any unitary matrix AU(2m) and for any function f(W) we have

$${\int}_{U({2}^{m})}d\mu (W)f(W) =\int d\mu (W)f(AW)\\ =\int d\mu (W)f(WA)\ ,$$

where the integration domain is assumed to be U(2m) throughout this work. Consider a finite set \({\{{W}_{y}\}}_{y\in Y}\) (of size Y) of unitaries Wy, and let P(t, t)(W) be an arbitrary polynomial of degree at most t in the matrix elements of W and at most t in those of W. Then, this finite set is a t-design if38

$${\langle {P}_{(t,t)}(W)\rangle }_{w} =\frac{1}{| Y| }\cdot \mathop{\sum}\limits _{y\in Y}{P}_{(t,t)}({W}_{y})\\ =\int d\mu (W){P}_{(t,t)}(W)\ .$$

From the general form of C in Eq. (6) we can see the cost function is a polynomial of degree at most 2 in the matrix elements of each block Wkl in V(θ), and at most 2 in those of \({({W}_{kl})}^{\dagger }\). Then, if a given block W forms a 2-design, one can employ the following elementwise formula of the Weingarten calculus55,56 to explicitly evaluate averages over W up to the second moment:

$$\begin{array}{ccc}\int d\mu (W){w}_{{\boldsymbol{i}}{\boldsymbol{j}}}{w}_{{\boldsymbol{i}}^{\prime} {\boldsymbol{j}}^{\prime} }^{* }&=&\frac{{\delta }_{{\boldsymbol{i}}{\boldsymbol{i}}^{\prime} }{\delta }_{{\boldsymbol{j}}{\boldsymbol{j}}^{\prime} }}{{2}^{m}}\\ \int d\mu (W){w}_{{{\boldsymbol{i}}}_{1}{{\boldsymbol{j}}}_{1}}{w}_{{{\boldsymbol{i}}}_{2}{{\boldsymbol{j}}}_{2}}{w}_{{\boldsymbol{i}}_1^{\prime} {\boldsymbol{j}}_2^{\prime} }^{* }{w}_{{\boldsymbol{i}}_2^{\prime} {\boldsymbol{j}}_2^{\prime} }^{* }&=&\frac{1}{{2}^{2m}-1}\left({{{\Delta }}}_{1}-\frac{{{{\Delta }}}_{2}}{{2}^{m}}\right)\end{array}$$

where wij are the matrix elements of W, and

$${{{\Delta }}}_{1} ={\delta }_{{{\boldsymbol{i}}}_{1}{\boldsymbol{i}}_1^{\prime} }{\delta }_{{{\boldsymbol{i}}}_{2}{\boldsymbol{i}}_2^{\prime} }{\delta }_{{{\boldsymbol{j}}}_{1}{\boldsymbol{j}}_1^{\prime} }{\delta }_{{{\boldsymbol{j}}}_{2}{\boldsymbol{j}}_2^{\prime} }+{\delta }_{{{\boldsymbol{i}}}_{1}{\boldsymbol{i}}_2^{\prime} }{\delta }_{{{\boldsymbol{i}}}_{2}{\boldsymbol{i}}_1^{\prime} }{\delta }_{{{\boldsymbol{j}}}_{1}{\boldsymbol{j}}_2^{\prime} }{\delta }_{{{\boldsymbol{j}}}_{2}{\boldsymbol{j}}_1^{\prime} }\ ,\\ {{{\Delta }}}_{2} ={\delta }_{{{\boldsymbol{i}}}_{1}{\boldsymbol{i}}_1^{\prime} }{\delta }_{{{\boldsymbol{i}}}_{2}{\boldsymbol{i}}_2^{\prime} }{\delta }_{{{\boldsymbol{j}}}_{1}{\boldsymbol{j}}_2^{\prime} }{\delta }_{{{\boldsymbol{j}}}_{2}{\boldsymbol{j}}_1^{\prime} }+{\delta }_{{{\boldsymbol{i}}}_{1}{\boldsymbol{i}}_2^{\prime} }{\delta }_{{{\boldsymbol{i}}}_{2}{\boldsymbol{i}}_1^{\prime} }{\delta }_{{{\boldsymbol{j}}}_{1}{\boldsymbol{j}}_1^{\prime} }{\delta }_{{{\boldsymbol{j}}}_{2}{\boldsymbol{j}}_2^{\prime} }\ .$$

Intuition behind the main results

The goal of this section is to provide some intuition for our main results. Specifically, we show here how the scaling of the cost function variance can be related to the number of blocks we have to integrate to compute \({\langle \cdots \rangle }_{{V}_{\text{R}},{V}_{{\rm{L}}}}\), the locality of the cost functions, and with the number of layers in the ansatz.

First, we recall from Eq. (38) that integrating over a block leads to a coefficient of the order 1/22m. Hence, we see that the more blocks one integrates over, the worse the scaling can be.

We now generalize the warm-up example. Let V(θ) be a single layer of the alternating ansatz of Fig. 3, i.e., V(θ) is a tensor product of m-qubit blocks Wk: = Wk1, with k = 1, …, ξ (and with ξ = n/m), so that θν is in the block \({W}_{k^{\prime} }\). In the Supplementary Note 5 we generalize this scenario to the when the input state is not \(\left|{\boldsymbol{0}}\right\rangle\), but instead is an arbitrary state ρ.

From (31), the partial derivative of the global cost function in (1) can be expressed as

$${\rm{Var}}[{\partial }_{\nu }{C}_{{\rm{G}}}]=\upsilon \ \prod _{k\ne k^{\prime} }\ {\left\langle {\rm{Tr}}{\left[\left|0\right\rangle {\left\langle 0\right|}^{\otimes m}{W}_{k}\left|0\right\rangle {\left\langle 0\right|}^{\otimes m}{W}_{k}^{\dagger }\right]}^{2}\right\rangle }_{{W}_{k}}$$

where \(\upsilon =\frac{{({2}^{m}-1)}^{2}{\rm{Tr}}[{\sigma }_{\nu }^{2}]}{{2}^{2m}{({2}^{m+1}-1)}^{2}}\). From (40) we have that in order to compute (40) one needs to integrate over ξ − 1 blocks. Then, since each integration leads to a coefficient 1/22m the variance will scale as \({\mathcal{O}}{\left(\right.1/({2}^{2m})}^{\xi -1}={\mathcal{O}}(1/{2}^{2n})\). Hence, the scaling of the variance gets worse for each block we integrate (such that the block acts on qubits we are measuring).

On the other hand, for a local cost let us consider a single term in (3) where \(j\in {S}_{\tilde{k}}\), so that

$${\rm{Var}}[{\partial }_{\nu }{C}_{{\rm{L}}}]\ \propto \ \frac{\upsilon }{{n}^{2}}\ {\left\langle {\rm{Tr}}{\left[(\left|0\right\rangle {\left\langle 0\right|}_{j}\otimes {{\mathbb{1}}}_{\overline{j}}){W}_{\tilde{k}}\left|0\right\rangle {\left\langle 0\right|}^{\otimes m}{W}_{\tilde{k}}^{\dagger }\right]}^{2}\right\rangle }_{{W}_{\tilde{k}}}\ .$$

Here, in contrast to the global case, we only have to integrate over a single block irrespective of the total number of qubits. Hence, we now find that the variance scales as \({\mathcal{O}}(1/{n}^{2})\), where we remark that the scaling is essentially given by the prefactor 1/n2 in (3).

Let us now briefly provide some intuition as to why the scaling of local cost gradients becomes exponentially vanishing with the number of layers as in Theorem 2. Consider the case when V(θ) contains L layers of the ansatz in Fig. 3. Moreover, as shown in Fig. 7, let W be in the first layer, and let Oi act on the m topmost qubits of \({\mathcal{L}}\). As schematically depicted in Fig. 7, we now have to integrating over L − 1 blocks. Then, as proved in the Supplementary Note 5, integrating over a block leads to a coefficient 2m/2/(2m + 1). Hence, after integrating L − 1 times, we obtain a coefficient \({2}^{m(L-1)/2}/{({2}^{m}+1)}^{L-1}\) which vanishes no faster than \({{\Omega }}\left(1/\mathrm{poly}\,(n)\right)\) if \(mL\in {\mathcal{O}}(\mathrm{log}\,(n))\).

Fig. 7: The block W is in the first layer of V(θ), and the operator Oi acts on the topmost m qubits in the forward light-cone \({\mathcal{L}}\) of W.
figure 7

Dashed thick lines indicate the backward light-cone of Oi. All but L − 1 blocks simplify to identity in Ωqp of Eq. (34).

As we discuss below, for more general scenarios the computation of Var[∂νC] becomes more complex.

Sketch of the proof of the main theorems

Here we present a sketch of the proof of Theorems 1 and 2. We refer the reader to the Supplementary Information for a detailed version of the proofs.

As mentioned in the previous subsection, if each block in V(θ) forms a local 2-design, then we can explicitly calculate expectation values 〈…〉W via (38). Hence, to compute \({\langle {{\Delta }}{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{{\boldsymbol{p}}^{\prime} {\boldsymbol{q}}^{\prime} }\rangle }_{{V}_{\text{R}}}\), and \({\langle {{\Delta }}{{{\Psi }}}_{{\boldsymbol{p}}{\boldsymbol{q}}}^{{\boldsymbol{p}}^{\prime} {\boldsymbol{q}}^{\prime} }\rangle }_{{V}_{{\rm{L}}}}\) in (31), one needs to algorithmically integrate over each block using the Weingarten calculus. In order to make such computation tractable, we employ the tensor network representation of quantum circuits.

For the sake of clarity, we recall that any two-qubit gate can be expressed as \(U={\sum }_{ijkl}{U}_{ijkl}\left|ij\right\rangle \left\langle kl\right|\), where Uijkl is a 2 × 2 × 2 × 2 tensor. Similarly, any block in the ansatz can be considered as a \({2}^{\frac{m}{2}}\ \times {2}^{\frac{m}{2}}\ \times {2}^{\frac{m}{2}}\ \times {2}^{\frac{m}{2}}\) tensor. As schematically shown in Fig. 8a, one can use the circuit description of \({{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{i}\) and Ψpq to derive the tensor network representation of terms such as \({\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{i}{{{\Omega }}}_{{\boldsymbol{q}}^{\prime} {\boldsymbol{p}}^{\prime} }^{j}]\). Here, \({{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{i}\) is obtained from (34) by simply replacing O with Oi.

Fig. 8: Tensor-network representations of the terms relevant to Var[∂νC].
figure 8

a Representation of \({{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{i}\) of Eq. (34) (left), where the superscript indicates that O is replaced by Oi. In this illustration, we show the case of n = 2m qubits, and we denote \({P}_{{\boldsymbol{p}}{\boldsymbol{q}}}=\left|{\boldsymbol{q}}\right\rangle \left\langle {\boldsymbol{p}}\right|\). We also show the representation of \({\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{i}{{{\Omega }}}_{{\boldsymbol{q}}^{\prime} {\boldsymbol{p}}^{\prime} }^{j}]\) (middle) and of \({\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{i}]{\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}^{\prime} {\boldsymbol{p}}^{\prime} }^{j}]\) (right). b By means of the Weingarten calculus we can algorithmically integrate over each block in the ansatz. After each integration one obtains four new tensors according to Eq. (38). Here we show the tensor obtained after the integrations \(\int d\mu (W){\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{i}{{{\Omega }}}_{{\boldsymbol{q}}^{\prime} {\boldsymbol{p}}^{\prime} }^{j}]\) and \(\int d\mu (W){\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{i}]{\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}^{\prime} {\boldsymbol{p}}^{\prime} }^{j}]\), which are needed to compute \({\langle {{\Delta }}{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{{\boldsymbol{q}}^{\prime} {\boldsymbol{p}}^{\prime} }\rangle }_{{V}_{\text{R}}}\) as in Eq. (31). c As shown in the Supplementary Note 1, the result of the integration is a sum of the form (49), where the deltas over p, q, \({\boldsymbol{p}}^{\prime}\), and \({\boldsymbol{q}}^{\prime}\) arise from the contractions between Ppq and \({P}_{{\boldsymbol{p}}^{\prime} {\boldsymbol{q}}^{\prime} }\).

In Fig. 8b we depict an example where we employ the tensor network representation of \({{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{i}\) to compute the average of \({\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{i}{{{\Omega }}}_{{\boldsymbol{q}}^{\prime} {\boldsymbol{p}}^{\prime} }^{j}]\), and \({\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{i}]{\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}^{\prime} {\boldsymbol{p}}^{\prime} }^{j}]\). As expected, after each integration one obtains a sum of four new tensor networks according to Eq. (38).

Proof of Theorem 2

Let us first consider an m-local cost function C where O is given by (16), and where \({\widehat{O}}_{i}\) acts nontrivially in a given subsystem Sk of \({\mathcal{S}}\). In particular, when \({\widehat{O}}_{i}\) is of this form the proof is simplified, although the more general proof is presented in the Supplementary Note 6. If \({S}_{k}\not\subset {S}_{{\mathcal{L}}}\) we find \({{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{i}\propto {{\mathbb{1}}}_{w}\), and hence

$${\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{i}{{{\Omega }}}_{{\boldsymbol{q}}^{\prime} {\boldsymbol{p}}^{\prime} }^{j}]-\frac{{\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{i}]{\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}^{\prime} {\boldsymbol{p}}^{\prime} }^{j}]}{{2}^{m}}=0\ .$$

The latter implies that we only have to consider the operators \({\widehat{O}}_{i}\) which act on qubits inside of the forward light-cone \({\mathcal{L}}\) of W.

Then, as shown in the Supplementary Information

$${\left\langle {\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{i}{{{\Omega }}}_{{\boldsymbol{q}}^{\prime} {\boldsymbol{p}}^{\prime} }^{i}]-\frac{{\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{i}]{\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}^{\prime} {\boldsymbol{p}}^{\prime} }^{i}]}{{2}^{m}}\right\rangle }_{{V}_{\text{R}}}\propto \epsilon ({\widehat{O}}_{i})\ .$$

Here we remark that the proportionality factor contains terms of the form \({\delta }_{{({\boldsymbol{p}},{\boldsymbol{q}})}_{{S}_{\overline{w}}^{+}}}{\delta }_{{({\boldsymbol{p}}^{\prime} ,{\boldsymbol{q}}^{\prime} )}_{{S}_{\overline{w}}^{+}}}{\delta }_{{({\boldsymbol{p}},{\boldsymbol{q}}^{\prime} )}_{{S}_{\overline{w}}^{-}}}{\delta }_{{({\boldsymbol{p}}^{\prime} ,{\boldsymbol{q}})}_{{S}_{\overline{w}}^{-}}}\) (where \({S}_{\overline{w}}^{+}\cup {S}_{\overline{w}}^{-}={S}_{\overline{w}}\)), which arises from the different tensor contractions of \({P}_{{\boldsymbol{p}}{\boldsymbol{q}}}=\left|{\boldsymbol{q}}\right\rangle \left\langle {\boldsymbol{p}}\right|\) in Fig. 8c. It is then straightforward to show that

$$\mathop{\sum}\limits _{{{\boldsymbol{p}}{\boldsymbol{q}}}\atop {{\boldsymbol{p}}^{\prime} {\boldsymbol{q}}^{\prime} }} {\delta }_{{({\boldsymbol{p}},{\boldsymbol{q}})}_{{S}_{\overline{w}}^{+}}}{\delta }_{{({\boldsymbol{p}}^{\prime} ,{\boldsymbol{q}}^{\prime} )}_{{S}_{\overline{w}}^{+}}}{\delta }_{{({\boldsymbol{p}},{\boldsymbol{q}}^{\prime} )}_{{S}_{\overline{w}}^{-}}}{\delta }_{{({\boldsymbol{p}}^{\prime} ,{\boldsymbol{q}})}_{{S}_{\overline{w}}^{-}}}{\left\langle {{\Delta }}{{{\Psi }}}_{{\boldsymbol{p}}{\boldsymbol{q}}}^{{\boldsymbol{p}}^{\prime} {\boldsymbol{q}}^{\prime} }\right\rangle }_{{V}_{{\rm{L}}}}\\ ={\left\langle {D}_{\text{HS}}\left({\tilde{\rho }}^{-},{{\rm{Tr}}}_{w}[{\tilde{\rho }}^{-}]\otimes \frac{{\mathbb{1}}}{{2}^{m}}\right)\right\rangle }_{{V}_{{\rm{L}}}}\ ,$$

where we define \({\tilde{\rho }}^{-}\) as the reduced states of \(\tilde{\rho }={V}_{{\rm{L}}}\rho {V}_{{\rm{L}}}^{\dagger }\) in the Hilbert spaces associated with subsystems \({S}_{w}\cup {S}_{\overline{w}}^{-}\). Here we recall that DHS is the Hilbert–Schmidt distance \({D}_{HS}\left(\rho ,\sigma \right)={\rm{Tr}}[{(\rho -\sigma )}^{2}]\).

By employing properties of DHS one can show (see Supplementary Note 6)

$${D}_{\text{HS}}\left({\tilde{\rho }}^{-},{{\rm{Tr}}}_{w}[{\tilde{\rho }}^{-}]\otimes \frac{{\mathbb{1}}}{{2}^{m}}\right)\ge \frac{{D}_{\text{HS}}\left({\widetilde{\rho }}_{w},\frac{{\mathbb{1}}}{{2}^{m}}\right)}{{2}^{m(L-l+2)/2}}\ ,$$

where \({\tilde{\rho }}_{w}={{\rm{Tr}}}_{{S}_{\overline{w}}^{-}}[{\tilde{\rho }}^{-}]\). We can then leverage the tensor network representation of quantum circuits to algorithmically integrate over each block in VL and compute \({\langle {D}_{\text{HS}}\left({\widetilde{\rho }}_{w},\frac{{\mathbb{1}}}{{2}^{m}}\right)\rangle }_{{V}_{{\rm{L}}}}\). One finds

$${\left\langle {D}_{\text{HS}}\left({\tilde{\rho }}_{w},\frac{{\mathbb{1}}}{{2}^{m}}\right)\right\rangle }_{{V}_{{\rm{L}}}}=\mathop{\sum} _{{(k,k^{\prime} )\in {k}_{{{\mathcal{L}}}_{\text{B}}}}\atop {k^{\prime} \geqslant k}}{t}_{k,k^{\prime} }\epsilon ({\rho }_{k,k^{\prime} })\ ,$$

with \({t}_{k,k^{\prime} }\geqslant \frac{{2}^{ml}}{{({2}^{m}+1)}^{2l}}\)\(\forall k,k^{\prime}\), and \(\epsilon ({\rho }_{k,k^{\prime} })\) defined in Theorem 2. Combining these results leads to Theorem 2. Moreover, as detailed in the Supplementary information, Theorem 2 is also valid when O is of the form (16).

Proof of Theorem 1

Let us now provide a sketch of the proof of Theorem 1, case (i). Here we denote for simplicity \({\widehat{O}}_{k}:={\widehat{O}}_{1k}\). We leave the proof of case (ii) for the Supplementary Note 7. In this case there are now operators Oi which act outside of the forward light-cone \({\mathcal{L}}\) of W. Hence, it is convenient to include in VR not only all the gates in \({\mathcal{L}}\) but also all the blocks in the final layer of V(θ) (i.e., all blocks WkL, with k = 1, …ξ). We can define \({S}_{\overline{{\mathcal{L}}}}\) as the compliment of \({S}_{{\mathcal{L}}}\), i.e., as the subsystem of all qubits which are not in \({\mathcal{L}}\) (with associated Hilbert-space \({{\mathcal{H}}}_{\overline{{\mathcal{L}}}}\)). Then, we have \({V}_{\text{R}}={V}_{{\mathcal{L}}}\otimes {V}_{\overline{{\mathcal{L}}}}\) and \(\left|{\boldsymbol{q}}\right\rangle \left\langle {\boldsymbol{p}}\right|=\left|{\boldsymbol{q}}\right\rangle {\left\langle {\boldsymbol{p}}\right|}_{{\mathcal{L}}}\otimes \left|{\boldsymbol{q}}\right\rangle {\left\langle {\boldsymbol{p}}\right|}_{\overline{{\mathcal{L}}}}\), where we define \({V}_{\overline{{\mathcal{L}}}}: = {\bigotimes }_{k\in {k}_{\overline{{\mathcal{L}}}}}{W}_{kL}\), \(\left|{\boldsymbol{q}}\right\rangle {\left\langle {\boldsymbol{p}}\right|}_{{\mathcal{L}}}:{ = \bigotimes }_{k\in {k}_{{\mathcal{L}}}}\left|{\boldsymbol{q}}\right\rangle {\left\langle {\boldsymbol{p}}\right|}_{k}\), and \(\left|{\boldsymbol{q}}\right\rangle {\left\langle {\boldsymbol{p}}\right|}_{\overline{{\mathcal{L}}}}:{ = \bigotimes }_{k^{\prime} \in {k}_{\overline{{\mathcal{L}}}}}\left|{\boldsymbol{q}}\right\rangle {\left\langle {\boldsymbol{p}}\right|}_{k^{\prime} }\). Here, we define \({k}_{{\mathcal{L}}}:=\{k:{S}_{k}\subseteq {S}_{{\mathcal{L}}}\}\) and \({k}_{\overline{{\mathcal{L}}}}:=\{k:{S}_{k}\subseteq {S}_{\overline{{\mathcal{L}}}}\}\), which are the set of indices whose associated qubits are inside and outside \({\mathcal{L}}\), respectively. We also write \(O={c}_{0}{\mathbb{1}}+{c}_{1}{\hat{O}}_{{\mathcal{L}}}\otimes {\hat{O}}_{\overline{{\mathcal{L}}}}\), where we define \({\hat{O}}_{{\mathcal{L}}}:{ = \bigotimes }_{k\in {k}_{{\mathcal{L}}}}{\widehat{O}}_{k}\) and \({\hat{O}}_{\overline{{\mathcal{L}}}}:{ = \bigotimes }_{k^{\prime} \in {k}_{\overline{{\mathcal{L}}}}}{\widehat{O}}_{k^{\prime} }\).

Using the fact that the blocks in V(θ) are independent we can now compute \({\langle {{\Delta }}{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{{\boldsymbol{q}}^{\prime} {\boldsymbol{p}}^{\prime} }\rangle }_{{V}_{\text{R}}}={\langle {{\Delta }}{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{{\boldsymbol{q}}^{\prime} {\boldsymbol{p}}^{\prime} }\rangle }_{{V}_{\overline{{\mathcal{L}}}},{V}_{{\mathcal{L}}}}\). Then, from the definition of Ωpq in Eq. (34) and the fact that one can always express

$${\left\langle {{\Delta }}{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{{\boldsymbol{q}}^{\prime} {\boldsymbol{p}}^{\prime} }\right\rangle }_{{V}_{\text{R}}}= \, {\left\langle {\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{{\mathcal{L}}}{{{\Omega }}}_{{\boldsymbol{q}}^{\prime} {\boldsymbol{p}}^{\prime} }^{{\mathcal{L}}}]-\frac{{\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{{\mathcal{L}}}]{\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}^{\prime} {\boldsymbol{p}}^{\prime} }^{{\mathcal{L}}}]}{{2}^{m}}\right\rangle }_{{V}_{{\mathcal{L}}}}\\ \times \left(\prod\limits_{k\in {k}_{\overline{{\mathcal{L}}}}}{\left\langle {{{\Omega }}}_{k}\right\rangle }_{{W}_{kL}}\right)\ ,$$


$${{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{{\mathcal{L}}} ={{\rm{Tr}}}_{{\mathcal{L}}\cap \overline{w}}\left[(\left|{\boldsymbol{p}}\right\rangle \left\langle {\boldsymbol{q}}\right|\otimes {{\mathbb{1}}}_{w}){V}_{{\mathcal{L}}}{O}^{{\mathcal{L}}}{V}_{{\mathcal{L}}}\right]\\ {{{\Omega }}}_{k} ={\rm{Tr}}\left[\left|{\boldsymbol{p}}\right\rangle {\left\langle {\boldsymbol{q}}\right|}_{k}{W}_{kL}^{\dagger }{\widehat{O}}_{k}{W}_{kL}\right]{\rm{Tr}}\left[\left|{\boldsymbol{p}}^{\prime} \right\rangle {\left\langle {\boldsymbol{q}}^{\prime} \right|}_{k}{W}_{kL}^{\dagger }{\widehat{O}}_{k}{W}_{kL}\right]$$

and where \({{\rm{Tr}}}_{{\mathcal{L}}\cap \overline{w}}\) indicates the partial trace over the Hilbert-space associated with the qubits in \({S}_{{\mathcal{L}}}\cap {S}_{\overline{w}}\). As detailed in the Supplementary Information we can use Eq. (38) to show that

$${\left\langle {{{\Omega }}}_{k}\right\rangle }_{{W}_{kL}}\leqslant \frac{{r}_{k}^{2}\left({\delta }_{{({\boldsymbol{p}},{\boldsymbol{q}})}_{{S}_{k}}}{\delta }_{{({\boldsymbol{p}}^{\prime} ,{\boldsymbol{q}}^{\prime} )}_{{S}_{k}}}+{\delta }_{{({\boldsymbol{p}},{\boldsymbol{q}}^{\prime} )}_{{S}_{k}}}{\delta }_{{({\boldsymbol{p}}^{\prime} ,{\boldsymbol{q}})}_{{S}_{k}}}\right)}{{2}^{2m}-1}\ .$$

On the other hand, as shown in the Supplementary Note 7 (and as schematically depicted in Fig. 8c), when computing the expectation value \({\langle \ldots \rangle }_{{V}_{{\mathcal{L}}}}\) in (47), one obtains

$${\left\langle {\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{{\mathcal{L}}}{{{\Omega }}}_{{\boldsymbol{q}}^{\prime} {\boldsymbol{p}}^{\prime} }^{{\mathcal{L}}}]-\frac{{\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{{\mathcal{L}}}]{\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}^{\prime} {\boldsymbol{p}}^{\prime} }^{{\mathcal{L}}}]}{{2}^{m}}\right\rangle }_{{V}_{{\mathcal{L}}}}=\mathop{\sum} _{\tau }{t}_{\tau }^{ij}{{\Delta }}{O}_{\tau }^{{\mathcal{L}}}{\delta }_{\tau }\ ,$$

where we defined \({\delta }_{\tau }={\delta }_{{({\boldsymbol{p}},{\boldsymbol{q}})}_{{S}_{\overline{\tau }}}}{\delta }_{{({\boldsymbol{p}}^{\prime} ,{\boldsymbol{q}}^{\prime} )}_{{S}_{\overline{\tau }}}}{\delta }_{{({\boldsymbol{p}},{\boldsymbol{q}}^{\prime} )}_{{S}_{\tau }}}{\delta }_{{({\boldsymbol{p}}^{\prime} ,{\boldsymbol{q}})}_{{S}_{\tau }}}\), \({t}_{\tau }\in {\mathbb{R}}\), \({S}_{\tau }\cup {S}_{\overline{\tau }}={S}_{{\mathcal{L}}}\cap {S}_{\overline{w}}\) (with \({S}_{\tau }\,\ne\, {{\emptyset}}\)), and

$${{\Delta }}{O}_{\tau }^{{\mathcal{L}}}= {{\rm{Tr}}}_{{x}_{\tau }{y}_{\tau }}\left[{{\rm{Tr}}}_{{z}_{\tau }}\left[{O}_{i}\right]{{\rm{Tr}}}_{{z}_{\tau }}[{O}_{j}]\right]\\ -\frac{{{\rm{Tr}}}_{{x}_{\tau }}\left[{{\rm{Tr}}}_{{y}_{\tau }{z}_{\tau }}\left[{O}_{i}\right]{{\rm{Tr}}}_{{y}_{\tau }{z}_{\tau }}[{O}_{j}]\right]}{{2}^{m}}\ .$$

Here we use the notation \({{\rm{Tr}}}_{{x}_{\tau }}\) to indicate the trace over the Hilbert space associated with subsystem \({S}_{{x}_{\tau }}\), such that \({S}_{{x}_{\tau }}\cup {S}_{{y}_{\tau }}\cup {S}_{{z}_{\tau }}={S}_{{\mathcal{L}}}\). As shown in the Supplementary Note 7, combining the deltas in Eqs. (48), and (49) with \({\left\langle {{\Delta }}{{{\Psi }}}_{{\boldsymbol{p}}{\boldsymbol{q}}}^{{\boldsymbol{p}}^{\prime} {\boldsymbol{q}}^{\prime} }\right\rangle }_{{V}_{{\rm{L}}}}\) leads to Hilbert–Schmidt distances between two quantum states as in (44). One can then use the following bounds \({D}_{\text{HS}}\left({\rho }_{1},{\rho }_{2}\right)\le 2\), \({{\Delta }}{O}_{\tau }^{{\mathcal{L}}}\le {\prod }_{k\in {k}_{{\mathcal{L}}}}{r}_{k}^{2}\), and ∑τtτ ≤ 2, along with some additional simple algebra explained in the Supplementary Information to obtain the upper bound in Theorem 1.

Ansatz and optimization method

Here we describe the gradient-free optimization method used in our heuristics. First, we note that all the parameters in the ansatz are randomly initialized. Then, at each iteration, one solves the following sub-space search problem: \(\mathop{\min }\limits_{{\boldsymbol{s}}\in {{\mathbb{R}}}^{d}}C({\boldsymbol{\theta }}+{\boldsymbol{A}}\cdot {\boldsymbol{s}})\), where A is a randomly generated isometry, and s = (s1, …, sd) is a vector of coefficients to be optimized over. We used d = 10 in our simulations. Moreover, the training algorithm gradually increases the number of shots per cost-function evaluation. Initially, C is evaluated with 10 shots, and once the optimization reaches a plateau, the number of shots is increased by a factor of 3/2. This process is repeated until a termination condition on the value of C is achieved, or until we reach the maximum value of 105 shots per function evaluation. While this is a simple variable-shot approach, we remark that a more advanced variable-shot optimizer can be found in ref. 57.

Finally, let us remark that while we employ a sub-space search algorithm, in the presence of barren plateaus all optimization methods will (on average) fail unless the algorithm has a precision (i.e., number of shots) that grows exponentially with n. The latter is due to the fact that an exponentially vanishing gradient implies that on average the cost function landscape will essentially be flat, with the slope of the order of \({\mathcal{O}}(1/{2}^{n})\). Hence, unless one has a precision that can detect such small changes in the cost value, one will not be able to determine a cost minimization direction with gradient-based, or even with black-box optimizers such as the Nelder–Mead method58,59,60,61.