Abstract
Variational quantum algorithms (VQAs) optimize the parameters θ of a parametrized quantum circuit V(θ) to minimize a cost function C. While VQAs may enable practical applications of noisy quantum computers, they are nevertheless heuristic methods with unproven scaling. Here, we rigorously prove two results, assuming V(θ) is an alternating layered ansatz composed of blocks forming local 2designs. Our first result states that defining C in terms of global observables leads to exponentially vanishing gradients (i.e., barren plateaus) even when V(θ) is shallow. Hence, several VQAs in the literature must revise their proposed costs. On the other hand, our second result states that defining C with local observables leads to at worst a polynomially vanishing gradient, so long as the depth of V(θ) is \({\mathcal{O}}(\mathrm{log}\,n)\). Our results establish a connection between locality and trainability. We illustrate these ideas with largescale simulations, up to 100 qubits, of a quantum autoencoder implementation.
Introduction
One of the most important technological questions is whether Noisy IntermediateScale Quantum (NISQ) computers will have practical applications^{1}. NISQ devices are limited both in qubit count and in gate fidelity, hence preventing the use of quantum error correction.
The leading strategy to make use of these devices is variational quantum algorithms (VQAs)^{2}. VQAs employ a quantum computer to efficiently evaluate a cost function C, while a classical optimizer trains the parameters θ of a Parametrized Quantum Circuit (PQC) V(θ). The benefits of VQAs are threefold. First, VQAs allow for taskoriented programming of quantum computers, which is important since designing quantum algorithms is nonintuitive. Second, VQAs make up for small qubit counts by leveraging classical computational power. Third, pushing complexity onto classical computers, while only running shortdepth quantum circuits, is an effective strategy for error mitigation on NISQ devices.
There are very few rigorous scaling results for VQAs (with exception of onelayer approximate optimization^{3,4,5}). Ideally, in order to reduce gate overhead that arises when implementing on quantum hardware one would like to employ a hardwareefficient ansatz^{6} for V(θ). As recent largescale implementations for chemistry^{7} and optimization^{8} applications have shown, this ansatz leads to smaller errors due to hardware noise. However, one of the few known scaling results is that deep versions of randomly initialized hardwareefficient ansatzes lead to exponentially vanishing gradients^{9}. Very little is known about the scaling of the gradient in such ansatzes for shallow depths, and it would be especially useful to have a converse bound that guarantees nonexponentially vanishing gradients for certain depths. This motivates our work, where we rigorously investigate the gradient scaling of VQAs as a function of the circuit depth.
The other motivation for our work is the recent explosion in the number of proposed VQAs. The Variational Quantum Eigensolver (VQE) is the most famous VQA. It aims to prepare the ground state of a given Hamiltonian H = ∑_{α}c_{α}σ_{α}, with H expanded as a sum of local Pauli operators^{10}. In VQE, the cost function is obviously the energy \(C=\left\langle \psi  H \psi \right\rangle\) of the trial state \(\left\psi \right\rangle\). However, VQAs have been proposed for other applications, like quantum data compression^{11}, quantum error correction^{12}, quantum metrology^{13}, quantum compiling^{14,15,16,17}, quantum state diagonalization^{18,19}, quantum simulation^{20,21,22,23}, fidelity estimation^{24}, unsampling^{25}, consistent histories^{26}, and linear systems^{27,28,29}. For these applications, the choice of C is less obvious. Put another way, if one reformulates these VQAs as groundstate problems (which can be done in many cases), the choice of Hamiltonian H is less intuitive. This is because many of these applications are abstract, rather than associated with a physical Hamiltonian.
We remark that polynomially vanishing gradients imply that the number of shots needed to estimate the gradient should grow as \({\mathcal{O}}(\mathrm{poly}\,(n))\). In contrast, exponentially vanishing gradients (i.e., barren plateaus) imply that derivativebased optimization will have exponential scaling^{30}, and this scaling can also apply to derivativefree optimization^{31}. Assuming a polynomial number of shots per optimization step, one will be able to resolve against finite sampling noise and train the parameters if the gradients vanish polynomially. Hence, we employ the term “trainable” for polynomially vanishing gradients.
In this work, we connect the trainability of VQAs to the choice of C. For the abstract applications in refs. ^{11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29}, it is important for C to be operational, so that small values of C imply that the task is almost accomplished. Consider an example of state preparation, where the goal is to find a gate sequence that prepares a target state \(\left{\psi }_{0}\right\rangle\). A natural cost function is the square of the trace distance D_{T} between \(\left{\psi }_{0}\right\rangle\) and \(\left\psi \right\rangle =V{({\boldsymbol{\theta }})}^{\dagger }\left{\boldsymbol{0}}\right\rangle\), given by \({C}_{{\rm{G}}}={D}_{\text{T}}{(\left{\psi }_{0}\right\rangle ,\left\psi \right\rangle )}^{2}\), which is equivalent to
with \({O}_{{\rm{G}}}={\mathbb{1}}\left{\boldsymbol{0}}\right\rangle \ \left\langle {\boldsymbol{0}}\right\). Note that \(\sqrt{{C}_{{\rm{G}}}}\ge  \left\langle \psi  M \psi \right\rangle \left\langle {\psi }_{0} M {\psi }_{0}\right\rangle \) has a nice operational meaning as a bound on the expectation value difference for a POVM element M.
However, here we argue that this cost function and others like it exhibit exponentially vanishing gradients. Namely, we consider global cost functions, where one directly compares states or operators living in exponentially large Hilbert spaces (e.g., \(\left\psi \right\rangle\) and \(\left{\psi }_{0}\right\rangle\)). These are precisely the cost functions that have operational meanings for tasks of interest, including all tasks in refs. ^{11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29}. Hence, our results imply that a nontrivial subset of these references will need to revise their choice of C.
Interestingly, we demonstrate vanishing gradients for shallow PQCs. This is in contrast to McClean et al.^{9}, who showed vanishing gradients for deep PQCs. They noted that randomly initializing θ for a V(θ) that forms a 2design leads to a barren plateau, i.e., with the gradient vanishing exponentially in the number of qubits, n. Their work implied that researchers must develop either clever parameter initialization strategies^{32,33} or clever PQCs ansatzes^{4,34,35}. Similarly, our work implies that researchers must carefully weigh the balance between trainability and operational relevance when choosing C.
While our work is for general VQAs, barren plateaus for global cost functions were noted for specific VQAs and for a very specific tensorproduct example by our research group^{14,18}, and more recently in^{29}. This motivated the proposal of local cost functions^{14,16,18,22,25,26,27}, where one compares objects (states or operators) with respect to each individual qubit, rather than in a global sense, and therein it was shown that these local cost functions have indirect operational meaning.
Our second result is that these local cost functions have gradients that vanish polynomially rather than exponentially in n, and hence have the potential to be trained. This holds for V(θ) with depth \({\mathcal{O}}(\mathrm{log}\,n)\). Figure 1 summarizes our two main results.
Finally, we illustrate our main results for an important example: quantum autoencoders^{11}. Our largescale numerics show that the global cost function proposed in^{11} has a barren plateau. On the other hand, we propose a novel local cost function that is trainable, hence making quantum autoencoders a scalable application.
Results
Warmup example
To illustrate costfunctiondependent barren plateaus, we first consider a toy problem corresponding to the state preparation problem in the Introduction with the target state being \(\left{\boldsymbol{0}}\right\rangle\). We assume a tensorproduct ansatz of the form \(V({\boldsymbol{\theta }}){ = \bigotimes }_{j = 1}^{n}{e}^{i{\theta }^{j}{\sigma }_{x}^{(j)}/2}\), with the goal of finding the angles θ^{j} such that \(V({\boldsymbol{\theta }})\left{\boldsymbol{0}}\right\rangle =\left{\boldsymbol{0}}\right\rangle\). Employing the global cost of (1) results in \({C}_{{\rm{G}}}=1\mathop{\prod }\nolimits_{j = 1}^{n}{\cos }^{2}\frac{{\theta }^{j}}{2}\). The barren plateau can be detected via the variance of its gradient: \({\rm{Var}}[\frac{\partial {C}_{{\rm{G}}}}{\partial {\theta }^{j}}]=\frac{1}{8}{(\frac{3}{8})}^{n1}\), which is exponentially vanishing in n. Since the mean value is \(\left\langle \frac{\partial {C}_{{\rm{G}}}}{\partial {\theta }^{j}}\right\rangle =0\), the gradient concentrates exponentially around zero.
On the other hand, consider a local cost function:
where \({{\mathbb{1}}}_{\overline{j}}\) is the identity on all qubits except qubit j. Note that C_{L} vanishes under the same conditions as C_{G}^{14,16}, C_{L} = 0 ⇔ C_{G} = 0. We find \({C}_{{\rm{L}}}=1\frac{1}{n}\mathop{\sum }\nolimits_{j = 1}^{n}{\cos }^{2}\frac{{\theta }^{j}}{2}\), and the variance of its gradient is \({\rm{Var}}[\frac{\partial {C}_{{\rm{L}}}}{\partial {\theta }^{j}}]=\frac{1}{8{n}^{2}}\), which vanishes polynomially with n and hence exhibits no barren plateau. Figure 2 depicts the cost landscapes of C_{G} and C_{L} for two values of n and shows that the barren plateau can be avoided here via a local cost function.
Moreover, this example allows us to delve deeper into the cost landscape to see a phenomenon that we refer to as a narrow gorge. While a barren plateau is associated with a flat landscape, a narrow gorge refers to the steepness of the valley that contains the global minimum. This phenomenon is illustrated in Fig. 2, where each dot corresponds to cost values obtained from randomly selected parameters θ. For C_{G} we see that very few dots fall inside the narrow gorge, while for C_{L} the narrow gorge is not present. Note that the narrow gorge makes it harder to train C_{G} since the learning rate of descentbased optimization algorithms must be exponentially small in order not to overstep the narrow gorge. The following proposition (proved in the Supplementary Note 2) formalizes the narrow gorge for C_{G} and its absence for C_{L} by characterizing the dependence on n of the probability C ⩽ δ. This probability is associated with the parameter space volume that leads to C ⩽ δ.
Proposition 1
Let θ^{j} be uniformly distributed on [−π, π] ∀j. For any δ ∈ (0, 1), the probability that C_{G} ≤ δ satisfies
For any \(\delta \in [\frac{1}{2},1]\), the probability that C_{L} ≤ δ satisfies
General framework
For our general results, we consider a family of cost functions that can be expressed as the expectation value of an operator O as follows
where ρ is an arbitrary quantum state on n qubits. Note that this framework includes the special case where ρ could be a pure state, as well as the more special case where \(\rho =\left{\boldsymbol{0}}\right\rangle \ \left\langle {\boldsymbol{0}}\right\), which is the input state for many VQAs such as VQE. Moreover, in VQE, one chooses O = H, where H is the physical Hamiltonian. In general, the choice of O and ρ essentially defines the application of interest of the particular VQA.
It is typical to express O as a linear combination of the form \(O={c}_{0}{\mathbb{1}}+\mathop{\sum }\nolimits_{i = 1}^{N}{c}_{i}{O}_{i}\). Here O_{i} ≠ \({\mathbb{1}}\), \({c}_{i}\in {\mathbb{R}}\), and we assume that at least one c_{i} ≠ 0. Note that C_{G} and C_{L} in (1) and (2) fall under this framework. In our main results below, we will consider two different choices of O that respectively capture our general notions of global and local cost functions and also generalize the aforementioned C_{G} and C_{L}.
As shown in Fig. 3a, V(θ) consists of L layers of mqubit unitaries W_{kl}(θ_{kl}), or blocks, acting on alternating groups of m neighboring qubits. We refer to this as an Alternating Layered Ansatz. We remark that the Alternating Layered Ansatz will be a hardwareefficient ansatz so long as the gates that compose each block are taken from a set of gates native to a specific device. As depicted in Fig. 3c, the one dimensional Alternating Layered Ansatz can be readily implemented in devices with onedimensional connectivity, as well as in devices with twodimensional connectivity (such as that of IBM’s^{36} and Google’s^{37} quantum devices). That is, with both one and twodimensional hardware connectivity one can group qubits to form an Alternating Layered Ansatz as in Fig. 3a.
The index l = 1, …, L in W_{kl}(θ_{kl}) indicates the layer that contains the block, while k = 1, …, ξ indicates the qubits it acts upon. We assume n is a multiple of m, with n = mξ, and that m does not scale with n. As depicted in Fig. 3a, we define S_{k} as the mqubit subsystem on which W_{kL} acts, and we define \({\mathcal{S}}=\{{S}_{k}\}\) as the set of all such subsystems. Let us now consider a block W_{kl}(θ_{kl}) in the lth layer of the ansatz. For simplicity we henceforth use W to refer to a given W_{kl}(θ_{kl}). As shown in the Methods section, given a θ^{ν} ∈ θ_{kl} that parametrizes a rotation \({e}^{i{\theta }^{\nu }{\sigma }_{\nu }/2}\) (with σ_{ν} a Pauli operator) inside a given block W, one can always express
where W_{A} and W_{B} contain all remaining gates in W, and are properly defined in the Methods section.
The contribution to the gradient ∇C from a parameter θ^{ν} in the block W is given by the partial derivative ∂_{ν}C. While the value of ∂_{ν}C depends on the specific parameters θ, it is useful to compute \({\langle {\partial }_{\nu }C\rangle }_{V}\), i.e., the average gradient over all possible unitaries V(θ) within the ansatz. Such an average may not be representative near the minimum of C, although it does provide a good estimate of the expected gradient when randomly initializing the angles in V(θ). In the Methods Section we explicitly show how to compute averages of the form 〈…〉_{V}, and in the Supplementary Note 3 we provide a proof for the following Proposition.
Proposition 2
The average of the partial derivative of any cost function of the form (6) with respect to a parameter θ^{ν} in a block W of the ansatz in Fig. 3 is
provided that either W_{A} or W_{B} of (7) form a 1design.
Here we recall that a tdesign is an ensemble of unitaries, such that sampling over their distribution yields the same properties as sampling random unitaries from the unitary group with respect to the Haar measure up to the first t moments^{38}. The Methods section provides a formal definition of a tdesign.
Proposition 2 states that the gradient is not biased in any particular direction. To analyze the trainability of C, we consider the second moment of its partial derivatives:
where we used the fact that \({\langle {\partial }_{\nu }C\rangle }_{V}=0\). The magnitude of Var[∂_{ν}C] quantifies how much the partial derivative concentrates around zero, and hence small values in (9) imply that the slope of the landscape will typically be insufficient to provide a costminimizing direction. Specifically, from Chebyshev’s inequality, Var[∂_{ν}C] bounds the probability that the costfunction partial derivative deviates from its mean value (of zero) as \(\Pr \left( {\partial }_{\nu }C \ge c\right)\le {\rm{Var}}[{\partial }_{\nu }C]/{c}^{2}\) for all c > 0.
Main results
Here we present our main theorems and corollaries, with the proofs sketched in the Methods and detailed in the Supplementary Information. In addition, in the Methods section we provide some intuition behind our main results by analyzing a generalization of the warmup example where V(θ) is composed of a single layer of the ansatz in Fig. 3. This case bridges the gap between the warmup example and our main theorems and also showcases the tools used to derive our main result.
The following theorem provides an upper bound on the variance of the partial derivative of a global cost function which can be expressed as the expectation value of an operator of the form
Specifically, we consider two cases of interest: (i) When N = 1 and each \({\widehat{O}}_{1k}\) is a nontrivial projector (\({\widehat{O}}_{1k}^{2}={\widehat{O}}_{1k}\ne {\mathbb{1}}\)) of rank r_{k} acting on subsystem S_{k}, or (ii) When N is arbitrary and \({\widehat{O}}_{ik}\) is traceless with \({\rm{Tr}}[{\widehat{O}}_{ik}^{2}]\le {2}^{m}\) (for example, when \({\widehat{O}}_{ik}{ = \bigotimes }_{j = 1}^{m}{\sigma }_{j}^{\mu }\) is a tensor product of Pauli operators \({\sigma }_{j}^{\mu }\in \{{{\mathbb{1}}}_{j},{\sigma }_{j}^{x},{\sigma }_{j}^{y},{\sigma }_{j}^{z}\}\), with at least one \({\sigma }_{j}^{\mu }\,\ne\, {\mathbb{1}}\)). Note that case (i) includes C_{G} of (1) as a special case.
Theorem 1
Consider a trainable parameter θ^{ν} in a block W of the ansatz in Fig. 3. Let Var[∂_{ν}C] be the variance of the partial derivative of a global cost function C (with O given by (10)) with respect to θ^{ν}. If W_{A}, W_{B} of (7), and each block in V(θ) form a local 2design, then Var[∂_{ν}C] is upper bounded by

(i)
For N = 1 and when each \({\widehat{O}}_{1k}\) is a nontrivial projector, then defining \(R=\mathop{\prod }\nolimits_{k = 1}^{\xi }{r}_{k}^{2}\), we have
$${F}_{n}(L,l)=\frac{{2}^{2m+(2m1)(Ll)}}{({2}^{2m}1)\cdot {3}^{\frac{n}{m}}\cdot {2}^{(2\frac{3}{m})n}}{c}_{1}^{2}R\ .$$(12) 
(ii)
For arbitrary N and when each \({\widehat{O}}_{ik}\) satisfies \({\rm{Tr}}[{\widehat{O}}_{ik}]=0\) and \({\rm{Tr}}[{\widehat{O}}_{ik}^{2}]\;\leqslant\; {2}^{m}\), then
$${F}_{n}(L,l)=\frac{{2}^{2m(Ll+1)+1}}{{3}^{\frac{2n}{m}}\cdot {2}^{\left(3\frac{4}{m}\right)n}}\mathop{\sum }\limits_{i,j=1}^{N}{c}_{i}{c}_{j}\ .$$(13)
From Theorem 1 we derive the following corollary.
Corollary 1
Consider the function F_{n}(L, l).

(i)
Let N = 1 and let each \({\widehat{O}}_{1k}\) be a nontrivial projector, as in case (i) of Theorem 1. If \({c}_{1}^{2}R\in {\mathcal{O}}({2}^{n})\) and if the number of layers \(L\in {\mathcal{O}}(\mathrm{poly}\,(\mathrm{log}\,(n)))\), then
$${F}_{n}\left(L,l\right)\in {\mathcal{O}}\left({2}^{\left(1\frac{1}{m}{\mathrm{log}\,}_{2}3\right)n}\right)\ ,$$(14)which implies that Var[∂_{ν}C] is exponentially vanishing in n if m ⩾ 2.

(ii)
Let N be arbitrary, and let each \({\widehat{O}}_{ik}\) satisfy \({\rm{Tr}}[{\widehat{O}}_{ik}]=0\) and \({\rm{Tr}}[{\widehat{O}}_{ik}^{2}]\;\leqslant\; {2}^{m}\), as in case (ii) of Theorem 1. If \(N\in {\mathcal{O}}({2}^{n})\), \({c}_{i}\in {\mathcal{O}}(1)\), and if the number of layers \(L\in {\mathcal{O}}(\mathrm{poly}\,(\mathrm{log}\,(n)))\), then
$${F}_{n}\left(L,l\right)\in {\mathcal{O}}\left(\frac{1}{{2}^{\left(1\frac{1}{m}\right)n}}\right)\ ,$$(15)which implies that Var[∂_{ν}C] is exponentially vanishing in n if m ⩾ 2.
Let us now make several important remarks. First, note that part (i) of Corollary 1 includes as a particular example the cost function C_{G} of (1). Second, part (ii) of this corollary also includes as particular examples operators with \(N\in {\mathcal{O}}(1)\), as well as \(N\in {\mathcal{O}}(\mathrm{poly}\,(n))\). Finally, we remark that F_{n}(L, l) becomes trivial when the number of layers L is Ω(poly(n)), however, as we discuss below, we can still find that Var[∂_{ν}C_{G}] vanishes exponentially in this case.
Our second main theorem shows that barren plateaus can be avoided for shallow circuits by employing local cost functions. Here we consider mlocal cost functions where each \({\widehat{O}}_{i}\) acts nontrivially on at most m qubits and (on these qubits) can be expressed as \({\widehat{O}}_{i}={\widehat{O}}_{i}^{{\mu }_{i}}\otimes {\widehat{O}}_{i}^{\mu ^{\prime} }\):
where \({\widehat{O}}_{i}^{{\mu }_{i}}\) are operators acting on m/2 qubits which can be written as a tensor product of Pauli operators. Here, we assume the summation in Eq. (16) includes two possible cases as schematically shown in Fig. 3b: First, when \({\widehat{O}}_{i}^{{\mu }_{i}}\) (\({\widehat{O}}_{i}^{\mu ^{\prime} }\)) acts on the first (last) m/2 qubits of a given S_{k}, and second, when \({\widehat{O}}_{i}^{{\mu }_{i}}\) (\({\widehat{O}}_{i}^{\mu ^{\prime} }\)) acts on the last (first) m/2 qubits of a given S_{k} (S_{k+1}). This type of cost function includes any ultralocal cost function (i.e., where the \({\widehat{O}}_{i}\) are onebody) as in (2), and also VQE Hamiltonians with up to m/2 neighbor interactions. Then, the following theorem holds.
Theorem 2
Consider a trainable parameter θ^{ν} in a block W of the ansatz in Fig. 3. Let Var[∂_{ν}C] be the variance of the partial derivative of an mlocal cost function C (with O given by (16)) with respect to θ^{ν}. W_{A}, W_{B} of (7), and each block in V(θ) form a local 2design, then Var[∂_{ν}C] is lower bounded by
with
where \({i}_{{\mathcal{L}}}\) is the set of i indices whose associated operators \({\widehat{O}}_{i}\) act on qubits in the forward lightcone \({\mathcal{L}}\) of W, and \({k}_{{{\mathcal{L}}}_{\text{B}}}\) is the set of k indices whose associated subsystems S_{k} are in the backward lightcone \({{\mathcal{L}}}_{\text{B}}\) of W. Here we defined the function \(\epsilon (M)={D}_{\text{HS}}\left(M,{\rm{Tr}}(M){\mathbb{1}}/{d}_{M}\right)\) where D_{HS} is the Hilbert–Schmidt distance and d_{M} is the dimension of the matrix M. In addition, \({\rho }_{k,k^{\prime} }\) is the partial trace of the input state ρ down to the subsystems \({S}_{k}{S}_{k+1}...{S}_{k^{\prime} }\).
Let us make a few remarks. First, note that the \(\epsilon ({\widehat{O}}_{i})\) in the lower bound indicates that training V(θ) is easier when \({\widehat{O}}_{i}\) is far from the identity. Second, the presence of \(\epsilon ({\rho }_{k,k^{\prime} })\) in G_{n}(L, l) implies that we have no guarantee on the trainability of a parameter θ^{ν} in W if ρ is maximally mixed on the qubits in the backwards lightcone.
From Theorem 2 we derive the following corollary for mlocal cost functions, which guarantees the trainability of the ansatz for shallow circuits.
Corollary 2
Consider the function F_{n}(L, l). Let O be an operator of the form (16), as in Theorem 2. If at least one term \({c}_{i}^{2}\epsilon ({\rho }_{k,k^{\prime} })\epsilon ({\widehat{O}}_{i})\) in the sum in (18) vanishes no faster than Ω(1/poly(n)), and if the number of layers L is \({\mathcal{O}}(\mathrm{log}\,(n))\), then
On the other hand, if at least one term \({c}_{i}^{2}\epsilon ({\rho }_{k,k^{\prime} })\epsilon ({\widehat{O}}_{i})\) in the sum in (18) vanishes no faster than \({{\Omega }}\left(1/{2}^{\mathrm{poly}\,(\mathrm{log}\,(n))}\right)\), and if the number of layers is \({\mathcal{O}}(\mathrm{poly}\,(\mathrm{log}\,(n)))\), then
Hence, when L is \({\mathcal{O}}(\mathrm{poly}\,(\mathrm{log}\,(n)))\) there is a transition region where the lower bound vanishes faster than polynomially, but slower than exponentially.
We finally justify the assumption of each block being a local 2design from the fact that shallow circuit depths lead to such local 2designs. Namely, it has been shown that onedimensional 2designs have efficient quantum circuit descriptions, requiring \({\mathcal{O}}({m}^{2})\) gates to be exactly implemented^{38}, or \({\mathcal{O}}(m)\) to be approximately implemented^{39,40}. Hence, an Llayered ansatz in which each block forms a 2design can be exactly implemented with a depth \(D\in {\mathcal{O}}({m}^{2}L)\), and approximately implemented with \(D\in {\mathcal{O}}(mL)\). For the case of twodimensional connectivity, it has been shown that approximate 2designs require a circuit depth of \({\mathcal{O}}(\sqrt{m})\) to be implemented^{40}. Therefore, in this case the depth of the layered ansatz is \(D\in {\mathcal{O}}(\sqrt{m}L)\). The latter shows that increasing the dimensionality of the circuit reduces the circuit depth needed to make each block a 2design.
Moreover, it has been shown that the Alternating Layered Ansatz of Fig. 3 will form an approximate onedimensional 2design on n qubits if the number of layers is \({\mathcal{O}}(n)\)^{40}. Hence, for deep circuits, our ansatz behaves like a random circuit and we recover the barren plateau result of^{9} for both local and global cost functions.
Numerical simulations
As an important example to illustrate the costfunctiondependent barren plateau phenomenon, we consider quantum autoencoders^{11,41,42,43,44}. In particular, the pioneering VQA proposed in ref. ^{11} has received significant literature attention, due to its importance to quantum machine learning and quantum data compression. Let us briefly explain the algorithm in ref. ^{11}.
Consider a bipartite quantum system AB composed of n_{A} and n_{B} qubits, respectively, and let \(\{{p}_{\mu },{\psi }_{\mu }\rangle \}\) be an ensemble of pure states on AB. The goal of the quantum autoencoder is to train a gate sequence V(θ) to compress this ensemble into the A subsystem, such that one can recover each state \({\psi }_{\mu }\rangle\) with high fidelity from the information in subsystem A. One can think of B as the “trash” since it is discarded after the action of V(θ).
To quantify the degree of data compression, ref. ^{11} proposed a cost function of the form:
where \({\rho }_{\,\text{AB}}^{\text{in}\,}={\sum }_{\mu }{p}_{\mu }{\psi }_{\mu }\rangle \ \langle {\psi }_{\mu }\) is the ensembleaverage input state, \({\rho }_{\,\text{B}}^{\text{out}\,}={\sum }_{\mu }{p}_{\mu }{{\rm{Tr}}}_{\text{A}}[\psi ^{\prime} \rangle \ \langle \psi ^{\prime} ]\) is the ensembleaverage trash state, and \(\left\psi ^{\prime} \right\rangle =V({\boldsymbol{\theta }}){\psi }_{\mu }\rangle\). Equation (22) makes it clear that \(C_{\rm{G}}^{\prime}\) has the form in (6), and \(O_{\rm{G}}^{\prime} ={{\mathbb{1}}}_{\text{AB}}{{\mathbb{1}}}_{\text{A}}\otimes \left{\boldsymbol{0}}\right\rangle \ \left\langle {\boldsymbol{0}}\right\) is a global observable of the form in (10). Hence, according to Corollary 1, \(C_{\rm{G}}^{\prime}\) exhibits a barren plateau for large n_{B}. (Specifically, Corollary 1 applies in this context when n_{A} < n_{B}). As a result, largescale data compression, where one is interested in discarding large numbers of qubits, will not be possible with \(C_{\rm{G}}^{\prime}\).
To address this issue, we propose the following local cost function
where \(O_{\rm{L}}^{\prime} ={{\mathbb{1}}}_{\text{AB}}\frac{1}{{n}_{\text{B}}}\mathop{\sum }\nolimits_{j = 1}^{{n}_{\text{B}}}{{\mathbb{1}}}_{\text{A}}\otimes \left0\right\rangle \ {\left\langle 0\right}_{j}\otimes {{\mathbb{1}}}_{\overline{j}}\), and \({{\mathbb{1}}}_{\overline{j}}\) is the identity on all qubits in B except the jth qubit. As shown in the Supplementary Note 9, \(C_{\rm{L}}^{\prime}\) satisfies \(C_{\rm{L}}^{\prime} \;\leqslant\; C_{\rm{G}}^{\prime} \;\leqslant\; {n}_{\text{B}}C_{\rm{L}}^{\prime}\), which implies that \(C_{\rm{L}}^{\prime}\) is faithful (vanishing under the same conditions as \(C_{\rm{G}}^{\prime}\)). Furthermore, note that \(O_{\rm{L}}^{\prime}\) has the form in (16). Hence Corollary 2 implies that \(C_{\rm{L}}^{\prime}\) does not exhibit a barren plateau for shallow ansatzes.
Here we simulate the autoencoder algorithm to solve a simple problem where n_{A} = 1, and where the input state ensemble \(\{{p}_{\mu }, {\psi }_{\mu } \rangle \}\) is given by
In order to analyze the costfunctiondependent barren plateau phenomenon, the dimension of subsystem B is gradually increased as n_{B} = 10, 15, …, 100.
Numerical results
In our heuristics, the gate sequence V(θ) is given by two layers of the ansatz in Fig. 4, so that the number of gates and parameters in V(θ) increases linearly with n_{B}. Note that this ansatz is a simplified version of the ansatz in Fig. 3, as we can only generate unitaries with real coefficients. All parameters in V(θ) were randomly initialized and as detailed in the Methods section, we employ a gradientfree training algorithm that gradually increases the number of shots per costfunction evaluation.
Analysis of the ndependence. Figure 5 shows representative results of our numerical implementations of the quantum autoencoder in ref. ^{11} obtained by training V(θ) with the global and local cost functions respectively given by (22) and (23). Specifically, while we train with finite sampling, in the figures we show the exact costfunction values versus the number of iterations. Here, the top (bottom) axis corresponds to the number of iterations performed while training with \(C_{\rm{G}}^{\prime}\) (\(C_{\rm{L}}^{\prime}\)). For n_{B} = 10 and 15, Fig. 5 shows that we are able to train V(θ) for both cost functions. For n_{B} = 20, the global cost function initially presents a plateau in which the optimizing algorithm is not able to determine a minimizing direction. However, as the number of shots per function evaluation increases, one can eventually minimize \(C_{\rm{G}}^{\prime}\). Such result indicates the presence of a barren plateau where the gradient takes small values which can only be detected when the number of shots becomes sufficiently large. In this particular example, one is able to start training at around 140 iterations.
When n_{B} > 20 we are unable to train the global cost function, while always being able to train our proposed local cost function. Note that the number of iterations is different for \(C_{\rm{G}}^{\prime}\) and \(C_{\rm{L}}^{\prime}\), as for the global cost function case we reach the maximum number of shots in fewer iterations. These results indicate that the global cost function of (22) exhibits a barren plateau where the gradient of the cost function vanishes exponentially with the number of qubits, and which arises even for constant depth ansatzes. We remark that in principle one can always find a minimizing direction when training \(C_{\rm{G}}^{\prime}\), although this would require a number of shots that increases exponentially with n_{B}. Moreover, one can see in Fig. 5 that randomly initializing the parameters always leads to \(C_{\rm{G}}^{\prime} \approx 1\) due to the narrow gorge phenomenon (see Proposition 1), i.e., where the probability of being near the global minimum vanishes exponentially with n_{B}.
On the other hand, Fig. 5 shows that the barren plateau is avoided when employing a local cost function since we can train \(C_{\rm{L}}^{\prime}\) for all considered values of n_{B}. Moreover, as seen in Fig. 5, \(C_{\rm{L}}^{\prime}\) can be trained with a small number of shots per costfunction evaluation (as small as 10 shots per evaluation).
Analysis of the Ldependence. The power of Theorem 2 is that it gives the scaling in terms of L. While one can substitute a function of n for L as we did in Corollary 2, one can also directly study the scaling with L (for fixed n). Figure 6 shows the dependence on L when training \(C_{\rm{L}}^{\prime}\) for the autoencoder example with n_{A} = 1 and n_{B} = 10. As one can see, the training becomes more difficult as L increases. Specifically, as shown in the inset it appears to become exponentially more difficult, as the number of shots needed to achieve a fixed cost value grows exponentially with L. This is consistent with (and hence verifies) our bound on the variance in Theorem 2, which vanishes exponentially in L, although we remark that this behavior can saturate for very large L^{9}.
In summary, even though the ansatz employed in our heuristics is beyond the scope of our theorems, we still find costfunctiondependent barren plateaus, indicating that the costfunction dependent barren plateau phenomenon might be more general and go beyond our analytical results.
Discussion
While scaling results have been obtained for classical neural networks^{45}, very few such results exist for the trainability of parametrized quantum circuits, and more generally for quantum neural networks. Hence, rigorous scaling results are urgently needed for VQAs, which many researchers believe will provide the path to quantum advantage with nearterm quantum computers. One of the few such results is the barren plateau theorem of ref. ^{9}, which holds for VQAs with deep, hardwareefficient ansatzes.
In this work, we proved that the barren plateau phenomenon extends to VQAs with randomly initialized shallow Alternating Layered Ansatzes. The key to extending this phenomenon to shallow circuits was to consider the locality of the operator O that defines the cost function C. Theorem 1 presented a universal upper bound on the variance of the gradient for global cost functions, i.e., when O is a global operator. Corollary 1 stated the asymptotic scaling of this upper bound for shallow ansatzes as being exponentially decaying in n, indicating a barren plateau. Conversely, Theorem 2 presented a universal lower bound on the variance of the gradient for local cost functions, i.e., when O is a sum of local operators. Corollary 2 notes that for shallow ansatzes this lower bound decays polynomially in n. Taken together, these two results show that barren plateaus are costfunctiondependent, and they establish a connection between locality and trainability.
In the context of chemistry or materials science, our present work can inform researchers about which transformation to use when mapping a fermionic Hamiltonian to a spin Hamiltonian^{46}, i.e., JordanWigner versus Bravyi–Kitaev^{47}. Namely, the Bravyi–Kitaev transformation often leads to more local Pauli terms, and hence (from Corollary 2) to a more trainable cost function. This fact was recently numerically confirmed^{48}.
Moreover, the fact that Corollary 2 is valid for arbitrary input quantum states may be useful when constructing variational ansatzes. For example, one could propose a growing ansatz method where one appends \(\mathrm{log}\,(n)\) layers of the hardwareefficient ansatz to a previously trained (hence fixed) circuit. This could then lead to a layerbylayer training strategy where the previously trained circuit can correspond to multiple layers of the same hardwareefficient ansatz.
We remark that our definition of a global operator (local operator) is one that is both nonlocal (local) and many body (few body). Therefore, the barren plateau phenomenon could be due to the manybodiness of the operator rather than the nonlocality of the operator; we leave the resolution of this question to future work. On the other hand, our Theorem 1 rules out the possibility that barren plateaus could be due to cardinality, i.e., the number of terms in O when decomposed as a sum of Pauli products^{49}. Namely, case (ii) of this theorem implies barren plateaus for O of essentially arbitrary cardinality, and hence cardinality is not the key variable at work here.
We illustrated these ideas for two examples VQAs. In Fig. 2, we considered a simple statepreparation example, which allowed us to delve deeper into the cost landscape and uncover another phenomenon that we called a narrow gorge, stated precisely in Proposition 1. In Fig. 5, we studied the more important example of quantum autoencoders, which have generated significant interest in the quantum machine learning community. Our numerics showed the effects of barren plateaus: for more than 20 qubits we were unable to minimize the global cost function introduced in^{11}. To address this, we introduced a local cost function for quantum autoencoders, which we were able to minimize for system sizes of up to 100 qubits.
There are several directions in which our results could be generalized in future work. Naturally, we hope to extend the narrow gorge phenomenon in Proposition 1 to more general VQAs. In addition, we hope in the future to unify our theorems 1 and 2 into a single result that bounds the variance as a function of a parameter that quantifies the locality of O. This would further solidify the connection between locality and trainability. Moreover, our numerics suggest that our theorems (which are stated for exact 2designs) might be extendable in some form to ansatzes composed of simpler blocks, like approximate 2designs^{39}.
We emphasize that while our theorems are stated for a hardwareefficient ansatz and for costs that are of the form (6), it remains an interesting open question as to whether other ansatzes, cost function, and architectures exhibit similar scaling behavior as that stated in our theorems. For instance, we have recently shown^{50} that our results can be extended to a more general type of Quantum Neural Network called dissipative quantum neural networks^{51}. Another potential example of interest could be the unitarycoupled cluster (UCC) ansatz in chemistry^{52}, which is intended for use in the \({\mathcal{O}}(\mathrm{poly}\,(n))\) depth regime^{34}. Therefore it is important to study the key mathematical features of an ansatz that might allow one to go from trainability for \({\mathcal{O}}(\mathrm{log}\,n)\) depth (which we guarantee here for local cost functions) to trainability for \({\mathcal{O}}(\mathrm{poly}\,n)\) depth.
Finally, we remark that some strategies have been developed to mitigate the effects of barren plateaus^{32,33,53,54}. While these methods are promising and have been shown to work in certain cases, they are still heuristic methods with no provable guarantees that they can work in generic scenarios. Hence, we believe that more work needs to be done to better understand how to prevent, avoid, or mitigate the effects of barren plateaus.
Methods
In this section, we provide additional details for the results in the main text, as well as a sketch of the proofs for our main theorems. We note that the proof of Theorem 2 comes before that of Theorem 1 since the latter builds on the former. More detailed proofs of our theorems are given in the Supplementary Information.
Variance of the cost function partial derivative
Let us first discuss the formulas we employed to compute Var[∂_{ν}C]. Let us first note that without loss of generality, any block W_{kl}(θ_{kl}) in the Alternating Layered Ansatz can be written as a product of ζ_{kl} independent gates from a gate alphabet \({\mathcal{A}}=\{{G}_{\mu }(\theta )\}\) as
where each \({\theta }_{kl}^{\nu }\) is a continuous parameter. Here, \({G}_{\nu }({\theta }_{kl}^{\nu })={R}_{\nu }({\theta }_{kl}^{\nu }){Q}_{\nu }\) where Q_{ν} is an unparametrized gate and \({R}_{\nu }({\theta }_{kl}^{\nu })={e}^{i{\theta }_{kl}^{\nu }{\sigma }_{\nu }/2}\) with σ_{ν} a Pauli operator. Note that W_{kL} denotes a block in the last layer of V(θ).
For the proofs of our results, it is helpful to conceptually break up the ansatz as follows. Consider a block W_{kl}(θ_{kl}) in the lth layer of the ansatz. For simplicity, we henceforth use W to refer to a given W_{kl}(θ_{kl}). Let S_{w} denote the mqubit subsystem that contains the qubits W acts on, and let \({S}_{\overline{w}}\) be the (n − m) subsystem on which W acts trivially. Similarly, let \({{\mathcal{H}}}_{w}\) and \({{\mathcal{H}}}_{\overline{w}}\) denote the Hilbert spaces associated with S_{w} and \({S}_{\overline{w}}\), respectively. Then, as shown in Fig. 3a, V(θ) can be expressed as
Here, \({{\mathbb{1}}}_{\overline{w}}\) is the identity on \({{\mathcal{H}}}_{\overline{w}}\), and V_{R} contains the gates in the (forward) lightcone \({\mathcal{L}}\) of W, i.e., all gates with at least one input qubit causally connected to the output qubits of W. The latter allows us to define \({S}_{{\mathcal{L}}}\) as the subsystem of all qubits in \({\mathcal{L}}\).
Let us here recall that the Alternating Layered Ansatz can be implemented with either a 1D or 2D square connectivity as schematically depicted in Fig. 3c. We remark that the following results are valid for both cases as the lightcone structure will be the same. Moreover, the notation employed in our proofs applies to both the 1D and 2D cases. Hence, there is no need to refer to the connectivity dimension in what follows.
Let us now assume that θ^{ν} is a parameter inside a given block W, we obtain from (6), (27), and (28)
with
Finally, from (29) we can derive a general formula for the variance:
which holds if W_{A} and W_{B} form independent 2designs. Here, the summation runs over all bitstrings p, q, \({\boldsymbol{p}}^{\prime}\), \({\boldsymbol{q}}^{\prime}\) of length 2^{n−m}. In addition, we defined
where \({{\rm{Tr}}}_{\overline{w}}\) indicates the trace over subsystem \({S}_{\overline{w}}\), and Ω_{qp} and Ψ_{qp} are operators on \({{\mathcal{H}}}_{w}\) defined as
We derive Eq. (31) in the Supplementary Note 4.
Computing averages over V
Here we introduce the main tools employed to compute quantities of the form 〈…〉_{V}. These tools are used throughout the proofs of our main results.
Let us first remark that if the blocks in V(θ) are independent, then any average over V can be computed by averaging over the individual blocks, i.e., \({\langle \ldots \rangle }_{V}={\langle \ldots \rangle }_{{W}_{11},\ldots ,{W}_{kl},\ldots }={\langle \ldots \rangle }_{{V}_{{\rm{L}}},W,{V}_{\text{R}}}\). For simplicity let us first consider the expectation value over a single block W in the ansatz. In principle 〈…〉_{W} can be approximated by varying the parameters in W and sampling over the resulting 2^{m} × 2^{m} unitaries. However, if W forms a tdesign, this procedure can be simplified as it is known that sampling over its distribution yields the same properties as sampling random unitaries from the unitary group with respect to the unique normalized Haar measure.
Explicitly, the Haar measure is a uniquely defined left and rightinvariant measure over the unitary group dμ(W), such that for any unitary matrix A ∈ U(2^{m}) and for any function f(W) we have
where the integration domain is assumed to be U(2^{m}) throughout this work. Consider a finite set \({\{{W}_{y}\}}_{y\in Y}\) (of size ∣Y∣) of unitaries W_{y}, and let P_{(t, t)}(W) be an arbitrary polynomial of degree at most t in the matrix elements of W and at most t in those of W^{†}. Then, this finite set is a tdesign if^{38}
From the general form of C in Eq. (6) we can see the cost function is a polynomial of degree at most 2 in the matrix elements of each block W_{kl} in V(θ), and at most 2 in those of \({({W}_{kl})}^{\dagger }\). Then, if a given block W forms a 2design, one can employ the following elementwise formula of the Weingarten calculus^{55,56} to explicitly evaluate averages over W up to the second moment:
where w_{ij} are the matrix elements of W, and
Intuition behind the main results
The goal of this section is to provide some intuition for our main results. Specifically, we show here how the scaling of the cost function variance can be related to the number of blocks we have to integrate to compute \({\langle \cdots \rangle }_{{V}_{\text{R}},{V}_{{\rm{L}}}}\), the locality of the cost functions, and with the number of layers in the ansatz.
First, we recall from Eq. (38) that integrating over a block leads to a coefficient of the order 1/2^{2m}. Hence, we see that the more blocks one integrates over, the worse the scaling can be.
We now generalize the warmup example. Let V(θ) be a single layer of the alternating ansatz of Fig. 3, i.e., V(θ) is a tensor product of mqubit blocks W_{k}: = W_{k1}, with k = 1, …, ξ (and with ξ = n/m), so that θ^{ν} is in the block \({W}_{k^{\prime} }\). In the Supplementary Note 5 we generalize this scenario to the when the input state is not \(\left{\boldsymbol{0}}\right\rangle\), but instead is an arbitrary state ρ.
From (31), the partial derivative of the global cost function in (1) can be expressed as
where \(\upsilon =\frac{{({2}^{m}1)}^{2}{\rm{Tr}}[{\sigma }_{\nu }^{2}]}{{2}^{2m}{({2}^{m+1}1)}^{2}}\). From (40) we have that in order to compute (40) one needs to integrate over ξ − 1 blocks. Then, since each integration leads to a coefficient 1/2^{2m} the variance will scale as \({\mathcal{O}}{\left(\right.1/({2}^{2m})}^{\xi 1}={\mathcal{O}}(1/{2}^{2n})\). Hence, the scaling of the variance gets worse for each block we integrate (such that the block acts on qubits we are measuring).
On the other hand, for a local cost let us consider a single term in (3) where \(j\in {S}_{\tilde{k}}\), so that
Here, in contrast to the global case, we only have to integrate over a single block irrespective of the total number of qubits. Hence, we now find that the variance scales as \({\mathcal{O}}(1/{n}^{2})\), where we remark that the scaling is essentially given by the prefactor 1/n^{2} in (3).
Let us now briefly provide some intuition as to why the scaling of local cost gradients becomes exponentially vanishing with the number of layers as in Theorem 2. Consider the case when V(θ) contains L layers of the ansatz in Fig. 3. Moreover, as shown in Fig. 7, let W be in the first layer, and let O_{i} act on the m topmost qubits of \({\mathcal{L}}\). As schematically depicted in Fig. 7, we now have to integrating over L − 1 blocks. Then, as proved in the Supplementary Note 5, integrating over a block leads to a coefficient 2^{m/2}/(2^{m} + 1). Hence, after integrating L − 1 times, we obtain a coefficient \({2}^{m(L1)/2}/{({2}^{m}+1)}^{L1}\) which vanishes no faster than \({{\Omega }}\left(1/\mathrm{poly}\,(n)\right)\) if \(mL\in {\mathcal{O}}(\mathrm{log}\,(n))\).
As we discuss below, for more general scenarios the computation of Var[∂_{ν}C] becomes more complex.
Sketch of the proof of the main theorems
Here we present a sketch of the proof of Theorems 1 and 2. We refer the reader to the Supplementary Information for a detailed version of the proofs.
As mentioned in the previous subsection, if each block in V(θ) forms a local 2design, then we can explicitly calculate expectation values 〈…〉_{W} via (38). Hence, to compute \({\langle {{\Delta }}{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{{\boldsymbol{p}}^{\prime} {\boldsymbol{q}}^{\prime} }\rangle }_{{V}_{\text{R}}}\), and \({\langle {{\Delta }}{{{\Psi }}}_{{\boldsymbol{p}}{\boldsymbol{q}}}^{{\boldsymbol{p}}^{\prime} {\boldsymbol{q}}^{\prime} }\rangle }_{{V}_{{\rm{L}}}}\) in (31), one needs to algorithmically integrate over each block using the Weingarten calculus. In order to make such computation tractable, we employ the tensor network representation of quantum circuits.
For the sake of clarity, we recall that any twoqubit gate can be expressed as \(U={\sum }_{ijkl}{U}_{ijkl}\leftij\right\rangle \left\langle kl\right\), where U_{ijkl} is a 2 × 2 × 2 × 2 tensor. Similarly, any block in the ansatz can be considered as a \({2}^{\frac{m}{2}}\ \times {2}^{\frac{m}{2}}\ \times {2}^{\frac{m}{2}}\ \times {2}^{\frac{m}{2}}\) tensor. As schematically shown in Fig. 8a, one can use the circuit description of \({{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{i}\) and Ψ_{pq} to derive the tensor network representation of terms such as \({\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{i}{{{\Omega }}}_{{\boldsymbol{q}}^{\prime} {\boldsymbol{p}}^{\prime} }^{j}]\). Here, \({{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{i}\) is obtained from (34) by simply replacing O with O_{i}.
In Fig. 8b we depict an example where we employ the tensor network representation of \({{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{i}\) to compute the average of \({\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{i}{{{\Omega }}}_{{\boldsymbol{q}}^{\prime} {\boldsymbol{p}}^{\prime} }^{j}]\), and \({\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{i}]{\rm{Tr}}[{{{\Omega }}}_{{\boldsymbol{q}}^{\prime} {\boldsymbol{p}}^{\prime} }^{j}]\). As expected, after each integration one obtains a sum of four new tensor networks according to Eq. (38).
Proof of Theorem 2
Let us first consider an mlocal cost function C where O is given by (16), and where \({\widehat{O}}_{i}\) acts nontrivially in a given subsystem S_{k} of \({\mathcal{S}}\). In particular, when \({\widehat{O}}_{i}\) is of this form the proof is simplified, although the more general proof is presented in the Supplementary Note 6. If \({S}_{k}\not\subset {S}_{{\mathcal{L}}}\) we find \({{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{i}\propto {{\mathbb{1}}}_{w}\), and hence
The latter implies that we only have to consider the operators \({\widehat{O}}_{i}\) which act on qubits inside of the forward lightcone \({\mathcal{L}}\) of W.
Then, as shown in the Supplementary Information
Here we remark that the proportionality factor contains terms of the form \({\delta }_{{({\boldsymbol{p}},{\boldsymbol{q}})}_{{S}_{\overline{w}}^{+}}}{\delta }_{{({\boldsymbol{p}}^{\prime} ,{\boldsymbol{q}}^{\prime} )}_{{S}_{\overline{w}}^{+}}}{\delta }_{{({\boldsymbol{p}},{\boldsymbol{q}}^{\prime} )}_{{S}_{\overline{w}}^{}}}{\delta }_{{({\boldsymbol{p}}^{\prime} ,{\boldsymbol{q}})}_{{S}_{\overline{w}}^{}}}\) (where \({S}_{\overline{w}}^{+}\cup {S}_{\overline{w}}^{}={S}_{\overline{w}}\)), which arises from the different tensor contractions of \({P}_{{\boldsymbol{p}}{\boldsymbol{q}}}=\left{\boldsymbol{q}}\right\rangle \left\langle {\boldsymbol{p}}\right\) in Fig. 8c. It is then straightforward to show that
where we define \({\tilde{\rho }}^{}\) as the reduced states of \(\tilde{\rho }={V}_{{\rm{L}}}\rho {V}_{{\rm{L}}}^{\dagger }\) in the Hilbert spaces associated with subsystems \({S}_{w}\cup {S}_{\overline{w}}^{}\). Here we recall that D_{HS} is the Hilbert–Schmidt distance \({D}_{HS}\left(\rho ,\sigma \right)={\rm{Tr}}[{(\rho \sigma )}^{2}]\).
By employing properties of D_{HS} one can show (see Supplementary Note 6)
where \({\tilde{\rho }}_{w}={{\rm{Tr}}}_{{S}_{\overline{w}}^{}}[{\tilde{\rho }}^{}]\). We can then leverage the tensor network representation of quantum circuits to algorithmically integrate over each block in V_{L} and compute \({\langle {D}_{\text{HS}}\left({\widetilde{\rho }}_{w},\frac{{\mathbb{1}}}{{2}^{m}}\right)\rangle }_{{V}_{{\rm{L}}}}\). One finds
with \({t}_{k,k^{\prime} }\geqslant \frac{{2}^{ml}}{{({2}^{m}+1)}^{2l}}\)\(\forall k,k^{\prime}\), and \(\epsilon ({\rho }_{k,k^{\prime} })\) defined in Theorem 2. Combining these results leads to Theorem 2. Moreover, as detailed in the Supplementary information, Theorem 2 is also valid when O is of the form (16).
Proof of Theorem 1
Let us now provide a sketch of the proof of Theorem 1, case (i). Here we denote for simplicity \({\widehat{O}}_{k}:={\widehat{O}}_{1k}\). We leave the proof of case (ii) for the Supplementary Note 7. In this case there are now operators O_{i} which act outside of the forward lightcone \({\mathcal{L}}\) of W. Hence, it is convenient to include in V_{R} not only all the gates in \({\mathcal{L}}\) but also all the blocks in the final layer of V(θ) (i.e., all blocks W_{kL}, with k = 1, …ξ). We can define \({S}_{\overline{{\mathcal{L}}}}\) as the compliment of \({S}_{{\mathcal{L}}}\), i.e., as the subsystem of all qubits which are not in \({\mathcal{L}}\) (with associated Hilbertspace \({{\mathcal{H}}}_{\overline{{\mathcal{L}}}}\)). Then, we have \({V}_{\text{R}}={V}_{{\mathcal{L}}}\otimes {V}_{\overline{{\mathcal{L}}}}\) and \(\left{\boldsymbol{q}}\right\rangle \left\langle {\boldsymbol{p}}\right=\left{\boldsymbol{q}}\right\rangle {\left\langle {\boldsymbol{p}}\right}_{{\mathcal{L}}}\otimes \left{\boldsymbol{q}}\right\rangle {\left\langle {\boldsymbol{p}}\right}_{\overline{{\mathcal{L}}}}\), where we define \({V}_{\overline{{\mathcal{L}}}}: = {\bigotimes }_{k\in {k}_{\overline{{\mathcal{L}}}}}{W}_{kL}\), \(\left{\boldsymbol{q}}\right\rangle {\left\langle {\boldsymbol{p}}\right}_{{\mathcal{L}}}:{ = \bigotimes }_{k\in {k}_{{\mathcal{L}}}}\left{\boldsymbol{q}}\right\rangle {\left\langle {\boldsymbol{p}}\right}_{k}\), and \(\left{\boldsymbol{q}}\right\rangle {\left\langle {\boldsymbol{p}}\right}_{\overline{{\mathcal{L}}}}:{ = \bigotimes }_{k^{\prime} \in {k}_{\overline{{\mathcal{L}}}}}\left{\boldsymbol{q}}\right\rangle {\left\langle {\boldsymbol{p}}\right}_{k^{\prime} }\). Here, we define \({k}_{{\mathcal{L}}}:=\{k:{S}_{k}\subseteq {S}_{{\mathcal{L}}}\}\) and \({k}_{\overline{{\mathcal{L}}}}:=\{k:{S}_{k}\subseteq {S}_{\overline{{\mathcal{L}}}}\}\), which are the set of indices whose associated qubits are inside and outside \({\mathcal{L}}\), respectively. We also write \(O={c}_{0}{\mathbb{1}}+{c}_{1}{\hat{O}}_{{\mathcal{L}}}\otimes {\hat{O}}_{\overline{{\mathcal{L}}}}\), where we define \({\hat{O}}_{{\mathcal{L}}}:{ = \bigotimes }_{k\in {k}_{{\mathcal{L}}}}{\widehat{O}}_{k}\) and \({\hat{O}}_{\overline{{\mathcal{L}}}}:{ = \bigotimes }_{k^{\prime} \in {k}_{\overline{{\mathcal{L}}}}}{\widehat{O}}_{k^{\prime} }\).
Using the fact that the blocks in V(θ) are independent we can now compute \({\langle {{\Delta }}{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{{\boldsymbol{q}}^{\prime} {\boldsymbol{p}}^{\prime} }\rangle }_{{V}_{\text{R}}}={\langle {{\Delta }}{{{\Omega }}}_{{\boldsymbol{q}}{\boldsymbol{p}}}^{{\boldsymbol{q}}^{\prime} {\boldsymbol{p}}^{\prime} }\rangle }_{{V}_{\overline{{\mathcal{L}}}},{V}_{{\mathcal{L}}}}\). Then, from the definition of Ω_{pq} in Eq. (34) and the fact that one can always express
with
and where \({{\rm{Tr}}}_{{\mathcal{L}}\cap \overline{w}}\) indicates the partial trace over the Hilbertspace associated with the qubits in \({S}_{{\mathcal{L}}}\cap {S}_{\overline{w}}\). As detailed in the Supplementary Information we can use Eq. (38) to show that
On the other hand, as shown in the Supplementary Note 7 (and as schematically depicted in Fig. 8c), when computing the expectation value \({\langle \ldots \rangle }_{{V}_{{\mathcal{L}}}}\) in (47), one obtains
where we defined \({\delta }_{\tau }={\delta }_{{({\boldsymbol{p}},{\boldsymbol{q}})}_{{S}_{\overline{\tau }}}}{\delta }_{{({\boldsymbol{p}}^{\prime} ,{\boldsymbol{q}}^{\prime} )}_{{S}_{\overline{\tau }}}}{\delta }_{{({\boldsymbol{p}},{\boldsymbol{q}}^{\prime} )}_{{S}_{\tau }}}{\delta }_{{({\boldsymbol{p}}^{\prime} ,{\boldsymbol{q}})}_{{S}_{\tau }}}\), \({t}_{\tau }\in {\mathbb{R}}\), \({S}_{\tau }\cup {S}_{\overline{\tau }}={S}_{{\mathcal{L}}}\cap {S}_{\overline{w}}\) (with \({S}_{\tau }\,\ne\, {{\emptyset}}\)), and
Here we use the notation \({{\rm{Tr}}}_{{x}_{\tau }}\) to indicate the trace over the Hilbert space associated with subsystem \({S}_{{x}_{\tau }}\), such that \({S}_{{x}_{\tau }}\cup {S}_{{y}_{\tau }}\cup {S}_{{z}_{\tau }}={S}_{{\mathcal{L}}}\). As shown in the Supplementary Note 7, combining the deltas in Eqs. (48), and (49) with \({\left\langle {{\Delta }}{{{\Psi }}}_{{\boldsymbol{p}}{\boldsymbol{q}}}^{{\boldsymbol{p}}^{\prime} {\boldsymbol{q}}^{\prime} }\right\rangle }_{{V}_{{\rm{L}}}}\) leads to Hilbert–Schmidt distances between two quantum states as in (44). One can then use the following bounds \({D}_{\text{HS}}\left({\rho }_{1},{\rho }_{2}\right)\le 2\), \({{\Delta }}{O}_{\tau }^{{\mathcal{L}}}\le {\prod }_{k\in {k}_{{\mathcal{L}}}}{r}_{k}^{2}\), and ∑_{τ}t_{τ} ≤ 2, along with some additional simple algebra explained in the Supplementary Information to obtain the upper bound in Theorem 1.
Ansatz and optimization method
Here we describe the gradientfree optimization method used in our heuristics. First, we note that all the parameters in the ansatz are randomly initialized. Then, at each iteration, one solves the following subspace search problem: \(\mathop{\min }\limits_{{\boldsymbol{s}}\in {{\mathbb{R}}}^{d}}C({\boldsymbol{\theta }}+{\boldsymbol{A}}\cdot {\boldsymbol{s}})\), where A is a randomly generated isometry, and s = (s_{1}, …, s_{d}) is a vector of coefficients to be optimized over. We used d = 10 in our simulations. Moreover, the training algorithm gradually increases the number of shots per costfunction evaluation. Initially, C is evaluated with 10 shots, and once the optimization reaches a plateau, the number of shots is increased by a factor of 3/2. This process is repeated until a termination condition on the value of C is achieved, or until we reach the maximum value of 10^{5} shots per function evaluation. While this is a simple variableshot approach, we remark that a more advanced variableshot optimizer can be found in ref. ^{57}.
Finally, let us remark that while we employ a subspace search algorithm, in the presence of barren plateaus all optimization methods will (on average) fail unless the algorithm has a precision (i.e., number of shots) that grows exponentially with n. The latter is due to the fact that an exponentially vanishing gradient implies that on average the cost function landscape will essentially be flat, with the slope of the order of \({\mathcal{O}}(1/{2}^{n})\). Hence, unless one has a precision that can detect such small changes in the cost value, one will not be able to determine a cost minimization direction with gradientbased, or even with blackbox optimizers such as the Nelder–Mead method^{58,59,60,61}.
Data availability
Data generated and analyzed during the current study are available from the corresponding author upon reasonable request.
References
Preskill, J. Quantum computing in the NISQ era and beyond. Quantum 2, 79 (2018).
McClean, J. R., Romero, J., Babbush, R. & AspuruGuzik, A. The theory of variational hybrid quantumclassical algorithms. New J. Phys. 18, 023023 (2016).
Farhi, E., Goldstone, J. & Gutmann, S. A quantum approximate optimization algorithm. Preprint at https://arxiv.org/abs/1411.4028 (2014).
Hadfield, S. et al. From the quantum approximate optimization algorithm to a quantum alternating operator ansatz. Algorithms 12, 34 (2019).
Hastings, M. B. Classical and quantum bounded depth approximation algorithms. Preprint at https://arxiv.org/abs/1905.07047 (2019).
Kandala, A. et al. Hardwareefficient variational quantum eigensolver for small molecules and quantum magnets. Nature 549, 242 (2017).
Arute, F. et al. Hartreefock on a superconducting qubit quantum computer. Science 369, 1084–1089 (2020a).
Harrigan, Matthew P. et al. Quantum approximate optimization of nonplanar graph problems on a planar superconducting processor. Nature Physics 1–5 (2021).
McClean, J. R., Boixo, S., Smelyanskiy, V. N., Babbush, R. & Neven, H. Barren plateaus in quantum neural network training landscapes. Nat. Commun. 9, 4812 (2018).
Peruzzo, A. et al. A variational eigenvalue solver on a photonic quantum processor. Nat. Commun. 5, 4213 (2014).
Romero, J., Olson, J. P. & AspuruGuzik, A. Quantum autoencoders for efficient compression of quantum data. Quant. Sci. Technol. 2, 045001 (2017).
Johnson, P. D., Romero, J., Olson, J., Cao, Y. & AspuruGuzik, A. QVECTOR: an algorithm for devicetailored quantum error correction. Preprint at https://arxiv.org/abs/1711.02249 (2017).
Koczor, B., Endo, S., Jones, T., Matsuzaki, Y. & Benjamin, S. C. Variationalstate quantum metrology. New J. Phys. 22, 083038 (2020).
Khatri, S. et al. Quantumassisted quantum compiling. Quantum 3, 140 (2019).
Jones, T. & Benjamin, S. C. Quantum compilation and circuit optimisation via energy dissipation. Preprint at https://arxiv.org/abs/1811.03147 (2019).
Sharma, K., Khatri, S., Cerezo, M. & Coles, P. J. Noise resilience of variational quantum compiling. New J. Phys. 22, 043006 (2020a).
Heya, K., Suzuki, Y., Nakamura, Y. & Fujii, K. Variational quantum gate optimization. Preprint at https://arxiv.org/abs/1810.12745 (2018).
LaRose, R., Tikku, A., O’NeelJudy, É., Cincio, L. & Coles, P. J. Variational quantum state diagonalization. npj Quant. Inf. 5, 1–10 (2018).
BravoPrieto, C., GarcíaMartín, D. & Latorre, J. Quantum singular value decomposer. Phys. Rev. A 101, 062310 (2020).
Li, Y. & Benjamin, S. C. Efficient variational quantum simulator incorporating active error minimization. Phys. Rev. X 7, 021050 (2017).
Heya, K., Nakanishi, K. M., Mitarai, K. & Fujii, K. Subspace variational quantum simulator. Preprint at https://arxiv.org/abs/1904.08566 (2019).
Cirstoiu, C. et al. Variational fast forwarding for quantum simulation beyond the coherence time. npj Quant. Inf. 6, 1–10 (2020).
Otten, M., Cortes, C. L. & Gray, S. K. Noiseresilient quantum dynamics using symmetrypreserving ansatzes. Preprint at https://arxiv.org/abs/1910.06284 (2019).
Cerezo, M., Poremba, A., Cincio, L. & Coles, P. J. Variational quantum fidelity estimation. Quantum 4, 248 (2020).
Carolan, J. et al. Variational quantum unsampling on a quantum photonic processor. Nature Physics 16.3 322327 (2020).
Arrasmith, A., Cincio, L., Sornborger, A. T., Zurek, W. H. & Coles, P. J. Variational consistent histories as a hybrid algorithm for quantum foundations. Nat. Commun. 10, 3438 (2019).
BravoPrieto, C. et al. Variational quantum linear solver. Preprint at https://arxiv.org/abs/1909.05820 (2019).
Xu, X. et al. Variational algorithms for linear algebra. Preprint at https://arxiv.org/abs/1909.03898 (2019).
Huang, H.Y., Bharti, K. & Rebentrost, P. Nearterm quantum algorithms for linear systems of equations. Preprint at https://arxiv.org/abs/1909.07344 (2019).
Cerezo, M. & Coles, P. J. Impact of barren plateaus on the hessian and higher order derivatives. Preprint at https://arxiv.org/abs/2008.07454 (2020).
Arrasmith, A., Cerezo, M., Czarnik, P., Cincio, L. & Coles, P. J. Effect of barren plateaus on gradientfree optimization. Preprint at https://arxiv.org/abs/2011.12245 (2020).
Grant, E., Wossnig, L., Ostaszewski, M. & Benedetti, M. An initialization strategy for addressing barren plateaus in parametrized quantum circuits. Quantum 3, 214 (2019).
Verdon, G. et al. Learning to learn with quantum neural networks via classical neural networks. Preprint at https://arxiv.org/abs/1907.05415 (2019a).
Lee, J., Huggins, W. J., HeadGordon, M. & Whaley, K. B. Generalized unitary coupled cluster wave functions for quantum computation. J. Chem. Theory Comput. 15, 311–324 (2018).
Verdon, G. et al. Quantum graph neural networks. Preprint at https://arxiv.org/abs/1909.12264 (2019b).
IBM Q: Quantum devices and simulators. https://www.research.ibm.com/ibmq/technology/devices/.
Arute, F. et al. Quantum supremacy using a programmable superconducting processor. Nature 574, 505–510 (2019).
Dankert, C., Cleve, R., Emerson, J. & Livine, E. Exact and approximate unitary 2designs and their application to fidelity estimation. Phys. Rev. A 80, 012304 (2009).
Brandao, F. G. S. L., Harrow, A. W. & Horodecki, M. Local random quantum circuits are approximate polynomialdesigns. Commun. Math. Phys. 346, 397–434 (2016).
Harrow, A. & Mehraban, S. Approximate unitary tdesigns by short random quantum circuits using nearestneighbor and longrange gates. Preprint at https://arxiv.org/abs/1809.06957 (2018).
Wan, K. H., Dahlsten, O., Kristjánsson, H., Gardner, R. & Kim, M. S. Quantum generalisation of feedforward neural networks. npj Quant. Inf. 3, 36 (2017).
Lamata, L. et al. Quantum autoencoders via quantum adders with genetic algorithms. Quant. Sci. Technol. 4, 014007 (2018).
Pepper, A., Tischler, N. & Pryde, G. J. Experimental realization of a quantum autoencoder: the compression of qutrits via machine learning. Phys. Rev. Lett. 122, 060501 (2019).
Verdon, G., Pye, J. & Broughton, M. A universal training algorithm for quantum deep learning. Preprint at https://arxiv.org/abs/1806.09729 (2018).
Pennington, J. & Bahri, Y. Geometry of neural network loss surfaces via random matrix theory. in Proceedings of the 34th International Conference on Machine LearningVolume 70 (JMLR. org, 2017) pp. 2798–2806, http://proceedings.mlr.press/v70/pennington17a.html.
Tranter, A., Love, P. J., Mintert, F. & Coveney, P. V. A comparison of the bravyi–kitaev and jordan–wigner transformations for the quantum simulation of quantum chemistry. J. Chem. Theory Comput. 14, 5617–5630 (2018).
Bravyi, S. B. & Kitaev, A. Y. Fermionic quantum computation. Ann. Phys. 298, 210–226 (2002).
Uvarov, A., Biamonte, J. D. & Yudin, D. Variational quantum eigensolver for frustrated quantum systems. Phys. Rev. B 102, 075104 (2020).
Biamonte, J. Universal variational quantum computation. Preprint at https://arxiv.org/abs/1903.04500 (2019).
Sharma, K., Cerezo, M., Cincio, L. & Coles, P. J. Trainability of dissipative perceptronbased quantum neural networks. Preprint at https://arxiv.org/abs/2005.12458 (2020b).
Beer, K. et al. Training deep quantum neural networks. Nat. Commun. 11, 1–6 (2020).
Bartlett, R. J. & Musiał, M. Coupledcluster theory in quantum chemistry. Rev. Modern Phys. 79, 291 (2007).
Volkoff, T. & Coles, P. J. Large gradients via correlation in random parameterized quantum circuits. Quant. Sci. Technol. (2021). https://iopscience.iop.org/article/10.1088/20589565/abd891.
Skolik, A. et al. Layerwise learning for quantum neural networks. Quantum Machine Intelligence 3.1 1–11 (2021).
Benoît, C. & Śniady, P. Integration with respect to the haar measure on unitary, orthogonal and symplectic group. Commun. Math. Phys. 264, 773–795 (2006).
Puchała, Z. & Miszczak, J. A. Symbolic integration with respect to the haar measure on the unitary groups. Bull. Pol. Acad. Sci. Tech. Sci. 65, 21–27 (2017).
Kübler, J. M., Arrasmith, A., Cincio, L. & Coles, P. J. An adaptive optimizer for measurementfrugal variational algorithms. Quantum 4, 263 (2020).
Nelder, J. A. & Mead, R. A simplex method for function minimization. Comput. J. 7, 308–313 (1965).
Paley, R. E. A. C. & Zygmund, A. A note on analytic functions in the unit circle. Math. Proc. Camb. Phil. Soc. 28, 266 (1932).
Fukuda, M., König, R. & Nechita, I. RTNI–a symbolic integrator for haarrandom tensor networks. J. Phys. A 52, 425303 (2019).
Nielsen, M. A. & Chuang, I. L. Quantum computation and quantum information: 10th Anniversary Edition, 10th ed. (Cambridge University Press, New York, NY, USA, 2011)
Acknowledgements
We thank Jacob Biamonte, Elizabeth Crosson, Burak Sahinoglu, Rolando Somma, Guillaume Verdon, and Kunal Sharma for helpful conversations. All authors were supported by the Laboratory Directed Research and Development (LDRD) program of Los Alamos National Laboratory (LANL) under project numbers 20180628ECR (for M.C.), 20190065DR (for A.S., L.C., and P.J.C.), and 20200677PRD1 (for T.V.). M.C. and A.S. were also supported by the Center for Nonlinear Studies at LANL. P.J.C. acknowledges initial support from the LANL ASC Beyond Moore’s Law project. This work was also supported by the U.S. Department of Energy (DOE), Office of Science, Office of Advanced Scientific Computing Research, under the Quantum Computing Application Teams program.
Author information
Affiliations
Contributions
The project was conceived by M.C., L.C., and P.J.C. The manuscript was written by M.C., A.S., T.V., L.C., and P.J.C. T.V. proved Proposition 1. M.C. and A.S. proved Proposition 2 and Theorems 1–2. M.C., A.S., T.V., and P.J.C. proved Corollaries 1–2. M.C., A.S., T.V., L.C., and P.J.C. analyzed the quantum autoencoder. For the numerical results, T.V. performed the simulation in Fig. 2, and L.C. performed the simulation in Fig. 5.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Peer review information Nature Communications thanks the anonymous reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Cerezo, M., Sone, A., Volkoff, T. et al. Cost function dependent barren plateaus in shallow parametrized quantum circuits. Nat Commun 12, 1791 (2021). https://doi.org/10.1038/s4146702121728w
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s4146702121728w
Further reading

Detecting and quantifying entanglement on nearterm quantum devices
npj Quantum Information (2022)

Mode connectivity in the loss landscape of parameterized quantum circuits
Quantum Machine Intelligence (2022)

A continuous variable Born machine
Quantum Machine Intelligence (2022)

Quantifying scrambling in quantum neural networks
Journal of High Energy Physics (2022)

Improved training of deep convolutional networks via minimumvariance regularized adaptive sampling
Soft Computing (2022)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.