Barren plateaus in quantum neural network training landscapes

McClean, Jarrod R.; Boixo, Sergio; Smelyanskiy, Vadim N.; Babbush, Ryan; Neven, Hartmut

doi:10.1038/s41467-018-07090-4

Download PDF

Article
Open access
Published: 16 November 2018

Barren plateaus in quantum neural network training landscapes

Jarrod R. McClean¹,
Sergio Boixo ORCID: orcid.org/0000-0002-1090-7584¹,
Vadim N. Smelyanskiy¹,
Ryan Babbush¹ &
…
Hartmut Neven¹

Nature Communications volume 9, Article number: 4812 (2018) Cite this article

51k Accesses
864 Citations
74 Altmetric
Metrics details

Subjects

Abstract

Many experimental proposals for noisy intermediate scale quantum devices involve training a parameterized quantum circuit with a classical optimization loop. Such hybrid quantum-classical algorithms are popular for applications in quantum simulation, optimization, and machine learning. Due to its simplicity and hardware efficiency, random circuits are often proposed as initial guesses for exploring the space of quantum states. We show that the exponential dimension of Hilbert space and the gradient estimation complexity make this choice unsuitable for hybrid quantum-classical algorithms run on more than a few qubits. Specifically, we show that for a wide class of reasonable parameterized quantum circuits, the probability that the gradient along any reasonable direction is non-zero to some fixed precision is exponentially small as a function of the number of qubits. We argue that this is related to the 2-design characteristic of random circuits, and that solutions to this problem must be studied.

Generalization in quantum machine learning from few training data

Article Open access 22 August 2022

Quantum neural network cost function concentration dependency on the parametrization expressivity

Article Open access 20 June 2023

Quantum optical neural networks

Article Open access 17 July 2019

Introduction

Rapid developments in quantum hardware have motivated advances in algorithms to run in the so-called noisy intermediate scale quantum (NISQ) regime¹. Many of the most promising application-oriented approaches are hybrid quantum–classical algorithms that rely on optimization of a parameterized quantum circuit^{2,3,4,5,6,7,8}. The resilience of these approaches to certain types of errors and high flexibility with respect to coherence time and gate requirements make them especially attractive for NISQ implementations^3,9,10,11.

The first implementation of such algorithms was developed in the context of quantum simulation with the variational quantum eigensolver^2,3. This algorithm has been successfully demonstrated on a number of experimental setups with extensions to excited states and other forms of incoherent error mitigation^{2,9,12,13,14,15,16}. Since then, the quantum approximate optimization algorithm was developed in a similar context to address hard optimization problems^5,17,18,19. This algorithm has also been demonstrated on quantum devices²⁰. These approaches have even been extended to both quantum machine learning and error correction^{6,7,20,21,22,23}.

While the precise formulation of these methods and their domains of applicability differ considerably, they typically tend to rely upon the optimization of some parameterized unitary circuit with respect to an objective function that is typically a simple sum of Pauli operators or fidelity with respect to some state. This framework is reminiscent of the methodology of classical neural networks^23,24. As with any non-linear optimization, the choice of both the parameterization and the initial state is important. In quantum simulation, there is often a choice inspired by physical domain knowledge^{3,17,25,26,27,28,29}. However, in all domains of applicability, there have been implementations that utilize parametrized random circuits of varying depth^{7,13,21,23,30}. Within quantum simulation that approach has been referred to as a “hardware efficient ansatz”¹³. This is in contrast to the previous proposals, such as the variational quantum eigensolver^2,3,9, which used parametrized structured circuits inspired by the problem at hand, such as unitary coupled cluster.

When little structure is known about the problem, or constraints of the existing quantum hardware may prevent utilizing that structure, choosing a random implementable circuit seems to provide an unbiased choice. One might also expect, based on recent experimental designs for “quantum supremacy”, that random quantum circuits are a powerful tool for such a task³¹. Also, despite concerns about gradient-based methods in classical deep neural networks^32,33,34, they are successful²⁴, even if using random initialization^33,35. However, in the quantum case one must remember that the estimation of even a single gradient component will scale as O(1/ε^α) for some small power α³⁶ as opposed to classical implementations where the same is achieved in O(log(1/ε)) time, where ε is the desired accuracy in the gradient that is inevitably tied to its magnitude.

We will present results related to random quantum circuits in the context of the exponential dimension of Hilbert space and gradient-based hybrid quantum–classical algorithms. A cartoon depiction of this is given in Fig. 1. We show that for a large class of random circuits, the average value of the gradient of the objective function is zero, and the probability that any given instance of such a random circuit deviates from this average value by a small constant ε is exponentially small in the number of qubits. This can be understood in the geometric context of concentration of measure^37,38,39 for high-dimensional spaces. When the measure of the space concentrates in this way, the value of any reasonably smooth function will tend towards its average with exponential probability, a fact made formal by Levy’s lemma⁴⁰. In our context, this means that the gradient is zero over vast reaches of quantum space. The region where the gradient is zero does not correspond to local minima of interest, but rather an exponentially large plateau of states that have exponentially small deviations in the objective value from the average of the totally mixed state. We argue that the depth of circuits which achieve these undesirable properties are modest, requiring only O(n^1/d) depth circuits on a d dimensional array, and numerically evaluate the constant factors one expects to encounter for small instances of this kind. While our results highlight the importance of avoiding random initialization in parametric circuit approaches, they do not discount the value of random quantum circuits in other applications such as information security or demonstrations of quantum supremacy. We close with an outlook on how this result should shape strategies in ansatz design for scaling to larger experiments.

Results

Gradient concentration in random circuits

We will discuss random parameterized quantum circuits (RPQCs)

$$U({\boldsymbol{\theta }}) = U(\theta _1,...,\theta _L) = \mathop {\prod}\limits_{l = 1}^L U_l(\theta _l)W_l,$$

(1)

where U_l(θ_l) = exp(−iθ_lV_l), V_l is a Hermitian operator, and W_l is a generic unitary operator that does not depend on any angle θ_l. Circuits of this form are a natural choice due to a straightforward evaluation of the gradient with respect to most objective functions and have been introduced in a number of contexts already^26,41. Consider an objective function E(θ) expressed as the expectation value over some Hermitian operator H,

$$E({\boldsymbol{\theta }}) = \langle 0|U({\boldsymbol{\theta }})^\dagger HU({\boldsymbol{\theta }})|0\rangle .$$

(2)

When the RPQCs are parameterized in this way, the gradient of the objective function takes a simple form:

$$\partial _kE \equiv \frac{{\partial E({\boldsymbol{\theta }})}}{{\partial \theta _k}} = i\left\langle {0\left| {U_ - ^\dagger \left[ {V_k,U_ + ^\dagger HU_ + } \right]U_ - } \right|0} \right\rangle,$$

(3)

where we introduce the notations $U_ - \equiv \mathop {\prod}\nolimits_{l = 0}^{k - 1} U_l(\theta _l)W_l$, $U_ + \equiv \mathop {\prod}\nolimits_{l = k}^L U_l(\theta _l)W_l$, and henceforth drop the subscript k from V_k → V for ease of exposition. Finally, we will define our RPQCs U(θ) to have the property that for any gradient direction ∂_kE defined above, the circuit implementing U(θ) is sufficiently random such that either U₋, U₊, or both match the Haar distribution up to the second moment, and the circuits U₋ and U₊ are independent.

Our results make use of properties of the Haar measure on the unitary group dμ_Haar(U) ≡ dμ(U), which is the unique left- and right-invariant measure such that

$${\int}_{U(N)} {d\mu (U)f(U)} = {\int} d\mu (U)f(VU) = {\int} {d\mu (U)f(UV)},$$

(4)

for any f(U) and V∈U(N), where the integration domain will be implied to be U(N) when not explicitly listed. While this property is valuable for proofs, quantum circuits that exactly achieve this invariance generically require exponential resources. This motivates the concept of unitary t-designs^42,43,44, which satisfy the above properties for restricted classes of f(U), often requiring only modest polynomial resources. Suppose {p_i, V_i} is an ensemble of unitary operators, with unitary V_i being sampled with probability p_i. The ensemble {p_i, V_i} is a t-design if

$$\mathop {\sum}\limits_i p_iV_i^{ \otimes t}\rho (V_i^\dagger )^{ \otimes t} = {\int} d\mu (U)U^{ \otimes t}\rho (U^\dagger )^{ \otimes t}.$$

(5)

This definition is equivalent to the property that if f(U) is a polynomial of at most degree t in the matrix elements of U and at most degree t in the matrix elements of U^*, then averaging over the t-design {p_i, V_i} will yield the same result as averaging over the unitary group with the respect to the Haar measure.

The average value of the gradient is a concept that requires additional specification because, for a given point, the gradient can only be defined in terms of the circuit that led to that point. We will use a practical definition that leads to the value we are interested in, namely

$$\langle \partial _kE\rangle = {\int} dUp(U)\partial _k\langle 0|U({\boldsymbol{\theta }})^\dagger HU({\boldsymbol{\theta }})|0\rangle,$$

(6)

where p(U) is the probability distribution function of U. A review on the properties of products of independent random matrices can be found in ref.⁴⁵. The assumptions of independence and at least one of U₋ or U₊ forming a 1-design in our RPQCs implies that 〈∂_kE〉 = 0, as shown in the Methods.

Levy’s lemma informs our intuition about the the expected variance of this quantity through simple geometric arguments. In particular, Haar random unitaries on n qubits will output states uniformly in the D = 2ⁿ − 1 dimensional hypersphere. The derivative with respect to the parameters θ is Lipschitz continuous with some parameter η that depends on the operator H. Levy’s lemma then implies that the variance of measurements will decrease exponentially in the number of qubits. This intuition may be made more precise through explicit calculation of the variance, which is done in more detail in the Methods. The result to first order is

$${\mathrm{Var}}\left[ {\partial _kE} \right] \approx \left\{ {\begin{array}{*{20}{c}} { - \frac{{{\mathrm{Tr}}\left( {\rho ^2} \right)}}{{\left( {2^{2n} - 1} \right)}}{\mathrm{Tr}}\left\langle {\left[ {V,u^\dagger Hu} \right]^2} \right\rangle _{U_ + }} \\ { - \frac{{{\mathrm{Tr}}\left( {H^2} \right)}}{{\left( {2^{2n} - 1} \right)}}{\mathrm{Tr}}\left\langle {\left[ {V,u\rho u^\dagger } \right]^2} \right\rangle _{U_ - }} \\ {\frac{1}{{2^{\left( {3n - 1} \right)}}}\,{\mathrm{Tr}}\left( {H^2} \right){\mathrm{Tr}}\left( {\rho ^2} \right){\mathrm{Tr}}\left( {V^2} \right)} \end{array}} \right.$$

(7)

where the notation $\langle f(u)\rangle _{U_x}$ indicates the average with u drawn from p(U_x), and the first case corresponds to U₋ being a 2-design and not U₊, the second to U₊ being a 2-design but not U₋, and the third to both U₊ and U₋ being 2-designs. We emphasize the fact that this variance depends at most on polynomials of degree 2 in U and polynomials of degree 2 in U^*. Whereas a unitary 2-design will exhibit the correct variance^43,46, a unitary 1-design will exhibit the correct average value, but not necessarily the variance. As a result, if a circuit is of sufficient depth that for any ∂_kE, either U₋ or U₊ forms a 2-design, then with high probability one will produce an ansatz state on a barren plateau of the quantum landscape, with no interesting search directions in sight.

From these results, it is clear that only either U₊ or U₋ needs to be sufficiently random to poison the gradient for the remainder of the circuit. For example, while it is somewhat unintuitive, even the first element of a circuit, k = 1, will have a vanishing gradient due to the circuit following it, U₊. Additionally, we see that there is no detailed dependence on the structure of V_k, other than the rate at which they help randomize the circuit, determining at what depth one expects to find an approximate 2-design.

Numerical simulations

The previous section shows that for reasonable classes of RPQCs at a sufficient number of qubits and depth, one will end up on a barren plateau. Here we verify this result for even modest depth one-dimensional (1D) random circuits with numerical simulations. This helps to clarify the rate of concentration for realistic circuits and shows the transition as the circuit grows in length from a single layer to a circuit demonstrating statistics analogous to a 2-design.

The circuits and objective functions used in our numerical experiments begin with a layer of R_Y(π/4) = exp(−iπ/8 Y) gates to prevent X, Y, or Z from being an especially preferential direction with respect to gradients. Then, the circuit proceeds by a number of layers. Each layer consists of a parallel application of single qubit rotations to all qubits, given by R_P(θ) where P∈{X, Y, Z} is chosen with uniform probability and θ∈[0, 2π) is also chosen uniformly. This layer is followed by a layer of 1D nearest neighbor controlled phase gates, as in Fig. 2. Thus, the number of angles is the number of qubits times the number of layers.

The objective operator H is chosen to be a single Pauli ZZ operator acting on the first and second qubits, H = Z₁Z₂. The gradient is evaluated with respect to the first parameter, θ_1,1. This simple choice helps to extract the exponential scaling. As complex objectives can be written as sums of these operators, the results for large objectives can be inferred from these numbers. Moreover, it is clear that for any polynomial sum of these operators, the exponential decay of the signal in the gradient will not be circumvented.

From Fig. 3 we see that for a single 2-local Pauli term, both the expected value of the gradient and its spread decay exponentially as a function of the number of qubits even when the number of layers is a modest linear function. Empirically for our linear connectivity, we see that value is about 10n where n is the number of qubits, following the expected scaling of O(n^1/d) where d is the dimension of the connectivity. For empirical reference, the expected gate depth in a chemistry ansatz such as unitary coupled cluster is at least O(n³), meaning that if the initial parameters were randomized, this effect could be expected on less than 10 orbitals, a truly small problem in chemical terms. We also observe in Fig. 4 that as the number of layers increases, there is a transition to a 2-design where the variance converges. This leads to a distinct plateau as the circuit length increases, where the height of the plateau is determined by the number of qubits. An additional example with an objective function defined by projection on a target state is provided as Supplementary Figures 1 and 2, showing the rapid decay of variance and similar plateaus as a function of circuit length. These results substantiate our conclusion that gradients in modest-sized random circuits tend to vanish without additional mitigating steps.

Contrast with gradients in classical deep networks

Finally, we contrast our results with the vanishing (and exploding) gradient problem of classical deep neural networks^32,33,34,47. At least two key differences are present in the quantum case: (i) the different scaling of the vanishing gradient and (ii) the complexity of computing expected values.

The gradient in a classical deep neural network can vanish exponentially in the number of layers^32,33, while in a quantum circuit the gradient may vanish exponentially in the number of qubits, as shown above. In the classical case, the gradient for a weight in a neuron depends on the sum of all the paths connecting that neuron to the output, and when the weights are initialized with random values the paths have random signs which cancels the signal³². The number of paths is exponential in the number of layers. In the quantum case, the number of paths is exponential in the number of gates, and also have random signs³¹. The gradient saturates to an exponential in the number of qubits because the output state is normalized.

The estimation of the gradient for each training batch for a classical neural network is limited by machine precision and scales with O(log(1/ε)). Even if the gradient is small, as long as it is consistent enough between batches, the method may eventually succeed. On a quantum device, the cost of estimating the gradient scales as O(1/ε^α)³⁶. For any number of measurements much lower than 1/||g||^α, where ||g|| is the norm of the gradient, a gradient-based optimization will result in a random walk. By concentration of measure, a random walk will have exponentially small probability of exiting the barren plateau. As a result, gradient descent without some additional strategy cannot circumvent this challenge on a quantum device in polynomial time.

Discussion

We have seen both analytically and numerically that for a wide class of random quantum circuits, the expected values of observables concentrate to their averages over Hilbert space and gradients concentrate to zero. This represents an interesting statement about the geometry of quantum circuits and landscapes related to hybrid quantum–classical algorithms. More practically, it means that randomly initialized circuits of sufficient depth will find relatively little utility in hybrid quantum–classical algorithms.

Historically, vanishing gradients may have played a role in the early winter of deep neural networks^32,34,47. However, multiple techniques have been proposed to mitigate this problem^24,35,48,49, and the amount of training data and computational power available has grown substantially. One approach to avoid these landscapes in the quantum setting is to use structured initial guesses, such as those adopted in quantum simulation. Another possibility is to use pre-training segment by segment, which was an early success in the classical setting^48,50. These or other alternatives must be studied if these ansatze are to be successful beyond a few qubits.

Methods

We explicitly show the expectation value of the gradient is 0 and that under our assumptions the variance decays exponentially in the number of qubits. By our definition of RPQCs, we have that for any specified direction ∂_kE, both U₋ and U₊ are independently distributed and either U₋ or U₊ match the Haar distribution up to at least the second moment (they are a 2-design). The assumption of independence is equivalent to

$$\begin{array}{*{20}{l}} {p(U)} \hfill & = \hfill & {{\int} dU_ + p(U_ + ){\int} dU_ - p(U_ - )} \hfill \\ {} \hfill & {} \hfill & { \times \delta (U_ + U_ - - U),} \hfill \end{array}$$

(8)

which allows us to rewrite the expression as

$$\langle \partial _kE\rangle = i{\int} dU_ - p(U_ - ){\mathrm{Tr}}\left\{ {\rho _ - \times {\int} dU_ + p(U_ + )\left[ {V,U_ + ^\dagger HU_ + } \right]} \right\}.$$

(9)

We will utilize explicit integration over the unitary group with respect to the Haar measure, which up to the first moment can be expressed as⁵¹

$${\int} d\mu (U)\,U_{ij}U_{km}^\dagger = {\int} d\mu (U)U_{ij}U_{mk}^ \ast = \frac{{\delta _{im}\delta _{jk}}}{N},$$

(10)

where N is the dimension of the space, typically 2ⁿ for n qubits. Using this expression, one may readily verify that

$$M = {\int} d\mu (U)UOU^\dagger = \frac{{{\mathrm{Tr}}O}}{N}I,$$

(11)

which we use in the following. Now, making use of the assumption that either U₊ or U₋ matches the Haar measure up to the first moment (it is a 1-design), we first examine the case where U₋ is at least a 1-design and find that

$$\begin{array}{*{20}{l}} {\langle \partial _kE\rangle } \hfill & = \hfill & {i{\int} d\mu \left( {U_ - } \right){\mathrm{Tr}}\left\{ {\rho _ - \times \left[ {V,{\int} dU_ + p(U_ + )U_ + ^\dagger HU_ + } \right]} \right\}} \hfill \\ {} \hfill & = \hfill & {\frac{i}{N}{\mathrm{Tr}}\left\{ {\left[ {V,{\int} dU_ + p(U_ + )U_ + ^\dagger HU_ + } \right]} \right\}} \hfill \\ {} \hfill & = \hfill & 0 \hfill \end{array},$$

(12)

where we have defined $\rho _ - = U_ - |0\rangle \langle 0|U_ - ^\dagger$ and used the fact that the trace of a commutator of trace class operators is zero. In the second case, where we assume U₊ is at least a 1-design,

$$\begin{array}{*{20}{l}} {\langle \partial _kE\rangle } \hfill & = \hfill & {i{\int} dU_ - p(U_ - ){\mathrm{Tr}}\left\{ {\rho _ - {\int} d\mu \left( {U_ + } \right)\left[ {V,U_ + ^\dagger HU_ + } \right]} \right\}} \hfill \\ {} \hfill & = \hfill & {i\frac{{{\mathrm{Tr}}H}}{N}{\int} dU_ - p(U_ - ){\mathrm{Tr}}\left\{ {\rho _ - \left[ {V,I} \right]} \right\}} \hfill \\ {} \hfill & = \hfill & {0.} \hfill \end{array}$$

(13)

An advantage of the explicit polynomial formulas are that they allow an analytic calculation of the variance as well, which allows precise specification of the coefficient in Levy’s lemma. In cases where the integrals depend on up to two powers of elements of U and U^*, one may make use of the elementwise formula⁵¹

$$\begin{array}{*{20}{l}} {{\int} d\mu (U)U_{i_1j_1}U_{i_2j_2}U_{i_1^\prime j_1^\prime }^ \ast U_{i_2^\prime j_2^\prime }^ \ast } \hfill & = \hfill & {\frac{{\delta _{i_1i_1^\prime }\delta _{i_2i_2^\prime }\delta _{j_1j_1^\prime }\delta _{j_2j_2^\prime } + \delta _{i_1i_2^\prime }\delta _{i_2i_1^\prime }\delta _{j_1j_2^\prime }\delta _{j_2j_1^\prime }}}{{N^2 - 1}}} \hfill \\ {} \hfill & {} \hfill & { - \frac{{\delta _{i_1i_1^\prime }\delta _{i_2i_2^\prime }\delta _{j_1j_2^\prime }\delta _{j_2j_1^\prime } + \delta _{i_1i_2^\prime }\delta _{i_2i_1^\prime }\delta _{j_1j_1^\prime }\delta _{j_2j_2^\prime }}}{{N(N^2 - 1)}}} \hfill \end{array}.$$

(14)

The variance of the gradient is defined by

$${\mathrm{Var}}[\partial _kE] = \langle (\partial _kE)^2\rangle,$$

(15)

as we have seen above that 〈∂_kE〉 = 0. Through use of the above formula for integration up to the second moment of the Haar distribution, one may evaluate this expression in 3 separate cases. For simplicity and relevance, we evaluate them in the asymptotic case including only the dominant contribution as determined by the inverse dimension.

In the case where U₋ is a 2-design but not U₊,

$${\mathrm{Var}}[\partial _kE] \approx \frac{{2{\mathrm{Tr}}(\rho ^2)}}{{N^2 - 1}}{\mathrm{Tr}}\left\langle {H_u^2V^2 - (H_uV)^2} \right\rangle _{U_ + } = - \frac{{{\mathrm{Tr}}(\rho ^2)}}{{2^{2n} - 1}}{\mathrm{Tr}}\left\langle {\left[ {V,H_u} \right]^2} \right\rangle _{U_ + },$$

(16)

where $H_u = u^\dagger Hu$ and we have defined the notation $\langle f(u)\rangle _{U_x}$ to mean the average over u sampled from p(U_x). In the case where U₊ is a 2-design but not U₋,

$${\mathrm{Var}}[\partial _kE] \approx \frac{{2{\mathrm{Tr}}(H^2)}}{{N^2 - 1}}{\mathrm{Tr}}\left\langle {\rho _u^2V^2 - (\rho _uV)^2} \right\rangle _{U_ - } = - \frac{{{\mathrm{Tr}}(H^2)}}{{2^{2n} - 1}}{\mathrm{Tr}}\left\langle {\left[ {V,\rho _u} \right]^2} \right\rangle _{U_ - },$$

(17)

where $\rho _u = u\rho u^\dagger$. Finally in the case where both U₊ and U₋ are 2-designs

$${\mathrm{Var}}[\partial _kE] \approx \frac{1}{{2^{(3n - 1)}}}{\mathrm{Tr}}\left( {H^2} \right){\mathrm{Tr}}\left( {\rho ^2} \right){\mathrm{Tr}}\left( {V^2} \right).$$

(18)

In all cases, the exponential decay of the gradient as a function of the number of qubits is evident.

Data availability

Data used to generate the above figures are available upon request from the authors.

References

Preskill, J. Quantum computing in the NISQ era and beyond. Quantum 2, 79 (2018).
Article Google Scholar
Peruzzo, A. et al. A variational eigenvalue solver on a photonic quantum processor. Nat. Commun. 5, 1 (2014).
Article Google Scholar
McClean, J. R., Romero, J., Babbush, R. & Aspuru-Guzik, A. The theory of variational hybrid quantum-classical algorithms. New J. Phys. 18, 023023 (2016).
Article ADS Google Scholar
Yung, M.-H. et al. From transistor to trappedion computers for quantum chemistry. Sci. Rep. 4, 9 (2014).
Google Scholar
Farhi, E. Goldstone, J. & Gutmann, S. A quantum approximate optimization algorithm. Preprint at https://arxiv.org/abs/1411.4028 (2014).
Johnson, P. D., Romero, J., Olson, J., Cao, Y. & Aspuru-Guzik, A. QVECTOR: an algorithm for device-tailored quantum error correction. Preprint at https://arxiv.org/abs/1711.02249 (2017).
Cao, Y., Giacomo Guerreschi, G. & Aspuru-Guzik, A. Quantum neuron: an elementary building block for machine learning on quantum computers. Preprint at https://arxiv.org/abs/1711.11240 (2017).
Hempel, C. et al. Quantum chemistry calculations on a trappedion quantum simulator. Phys. Rev. X 8, 031022 (2018).
Google Scholar
O’alley, P. J. J. et al. Scalable quantum simulation of molecular energies. Phys. Rev. X 6, 31007 (2016).
Google Scholar
McClean, J. R., Schwartz, M. E., Carter, J. & de Jong, W. A. Hybrid quantum-classical hierarchy for mitigation of decoherence and determination of excited states. Phys. Rev. A. 95, 42308 (2017).
Article ADS Google Scholar
Wecker, D., Hastings, M. B. & Troyer, M. Progress towards practical quantum variational algorithms. Phys. Rev. A 92, 42303 (2015).
Article ADS Google Scholar
Shen, Y. et al. Quantum implementation of unitary coupled cluster for simulating molecular electronic structure. Phys. Rev. A 95, 020501(R) (2017).
Article ADS Google Scholar
Kandala, A. et al. Hardware-efficient quantum optimizer for small molecules and quantum magnets. Nature 549, 242 (2017).
Article ADS CAS Google Scholar
Colless, J. I. et al. Computation of molecular spectra on a quantum processor with an error-resilient algorithm. Phys. Rev. X 8, 011021 (2018).
Google Scholar
Santagati, R. et al. Witnessing eigenstates for quantum simulation of hamiltonian spectra. Sci. Adv. 4, 1 (2018).
Article Google Scholar
Dumitrescu, E. F. et al. Cloud quantum computing of an atomic nucleus. Phys. Rev. Lett. 120, 210501 (2018).
Article ADS CAS Google Scholar
Wecker, D., Hastings, M. B. & Troyer, M. Training a quantum optimizer. Phys. Rev. A 94, 022309 (2016).
Article ADS Google Scholar
Wang, Z., Hadfield, S., Jiang, Z. & Rieffel, E. G. Quantum approximate optimization algorithm for maxcut: a fermionic view. Phys. Rev. A 97, 022304 (2018).
Article ADS Google Scholar
Moll, N. et al. Quantum optimization using variational algorithms on nearterm quantum devices. Quantum Sci. Technol. 3, 030503 (2018).
Article ADS Google Scholar
Otterbach, J. S. et al. Unsupervised machine learning on a hybrid quantum computer. Preprint at https://arxiv.org/abs/1712.05771v1 (2017).
Romero, J., Olson, J. P. & Aspuru-Guzik, A. Quantum autoencoders for efficient compression of quantum data. Quantum Sci. Technol. 2, 045001 (2017).
Article ADS Google Scholar
Biamonte, J. et al. Quantum machine learning. Nature 549, 195 (2017).
Article ADS CAS Google Scholar
Farhi,E. & Neven, H. Classification with quantum neural networks on near term processors. Preprint at https://arxiv.org/abs/1802.06002 (2018).
LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436 (2015).
Article ADS CAS Google Scholar
McClean, J. R., Babbush, R., Love, P. J. & Aspuru-Guzik, A. Exploiting locality in quantum computation for quantum chemistry. J. Phys. Chem. Lett. 5, 4368 (2014).
Article CAS Google Scholar
Romero, J. et al. Strategies for quantum computing molecular energies using the unitary coupled cluster ansatz. Quant. Sci. Technol. 4, 1 (2018).
Article ADS Google Scholar
Babbush, R. et al. Low-depth quantum simulation of materials. Phys. Rev. X 8, 011044 (2018).
Google Scholar
Rubin, N. C., Babbush, R. & McClean, J. Application of fermionic marginal constraints to hybrid quantum algorithms. New J. Phys. 20, 053020 (2018).
Article ADS Google Scholar
Kivlichan, I. D. et al. Quantum simulation of electronic structure with linear depth and connectivity. Phys. Rev. Lett. 120, 110501 (2018).
Article ADS MathSciNet Google Scholar
Farhi, E., Goldstone, J., Gutmann, S. & Neven, H. Quantum algorithms for fixed qubit architectures. Preprint at http://arxiv.org/abs/1703.06199 (2017).
Boixo, S. et al. Characterizing quantum supremacy in nearterm devices. Nat. Phys. 14, 595 (2018).
Article CAS Google Scholar
Bradley, D. M. Learning in Modular Systems (Carnegie Mellon University, Pittsburgh, 2010).
Glorot, X. & Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, PMLR. AISTATS. (eds.) Yee Whye Teh, Mike Titterington. 249–256 (2010)
Shalev-Shwartz, S., Shamir, O., & Shammah, S. Failures of gradient-based deep learning. Preprint at https://arxiv.org/abs/1703.07950 (2017).
Ioffe S. & Szegedy, C. Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, PMLR. ICML. (eds.) Francis Bach, David Blei. 448–456 (2015)
Knill, E., Ortiz, G. & Somma, R. D. Optimal quantum measurements of expectation values of observables. Phys. Rev. A 75, 012328 (2007).
Article ADS Google Scholar
Popescu, S., Short, A. J. & Winter, A. Entanglement and the foundations of statistical mechanics. Nat. Phys. 2, 754 (2006).
Article CAS Google Scholar
Bremner, M. J., Mora, C. & Winter, A. Are random pure states useful for quantum computation? Phys. Rev. Lett. 102, 190502 (2009).
Article ADS MathSciNet Google Scholar
Gross, D., Flammia, S. T. & Eisert, J. Most quantum states are too entangled to be useful as computational resources. Phys. Rev. Lett. 102, 190501 (2009).
Article ADS MathSciNet CAS Google Scholar
Ledoux, M. The Concentration of Measure Phenomenon (American Mathematical Society, Providence, 2005).
Guerreschi, G. G. & Smelyanskiy, M. Practical optimization for hybrid quantum-classical algorithms. Preprint at https://arxiv.org/abs/1701.01450 (2017).
Renes, J. M., Blume-Kohout, R., Scott, A. J. & Caves, C. M. Symmetric informationally complete quantum measurements. J. Math. Phys. 45, 2171 (2004).
Article ADS MathSciNet Google Scholar
Dankert, C., Cleve, R., Emerson, J. & Livine, E. Exact and approximate unitary 2-designs and their application to fidelity estimation. Phys. Rev. A 80, 012304 (2009).
Article ADS Google Scholar
Harrow, A. W. & Low, R. A. Random quantum circuits are approximate 2-designs. Commun. Math. Phys. 291, 257 (2009).
Article ADS MathSciNet Google Scholar
Ipsen, J. R. Products of independent Gaussian random matrices. Preprint at https://arxiv.org/abs/1510.06128 (2015).
Roberts, D. A. & Yoshida, B. Chaos and complexity by design. J. High. Energy Phys. 2017, 121 (2017).
Article MathSciNet Google Scholar
Hochreiter, S., Bengio, Y., Frasconi, P. & Schmidhuber, J. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. A Field Guide to Dynamical Recurrent Neural Networks, Chapter 14, (eds.) S. C. Kremer and J. F. Kolen. (IEEE Press Piscataway, NJ 2001) https://www.amazon.com/Field-Guide-Dynamical-Recurrent-Networks/dp/0780353692
Hinton, G. E., Osindero, S. & Teh, Y.-W. A fast learning algorithm for deep belief nets. Neural Comput. 18, 1527 (2006).
Article MathSciNet Google Scholar
He, K., Zhang, X., Ren, S. & J. Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. (Eds.) Raman, B., Kumar, S., Roy, P.P., Sen, D., Las Vegas, NV, United States, 770–778 (2016).
Bengio, Y., Lamblin, P., Popovici, D. & Larochelle, H. Greedy layer-wise training of deep networks. Advances in Neural Information Processing Systems 19 (NIPS'06). (eds.) B. Schölkopf, J. C. Platt, T. Hoffman. 53–160 (MIT Press, Canada 2007).
Puchałla, Z. & Miszczak, J. A. Symbolic integration with respect to the haar measure on the unitary groups. Bull. Pol. Acad. Sci. Tech. Sci. 65, 21 (2017).
Google Scholar

Download references

Acknowledgements

The authors thank Craig Gidney for helpful comments on the manuscript.

Author information

Authors and Affiliations

Google Inc., 340 Main Street, Venice, CA, 90291, USA
Jarrod R. McClean, Sergio Boixo, Vadim N. Smelyanskiy, Ryan Babbush & Hartmut Neven

Authors

Jarrod R. McClean
View author publications
You can also search for this author in PubMed Google Scholar
Sergio Boixo
View author publications
You can also search for this author in PubMed Google Scholar
Vadim N. Smelyanskiy
View author publications
You can also search for this author in PubMed Google Scholar
Ryan Babbush
View author publications
You can also search for this author in PubMed Google Scholar
Hartmut Neven
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

J.R.M., S.B., V.N.S, R.B., and H.N. contributed to the formulation of ideas, calculations, and writing of the manuscript.

Corresponding authors

Correspondence to Jarrod R. McClean, Sergio Boixo or Vadim N. Smelyanskiy.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Supplementary Information

Peer Review File

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

McClean, J.R., Boixo, S., Smelyanskiy, V.N. et al. Barren plateaus in quantum neural network training landscapes. Nat Commun 9, 4812 (2018). https://doi.org/10.1038/s41467-018-07090-4

Download citation

Received: 23 May 2018
Accepted: 09 October 2018
Published: 16 November 2018
DOI: https://doi.org/10.1038/s41467-018-07090-4

This article is cited by

Understanding quantum machine learning also requires rethinking generalization
- Elies Gil-Fuster
- Jens Eisert
- Carlos Bravo-Prieto
Nature Communications (2024)
Analyzing variational quantum landscapes with information content
- Adrián Pérez-Salinas
- Hao Wang
- Xavier Bonet-Monroig
npj Quantum Information (2024)
Theoretical guarantees for permutation-equivariant quantum neural networks
- Louis Schatzki
- Martín Larocca
- M. Cerezo
npj Quantum Information (2024)
Quantum approximate optimization via learning-based adaptive optimization
- Lixue Cheng
- Yu-Qin Chen
- Shengyu Zhang
Communications Physics (2024)
Enhancing detection of topological order by local error correction
- Iris Cong
- Nishad Maskara
- Mikhail D. Lukin
Nature Communications (2024)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.