## Introduction

The past decade has seen a fast development of quantum technologies and the achievement of an unprecedented level of control in quantum hardware1, clearing the way for demonstrations of quantum computing applications for practical uses. However, near-term applications face some of the limitations intrinsic to the current generation of quantum computers, often referred to as Noisy Intermediate-Scale Quantum (NISQ) hardware2. In this regime, a limited qubit count and absence of quantum error correction constrain the kind of applications that can be successfully realized. Despite these limitations, hybrid classical-quantum algorithms3,4,5,6 have been identified as the ideal candidates to assess the first possible advantage of quantum computing in practical applications7,8,9,10.

The Quantum Approximate Optimization Algorithm (QAOA)5 is a notable example of variational quantum algorithm with prospects of quantum speedup on near-term devices. Devised to take advantage of quantum effects to solve combinatorial optimization problems, it has been extensively theoretically characterized11,12,13,14,15,16, and also experimentally realized on state-of-the-art NISQ hardware17. While the general presence of quantum advantage in quantum optimization algorithms remains an open question18,19,20,21, QAOA has gained popularity as a quantum hardware benchmark22,23,24,25. As its desired output is essentially a classical state, the question arises whether a specialized classical algorithm can efficiently simulate it26, at least near the variational optimum. In this paper, we use a variational parametrization of the many-qubit state based on Neural Network Quantum States (NQS)27 and extend the method of ref. 28 to simulate QAOA. This approach trades the need for exact brute force exponentially scaling classical simulation with an approximate, yet accurate, classical variational description of the quantum circuit. In turn, we obtain an heuristic classical method that can significantly expand the possibilities to simulate NISQ-era quantum optimization algorithms. We successfully simulate the Max-Cut QAOA circuit5,11,17 for 54 qubits at depth p = 4 and use the method to perform a variational parameter sweep on a 1D cut of the parameter space. The method is contrasted with state-of-the-art classical simulations based on low-rank Clifford group decompositions26, whose complexity is exponential in the number of non-Clifford gates as well as tensor-based approaches29. Instead, limitations of the approach are discussed in terms of the QAOA parameter space and its relation to different initializations of the stochastic optimization method used in this work.

## Results

### The Quantum Approximate Optimization Algorithm

The Quantum Approximate Optimization Algorithm (QAOA) is a variational quantum algorithm for approximately solving discrete combinatorial optimization problems. Since its inception in the seminal work of Farhi, Goldstone, and Gutmann5,12, QAOA has been applied to Maximum Cut (Max-Cut) problems. With competing classical algorithms30 offering exact performance bounds for all graphs, an open question remains—can QAOA perform better by increasing the number of free parameters?

In this work, we study a quadratic cost function31,32 associated with a Max-Cut problem. If we consider a graph G = (V, E) with edges E and vertices V, the Max-Cut of the graph G is defined by the following operator:

$${\mathcal{C}}=\mathop{\sum}\limits_{i,j\in E}{w}_{ij}{Z}_{i}{Z}_{j}\!,$$
(1)

where wij are the edge weights and Zi are Pauli operators. The classical bitstring $${\mathcal{B}}$$ that minimizes $$\left\langle {\mathcal{B}}\right|{\mathcal{C}}\left|{\mathcal{B}}\right\rangle$$ is the graph partition with the maximum cut. QAOA approximates such a quantum state through a quantum circuit of predefined depth p:

$$\left|{\boldsymbol{\gamma }},{\boldsymbol{\beta }}\right\rangle ={U}_{B}({\beta }_{p}){U}_{C}({\gamma }_{p})\cdots {U}_{B}({\beta }_{1}){U}_{C}({\gamma }_{1})\left|+\right\rangle \!,$$
(2)

where $$\left|+\right\rangle$$ is a symmetric superposition of all computational basis states: $$\left|+\right\rangle ={H}^{\otimes N}{\left|0\right\rangle }^{\otimes N}$$ for N qubits. The set of 2p real numbers γi and βi for i = 1…p define the variational parameters to be optimized over by an external classical optimizer. The unitary gates defining the parametrized quantum circuit read $${U}_{B}(\beta )={\prod }_{i\in V}{e}^{-i\beta {X}_{i}}$$ and $${U}_{C}(\gamma )={e}^{-i\gamma {\mathcal{C}}}$$.

Optimal variational parameters γ and β are then found through an outer-loop classical optimizer of the following quantum expectation value:

$$C({\boldsymbol{\gamma }},{\boldsymbol{\beta }})=\left\langle {\boldsymbol{\gamma }},{\boldsymbol{\beta }}\right|{\mathcal{C}}\left|{\boldsymbol{\gamma }},{\boldsymbol{\beta }}\right\rangle$$
(3)

It is known that, for QAOA cost operators of the general form $${\mathcal{C}}={\sum }_{k}{{\mathcal{C}}}_{k}({Z}_{1},\ldots ,{Z}_{N})$$, the optimal value asymptotically converges to the minimum value:

$$\mathop{\mathrm{lim}}\limits_{p\to \infty }{C}_{p}=\mathop{\min }\limits_{{\mathcal{B}}}\left\langle {\mathcal{B}}\right|{\mathcal{C}}\left|{\mathcal{B}}\right\rangle$$
(4)

where Cp is the optimal cost value at QAOA depth p and $${\mathcal{B}}$$ are classical bit strings. With modern simulations and implementations still being restricted to lower p-values, it is unclear how large p has to get in practice before QAOA becomes comparable with its classical competition.

In this work we consider 3-regular graphs with all weights wij set to unity at QAOA depths of p = 1, 2, 4.

### Classical variational simulation

Consider a quantum system consisting of N qubits. The Hilbert space is spanned by the computational basis $$\{\left|{\mathcal{B}}\right\rangle :{\mathcal{B}}\in {\{0,1\}}^{N}\}$$ of classical bit strings $${\mathcal{B}}=({B}_{1},\ldots ,{B}_{N})$$. A general state can be expanded in this basis as $$\left|\psi \right\rangle ={\sum }_{{\mathcal{B}}}\psi ({\mathcal{B}})\left|{\mathcal{B}}\right\rangle$$. The convention $${Z}_{i}\left|{\mathcal{B}}\right\rangle ={(-1)}^{{B}_{i}}\left|{\mathcal{B}}\right\rangle$$ is adopted. In order to perform approximate classical simulations of the QAOA quantum circuit, we use a neural-network representation of the many-body wavefunction $$\psi ({\mathcal{B}})$$ associated with this system, and specifically adopt a shallow network of the Restricted Boltzmann Machine (RBM) type33,34,35:

$$\begin{array}{lll}&&\psi ({\mathcal{B}})\approx {\psi }_{\theta }({\mathcal{B}})\equiv \exp \left(\mathop{\sum }\limits_{j=1}^{N}{a}_{j}{B}_{j}\right) \cdot \mathop{\prod }\limits_{k=1}^{{N}_{\text{h}}}\left[1+\exp \left({b}_{k}+\mathop{\sum }\limits_{j=1}^{N}{W}_{jk}{B}_{j}\right)\right].\end{array}$$
(5)

The RBM provides a classical variational representation of the quantum state27,36. It is parametrized by a set of complex parameters θ = {a, b, W}—visible biases a = (a1, …, aN), hidden biases $${\bf{b}}=({b}_{1},\ldots ,{b}_{{N}_{\text{h}}})$$ and weights W = (Wj,k: j = 1…N, k = 1…Nh). The complex-valued ansatz given in Eq. (5) is, in general, not normalized.

We note that the N-qubit $$\left|+\right\rangle$$ state required for initializing QAOA can always be exactly implemented by setting all variational parameters to 0. That choice ensures that the wavefunction ansatz given in Eq. (5) is constant across all computational basis states, as required. The advantage of using the ansatz given in Eq. (5) as an N-qubit state is that a subset of one- and two-qubit gates can be exactly implemented as mappings between different sets of variational parameters $$\theta \mapsto \theta ^{\prime}$$. In general, such mapping corresponding to an abstract gate $${\mathcal{G}}$$ is found as the solution of the following nonlinear equation:

$$\langle {\mathcal{B}}| {\psi }_{\theta ^{\prime} }\rangle =C\left\langle {\mathcal{B}}\right|{\mathcal{G}}\left|{\psi }_{\theta }\right\rangle,$$
(6)

for all bit strings $${\mathcal{B}}$$ and any constant C, if a solution exists. For example, consider the Pauli Z gate acting on qubit i. In that case, Eq. (6) reads $${e}^{{a}_{i}^{\prime}{B}_{i}}=C{(-1)}^{{B}_{i}}{e}^{{a}_{i}{B}_{i}}$$ after trivial simplification. The solution is $${a}_{i}^{\prime}={a}_{i}+i\pi$$ for C = 1, with all other parameters remaining unchanged. In addition, one can exactly implement a subset of two-qubit gates by introducing an additional hidden unit coupled only to the two qubits in question. Labeling the new unit by c, we can implement the RZZ gate relevant for QAOA. The gate is given as $$RZZ(\phi )={e}^{-i\phi {Z}_{i}{Z}_{j}}\propto \,\text{diag}\,(1,{e}^{i\phi },{e}^{i\phi },1)$$ up to a global phase. The replacement rules read:

$$\begin{array}{lll}&&{W}_{ic}=-2{\mathcal{A}}(\phi ), \quad {W}_{jc}=2{\mathcal{A}}(\phi )\\ &&{a}_{i}\to {a}_{i}+{\mathcal{A}}(\phi ),\quad {a}_{j}\to {a}_{j}-{\mathcal{A}}(\phi ),\end{array}$$
(7)

where $${\mathcal{A}}(\phi )$$ = Arccosh $$\left({e}^{i\phi }\right)$$ and C = 2. Derivations of replacement rules for these and other common one and two-qubit gates can be found in Sec. Methods.

Not all gates can be applied through solving Eq. (6). Most notably, gates that form superpositions belong in this category, including $${U}_{B}(\beta )={\prod }_{i}{e}^{-i\beta {X}_{i}}$$ required for running QAOA. This happens simply because a linear combination of two or more RBMs cannot be exactly represented by a single new RBM through a simple variational parameter change. To simulate those gates, we employ a variational stochastic optimization scheme.

We take $${\mathcal{D}}(\phi ,\psi )=1-F(\phi ,\psi )$$ as a measure of distance between two arbitrary quantum states $$\left|\phi \right\rangle$$ and $$\left|\psi \right\rangle$$, where F(ϕ, ψ) is the usual quantum fidelity:

$$F(\phi ,\psi )=\frac{| \langle \phi | \psi \rangle {| }^{2}}{\langle \phi | \phi \rangle \langle \psi | \psi \rangle}.$$
(8)

In order to find variational parameters θ, which approximate a target state $$\left|\phi \right\rangle$$ well ($$\left|{\psi }_{\theta }\right\rangle \approx \left|\phi \right\rangle$$, up to a normalization constant), we minimize $${\mathcal{D}}({\psi }_{\theta },\phi )$$ using a gradient-based optimizer. In this work we use the Stochastic Reconfiguration (SR)37,38,39 algorithm to achieve that goal.

For larger p, extra hidden units introduced when applying UC(γ) at each layer can result in a large number of associated parameters to optimize over that are not strictly required for accurate output state approximations. So to keep the parameter count in check, we insert a model compression step, which halves the number of hidden units immediately after applying UC doubles it. Specifically we create an RBM with fewer hidden units and fit it to the output distribution of the larger RBM (output of UC). Exact circuit placement of compression steps are shown on Fig. 1 and details are provided in Methods. As a result of the compression step, we are able to keep the number of hidden units in our RBM ansatz constant, explicitly controlling the variational parameter count.

### Simulation results for 20 qubits

In this section we present our simulation results for Max-Cut QAOA on random regular graphs of order N40,41,42. In addition, we discuss model limitations and its relation to current state-of-the-art simulations.

QAOA angles γ, β are required as an input of our RBM-based simulator. At p = 1, we base our parameter choices on the position of global optimum that can be computed exactly (see Supplementary Note 1). For p > 1, we resort to direct numerical evaluation of the cost function as given in Eq. (1) from either the complete state vector of the system (number of qubits permitting) or from importance-sampling the output state as represented by a RBM. For all p, we find the optimal angles using Adam43 with either exact gradients or their finite-difference approximations.

We begin by studying the performance of our approach on a 20-qubit system corresponding to the Max-Cut problem on a 3-regular graph of order N = 20. In that case, access to exact numerical wavefunctions is not yet severely restricted by the number of qubits. That makes it a suitable test-case. The results can be found in Fig. 2.

In Fig. 2, we present the cost function for several values of QAOA angles, as computed by the RBM-based simulator. Each panel shows cost functions from one typical random 3-regular graph instance. We observe that cost landscapes, optimal angles and algorithm performance do not change appreciably between different random graph instances. We can see that our approach reproduces variations in the cost landscape associated with different choices of QAOA angles at both p = 1 and p = 2. At p = 1, an exact formula (see Supplementary Note 1) is available for comparison of cost function values. We report that, at optimal angles, the overall final fidelity (overlap squared) is consistently above 94% for all random graph instances we simulate.

In addition to cost function values, we also benchmark our RBM-based approach by computing fidelities between our variational states and exact simulations. In Fig. 3 we show the dependence of fidelity on the number of qubits and circuit depth p. While, in general, it is hard to analytically predict the behavior of these fidelities, we nonetheless remark that with relatively small NQS we can already achieve fidelities in excess of 92% for all system sizes considered for exact benchmarks.

### Simulation results for 54 qubits

Our approach can be readily extended to system sizes that are not easily amenable to exact classical simulation. To show this, in Fig. 4 we show the case of N = 54 qubits. This number of qubits corresponds, for example, to what implemented by Google’s Sycamore processor, while our approach shares no other implementation details with that specific platform. For the system of N = 54 qubits, we closely reproduce the exact error curve (see Supplementary Note 1) at p = 1, implementing 81 RZZ (eiγZZ) gates exactly and 54 RX (eiβX) gates approximately, using the described optimization method. We also perform simulations at p = 2 and p = 4 and obtain corresponding approximate QAOA cost function values.

At p = 4, we exactly implement 324 RZZ gates and approximately implement 216 RX gates. This circuit size and depth is such that there is no available experimental or numerically exact result to compare against. The accuracy of our approach can nonetheless be quantified using intermediate variational fidelity estimates. These fidelities are exactly the cost functions (see Sec. Methods) we optimize, separately for each qubit. In Fig. 4 (panel b) we show the optimal variational fidelities (see Eq. (8)) found when approximating the action of RX gates with the RBM wavefunction. At optimal γ4 (minimum of p = 4 curve at Fig. 4, panel a), the lowest variational fidelity reached was above 98%, for a typical random graph instance shown at Fig. 4. As noted earlier, exact final states of 54-qubit systems are intractable so we are unable to report or estimate the full many-qubit fidelity benchmark results.

We remark that the stochastic optimization performance is sensitive to choices of QAOA angles away from optimum (see Fig. 4 right). In general, we report that the fidelity between the RBM state (Eq. (5)) and the exact N-qubit state (Eq. (2)) decreases as one departs from optimal by changing γ and β.

For larger values of QAOA angles, the associated optimization procedure is more difficult to perform, resulting in a lower fidelity (see the dark patch in Fig. 4, panel b). We find that optimal angles were always small enough not to be in the low-performance region. Therefore, this model is less accurate when studying QAOA states away from the variational optimum. However, even in regions with lowest fidelities, RBM-based QAOA states are able to approximate cost well, as can be seen in Figs. 2 and 4.

As an additional hint to the high quality of the variational approximation, we capture the QAOA approximation of the actual combinatorial optimum. A tight upper bound on that optimum was calculated to be Copt = −69 for 54 qubits by directly optimizing an RBM to represent the ground state of the cost operator defined in Eq. (3).

### Comparison with other methods

In modern sum-over-Cliffords/Metropolis simulators, computational complexity grows exponentially with the number of non-Clifford gates. With the RZZ gate being a non-Clifford operation, even our 20-qubit toy example, exactly implementing 60 RZZ gates at p = 2, is approaching the limit of what those simulators can do26. In addition, that limit is greatly exceeded by the larger, 54-qubit system we study next, implementing 162 RZZ gates. State-of-the-art tensor-based approaches29 have been used to simulate larger circuits but are ineffective in the case of nonplanar graphs.

Another very important tensor-based method is the Matrix Product State (MPS) variational representation of the many-qubit state. This is is a low-entanglement representation of quantum states, whose accuracy is controlled by the so-called bond dimension. Routinely adopted to simulate ground states of one-dimensional systems with high accuracy44,45,46, extensions of this approach to simulate challenging circuits have also been recently put forward47. In Fig. 5, our approach is compared with an MPS ansatz. We establish that for small systems, MPS provides reliable results with relatively small bond dimensions. For larger systems, however, our approach significantly outperforms MPS-based circuit simulation methods both in terms of memory requirements (fewer parameters) and overall runtime. This is to be expected in terms of entanglement capacity of MPS wavefunctions, that are not specifically optimized to handle non-one-dimensional interaction graphs, as in this specific case at hand.

For a more direct comparison, we estimate the MPS bond dimension required for reaching RBM performance at p = 2 and 54 qubits to be ~104 (see Fig. 5), amounting to ~1010 complex parameters (≈160 GB of storage) while our RBM approach uses ≈4500 parameters (≈70 kB of storage). In addition, we expect the MPS number of parameters to grow with depth p because of additional entanglement, while RBM sizes heuristically scale weakly (constant in our simulations) with p and can be controlled mid-simulation using our compression step. It should be noted that the output MPS bond dimension depends on the specific implementation of the MPS simulator, namely, qubit ordering and the number of “swap” gates applied to correct for the nonplanar nature of the underlying graph, and that a more efficient implementations might be found. However, determining the optimal implementation is itself a difficult problem and, given the entanglement of a generic circuit we simulate, it would likely produce a model with orders of magnitude more parameters than a RBM-based approach.

## Discussion

In this work, we introduce a classical variational method for simulating QAOA, a hybrid quantum-classical approach for solving combinatorial optimizations with prospects of quantum speedup on near-term devices. We employ a self-contained approximate simulator based on NQS methods borrowed from many-body quantum physics, departing from the traditional exact simulations of this class of quantum circuits.

We successfully explore previously unreachable regions in the QAOA parameter space, owing to good performance of our method near optimal QAOA angles. Model limitations are discussed in terms of lower fidelities in quantum state reproduction away from said optimum. Because of such different area of applicability and relative low computational cost, the method is introduced as complementary to established numerical methods of classical simulation of quantum circuits.

Classical variational simulations of quantum algorithms provide a natural way to both benchmark and understand the limitations of near-future quantum hardware. On the algorithmic side, our approach can help answer a fundamentally open question in the field, namely whether QAOA can outperform classical optimization algorithms or quantum-inspired classical algorithms based on artificial neural networks48,49,50.

## Methods

### Exact application of one-qubit Pauli gates

As mentioned in the main text, some one-qubit gates gates can be applied exactly to the RBM ansatz given in Eq. (5). Here we discuss the specific case of Pauli gates. Parameter replacement rules we use to directly apply one-qubit gates can be obtained by solving Eq. (6) given in the main text. Consider for example the Pauli Xi or NOTi gate acting on qubit i. It can be applied by satisfying the following system of equations:

$$\begin{array}{lll}&&{\mathrm{ln}}\,C+{a}_{i}^{\prime}{B}_{i}=(1-{B}_{i}){a}_{i}\\ &&{b}_{k}^{\prime}+{B}_{i}{W}_{ik}^{\prime}={b}_{k}+(1-{B}_{i}){W}_{ik}.\end{array}$$
(9)

for Bi = 0, 1. The solution is:

$$\begin{array}{lll}&&{\mathrm{ln}}\,C={a}_{i};\quad {a}_{i}^{\prime}=-{a}_{i};\\ &&{b}_{k}^{\prime}={b}_{k}+{W}_{ik};\quad {W}_{ik}^{\prime}=-{W}_{ik},\end{array}$$
(10)

with all other parameters remaining unchanged.

A similar solution can be found for the Pauli Y gate:

$$\begin{array}{lll}&&{\mathrm{ln}}\,C={a}_{i}+\frac{i\pi }{2};\quad {a}_{i}^{\prime}=-{a}_{i}+i\pi ;\quad \\ &&{b}_{k}^{\prime}={b}_{k}+{W}_{ik} ; \quad {W}_{ik}^{\prime}=-{W}_{ik} ,\end{array}$$
(11)

with all other parameters remaining unchanged as well.

For the Pauli Z gate, as described in the main text, one needs to solve $${e}^{{a}_{i}^{\prime}{B}_{i}}={(-1)}^{{B}_{i}}{e}^{{a}_{i}{B}_{i}}$$. The solution is simply

$${a}_{i}^{\prime}={a}_{i}+i\pi.$$
(12)

More generally, it is possible to apply exactly an arbitrary Z rotation gate, as given in matrix form as:

$$RZ(\varphi )={e}^{-i\frac{\varphi }{2}Z}\propto \left(\begin{array}{ll}1&0\\ 0&{e}^{i\varphi }\end{array}\right)$$
(13)

where the proportionality is up to a global phase factor. Similar to the Pauli Zi gate, this gate can be implemented on qubit i by solving $${e}^{{a}_{i}^{\prime}{B}_{i}}={e}^{i\varphi {B}_{i}}{e}^{{a}_{i}{B}_{i}}$$. The solution is simply:

$${a}_{i}^{\prime}={a}_{i}+i\varphi,$$
(14)

with all other parameters besides ai remaining unchanged. This expression reduces to the Pauli Z gate replacement rules for φ = π as required.

### Exact application of two-qubit gates

We apply two-qubit gates between qubits k andlby adding an additional hidden unit (labeled by c) to the RBM before solving Eq. (6) from the main text. The extra hidden unit couples only to qubits in question, leaving all previously existing parameters unchanged. In that special case, the equation reduces to

$${e}^{{{\Delta }}{a}_{k}{B}_{k}+{{\Delta }}{a}_{l}{B}_{l}}\left(1+{e}^{{W}_{kc}{B}_{k}+{W}_{lc}{B}_{l}}\right){\psi }_{\theta }({\mathcal{B}})=C\left\langle {\mathcal{B}}\right|{\mathcal{G}}\left|{\psi }_{\theta }\right\rangle.$$
(15)

An important two-qubit gate we can apply exactly are ZZ rotations. The gate RZZ is key for being able to implement the first step in the QAOA algorithm. The definition is:

$$RZZ(\varphi )={e}^{-i\frac{\varphi }{2}Z\otimes Z}\propto \left(\begin{array}{*{20}{l}}1&0&0&0\\ 0&{e}^{i\varphi }&0&0\\ 0&0&{e}^{i\varphi }&0\\ 0&0&0&1\end{array}\right)\ ,$$
(16)

where the proportionality factor is again a global phase. The related matrix element for a RZZkl gate between qubits k and l is $$\left\langle {B}_{k}^{\prime}{B}_{l}^{\prime}\right|RZ{Z}_{kl}(\varphi )\left|{B}_{k}{B}_{l}\right\rangle ={e}^{i\varphi {B}_{k}\oplus {B}_{l}}$$ where  stands for the classical exclusive or (XOR) operation. Then, one solution to Eq. (15) reads:

$$\begin{array}{lll}&&{W}_{ic}=-2{\mathcal{A}}(\varphi );\quad {W}_{jc}=2{\mathcal{A}}(\varphi )\\ &&{a}_{i}^{\prime}={a}_{i}+{\mathcal{A}}(\varphi );\quad {a}_{j}^{\prime}={a}_{j}-{\mathcal{A}}(\varphi ),\end{array}$$
(17)

where $${\mathcal{A}}(\varphi )$$ = Arccosh $$\left({e}^{i\varphi }\right)$$ and C = 2.

### Approximate gate application

Here we provide model details and show how to approximately apply quantum gates that cannot be implemented through methods described in sec. Exact application of one-qubit Pauli gates. In this work we use the Stochastic Reconfiguration (SR)37 algorithm to approximately apply quantum gates to the RBM ansatz. To that end, we write the “infidelity” between our RBM ansatz and the target state ϕ, $${\mathcal{D}}({\psi }_{\theta },\phi )=1-F({\psi }_{\theta },\phi )$$, as an expectation value of an effective hamiltonian operator $${H}_{\,\text{eff}\,}^{\phi }$$:

$${\mathcal{D}}({\psi }_{\theta },\phi )=\frac{\left\langle {\psi }_{\theta }\right|{H}_{\,\text{eff}\,}^{\phi }\left|{\psi }_{\theta }\right\rangle }{\langle {\psi }_{\theta }| {\psi }_{\theta }\rangle }\ \to \ {H}_{\,\text{eff}\,}^{\phi }={\mathbb{1}}-\frac{\left|\phi \right\rangle \left\langle \phi \right|}{\langle \phi | \phi \rangle }$$
(18)

We call the hermitian operator given in Eq. (18) a “hamiltonian” only because the target quantum state $$\left|\psi \right\rangle$$ is encoded into it as the eigenstate corresponding to the smallest eigenvalue. Our optimization scheme focuses on finding small parameter updates Δk that locally approximate the action of the imaginary time evolution operator associated with $${H}_{\,\text{eff}\,}^{\phi }$$, thus filtering out the target state:

$$\left|{\psi }_{\theta +{{\Delta }}}\right\rangle =C\ {e}^{-\eta H}\left|{\psi }_{\theta }\right\rangle,$$
(19)

where C is an arbitrary constant included because our variational states (Eq. (5), main text) are not normalized. Choosing both η and Δ to be small, one can expand both sides to linear order in those variables and solve the resulting linear system for all components of Δ, after eliminating C first. After some simplification, one arrives at the following parameter at each loop iteration (indexed with t):

$${\theta }_{k}^{(t+1)}={\theta }_{k}^{(t)}-\eta \mathop{\sum}\limits_{l}{S}_{kl}^{-1}\ \frac{\partial {\mathcal{D}}}{\partial {\theta }_{l}^{* }},$$
(20)

where stochastic estimations of gradients of the cost function $${\mathcal{D}}({\psi }_{\theta },\phi )$$ can be obtained through samples from ψθ2 at each loop iteration through:

$$\frac{\partial {\mathcal{D}}}{\partial {\theta }_{k}^{* }}={\left\langle {{\mathcal{O}}}_{k}^{\dagger }{H}_{\text{eff}}^{\phi }\right\rangle }_{{\psi }_{\theta }}-{\left\langle {{\mathcal{O}}}_{k}^{\dagger }\right\rangle }_{{\psi }_{\theta }}{\left\langle {H}_{\text{eff}}^{\phi }\right\rangle }_{{\psi }_{\theta }}.$$
(21)

Here, $${{\mathcal{O}}}_{k}$$ is defined as a diagonal operator in the computational basis such that $$\left\langle {\mathcal{B}}\right|^{\prime} {{\mathcal{O}}}_{k}\left|{\mathcal{B}}\right\rangle =\frac{\partial {\mathrm{ln}}\,{\psi }_{\theta }}{\partial {\theta }_{k}}\ {\delta }_{{\mathcal{B}}^{\prime} {\mathcal{B}}}$$. Averages over ψ are commonly defined as $${\langle \cdot \rangle }_{\psi }\equiv \left\langle \psi \right|\cdot \left|\psi \right\rangle /\langle \psi | \psi \rangle$$. Furthermore, the S-matrix appearing in Eq. (20) reads:

$${S}_{kl}={\left\langle {{\mathcal{O}}}_{k}^{\dagger }{{\mathcal{O}}}_{l}\right\rangle }_{{\psi }_{\theta }}-{\left\langle {{\mathcal{O}}}_{k}^{\dagger }\right\rangle }_{{\psi }_{\theta }}\left\langle \right.{{\mathcal{O}}}_{l}{\rangle }_{{\psi }_{\theta }},$$
(22)

and corresponds to the Quantum Geometric Tensor or Quantum Fisher Information (also see ref. 51 for a detailed description and connection with the natural gradient method in classical machine learning52).

Exact computations of averages over N-qubit states ψθ and ϕ at each optimization step range from impractical to intractable, even for moderate N. Therefore, we evaluate those averages by importance-sampling the probability distributions associated with the variational ansatz ψθ2 and the target state ϕ2 at each optimization step t. All of the above expectation values are evaluated using Markov Chain Monte Carlo (MCMC)38,39 sampling with basic single-spin flip local updates. An overview of the sampling method can be found in ref. 53. In order to use those techniques, we rewrite Eq. (21) as:

$$\frac{\partial {\mathcal{D}}}{\partial {\theta }_{l}^{* }}={\left\langle \frac{\phi }{{\psi }_{\theta }}\right\rangle }_{{\psi }_{\theta }}{\left\langle \frac{{\psi }_{\theta }}{\phi }\right\rangle }_{\phi }\left[{\left\langle {{\mathcal{O}}}_{k}^{* }\right\rangle }_{{\psi }_{\theta }}-\frac{{\left\langle \frac{\phi }{{\psi }_{\theta }}{{\mathcal{O}}}_{k}^{* }\right\rangle }_{{\psi }_{\theta }}}{{\left\langle \frac{\phi }{{\psi }_{\theta }}\right\rangle }_{{\psi }_{\theta }}}\right].$$
(23)

In our experiments with less than 20 qubits, we take 8000 MCMC samples from four independent chains (totaling 32,000 samples) for gradient evaluation. Between each two recorded samples, we take N MCMC steps (for N qubits). For the 54-qubit experiment, we take 2000 MCMC samples four independent chains because of increased computational difficulty of sampling. The entire Eq. (23) is manifestly invariant to rescaling of ψθ and ϕ, removing the need to ever compute normalization constants. We remark that the prefactor in Eq. (23) is identically equal to the fidelity given in Eq. (8) in the main text.

$$F(\psi ,\phi )=\frac{| \langle \phi | \psi \rangle {| }^{2}}{\langle \phi | \phi \rangle \langle \psi | \psi \rangle }={\left\langle \frac{\phi }{\psi }\right\rangle }_{\psi }{\left\langle \frac{\psi }{\phi }\right\rangle }_{\phi }\ ,$$
(24)

allowing us to keep track of cost function values during optimization with no additional computational cost.

The second step consists of multiplying the variational derivative with the inverse of the S-matrix (Eq. (22)) corresponding to a stochastic estimation of a metric tensor on the hermitian parameter manifold. Thereby, the usual gradient is transformed into the natural gradient on that manifold. However, the S-matrix is stochastically estimated and it can happen that it is singular. To regularize it, we replace S with S + $$\epsilon {\mathbb{1}}$$, ensuring that the resulting linear system has a unique solution. We choose ϵ = 10−3 throughout. The optimization procedure is summarized in Supplementary Note 2.

In order to keep the number of hidden units reasonable, we employ a compression step at each QAOA layer (after the first). Immediately after applying the UC(γk) gate in layer k to the RBM ψθ (and thereby introducing unwanted parameters), we go through the following steps:

1. (1)

Construct a new RBM $${\widetilde{\psi }}_{\theta }$$.

2. (2)

Initialize $${\widetilde{\psi }}_{\theta }$$ to exactly represent the state $${U}_{C}\left(\frac{1}{k}{\sum }_{j\le k}{\gamma }_{j}\right)\left|+\right\rangle$$. Doing this introduces half the number hidden units that are already present in ψθ.

3. (3)

Stochastically optimize $${\widetilde{\psi }}_{\theta }$$ to approximate ψθ (using algorithm in Supplementary Note 2) with ϕ → ψθ and $$\psi \to {\widetilde{\psi }}_{\theta }$$.

In essence, we use the optimization algorithm with the “larger” ψθ as the target state ϕ. The optimization results in a new RBM state with fewer hidden units that closely approximates the old RBM with fidelity > 0.98 in all our tests. We then proceed to simulate the rest of the QAOA circuit and apply the same compression procedure again when the number of parameters increases again. The exact schedule of applying this procedure in the context of different QAOA layers can be seen on Fig. 1.

We choose the initial state for the optimization as an exactly reproducible RBM state that has non-zero overlap with the target (larger) RBM. In principle, any other such state would work, but we heuristically find this one to be a reliable choice across all p-values studied. Alternatively, one can just initialize $$\widetilde{{\psi }_{\theta }}$$ to $${U}_{C}\left(\gamma ^{\prime} \right)\left|+\right\rangle$$ with $$\gamma ^{\prime} ={\text{argmax}}_{\gamma }\ F\left({\psi }_{\theta },{U}_{C}(\gamma )\left|+\right\rangle \right)$$, using an efficient 1D optimizer to solve for $$\gamma ^{\prime}$$ before starting to optimize the full RBM.