Purification-based quantum error mitigation of pair-correlated electron simulations

An important measure of the development of quantum computing platforms has been the simulation of increasingly complex physical systems. Before fault-tolerant quantum computing, robust error-mitigation strategies were necessary to continue this growth. Here, we validate recently introduced error-mitigation strategies that exploit the expectation that the ideal output of a quantum algorithm would be a pure state. We consider the task of simulating electron systems in the seniority-zero subspace where all electrons are paired with their opposite spin. This affords a computational stepping stone to a fully correlated model. We compare the performance of error mitigations on the basis of doubling quantum resources in time or in space on up to 20 qubits of a superconducting qubit quantum processor. We observe a reduction of error by one to two orders of magnitude below less sophisticated techniques such as postselection. We study how the gain from error mitigation scales with the system size and observe a polynomial suppression of error with increased resources. Extrapolation of our results indicates that substantial hardware improvements will be required for classically intractable variational chemistry simulations. It is hoped that simulations of molecules and materials will provide a near-term application of quantum computers. A study of the performance of error mitigation highlights the obstacles to scaling up these calculations to practically useful sizes.

The electronic structure problem can be expressed in models of varying complexity and realism.Quantum simulations of chemistry within the Hartree-Fock (meanfield) approximation were implemented for system sizes up to 12 qubits in [27], and this retains the record for the largest VQE calculation of a chemical ground state on quantum hardware.As a next step, one can consider working in the seniority zero subspace of the entire Hilbert space, which assumes all electrons come in spinup or spin-down pairs [15,[33][34][35][36][37].This has the advantage of projecting a local fermionic problem onto a local qubit problem [15].The S 0 ground state is not a priori classically efficiently simulable [15] (though good approximate methods are known to exist for many problems [38][39][40]).This makes it a good stepping stone beyond Hartree-Fock towards the full electronic structure problem.
Recent quantum experiments have relied on error mitigation techniques [41], which are not scalable like error correction [1,42], but promise to substantially shrink experimental errors.Popular methods are based on postselection [43,44], rescaling [45][46][47][48], purification [27,[49][50][51] and probabilistic cancellation [45,52].Various schemes and combinations of error mitigation techniques have been implemented in practice [24, 26-30, 47, 53].However, many of these methods do not promise to remove bias to the level of accuracy needed for useful simulation of chemistry, or remain untested beyond few-qubit experiments.Shifting from non-interacting fermions to correlated electronic structure, one loses two error mitigation advantages that were crucial to the success of [27]: efficient density matrix purification via McWeeny iteration [54], and low-cost gradient estimation.
In this work, we mitigate errors accumulated during the preparation of electronic ground states in the seniority-zero space, comparing three different error mitigation techniques -postselection, echo verification, and virtual distillation -on up to 20 qubits of a superconducting quantum processor.Using either echo verifica-tion or a new combination of postselection and virtual distillation, we are able to reproduce the ground state energy and order parameter for an N = 10-qubit simulation of the Richardson-Gaudin (RG), or pairing model -the quintessential model of superconductivity -improving over the unmitigated estimates by 1 − 2 orders of magnitude.This demonstrates an improvement over classical pair-coupled-cluster-doubles, and the non-interacting BCS theory, neither of which are qualitatively correct over the entire range of coupling values considered.Echo verification was further able to significantly improve over postselected VQE for 6-and 10-qubit simulations of the ring-opening of cyclobutene.While the stringent error requirements (< 0.05 Hartree) to differentiate between mean-field and the exact solution could only be achieved for the 6-qubit case, this still represents the largest VQE simulation of electronic structure for chemistry to date.
Finally, we considered the scaling of our simulation of the RG model, using data from simulations at N = 4, 6, 8, 10.We observe a clear difference in the asymptotic scaling of the mean absolute error in energy and order parameter when echo verification or virtual distillation are applied.From this data, we are able to estimate the minimum requirements for a beyond-classical VQE simulation of similar form: a 25× decrease in hardware error rates (from those observed in this work), a limit of O(N )-depth for future variational ansatzes, and the need to pre-optimize ansatzes classically without intermediate calls to a device.Even if this list of requirements is achieved, meeting the high level of accuracy required for the electronic structure problem will pose a serious challenge, as chemical accuracy is around 60× smaller than our mean accuracy for the 10-qubit cyclobutene problem.

A. Simulating the seniority-zero subspace
The seniority of a Slater determinant is the number of unpaired electrons; thus, the seniority zero (S 0 ) sector of Hilbert space for an N -electron system in M orbitals is the space of M N/2 determinants leaving no electrons unpaired given a particular pairing of the spin-orbitals.Seniority is not a global symmetry of the electronic structure Hamiltonian and it is basis dependent; it has been used as a way to classify determinant subspaces to generate better approximations for solving the Schrödinger equation [34][35][36][37] and as a starting point for modeling strong correlations from electron pair states [33].
Supported by the S 0 subspace there exists a set of operators satisfying the su(2) algebra constructed from pairs of fermion ladder operators and the spatial orbital num-ber operator [55] a † pσ a pσ , P p , P † q = (1 − N p )δ p,q , [N p , P q ] = −2P q δ p,q , (1) where p, q and α, β are orbital and spin indices respectively.These operators form a basis for Hamiltonians projected into the S 0 subspace.The equivalence to an su(2) algebra means seniority zero models resemble Heisenberg spin−1/2 models which are easily expressed as Pauli operators.
In this work we focus on two Hamiltonians to validate purification-based error mitigation strategies.The first is the Richardson-Gaudin (RG), or pairing model which is a model for a small superconducting grain when g < 0 [56][57][58][59], but with a g-dependent Debye frequency [57].The second model is the electronic structure Hamiltonian (H elec ) projected into the S 0 subspace The all-to-all connected Heisenberg spin Hamiltonian is, in general, not known to be classically solvable, but good approximate methods exist.This is especially true for the RG model, which is often integrable [39], well-approximated by density-matrix renormalization group [38] and pair coupled cluster techniques in the repulsive regime, and solvable by quantum Monte Carlo in the attractive regime (where it has no sign problem).Pair coupled cluster theory is also known to work well for the electronic structure problem in the S 0 subspace [36,40,60,61] while full configuration interaction quantum Monte Carlo shows a reduced sign problem [62].
As such, although we have strong evidence that the quantum circuits used in this text are not classically simulatable (App.B 1), we do not believe directly scaling S 0 simulations represents the easiest path to a quantum advantage in chemistry; this is instead a stepping stone between a mean field solution and the full electronic structure problem.

B. The unitary pair coupled cluster ansatz and energy estimation
In this work we use a Trotterized unitary pair-coupledcluster doubles (UpCCD) ansatz [15] compiled into a set of qubits in a ladder geometry with nearest-neighbor coupling.The ladder ansatz (instead of a generic ring) allows us to efficiently measure terms in the Hamiltonian corresponding qubits that are not physically adjacent after encoding with a minimal number of swap operations.When mapped from fermions to qubits the Up-CCD ansatz has the form where each GS ij (θ ) is a Givens-swap gate corresponding to the product of a Givens rotation gate on a pair labeled by qubits i and j followed by a swap op-eration [15], The GS gate corresponds to a coherent partial pairexcitation (by the angle φ), followed by a pair-swap.Given a number of layers N in Eq. ( 4) and total number of qubits N there are a total of N N/2 free parameters in the ansatz.To minimize the amount of time qubits are idle we order the spatial orbitals such that the Fermi-vacuum is |0101 . . .01 -e.g. the restricted Hartree-Fock state-corresponding to an interleaved list of occupied and virtual orbital labels in ascending energy order.The Hamiltonian qubit ordering is then chosen such that when all θ = 0, the Hartree-Fock state for each model is returned.The alternating swap gate arrangement allows us to couple each occupied pair with each unoccupied pair once in depth N/2 (see App. B 3).Thus, in this work we set N = N/2 for all systems.Each GS(θ) gate is compiled into a product of three controlled-Z (CZ) gates interleaved with tunable single-qubit microwave gates (Fig. 1 (top), see App.B 2).
To perform energy estimation on our two S 0 models, expectation values with respect to nearest-neighbor and non-nearest-neighbor qubits are required.The expectation value X i X j + Y i Y j is estimated by performing a number preserving diagonalization [19,27] mapping the expectation value to the difference of Z i and Z j .The ladder geometry allows us to measure all non-nearestneighbor pairs across the rungs of the ladder in a similar fashion at the additional cost of at most one swap operation.The full measurement protocol is detailed in Appendix B 3. All-to-all coupling is achieved in N circuits bringing the total number of different circuits to measure the Hamiltonian's expectation value to N + 1. Strategies with fewer numbers of circuits exist, however they do not allow for post-selection on particle number.

C. Echo verification and virtual distillation
Echo verification (EV), introduced in [51] where O|φ = e iφ |φ .The expectation value ψ|O|ψ can be recovered from Eq. ( 8) as the other terms are known.
The largest effect of noise on the system is to dampen Ψ|φ ψ|Ψ → F Ψ|φ ψ|Ψ , where F is the circuit fidelity [51].We can estimate F independently by removing O from the circuit, which yields a Loschmidt echo of the preparation unitary [64].This is achieved in practice by removing a virtual Z rotation (see Fig. 1, bottom), making the estimated Loschmidt fidelity an accurate estimate of F .Further EV implementation details can be found in App.B 4.
Virtual distillation (VD) [49,50] is an error mitigation technique that uses collective measurements of k copies of a state ρ to estimate expectation values with respect to ρ k /Tr[ρ k ].VD schemes are based on the observation that the cyclic shift operator S (k) is easily diagonalized, and therefore can be measured, which yields e.g. for k = 2 with 2) can be simultaneously diagonalized with O s when O = Z i by a GS(π/4) rotation between pairs of identified qubits on the two registers.For two N/2 × 2 ladders on a square lattice geometry, this requires one round of swap gates to shift identified qubits next to each other.Operators O = Z i are measured by rotating to Z i (see Sec.I B) and following the above procedure.The virtual distillation circuit is only 6 two-qubit gates deeper than post-selected VQE.
As the GS(π/4) gate is number-conserving, VD can be combined with postselection: the global excitation number j (Z j ⊗I +I ⊗Z j ) is a good symmetry.This requires that the state prior to measurement also conserve number.This is true when estimating X i X j + Y i Y j , but not when estimating Z i Z j : when mapping Z i Z j → Z i one can only preserve the parity of the total number of excitations.In the main text of this work, we will present results showing VD with postselection only (PS-VD).We compare VD with and without postselection in App.F 1.

II. THE RICHARDSON-GAUDIN MODEL
We use our UpCCD ansatz to prepare approximate ground states of the RG model on 10 sites at half-filling across a range of coupling strengths using parameters optimized in noiseless simulations.We achieve half-filling  by adding a chemical potential to the single particle energies in Eq. ( 2); In Fig. 2 (top left), we estimate the prepared states' energy with and without error mitigation techniques (see caption), and compare it to exact diagonalization in the S 0 subspace, also known as double occupied configuration interaction (DOCI), and classical pair-coupledcluster doubles (pCCD), and BCS solutions.We see that using EV or PS-VD we are able to reproduce the entire energy curve to high accuracy, which neither pCCD nor the non-interacting BCS theory can achieve.The ex-perimental error in the result is the sum of the UpCCD model error and the experimental error.To disambiguate the effects of UpCCD model error, in Fig. 2 (top right) we plot the error between our experimental data and the UpCCD ground state energy.Postselection consistently mitigated around half the error present in the raw ansatz.By contrast, EV demonstrates an average 85-fold and maximum 460-fold error reduction.PS-VD achieves similar performance, with an average 60-fold and maximum 140-fold improvement.The residual error following EV or PS-VD drifts notably with fluctuations between points larger than error bars.We attribute this observation to device drift.
The RG Hamiltonian has a well-known phase transition in the attractive regime (g ≤ 0) in the thermodynamic limit, which appears in the BCS state at finite N , but is not present in the true ground state due to finite size effects [56][57][58].This presents an opportunity for a variational quantum simulation to determine qualitative features of a quantum Hamiltonian beyond non-interacting physics.The traditional order parameter for the BCS state, ∆ BCS = 1 N j a j↑ a j↓ , is zero on the RG Hamiltonian ground state due to number conservation.However, one can confirm that ∆ = 1 N j,σ n 2 jσ − n jσ 2 satisfies ∆ = ∆ BCS for the BCS ground state of the Hamiltonian, giving a manybody order parameter [57].In Fig. 2 (bottom left), we plot experimental estimates of ∆ across the range of g values considered.In the absence of error mitigation, though the order parameter dips around g = 0 the true cusp is not reproduced.Both EV and PS-VD clearly improve over the BCS approximation for g > 0.5, with EV particularly able to reproduce the cusp at g = 0.The performance of error mitigation is demonstrated by plotting the error in ∆ against the noise free UpCCD energy in Fig. 2

(bottom right).
We see all experimental estimates have a slight peak in error at g = 0.This can be attributed to ∆ being highly sensitive to error at this point ( ∂∆ ∂ njσ

→ ∞).
Furthermore, the maximally-mixed state has ∆ = 1, so decoherence has a larger effect when targeting ∆ << 1.This contrasts with the error in the energy (Fig. 2 (top right)), which has a slight dip near g = 0. We attributed this to the increased contribution from X i X j + Y i Y j to the energy when g is far from 0, as ∆ is independent of these expectation values.The improvement from EV and PS-VD in estimating the order parameter is slightly less than that in estimating the energy, with a mean (max) 32-fold (56-fold) improvement from EV, and 18-fold (51fold) improvement from PS-VD.We attribute this to the increased sensitivity of ∆ to noise at g = 0, and the high performance of the raw results at g << 0 (where depolarizing noise has little effect as ∆ ∼ 1).FIG. 3. The conrotatory Cyclobutene ring opening pathway simulated in the seniority zero subspace comparing postselected VQE and echo verification (EV) on an optimized unitary pair-coupled-cluster ansatz.From left to right the reaction path corresponds to the ring opening reaction.For the ten orbital case the unitary pair-coupled-cluster ansatz (evaluated in simulation) has less than 1.8×10 −4 energy difference from exact diagonalization in the seniority zero space.The blue curves correspond to the exact diagonalization of the seniority zero active space Hamiltonian spanning 10 orbitals (lighter-broad blue line) and 6 orbitals (darker-narrow blue line).The red curve is the restricted Hartree Fock (RHF) mean-field energy.Green points (darker green for 6 qubits and lighter green for 10 qubits) are the echo verified experimental data while yellow points (darker yellow for 6 qubits and lighter yellow for 10 qubits) are the post-selected VQE energies.The 10 qubit VQE data is plotted on a discontinuous and different scale to preserve the visual scale of the reaction energy along the reaction coordinate.

III. CYCLOBUTENE RING OPENING
We further validated scalable error mitigation protocols by simulating the conrotatory ring opening pathway for cyclobutene in an active space of six orbital and six electrons and ten orbitals and ten electrons corresponding to a six and ten qubit simulation of the Hamiltonian in Eq. ( 3).The mechanism of this ring opening is described by the Woodward-Hoffmann rules for pericyclic ring openings corresponding to the in-phase combination of the two carbon 2p orbitals when brought together to form the four-member carbon ring.
The geometries along the reaction path are determined from a nudged elastic band calculation using density functional theory (B3LYP) to evaluate forces.The final structures use a minimal basis set (STO-3G) to generate the active space Hamiltonians to project into the senior-ity zero sector.The Woodward-Hoffmann rules are a type of molecular orbital theory and thus we expect this reaction to be qualitatively described within mean-field theory.This is verified numerically for our seniority zero model where the largest CI coefficient has an average value of 0.974 (9), for six-orbitals, and 0.973( 9), for 10orbitals, indicating a single-reference system.As such, our unitary pair-coupled-cluster doubles ansatz targets the dynamic correlation corrections to the mean-field.
The average post-selected-VQE absolute error is 0.058 ± 0.006 and 0.395 ± 0.023 Hartree for the six orbital and ten orbital systems, respectively.The average echoverified absolute error is 0.011 ± 0.005 and 0.064 ± 0.035 Hartree for the six orbital and ten orbital system, respectively, showing a 5.51-fold and 6.12-fold improvement over post-selected-VQE average error.Comparing to the raw VQE data, we find a 55.1-fold and 38.4-fold mean error reduction for the six orbital system and 10-qubit system respectively.While there is notable improvement in energy across the reaction pathway for the 10 orbital system the magnitude of the errors is larger than the 0.037 Hartree energy difference between cyclobutene and 1,3-Butadiene.Furthermore, a visual inspection of Figure 3 indicates high parallelity errors in the 10 orbital system.Given the error bars on echo verification are smaller than the parallelity error (point scatter) we attribute the main source of error to device drift.

IV. OUTLOOK
We have observed the echo verification and virtual distillation error mitigation protocols suppressing errors by 1-2 orders of magnitude on a range of quantum simulation problems using up to 20 superconducting qubits.We now consider the requirements for scaling these experiments to the classical intractable regime.
In Fig. 4 (top) we plot the number of experiments (shots) used in this work to simulate the RG Hamiltonian at g = −0.9(where pCCD does not describe the system well), and compare this to theoretical estimates targeting the same model to within a sampling noise of 0.1 a.u.using the experimental fidelities observed for 10 qubits (fidelities taken from Fig. 4 (bottom right)).The 50× gap between theory and experiment for 10-qubit EV can be attributed mostly to extra circuits used to cancel out a background magnetic field (see App. B 4).The gap for our VD experiment is roughly 3× by comparison.Assuming the ability to freely weight our shot distribution, we estimate that for a 50-qubit experiment (as a proxy lower bound for a beyond-classical quantum computation) using VD or EV, 10 8 or 10 9 shots would be required respectively.This is executable on current hardware in a wall-clock time (see App. D 1) of > 1 hour or > 10 hours respectively.Including the difference between experiment and theory at 10 qubits raises the cost of EV to 5 × 10 10 shots, which would require multiple days to achieve.These numbers do not include the multiplica- tive cost of variational optimization (see App. E).Furthermore, the requirements for accurate electronic structure simulations may be lower than the 0.1 a.u.requirement considered here.Methods to pre-optimize variational ansatzes classically, and applications of VQE to problems simpler than electronic structure, may thus be necessary for beyond-classical VQE experiments.
Device coherence presents an additional scaling challenge.To maintain circuit fidelity F over an O(N )-depth, fully parallel circuit as N scales from 5 to 50 requires all error rates to drop by and coherence times to increase by roughly a factor 25 (proportional to O(N 2 )).As any reduction in F incurs an O(poly(F −1 )) sampling cost [49,51], and as F scales exponentially in the error rate, and as F ∼ 10% for PS-VD (Fig. 4 (bottomright)), we see little room for negotiation on this 25× lower bound.To achieve a 25× decrease in error rate would require a 50-qubit device with XEB fidelities on all two-qubit gates ≤ 3 × 10 −4 .However this analysis precludes ansatzes with depth O(N 2 ) or higher or significantly larger constant factors (in our case, the circuit depth of the bare VQE is 3N/2).For instance, successfully implementing a 50-qubit VQE with ansatz depth 3N 2 /2 with EV or VD would require error rates to drop ∼ 1000×.
On a more positive note, in Fig. 4 (bottom left), we plot the absolute error in the energy estimates, averaged across all points in our RG model experiment.The energy scales sublinearly after applying EV or VD (a clear asymptotic difference to raw or postselected VQE), which suggests that a 25× decrease in error rate required to keep sampling costs constant may yield significantly higher precision results.A similar gap between EV/VD and VQE/PS-VQE for estimating the order parameter can be observed in Fig. 4 (bottom middle); the discrepancy in absolute scaling can be attributed mostly to the energy scaling as O(N 2 ), while ∆ does not scale with N .This observation runs contrary to the observations in Fig. 3, where shifting from 6 to 10 qubits increased the mean error by a factor 10. Investigating the mean error in estimating Pauli operators (App.F 2) suggests that the true scaling lies somewhere in between these values.If the energy error scales linearly or better with the error rate per qubit (which is expected from simulations in Ref. [51]), and scales less than quadratically in N , our requirement to scale error rates as O(N −2 ) to preserve the circuit fidelity F will yield a drop in absolute energy error as a function of N .Thus, pinning down this scaling of experimental error with system size and error rates is a key area for future work.were calibrated by running two experiments in series.Firstly, a single GS(0) = swap gate was implemented between qubits i and j (with virtual gates inserted); the qubits were prepared in the state |0+ measured in the ZX or ZY basis, or prepared in the state | + 0 and measured in the XZ or Y Z basis.Sweeping β (j) i and β (i) j gave four datasets that could be fitted to extract optimal phase offsets.The resulting gate was then benchmarked by estimating XI and Y I on the state [GS(0)] 2k |0+ and IX and IY on [GS(0)] 2k | + 0 , and fitting this to an oscillatory decay curve.Under this benchmark, the initial calibration typically reduced the accumulated phase per CZ to less than 30 milliradians.This benchmark was further used to calibrate, by sweeping β i + β j on pairs i, j that are being acted on by the same GS gate to remove the remaining oscillations.We find in practice that a cubic fit to 11 datasets is a robust way to perform a final estimate of β i +β j , with the residual phase less than 5 milliradians when calibration was successful.If the estimated fidelity of the resulting GS gate underperformed (> 1.5% error per CZ gate), qubit or coupler frequencies were reoptimized before recalibrating.Calibration was performed in parallel on sets of CZ gates that were run in parallel during an experiment, to mimic the local environment and compensate for 2-qubit gate crosstalk.Here we substantiate the claim that the UpCCD circuits realized on hardware in this work are in general not efficiently classically simulable.We do so by constructing a universal quantum gate set on a reduced Hilbert space (dual-rail encoding) with an O(1) depth overhead.This construction shows that any nearest neighbor depth-O(N ) circuit on a line of qubits can be mapped to a depth-O(N ) UpCCD ansatz (and circuits with arbitrary connectivity to a depth-O(N 2 ) UpCCD ansatz), when allowing for the omission of gates (as the identity is not a GS gate).For this to hold it is pivotal that the GS gate family includes the swap gate and is thus not a matchgate.
To demonstrate a universal gate set we use a dual-rail encoding of one logical qubit into two physical qubits (onto which the GS gates will act).We use tilde to denote logical states and operations and set It is then straightforward to verify by direct computation that a GS gate acting on two physical qubits belonging to the same logical qubit can be used to realize the following logical Hadamard, Pauli, and Pauli rotation gates: Logical two qubit entangling gates can be realized by acting with GS gates on qubits belonging to two different logical qubits.The cnot gate can for instance be made by means of A key advance in this work was the development of a mapping of our UpCCD ansatz to a 2D grid with local connectivity, such that a) the entire ansatz could be implemented in depth N/2 GS gates, and b) all X i X j +Y i Y j terms could be estimated using only N unique mappings.In this section, we explain this mapping in more detail and prove both a) and b) true.
Let us first expand the discussion of the implementation of the UpCCD ansatz.The standard UpCCD ansatz takes the form which when mapped to qubits in the S 0 approximation, becomes This in turn can be Trotterized to a product of coherent pair excitations exp(θ pq X p Y q − h.c.).As mentioned in the main text, the effect of a single GS gate (Eq.( 7)) in the fermionic picture is to implement a single coherent pair excitation between the spatial orbitals assigned to qubits i and j, and then to swap the orbitals.This means a given orbital is not assigned to a fixed qubit throughout the experiment.For our implementation of the UpCCD ansatz (see e.g.Fig. 1, bottom) GS gates are applied in layers: between qubits i, j = 2n, 2n + 1 during an even-numbered layer, and between qubits i, j = 2n + 1, (2n + 2)%N during an oddnumbered layer (for n = 0, . . ., N/2 − 1).We claim that, at half-filling, any initial assignment of occupied orbitals to odd-numbered qubits and unoccupied orbitals to evennumbered qubits will cause N/2 such layers to implement a Trotterized form of Eq. (B26).To see this, note that during an even-numbered layer, orbitals sitting on evennumbered (odd-numbered) qubits shift to the right (left) around the ring of qubits, and vice-versa during an oddnumbered layer.This in turn implies that the empty orbitals, that are initially assigned to an even-numbered qubit, will only ever move to the right around this ring as the ansatz proceeds, as they will be assigned to an oddnumbered qubit on odd-numbered rounds.Likewise, the filled orbitals will only ever move to the left around this ring.As an occupied orbital must cross (and thus interact with) every unoccupied orbital before it encounters the same one twice, this implies that the first N/2 layers of our UpCCD ansatz will yield precisely the (N/2) 2 coherent pair excitations between each unoccupied and each occupied orbital, as required.(This is demonstrated for 10 qubits in Fig. 5[top].)Let us now consider the measurement of non-local X i X j + Y i Y j terms as performed in this work.As mentioned in the main text, these are diagonalized on a pair of orbitals i, j via a GS(π/4) rotation, which necessitates the orbitals be on neighbouring qubits.The GS(π/4) rotation maps the operators Z i , Z j → D + ij , D − ij , where we define As we implement our ansatz on a 2 × N/2 grid, qubit i is not only connected to qubit (i ± 1)%N ; the crosslinks in the grid connect qubit i and qubit Moreover, note that our ansatz remains unchanged if we perform the cyclic permutation i → (i + k)%N ; that is, we shift our orbital assignments, and all ansatz gates and parameters, around the ring of qubits.(Note that this technically swaps the definition of even and odd layers when k is odd: after this permutation the first layer of GS gates will be between qubits i, j = 2n + 1, (2n + 2)%N .)Following this permutation, cross-links will connect the orbital that was on qubit i to that which was on qubit [N − (i + 2k) − 1]%N .One can confirm that by running over k = 0, . . ., N/2 − 1 we find N/2 cyclic permutations, such that each occupied orbital is coupled to each unoccupied orbital by a cross-link for exactly one permutation.With just the cyclic permutation operation and no additional swap gates, this gives N/2 unique circuits that allow for measurement of all D ± ij where i is occupied and j unoccupied.To obtain the circuits that couple occupied orbitals to occupied orbitals (and unoccupied to unoccupied), we require an additional layer of swap gates.(This is unavoidable given our initial ansatz ordering: all occupied qubits are connected by an even number of couplings, so a further direct coupling would yield an odd-order cycle, which does not exist on a square lattice.)After each permutation k above, we perform swaps between qubits l + (k%2), l + (k%2) + 1 for 0 ≤ l < N/4.One can confirm that this yields N/2 circuits such that a coupling between any pair of occupied orbitals can be achieved in one such circuit.(In this second set of circuits, some qubits are not coupled, and were not used to estimate expectation values in this experiment.)This can be confirmed by a visual inspection of Fig. 5[bottom].Code that implements this scheduling has also been uploaded to ReCirq [66].

Scheduling of EV circuits
In this section we outline the additional experimental details required to implement EV in this work.We (middle) After executing the UpCCD ansatz we have not used the cross-couplers (red squares), allowing us to virtually rotate the entire grid during compilation for the purposes of measurement.This virtual rotation allows all occupied and unoccupied orbitals to be coupled at some step, without any increase in circuit depth.(bottom) The virtual rotation cannot however, bring two occupied or two unoccupied orbitals to nearest-neighbour qubits so that they may be coupled.To achieve this, we require an extra round of swaps (red dashed boxes).One can confirm that across the middle and bottom layers, all pairs of qubits are coupled in at least one configuration.
implemented control-free EV for O = Z i , or Z i Z j , or D + ij (Eq.(B27)), using a vacuum reference state |φ = |00 . . . .We prepare the superposition 1 √ 2 (|ψ + |φ ) by acting the UpCCD ansatz circuit (Eq.( 4)) on the cat state 1 √ 2 (|0000 . . .+ |0101 . . . ) (see Fig. 1[middle, 'Echo verification'] and Fig. 1[bottom]).Then, all operators O were implemented by compiling them to a virtual Z rotation on a single qubit.Finally, to measure Ψ|φ ψ|Ψ , we inverted the UpCCD circuit and cat state preparation.This maps the desired matrix element |φ ψ| → |0 1| ⊗ |00 . . .00 . . .|, allowing its measurement via single-qubit rotation on a single 'measurement' qubit, and readout of all qubits in the computational basis.(Reading out all qubits is essential, as we record an estimate of 0 for the measurement of Ψ|φ ψ|Ψ unless all qubits other than the measurement qubit read out 0.) Our implementation of O as a virtual Z rotation allows us to replace O → O α = cos(α) + i sin(α)O to remove our susceptibility to a uniform background magnetic field e ih j Zj .Such a field transforms Eq. ( 8) to 1 2 ψ|O α |ψ e i(φ+hN/2) ; fitting this to three points of α allows us to simultaneously estimate h, the fidelity F , and O .(This is preferable to independent calibration as h fluctuates with a 1/f spectrum.)Some further experimental optimization was made for the EV circuit that was not available for the VQE or VD circuits.Many of the gates in the final EV circuit [Fig. 1, bottom] cancel to the identity, as the second half of the circuit is the inversion of the first half.We identify and prune these to increase the overall circuit fidelity, and insert echo pulses into the resulting empty space.We further compile an echo pulse for the entire second half of the EV circuit [this can be done as XX • GS(θ) • XX = GS(θ)].To unbias readout, we measure the single measurement qubit in the ±X and ±Y bases.This, combined with the additional circuits to remove a background magnetic field, raises the total number of circuits to 12N 2 (6N 2 + 6N ) to estimate the expectation value of our chemistry (RG) Hamiltonian.Shots were distributed across these circuits following the term weight in the Hamiltonian with some additional restrictions imposed by classical readout hardware (see App. D 3).
Appendix C: Data processing

Optimal linear combinations of non-independent expectation values
Once we have used our variational ansatz to prepare an approximation ρ(θ) ∼ |ψ(θ) ψ(θ)| to the ground state of our target problem, it remains to measure the quantum device to estimate the energy (or other properties of the state).As our devices are heavily coherence limited, rather than attempting to perform this estimation in a single shot, we write our Hamitonian as a sum of simpler terms and estimate the expectation value of each such term (Note that we are not placing any restrictions on Q i at this point.)The method in which we estimate Trace[Q i ρ] will depend on which error mitigation methods are being implemented.However, all schemes will return a set of estimates of Trace[Q i ρ] with a covariance matrix In this experiment, it turns out that our choice of {Q i } will not be linearly independent.The reason for this is post-selection: we desire our choice of {Q i } to allow us to measure S z = i Z i for each experiment.To achieve this, we measure the operators Z j , Z j Z k , and D ± jk (Eq.(B27)), but D + jk + D − jk = Z j + Z k .This leaves us with a degree of freedom in our choice of c i that we may optimize upon, once a dataset is taken.
In order to choose c i , we perform a constrained Lagrangian minimization.Our target cost function is the variance on the estimate in Eq. ( C2) subject to Eq. (C1) as a constraint.Let us fix some linearly independent basis of operators {P j } (e.g.Pauli operators), and we can write H = j h j P j , and Q i = j q i,j P j .(As P j is a basis, we have no freedom in our choice of h j or q i,j .)Our constraints then take the form i c i q i,j = h j . (C5) This can be written as a Lagrange multiplier, yielding a Lagrangian Differentiating with respect to c i and setting equal to 0 yields (using the fact that Σ i,j is a positive matrix) Recognising this as a vector equation, as the matrix Σ is invertible we have single batch.However, during variational optimization, depending on the optimizer used, at least one call to the device per epoch is needed, because the quantum circuits to be executed next depend on the step taken by the optimizer, which in turn depends on the measurement results of the first batch.This needs to be taken into account when comparing the performance of different optimizers.The time per shot was found to be 1 * 10 −5 , essentially independently of the circuit depth (probably because it is dominated by readout, reset, and time needed by the control electronics).The number of shots c of a circuit can currently only be set for a batch as a whole, which in practice limits our ability to distribute shots in a way that would minimize the variance of the resulting estimator.The total wall-clock time this takes the form

Hamiltonian decomposition schemes
In this section we define the different options chosen to decompose a Hamiltonian into terms Q i for measurement.We are free to group the measurement of all Q i terms together, as long as they commute and an appropriate diagonalization circuit is found.For our numerics, we study the following measurement strategies: • Termwise -we take individual Pauli operators and measure each Q i using an independent measurement circuit.
• XX + Y Y -we first measure all Q i ∈ {Z j , Z i Z j } in a single-shot measurement.Next, we measure Q i ∈ {X j X k +Y j Y k }, grouping disjoint pairs j, k together into N sets following the scheduling outlined in Sec.B 3. This mostly matches the scheme used for the estimation of expectation values Raw VQE, PS-VQE and PS-VD in Fig. 2, and the scheme used for the PS-VQE estimates in Fig. 3.
)), and measure each term separately.This matches the measurement scheme used for EV.As the operators chosen are Hermitian and unitary, the EV variance is equivalent to the standard tomography variance [63].By choosing only D + ij and not adding D − ij to our set of measurements, the set {Q i } becomes linearly independent, and so the relative coefficients c i (Eq.(C2)) are fixed.Some notable differences occur between the numerical estimates made here and those implemented on hardware.(Ultimately the number of shots in the experiment was chosen to be low enough that the experimental error mostly dominated, rather than being optimized based on preliminary calculations.)In the experiment, additional estimates of Z j + Z k were extracted alongside each measurement of X j X k + Y j Y k and combined using the techniques outlined in App.C 1.Each circuit was repeated with and without an additional π pulse on all qubits to unbias readout noise.In the experiment, the shot distribution was chosen to be uniform for VQE (40,000 per circuit) and VD (100,000 per circuit).(One can observe in Fig. 4 that an excess of shots was taken in both cases.)For EV, shots were distributed according to the relative weight of Q i in the Hamiltonian.However, this was made more complicated by a technical requirement that all shots executed in a single batch must have the same number of repetitions.To accommodate this, the number of shots used to estimate a single Q i was rounded up or down to be an integer multiple of a fixed base (40000).Furthermore, as mentioned in App.B 4, when performing EV to estimate each Q i , 12 unique circuits were run to cancel out a background magnetic field and depolarize readout noise.

Shot distribution
Once the decomposition of the Hamiltonian is decided, the variance of the resulting energy estimator further depends on how the available shot budget is distributed over the Q i (or the jointly measurable groups of Q i ).In principle, estimates of the (co-)variances from a small number of shots, from previous measurements at close by VQE parameter values according to (C13), or in adaptive schemes can be used to distribute shots in an asymptotically optimal way to reduce the variance.In practice, one is limited by the overhead of calling the device and limitations in setting the shots for individual circuits inside a batch (see Section D 1).
In our estimates we assume the ability to allocate shots per distinct circuit (and not only on a per-batch basis) but only consider the two non-adaptive shot distribution schemes that need no input from the quantum computer: • Uniform Distribution -distribute the total shot budget per expectation value uniformly over the term groups, i.e., spend the same number of shots on each group of Q i that are measured jointly according to the decomposition scheme.
• According to term group weights -the total shot budget is distributed proportional to the coefficients |c i | in Eq.C2.In the case that multiple . This is optimal in case the variances of all Q i are equal and covariances vanish [17].(Note that as the three measurement strategies measure only linearly-independent operators, the |c i | can be fixed prior to measurement.)The orange and green data points for the XX +Y Y +IZ +ZI scheme with and without EV coincide.

Wall-clock time extrapolation to large system sizes
Using the protocol described in the previous sections, we are able to estimate the cost of measuring the RG model at g = −0.9 to within 0.1 a.u. as we enlarge the system size, while optimizing our shot distribution (App.D 3) for the wall-clock time (App.D 1).In Fig. 6, we plot the cost in terms of the number of shots (top) and in wall-clock time for superconducting hardware (bottom).Estimates are performed for N = 6, 8, 10, 12; given good fit to a line on a log-log plot and that a powerlaw scaling is expected, we are able to extrapolate this to larger system sizes.This data was combined with fidelity estimates for a 10-qubit system (Fig. 4[bottom-right]) to yield the curves in Fig. 4[top].
Appendix E: Variational optimization of a 6-qubit experiment

The Conjugate Model Gradient Descent optimizer
The large number of evaluations needed for ansatz parameter optimization on quantum hardware is a major impediment towards keeping the cost of variational algorithms (in terms of wall-clock time) manageable.To mitigate the overhead incurred by sending jobs to and receiving results from a device (see App. D 1), it is beneficial if the optimizer can request a batch of cost-function evaluations at once before making a step.In [68], surrogate model based optimizers were found to have good performance under this constraint.Here, a quadratic model function was fitted to present and past expectation value estimates in the vicinity of the current ansatz parameter vector.Then, after making a step, a batch of circuits corresponding to points in the vicinity of the new parameter vector were evaluated and the stepping procedure repeated.In this appendix we develop a natural extension of this procedure, by combining it with the conjugate descent algorithm.
The Conjugate Model Gradient Descent optimizer is a surrogate model-based optimization algorithm, with the additional improvement that the gradient which is extrapolated from fitting a quadratic model to the cost function, is used in the framework of conjugate gradient descent to make a step in the parameter space.The Conjugate Gradient Descent method was developed by Hestenes and Stiefel in [69].For the special case of quadratic cost functions over an n dimensional space, conjugate gradient methods can be proven to converge in n iterations [70].In practice the conjugate gradient method is found to work well for cost functions that are locally approximately quadratic.This is an assumption one anyway needs to make when using quadratic model based optimizers and thus makes it natural to combine conjugate gradient descent and quadratic surrogate model based optimizers with quadratic model functions.In conjugate gradient descent the steps are taken in the direction of the so-called conjugate gradient s n = g n + β n s n−1 , where g n is the gradient (in our case the one of the quadratic model fitted to the samples around the current position) and β n is a scaling factor.The formula for the n-th iteration scaling factor is given in [71] as: With s 0 = g 0 fixed for the first step of the algorithm.
In Alg. 1, we present our Conjugate Model Gradient Descent algorithm.We assume in this algorithm access to a (noisy) oracle to the target function f to be optimized.In practice, this is given by a call to a quantum device with a target error, that must be made small enough to enable convergence.Let δ ← δ/(m + 1) ξ

7:
Sample k points uniformly at random from the δneighborhood of x to generate a set S 8: for each x in S ∪ {x} do Fit f (x) = x T Ax + bx + c ∼ y to the points (x, y) in L using least squares linear regression.

18:
Let gm be the gradient of f at x (i.e.gm = b).Add gm to the list G

30:
Add cgm to the list H 31: Let m ← m + 1 33: end for 34: return x

Hyperparameter tuning
For all experiments shown in the main text, parameters were obtained by a noiseless simulation using L-BFGS-B, starting from θ = 0. We find that in the absence of experimental noise, gradient-based optimizers such as L-BFGS-B converge well to an optimal solution.However, this is not the case in the presence of experimental or sampling noise, which can make many estimators unstable.This has lead previous efforts towards stochasticbased [28] or model-based [27] optimizers, including the Model Gradient Descent optimizer that we have based our conjugated Model Gradient Descent algorithm upon in the previous section.
The hyperparameters used in the Conjugate Model Gradient Descent algorithm were either chosen through an intuitive approach due to their physical meaning, such as the sample radius due to the parameters being per- In this figure, the distance between the energy that was calculated in the noiseless regime by the L-BFGS-B optimizer is plotted over the wall-time required to reach such resolution.The system displayed is a RG Hamiltonian for 6 qubits, with a coupling constant of g = −0.9,consisting of 10 runs and the lines being the average of those runs for the Conjugate Model Gradient Descent (CMG) and the Model Gradient Descent (MG) found in [68].
turbed by a random variable from [-0.25,0.25]and the maximum number of iterations to stay within reasonable wall-time, or via grid search for the rest of the hyperparameters, ensuring they are good enough to work for the variety of cases examined while avoiding overfitting.Explicitly, the hyperparameters used are included in Table I The hyperparameters we chose work equally well for Model Gradient Descent, as well our Conjugate variant.We found that the conjugate method can sometimes speed up convergence or help prevent getting stuck in local minima.We illustrate this with two examples: In Fig 7 how Conjugate Model Gradient Descent has the ability to speed up in case the initial learning rate was chosen to be too small.In Fig. 8 we show an example of hyper parameters for which Conjugate Model Gradient Descent is able to converge to a lower minimum than plain Model Gradient Descent., where in this case the number of samples used to construct the quadratic model one uses is higher, namely k = 4(N + 1)(N + 2), allowing for the performance difference of the two optimizers to become more pronounced, with Conjugate Model Gradient Descent managing to consistently converge at better parameters, while Model Gradient Descent gets stuck in a local minimum, averaged over 10 runs for each optimizer, for 25 maximum iterations.

Experimental results
We test the hyperparameter-tuned Conjugate Model Gradient Descent algorithm on the problem of optimizing the UpCCD ansatz for the ground state of the RG Hamiltonian at a coupling g = −0.9.In order to demonstrate convergence, we perturb our ansatz parameters from values optimized on a noiseless classical simulation (using the L-BFGS-B optimizer as described above), by a random variable drawn from [−0.25, 0.25].From this point, Conjugate Model Gradient Descent converged in 9 iterations with each iteration requiring 46 calls to the cost function (414 calls total).We observe that the optimizer successfully finds an energy below the initial point (0.07a.u.), which demonstrates the well-known VQE ability to mitigate coherent noise [9,10].This improvement is reflected in the results mitigated with echo verification as well, which demonstrated a 0.04a.u.reduction in error.However, this is a relatively small reduction compared to the overall error observed.We note that this is a significantly higher error than that observed in Fig. 12, which we attribute to poor calibration on the day.If the absolute gain in energy was replicated in Fig. 12, this would account for ∼ 40% of the overall error.However, the relative gain in error in this case was only ∼ 10%.Ultimately, as the cost of optimization was already significant enough for this 6 obstacle.
Appendix F: Additional results and analysis

Virtual distillation with and without postselection
In this appendix, we show the effect that postselection has on virtual distillation.In Fig. 10, we repeat the plot of Fig. 2, but with data from VD without postselection on top.(Other lines are removed to make the plot clearer.)We see that by itself, VD [light blue curve] outperforms postselection in terms of energy estimation across most points of the energy curve, but it is not variationally bound.This is because without postselection, VD returns unphysical estimates (This is a consequence of the two copies of the state not necessarily being subject to the same noise.)As a consequence of this, our estimate of the order parameter is complex and non-physical, and therefore not plotted.This issue can be rectified somewhat by bounding the estimates of Z i and D ± ij to lie in [−1, 1].Performing this (Fig. 2, purple curve) allows for an estimate of the order parameter to be made, however it is clearly qualitatively incorrect.Moreover, although the energies are shifted up, some are still not variationally bounded, and those that are, often overshoot the energy, making the absolute error worse.(The variational bound could be rectified by enforcing positivity conditions on the set of generated estimates [67].)In summary, the combination of postselection and virtual distillation is seen here to have a greater effect than the sum of its parts for energy error.
We attribute the gain from postselection in virtual distillation to two sources.Firstly, postselection removes final readout noise, which VD does not naturally correct against.(As the estimation of Tr[ρ 2 ] involves a highly correlated measurement, this cannot be easily corrected by most readout correction schemes designed to estimate local Pauli expectation values.)Secondly, though we are in principle only post-selecting on the sum of the number of excitations in ρ (1) and the number of excitations in ρ (2) being equal to N , when combined with virtual distillation this projects out all noise such as T1 decay and suppresses any particle-non-conserving noise to second order.In short, this is because breaking number conservation separately on ρ (1) and ρ (2) yields states which do not overlap (as they must have different particle number), and these states cancel out when taking the product ρ (1) ρ (2) .Let us study this in more detail.Consider a post-selected estimate of Tr[ρ 2 O] using VD for O = Z j (as described in the main text).After postselecting on j (1 ⊗ Z j + Z j ⊗ 1) = N , we can write our global state as ρ 2 = p,q,r,s c p,q,r,s |p q| ⊗ |r s|, (F1) where p, q, r, s index the basis states of both systems.(Following projection, ρ 2 will no longer be a product state.)Let us write n p = p| j Z j |p etc as the number of excitations in these basis states (i.e. the Hamming weight of the index), and our projection requires c p,q,r,s = 0 unless n p + n r = n q + n s = N .(This assumes WLOG we are at half-filling, and ideally n p = n r = n q = n s = N/2.)Then, as O s preserves particle number for O = Z j , the only contribution to Tr ρ 2 S (2) (Z j ⊗ I + I ⊗ Z j ) = p,q,r,s c p,q,r,s δ s,p δ r,q s|Z j |s + r|Z j |r , (F2) comes from those terms c p,q,p,q .(The same is true for our estimate of Trace[ρ 2 S (2) ].)This implies that a non-zero contribution to Trace[ρ 2 S (2) ] comes only from matrix elements |p q| ⊗ |p q| where n p = N 2 + δ, n q = N 2 − δ.When δ = 0, our matrix element |p q| ⊗ |p q| is in the number-conserving sector.When δ = 0, our matrix element corresponds to a product of coherent superpositions between the N/2 + 1 and N/2 − 1 sector on both qubits.This implies that noise channels such as the T1 channel will be completely mitigated as these off-diagonal elements do not exist.Coherent noise may cause some of these off-diagonal elements to appear (consider e.g. a single-qubit X rotation on the state 1 √ 2 (|01 + |10 ).), however this is required to happen on both states to contribute, which suppresses it to second order.(This is in contrast to coherent noise that preserves number, to which we can be first-order sensitive.)The above analysis has assumed that the postselection is perfect; readout noise would complicate the above.

Distribution of errors in Pauli expectation value estimation
In Fig. 11, we plot a histogram of the error in estimating the expectation values of Pauli operators, taking data from Fig. 2, Fig. 12, and Fig. 3.We observe that the mean error in both cases for cyclobutene is slightly worse than for the RG model, but only by a small percentage.(This justifies our claim in Sec.IV that a 0.1 a.u.error in the RG Hamiltonian is approximately 1 − 2 orders of magnitude larger than chemical accuracy for a 10-qubit system.)As the difference between systems here is minimal (changing only the value of some virtual Z rotations), we can only attribute this difference to the performance of the device whilst taking these datasets.Going from 6 to 10 qubits increases the mean error in all experiments by a factor 1.5−2×.As the Hamiltonian for both the RG model and cyclobutene has a number of terms scaling as O(N 2 ); assuming that the errors follow a roughly Gaussian distribution would predict the energy error scales O(N )× the error per each individual operator.This then predicts a gain in energy of 2.5 − 3.3×, which lies in between the observation of Fig. 3 and Fig. 4.This suggests that the RG model energy estimation at large N may have had some benevolent cancellation of noise, and likewise for the cyclobutene energy estimation at small N .

Smaller studies of the RG Hamiltonian
In Fig. 12, we present experimental simulations of the ground state of the RG Hamiltonian for 4, 6, and 8 qubits.The method to produce these figures is identical to that used in the production of Fig. 2, save for the number of qubits and shots used.Aggregated data from these figures was used to generate the scaling plots of Fig. 4.

FIG. 1 .
FIG. 1.The UpCCD ansatz and its compilation to a 2D superconducting transmon grid.(top) Decomposition of the gates used in this experiment to CZ and single-qubit gates.See supplemental material for details.(second from top, left) 2 × 5 grid with couplers in a square lattice geometry, showing couplers used during the ansatz (ring coupler, purple), and those used only during measurement (cross-coupler, red).(second from top, right) 2+1D circuit cartoon of a combined ansatz and measurement on a 2 × 5 transmon qubit array.(third from top) Cartoon of error mitigation techniques used in this experiment.Different circuit pieces are described in the legend.(bottom) an example 8-qubit echo verification circuit to measure the expectation value of (X1X7+Y1Y7+Z1+Z7)/2.Shaded gates at the top and bottom of the qubit array wrap around the 2 × 4 ring.

FIG. 4 .
FIG. 4. Scaling the simulation of the RG model to larger qubit counts.(top) Number of shots required for convergence at g = −0.9.Dots give numbers chosen for the experiment, crosses and pluses give simulated estimations using two types of term grouping (see App. D 3) using observed fidelities of a 10 qubit experiment.(bottom-left) Experimental energy error (vs the UpCCD ground state), averaged over all points studied of the RG model.Error bars show sample standard deviation, and lines a power-law fit (exponent shown) as a guide to the eye.(bottom-middle) Experimental error in order parameter (vs the UpCCD ground state), averaged over all points studied of the RG model.Error bars and lines same as bottom-left.(bottom-right) Different fidelity metrics for post-selected VQE, EV, VD, and Loschmidt echo (see legend), averaged over all points studied of the RG model.
Author contributions T.E.O.calibrated the device and ran the experiments.V.E.E., G.A., and C.G. designed the UpCCD ansatz.V.E.E., G.A., and T.E.O.designed the scheduling of the ansatz onto a 2 × N grid.F.G. and C.G. designed the conjugate model gradient descent algorithm and preoptimized ansatz parameters for the ansatz.N.C.R. wrote the pair coupled cluster code, pre-optimized ansatz parameters, and performed the classical chemistry calculations for the cyclobutene model.T.E.O., W.J.H., S.P., K.K. and R.B. designed and optimized the error mitigation and EV measurement strategies.O.O. and C.G. developed the BQP completeness proof for the UpCCD ansatz.T.E.O., N.C.R., C.G., F.G., V.E.E. and R.B. wrote the paper.T.E.O., C.G., R.B. and N.C.led and co-ordinated the project.All authors contributed to revising the manuscript and writing the Supplementary Information.All authors contributed to the experimental and theoretical infrastructure to enable the experiment.
Appendix B: Further details of the UpCCD ansatz 1. BQP-completeness of nearest neighbor Givens-swap circuits

FIG. 5 .
FIG.5.Detail of the scheduling of pair excitation interactions during the UpCCD ansatz, and measurement scheduling.(top) The physical rotations during the execution of swap layers of the UpCCD ansatz, showing that all occupied (even index) and all unoccupied (odd index) orbitals are coupled (purple squares) at some point during the ansatz.(middle) After executing the UpCCD ansatz we have not used the cross-couplers (red squares), allowing us to virtually rotate the entire grid during compilation for the purposes of measurement.This virtual rotation allows all occupied and unoccupied orbitals to be coupled at some step, without any increase in circuit depth.(bottom) The virtual rotation cannot however, bring two occupied or two unoccupied orbitals to nearest-neighbour qubits so that they may be coupled.To achieve this, we require an extra round of swaps (red dashed boxes).One can confirm that across the middle and bottom layers, all pairs of qubits are coupled in at least one configuration.

12 :
for each tuple (x , y ) in L do 13: if |x − x| < δ then 14:Add (x , y ) in L FIG.7.In this figure, the distance between the energy that was calculated in the noiseless regime by the L-BFGS-B optimizer is plotted over the wall-time required to reach such resolution.The system displayed is a RG Hamiltonian for 6 qubits, with a coupling constant of g = −0.9,consisting of 10 runs and the lines being the average of those runs for the Conjugate Model Gradient Descent (CMG) and the Model Gradient Descent (MG) found in[68].

FIG. 8 .
FIG.8.Comparison of the two variants of the Model Gradient Descent optimizer, where in this case the number of samples used to construct the quadratic model one uses is higher, namely k = 4(N + 1)(N + 2), allowing for the performance difference of the two optimizers to become more pronounced, with Conjugate Model Gradient Descent managing to consistently converge at better parameters, while Model Gradient Descent gets stuck in a local minimum, averaged over 10 runs for each optimizer, for 25 maximum iterations.

FIG. 10 .
FIG. 10.Comparison of virtual distillation with postselection (yellow), without postselection (light blue), and without postselection but while bounding expectation values of Zj and D ± ij in [−1, 1] (purple), for a 10-qubit simulation of ground states of the RG Hamiltonian across a range of g values.Raw VQE (blue), postselected VQE (red), and post-selected VD [PS-VD] data taken from Fig. 2. VQE error bars obtained by error propagation (1 standard deviation, see App.C 2), VD error bars obtained by bootstrapping (1 standard deviation).
FIG. 11.Histogram of the expectation value error in Pauli operators P = Zj, P = or P = D ± ij (Eq.(B27)), across all data taken for the experiment mentioned.Mean error across the dataset is given.
Algorithm 1 Conjugate Model Gradient DescentInput:Initial point x0, learning rate γ, sample radius δ, maximum iterations n, number of evaluations per iteration k, rate decay exponent α, stability constant A, sample radius decay exponent ξ, tolerance ε, oracle for function f .

TABLE I .
: Hyperparameters for the Conjugate Model Gradient Descent optimizer during the experiment, where N is the number of parameters in the optimization.These were also used in the later comparisons with Model Gradient Descent, to gauge performance in equal footing.
-qubit example, and as the number of calls for Conjugate Model Gradient Descent scales as O(#parameters) ∼ O(N 2 ), we did not see it practical to continue this line of research in this work.With current qubit counts and shot repetition rates the optimization of variational algorithms of meaningful size remains a major