Introduction

Near-term quantum computers hold great promise but also pose great challenges. Low qubit counts place constraints on problem sizes that can be implemented. Decoherence and gate infidelity place constraints on the circuit depth that can be implemented. These constraints are captured in the (now widely used) term noisy intermediate-scale quantum (NISQ)1.

To address the circuit depth constraint, variational quantum algorithms (VQAs) have been proposed for many applications2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27. VQAs employ a quantum-classical optimization loop to train the parameters θ of a quantum circuit V(θ). Leveraging classical optimizers allows the quantum circuit depth to remain shallow. This makes VQAs powerful tools for error mitigation on NISQ devices.

A particularly important application of NISQ computers will be extracting the spectra, eigenvalues, and eigenvectors, of very large matrices. Indeed the most famous VQA, known as the variational quantum eigensolver (VQE), aims to variationally determining the energies and state-preparation circuits for the ground state and low-lying excited states of a given Hamiltonian, i.e., a Hermitian matrix. VQE promises to revolutionize the field of quantum chemistry28,29, and perhaps even nuclear30 and condensed matter31,32 physics.

If one instead considers a positive-semidefinite matrix, then extracting the spectrum has direct application as a machine-learning primitive known as principal component analysis (PCA). Along these lines, Lloyd et al.33 introduced a quantum algorithm called quantum PCA (qPCA) to deterministically extract the spectrum of an n-qubit density matrix ρ. qPCA employs quantum phase estimation and density matrix exponentiation as subroutines and hence requires a large number of quantum gates and copies of ρ. In an effort to reduce circuit depth in the NISQ era, LaRose et al.6 developed a VQA for this application called variational quantum state diagonalization (VQSD). VQSD requires two copies of ρ, hence 2n qubits, and trains the parameters θ of a gate sequence V(θ) so that \(\tilde{\rho }=V({{{\boldsymbol{\theta }}}})\rho {V}^{{\dagger} }({{{\boldsymbol{\theta }}}})\) is approximately diagonal. A different variational approach, called quantum singular value decomposition (QSVD), was introduced by Bravo-Prieto et al.27. QSVD takes a purification \(\left|\psi \right\rangle\) of ρ as its input and hence requires however many qubits it takes to purify ρ (possibly 2n qubits).

In this work, we introduce a variational algorithm for PCA that only requires a single copy of ρ and hence only n qubits per iteration of the algorithm. Our approach, called the variational quantum state eigensolver (VQSE), exploits the mathematical connection between diagonalization and majorization. Namely, it is well known that the eigenvalues of a density matrix ρ majorize the diagonal elements in any basis. Hence, by choosing a cost function C that is a Schur concave function of the diagonal elements of ρ, one can ensure that the cost function is minimized when ρ is diagonalized. Specifically, we write the cost as \(C={{{\rm{Tr}}}}(\tilde{\rho }H)\), where H is some Hamiltonian with a non-degenerate spectrum, which ensures the Schur concavity property. Note that evaluating C simply involves measuring the expectation value of H on \(\tilde{\rho }\), and hence one can see why only n qubits are required.

To learn the optimal θ parameters, we introduce a new training approach, not previously used in other VQAs. Specifically, we employ a time-dependent Hamiltonian H that we adapt based on information gained from measurements performed throughout the optimization. The aim of this adaptive approach is: (1) to mitigate barren plateaus in training landscapes, and (2) to get out of local minima. With our numerics, we find that using an adaptive Hamiltonian is better than simply fixing the Hamiltonian throughout the optimization. Here, we further provide a rigorous analysis of the measurement shot requirements of VQSE where we show that the relative error induces from statistical sampling error is, with high probability, smaller than δ, if one measures the system response with a number of shots that scales only as \({{\Omega }}(\log (1/\delta )/{\lambda }_{m}^{2})\), with λm being the smallest eigenvalue one wishes to estimate.

Finally, we illustrate two important applications of VQSE with our numerical implementations. First, we use VQSE for error mitigation of the W-state preparation circuit. Namely, by projecting the state onto the eigenvector with the largest eigenvalue, we re-purify the state, mitigating the effects of incoherent errors. Second, we use VQSE to perform entanglement spectroscopy (which is essentially PCA on the reduced state of a bipartition) on the ground state of an XY-model spin chain. This allows us to identify quantum critical points in this model.

Results

Theoretical basis of VQSE

Consider an n-qubit quantum state ρ with (unknown) spectral decomposition \(\rho ={\sum }_{k}{\lambda }_{k}\left|{\lambda }_{k}\right\rangle \,\left\langle {\lambda }_{k}\right|\), such that the eigenvalues are ordered in decreasing order (i.e., λkλk+1 for k = 1, …, rank(ρ), while λk = 0 for k rank(ρ)). The goal of VQSE is to estimate the m-largest eigenvalues of ρ, where m 2n, and furthermore to return a gate sequence V(θ) that approximately prepares their associated eigenvectors from standard basis elements.

At first sight, this looks like a matrix diagonalization problem. Indeed, this is the perspective taken in the literature, e.g., by the VQSD algorithm6 which employs a cost function that quantifies how far \(\tilde{\rho }=V({{{\boldsymbol{\theta }}}})\rho {V}^{{\dagger} }({{{\boldsymbol{\theta }}}})\) is from a diagonal matrix. However, our VQSE algorithm takes a conceptually different approach, focusing on majorization instead of diagonalization.

We write the VQSE cost function as an energy, or the expectation value of a Hamiltonian:

$$C({{{\boldsymbol{\theta }}}})\equiv \left\langle H\right\rangle ={{{\rm{Tr}}}}\left[HV({{{\boldsymbol{\theta }}}})\rho {V}^{{\dagger} }({{{\boldsymbol{\theta }}}})\right]\,.$$
(1)

Here, H is a simple n-qubit Hamiltonian that is diagonal in the standard basis and whose eigenenergies and associated eigenstates are known and respectively given by {Ek} and \(\{\left|{{{{\boldsymbol{e}}}}}_{k}\right\rangle \}\) (where \({{{{\boldsymbol{e}}}}}_{k}={e}_{k}^{1}\cdot \ldots \cdot {e}_{k}^{n}\) for k = 1, …, 2n are bitstrings of length n). Moreover, we henceforth assume that the eigenenergies are non-negative and ordered in increasing order, i.e., EkEk+1. We have

$$C({{{\boldsymbol{\theta }}}})=\mathop{\sum }\limits_{k=1}^{{2}^{n}}{E}_{k}{p}_{k}={{{\boldsymbol{E}}}}\cdot {{{\boldsymbol{p}}}}\,,\quad {p}_{k}=\langle {{{{\boldsymbol{e}}}}}_{k}| \tilde{\rho }| {{{{\boldsymbol{e}}}}}_{k}\rangle \,,$$
(2)

where we defined the vectors E = (E1, E2, …) and p = (p1, p2, …). Similarly, let us define the vector of eigenvalues of ρ as λ = (λ1, λ2, …). Then, since the eigenvalues of a positive semidefinite matrix majorize its diagonal elements λp, and since the dot product with an increasingly ordered vector is a Schur concave function34,35, we have

$$C({{{\boldsymbol{\theta }}}})={{{\boldsymbol{E}}}}\cdot {{{\boldsymbol{p}}}}\,\geqslant\, {{{\boldsymbol{E}}}}\cdot {{{\boldsymbol{\lambda }}}}=\mathop{\sum}\limits_{k}{E}_{k}{\lambda }_{k}\,,$$
(3)

where we have used the fact that ρ and \(\tilde{\rho }\) have the same eigenvalues. Hence, one can see that C(θ) is minimized when V(θ) maps the eigenbasis of ρ to the eigenbasis of H, with appropriate ordering. Since the latter is chosen to be the standard basis, this corresponds to diagonalizing ρ. Thus, even though it may not be obvious at first sight, minimizing C(θ) corresponds to diagonalizing ρ.

The VQSE algorithm

Figure 1 shows a schematic diagram of the VQSE algorithm. The three inputs to VQSE are: (1) a n-qubit quantum state ρ, (2) an integer m, and (3) a parameterized gate sequence or ansatz V(θ). The outputs of VQSE are: (1) estimates \({\{{\tilde{\lambda }}_{i}\}}_{i = 1}^{m}\) of the m-largest eigenvalues \({\{{\lambda }_{i}\}}_{i = 1}^{m}\) of ρ, and (2) a gate sequence V(θopt) that prepares approximate versions \({\{|{\tilde{\lambda }}_{i}\rangle \}}_{i = 1}^{m}\) of the associated m eigenvectors \({\{\left|{\lambda }_{i}\right\rangle \}}_{i = 1}^{m}\). While in principle m can be as large as 2n, we assume that one is interested in a number m of eigenvalues and eigenvectors that grows at worse as \({{{\mathcal{O}}}}(\,{{\mbox{poly}}}\,(n))\).

Fig. 1: Schematic diagram of VQSE.
figure 1

VQSE takes as inputs an n-qubit state ρ, an integer m, and a parametrized unitary V(θ). It then outputs estimates of the m-largest eigenvalues of ρ, and their associated eigenvectors. The first step of the algorithm is a hybrid quantum-classical optimization loop to train the parameters θ, and minimize the cost function defined in (4) as the expectation value of a Hamiltonian H(t) over the state \(\tilde{\rho }=V({{{\boldsymbol{\theta }}}})\rho {V}^{{\dagger} }({{{\boldsymbol{\theta }}}})\). To facilitate this optimization, we adaptively update H(t) using information obtained via measurements on \(\tilde{\rho }\). When this optimization terminates, at which point we say θ = θopt, one reads off the eigenvalues. Namely, by preparing V(θopt)ρV(θopt) and measuring in the standard basis, one obtains bitstrings z whose associated frequencies are estimates of the eigenvalues of ρ. Finally, one prepares the estimated eigenvectors by preparing the states \(\left|{{{\boldsymbol{z}}}}\right\rangle\) and acting on them with V(θopt).

After taking in the inputs, VQSE enters a hybrid quantum-classical optimization loop to train the parameters θ in the ansatz V(θ). This loop employs a quantum computer to evaluate the VQSE cost function, denoted

$$C(t,{{{\boldsymbol{\theta }}}})\equiv \left\langle H(t)\right\rangle ={{{\rm{Tr}}}}\left[H(t)\widetilde{\rho }\right]\,,\quad \widetilde{\rho }=V({{{\boldsymbol{\theta }}}})\rho {V}^{{\dagger} }({{{\boldsymbol{\theta }}}})\,.$$
(4)

Here, H(t) is a Hamiltonian that could, in general, depend on the time t, where t [0, 1] is a parameter that indicates the optimization loop run-time such that the loop starts at t = 0 and ends at t = 1. For all t, we assume that H(t) can be efficiently measured on a quantum computer and that it is diagonal in the standard basis, with its lowest m eigenenergies being non-degenerate and non-negative. We further elaborate on how to choose H(t) in section ‘Cost functions’. Note that the quantum circuit to evaluate the cost C(t, θ), as depicted in Fig. 1, simply involves applying V(θ) to the state ρ and then measuring the Hamiltonian H(t).

The quantum computer then feeds the value of the cost (or the gradient of the cost for gradient-based optimization) to a classical computer, which adjusts the parameters θ for the next round of the loop. The ultimate goal is to find the global minimum of the cost landscape at t = 1, i.e., to solve the problem:

$${{{{\boldsymbol{\theta }}}}}_{{{\mbox{opt}}}}\equiv \arg \mathop{\min }\limits_{{{{\boldsymbol{\theta }}}}}C(1,{{{\boldsymbol{\theta }}}})\,.$$
(5)

In reality, one will need to impose some termination condition on the optimization loop and hence the final parameters obtained (which we still denote as θopt) will only approximately satisfy Eq. (5). Nevertheless, we provide a verification procedure below in section ‘Verification of solution quality’ that allows one to quantify the quality of the solution even when (5) is not exactly satisfied.

As shown in Fig. 1, the next step of VQSE is the eigenvalue readout. From the parameters θopt one can estimate the eigenvalues of ρ by acting with the gate sequence V(θopt) and then measuring in the standard basis \(\{\left|{{{{\boldsymbol{z}}}}}_{k}\right\rangle \}\). Let Pr(zk) be the probability of the zk outcome. Then by taking the m largest of these probabilities we define \({{{\mathcal{L}}}}\equiv {\{{\widetilde{\lambda }}_{i}\}}_{i = 1}^{m}\) as the ordered set of estimates of the m-largest eigenvalues of ρ, and we define \({{{\mathcal{Z}}}}\) as the set of bitstrings \({\{{{{{\boldsymbol{z}}}}}_{i}\}}_{i = 1}^{m}\) associated with the elements of \({{{\mathcal{L}}}}\):

$${\widetilde{\lambda }}_{i}=\Pr ({{{{\boldsymbol{z}}}}}_{i})=\langle {{{{\boldsymbol{z}}}}}_{i}| \widetilde{\rho }| {{{{\boldsymbol{z}}}}}_{i}\rangle \,,\quad \,{{\mbox{such that}}}\,\quad {\widetilde{\lambda }}_{i}\,\geqslant\, {\widetilde{\lambda }}_{i+1}\,.$$
(6)

Note that \({\widetilde{\lambda }}_{i}\) in (6) correspond to diagonal elements of \(\widetilde{\rho }\) in the standard basis, and not to its eigenvalues.

In practice, when estimating the eigenvalues one measures \(\tilde{\rho }\) in the standard basis a finite number of times Nruns. Hence, if a bitstring \({{{{\boldsymbol{z}}}}}_{i}\in {{{\mathcal{Z}}}}\) has frequency fi for Nruns total runs, then we can estimate \({\widetilde{\lambda }}_{i}\) as

$${\widetilde{\lambda }}_{i}^{\,{{\mathrm{est}}}}=\frac{{f}_{i}}{{N}_{{{\mathrm{runs}}}}}\,.$$
(7)

One can think of this as a Bernouilli trial. Let Λi be a random variable that takes value 1 if we get outcome zi (with probability \({\widetilde{\lambda }}_{i}\)), and takes value 0 otherwise (with probability \(1-{\widetilde{\lambda }}_{i}\)). After repeating the experiment Nruns times we are interested in bounding the probability that the relative error \({\varepsilon }_{i}\equiv | {\widetilde{\lambda }}_{i}^{\,{{\mathrm{est}}}\,}-{\widetilde{\lambda }}_{i}| /{\widetilde{\lambda }}_{i}\) is larger than a certain value c 0. From Hoeffding’s inequality, we find

$$\Pr ({\varepsilon }_{i}\,\geqslant\, c)\,\leqslant\, {{\rm{e}}}^{-2{N}_{{{\mathrm{runs}}}}{c}^{2}{\widetilde{\lambda }}_{i}^{2}}\,,\quad \forall c \,>\, 0\,.$$
(8)

For fixed Nruns, Eq. (8) shows that the smaller the inferred eigenvalue \({\widetilde{\lambda }}_{i}\), the larger the probability of having a given relative error. Equation (8) also implies that increasing Nruns reduces the probability of large relative errors. Hence, we can always choose Nruns such that the probability of error is smaller than a given δ for all m eigenvalues via

$$\forall i\in [1,m],\,\,\,\Pr ({\varepsilon }_{i}\,\geqslant\, c)\,\leqslant\, \delta \,\to \,{N}_{{{\mathrm{runs}}}}\,\geqslant\, \frac{\log (1/\delta )}{2{c}^{2}{\lambda }_{m}^{2}}\,,$$
(9)

where λm is the smallest eigenvalue of interest. Analogously, from (9) we have that all eigenvalues larger than \(\sqrt{\frac{\log (1/\delta )}{2{c}^{2}{N}_{{{\mathrm{runs}}}}}}\) have a probability of error smaller than δ.

The last step of VQSE is to prepare the inferred eigenvectors of ρ. Given a bitstring \({{{{\boldsymbol{z}}}}}_{i}\in {{{\mathcal{Z}}}}\), one can prepare the associated inferred eigenvector by taking the state \(\left|{{{\boldsymbol{0}}}}\right\rangle ={\left|0\right\rangle }^{\otimes n}\), acting on it with the gate \({X}^{{z}_{1}^{i}}\otimes {X}^{{z}_{2}^{i}}\otimes \ldots \otimes {X}^{{z}_{n}^{i}}\), and then applying the gate sequence \(V{({{{{\boldsymbol{\theta }}}}}_{{{\mathrm{opt}}}})}^{{\dagger} }\):

$$|{\widetilde{\lambda }}_{i}\rangle ={V}^{{\dagger} }({{{{\boldsymbol{\theta }}}}}_{{{\mathrm{opt}}}})\left|{{{{\boldsymbol{z}}}}}_{i}\right\rangle \,,\quad \left|{{{{\boldsymbol{z}}}}}_{i}\right\rangle ={X}^{{z}_{1}^{i}}\otimes \ldots \otimes {X}^{{z}_{n}^{i}}\left|{{{\mathbf{0}}}}\right\rangle \,.$$
(10)

Note that while the inferred eigenvalues can be stored classically, the eigenvectors are prepared on a quantum computer, and hence one needs to perform measurements to extract information about these eigenvectors.

Cost functions

Consider the Hamiltonian H(t) that defines the VQSE cost function in (4). Recall that we choose H(t) so that: (1) it is diagonal in the standard basis, (2) its lowest m eigenvalues are non-negative and non-degenerate, and (3) it can be efficiently measured on a quantum computer. Let us now discuss possible choices for H(t).

Fixed Hamiltonians

When the Hamiltonian is fixed (i.e., time-independent), we write H(t) ≡ H, and C(t, θ) ≡ C(θ). In this case, a simple, intuitive cost function is given by

$${C}_{G}({{{\boldsymbol{\theta }}}})={{{\rm{Tr}}}}[{H}_{G}\widetilde{\rho }],\quad {H}_{G}={\mathbb{1}}-\mathop{\sum }\limits_{i=1}^{m}{q}_{i}\left|{{{{\boldsymbol{e}}}}}_{i}\right\rangle \,\left\langle {{{{\boldsymbol{e}}}}}_{i}\right|\,,$$
(11)

with qi > 0 (such that qi > qi+1), and where the \(\left|{{{{\boldsymbol{e}}}}}_{i}\right\rangle\) are orthogonal states in the standard basis. The spectrum of HG is composed of m non-degenerate eigenenergies, and a (2n − m)-fold degenerate eigenenergy.

On the one hand, this large degeneracy makes it easier to find a global minimum as the solution space is large. That is, denoting as Vopt an optimal unitary that minimizes (11), then there is a large set of such optimal unitaries \({{{{\mathcal{S}}}}}_{{{\mbox{opt}}}}=\{{V}_{{{\mbox{opt}}}}\}\), which are not related by global phases. This is due to the fact that one is only interested in the m rows and the m columns of V(θ) that diagonalize \(\widetilde{\rho }\) in the subspace spanned by \({\{\left|{{{{\boldsymbol{e}}}}}_{i}\right\rangle \}}_{i = 1}^{m}\). Specifically, any optimal unitary must satisfy \(\langle {{{{\boldsymbol{z}}}}}_{i}| {V}_{{{\mbox{opt}}}}| {\lambda }_{i}\rangle =\langle {\lambda }_{i}| {V}_{{{\mbox{opt}}}}| {{{{\boldsymbol{z}}}}}_{i}\rangle ={\delta }_{{{{{\boldsymbol{z}}}}}_{i}{{{{\boldsymbol{e}}}}}_{i}}\) for i = 1, …, m (and with \({{{{\boldsymbol{z}}}}}_{i}\in {{{\mathcal{Z}}}}\)), while the (2n − m) × (2n − m) unitary principal submatrix of Vopt with matrix elements \(\langle {{{{\boldsymbol{z}}}}}_{i}| {V}_{{{\mbox{opt}}}}| {{{{\boldsymbol{z}}}}}_{i^{\prime} }\rangle\), where \({{{{\boldsymbol{z}}}}}_{i},{{{{\boldsymbol{z}}}}}_{i^{\prime} }\,\notin\, {{{\mathcal{Z}}}}\), remains completely arbitrary.

On the other hand, it has been shown that when employing hardware-efficient ansatzes36 for V(θ), global cost functions like CG(θ) are untrainable for large problem sizes as they exhibit exponentially vanishing gradients (i.e., barren plateaus37) even when the ansatz is short depth38. Such barren plateaus can be avoided by employing a different type of cost function known as a local cost38,39, where C is defined such that one compares states or operators with respect to each individual qubit rather than comparing them in a global sense.

One can construct a local cost where the Hamiltonian is a weighted sum of local z-Pauli operators:

$${C}_{L}\equiv \langle {H}_{L}\rangle \,,\quad {H}_{L}={\mathbb{1}}-\mathop{\sum }\limits_{j=1}^{n}{r}_{j}{Z}_{j}\,,$$
(12)

where \({r}_{j}\in {\mathbb{R}}\) and Zj is the z-Pauli operator acting on qubit j. Care must be taken when choosing the coefficients \({\{{r}_{j}\}}_{j = 1}^{n}\) to ensure that the lowest m-eigenenergies of HL are non-degenerate. For instance, when targeting the largest eigenvalue of ρ (m = 1), the simple choice rj = 1, j achieves this goal. On the other hand, if one is interested in m = n + 1 eigenvalues, then one can choose rj = r1 + (j − 1)δ with r1δ, which will ensure that the m-lowest energy levels, {E1, E1 + r1, E1 + r1 + δ, . . . , E1 + r1 + (m − 1)δ}, are non-degenerate. Henceforth, we will assume that one has chosen \({\{{r}_{j}\}}_{j = 1}^{n}\) such that the m-lowest energy levels are non-degenerate.

While fixed local cost functions do not exhibit barren plateaus for shallow depth, they still have several trainability issues. First, having less degeneracy in HL leads to a more difficult optimization problem. Since degeneracy allows for additional freedom in the solution space, non-degeneracy constrains the possible solutions. Therefore, there is a tradeoff between engineering non-degeneracy (which allows one to distinguish more eigenvalues of ρ) versus keeping degeneracy (which allows for more solutions). Second, we expect both CL and CG to have a high density of local minima, especially for large m. This is because there will be partial solutions to the problem where one correctly assigns some eigenvalues of ρ to the right energy levels of the Hamiltonian, while incorrectly assigning other eigenvalues. This local minima issue is what motivates the following adaptive approach.

Adaptive Hamiltonian

Let us now we introduce an approach to adaptively update the VQSE Hamiltonian (and hence the cost function) based on information obtained via measurements during the optimization loop. This method allows us to mitigate the issues discussed in the previous section that arise for cost functions with fixed local or global Hamiltonians. Namely, the adaptive cost function solves the following three problems: (1) barren plateaus for shallow depth38, (2) high density of local minima, (3) smaller solution space arising from non-degenerancies.

Consider a time-dependent Hamiltonian of the form

$$H(t)\equiv (1-f(t)){H}_{L}+f(t){H}_{G}(t)\,,$$
(13)

where f(t) is a real-valued function such that f(0) = 0, f(1) = 1, and HL is a local Hamiltonian as in (12). We recall here that t [0, 1] is a parameter that indicates the optimization loop run time. Moreover, we define the time-dependent global Hamiltonian

$${H}_{G}(t)\equiv {\mathbb{1}}-\mathop{\sum }\limits_{i=1}^{m}{q}_{i}\left|{{{{\boldsymbol{z}}}}}_{i}(t)\right\rangle \left\langle {{{{\boldsymbol{z}}}}}_{i}(t)\right|\,,$$
(14)

where the coefficients qi are real and positive, and chosen in the same way as in (11). In addition, the states \(\left|{{{{\boldsymbol{z}}}}}_{i}(t)\right\rangle\) are adaptively chosen throughout the optimization loop by preparing \(\widetilde{\rho }\), measuring in the standard basis to obtain the sets \({{{\mathcal{L}}}}\) and \({{{\mathcal{Z}}}}\), and updating HG(t) so that \({{{{\boldsymbol{z}}}}}_{i}(t)\in {{{\mathcal{Z}}}}\).

As schematically shown in Fig. 2a, in order to mitigate the barren plateau phenomena it is important to choose a function f(t) which is not rapidly growing with t. Hence, for small t, H(t) ~ HL and the cost function will be trainable as it will not present a barren plateau. Then, as t increases, one can deal with the issue of local minima by updating HG(t). As depicted in the insets of Fig. 2a, adaptively changing HG(t) transforms local minima in the cost landscape into global minima. Then, by the end of the algorithm, we have H(1) = HG(1), and as shown in panel (b) of Fig. 2, the spectrum of H becomes highly degenerate and the dimension of the solution space increases. In section ‘Algorithm for the adaptive cost function’ of the “Methods”, we present an algorithm to illustrate how one can update H(t).

Fig. 2: Adaptive cost function.
figure 2

a Schematic representation of the function f(t) and the cost landscape of C(t, θ) versus t. We choose f(t) as a slowly growing function with t. When the optimization starts at t = 0, the cost function does not exhibit a barren plateau as the Hamiltonian is local H(0) = HL. As t increases H(t) becomes a linear combination of HL and a global Hamiltonian HG(t) which is adaptively updated using information gained from measurements on V(θ)ρV(θ). As shown in the insets, this procedure allows for local minima to become global minima. Finally, when the algorithm ends at t = 1 the Hamiltonian is global H(t) = HG(t). b Schematic representation of the eigenenergies of H(t) versus t. For small t the Hamiltonian is local and hence its spectrum contains non-degenerancies that reduce the space of solutions. At t = 1, H(t) becomes a global Hamiltonian and the spectrum has m non-degenerate levels and a (2n − m)-degenerate level.

We remark that ref. 40 proposed a method called adiabatically assisted VQE (AAVQE), which dynamically updates the VQE cost function by driving between a simple Hamiltonian to the non-trivial problem Hamiltonian. Note that the goals of AAVQE and our adaptive training method are different. Furthermore, in our method, one adaptively updates the cost function based on information obtained through measurements, while AAVQE does not use information gained during the optimization.

Operational meaning of the cost function

Here we discuss the operational meaning of the VQSE cost function, showing that small cost values imply small eigenvalue and eigenvector errors. Let \({\{|{\widetilde{\lambda }}_{i}\rangle \}}_{i = 1}^{m}\) be the set of the inferred eigenvector associated with every \({\widetilde{\lambda }}_{i}\) in \({{{\mathcal{L}}}}\), and let \(\left|{\delta }_{i}\right\rangle =\rho |{\widetilde{\lambda }}_{i}\rangle -{\widetilde{\lambda }}_{i}|{\widetilde{\lambda }}_{i}\rangle\). We then define eigenvalue and eigenvector errors as follows:

$${\varepsilon }_{\lambda }\equiv \mathop{\sum }\limits_{i=1}^{m}{({\lambda }_{i}-{\widetilde{\lambda }}_{i})}^{2},\quad {\varepsilon }_{v}\equiv \mathop{\sum }\limits_{i=1}^{m}\langle {\delta }_{i}| {\delta }_{i}\rangle .$$
(15)

Here 〈δiδi〉 quantifies the component of \(\rho |{\widetilde{\lambda }}_{i}\rangle\) that is orthogonal to \(|{\widetilde{\lambda }}_{i}\rangle\), which follows from the following identity: \(\left|{\delta }_{i}\right\rangle =({\mathbb{1}}-|{\widetilde{\lambda }}_{i}\rangle \langle {\widetilde{\lambda }}_{i}|)\rho |{\widetilde{\lambda }}_{i}\rangle\).

Then by using the Cauchy–Schwarz inequality, majorization conditions, and Schur convexity, we establish the following upper bound on eigenvalue and eigenvector errors (see section ‘Operational meaning of the cost function’ for more details):

$${\varepsilon }_{\lambda },{\varepsilon }_{v}\,\leqslant\, {{{\rm{Tr}}}}[{\rho }^{2}]-\frac{{({E}_{m+1}-C({{{\boldsymbol{\theta }}}}))}^{2}}{\mathop{\sum }\nolimits_{i=1}^{m}{({E}_{m+1}-{E}_{i})}^{2}}\,,$$
(16)

where (E1, …, Em) are the m-smallest eigeneneries of H, and where for simplicity we have omitted the t dependence. Thus Eq. (16) provides an operational meaning to our cost function, as small values of the cost function lead to small eigenvalue and eigenvector errors.

Verification of solution quality

Let us show how to verify the results obtained from the VQSE algorithm. We remark that this verification step is optional, particularly because it requires 2n qubits, whereas the rest of VQSE only requires n qubits.

In section ‘Verification of solution quality’ of “Methods”, we prove the following useful bound on eigenvalue and eigenvector error:

$${\varepsilon }_{\lambda },{\varepsilon }_{v}\,\leqslant\, {{{\rm{Tr}}}}[{\rho }^{2}]-\left(\mathop{\sum }\limits_{i=1}^{\widehat{m}}{\widetilde{\lambda }}_{i}^{2}+\frac{{(1-\mathop{\sum }\nolimits_{i = 1}^{\widehat{m}}{\widetilde{\lambda }}_{i})}^{2}}{{2}^{n}-\widehat{m}}\right)\,,$$
(17)

where one can take \(\widehat{m}\) as any integer between m and 2n. One can efficiently estimate the right-hand-side of (17) as follows. Given two copies of ρ, \({{{\rm{Tr}}}}[{\rho }^{2}]\) can be estimated by a depth-two quantum circuit with classical post-processing that scales linearly with n41. Moreover, since \({{{\rm{Tr}}}}[{\rho }^{2}]\) is independent of V(θ), one only needs to compute it once (outside of the optimization loop). Estimating the \({\widetilde{\lambda }}_{i}\) for \(i=1,...,\widehat{m}\) essentially comes for free as part of the eigenvalue readout step of VQSE, where we note that taking \(\widehat{m} \,>\, m\) simply involves keeping track of the frequencies of more bitstrings (more than the m-largest) during this readout step. Finally, we remark that while Eq. (16) can also be used for verification, in section ‘Verification of solution quality’ we show that (17) provides a tighter bound, particularly as one increases \(\widehat{m}\).

Ansatz

While there are many possible choices for the ansatz V(θ), we are here restricted to state-agnostic ansatzes which do not require any a prior information about ρ. One such ansatz is the Layered Hardware Efficient Ansatz36 shown in Fig. 3a. Here, V(θ) consists of a fixed number L of layers of two-qubit gates Bμ(θμ) acting on alternating pairs of neighboring qubits. Figure 3b illustrates possible choices for Bμ(θμ). Note that with this structure, the number of parameters in θ grows linearly with the n and L.

Fig. 3: Ansatz diagram.
figure 3

a Layered hardware-efficient ansatz for V(θ). A single layer of the ansatz is composed of two-qubit gates Bμ(θμ) acting on neighboring qubits. Shown is the case of two layers. b While there are many choices for each block Bμ(θμ), in our numerics we employed two different parameterizations. Top: Each gate is composed of a controlled-Z gate preceded and followed by single-qubit rotations about the y-axis \({R}_{y}(\theta )={\rm{e}}^{{\rm{i}}\theta {\sigma }_{y}}\). Bottom: Each gate is composed of a CNOT gate preceded and followed by a single-qubit rotation \(G({\theta }_{1},{\theta }_{2},{\theta }_{3})={\rm{e}}^{{\rm{i}}{\theta }_{3}{\sigma }_{z}/2}{\rm{e}}^{{\rm{i}}{\theta }_{2}{\sigma }_{y}/2}{\rm{e}}^{{\rm{i}}{\theta }_{1}{\sigma }_{z}/2}\). The number of parameters in θ increases linearly with the number of layers and the number of qubits n.

Let us remark that the Layered Hardware Efficient Ansatz can lead to trainability issues as the system size increases37,38. Hence, different strategies have been proposed to mitigate such difficulties, such as learning to initialize parameters42, layer-by-layer training43, and correlating the parameters44. In addition, these methods can be combined with more sophisticated ansatzes, such as a variable-structure ansatz6,41 where the structure of the ansatz is not fixed, and where the gate placement becomes an optimizable hyper-parameter. This variable-structure approach has already been shown to improve performance in the context of extracting the eigensystem of a quantum state6.

Finally, since VQSE optimization corresponds to an energy minimization problem, a natural ansatz that can also be used to mitigate trainability issues is the quantum alternating operator ansatz (QAOA)3,45. Specifically, one could employ H(t) as the problem Hamiltonian in the QAOA and use a standard mixing Hamiltonian. While we do not employ this ansatz in our heuristics, it is nevertheless of interest for future work.

Optimization

Regarding the optimization of the parameters θ, while gradient-free methods are an option46,47, there has been recent evidence that gradient-based methods can perform better48,49,50,51. Moreover, as shown in refs. 52,53, for cost functions like (1), gradients can be analytically determined (see section ‘Gradient of the cost function’ in the “Methods” for an explicit derivation of the gradient formula). Therefore, in our heuristics, we employ gradient-based optimization.

Numerical Implementations

Here we present the numerical results obtained from implementing VQSE. We first employ VQSE to estimate the spectrum of quantum states of different dimensions and compare the performance of cost functions based on the global, local, and adaptive Hamiltonians discussed in section ‘Cost functions’. Then we use VQSE for error mitigation of the W-state preparation circuit. Finally, we implement VQSE for entanglement spectroscopy on the ground-state of an XY-spin chain, which allows us to detect the presence of quantum critical points.

VQSE for quantum principal component analysis

Figure 4 presents the results of implementing VQSE to estimate the six largest eigenvalues (m = 6) of quantum states with n = 6, 8, and 10 qubits. In all cases, we have rank(ρ) = 16, as the states were prepared by randomly entangling the system qubits with four ancillary qubits, which were later traced out. Moreover, we chose ρ to be real and not sparse in the standard basis.

Fig. 4: Relative and absolute error versus the number of iterations.
figure 4

We implemented VQSE for states of: a n = 6, b n = 8, and c n = 10 qubits. In all cases, the ansatz for V(θ) was given by three layers of the Layered Hardware Efficient Ansatz of Fig. 3(b, top). Each curve represents the absolute or relative error (denoted Abs error or Rel error, respectively) of (18) obtained by training V(θ) when employing an adaptive, fixed-local, or fixed-global Hamiltonian. The number of iterations was 330 for (a) and 360 for (b) and (c). For the adaptive runs, we employed Algorithm 1, with the Hamiltonian being updated every 30 iterations. In each case, the adaptive approach performs the best as it achieves the smallest errors.

In our heuristics, we used the layered hardware efficient ansatz of Fig. 3(b, top), and we employed the fixed-local, fixed-global, and adaptive cost functions of section ‘Cost functions’. The termination condition was stated in terms of the maximum number of iterations in the optimization loop. Hardware noise and finite sampling were not included in these heuristics. (The next subsection shows heuristics with noise.) For the fixed local cost function, we chose the \({\{{r}_{j}\}}_{j = 1}^{n}\) in (12) so that the first six energy eigenvalues of HL were non-degenerate. Moreover, we defined the fixed global Hamiltonian such that the first six energy levels (i.e., associated eigenvectors and spectral gaps) coincided with those of HL. Finally, the adaptive Hamiltonian was constructed according to the procedure described above, and more specifically, in Algorithm 1 in the “Methods” section.

Since for these examples, we can calculate the exact eigenvalues λi, we compute and plot the following quantities which we use as figures of merit for the performance of the VQSE algorithm:

$${\varepsilon }_{\lambda }\equiv \mathop{\sum }\limits_{i=1}^{6}{({\lambda }_{i}-{\widetilde{\lambda }}_{i})}^{2}\,,\quad {\varepsilon }_{r}\equiv \mathop{\sum }\limits_{i=1}^{6}{({\lambda }_{i}-{\widetilde{\lambda }}_{i})}^{2}/{\lambda }_{i}^{2}\,.$$
(18)

Here ελ and εr respectively quantify absolute error and relative error in estimating the exact eigenvalues. We remark that these two quantities provide different information: The absolute error is biased towards the error in estimating the large eigenvalues of ρ, while on the other hand, the relative error is more sensitive to errors in estimating the small eigenvalues of ρ.

Figure 4 plots the relative and absolute errors versus number of iterations (with the total number of iterations fixed). While we performed many runs, these plots show only the run that achieved the lowest absolute error. For all system sizes considered (n = 6, 8 and 10), VQSE achieves smaller relative and absolute errors when employing the adaptive Hamiltonian approach than using a fixed Hamiltonian. For n = 6, the errors obtained by adaptively updating H(t) are two orders of magnitude smaller than those obtained with fixed Hamiltonians, while for n = 8 they are one order of magnitude smaller. As shown in Fig. 4c for n = 10, the adaptive Hamiltonian approach achieves error of the order: ~10−5 for the relative error, and ~10−7 for the absolute error, and again outperforms the fixed local Hamiltonian approaches. We here finally remark that we can use Eq. (9) to determine the number of shots needed to guarantee that with a probability larger than 99% the relative error induced by finite sampling is smaller than 0.001. Namely, we find that one needs a number of shots larger than 50.2K, which is well within the order of magnitude of shots regularly used.

It is natural to ask whether the runs shown in Fig. 4 are representative of the algorithm performance. To provide an analysis of the average VQSE performance, we plot in Fig. 5 the runs-per-success versus 1/ελ for each of the aforementioned examples. Here, runs-per-success is defined as the total number of runs divided by the number of runs with an absolute error smaller than a target ελ.

Fig. 5: Runs-per-success versus inverse absolute error 1/ελ.
figure 5

We implemented VQSE for the states of: a n = 6, b n = 8, and c n = 10 qubits corresponding to Fig. 4. The insets depict the same data in the small 1/ελ regime. Runs-per-success is defined as the total number of runs divided by the number of runs with a relative error smaller than a target ελ. For all three cases, we can see that as 1/ελ increases, the adaptive Hamiltonian has the lowest number of runs-per-success, and hence the best performance. In all cases, the x-axis is plotted on a log scale.

From all three panels in Fig. 5 we see that for large 1/ελ, the adaptive Hamiltonian always has the best performance as it requires less runs-per-success to achieve smaller errors. Finally, it is interesting to note from the insets of Fig. 5 that there is a regime where the run time has a linear dependence on \(\log (1/{\varepsilon }_{\lambda })\) when employing an adaptive approach. This suggests that VQSE may perform quite efficiently for large ελ. However, the linear dependence breaks down for small ελ, where the number of runs-per-success seems to grow exponential with 1/ελ. Despite, such growth, for up to 8 qubits we only need 100 repetitions to achieve an error of order 10−6. We leave for future work a more detailed study of the dependence of runs-per-success for small error. Finally, the results presented in Figs. 4 and 5 suggest that for a sufficiently large value of number of iterations, the adaptive Hamiltonian approach outperforms the fixed Hamiltonian approaches as it requires the least number of iterations to converge to very small values of ελ.

Error mitigation

Here we discuss an important application of the VQSE algorithm for error mitigation. Quantum state preparation circuits (gate sequences U which prepare a target state \(\left|\psi \right\rangle\)) are used as subroutines in many quantum algorithms. However, since current quantum computers are noisy, all state preparation circuits produce mixed states ρ. If there is little enough incoherent noise, we can expect that the largest eigenvalue of ρ is associated with \(\left|\psi \right\rangle\). Here we show that VQSE can be implemented to re-purify ρ and estimate \(\left|\psi \right\rangle\). Naturally, when running the VQSE eigenvector preparation circuit, noise will also produce a mixed state σ. However, if the depth of V(θ) is shorter than the depth of U, one can obtain a higher fidelity between σ and \(\left|\psi \right\rangle\) in comparison to the fidelity between ρ and \(\left|\psi \right\rangle\). In this case one can mitigate errors by replacing the state preparation circuit by the VQSE eigenvector preparation circuit.

Let us now consider the three qubit W-state preparation circuit from ref. 54, section 2.2 (see also ref. 55). By employing a noisy quantum computer simulator with the noise profile of IBM’s Melbourne processor56, we find that the fidelity between ρ and the exact W state \(\left|\psi \right\rangle\) is \(F(\rho ,\left|\psi \right\rangle )\approx 0.785\). We then train 10 instances of VQSE with two layers of the ansatz in Fig. 3(b, bottom) and with a termination condition of 50 iterations. Moreover, we employ the adaptive Hamiltonian, where we update H(t) every 10 iterations according to Algorithm 1. Figure 6 shows the average cost function value and average fidelity between \(\left|\psi \right\rangle\) and the state σ obtained by running the VQSE eigenvector preparation circuit. As the number of iterations increases, the cost value tends to decrease, showing that we are able to train in the presence of noise. Moreover, we also see that \(F(\sigma ,\left|\psi \right\rangle )\) increases and saturates at a value larger than \(F(\rho ,\left|\psi \right\rangle )\), namely at 0.853, hence showing that we are in fact mitigating the effect of noise. This can be explained by the fact that we reduced the circuit depth, as our ansatz contains two CNOTs, while the textbook circuit contains three CNOTs.

Fig. 6: Cost function value and fidelity versus number of iterations.
figure 6

We implement VQSE for error mitigation of the three qubit W-state preparation circuit. The input state ρ corresponds to the mixed state obtained by running the W-state preparation circuit on a noisy simulator. The dashed line corresponds to the fidelity \(F(\rho ,\left|\psi \right\rangle )\) between ρ and the the exact W state \(\left|\psi \right\rangle\). For each iteration step, we compute the fidelity \(F(\sigma ,\left|\psi \right\rangle )\), where the mixed state σ is obtained by running the VQSE eigenvector preparation circuit on the noisy simulator. Curves depict the average of 10 instances of the algorithm. As the number of iterations increases the cost function value decreases, which implies that we are able to train V(θ) in the presence of noise. After a few iterations of the VQSE optimization loop, we find \(F(\sigma ,\left|\psi \right\rangle ) \,>\, F(\rho ,\left|\psi \right\rangle )\).

Entanglement spectroscopy

We now discuss the possibility of employing VQSE to compute the entanglement spectrum of a state ρ which is obtained as the reduced state of a bipartite quantum system \(\left|{\psi }_{AB}\right\rangle\), i.e., \(\rho ={{{{\rm{Tr}}}}}_{B}\left|{\psi }_{AB}\right\rangle \,\left\langle {\psi }_{AB}\right|\). Let d denote the dimension of ρ. The entanglement spectrum57 refers to the collection \({\{{\lambda }_{k}\}}_{k = 1}^{d}\) of eigenvalues of ρ, and as discussed in ref. 58, entanglement spectroscopy is a useful tool to analyze states \(\left|{\psi }_{AB}\right\rangle\) prepared by simulating many-body systems on a quantum computer. Specifically, the entanglement spectrum is useful to study the bipartite entanglement, as it contains more universal signatures than the von Neumann entropy alone57, and it can detect the presence of quantum critical points59,60.

Let us now consider an N = 8 spin-1/2 cyclic chain interacting trough uniform XY first-neighbor Heisenberg coupling in the presence of a non-transverse magnetic field. The Hamiltonian of the system is

$$H=-\mathop{\sum}\limits_{j}({h}_{x}{S}_{j}^{x}+{h}_{z}{S}_{j}^{z}+{J}_{x}{S}_{j}^{x}{S}_{j+1}^{x}+{J}_{y}{S}_{j}^{y}{S}_{j+1}^{y})\,,$$
(19)

where j labels the site in the chain, \({S}_{j}^{\mu }\) the spin operator (with μ = x, y, z), Jμ the coupling strength, and hμ the magnetic fields. Here, Jμ > 0 leads to ferromagnetic (FM) coupling, while Jμ < 0 to antiferromagnetic (AFM) coupling. As shown in refs. 60,61, for specific values of the fields hμ (known as factorizing fields) the Hamiltonian in (19) presents quantum critical points known as "factorization” points. At the non-transverse factorizing field, the ground-state of H becomes a separable non-degenerate state such that one of its eigenvalues is exactly equal to one, while the rest are exactly zero.

In Fig. 7a, c, we show results of implementing VQSE with an adaptive Hamiltonian to compute the three largest eigenvalues of the state ρ defined as the reduced state of 4 neighboring spins obtained from the ground state of (19). For simplicity, we have parametrized the fields as \(({h}_{z},{h}_{x})=h(\cos (\gamma ),\sin (\gamma ))\) with γ fixed. Specifically, in Fig. 7a, c we plot the estimated eigenvalues versus the field magnitude h for a system with FM and AFM couplings, respectively. Moreover, dashed lines indicate the exact eigenvalues. For each field value, we run 8 instances of VQSE, and even for such a small number of runs, the estimated eigenvalues give good approximations as we get relative errors which in general are of the order of ~10−2.

Fig. 7: Exact and estimated eigenvalues versus field value, for the VQSE entanglement spectroscopy implementations.
figure 7

The input state ρ is given as the reduced state of 4 neighboring qubits from the ground state of (19). Top and bottom rows correspond to ferromagnetic and antiferromagnetic couplings, respectively. Dashed curves represent the exact three largest eigenvalues of ρ, while plot markers indicate the VQSE estimated eigenvalues. In a and c we see that VQSE can accurately estimate the eigenvalues. In b and d we plot 1 − λ1, and the y axis is on a log scale. Here we see that the quantum critical factorization points are detected at h/Jx ≈ 0.76 and h/Jx ≈ 1.43 in (b) and (d), respectively, since at those points we have \({\widetilde{\lambda }}_{1}\approx 1\), and \({\widetilde{\lambda }}_{2},{\widetilde{\lambda }}_{3}\approx 0\).

In Fig. 7b, d, we show the same data as in (a) and (c) but the y axis is plotted on a logarithmic scale, and where instead of plotting the largest eigenvalue λ1, we plot 1 − λ1. For the FM (AFM) case, there is a factorization points at h/Jx ≈ 0.76 (h/Jx ≈ 1.43). As depicted in these panels, around critical points we correctly find \({\widetilde{\lambda }}_{1}\approx 1\), and \({\widetilde{\lambda }}_{2},{\widetilde{\lambda }}_{3}\approx 0\). These results show that VQSE can detect quantum critical factorization points.

Discussion

In the NISQ era, every qubit and every gate counts. Wasteful usage of qubits or gates will ultimately limit the problem size that an algorithm can solve. In this work, we presented an algorithm for extracting the eigensystem of a quantum state ρ that is as frugal as we could imagine, with respect to qubit count.

We introduced the VQSE, which estimates the m-largest eigenvalues and associated eigenvectors of ρ, using only a single copy of ρ, and hence only n qubits per iteration of the VQSE. VQSE exploits the mathematical connection between diagonalization and majorization to define an efficiently computable cost function as the expectation value of a Hamiltonian. We derived an operational meaning of this cost function as a bound on eigensystem error. Furthermore, we introduced a training method that involved adaptively updating the VQSE cost function based on the information gained from measurements performed throughout the optimization. This was aimed at addressing both barren plateaus and local minima in the cost landscape.

We have numerically implemented VQSE for several applications. We showed that VQSE can be employed for PCA by implementing the VQSE algorithm on states of n = 6, 8, and 10 qubits to estimate the six largest eigenvalues. Our numerical results (Figs. 4 and 5) indicate that our adaptive cost function approach leads to smaller errors than the ones obtained by training a fixed cost function. We also showed (Fig. 7) that one can detect quantum critical points by performing entanglement spectroscopy with the eigenvalues obtained via VQSE. Finally, we employed VQSE to mitigate errors that occur during the W-state preparation circuit. This involved running VQSE on a noisy simulator to re-purify the state, i.e., find the circuit that prepares the eigenvector with the largest eigenvalue. We found (Fig. 6) that the re-purified state obtained by VQSE improved the fidelity with the target W state, and hence reduced the effects of noise.

Comparison to literature

Since VQSE only requires n qubits, it is as qubit frugal as it can possible be when compared to other algorithms for the same task, such as quantum PCA (qPCA)33, VQSD6, and quantum state singular value decomposition (QSVD)27. The quantum phase estimation and density matrix exponentiation primitives in qPCA make it difficult to implement in the near term62, and this is supported an by attempted implementation in ref. 6 that resulted in poor performance. On the other hand, VQSD and QSVD are variational algorithms and hence have the possibility of lower-depth requirements. But they still need to employ a larger number of qubits than VQSE. Specifically, VQSD needs to perform the so-called Diagonalized Inner Product Test6 which requires two copies of ρ, i.e., requires twice as many qubits as VQSE. In addition, it is also worth noting that VQSD is vulnerable to noise, since any asymmetry between the noise acting of each copy of ρ will affect the result of the algorithm. Finally, in QSVD, one needs to either compute or have access to a purification \(\left|\psi \right\rangle\) of ρ. Hence QSVD requires a number of qubits between n and 2n. Moreover, we expect that noise will be a bigger issue for QSVD than for VQSE, since in practice the assumption that one has a pure state in QSVD can often be violated due to incoherent noise during state preparation.

Quantum-inspired classical algorithms63 for PCA can perform well in practice, provided that the matrix has a very large dimension, low rank, and low condition number64. We note that VQSE does not have such limitations, except the fact that VQSE yields results with high accuracy for low-rank states. Here is also paramount to recall that recent results have show that an exponential advantage is still possible for PCA65,66, even for near-term algorithms, as the quantum-inspired classical algorithms are artificially given too much power via access to quantum state amplitudes. Thus, in view of these recent result, VQSE can be useful in the quest for achieving a quantum speedup with quantum PCA, particularly for analysis of quantum data. For the case of classical data, the success of VQSE for PCA relies on the efficiency of preparing a quantum state corresponding to the covariance matrix of the classical data67. In addition, VQSE has applications not only for PCA but also for other tasks such as entanglement spectroscopy and error mitigation on NISQ devices. For error mitigation, we leave for future work combining our approach with other error mitigation techniques such as Virtual Distillation68,69, which also seeks to re-purify noisy quantum states.

Future directions

Due to the rapid rise of VQE2, much research has gone into how to prepare ground and excited states on NISQ devices. However, more research is needed on how to characterize these states, once prepared. This is where VQSE comes in, as VQSE can extract the entanglement spectra of these states and hence characterize important properties like topological order57. Hence it is worth exploring in the future the idea of pairing up the VQE and VQSE algorithms, where VQSE is implemented immediately after VQE.

Furthermore, VQSE has immediate application for estimating the fidelity of two quantum states with reduced resource requirements. This is because an algorithm was previously introduced8 to estimate fidelity by using state diagonlization as a subroutine, and hence VQSE can provide a more efficient version of this subroutine.

A crucial technical idea in this work was our adaptively-updated cost function, which improved optimization performance. It is worth investigating whether this adaptive method can improve the performance of other VQAs2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27.

Another direction to explore is whether VQSE exhibits noise resilience17. We suspect this to be true given the similar structure of VQSE and the variational quantum compiling algorithms investigated in ref. 17.

This is important as we are proposing that VQSE will be a useful tool for error mitigation. Namely, we envision that VQSE could be used as a subroutine to improve the accuracy of several quantum algorithms. For example, one could use VQSE to re-purify the noisy quantum state obtained as the outcome of the VQE algorithm. Alternatively, one could periodically perform VQSE whilst running a dynamical quantum simulation on a NISQ device, which would re-purify the state as it is evolving in time. This could allow one to simulate long-time dynamics, i.e., times significantly beyond the coherence time of a NISQ device.

Methods

Operational meaning of the cost function

In this section, we provide a derivation for Eq. (16). First, we rewrite the eigenvalue error in Eq. (15) as follows:

$$\begin{array}{r}{\varepsilon }_{\lambda }={{{{\boldsymbol{\lambda }}}}}^{m}\cdot {{{{\boldsymbol{\lambda }}}}}^{m}+{{{{\widetilde{\boldsymbol{\lambda }}}}}}^{m}\cdot {{{{\widetilde{\boldsymbol{\lambda }}}}}}^{m}-2{{{{\boldsymbol{\lambda }}}}}^{m}\cdot {{{{\widetilde{\boldsymbol{\lambda }}}}}}^{m}\,,\end{array}$$
(20)

where λm ≡ (λ1, …, λm) and \({{{{\widetilde{\boldsymbol{\lambda }}}}}}^{m}\equiv ({\widetilde{\lambda }}_{1},\ldots ,{\widetilde{\lambda }}_{m})\). Since the eigenvalues of a positive semidefinite operator majorize its diagonal elements, we have that \({{{{\boldsymbol{\lambda }}}}}^{m}\succ {{{{\widetilde{\boldsymbol{\lambda }}}}}}^{m}\). Moreover, from the Schur convexity property of the dot product with an ordered vector, it follows that \({{{{\boldsymbol{\lambda }}}}}^{m}\cdot {{{{\widetilde{\boldsymbol{\lambda }}}}}}^{m}\,\geqslant\, {{{{\widetilde{\boldsymbol{\lambda }}}}}}^{m}\cdot {{{{\widetilde{\boldsymbol{\lambda }}}}}}^{m}\), which further implies the following inequality:

$${\varepsilon }_{\lambda }\,\leqslant\, {{{{\boldsymbol{\lambda }}}}}^{m}\cdot {{{{\boldsymbol{\lambda }}}}}^{m}-{{{{\widetilde{\boldsymbol{\lambda }}}}}}^{m}\cdot {{{{\widetilde{\boldsymbol{\lambda }}}}}}^{m}\,.$$
(21)

Similarly, from Eq. (15) we get

$$\begin{array}{r}{\varepsilon }_{v}\,\leqslant\, {{{{\boldsymbol{\lambda }}}}}^{m}\cdot {{{{\boldsymbol{\lambda }}}}}^{m}-{{{{\widetilde{\boldsymbol{\lambda }}}}}}^{m}\cdot {{{{\widetilde{\boldsymbol{\lambda }}}}}}^{m}\,,\end{array}$$
(22)

where we again used the fact that the eigenvalues of a positive semidefinite operator majorize its diagonal elements, and hence \({{{{\boldsymbol{\lambda }}}}}^{m}\cdot {{{{\boldsymbol{\lambda }}}}}^{m}\,\geqslant\, \mathop{\sum }\nolimits_{i=1}^{m}\langle {\widetilde{\lambda }}_{i}| {\rho }^{2}| {\widetilde{\lambda }}_{i}\rangle\).

We recall from Eq. (3) that the VQSE cost function can be expressed as \(C=\mathop{\sum }\nolimits_{i = 1}^{d}{E}_{i}{p}_{i}\), where we omit the θ, and t dependence of C. Therefore, the following chain of inequalities hold:

$$\begin{array}{ll}C\,\geqslant\, \mathop{\sum }\limits_{i=1}^{m}{E}_{i}{p}_{i}+{E}_{m+1}\mathop{\sum}\limits_{i > m}{p}_{i}\\ \quad=\,{E}_{m+1}-\left(\mathop{\sum }\limits_{i=1}^{m}{p}_{i}({E}_{m+1}-{E}_{i})\right)\\ \quad\geqslant\, {E}_{m+1}-\sqrt{\left(\mathop{\sum }\limits_{i=1}^{m}{p}_{i}^{2}\right)\left(\mathop{\sum }\limits_{i=1}^{m}{({E}_{m+1}-{E}_{i})}^{2}\right)},\end{array}$$
(23)

where d = 2n. The first inequality follows the fact that EiEm+1, im + 1 and \({\sum }_{i \,{ > }\,m}{p}_{i}=1-\mathop{\sum }\nolimits_{i = 1}^{m}{p}_{i}\). The second inequality follows from the Cauchy–Schwarz inequality for the dot product of two vectors uvuv. By combining Eq. (23) with the fact that \(\mathop{\sum }\nolimits_{i = 1}^{m}{\widetilde{\lambda }}_{i}^{2}\geqslant \mathop{\sum }\nolimits_{i = 1}^{m}{p}_{i}^{2}\) (since \({\widetilde{\lambda }}_{i}\in {{{\mathcal{L}}}}\) are the largest diagonal elements of \(\widetilde{\rho }\)), we find that

$$\sqrt{\mathop{\sum }\limits_{i=1}^{m}{\tilde{\lambda }}_{i}^{2}}\,\geqslant\, \frac{{E}_{m+1}-C}{\sqrt{\mathop{\sum }\nolimits_{i = 1}^{m}{({E}_{m+1}-{E}_{i})}^{2}}}\,.$$
(24)

Using the fact that \({{{{\boldsymbol{\lambda }}}}}^{m}\cdot {{{{\boldsymbol{\lambda }}}}}^{m}\,\leqslant\, {{{\boldsymbol{\lambda }}}}\cdot {{{\boldsymbol{\lambda }}}}={{{\rm{Tr}}}}[{\rho }^{2}]\), we obtain the following equality from (24)

$${{{{\boldsymbol{\lambda }}}}}^{m}\cdot {{{{\boldsymbol{\lambda }}}}}^{m}-{{{{\widetilde{\boldsymbol{\lambda }}}}}}^{m}\cdot {{{{\widetilde{\boldsymbol{\lambda }}}}}}^{m}\,\leqslant\, {{{\rm{Tr}}}}[{\rho }^{2}]-\frac{{({E}_{m+1}-C)}^{2}}{\mathop{\sum }\nolimits_{i = 1}^{m}{({E}_{m+1}-{E}_{i})}^{2}}\,.$$

Combining this with (21) and (22) leads to (16).

Verification of solution quality

Here we provide a proof of Eq. (17), and we show that this bound is tighter than the bound in (16). From the definition of the eigenvalue and eigenvector error in (15), it is straightforward to see that \({\varepsilon }_{\lambda }\,\leqslant\, \mathop{\sum }\nolimits_{i = 1}^{d}{({\lambda }_{i}-{\widetilde{\lambda }}_{i})}^{2}\), and \({\varepsilon }_{v}\,\leqslant\, \mathop{\sum }\nolimits_{i = 1}^{d}\langle {\delta }_{i}| {\delta }_{i}\rangle\), where d = 2n. By following a procedure similar to the one employed in deriving (22), we find

$${\varepsilon }_{\lambda }\,\leqslant\, {{{\boldsymbol{\lambda }}}}\cdot {{{\boldsymbol{\lambda }}}}-{{{\widetilde{\boldsymbol{\lambda }}}}}\cdot {{{\widetilde{\boldsymbol{\lambda }}}}}\,,$$
(25)

where we recall that λ and \({{{\widetilde{\boldsymbol{\lambda }}}}}\) denote d-dimensional vectors of ordered exact and estimated eigenvalues of ρ, respectively. Moreover, from \(\left|{\delta }_{i}\right\rangle =({\mathbb{1}}-|{\widetilde{\lambda }}_{i}\rangle \langle {\widetilde{\lambda }}_{i}|)\rho |{\widetilde{\lambda }}_{i}\rangle\), it is straightforward to get

$${\varepsilon }_{v}\,\leqslant\, \mathop{\sum }\limits_{i=1}^{d}\langle {\widetilde{\lambda }}_{i}| {\rho }^{2}| {\widetilde{\lambda }}_{i}\rangle -{{{\widetilde{\boldsymbol{\lambda }}}}}\cdot {{{\widetilde{\boldsymbol{\lambda }}}}}={{{\boldsymbol{\lambda }}}}\cdot {{{\boldsymbol{\lambda }}}}-{{{\widetilde{\boldsymbol{\lambda }}}}}\cdot {{{\widetilde{\boldsymbol{\lambda }}}}}\,,$$
(26)

where we used the fact that \(\mathop{\sum }\nolimits_{i = 1}^{d}\langle {\widetilde{\lambda }}_{i}| {\rho }^{2}| {\widetilde{\lambda }}_{i}\rangle ={{{\rm{Tr}}}}[{\rho }^{2}]\), which follows from the invariance of trace under a basis transformation.

Let \({{{\widehat{\boldsymbol{\lambda }}}}}=({\widetilde{\lambda }}_{1},\ldots ,{\widetilde{\lambda }}_{\widehat{m}},\frac{1-\mathop{\sum }\nolimits_{i = 1}^{\widehat{m}}{\widetilde{\lambda }}_{i}}{{2}^{n}-\widehat{m}},\ldots ,\frac{1-\mathop{\sum }\nolimits_{i = 1}^{\widehat{m}}{\widetilde{\lambda }}_{i}}{{2}^{n}-\widehat{m}})\), with \(\widehat{m} \,>\, m\), be a vector majorized by \({{{\widetilde{\boldsymbol{\lambda }}}}}\), i.e., \({{{\widetilde{\boldsymbol{\lambda }}}}}\succ {{{\widehat{\boldsymbol{\lambda }}}}}\). Since the dot product with an ordered vector is a Schur convex function, we have \({{{\widehat{\boldsymbol{\lambda }}}}}\cdot {{{\widehat{\boldsymbol{\lambda }}}}}\,\leqslant\, {{{\widehat{\boldsymbol{\lambda }}}}}\cdot {{{\widetilde{\boldsymbol{\lambda }}}}}\,\leqslant\, {{{\widetilde{\boldsymbol{\lambda }}}}}\cdot {{{\widetilde{\boldsymbol{\lambda }}}}}\), which further implies the following inequality:

$${{{\boldsymbol{\lambda }}}}\cdot {{{\boldsymbol{\lambda }}}}-{{{\widetilde{\boldsymbol{\lambda }}}}}\cdot {{{\widetilde{\boldsymbol{\lambda }}}}}\,\leqslant\, {{{\boldsymbol{\lambda }}}}\cdot {{{\boldsymbol{\lambda }}}}-\left(\mathop{\sum }\limits_{i=1}^{\widehat{m}}{\widetilde{\lambda }}_{i}^{2}+\frac{{(1-\mathop{\sum }\nolimits_{i = 1}^{\widehat{m}}{\widetilde{\lambda }}_{i})}^{2}}{{2}^{n}-\widehat{m}}\right).$$
(27)

This inequality can be combined with (25) and (26) to obtain the bound in (17).

We now show that (17) is tighter than (16). Specifically, we prove that the negative term in the right-hand side of (17) is larger than the one in (16). Consider the following chain of inequalities:

$$\begin{array}{l}\left(\mathop{\sum }\limits_{i=1}^{\widehat{m}}{\widetilde{\lambda }}_{i}^{2}+\frac{{(1-\mathop{\sum }\nolimits_{i = 1}^{\widehat{m}}{\widetilde{\lambda }}_{i})}^{2}}{{2}^{n}-\widehat{m}}\right)\geqslant \mathop{\sum }\limits_{i=1}^{m}{\widetilde{\lambda }}_{i}^{2}\\ \qquad\qquad\qquad\qquad\qquad\quad\geqslant \frac{{({E}_{m+1}-C)}^{2}}{\mathop{\sum }\nolimits_{i=1}^{m}{({E}_{m+1}-{E}_{i})}^{2}}\,,\end{array}$$

where we used \(\widehat{m} \,>\, m\), and where the last inequality follows from (24).

Gradient of the cost function

Here we show that the partial derivative of (4) with respect to an angle θν is given by

$$\begin{array}{l}\displaystyle{\frac{\partial C(t,{{{\boldsymbol{\theta }}}})}{\partial {\theta }_{\nu }}}=\frac{1}{2}\left({{{\rm{Tr}}}}\left[H(t)V({{{{\boldsymbol{\theta }}}}}_{+})\rho {V}^{{\dagger} }({{{{\boldsymbol{\theta }}}}}_{+})\right]\right.\\ \left.\qquad\quad-{{{\rm{Tr}}}}\left[H(t)V({{{{\boldsymbol{\theta }}}}}_{-})\rho {V}^{{\dagger} }({{{{\boldsymbol{\theta }}}}}_{-})\right]\right)\,.\end{array}$$
(28)

Writing θ = (θ1, …, θν, …), then θ± are simply given by θ± = (θ1, …, θν ± π/2, …), which shows that the gradient values are efficiently accessible by shifting the parameters in θ and measuring the expectation value 〈H(t)〉.

Let us consider the layered hardware efficient ansatz of Fig. 3a. Here V(θ) consists of a fixed number L of layers of 2-qubit gates Bμ(θμ) acting on alternating pairs of neighboring qubits. Moreover, Bμ(θμ) can always be expressed as a product of ημ gates from a given alphabet \({{{\mathcal{A}}}}=\{{U}_{k}({\theta }_{k})\}\) as

$${B}_{\mu }({{{{\boldsymbol{\theta }}}}}_{\mu })={U}_{{\eta }_{\mu }}({\theta }_{\mu }^{{\eta }_{\mu }})\ldots {U}_{\nu }({\theta }_{\mu }^{\nu })\ldots {U}_{1}({\theta }_{\mu }^{{\eta }_{1}})\,.$$
(29)

Here \({\theta }_{\mu }^{{\eta }_{\mu }}\) are continous parameters, and we can always write without loss of generality Uk(θ) = Rk(θ)Tk, where \({R}_{k}(\theta )={\rm{e}}^{{\rm{i}}\theta {\sigma }_{k}/2}\) is a single qubit rotation and Tk is an unparametrized gate.

We can then compute \({\partial }_{\nu }{B}_{\mu }({{{{\boldsymbol{\theta }}}}}_{\mu })\equiv \partial {B}_{\mu }({{{{\boldsymbol{\theta }}}}}_{\mu })/\partial {\theta }_{\mu }^{\nu }\) as

$${\partial }_{\nu }{B}_{\mu }({{{{\boldsymbol{\theta }}}}}_{\mu })=\frac{{\rm{i}}}{2}{U}_{{\eta }_{\mu }}({\theta }_{\mu }^{{\eta }_{\mu }})\ldots {\sigma }_{\nu }{U}_{\nu }({\theta }_{\mu }^{\nu })\ldots {U}_{1}({\theta }_{\mu }^{{\eta }_{1}})\,.$$
(30)

Then, without loss of generality let us write V(θ) = VL(θL)Bμ(θμ)VR(θR), where VL(θL), and VR(θR) contain all gates in V(θ) except for Bμ(θμ). By noting that ∂νV(θ) = VL(θL)∂νBμ(θμ)VR(θR), we have

$$\begin{array}{ll}{\partial }_{\nu }C={{{\rm{Tr}}}}\left[H{V}_{L}{\partial }_{\nu }{B}_{\mu }{V}_{R}\rho {V}_{R}^{{\dagger} }{B}_{\mu }^{{\dagger} }{V}_{L}^{{\dagger} }\right]\\ \qquad+\,{{{\rm{Tr}}}}\left[H{V}_{L}{B}_{\mu }{V}_{R}\rho {V}_{R}^{{\dagger} }{\partial }_{\nu }{B}_{\mu }^{{\dagger} }{V}_{L}^{{\dagger} }\right]\,,\end{array}$$

where we omitted the parameter dependence for simplicity. Then, from Eq. (30) and using the following identity (which is valid for any matrix A)

$$i[{\sigma }_{\nu }A]={R}_{\nu }\left(-\frac{\pi }{2}\right)A{R}_{\nu }^{{\dagger} }\left(-\frac{\pi }{2}\right)-{R}_{\nu }\left(\frac{\pi }{2}\right)A{R}_{\nu }^{{\dagger} }\left(\frac{\pi }{2}\right)\,,$$
(31)

where \({R}_{k}(\theta )={\rm{e}}^{{\rm{i}}\theta {\sigma }_{k}/2}\), we obtain

$$\begin{array}{l}\displaystyle{\frac{\partial C(t,{{{\boldsymbol{\theta }}}})}{\partial {\theta }_{\nu }}}=\frac{1}{2}\left({{{\rm{Tr}}}}\left[H(t)V({{{{\boldsymbol{\theta }}}}}_{+})\rho {V}^{{\dagger} }({{{{\boldsymbol{\theta }}}}}_{+})\right]\right.\\ \left.\quad\qquad\qquad\;-\,{{{\rm{Tr}}}}\left[H(t)V({{{{\boldsymbol{\theta }}}}}_{-})\rho {V}^{{\dagger} }({{{{\boldsymbol{\theta }}}}}_{-})\right]\right)\,.\end{array}$$
(32)

Algorithm for the adaptive cost function

Algorithm 1 shows a simple adaptive strategy that illustrates how one can update H(t). Specifically, we consider the case when f(t) is a stepwise function. In addition, we define the VQSE optimization loop termination condition in terms of the maximum number of iterations allowed Nmax. We also define an updating parameter s (with Nmax/s being an integer) such that we update HG(t) every s steps. Finally, here we use the term optimizer, denoted as opt, as a function that takes as inputs a set of parameters θ and a cost function C(t, θ) (or the gradient of the cost for gradient-based optimization) and returns an updated set of parameters that attempts to solve the minimization problem of Eq. (5).

Algorithm 1

Adaptive cost function with stepwise schedule f(t)

Input: state ρ; trainable unitary V(θ); integer m; timestep \(\delta {{{\rm{t}}}}=1/{{{{\rm{N}}}}}_{\max }\); adapting stepsize ts = 1/s; local time-independent Hamiltonian HL; a set of constant parameters \({\left\{{q}_{i}\right\}}_{i = 1}^{m}\); classical optimizer opt.

Output: parameters θopt which minimize the cost function, i.e., \({{{{\boldsymbol{\theta }}}}}_{{{{\rm{opt}}}}}=\arg {\min }_{\theta }\,C({{{\boldsymbol{\theta }}}}).\)

Init: randomly choose a set of initial parameters θ; H(t) ← HL; t ← δt

1: while t 1 do

2:  if t if divisible by ts then

3:   measure V(θ)ρV(θ) in the standard basis.

   define the sets \({{{\mathcal{L}}}}\) and \({{{\mathcal{Z}}}}\)

4:   \({H}_{G}(t)\leftarrow {\mathbb{1}}-\mathop{\sum }\nolimits_{i = 1}^{m}{q}_{i}\left|{{{{\boldsymbol{z}}}}}_{i}\right\rangle \left\langle {{{\boldsymbol{z}}}}_{i}\right|\) with \({z}_{i}\in {{{\mathcal{Z}}}}\)

5:   H(t) ← (1 − t)HL + tHG(t)

6:  run opt with C and θ as input, and \({{{{\boldsymbol{\theta }}}}}_{\min }\) as output

7:  \({{{\boldsymbol{\theta }}}}\leftarrow {{{{\boldsymbol{\theta }}}}}_{\min }\)

8:  t ← t + δt

9:  if t = 1 then

10:   θopt ← θ

  Return: θopt