Variational Quantum State Diagonalization

Variational hybrid quantum-classical algorithms are promising candidates for near-term implementation on quantum computers. In these algorithms, a quantum computer evaluates the cost of a gate sequence (with exponential speedup over classical cost evaluation), and a classical computer uses this information to adjust the parameters of the gate sequence. Here we present such an algorithm for quantum state diagonalization. State diagonalization has applications in condensed matter physics (e.g., entanglement spectroscopy) as well as in machine learning (e.g., principal component analysis). For a quantum state $\rho$ and gate sequence $U$, our cost function quantifies how far $ U\rho U^{\dagger}$ is from being diagonal. We introduce novel short-depth quantum circuits to quantify our cost. Minimizing this cost returns a gate sequence that approximately diagonalizes $\rho$. One can then read out approximations of the largest eigenvalues, and the associated eigenvectors, of $\rho$. As a proof-of-principle, we implement our algorithm on Rigetti's quantum computer to diagonalize one-qubit states and on a simulator to find the entanglement spectrum of the Heisenberg model ground state.


I. INTRODUCTION
The future applications of quantum computers, assuming that large-scale, fault-tolerant versions will eventually be realized, are manifold. From a mathematical perspective, applications include number theory [1], linear algebra [2][3][4], differential equations [5,6], and optimization [7]. From a physical perspective, applications include electronic structure determination [8,9] for molecules and materials and real-time simulation of quantum dynamical processes [10] such as protein folding and photoexcitation events. Naturally, some of these applications are more long-term than others. Factoring and solving linear systems of equations are typically viewed as longer term applications due to their high resource requirements. On the other hand, approximate optimization and determining electronic structure may be nearer term applications, and could even serve as demonstrations of quantum supremacy in the near future [11,12].
A major aspect of quantum algorithms research is to make applications of interest more near term by reducing the quantum resource requirements including qubit count, circuit depth, numbers of gates, and numbers of measurements. A powerful strategy for this purpose is algorithm hybridization, where a fully quantum algorithm is turned into a hybrid quantum-classical algorithm [13]. The benefit of hybridization is two-fold, both reducing the resources (hence allowing implementation on smaller hardware) as well as increasing accuracy (by outsourcing calculations to "error-free" classical computers).
Variational hybrid algorithms are a class of quantumclassical algorithms that involve minimizing a cost function that depends on the parameters of a quantum gate sequence. Cost evaluation occurs on the quantum computer, with exponential speedup over classical cost evaluation, and the classical computer uses this cost informa-tion to adjust the parameters of the gate sequence. Variational hybrid algorithms have been proposed for Hamiltonian ground state and excited state preparation [14][15][16], approximate optimization [7], error correction [17], quantum data compression [18], quantum simulation [19], and quantum compiling [20]. A key feature of such algorithms is their near-term relevance, since only the subroutine of cost evaluation occurs on the quantum computer, while the optimization procedure is entirely classical, and hence standard classical optimization tools can be employed.
In this work, we consider the application of diagonalizing quantum states. In condensed matter physics, diagonalizing states is useful for identifying properties of topological quantum phases-a field known as entanglement spectroscopy [21]. In data science and machine learning, diagonalizing the covariance matrix (which could be encoded in a quantum state [2,22]) is frequently employed for principal component analysis (PCA). PCA identifies features that capture the largest variance in one's data and hence allows for dimensionality reduction [23].
Classical methods for diagonalization typically scale polynomially in the matrix dimensions [24]. Similarly, the number of measurements required for quantum state tomography-a general method for fully characterizing a quantum state-scales polynomially in the dimension. Interestingly, Lloyd et al. proposed a quantum algorithm for diagonalizing quantum states that can potentially perform exponentially faster than these methods [2]. Namely, their algorithm, called quantum principal component analysis (qPCA), gives an exponential speedup for low-rank matrices. qPCA employs quantum phase estimation combined with density matrix exponentiation. These subroutines require a significant number of qubits and gates, making qPCA difficult to implement in the near term, despite its long-term promise.
Here, we propose a variational hybrid algorithm for quantum state diagonalization. For a given state ρ, our algorithm is composed of three steps: (1) Train the parameters α of a gate sequence U p (α) such that ρ = U p (α opt )ρU p (α opt ) † is approximately diagonal, (2) Read out the largest eigenvalues of ρ by measuring in the eigenbasis (i.e., by measuringρ in the standard basis), and (3) Prepare the eigenvectors associated with the largest eigenvalues. We call this the variational quantum state diagonalization (VQSD) algorithm. VQSD is a near-term algorithm with the same practical benefits as other variational hybrid algorithms. Employing a layered ansatz for U p (α) (where p is the number of layers) allows one to obtain a hierarchy of approximations for the eigevalues and eigenvectors. We therefore think of VQSD as an approximate diagonalization algorithm. We carefully choose our cost function C to have the following properties: (1) C is faithful (i.e, it vanishes if and only ifρ is diagonal), (2) C is efficiently computable on a quantum computer, (3) C has an operational meaning such that it upper bounds the eigenvalue error (the discrepancy between the inferred and the exact eigenvalues), and (4) C scales well for training purposes in the sense that its gradient does not vanish exponentially in the number of qubits. The precise definition of C is given in Sec. II A and involves a difference of purities for different states. To compute C, we introduce novel short-depth quantum circuits that likely have applications outside the context of VQSD.
To illustrate our method, we implement VQSD on Rigetti's 8-qubit quantum computer. We successfully diagonalize one-qubit pure states using this quantum computer. To highlight future applications (when larger quantum computers are made available), we implement VQSD on a simulator to perform entanglement spectroscopy on the ground state of the one-dimensional Heisenberg model composed of 8 spins.
Our paper is organized as follows. Section II outlines the VQSD algorithm and presents its implementation. In Sec. III, we discuss the complexity of the VQSD algorithm for particular example states, we give a comparison to the qPCA algorithm, and we elaborate on future applications. Section IV presents our methods for quantifying diagonalization and for optimizing our cost function.

II. RESULTS
A. The VQSD Algorithm 1. Overall Structure Figure 1 shows the structure of the VQSD algorithm. The goal of VQSD is to take, as its input, an n-qubit density matrix ρ in quantum form and then output approximations of the m-largest eigenvalues and their associated eigenvectors. Here, m will typically be much less than 2 n , the total number of eigenvectors, although the user is free to increase m with increased algorithmic com-plexity (discussed below). The outputted eigenvalues will be in classical form, i.e., will be stored on a classical computer. In contrast, the outputted eigenvectors will be in quantum form, i.e., will be prepared on a quantum computer. This is necessary because the eigenvectors would have 2 n entries if they were stored on a classical computer, which is intractable for large n. Nevertheless, one can characterize important aspects of these eigenvectors with a polynomial number of measurements on the quantum computer.
Similar to classical eigensolvers, the VQSD algorithm is an approximate or iterative diagonalization algorithm. Classical eigenvalue algorithms are necessarily iterative, not exact [25]. Iterative algorithms are useful in that they allow for a trade-off between run-time and accuracy. Higher degrees of accuracy can be achieved at the cost of more iterations (equivalently, longer run-time), or short run-time can be achieved at the cost of lower accuracy. This flexibility is desirable in that it allows the user of the algorithm to dictate the quality of the solutions found.
The iterative feature of VQSD arises via a layered ansatz for the diagonalizing unitary. This idea similarly appears in other variational hybrid algorithms, such as the Quantum Approximate Optimization Algorithm [26]. Specifically, VQSD diagonalizes ρ by variationally updating a parameterized unitary U p (α) such that is (approximately) diagonal at the optimal value α opt . (For brevity we often writeρ forρ p (α).) We assume a layered ansatz of the form Here, p is a hyperparameter that sets the number of layers L i (α i ), and each α i is a set of optimization parameters that corresponds to internal gate angles within the layer. The parameter α in (1) refers to the collection of all α i for i = 1, ..., p. Once α opt is found, one can run a particular quantum circuit (shown in Fig. 1(c) and discussed below) N readout times to approximately determine the eigenvalues of ρ. The precision (i.e, the number of significant digits) of each eigenvalue increases with N readout and with the eigenvalue's magnitude. Hence for small N readout only the largest eigenvalues of ρ will be precisely characterized, so there is a connection between N readout and how many eigenvalues, m, are determined. The hyperparameter p is a refinement parameter, meaning that the accuracy of the eigenvalues typically increases as p increases. We formalize this argument as follows. Let C denote our cost function, defined below in (9), which we are trying to minimize. In general, the cost C will be non-increasing (i.e., will either decrease or stay constant) in p. One can ensure that this is true by taking the optimal parameters learned for p layers as the starting point for the optimization of p + 1 layers. Next, we argue that C is closely connected to the accuracy of the eigenvalues. Specifically, it gives an upper bound Here, p is a hyperparameter that dictates the quality of solution found. This optimal unitary is sent to the eigenvalue readout circuit (c) to obtain bitstrings z, the frequencies of which provide estimates of the eigenvalues of ρ. Along with the optimal unitary Up(αopt), these bitstrings are sent to the eigenvector preparation circuit (c) to prepare the eigenstates of ρ on a quantum computer. Both the eigenvalues and eigenvectors are the outputs (d) of the VQSD algorithm.
on the eigenvalue error. So while C is not directly a measure of eigenvalue error, one obtains an increasingly tighter upper bound on the eigenvalue error as C decreases (equivalently, as p increases). We introduce the following quantity to quantify eigenvalue error, i.e., the discrepancy between the inferred eigenvaluesλ and the actual eigenvalues λ, where d = 2 n . Here the vectorsλ = (λ 1 , ...,λ d ) and λ = (λ 1 , ..., λ d ) are both ordered in decreasing order. As proven in Sec. IV A, our cost function upper bounds the eigenvalue error up to a proportionality factor q, Because C is non-increasing in p, the upper bound in (4) is non-increasing in p and goes to zero if C goes to zero. The various steps in the VQSD algorithm are shown schematically in Fig. 1. There are essentially three main steps: (1) an optimization loop that minimizes the cost C via back-and-forth communication between a classical and quantum computer, where the former adjusts α and the latter computes C for U p (α), (2) a readout procedure for approximations of the m largest eigenvalues, which involves running a quantum circuit and then classically analyzing the statistics, and (3) a preparation procedure to prepare approximations of the eigenvectors associated with the m largest eigenvalues. In the following subsections, we elaborate on each of these procedures.

Parameter Optimization Loop
Naturally, there are many ways to parameterize U p (α). Ideally one would like the number of parameters to grow polynomially in both n and p, so that the search space is not exponentially large. Figure 2 presents an example ansatz that satisfies this condition. Each layer L i is broken down into layers of two-body gates that can be performed in parallel. These two-body gates can be further broken down into parameterized one-body gates, for example, with the construction in Ref. [27].
For a given ansatz, such as the one in Fig. 2, parameter optimization involves evaluating the cost C on a quantum computer for an initial choice of parameters and then modifying the parameters on a classical computer in an iterative feedback loop. The goal is to find The classical optimization routine used for updating the parameters can involve either gradient-free or gradientbased methods. In Sec. IV B, we explore this further and discuss our optimization methods. In Eq. (5), C(U p (α)) quantifies how far the stateρ p (α) is from being diagonal. There are many ways to define such a cost function, and in fact there is an entire field of research on coherence measures that has introduced various such quantities [28]. We aim for a cost that is efficiently computable on a quantum computer, and hence we consider a cost that can be expressed in terms of purities. (It is well known that a quantum computer can find the purity Tr(σ 2 ) of an n-qubit state σ with complexity scaling only linearly in n, an exponential speedup over classical computation [29,30].) Two such cost functions, whose individual merits we discuss in Sec. IV A, are Tr(Z j (ρ) 2 ) .
Here, Z and Z j are quantum channels that dephase (i.e., destroy the off-diagonal elements) in the global standard basis and in the local standard basis on qubit j, respectively. Importantly, the two functions vanish under the same conditions: So the global minima of C 1 and C 2 coincide and correspond precisely to unitaries U p (α) that diagonalize ρ (i.e., unitaries such thatρ is diagonal). As elaborated in Sec. IV A, C 1 has a useful operational meaning as a bound on eigenvalue error, namely C 1 ∆(λ,λ). However, its landscape tends to be insensitive to changes in U p (α) for large n. In contrast, we are not aware of a clear operational meaning for C 2 . However, the landscape for C 2 is more sensitive to changes in U p (α), making it useful for training U p (α) when n is large. Due to these contrasting merits of C 1 and C 2 , we define our overall cost function C as a weighted average of these two functions where q ∈ (0, 1] is a free parameter that allows one to tailor the VQSD method to the scale of one's problem. For small n, one can set q ≈ 1 since the landscape for C 1 is not too flat for small n, and, as noted above, C 1 is an operationally relevant quantity. For large n, one can set q to be small since the landscape for C 2 will provide the gradient needed to train U p (α).
Computing C amounts to evaluating the purities of various quantum states on a quantum computer and then doing some simple classical post-processing that scales linearly in n. This can be seen from Eqns. (6) and (7). The first term, Tr(ρ 2 ), in C 1 and C 2 is independent of U p (α). Hence, Tr(ρ 2 ) can be evaluated outside of the optimization loop in Fig. 1 using the Destructive Swap Test (see Sec. IV A for the circuit diagram). Inside the loop, we only need to compute Tr(Z(ρ) 2 ) and Tr(Z j (ρ) 2 ) for all j. Each of these terms are computed by first preparing two copies ofρ and then implementing quantum circuits whose depths are constant in n. For example, the circuit for computing Tr(Z(ρ) 2 ) is shown in Fig. 1(b), and surprisingly it has a depth of only one gate. We call it the Diagonalized Inner Product (DIP) Test. The circuit for computing Tr(Z j (ρ) 2 ) is similar, and we call it the Partially Diagonalized Inner Product (PDIP) Test. We elaborate on both of these circuits in Sec. IV A.

Eigenvalue Readout
After finding the optimal diagonalizing unitary U p (α opt ), one can use it to readout approximations of the eigenvalues of ρ. Figure 1(c) shows the circuit for this readout. One prepares a single copy of ρ and then acts with U p (α opt ) to prepareρ p (α opt ). Measuring in the standard basis {|z }, where z = z 1 z 2 ...z n is a bitstring of length n, gives a set of probabilities {λ z } with We take theλ z as the inferred eigenvalues of ρ. We emphasize that theλ z are the diagonal elements, not the eigenvalues, ofρ p (α opt ). Note that the squared error between the inferred eigenvalue and the corresponding true eigenvalue is bounded by the final cost that one obtained in the parameter optimization loop. That is, Each run of the circuit in Fig. 1(c) generates a bitstring z corresponding to the measurement outcomes. If one obtains z with frequency f z for N readout total runs, theñ gives an estimate forλ z . The statistical deviation ofλ est z fromλ z goes with 1/ √ N readout . The relative error z (i.e., the ratio of the statistical error onλ est z to the value ofλ est z ) then goes as This implies that events z with higher frequency f z have lower relative error. In other words, the larger the inferred eigenvalueλ z , the lower the relative error, and hence the more precisely it is determined from the experiment. When running VQSD, one can pre-decide on the desired values of N readout and a threshold for the relative error, denoted max . This error threshold max will then determine m, i.e., how many of the largest eigenvalues that get precisely characterized. So max , and the set of inferred eigenvalues {λ z }. Precisely, we take m = |λ est | as the cardinality of the following set: which is the set of inferred eigenvalues that were estimated with the desired precision.

Eigenvector Preparation
The final step of VQSD is to prepare the eigenvectors associated with the m-largest eigenvalues, i.e., the eigenvalues in the set in Eq. (14). Let Z = {z :λ est z ∈λ est } be the set of bitstrings z associated with the eigenvalues inλ est . (Note that these bitstrings are obtained directly from the measurement outcomes of the circuit in Fig. 1(c), i.e., the outcomes become the bitstring z.) For each z ∈ Z, one can prepare the following state, which we take as the inferred eigenvector associated with our estimate of the inferred eigenvalueλ est z , The circuit for preparing this state is shown in Fig. 1 As noted in (16), one first prepares |z by acting with X operators raised to the appropriate powers, and then one acts with U p (α opt ) † to rotate from the standard basis to the inferred eigenbasis. Naturally, the inferred eigenvector |ṽ z will be an approximation of the true eigenvector |v z of ρ. Unlike the case of eigenvalue error, we do not have a simple bound that relates eigenvector error (i.e., how far |ṽ z is from |v z ) to our cost function. Nevertheless, it is clear that the eigenvector error must vanish as our cost function goes to zero.
Once they are in quantum form, each inferred eigenvector |ṽ z can be characterized by measuring expectation values of interest. That is, important physical features such as energy or entanglement (e.g., entanglement witnesses) are associated with some Hermitian observable M , and one can evaluate the expectation value ṽ z |M |ṽ z to learn about these features.

B. Implementations
Here we present our implementations of VQSD, first for a one-qubit state on a cloud quantum computer to show that it is amenable to currently available hardware. Then, to illustrate the scaling to larger, more interesting problems, we implement VQSD on a simulator for the 8-spin ground state of the Heisenberg model.

One-Qubit State
We now discuss the results of applying VQSD to the one-qubit plus state ρ = |+ +| on the 8Q-Agave quantum computer provided by Rigetti. Because the problem size is small (n = 1), we set q = 1 in the cost function (9). Since ρ is a pure state, the cost function is For U p (α), we take p = 1, for which the layered ansatz becomes an arbitrary single qubit rotation. The results of VQSD for this state are shown in Fig. 3. In Fig. 3(a), the solid curve shows the cost versus the number of iterations in the parameter optimization loop, and the dashed curves show the inferred eigenvalues of ρ at each iteration. As can be seen, the cost decreases to a small value near zero and the eigenvalue estimates simultaneously converge to the correct values of zero and one. Hence, VQSD successfully diagonalized this state. Figure 3(b) shows the landscape of the optimization problem on Rigetti's 8Q-Agave quantum computer, Rigetti's noisy simulator, and a noiseless simulator. Here, we varied the angle α in the diagonalizing unitary U (α) = R x (π/2)R z (α) and computed the cost at each value of this angle. The landscape on the quantum computer has local minima near the optimal angles α = π/2, 3π/2 but the cost is not zero. This explains why we obtain the correct eigenvalues even though the cost is nonzero in Fig. 3(a). The nonzero cost can be due to a combination of decoherence, gate infidelity, and measurement error. As shown in Fig. 3(b), the 8Q-Agave quantum computer retuned during our data collection, and after this retuning, the landscape of the quantum computer matched that of the noisy simulator significantly better.

Heisenberg Model Ground State
While current noise levels of quantum hardware limit our implementations of VQSD to small problem sizes, The cost landscape on a noiseless simulator, Rigetti's noisy simulator, and Rigetti's quantum computer. Error bars show the standard deviation (due to finite sampling) of multiple runs. The local minima occur roughly at the theoretically predicted values of π/2 and 3π/2. During data collection for this plot, the 8Q-Agave quantum computer retuned, after which its cost landscape closely matched that of the noisy simulator.
we can explore larger problem sizes on a simulator. An important application of VQSD is to study the entanglement in condensed matter systems, and we highlight this application in the following example.
Let us consider the ground state of the one-dimensional Heisenberg model, whose Hamiltonian is zẑ ) and periodic boundary conditions, S (n+1) = S (1) . Performing entanglement spectroscopy on the ground state |ψ AB involves diagonalizing the reduced state ρ = Tr B (|ψ ψ| AB ). Here we consider a total of 8 spins (n = 8). We take A to be a subset of 4 nearest-neighbor spins, and B is the complement of A.
The results of applying VQSD to the 4-spin reduced state ρ via a simulator are shown in Fig. 4. Panel (a) plots the inferred eigenvalues versus the number of layers p in our ansatz (see Fig. 2). One can see that the inferred eigenvalues converge to their theoretical values as p increases. Panel (b) plots the inferred eigenvalues resolved by their associated quantum numbers (z-component of total spin). This shows that our VQSD implementation is getting roughly the correct values for both the eigenvalues and their quantum numbers, particularly for large eigenvalues. Resolving not only the eigenvalues but also their quantum numbers is important for entanglement spectroscopy [21], and clearly VQSD can do this.

A. General Complexity Remarks
We emphasize that VQSD is meant for states ρ that have either low rank or possibly high rank but low entropy H(ρ) = −Tr(ρ log ρ). This is because the eigenvalue readout step of VQSD would be exponentially complex for states with high entropy. In other words, for high entropy states, if one efficiently implemented the eigenvalue readout step (with N readout polynomial in n), then very few eigenvalues would get characterized with the desired precision.
In what follows we discuss some simple examples of states to which one might apply VQSD. There are several aspects of complexity to keep in mind when considering these examples, including: (C1) The gate complexity of the unitary that diagonalizes ρ. (It is worth remarking that approximate diagonalization might be achieved with a less complex unitary than exact diagonalization.) (C2) The complexity of searching through the search space to find the diagonalizing unitary.
(C3) The statistical complexity associated with reading out the eigenvalues.

B. Example States
In the simplest case, suppose ρ = |ψ 1 ψ 1 | ⊗ · · · ⊗ |ψ n ψ n | is a tensor product of pure states. This state can be diagonalized by a depth-one circuit U = U 1 ⊗ · · · ⊗ U n composed of n one-qubit gates (all done in parallel). Each U j diagonalizes the associated |ψ j ψ j | state. Searching for this unitary within our ansatz can be done by setting p = 1, i.e., with a single layer L 1 shown in Fig. 2. A single layer is enough to find the unitary that exactly diagonalizes ρ in this case. Hence, for this example, both complexities (C1) and (C2) are efficient. Finally, note that the eigenvalue readout, (C3), is efficient because there is only one non-zero eigenvalue. Hence,λ est z ≈ 1 and z ≈ 1/ √ N readout for this eigenvalue. This implies that N readout can be chosen to be constant, independent of n, in order to accurately characterize this eigenvalue.
A generalization of product states are classically correlated states, which have the form where {|b 1 } form an orthonormal basis for qubit j. Like product states, classically correlated states can be diagonalized with a depth-one circuit composed of one-body unitaries. Hence (C1) and (C2) are efficient for such states. However, the complexity of eigenvalue readout depends on the {p z } distribution; if it is high entropy then eigenvalue readout can scale exponentially.
Finally, we consider pure states of the form ρ = |ψ ψ|. For such states, eigenvalue readout (C3) is efficient because N readout can be chosen to be independent of n, as we noted earlier for the example of pure product states.
Next we argue that the gate complexity of the diagonalizing unitary, (C1), is efficient. The argument is simply that VQSD takes the state ρ as its input, and ρ must have been prepared on a quantum computer. Let V be the unitary that was used to prepare |ψ = V |0 on the quantum computer. For large n, V must have been efficient to implement, otherwise the state |ψ could not have been prepared. Note that V † , which is constructed from V by reversing the order of the gates and adjointing each gate, can be used to diagonalize ρ. Because V is efficiently implementable, then V † is also efficiently implementable. Hence, ρ can be efficiently diagonalized. A subtlety is that one must compile V † into one's ansatz, such as the ansatz in Fig. 2. Fortunately, the overhead needed to compile V † into our ansatz grows (at worst) only linearly in n. An explicit construction for compiling V † into our ansatz is as follows. Any one-qubit gate directly translates without overhead into our ansatz, while any two-qubit gate can be compiled using a linear number of swap gates to make the qubits of interest to be nearest neighbors, then performing the desired two-qubit gate, and finally using a linear number of swap gates to move the qubits back to their original positions.
Let us now consider the complexity (C2) of searching for U . Since there are a linear number of parameters in each layer, and p needs only to grow polynomially in n, then the search space dimensionality grows only polynomially in n. But this does not guarantee that we can efficiently minimize the cost function, since the landscape is non-convex. In general, search complexity for problems such as this remains an open problem. Hence, we cannot make a general statement about (C2) for pure states.

C. Comparison to Literature
Diagonalizing quantum states with classical methods would require exponentially large memory to store the density matrix, and the matrix operations needed for diagonalization would be exponentially costly. VQSD avoids both of these scaling issues.
Another quantum algorithm that extracts the eigenvalues and eigenvectors of a quantum state is qPCA [2]. Similar to VQSD, qPCA has the potential for exponential speedup over classical diagonalization for particular classes of quantum states. Like VQSD, the speedup in qPCA is contingent on ρ being a low-entropy state.
We performed a simple implementation of qPCA to get a sense for how it compares to VQSD (see Appendix for details). In particular, just like we did for Fig. 3, we considered the one-qubit plus state ρ = |+ +|. We implemented qPCA for this state on Rigetti's noisy simulator (whose noise is meant to mimic that of their 8Q-Agave quantum computer). The circuit that we implemented applied one controlled-exponential-swap gate (in order to approximately exponentiate ρ, as discussed in [2]). With this circuit we inferred the two eigenvalues of ρ to be approximately 0.8 and 0.2. Hence, for this simple example, it appears that qPCA gave eigenvalues that were slightly off from the true values of 1 and 0, while VQSD was able to obtain the correct eigenvalues, as discussed in Fig. 3.

D. Future Applications
Finally we discuss various applications of VQSD. As noted in Ref. [2], one application of quantum state diagonalization is benchmarking of quantum noise processes, i.e., quantum process tomography. Here one prepares the Choi state by sending half of a maximally entangled state through the process of interest. One can apply VQSD to the resulting Choi state to learn about the noise process, which may be particular useful for benchmarking near-term quantum computers.
A special case of VQSD is variational state preparation. That is, if one applies VQSD to a pure state ρ = |ψ ψ|, then one can learn the unitary U (α) that maps |ψ to a standard basis state. Inverting this unitary allows one to map a standard basis state (and hence the state |0 ⊗n ) to the state |ψ , which is known as state preparation. Hence, if one is given |ψ in quantum form, then VQSD can potentially find a short-depth circuit that approximately prepares |ψ . (We will explore the special case of variational state preparation in future work.) In machine learning, PCA is a common subroutine in supervised and unsupervised learning algorithms and also has many direct applications. PCA inputs a data matrix X and finds a new basis such that the variance is maximal along the new basis vectors. One can show that this amounts to finding the eigenvectors of the covariance matrix E[XX T ] with the largest eigenvalues, where E denotes expectation value. Thus PCA is reduced to diagonalizing a positive-semidefinite matrix, E[XX T ]. Hence VQSD can perform this task provided one has access to QRAM [22] to prepare the covariance matrix as a quantum state. PCA can be used to reduce the dimension of X as well as to filter out noise in data. It is a subroutine in multidimensional scaling-a machine learning algorithm that reconstructs an approximate "map" of vectors x i given only pairwise distances ||x i − x j || 2 2 between them-as well as the isomap algorithm [31] in manifold learning. By generalizing the Euclidean inner product to a positive semi-definite kernel, nonlinear (kernel) PCA can be used on data that is not linearly separable.
Perhaps the most important near-term application of VQSD is to study condensed matter physics. In particular, we propose that one can apply the variational quantum eigensolver [14] to prepare the ground state of a many-body system, and then one can follow this with the VQSD algorithm to characterize the entanglement in this state. Ultimately this approach could elucidate key properties of condensed matter phases. In particular, VQSD allows for entanglement spectroscopy, which has direct application to the identification of topological order [32]. Extracting both the eigenvalues and eigenvectors is useful for entanglement spectroscopy [32], and we illustrated this capability of VQSD in Fig. 4.

A. Diagonalization Test Circuits
Here we elaborate on the cost functions C 1 and C 2 and present short-depth quantum circuits to compute them.

C1 and the DIP Test
The function C 1 defined in (6) In other words, C 1 is (1) the minimum distance betweeñ ρ and the set of diagonal states D, (2) the distance from ρ to Z(ρ), and (3) the sum of the absolute squares of the off-diagonal elements ofρ.
Since the eigenvalues of a density matrix majorize its diagonal elements, λ λ , and the dot product with an ordered vector is a Schur convex function, we have λ ·λ λ ·λ .
Hence from (25) and (26) we obtain the bound For computational purposes, we use the difference of purities interpretation of C 1 given in (6). The Tr(ρ 2 ) term is independent of U p (α). Hence it only needs to be evaluated once, outside of the parameter optimization loop. It can be computed via the expectation value of the swap operator S on two copies of ρ, using the identity This expectation value is found with a depth-two quantum circuit that essentially corresponds to a Bell-basis measurement, with classical post-processing that scales linearly in the number of qubits [33,34]. This is shown in Fig. 5(a). We call this procedure the Destructive Swap Test, since it is like the Swap Test, but the measurement occurs on the original systems instead of on an ancilla. Similarly, the Tr(Z(ρ) 2 ) term could be evaluated by first dephasingρ and then performing the Destructive Swap Test, which would involve a depth-three quantum circuit with linear classical post-processing. This approach was noted in Ref. [35]. However, there exists a simpler circuit, which we call the Diagonalized Inner Product (DIP) Test. The DIP Test involves a depth-one quantum circuit with no classical post-processing. An abstract version of this circuit is shown in Fig. 5(b), for two states σ and τ . For our application we will set σ = τ =ρ, although we discuss the general case below.
Let σ and τ be states on the n-qubit systems A and B, respectively. Let ω AB = σ ⊗ τ denote the initial state. The action of the CNOTs in Fig. 5(b) gives the state where the notation X z means X z1 ⊗ X z2 ⊗ · · · ⊗ X zn . Partially tracing over the B system gives where τ z,z = z|τ |z . The probability for the all-zeros outcome is then which follows because X z |0 = |z . Hence the probability for the all-zeros outcome is precisely the diagonalized inner product. Note that in the special case where σ = τ =ρ, we obtain the sum of the squares of the diagonal elements, zρ 2 z,z = Tr(Z(ρ) 2 ). In summary, C 1 is efficiently computed by using the Destructive Swap Test for the Tr(ρ 2 ) term and the DIP Test for the Tr(Z(ρ) 2 ) term.

C2 and the PDIP Test
Like C 1 , C 2 can also be rewritten in terms of of the Hilbert-Schmidt distance. Namely, C 2 is the average distance ofρ to each locally-dephased state Z j (ρ): where Z j (·) = z (|z z| j ⊗ 1 1 k =j )(·)(|z z| j ⊗ 1 1 k =j ). Naturally, one would expect that C 2 C 1 , sinceρ should be closer to each locally dephased state than to the fully dephased state. Indeed this is true and can be seen from: However, C 1 and C 2 vanish under precisely the same conditions, as noted in Eq. (8). One can see this as follows. The conditionρ = Z j (ρ) for all j corresponds to C 2 = 0. This implies thatρ = (Z 1 ⊗ · · · ⊗ Z n )(ρ) = Z(ρ), which follows by repeatedly acting with local dephasing channels. Thus, C 2 is faithful, vanishing iffρ is diagonal. Writing C 2 in terms of purities, as in (7), shows how it can be computed on a quantum computer. As in the case of C 1 , the first term in (7) is computed with the Destructive Swap Test. For the second term in (7), each purity Tr(Z j (ρ) 2 ) could also be evaluated with the Destructive Swap Test, by first locally dephasing the appropriate qubit. However, we present a slightly improved circuit to compute these purities that we call the Partially Diagonalized Inner Product (PDIP) Test. The PDIP Test circuit is shown in Fig. 5(c) for the general case of feeding in two distinct states σ and τ with the goal of computing the inner product between Z j (σ) and Z j (τ ). For generality we let l, with 0 l n, denote the number of qubits being locally dephased for this computation. If l > 0, we define j = (j 1 , ..j l ) as a vector of indices that indicates which qubits are being locally dephased.
The PDIP Test is a hybrid of the Destructive Swap Test and the DIP Test, corresponding to the former when l = 0 and the latter when l = n. Hence, it generalizes both the Destructive Swap Test and the DIP Test. Namely, the PDIP Test performs the DIP Test circuit on the qubits appearing in j and performs the Destructive Swap Test circuit on the qubits not appearing in j. Let j denote the latter qubits. Let σ and τ , respectively, be states on the n-qubit systems A = A jj and B = B jj . The initial state ω AB = σ ⊗ τ evolves, under the action of the CNOTs associated with the DIP Test and then tracing over the control systems, to where X z and |z z| act non-trivially only on the j subsystems of A and B, respectively. Measuring system A j and obtaining the all-zeros outcome would leave systems A j B j in the (unnormalized) conditional state: where σ z j := Tr A j ((|z z|⊗1 1)σ) and τ z j := Tr B j ((|z z|⊗ 1 1)τ ). Finally, computing the expectation value for the swap operator (via the Destructive Swap Test) on the state in (35) gives The last equality can be verified by noting that Z j (σ) = z |z z| ⊗ σ z j and Z j (τ ) = z |z z| ⊗ τ z j . Specializing (36) to σ = τ =ρ gives the quantity Tr(Z j (ρ) 2 ).

C1 versus C2
Here we discuss the contrasting merits of the functions C 1 and C 2 , hence motivating our cost definition in (9).
As noted previously, C 1 has an operational meaning as an upper bound on eigenvalue error, whereas C 2 does not share this interpretation. In addition, the circuit for computing C 1 is more efficient than that for C 2 . The circuit for computing the second term in C 1 , shown in Fig. 5(b) has a gate depth of one, with n CNOT gates, n measurements, and no classical post-processing. The circuit for computing the second term in C 2 , shown in Fig. 5(c) has a gate depth of two, with n CNOT gates, n − 1 Hadamard gates, 2n − 1 measurements, and classical post-processing whose complexity scales linearly in n. So in every aspect (gate depth, gate count, number of measurements, and classical post-processing), the circuit for computing C 1 is less complex than that for C 2 . Hence, this implies that C 1 can be computed with greater accuracy than C 2 on a noisy quantum computer.
On the other hand, consider how the landscape for C 1 and C 2 scale with n. As a simple example, suppose ρ = |0 0| ⊗ · · · ⊗ |0 0| is a tensor product of n copies of the |0 0| state. Suppose one takes a single parameter ansatz for U , such that U (θ) = R X (θ) ⊗ · · · ⊗ R X (θ), where R X (θ) is a rotation about the X-axis of the Bloch sphere by angle θ. For this example, C 1 is where x(θ) = Tr(Z(R X (θ)|0 0|R X (θ) † ) 2 ) = (1 + cos 2 θ)/2. If θ is not an integer multiple of π, then x(θ) < 1, and x(θ) n will be exponentially suppressed for large n. In other words, for large n, the landscape for x(θ) n becomes similar to that of a delta function: it is zero for all θ except for multiples of π. Hence, for large n, it becomes difficult to train the unitary U (θ) because the gradient vanishes for most θ. This is just an illustrative example, but this issue is general. Generally speaking, for large n, the function C 1 has a sharp gradient near its global minima, and the gradient vanishes when one is far away from these minima. Ultimately this limits C 1 's utility as a training function for large n.
In contrast, the function C 2 does not suffer from this issue. For the example in the previous paragraph, we have which is independent of n. So for this example the gradient of C 2 does not vanish as n increases, and hence C 2 can be used to train the parameter θ. More generally, the landscape of C 2 is less barren than that of C 1 for large n. We can argue this, particularly, for states ρ that have low rank or low entropy. The second term in (7), which is the term that provides the variability with α, does not vanish even for large n, since (as shown in Appendix D): Here, H(ρ) = −Tr(ρ log 2 ρ) is the von Neumann entropy of ρ, and r is the rank of ρ. So as long as ρ is low entropy or low rank, then the second term in C 2 will not vanish. Note that a similar bound does not exist for second term in C 1 , which does tend to vanish for large n.
Due to these issues, we proposed a weighted average of C 1 and C 2 for our cost C in (9). This allows one to weight C 1 moreso for small n and C 2 moreso for large n.

B. Optimization Methods
Finding α opt in (5) is a major component of VQSD. While many works have benchmarked classical optimization algorithms (e.g., Ref. [36]), the particular case of optimization for variational hybrid algorithms [37] is limited and needs further work [38]. Both gradient-based and gradient-free methods are possible, but gradient-based methods may not work as well with noisy data. Additionally, Ref. [39] notes that gradients of a large class of circuit ansatze vanish when the number of parameters becomes large. These and other issues (e.g., sensitivity to initial conditions, number of function evaluations) should be considered when choosing an optimization method.
In our preliminary numerical analyses (see Appendix B), we found that the Powell optimization algorithm [40] performed the best on both quantum computer and simulator implementations of VQSD. This derivative-free algorithm uses a bi-directional search along each parameter using Brent's method. Our studies showed that Powell's method performed the best in terms of convergence, sensitivity to initial conditions, and number of correct solutions found. The implementation of Powell's algorithm used in this paper can be found in the open-source Python package SciPy Optimize [41]. ACKNOWLEDGMENTS We thank Rigetti for providing access to their quantum computer. The views expressed in this paper are those of the authors and do not reflect those of Rigetti. RL, EO, and AT acknowledge support from the U.S. Department of Energy through a quantum computing program sponsored by the LANL Here we provide further details on our implementations of VQSD in Sec. II B.
First, we discuss our implementation on a quantum computer (data shown in Fig. 3). Figure 6 displays the circuit used for this implementation. This circuit is logically divided into three sections. First, we prepare two copies of the plus state ρ = |+ +| = H|0 0|H by doing a Hadamard gate H on each qubit. Next, we implement one layer of a unitary ansatz, namely U (θ) = R z (θ)R x (π/2). This ansatz was chosen because each gate can be natively implemented on Rigetti's quantum computer. To simplify the search space, we restricted to one parameter instead of a universal one-qubit unitary. Last, we implement the DIP Test circuit, described in Fig. 5, which here consists of only one CNOT gate and one measurement.
For the parameter optimization loop, we used the Powell algorithm mentioned in Sec. IV B. This algorithm found the minimum cost in less than ten objective function evaluations on average. Each objection function evaluation (i.e., call to the quantum computer) sampled from 10,000 runs of the circuit in Fig. 6. As can be seen in Fig. 3(b), 10,000 runs was sufficient to accurately estimate the cost function (9) with small variance. Because the problem size was small, we took q = 1 in (9), which provided adequate variability in the cost landscape.
Because of the noise levels in current quantum computers, we limited VQSD implementations on quantum hardware to only one-qubit states. Noise affects the computation in multiple areas. For example, in state preparation, qubit-specific errors can cause the two copies of ρ to actually be different states. Subsequent gate errors (notably two-qubit gates), decoherence, and measurement errors prevent the cost from reaching zero even though the optimal value of θ is obtained.
Next, we discuss our VQSD implementation on a simulator (data shown in Fig. 4). For this implementation we again chose q = 1 in our cost function. Because of the larger problem size (diagonalizing a 4-qubit state), we employed multiple layers in our ansatz, up to p = 5. The simulator directly calculated the measurement probability distribution in the DIP Test, as opposed to determining the desired probability via sampling. This allowed us to use a gradient-based method to optimize our cost function, reducing the overall runtime of the opti- Optimization tests on six-qubit product states in the VQSD algorithm. Each plot shows a different optimization algorithm (described in main text) and curves on each plot show optimization attempts with different (random) initial conditions. Cost refers to the C1 cost function (q = 1 in (9)), and each iteration is defined by a decrease in the cost function. As can be seen, the Powell algorithm is the most robust to initial conditions and provides the largest number of solved problem instances.
mization. Hence, our simulator implementation for the Heisenberg model demonstrated a future application of VQSD while alleviating the optimization bottleneck that is present for all variational quantum algorithms on large problem sizes, an area that needs further research [38]. We explore optimization methods further in Appendix B.

Appendix B: Comparison of Optimization Methods
As emphasized previously, numerical optimization plays a key role in all variational hybrid algorithms, and further research in optimization methods is needed. In VQSD, the accuracy of the inferred eigenvalues are closely tied to the performance of the optimization algorithm used in the parameter optimization loop. This issue becomes increasingly important as one goes to large problem sizes (large n), where the number of parameters in the diagonalizing unitary becomes large.
Here, we compare the performance of six different optimization algorithms when used inside the parameter optimization loop of VQSD. These include Powell's algorithm [40], Constrained Optimization BY Linear Approximation (COBYLA) [42], Bound Optimization BY Quadratic Approximation (BOBYQA) [43], Nelder-Mead [44], Broyden-Fletcher-Goldfarb-Shanno (BFGS) [45], and conjugate gradient (CG) [45]. As mentioned in the main text, Powell's algorithm is a derivative-free optimizer that uses a bi-directional search along each parameter. The COBYLA and BOBYQA algorithms are both trust region or restricted-step methods, which approximate the objective function by a model function. The region where this model function is a good approximation is known as the trust region, and at each step the optimizer attempts to expand the trust region. The Nelder-Mead algorithm is a simplex method useful for derivative-free optimization with smooth objective functions. Lastly, the BFGS and CG algorithms are both gradient-based. The BFGS method is a quasi-Newton method that uses first derivatives only, and the CG method uses a nonlinear conjugate gradient descent. The implementations used in our study can be found in the open-source Python package SciPy Optimize [41] and in Ref. [46].
For this study, we take the input state ρ to be a sixqubit pure product state: Here, the state preparation unitary is where the angles (α z ) are randomly chosen. Using each algorithm, we attempt to minimize the cost by adjusting 36 parameters in one layer of the unitary ansatz in Fig. 2. For fairness of comparison, only the objective function and initial starting point were input to each algorithm, i.e., no special options such as constraints, bounds, or other information was provided. The results of this study are shown in Fig. 7 and Table I. Figure 7 shows cost versus iteration for each of the six algorithms. Here, we define one iteration by a call to the objective function in which the cost decreases. In particular, the number of iterations is different than the number of cost function evaluations (see Table I), which is not set a priori but rather determined by the optimizer. Plotting cost per each function evaluation would essentially produce a noisy curve since the optimizer is trying many values for the parameters. Instead, we only plot the cost for each parameter update in which the cost decreases. Both the number of iterations, function evaluations, and overall runtime are important features of the optimizer.
In this study, as well as others, we found that the Powell optimization algorithm provides the best performance in terms of lowest minimum cost achieved, sensitivity to initial conditions, and fraction of correct solutions found. The trust-region algorithms COBYLA and BOBYQA were the next best methods. In particular, although the Powell algorithm consistently obtained lower minimum costs, the COBYLA method ran thirteen times faster on average (see Table I). Indeed, both trust region methods provided the shortest runtime. The gradient-based methods BFGS and CG had comparable run-times but  Fig. 7. For example, BOBYQA took 2.32 times as long to run on average than COBYLA, which took the least time to run out of all algorithms. Absolute run-times depend on a variety of factors and computer performance. For reference, the COBYLA algorithm takes approximately one minute for this problem on a laptop computer. The number of cost function evaluations used (related to run-time but also dependent on the method used by the optimizer) is shown in the second row.
were unable to find any minima. Similarly, the Nelder-Mead simplex algorithm was unable to find any minima. This method also had the longest average run-time of all algorithms tested. This preliminary analysis suggests that the Powell algorithm is the best method for VQSD. For other variational quantum algorithms, this may not necessarily be the case. In particular, we emphasize that the optimization landscape is determined by both the unitary ansatz and the cost function definition, which may vary drastically in different algorithms. While we found that gradient-based methods did not perform well for VQSD, they may work well for other applications. Additionally, optimizers that we have not considered here may also provide better performance. We leave these questions to further work. be states on systems A and B = B 1 ...B r , respectively. Then one performs the transformation where W = U AB k · · · U AB1 , and U JK = e −iS JK ∆t . (C2) The resulting reduced state is where t = k∆t. Finally, by turning each U JK in (C2) into a controlled operation: and hence making W controlled, one can then construct an approximation of C V (t) . If one chooses the input state for quantum phase estimation to be ρ = z λ z |v z v z | itself, then the final state becomes z λ z |v z v z | ⊗ |λ z λ z | (C5) whereλ z is a binary representation of an estimate of the corresponding eigenvalue λ z . One can then sample from the state in (C5) to characterize the eigenvalues and eigenvectors. The approximation of V (t) in (C3) can be done with accuracy provided that one uses O(t 2 −1 ) copies of ρ. The time t max needed for quantum phase estimation to achieve accuracy is t max = O( −1 ). Hence, with qPCA, the eigenvalues and eigenvectors can be obtained with accuracy provided that one uses O( −3 ) copies of ρ.
2. Our Implementation of qPCA Figure 8 shows our strategy for implementing qPCA on an arbitary one-qubit state ρ. The circuit shown corresponds to the quantum phase estimation algorithm with one bit of precision (i.e., one ancilla qubit). A Hadamard gate is applied to the ancilla qubit, which then acts as the control system for the C V (t) gate, and finally the ancilla is measured in the x-basis. The C V (t) is approximated (as discussed above) with k applications of the controlled-exponential-swap gate.
To implement qPCA, the controlled-exponential-swap gate in (C4) must be compiled into one-and two-body gates. For this purpose, we used the machine-learning approach from Ref. [34] to obtain a short-depth gate sequence for controlled-exponential-swap. The gate sequence we obtained is shown in Fig. 8 and involves 7 CNOTs and 8 one-qubit gates. Most of the one-qubit FIG. 8. Circuit for our qPCA implementation. Here, the eigenvalues a one-qubit pure state ρ are estimated to a single digit of precision. We use k copies of ρ to approximate C V (t) by applying the controlled-exponential-swap operator k times for a time period ∆t = t/k. The bottom panel shows our compilation of the controlled-exponential-swap gate into oneand two-qubit gates.
gates are z-rotations and hence are error-free (implemented via a clock change), including the following gates: u 1 = u 5 = u 7 = R z (−(π + ∆t)/2) (C6) u 3 = R z ((π − ∆t)/2) (C7) u 4 = R z (∆t/2) (C8) u 8 = R z (π/2) . (C9) The one-qubit gates that are not z-rotations are: 1 1 e −i(π−∆t)/2 e i(π+∆t)/2 (C10) We implemented the circuit in Fig. 8 using both Rigetti's noiseless simulator, known as the Quantum Virtual Machine (QVM), as well as their noisy QVM that utilizes a noise model of their 8Q-Agave chip. Because the latter is meant to mimic the noise in the 8Q-Agave chip, our qPCA results on the noisy QVM can be compared to our VQSD results on the 8Q-Agave chip in Fig. 3. (We remark that lack of availability prevented us from implementing qPCA on the actual 8Q-Agave chip.) For our implementation, we chose the one-qubit plus state, ρ = |+ +|. Implementations were carried out using both one and two controlled-exponential-swap gates, corresponding to k = 1 and k = 2. The time t for which the unitary e −iρt was applied was increased. Figure 9 shows the raw data, i.e., the largest inferred eigenvalue versus t. In each case, small values of t gave more accurate eigenvalues. In the noiseless case, the eigenvalues of ρ = |+ +| were correctly estimated to be ≈ {1, 0} already for k = 1 and consequently also for k = 2. In the noisy case, the eigenvalues were estimated to be ≈ {0.8, 0.2} for k = 1 and ≈ {0.7, 0.3} for k = 2, where we have taken the values for small t. Table II summarizes the different cases.
Already for the case of k = 1, the required resources of qPCA (3 qubits + 7 CNOT gates) for estimating the eigenvalue of an arbitary pure one-qubit state ρ are higher than those of the DIP test (2 qubits + 1 CNOT gate) for the same task. Consequently, the DIP test yields more accurate results as can be observed by comparing Fig. 3 to Fig. 9. Increasing the number of copies to k = 2 only decreases the accuracy of the estimation, since the C V (t) gate is already well approximated for short application times t when k = 1 in the noiseless case. Thus, increasing the number of copies does not offer any improvement in the noiseless case, but instead leads to poorer estimation performance in the noisy case. This can be seen for the k = 2 case (see Fig. 9 and Table  II), due to the doubled number of required CNOT gates relative to k = 1.