Introduction

Like any other processor, the behavior of a quantum information processor must be characterized, verified, and certified. Quantum state tomography (QST) is one of the main tools for that purpose.1 Yet, it is generally an inefficient procedure, since the number of parameters that specify quantum states grows exponentially with the number of sub-systems. This inefficiency has two practical manifestations: (i) without any prior information, a vast number of data points needs to be collected;1 (ii) once the data is gathered, a numerical procedure should be executed on an exponentially high-dimensional space, in order to infer the quantum state that is most consistent with the observations. Thus, to perform QST on steadily growing quantum processors,2,3 we must introduce novel and efficient techniques for its completion.

Recent advances4,5,6 simplify QST by including the premise that, often, our aim is to coherently manipulate pure quantum states (i.e., states that can be equivalently described with rank-1, positive semi-definite (PSD) density matrices). The use of such prior information is the modus operandi toward making QST more manageable, with respect to the amount of data required.

Compressed sensing (CS)7 – and its extension to low-rank approximation8– has been applied to QST6,9,10 within this context. Particularly, Gross et al.6 prove that convex programming guarantees robust estimation of pure n-qubit states from much less information than common approaches require, with overwhelming probability.

These advances, however, leave open the question of how efficiently one can estimate exponentially large-sized quantum states, from a limited set of observations. Since convex programming is susceptible of provable performance, typical QST protocols rely on convex programs.4,6,9 Nevertheless, their weakness remains the high computational and storage complexity. In particular, due to the PSD nature of density matrices, a key step in convex programs is the repetitive application of Hermitian eigensolvers. Such solvers include the well-established family of Lanczos methods,11,12,13 the Jacobi-Davinson SVD type of methods,14 as well as preconditioned hybrid schemes,15 among others. Since – at least once per iteration – a full eigenvalue decomposition is required in most convex programs, eigensolvers contribute a \({\cal O}((2^n)^3)\) computational complexity, where n is the number of qubits of the quantum system. It is obvious that the recurrent application of such eigensolvers makes convex programs impractical, even for quantum systems with a relatively small number n of qubits.6,16

Ergo, to improve the efficiency of QST, and of CS QST in particular, we need to complement it with numerical algorithms that can handle large search spaces using limited amount of data, while having rigorous performance guarantees. This is the purpose of this work. Inspired by the recent advances on finding the global minimum in non-convex problems,17,18,19,20,21,22,23,24 we propose the application of alternating gradient descent for CS QST, that operates directly on the assumed low-rank structure of the density matrix. The algorithm – named Projected Factored Gradient Decent (ProjFGD) – shows significant improvements in QST problems (both in accuracy and efficiency), as compared with state-of-the-art approaches; our numerical experiments justify such behavior.

More crucially, we prove that, despite being a non-convex program, under mild conditions, the algorithm is guaranteed to converge to the global minimum of the QST problem. In general, finding the global minimum in non-convex problems is a hard problem. However, our approach assumes certain regularity conditions – that are, however, satisfied by common CS-inspired protocols in practice4,6,9 – and a good initialization – which we make explicit in the text; both lead to a fast and provable estimation of the state of the system, even with limited amount of data.

Results

QST setup

We begin by describing the problem of QST. We are focusing here on QST of a low-rank n-qubit state, ρ, from measuring expectation values of n-qubit Pauli observables \(\left\{ {P_i} \right\}_{i = 1}^m\). We denote by \(y\, \in \,{\Bbb R}^m\) the measurement vector with elements \(y_i = \frac{{2^n}}{{\sqrt m }}{\mathrm{Tr}}\left( {P_i \cdot \rho _ \star } \right) + e_i,\,i = 1, \ldots ,m\), for some measurement error ei. The normalization \(\frac{{2^n}}{{\sqrt m }}\) is chosen to follow the results of Liu.25 For brevity, we denote \({\cal M}:{\Bbb C}^{2^n \times 2^n} \to {\Bbb R}^m\) as the linear “sensing” map, such that \(\left( {{\cal M}\left( \rho \right)} \right)_i = \frac{{2^n}}{{\sqrt m }}{\mathrm{Tr}}\left( {P_i \cdot \rho } \right)\), for \(i = 1, \ldots ,m\).

An n-qubit Pauli observable is given by \(P = \otimes _{j = 1}^ns_j\) where \(s_j\, \in \,\left\{ {1,\sigma _x,\sigma _y,\sigma _z} \right\}\). There are 4n such observables in total. In general, one needs to have the expectation values of all 4n Pauli observables to uniquely reconstruct ρ. However, since according to our assumption ρ is a low-rank quantum state, we can apply the CS result,6,25 that guarantees a robust estimation, with high probability, from the measurement of the expectation values of just \(m = {\cal O}\left( {r2^nn^6} \right)\) randomly chosen Pauli observables, where \(r \ll 2^n\) is the rank of ρ.

Key property to achieve this is the restricted isometry property:25

Definition 1 (Restricted Isometry Property (RIP) for Pauli measurements)

Let \({\cal M}:{\Bbb C}^{2^n \times 2^n} \to {\Bbb R}^m\) be a linear map, such that \(\left( {{\cal M}\left( \rho \right)} \right)_i = \frac{{2^n}}{{\sqrt m }}{\mathrm{Tr}}\left( {P_i \cdot \rho } \right)\), for \(i = 1, \ldots ,m\). Then, with high probability over the choice of \(m = \frac{c}{{\delta _r^2}} \cdot \left( {r2^nn^6} \right)\) Pauli observables Pi, where c > 0 is an absolute constant, \({\cal M}\) satisfies the r-RIP with constant δr, 0 ≤ δr < 1; i.e.,

$$\left( {1 - \delta _r} \right)\left\Vert \rho \right\Vert_F^2 \le \left\Vert {{\cal M}\left( \rho \right)} \right\Vert_2^2 \le \left( {1 + \delta _r} \right)\left\Vert \rho \right\Vert_F^2,$$

where \(\left\Vert \cdot \right\Vert_F\) denote the Frobenius norm, is satisfied \(\forall \rho \in {\Bbb C}^{2^n \times 2^n}\) such that \(rank\left( \rho \right) \le r\).

An accurate estimation of ρ is obtained by solving, essentially, a convex optimization problem constrained to the set of quantum states,9 consistent with the measured data. Among the various problem formulations for QST, two convex program examples are the trace-minimization program that is typically studied in the context of CS QST:

$$\begin{array}{*{20}{l}} {\begin{array}{*{20}{c}} {{\mathrm{minimize}}} \cr {\rho \in {\Bbb C}^{2^n \times 2^n}} \end{array}} \hfill & {{\mathrm{Tr}}\left( \rho \right)} \hfill \cr {{\mathrm{subject}}\,{\mathrm{to}}} \hfill & {\rho \succcurlyeq 0,} \hfill \cr {} \hfill & {\left|\!| {y - {\cal M}\left( \rho \right)} \right|\!|_2 \le \varepsilon ,} \hfill \end{array}$$
(1)

and the least-squares program,

$$\begin{array}{*{20}{l}} {\begin{array}{*{20}{c}} {{\mathrm{minimize}}} \cr {\rho \in {\Bbb C}^{2^n \times 2^n}} \end{array}} \hfill & {\frac{1}{2} \cdot \left\Vert {y - {\cal M}\left( \rho \right)} \right\Vert_2^2} \hfill \cr {} \hfill & {\rho \succcurlyeq 0,} \hfill \cr {{\mathrm{subject}}\,{\mathrm{to}}} \hfill & {{\mathrm{Tr}}\left( \rho \right) \le 1,} \hfill \end{array}$$
(2)

which is closely related to the (negative) log-likelihood minimization under Gaussian noise assumption. The constraint \(\rho \succcurlyeq 0\) captures the positive semi-definite assumption, \(\left\Vert {\, \cdot \,} \right\Vert_2\) is the vector Euclidean \(\ell _2\)-norm, and ε > 0 is a parameter related to the error level in the model. Key in both programs is the combination of the PSD constraint and the trace object: combined, they constitute the tightest convex relaxation to the low-rank, PSD structure of the unknown ρ; see also Recht et al.26. The constraint \({\mathrm{Tr}}\left( \rho \right) = 1\) is relaxed in Eq. (2) to allow more robustness to noise, following Kalev et al.9. The solutions of these programs should be normalized to have unit trace to represent quantum states. We note that if \({\cal M}\) corresponds to a positive-operator valued measure (POVM), or includes the identity operator, then the explicit trace constraint is redundant.

As was discussed in the introduction, the problem with convex programs, such as Eqs. (1) and (2), is their inefficiency when applied in high-dimensional systems: most practical solvers for Eqs. (1) and (2) are iterative and handling PSD constraints adds an immense complexity overhead per iteration, especially when n is large.

In this work, we propose to use non-convex programming for QST of low-rank density matrices; we show in practice that it leads to higher efficiency than typical convex programs. We achieve this by restricting the optimization over the intrinsic non-convex structure of rank-r PSD matrices. This allow us to “describe” an 2n × 2n PSD matrix with only \({\cal O}{(2^nr)}\) space, as opposed to the \({\cal O}((2^n)^2)\) ambient space. Even more substantially, our program has theoretical guarantees of global convergence, similar to the guarantees of convex programming, while maintaining faster performance than the latter. These properties make our scheme ideal to complement the CS methodology for QST in practice.

Projected factored gradient descent algorithm

Optimization criterion recast

At its basis, the Projected Factored Gradient Descent (ProjFGD) algorithm transforms convex programs, such as in Eqs. (1)–(2), by enforcing the factorization of a d × d PSD matrix ρ such that \(\rho = AA^\dagger\), where d = 2n. This factorization, popularized by Burer and Monteiro27 for solving semi-definite convex programming instances, naturally encodes the PSD constraint, removing the expensive eigen-decomposition projection step. For concreteness, we focus here on the convex program (Eq. (2)). In order to encode the trace constraint, ProjFGD enforces additional constraints on A. In particular, the requirement that Tr(ρ) ≤ 1 is equivalently translated to the convex constraint \(\left\Vert A \right\Vert_F^2 \le 1\), where \(\left\Vert {\, \cdot \,} \right\Vert_F\) is the Frobenius norm. The above recast the program (Eq. (2)) as a non-convex program:

$$\begin{array}{*{20}{l}} {\begin{array}{*{20}{c}} {{\mathrm{minimize}}} \cr {A \in {\Bbb C}^{d \times r}} \end{array}} \hfill & {f\left( {AA^\dagger } \right): = {\textstyle{1 \over 2}} \cdot \left|\left| {y - {\cal M}\left( {AA^\dagger } \right)} \right|\right|_2^2} \hfill \cr {{\mathrm{subject}}\,{\mathrm{to}}} \hfill & {\left|\!| A \right|\!|_F^2 \le 1.} \hfill \end{array}$$
(3)

Given rank(ρ) = r, programs Eqs. (2) and (3) are equivalent in the sense that the optimal value of Eq. (2) is identical to that of Eq. (3), by the relation \(\rho = AA^\dagger\); however, program Eq. (3) might have additional local solutions. Further, while the constraint set is convex, the objective is no longer convex due to the bilinear transformation of the parameter space \(\rho = AA^\dagger\). Such criteria have been studied recently in machine learning and signal processing applications.17,18,19,20,21,22,23,24 Here, the added twist is the inclusion of further matrix norm constraints, that makes it proper for tasks such as QST; as we show in the Supplementary information Section A, such addition complicates the algorithmic analysis.

The ProjFGD algorithm and its guarantees

At heart, ProjFGD is a projected gradient descent algorithm over the variable A; i.e.,

$$A_{t + 1} = {\Pi}_{\cal C}\left( {A_t - \eta \nabla f\left( {A_tA_t^\dagger } \right) \cdot A_t} \right),$$

where \({\Pi}_{\cal C}\left( B \right)\) denotes the projection of a matrix \(B \in {\Bbb C}^{d \times r}\) onto the set \({\cal C} = \left\{{A:A \in {\Bbb C}^{d \times r},\left\Vert A \right\Vert_F^2 \le 1} \right\}\). \(\nabla f( \cdot ):{\Bbb R}^{d \times d} \to {\Bbb R}^{d \times d}\) denotes the gradient of the function f. Specific details of the ProjFGD algorithm, along with a pseudocode implementation, are provided in the Method Section and in the Supplementary information Sections A and B. Here, we focus on the theoretical guarantees of the ProjFGD. In summary, our theory dictates a specific constant step-size selection, η, that guarantees convergence to the global minimum, assuming a satisfactory initial point ρ0 is provided.

An important issue in optimizing Eq. (3) over the factored space is the existence of non-unique possible factorizations for a given ρ. To see this, if \(\rho = AA^\dagger\), then for any unitary matrix \(R \in {\Bbb C}^{r \times r}\) such that \(RR^\dagger = I\), we have \(\rho = \widehat A\widehat A^\dagger\), where \(\widehat A = AR\). Since we are interested in obtaining a low-rank solution in the original space, we need a notion of distance to ρ over the factors. We use the following unitary-invariant distance metric: Let matrices

Definition 2

Let matrices \(A,A_ \star , \in {\Bbb C}^{d \times r}\). Define:

$${\mathrm{DIST}}\left( {A,A_ \star } \right): = \mathop {{min}}\limits_{R:R\, \in \,{\cal U}} \left\Vert {A - A_ \star R} \right\Vert_F,$$

where \({\cal U}\) is the set of r × r unitary matrices.

Let us first describe the local convergence rate guarantees of ProjFGD.

Theorem 3 (Local convergence rate for QST)

Let ρ be a rank-r quantum state density matrix of an n-qubit system with a non-unique factorization \(\rho _ \star = A_ \star A_ \star ^\dagger\), for \(A_\star\in {\Bbb C}^{2^n \times r}\). Let \(y \in {\Bbb R}^m\) be the measurement vector of \(m = {\cal O}\left( {rn^62^n} \right)\) random n-qubit Pauli observables, and \({\cal M}\) be the corresponding sensing map, such that \(y_i = \left( {{\cal M}\left( {\rho _ \star } \right)} \right)_i + e_i,\,\forall i = 1, \ldots ,m\). Let the step η in ProjFGD satisfy:

$$\eta \le {\textstyle{1 \over {128\left( {\widehat L\sigma _1(\rho _0) + \sigma _1(\nabla f(\rho _0))} \right)}}},$$
(4)

where \(\sigma _1(\rho )\) denotes the leading singular value of ρ. Here, \(\widehat L\, \in \,\left( {1,2} \right)\) and \(\rho _0 = A_0A_0^\dagger\) is the initial point such that:

$${\mathrm{DIST}}\left( {A_0,A_ \star } \right) \le \gamma ^\prime \sigma _r\left( {A_ \star } \right),$$

for \(\gamma ^\prime : = c \cdot {\textstyle{{\left( {1 - \delta _{4r}} \right)} \over {\left( {1 + \delta _{4r}} \right)}}} \cdot {\textstyle{{\sigma _r\left( {\rho _ \star } \right)} \over {\sigma _1\left( {\rho _ \star } \right)}}},\,c \le {\textstyle{1 \over {200}}}\), where \(\delta _{4r}\) is the RIP constant. Let At be the estimate of ProjFGD at the t-th iteration; then, the new estimate At+1 satisfies

$${\mathrm{DIST}}\left( {A_{t + 1},A_ \star } \right)^2 \le \alpha \cdot {\mathrm{DIST}}\left( {A_t,A_ \star } \right)^2,$$
(5)

where \(\alpha : = 1 - \frac{{\left( {1 - \delta _{4r}} \right) \cdot \sigma _r\left( {\rho _ \star } \right)}}{{550\left( {\left( {1 + \delta _{4r}} \right)\sigma _1\left( {\rho _ \star } \right) + \left\Vert e \right\Vert_2} \right)}} < 1\). Further, \(A_{t + 1}\) satisfies \({\mathrm{DIST}}\left( {A_{t + 1},A_ \star } \right) \le \gamma ^\prime \sigma _r\left( {A_ \star } \right)\), \(\forall t\).

The proof of Theorem 3 is provided in the Supplementary information Section A. The definitions of L and \(\widehat L\) can be found in the Methods Section; for our discussion, they can be assumed constants. The above theorem provides a local convergence guarantee: given an initialization point \(\rho _0 = A_0A_0^\dagger\) close enough to the optimal solution –in particular, where \({\mathrm{DIST}}\left( {A_0,A_ \star } \right) \le \gamma ^\prime \sigma _r\left( {A_ \star } \right)\) is satisfied– our algorithm converges locally with linear rate. In order to obtain \(\left( {A_T,A_ \star } \right)^2 \le \varepsilon\), ProjFGD requires \(T = {\cal O}\left( {{\mathrm{log}}{\textstyle{{\gamma ^\prime \cdot \sigma _r\left( {A_ \star } \right)} \over \varepsilon }}} \right)\) number of iterations. We conjecture that this further translates into linear convergence in the infidelity metric, \(1 - \left( {\sqrt {\sqrt {\rho _T\rho } \star \sqrt {\rho _T} } } \right)^2\).

So far, we assumed \(\rho _0\) is provided such that \({\mathrm{DIST}}\left( {A_0,A_ \star } \right) \le \gamma ^\prime \sigma _r\left( {A_ \star } \right)\). The next theorem proposes an initialization procedure that could achieve this guarantee (under assumptions) and turns the above local guarantees to convergence to the global minimum.

Lemma 4

Let A0 be such that \(\rho _0 = A_0A_0^\dagger = {\Pi}_{{\cal C}^\prime }\left( {{\textstyle{{ - 1} \over L}} \cdot \nabla f\left( 0 \right)} \right)\), where \({\Pi}_{{\cal C}^\prime }( \cdot )\) is the projection onto the set of PSD matrices ρ that satisfy \(Tr\left( \rho \right) \le 1\), and \(\nabla f\left( 0 \right)\) denotes the gradient of f evaluated at the all zero matrix. Consider the problem (3) where \({\cal M}\) satisfies the RIP for some constant \(\delta _{4r}\, \in \,\left( {0,1} \right)\). Further, assume the optimum point \(\rho _ \star\) satisfies \(rank\left( {\rho _ \star } \right) = r\). Then, A0 satisfies:

$${\mathrm{DIST}}\left( {A_0,A_ \star } \right) \le \gamma ^\prime \cdot \sigma _r\left( {A_ \star } \right),$$

where \(\gamma ^\prime = \sqrt {{\textstyle{{1 - {\textstyle{{1 - \delta _{4r}} \over {1 + \delta _{4r}}}}} \over {2(\sqrt 2 - 1)}}}} \cdot \tau \left( {\rho _ \star } \right) \cdot \sqrt {srank\left( {\rho _ \star } \right)}\) and \(srank\left( \rho \right) = {\textstyle{{\left\Vert \rho \right\Vert_F} \over {\sigma _1\left( \rho \right)}}}\).

The proof is provided in Supplementary information Section B. This initialization introduces further restrictions on the condition number of \(\rho _ \star\), \(\tau \left( {\rho _ \star } \right) = \frac{{\sigma _1\left( {\rho _ \star } \right)}}{{\sigma _r\left( {\rho _ \star } \right)}}\), and the condition number of the objective function, which is proportional to \(\propto \frac{{1 + \delta _{4r}}}{{1 - \delta _{4r}}}\). The initialization assumptions in Theorem 3 are satisfied by Lemma 4 if \({\cal M}\) satisfies RIP with a constant \(\delta _{4r}\) fulfilling the following condition:

$$\frac{{1 + \delta _{4r}}}{{1 - \delta _{4r}}} \cdot \sqrt {1 - {\textstyle{{1 - \delta _{4r}} \over {1 + \delta _{4r}}}}} \le {\textstyle{{\sqrt {2\left( {\sqrt 2 - 1} \right)} } \over {200}}} \cdot \frac{1}{{\sqrt r \cdot \tau ^2\left( {\rho _ \star } \right)}}.$$
(6)

In the special case of r = 1, \(\tau (\rho _ \star ) = 1\) and, \({\mathrm{srank}}\left( {\rho _ \star } \right) = 1\), the condition simplifies to \(\delta _{4r} \lesssim 10^{ - 5}\). While these conditions are hard to check a priori, Pauli observables satisfy them, with high probability, as n increases, according to the results of Liu.25

In summary, we have shown that, with a proper initialization and a constant step size, the ProjFGD algorithm converges to the global minimum, if the sensing map satisfies the RIP with a small constant, according to Eq. (6). This condition is satisfied, with high probability, by a measurement of \({\cal O}\left( {rn^62^n} \right)\) random Pauli observables.

We note that the conditions for global convergence are sufficient but not necessary. As we shall see in the experiments below, we obtain convergence to the global minimum (or to a point very close to it) with milder conditions, such as random initialization. Moreover, recent advances in machine learning22 have shown that, under RIP, random initialization guarantees global convergence of a variant of our algorithm, where we exclude the trace constraint in Eq. (2). This is the case where \({\cal M}\) corresponds to a POVM, or includes the identity operator.

Numerical experiments evaluation

Our experiments follow the discussion above. We find that our initialization, as well as random initialization, works well in practice, and this behavior has been observed repeatedly in all the experiments we conducted. Thus, the method returns the exact solution of the convex programming problem, while being orders of magnitude faster than state-of-the-art optimization programs.

In all the experiments, the error is reported in the Frobenius metric, \(\left\Vert {\widehat \rho - \rho _ \star } \right\Vert_F/\left\Vert {\rho _ \star } \right\Vert_F\), where \(\widehat \rho\) is the estimation of the true state \(\rho _ \star\). Note that for a pure state \(\rho\), \(\left\Vert \rho \right\Vert_F = 1\). For some experiments we also report the infidelity metric \(1 - {\mathrm{Tr}}\left( {\sqrt {\sqrt {\rho _ \star } \widehat \rho \sqrt {\rho _ \star } } } \right)^2\). We model the additive noise in our experiments, \(e \in {\Bbb R}^m\), according to a circularly-symmetric normal distribution with variance σ for each measurement, \(e\sim {\cal C}{\cal N}\left( {0,\sigma \cdot I} \right)\).

Comparison of ProjFGD with second-order methods

As a first set of experiments, we compare the efficiency of ProjFGD with second-order cone convex programs. State-of-the-art solvers within this class of solvers are the SeDuMi and SDPT3 methods; for their use, we rely on the off-the-shelf Matlab wrapper CVX.28 In our experiments, we observed that SDPT3 was faster and we select it for our comparison. The setting is as described in the Results Section, where additive noise has variance σ, i.e., \(\sim {\cal C}{\cal N}\left( {0,\sigma \cdot I} \right)\). We consider both convex formulations Eqs. (1)–(2) and compare it to the ProjFGD estimator with r = 1; in figures we use the notation CVX 1 and CVX 2 for simplicity.

We consider two cases: (i) n = 7, and (ii) n = 13. Table 1 shows median values of ten independent experimental realizations for \(m = \frac{7}{3}rd\,{\mathrm{log}}\,d\); this selection of m was made so that all algorithms return a solution close to the optimum \(\rho _ \star\). Empirically, we have observed that ProjFgD succeeds even for cases \(m = {\cal O}\left( {rd} \right)\). We consider both noiseless σ = 0 and noisy σ = 0.05 settings.

Table 1 All values are median values over ten independent Monte Carlo iterations. “N/A” indicates that the corresponding algorithms did not return a solution within the selected wall-time T. We set T = 86400 s (1 day)

Figures 1 and 2 show graphically how second-order convex vs. our first-order non-convex schemes scale. In Fig. 1, we observe that, while in the ProjFGD more observations lead to faster convergence,29 the same does not hold for the second-order cone programs. In Fig. 2, it is obvious that the convex solvers do not scale easily beyond n = 7, whereas our method handles cases up to n = 13, within reasonable time. We note that, as n increases, a significant amount of time in our algorithm is spent forming the Pauli measurement vectors Pi; i.e., assuming that the application of Pi’s takes the same amount of time as in CVX solvers, ProjFGD requires much less additional computational power per iteration, compared with CVX 1 and CVX 2.

Fig. 1
figure 1

Dimension fixed to d = 27 with rank(ρ) = 1. The figure depicts the noiseless setting. Numbers within figure are the error in Frobenius norm achieved (median values)

Fig. 2
figure 2

Number of data points set to \(m = \frac{7}{3}rd\,{\mathrm{log}}\,d\). Rank of optimum point is set to rank(ρ) = 1. The figure depicts the noiseless setting

Comparison of ProjFGD with first-order methods

We compare our method with more efficient first-order methods, both convex (AccUniPDGrad30) and non-convex (SparseApproxSDP31 and RSVP32); we briefly describe these methods in the Discussion Section.

We consider two settings: \(\rho _ \star\) is (i) a pure state (i.e., \({\mathrm{rank}}\left( {\rho _ \star } \right) = 1\)) and, (ii) a nearly low-rank state. In the latter case, we construct \(\rho _ \star = \rho _{ \star ,r} + \zeta\), where \(\rho _{ \star ,r}\) is a rank-deficient PSD satisfying \({\mathrm{rank}}\left( {\rho _{ \star ,r}} \right) = r\), and \(\zeta \, \in \,{\Bbb C}^{d \times d}\) is a full-rank PSD noise term with a fast decaying eigen-spectrum, significantly smaller than the leading eigen values of \(\rho _{ \star ,r}\). In other words, we can well-approximate \(\rho _ \star\) with \(\rho _{ \star ,r}\). For all cases, the noise is such that \(\left\Vert e \right\Vert = 10^{ - 3}\). The number of data points m satisfy \(m = C_{{\mathrm{sam}}} \cdot rd\), for various values of \(C_{{\mathrm{sam}}} > 0\).

Table 2 contains recovery error and execution time results for the case n = 13 (d = 8192); in this case, we solve a \(d^2 = 67,108,864\) dimensional problem. For this case, RSVP and SparseApproxSDP algorithms were excluded from the comparison, due to excessive execution time. Supplementary information Section C provides extensive results, where similar performance is observed for other values of d = 2n and Csam.

Table 2 Median results for reconstruction and efficiency, for n = 13 qubits and Csam = 3

Table 3 considers the more general case where \(\rho _ \star\) is nearly low-rank: i.e., it can be well-approximated by a density matrix \(\rho _{ \star ,r}\) where r = 20 (low-rank density matrix). In this case, n = 12 (d = 4096), m = 245,760 for Csam = 3. As the rank in the model, r, increases, algorithms that utilize an SVD routine spend more CPU time on singular value/vector calculations. Certainly, the same applies for matrix-matrix multiplications; however, in the latter case, the complexity scale is milder than that of the SVD calculations. For completeness, in Supplementary information Section C we provide results that illustrate the effect of random initialization: Similar to above, ProjFGD shows competitive behavior by finding a better solution faster, irrespective of initialization point.

Table 3 Median results for reconstruction and efficiency

Overall, ProjFGD shows a substantial improvement in performance, as compared to the state-of-the-art algorithms; we would like to emphasize that projected gradient descent schemes, such as in Becker et al.,32 are also efficient in small- to medium-sized problems, due to their fast convergence rate. Further, convex approaches might show better sampling complexity performance (i.e., as Csam decreases). Nevertheless, one can perform accurate maximum-likelihood estimation for larger systems in the same amount of time using our methods for such small- to medium-sized problems. We defer the reader to Supplementary information Section C, due to space restrictions.

Discussion

In this work, we propose a non-convex algorithm, dubbed as ProjFGD, for estimating a highly-pure quantum state, in a high-dimensional Hilbert space, from relatively small number of data points. We showed empirically that ProjFGD is orders of magnitude faster than state-of-the-art convex and non-convex programs, such as Yurtsever et al.,30 Hazan,31 and Becker et al.32. More importantly, we prove that under proper initialization and step size, the ProjFGD is guaranteed to converge to the global minimum of the problem, thus ensuring a provable tomography procedure; see Theorem 3 and Lemma 4.

Our techniques and proofs can be applied to scenaria beyond the ones considered in this work. We conjecture that our results apply for other “sensing” settings, that are informationally complete for low-rank states; see e.g., Baldwin et al.4. The results presented here are independent of the noise model and could be applied for non-Gaussian noise models, such as those stemming from finite counting statistics. Lastly, while here we focus on state tomography, it would be interesting to explore similar techniques for the problem of process tomography.

Related work

In order to place our work in the literature, we focus on several efficient methods for QST; for a broader set of citations that go beyond QST, see Park et al.21

The use of non-convex algorithms in QST is not new, and dates before the introduction of the CS protocol in QST settings.6 Even the use of the reparameterization \(\rho = AA^\dagger\) is not new; see the works.33,34,35,36 Albeit their success, there are no theoretical results on the non-convex nature of the transformed objective (e.g., the presence of spurious local minima), except for the case of Goncalves et al..37 In that work, the authors consider the informationally complete case, where the number of measurements is of the order \({\cal O}\left( {d^2} \right)\), and therefore, there is a unique solution in Eqs. (1)–(2), without the requirement of the RIP. The authors characterize the local vs. the global behavior of the objective under the factorization \(\rho = AA^\dagger\) and discuss how existing methods fail due to improper stopping criteria or due to the lack of algorithmic convergence results. Their work highlights the lack of rigorous convergence results of algorithms used in QST.

Shang et al.38 propose a hybrid algorithm that (i) starts with a conjugate-gradient (CG) algorithm in the A space, in order to get initial rapid descent, and (ii) switch over to accelerated first-order methods in the original ρ space, provided one can determine the switchover point cheaply. Under the multinomial maximum-likelihood objective, in the initial CG phase, the Hessian of the objective is computed per iteration (i.e., a d2 × d2 matrix), along with its eigenvalue decomposition. Such an operation is costly, even for moderate values of d, and heuristics are proposed for its completion. From a theoretical perspective, Shang et al.38 provide no convergence or convergence rate guarantees.

Goncalves et al.39 the authors study the QST problem in the original parameter space, and propose a projected gradient descent algorithm. The proposed algorithm applies both in convex and non-convex objectives, and convergence only to stationary points could be expected. Bolduc et al.40 extends the work of Goncalves et al.39 with two first-order variants, using momentum motions, similar to the techniques proposed by Polyak and Nesterov for faster convergence in convex optimization.41 The above algorithms operate in the informationally complete case. Similar ideas in the informationally incomplete case can be found in these works.32,42

Very recently, Riofrio et al.42 presented an experimental implementation of CS tomography of a n = 7 qubit system, where only 127 Pauli basis measurements are available. To achieve recovery in practice, the authors proposed a computationally efficient estimator, based on the factorization \(\rho = AA^\dagger\). The resulting method resembles our gradient descent method on the factors A. d: However, the authors focus only on the experimental efficiency of the method and provide no specific results on the optimization efficiency of the algorithm, what are its theoretical guarantees, and how its components (such as initialization and step size) affect its performance (e.g., the step size is set to a sufficiently small constant). See also Schwemmer et al.10 for a six-qubit implementation.

One of the first provable algorithmic solutions for the QST problem was through convex approximations:26 this includes nuclear norm minimization approaches,6 as well as proximal variants, as the one that follows:

$${{\mathrm{minimize}}\atop{\rho\succcurlyeq 0}} \quad \left|\!| {{y}-{\cal M} (\rho)} \right|\!|^{2}_{F}+ \lambda {\mathrm{Tr}}\left( \rho \right).$$
(7)

See Gross et al.6 for the theoretical analysis. Within this context, we mention the work of Yurtsever et al.:30 there, the AccUniPDGrad algorithm is proposed – a universal primal-dual convex framework with sharp operators, in lieu of proximal low-rank operators – where QST is considered as an application. AccUniPDGrad combines the flexibility of proximal primal-dual methods with the computational advantages of conditional gradient methods.

Hazan31 presents SparseApproxSDP algorithm that solves the QST problem in Eq. (2), when the objective is a generic gradient Lipschitz smooth function, by updating a putative low-rank solution with rank-1 refinements, coming from the gradient. This way, SparseApproxSDP avoids computationally expensive operations per iteration, such as full eigen-decompositions. In theory, SparseApproxSDP achieves a sublinear \(O\left( {\frac{1}{\varepsilon }} \right)\) convergence rate. However, depending on ε, SparseApproxSDP might not return a low-rank solution.

Finally, Becker et al.32 propose Randomized Singular Value Projection (RSVP), a projected gradient descent algorithm for QST, which merges gradient calculations with truncated eigen-decompositions, via randomized approximations for computational efficiency.

Future directions

We conclude with a short list of interesting future research directions. Our immediate goal is the application of ProjFGD in real-world scenaria; this could be completed by utilizing IBM quantum computers.3 This complements the results found in Riofrio et al.43 for a different quantum system.

Beyond its use as point estimator, the maximum-likelihood estimator is used as a basis for inference around the point estimate, via confidence intervals44 and credible regions.45 However, there is still no rigorous analysis when the factorization \(\rho = AA^\dagger\) is used.

The work in refs. 38,40 considers accelerated gradient descent methods for QST in the original parameter space ρ. It remains an open question how our approach could exploit such techniques, along with rigorous approximation and convergence guarantees. Further, distributed/parallel implementations, like Hou et al.,46 remain widely open using our approach, in order to accelerate further the execution of the algorithm. Research along these directions is very interesting and is left for future work.

Finally, we identify two practical observations from our experiments that need further theoretical justification. First, we saw numerically that a random initialization in our settings works well; a careful theoretical treatment for this case is an open problem. Second, while we observed that the ProjFGD outpreforms convex solvers; it is an open question to understand its behavior in the setting where r = d.

Methods

Next follows a more detailed discussion on ProjFGD. The pseudocode is provided in Algorithm 1; a real implementation is in Supplementary information Section C.

figure a

Denote \(g\left( A \right) = {\textstyle{1 \over 2}} \cdot \left\Vert {y - {\cal M}(AA^\dagger )} \right\Vert_2^2\) and \(f\left( \rho \right) = {\textstyle{1 \over 2}} \cdot \left\Vert {y - {\cal M}(\rho )} \right\Vert_2^2\). Due to the symmetry of f, i.e., \(f\left( \rho \right) = f\left( {\rho ^\dagger } \right)\), the gradient of g(A) w.r.t. A variable is given by

$$\nabla g\left( A \right) = \left( {\left( \rho \right) + \left( \rho \right)^\dagger } \right) \cdot A = 2\left( \rho \right) \cdot A,$$

where \(\nabla f\left( \rho \right) = - 2{\cal M}^ \ast \left( {y - {\cal M}\left( \rho \right)} \right)\), and \({\cal M}^ \ast\) is the adjoint operator for \({\cal M}\). For the Pauli measurements case we consider in this paper, the adjoint operator for an input vector \(b\, \in \,{\Bbb R}^m\) is \({\cal M}^ \ast \left( b \right) = {\textstyle{{2^n} \over {\sqrt m }}}\mathop {\sum}\nolimits_{i = 1}^m b_iP_i\).

The prior knowledge \({\mathrm{rank}}\left( {\rho _ \star } \right) \le r_ \star\) is imposed by setting \(A\, \in \,{\Bbb C}^{d \times r_ \star }\). In real experiments, the state \(\rho _ \star\) could be full rank, but often is highly-pure with only few dominant eigenvalues.43 In this case, \(\rho _ \star\) is well-approximated by a low-rank matrix of rank r, which can be much smaller than \(r_ \star\). In the ProjFGD protocol, we set \(A\, \in \,{\Bbb C}^{d \times r}\). In this form, A contains far fewer variables to maintain and optimize than a d × d PSD matrix, and thus it is easier to update and to store its iterates.

The per-iteration complexity of ProjFGD is dominated by the application of the linear map \({\cal M}\) and by matrix–matrix multiplications. While both eigenvalue decomposition and matrix multiplication have \({\cal O}\left( {\left( {2^n} \right)^2r} \right)\) complexity, the latter is at least two-orders of magnitude faster on dense matrices.21

Due to the bilinear structure in Eq. (3), it is not clear whether the factorization \(\rho = AA^\dagger\) introduces spurious local minima, i.e., minima that do not exist in Eqs. (1)–(2), but are “created” after the factorization. This necessitates careful initialization to obtain the global minimum.

The initial point \(\rho _0\) is set as \(\rho _0: = 1/\hat L \cdot {\Pi}_{{\cal C}^\prime }\left( { - \nabla f\left( 0 \right)} \right) = 2/\hat L \cdot {\Pi}_{{\cal C}^\prime }\left( {{\cal M}^ \ast \left( y \right)} \right)\), where \({\Pi}_{{\cal C}^\prime }( \cdot )\) denotes the projection onto the set of PSD matrices ρ that satisfy \({\mathrm{Tr}}\left( \rho \right) \le 1\). Here, \(\hat L\) represents an approximation of L, where L is such that for all rank-r matrices \(\rho ,\zeta\):

$$\left\Vert {\nabla f\left( \rho \right) - \nabla f\left( \zeta \right)} \right\Vert_F \le L \cdot \left\Vert {\rho - \zeta } \right\Vert_F.$$
(8)

(This also means that f is restricted gradient Lipschitz continuous with parameter L. We defer the reader to the Supplementary information Sections A and B for more information). In practice, we set \(\hat L \in \left( {1,2} \right)\).

This is the only place where eigenvalue-type calculation is required. The projection \({\Pi}_{{\cal C}^\prime }( \cdot )\) is given in ref. 39. In practice, we could just use a standard projection onto the set of PSD matrices \(\rho _0: = 2/\hat L \cdot {\Pi}_ + \left( {{\cal M}^ \ast \left( y \right)} \right)\); our numerical experiments show that it is sufficient and can be implemented by any off-the-shelf eigenvalue solver. In that case, the algorithm generates \(A_0\, \in \,{\Bbb C}^{d \times r}\) by truncating the computed eigen-decomposition, followed by a projection onto the convex set, \({\cal C}\).

Data availability

The empirical results were obtained via synthetic experiments; the algorithm’s implementation is available in the supplementary material.