Hamiltonian simulation with optimal sample complexity

We investigate the sample complexity of Hamiltonian simulation: how many copies of an unknown quantum state are required to simulate a Hamiltonian encoded by the density matrix of that state? We show that the procedure proposed by Lloyd, Mohseni, and Rebentrost [Nat. Phys., 10(9):631–633, 2014] is optimal for this task. We further extend their method to the case of multiple input states, showing how to simulate any Hermitian polynomial of the states provided. As applications, we derive optimal algorithms for commutator simulation and orthogonality testing, and we give a protocol for creating a coherent superposition of pure states, when given sample access to those states. We also show that this sample-based Hamiltonian simulation can be used as the basis of a universal model of quantum computation that requires only partial swap operations and simple single-qubit states. One of the hallmarks of quantum computation is the storage and extraction of information within quantum systems. Recently, Lloyd, Mohseni and Rebentrost created a protocol to treat multiple identical copies of a quantum state as “quantum software”, specifying a quantum program to be run on any other state. They use this approach to do principal component analysis of the software state. Here, we expand on their results, providing protocols for running more-complex quantum programs specified by several different states. Our protocols can be used to analyze the relationship between different states (for example, deciding whether states are orthogonal) and to create new states (such as coherent linear combinations of two states). We also outline the optimality of Lloyd et al.’s original protocol, as well as our new protocols.


Introduction
One of the most anticipated applications of quantum computation is Hamiltonian simulation.In fact, this was Feynman's main motivation for suggesting the creation of quantum computers [Fey82].Feynman's intuition may soon pay off, as the simulation of Hamiltonians for quantum chemistry looks to be implementable on small to moderately sized quantum computers [HWBT15, WBC + 14].In addition, Hamiltonian simulation has implications for more general computational problems, including adiabatic algorithms [FGGS00], quantum walk algorithms [CCD + 03], and algorithms for systems of linear equations [HHL09].
Much work has been done on the time and query complexity of Hamiltonian simulation when given a classical description or black box description of the Hamiltonian.Lloyd provided the first formal results on simulation, considering Hamiltonians that consist of sums of non-commuting terms [Llo96].Other lines of research have focused on simulating sparse Hamiltonians, with a long sequence of work recently culminating in an optimal algorithm [LC16] (see [BCK15] for a more complete history of work in this field).
In this work, we approach the problem of Hamiltonian simulation from a slightly different perspective.Rather than given a classical description or black-box access to a Hamiltonian H, we consider the problem of simulating H when given many copies of a quantum state ρ that encodes the Hamiltonian to be simulated.In particular, we assume that for some constant c ∈ R such that H + c1 is positive semidefinite.In that case, ρ itself is positive semidefinite and Tr ρ = 1, so ρ is a valid density matrix.Note that so the Hamiltonian dynamics of H and ρ are equivalent up to an overall phase and time scaling.Moreover, since the Hamiltonian H in Eq. ( 2) can be arbitrary, any unitary can in fact be expressed as e −iρt for an appropriately chosen state ρ and time t.This modified version of the original Hamiltonian simulation problem is what we call samplebased Hamiltonian simulation: given one copy of an unknown state σ and n copies of an unknown state ρ, implement the following map: where t is the desired evolution time.We also allow for some error in the final state-we denote by δ the trace distance [NC10] between the state that is output by the protocol and the ideal state e −iρt σe iρt .This problem was first considered in [LMR14], where the authors give a simple protocol, which we call the LMR protocol (after the authors' initials), for approximately implementing the unitary e −iρt using many copies of ρ.Their protocol is based on a partial swap operation that can also be considered as a finite-dimensional analogue of a beam-splitter [ADO16].An interesting feature of the LMR protocol is that it is agnostic with regard to ρ.In the spirit of [Pre99,GC99], this suggests interpreting ρ as a "quantum software state".
The main motivation for sample-based Hamiltonian simulation in [LMR14] is to perform principal component analysis of ρ.They do this by performing phase estimation on the unitary e −iρ .We note that it is a nontrivial fact that the controlled-e −iρ operation can in fact be implemented with the LMR protocol; see Appendix C (this fact does not seem to have been explicitly discussed in previous work).In addition, we note in Appendix D that a slightly more careful analysis gives a polynomial improvement in sample complexity over the complexity given in [LMR14] for performing phase estimation.The LMR protocol has applications to many problems in machine learning, e.g.[LMR14,Wan14,RML14,CD15].
In this paper, we investigate the optimal scaling of sample-based Hamiltonian simulation.That is, we ask the following question: given t and δ, what is the minimum n (number of copies of ρ) necessary to implement e −iρt on an unknown state σ to trace distance at most δ?We call this the sample complexity of Hamiltonian simulation.
It is interesting to consider alternative strategies, other than LMR, for sample-based Hamiltonian simulation.While the LMR protocol acts with each copy of ρ sequentially, perhaps one could achieve better performance by acting with a global operation [Ozo15]?For example, recent near-optimal tomographic protocols have relied on performing global operations (like the Schur transform) on many copies of the unknown state [HHJ + 15, OW15].Along those lines, perhaps one could do better than LMR by applying tomographic protocols to get an estimate ρ of ρ from the n copies of ρ, and then evolve according to e −i ρt .

Notation
We use H to denote a finite-dimensional Hilbert space, and D(H) to represent the set of positive semi-definite operators with trace 1 in H (i.e. the set of valid quantum states).
The trace distance between ρ, σ ∈ D(H) is given by where A 1 := Tr( √ AA † ).The trace distance between ρ and σ gives the maximum difference in probability of any measurement on the two states [NC10].For two quantum channels E 1 and E 2 that act on D(H), their trace norm distance is defined as The diamond norm distance is defined as where For unitary channels U 1 and U 2 corresponding to conjugation by unitary matrices U 1 and U 2 , we will sometimes write , by Tr B ρ AB we mean taking the partial trace of the second system.More generally, given a state ρ on multiple subsystems, by Tr i ρ we mean taking the partial trace of the i th subsystem of ρ.
We define the single qubit state |+ := (|0 + |1 )/ √ 2. We refer to i[A, B] := i(AB − BA) as the commutator and {A, B} := AB + BA as the anticommutator of operators A and B. We will use X, Y, and Z to denote the single-qubit Pauli operators.We use 1 A to mean the identity matrix acting on subsystem A, but if clear from context, we will drop the subscript.

LMR Protocol versus State Tomography
Lloyd, Mohseni, and Rebentrost [LMR14] gave a simple method for approximating the transformation in Eq. (3).Importantly, their procedure is independent of σ and ρ, and the number of copies of ρ required does not depend on the dimension of the two states.We state their result in a slightly more general form, where σ has two registers and e −iρt is applied only to one of them.

Theorem 1 ([LMR14]
).Let ρ ∈ D(H A ) and σ ∈ D(H A ⊗ H B ) be two unknown quantum states and t ∈ R (can be either positive or negative).Then there exists a quantum algorithm that transforms as long as the number of copies of ρ is n = O(t 2 /δ).In other words, this quantum algorithm implements the unitary e −iρt up to error δ in diamond norm, using O(t 2 /δ) copies of ρ.
Proof.We will give a sketch of the proof; for the full proof see Appendix B. For simplicity we assume ρ and σ have the same dimension, i.e.H B is one-dimensional.Then by the Hadamard Lemma (see Appendix A), the target state is We note that for very small evolution times ∆, we have the following direct calculation: were S is the swap operator between the two registers.If we take ∆ = δ/t and repeat this procedure O(t 2 /δ) times, we end up implementing the operator e −iρt up to error Thus the LMR protocol uses O(t 2 /δ) copies of ρ to implement the unitary e −iρt up to error δ in trace norm.To obtain the result for diamond norm, we can simply replace ρ by ρ A ⊗ 1 B and σ by σ AB .See Appendix B for a more detailed proof.Remark 2. While not noted explicitly in [LMR14], it turns out that the LMR protocol can be implemented efficiently, i.e. using O(log D • t 2 /δ) single-qubit and Fredkin (controlled-swap) gates, where D = dim(H A ).To see this, we recall that in the proof of Theorem 1, the only potentially expensive operation is the partial swap This operation is a linear combination of the two unitaries 1 and S, the latter swapping two registers of dimension D and thus being implementable using O(log D) gates.By the LCU (linear combination of unitaries) algorithm (see e.g.[BCK15], or [Kot14, Theorem 2.4]), e −iS∆ can be implemented using a constant number of uses of controlled-S, elementary single-qubit rotations, and a single-qubit unitary A satisfying Hence the LMR protocol can be implemented with O(log D • t2 /δ) single-qubit and Fredkin gates. 1e note that Marvian and Lloyd independently give an alternative efficient implementation for the LMR protocol [ML16,Appendix C]; their algorithm has an extra multiplicative factor of log(t 2 /δ) in the runtime.
Remark 3. We also note that the LMR protocol can be modified to implement the controlled-e −iρt operation, which will be important if one wants to implement phase estimation on e −iρt .This fact does not seem to have been addressed in previous work; see Appendix C.
An alternative method to LMR for the sample-based Hamiltonian simulation problem would be to perform tomography on the copies of ρ to get an estimate ρ of ρ, and then implement e −i ρt .Let ζ = ρ − ρ and suppose ζ 1 = .We first show that simulating with ρ instead of ρ results in a diamond norm distance at most t.That is, To show this, we recall the Lie product formula [Var84] e For any integer l ≥ 1 and unitaries U and V, using the triangle inequality we have that where the equality comes from the unitary invariance of the diamond norm.Applying this argument inductively, U l − V l ≤ l U − V .Hence e −iρt/l e −iζt/l l − e −iρt ≤ l e −iρt/l e −iζt/l − e −iρt/l (17) = l e −iζt/l − 1 (18) where the last line follows using a Taylor series expansion.Finally, using Eq. ( 14) and taking l → ∞ in Eq. ( 19), we obtain Eq. (13).Moreover, there exists ρ and ρ for which Eq. ( 13) is essentially tight.To see this, note that if ρ and ρ commute, we have This means that if we want to simulate e −iρt to error δ in diamond norm, in general we need an estimate ρ such that ρ − ρ 1 = O(δ/t).
In Theorem 1 of [HHJ + 15], they prove that to acquire an estimate of a rank-r state ρ ∈ D(d) that differs from the true ρ by at most in trace distance with probability at least 1 − η requires n copies of ρ, where with C a function of only η.The scaling in can be slightly improved.Fixing η, d and r, Eq. ( 23) scales in as Ω(1/( 2 log(1/ ))).If one could acquire such an estimate of ρ, one could violate the Helstrom bound [Hel76] that scales as Ω(1/ 2 ).Therefore, we can combine the two bounds to get Back to our problem of Hamiltonian simulation, we want = δ/t to obtain a simulation correct to accuracy δ.Setting = δ/t, we find that the number of samples needed to obtain a tomographic estimate to the desired accuracy is On the other hand, using LMR to simulate e −iρt to accuracy δ uses n copies of ρ, where Since LMR does not have any dependence on d or r, we immediately see that for large d or r, it does significantly better.Furthermore, even fixing d and r, we see that LMR provides a square root improvement in sample complexity over tomography in terms of δ.

LMR Protocol is Optimal
Our strategy for proving the optimality of the LMR protocol will be as follows.We first give a lower bound on the sample complexity of distinguishing two specific states.Next, we assume we have a protocol that simulates e −iρt to trace norm (not diamond norm, see discussion below) δ using f (t, δ) samples of ρ for some function f .Then we show that using such a protocol, one can distinguish these two states.However, if f = o(t 2 /δ), we would violate our lower bound on state discrimination.
For this bound, we will consider states of the form for some x ∈ [0, 1].
Proof.If we have n samples of ρ, and we want to determine whether ρ is ρ(x) or ρ(x + ), then the maximum probability that we correctly identify ρ is at most [Hol73,Hel76] 1 Now ρ(x) and ρ(x + ) commute, so the trace distance in Eq. (28) becomes the total variation distance between the eigenvalues of the two states.Since ρ(x) is a rank-2 state, this variation distance is the same as the variation distance between two binomial distributions with n trials and probabilities x and x + respectively.Then if n < C η / 2 , for a sufficiently small constant C η that depends on η, as long as < η, the total variation distance between these two binomial distributions is less than 1/3 (from many sources, e.g.[AJ06]).
We now show the main result of this section: the sample complexity of the LMR protocol is optimal.
The upper bound holds by the LMR protocol, Theorem 1, so we will only prove the lower bound.The fact that the trace norm lower bounds the diamond norm makes a tight lower bound in terms of the trace norm a stronger result than if we had used the diamond norm.

Pure State Discrimination and the Optimality of LMR for Pure States
In the previous section, we saw that the sample complexity of the LMR protocol cannot in general be improved.However, the specific case of state discrimination that we used in the proof involved mixed states.One is left to wonder whether simulating exp(−i|ψ ψ|t) for a pure state |ψ might possibly be more efficient.This relates to a practically relevant question; as we will see in Section 7, the LMR protocol and certain pure states as resources create a universal model for quantum computation.
In this section, however, we show that LMR is also optimal for pure states, at least in the δ error parameter.What about the t parameter?We argue that we cannot expect to prove a meaningful lower bound on the t dependence in pure state LMR.The reason is that, given any state ρ and promised that exp(−iρt) is periodic (i.e.exp(−iρt 1 ) = exp(−iρt 2 ) for any t 2 = t 1 + kT for integer k and real number T), we can always simulate the Hamiltonian ρ for an equivalent time t ∈ [0, T) instead.Notice first that LMR gives an algorithm for this simulation that takes a finite number of samples for any time t ∈ [0, T) and fixed δ.Since we have such an upper bound, any lower bound correct in the δ scaling but not necessarily in the time scaling will differ by at most a constant from this upper bound.Such a lower bound is not meaningful with respect to any asymptotic scaling.Most relevant to our immediate purpose, this argument applies to pure states: knowing a state is pure, we immediately know its period, namely 2π.
To prove that the LMR protocol is optimal for pure states, we first show that pure state discrimination reduces to a problem we call sample-based Grover's search.Then, with the help of known bounds on state discrimination, we prove a lower bound on the efficiency of sample-based Grover's search.Finally, we show that LMR can be used to implement sample-based Grover's search, and therefore find that LMR is optimal in terms of the precision δ.

Metrological View of Grover's Search
While Grover's search [Gro96] is a well-known quantum mechanical task, it is not often stated in its form as a decision problem, and very rarely [DDM15] as a metrological decision problem, where the inputs are unitaries and the output depends on a property that those unitaries either possess or do not possess.This guise is useful for our purposes, however, because the LMR protocol, Eq. (3), allows us to turn metrology problems on states into metrology problems on quantum operations.
In the metrological view, Grover's search, or perhaps more precisely amplitude amplification [BHMT02], is the following problem of parameter estimation.Let T be a subspace of C 2 q .We call T the target subspace.Let U T be a unitary acting on q + 1 qubits such that In this problem, and in the following variations, we will assume access to U T and U † T are free.For an q-qubit unitary V, define Then in Grover's search, the task is to decide whether λ ≥ w (for w > 0) or λ = 0, while using V and V † as few times as possible.In words, if we call |s = V|0 ⊗q the start state, we would like to determine whether the start state has substantial probability mass in the target subspace or none, promised one is the case.If we solve this problem using Grover's search and count the number of uses of V and V † required to succeed with probability 1 − , we get the standard complexity Θ(log(1/ )/ √ w) [BBHT98,BCdWZ99].One simple modification of metrological Grover's search is to replace the circuit description of the start state preparation operator V with copies of the start state |s instead.The problem is now to determine whether λ = |(1 ⊗ 1|)U T (|s ⊗ |0 )| 2 is at least w > 0 or equal to zero, promised one is the case, given copies of |s and unlimited access to U T and U † T .We call this sample-based Grover's search.But how many copies of |s are needed?We will see in Section 4.2 that the answer is Θ(log(1/ )/w), and so we find we have lost the square-root advantage of Grover's search.
A second variant of metrological Grover's search is to replace both V and U T with quantum states.In this form, the problem becomes: given copies of q-qubit states |s and |t , determine whether λ = | s|t | 2 is at least w > 0 or equal to zero, promised one is the case.We call this variant orthogonality testing.The number of copies of |s and |t needed will also turn out to be Θ(log(1/ )/w); see Section 6.
Orthogonality testing is similar to equality testing, the problem of deciding whether | s|t | 2 = 1 or | s|t | 2 < 1 − w for some w > 0, for which there is already an optimal (up to log factors) algorithm [BCWdW01].

The LMR Protocol is Optimal for Pure States
To show LMR is optimal for pure states, we begin by showing a lower bound on sample-based Grover's search.Then, we show that sample-based Grover's search can be implemented optimally by LMR.Lemma 6. Sample-based Grover's search with success probability 1 − uses Θ (log(1/ )/w) copies of |s .
Proof.We will first prove the lower bound.Consider the pure state discrimination problem of Helstrom [Hel76].You are given a quantum state |ψ (of arbitrary dimension) which is either |φ 1 or |φ 2 , each with probability 1/2.You are provided with classical descriptions of the two states |φ 1 and |φ 2 , and asked to decide whether |ψ = |φ 1 or |ψ = |φ 2 .Over all possible measurements one could perform on |ψ , what is the minimum failure rate that can be achieved?
Helstrom's bound [Hel76] states that for any discrimination procedure, A special case of the bound corresponds to discrimination given n copies of |ψ instead of one.
Then, the problem is to discriminate between |φ 1 ⊗n and |φ 2 ⊗n , and Helstrom's bound can be rearranged to give The next step in proving the lower bound is to show that pure state discrimination can be done with sample-based Grover's search as described in Section 4.1.A similar reduction to state discrimination and Helstrom's bound is the key step in [BCWdW01] for proving a lower bound on equality testing.Now, recall that sample-based Grover's search requires copies of a q-qubit input state |s and a unitary U T that defines a target space T , as in Eq. (36).We set |s to the mystery state |ψ .To choose U T , we note that for some w ∈ [0, 1] we can write |φ are known because we have classical descriptions of |φ 1 and |φ 2 .We can therefore choose U T , as defined by Eq. (36), to be a |φ ⊥ 1 -tester by letting T = span{|φ ⊥ 1 }.Now, provided enough copies of |s = |ψ , a sample-based Grover's algorithm with this choice of U T will be able to distinguish between |ψ = |φ 1 (the λ = 0 case) and |ψ = |φ 2 (the λ ≥ w case).Since it solves the pure state discrimination problem for states |φ 1 and |φ 2 , using Eq. ( 40) with | φ 1 |φ 2 | 2 = 1 − w gives us the desired bound.Now we prove the upper bound.Given an oracle marking the target space, we should, by just applying U T to |s |0 and measuring the ancilla bit, on average observe at least one positive event after O(1/w) trials if |s does have some overlap ≥ w with the target space.To boost the probability of success from a constant to requires only a factor of log(1/ ) more attempts.This is, of course, the pure state discrimination analogue of a classical randomized algorithm for unstructured search.
While the classical search algorithm described in the proof is an obvious optimal procedure in sampling complexity, we can also solve sample-based Grover's search with LMR.The ultimate reason to do this is not to give a useful algorithm for sample-based search, but rather to show that LMR is optimal in the number of copies of a pure state ρ that it uses to simulate e −iρt .
Theorem 7. The number of copies of an unknown pure state ρ required for any algorithm to simulate e −iρt to trace norm δ is Ω(1/δ).
Proof.For ρ = |s s| a pure state we have Setting t = π this is e −iπ|s s| = 1 − 2|s s| = R |s s| , the reflection about the start state |s that plays a crucial role in Grover's algorithm [Gro96].
Let us implement sample-based Grover's search using this observation.Since we have unlimited access to U T and U † T , we can implement the reflection about the target space as R T = U † T (1 ⊗ Z)U T without any sampling.Grover's search finds a state |t 0 in the target space T = {|t : times, where the initial probability mass of |s inside the target space is λ.Next, we apply U T to the resulting state and a |0 ancilla and then measure the ancilla to determine whether a state |t 0 within T has been found.This process as stated requires knowing λ to determine the number of times G is to be applied.
But notice, for the sample-based Grover search problem we are promised only that λ = 0 or λ ≥ w, not that we know λ exactly.However, this is not a problem.It has been shown, using either exponentially increasing sequences of Grover iterates [BBHT98] or fixed-point quantum search [YLC14], that performing O(log(1/ )/ √ w) iterates suffices to distinguish the two cases with success probability 1 − .That is, in the λ = 0 case, no target state |t 0 is found, while in the λ ≥ w case such a state is found with probability 1 − .In the remainder of the proof, we will take to be a constant.

Generalized LMR for Simulation of Hermitian Polynomials
The sample-based Hamiltonian simulation of Eq. (3) can be further generalized.Instead of evolution of σ by a single state ρ, the target Hamiltonian H could be encoded by some combination of multiple states ρ 1 , ρ 2 , . . ., ρ K .For example, we might want to implement the map where H = f (ρ 1 , ρ 2 , . . ., ρ K ) is some Hermitian multinomial function of the input states.We will treat this problem fully in this section.
One key tool will be the following lemma, which lets us simulate a Hamiltonian given by the difference of two subnormalized states: where ρ + , ρ − are unknown subnormalized states with Tr ρ + + Tr ρ − = 1.Using n samples of ρ , a quantum algorithm can transform σ AB into σAB such that Proof.We will apply a modification of the LMR protocol-instead of using partial swaps, we will use the following unitary operator: where and S is the usual swap operator.The first qubit in e −iS ∆ essentially controls whether the partial swap on the remaining two systems is applied in the forwards or backwards direction in time.Applying e −iS ∆ to ρ ⊗ σ, we obtain Tracing out the second register (the register originally containing ρ ± ), we obtain the state Now tracing out the first qubit, we obtain the state where we've used Tr ρ + + Tr ρ − = 1 and H = ρ + − ρ − in the first line.Thus with one copy of ρ we can simulate the operation e −iH∆ up to error O(∆ 2 ); by choosing ∆ = δ/t and repeating this for t 2 /δ times, we obtain a simulation of e −iHt up to error O(δ), using O(t 2 /δ) copies of ρ .

Simulating Linear Combinations
In the simplest case where H = ∑ K j=1 c j ρ j is a linear combination of the ρ j , we show: Theorem 9. Let ρ 1 , . . . ,ρ K ∈ D(H A ) and σ AB ∈ D(H A ⊗ H B ) be unknown quantum states, and let c 1 , . . . ,c K ∈ R. Using n samples from the states {ρ 1 , . . ., ρ K }, a quantum algorithm can transform σ AB into σAB such that if n = O(c 2 t 2 /δ) where c := ∑ K j=1 |c j |.Moreover, on average, the number of copies of ρ j consumed is n j = O(|c j |ct 2 /δ). Proof.Define Note that ρ is a valid density matrix, and can be created by sampling the state |0 0| ⊗ ρ j with probability c j /c if c j > 0, and otherwise sampling the state |1 1| ⊗ ρ j with probability −c j /c if c j < 0. (This works for the same reason a mixed state is indistinguishable from the corresponding probabilistic distribution of pure states.)By Lemma 8, since ρ is of the form we can simulate e −iHt = e −i(H/c)(ct) to error δ using O(c 2 t 2 /δ) copies of ρ .On average the state ρ j is sampled We now show that Theorem 9 is tight.
Theorem 10.Let {c 1 , . . ., c K } be a set of K real numbers.Then there exist ρ 1 , . . ., ρ K such that to simulate H = ∑ K j=1 c j ρ j for time t and to error δ in trace norm requires Ω(c 2 t 2 /δ) copies of states in {ρ 1 , . . ., ρ K }, where c := ∑ j |c j |, as long as δ and δ/(ct) are smaller than some constants.
Proof.We first consider the case that ρ j = ρ and c j ≥ 0 for all j.Then we can use Theorem 9 to simulate H = ∑ j c j ρ = cρ for time t to accuracy δ using O(c 2 t 2 /δ) samples of ρ.Comparing with our lower bound in Theorem 5, we find this is optimal.
If we have c j 's such that some c j < 0, without loss of generality, assume that ∑ j:c j >0 c j ≥ ∑ j:c j <0 |c j |.Then if c j > 0, set ρ j = ρ, and if c j < 0, set ρ j equal to the maximally mixed state.This gives where c is some real number.The term proportional to the identity can be dropped (since it only gives an irrelevant phase factor), and ∑ j:c j >0 c j ≥ c/2 by assumption, so simulating H for time t to accuracy δ requires Ω(c 2 t 2 /δ) samples of ρ.

Simulating the Commutator and Anticommutator
In this section, we will show how to simulate a Hamiltonian that is the commutator of two states or the anticommutator H a = {ρ 1 , ρ 2 }; or some linear combination of the two.As we will show in the next section, simulating the commutator H c is useful for orthogonality testing and for adding two unknown pure states.
One approach to simulating H c would be to use the expression [Var84, Corollary 2.12.5] and apply Theorem 1 sequentially for each term in the product.Unfortunately, this leads to an error of O(t), and is incorrect even at the lowest order.We now present an alternate approach using Lemma 8.
Using n samples each of ρ 1 and ρ 2 , a quantum algorithm can transform σ AB into σAB such that Note that choosing φ = 0 we recover the anticommutator Hamiltonian H a /2, and choosing φ = π/2 we recover the commutator Hamiltonian H c /2.
Proof.For simplicity we only consider the case when ρ 1 , ρ 2 , σ ∈ D(H A ); the general case can be straightforwardly tackled as in Appendix B. Our strategy will be to produce a state of the form and then apply Lemma 8. We will use the circuit in Fig. 1 to produce such a state.
Figure 1: The gadget to create a state ρ .Here the controlled-cross-cross gate is a controlled-swap, and the waste bin indicates the partial trace.The H-gate is a single-qubit Hadamard gate (not to be confused with the Hamiltonian) and the measurement is in the Z-basis.
We now analyze Fig. 1.There are a number of dotted lines cutting the figure, and we will write down the state (as a density matrix) at each.First, at I, after applying a controlled swap, After discarding the last register we get Finally, a Hadamard operation to the first qubit gives Now the measurement operator on the first qubit in Fig. 1 denotes dephasing in the standard basis.Namely, we measure in the {|0 0|, |1 1|} basis, and if outcome |0 0| is obtained, replace the qubit state with |0 0|, and if outcome |1 1| is obtained, replace it with |1 1|.Thus at IV, after performing this measurement, we have Notice that ρ IV is a state of the form ρ = |0 0| ⊗ ρ + + |1 1| ⊗ ρ − , where ρ + and ρ − are subnormalized states with Tr ρ + + Tr ρ − = 1, and Applying Lemma 8, we can simulate H using the claimed resources.
It is easy to see that the simulation from Theorem 11 of the anticommutator H a = {ρ 1 , ρ 2 } has optimal scaling in t and δ, because in the qubit case, we can always choose ρ 2 = 1/2 so that H a = ρ 1 and we can apply the lower bound from Theorem 5.It is a little less trivial to show that our simulation of H c = i[ρ 1 , ρ 2 ] is optimal, but we show now that it is.The proof proceeds along similar lines as the optimality proofs in Theorem 5 and Theorem 10.
Theorem 12.To simulate H = i[ρ 1 , ρ 2 ] for time t and to trace norm error δ requires Ω(t 2 /δ) copies each of the states ρ 1 and ρ 2 , as long as δ and δ/t are smaller than some constants.
Proof.First, consider the two states ρ A = ρ(1/2) = 1/2 and ρ B = ρ(1/2 + ), where ρ(x) is from Eq. ( 27) and 0 < ≤ 1/2.By using the commutator simulation, we will identify an unknown state ρ 1 as either ρ A or ρ B with probability 2/3, a task for which Lemma 4 gives a lower bound of C −2 on the sample complexity, for some constant C. Let Simply perform exp([ρ 1 , ρ 2 ]π/(2 )) on |0 through commutator simulation and measure in the Z-basis.The outcome will be |1 if and only if ρ 1 = ρ 2 .The remaining part of the proof, extending to any sufficiently large t and small δ, proceeds exactly as in Theorem 5. Notice that symmetry of the commutator requires that any lower bound proved on the number of copies of ρ 1 also applies to the number of copies of ρ 2 .
While the above proof uses mixed states, it is possible to prove commutator simuation is optimal for pure states as well, by using the lower bound on orthogonality testing we will provide in Section 6.2.

Simulating Hermitian Polynomials in the Input States
It is not hard to show that any Hamiltonian written as a sum of nested commutators (with factors of i) and anticommutators can be expanded into a Hermitian multinomial.In fact, the converse is true too, as we sketch in Appendix E. This motivates us to extend the ideas of Theorem 11 to simulate any Hermitian multinomial in the states ρ 1 , . . ., ρ K , given sample access to these states.
Using n samples from the states {ρ 1 , . . ., ρ K }, a quantum algorithm can transform σ AB into σAB such that if n = O(Lc 2 t 2 /δ) where c := ∑ r∈R |c r | and L := max r∈R |r| is the multinomial degree of H.Moreover, on average, the number of copies of ρ j consumed is n j = O κ j c 2 t 2 /δ where κ j = ∑ r∈R v j (r)|c r |/c, and v j (r) = |{s : r s = j}|.
Proof.We first consider a term H r with r = (1, 2, . . ., k), for some k such that 2 ≤ k ≤ K. (More general r will follow easily from this special case.)Let S k be the cyclic permutation of k copies of H A that acts as follows: S k |j 1 , j 2 , . . ., j k = |j k , j 1 , . . ., j k−1 .In other words, Consider the circuit in Fig. 2. The output of Fig. 2 (we will not go through the details of the calculation as they are very similar to the analysis in Theorem 11) is of the form ρ Figure 2: The gadget to create ρ (r) .Here S k is the permutation of k registers given in Eq. ( 67), and the waste bins indicate the partial trace.The H-gate is a single-qubit Hadamard gate and measurement is in the Z-basis.In [EAO + 02] they use the same circuit, but use the measurement outcomes to perform spectrum estimation.
When we chose ab * = e iφ r /2, we find To apply this analysis to any other r with |r| = k, one can simply supply the appropriate input states ρ j in Fig. 2. Now without loss of generality we can assume c r ≥ 0 for all r, since the sign can be absorbed into the phase φ r .Therefore by sampling from r ∈ R with probability c r /c and creating ρ (r) , we obtain the state By Lemma 8, we can therefore simulate the Hamiltonian for the desired time and precision using O(c 2 t 2 /δ) copies of ρ .Since each copy of ρ requires a sample of a state ρ (r) , and each of these states requires at most L = max r∈R |r| copies of states in {ρ 1 , . . ., ρ K }, we obtain the stated total sample complexity.
To calculate the average number of uses of ρ j , we note that ρ j is used v j (r) times to create the state ρ (r) , and to create the state ρ , the state ρ (r) is chosen with probability |c j |/c.Thus ρ j is used on average κ j = ∑ r∈R v j (r)|c r |/c times to create a single ρ .Then since O(c 2 t 2 /δ) copies of ρ are used in the simulation, we obtain the stated complexity.

Applications of Commutator Simulation
One might wonder if commutator simulation is useful for any quantum information processing task.We describe how commutator simulation can be used to coherently add two pure states, i.e. producing a state proportional to |ψ 1 + |ψ 2 .We also show that commutator simulation can be used to perform orthogonality testing.Recall from Section 4.1 that this is the problem of determining whether two pure states have overlap at least w or are orthogonal.

Coherent state addition
We first give a protocol for coherent state addition: given many copies of unknown pure states |ψ 1 and |ψ 2 , the task is to obtain a state of the form for some a, b ∈ R. Note that the target state is sensitive to the global phases of the two input states-in particular, the relative phase between |ψ 1 and |ψ 2 -which have no physical meaning.
To make the task well-defined, we instead demand the target state to be of the form for some a, b ∈ R, which is unique (up to a global phase) even when the global phases of the two input states have not been specified.Note that we can always recover Eq. ( 72) from Eq. ( 73) by fixing the global phases of the two input states appropriately (i.e.such that ψ 2 |ψ 1 ≥ 0).
Theorem 14.Let |ψ 1 and |ψ 2 be unknown pure states of the same dimension.Promised that the angle between the two states is to trace distance δ using O( χ 2 δ sin 2 2∆ ) copies of |ψ 1 and |ψ 2 , where e iϕ := ψ 2 |ψ 1 /| ψ 2 |ψ 1 | is an unimportant phase factor that can be ignored by appropriately adjusting the global phases of the two states.
Remark 15.Our proof is based on commutator simulation and effectively implements a rotation in the two-dimensional subspace spanned by |ψ 1 and |ψ 2 .Indeed, note from Eq. (74) that |ψ(0) = |ψ 1 and |ψ(∆) = e iϕ |ψ 2 , while intermediate values of χ produce states that interpolate between these two.As a consequence, the target state in Eq. ( 73) has real coefficients a and b.One can also achieve complex coefficients using a more sophisticated Hamiltonian that includes terms proportional to |ψ 1 ψ 1 | and |ψ 2 ψ 2 |, but we do not consider this case here for the sake of simplicity.
Remark 16.If one does not care about the relative phase e iϕ , one can always exchange the two states and replace χ by ∆ − χ, which would improve the complexity by a constant factor when χ > ∆/2.Remark 17.Our protocol requires a very large number of samples when the states |ψ 1 and |ψ 2 have either very small or very large overlap (i.e. in cases when sin 2 2∆ is very small).This is because we use commutator simulation to effectively implement a rotation in the two-dimensional subspace spanned by |ψ 1 and |ψ 2 , and in the special cases when |ψ 1 ⊥ |ψ 2 or |ψ 1 = e iϕ |ψ 2 the commutator vanishes and hence our protocol fails (in the second case the task is trivial though).
Proof.Using ψ 2 |ψ 1 = e iϕ cos ∆, we can write for some unit vector |ψ ⊥ 1 such that ψ 1 |ψ ⊥ 1 = 0. Then the commutator of two non-orthogonal pure states acts as a Hamiltonian that induces a rotation in the two-dimensional subspace spanned by the states.In particular, where Y |ψ ,|ψ ⊥ acts as the Pauli Y matrix in the two-dimensional subspace spanned by orthonormal states |ψ and |ψ ⊥ .If Y is the 2 × 2 Pauli matrix then e iχY = cos χ 1 which is the desired state |ψ(χ) (we substituted |ψ ⊥ 1 = (cos ∆|ψ 1 − e iϕ |ψ 2 )/ sin ∆ from Eq. (75) to get the last line).To prepare this state, we can apply exp iχY |ψ 1 ,|ψ ⊥ 1 to |ψ 1 using the commutator simulation algorithm: we evolve with According to Theorem 11, this requires O(t 2 /δ) copies of each state, so the total sample complexity is O( χ 2 δ sin 2 2∆ ).Interestingly, by choosing χ = ∆/2 in Eq. (74) it is possible to coherently add two states, i.e. create a state proportional to |ψ 1 + |ψ 2 (we are ignoring the relative phase between the two states).However, to determine ∆ one needs to estimate the inner product between the two states, which can be done by running phase estimation on the commutator.

Orthogonality Testing
We now give a method for testing the orthogonality of two unknown pure states.
Theorem 18.Let |ψ 1 and |ψ 2 be unknown pure states of the same dimension.Promised that either Proof.For the upper bound, let ∆ := arccos| ψ 1 |ψ 2 |.From Eq. (79), we see that if the states are non-orthogonal, commutator simulation generates a rotation; whereas if the states are orthogonal, i.e. ∆ = π/2, commutator simulation performs only the identity.However, we have to be careful, because identical states correspond to ∆ = 0 which also results in a trivial commutator.Therefore we consider the modified states | ψ 1 := |ψ 1 |0 and | ψ 2 := |ψ 2 |+ , which can never have overlap greater than 1/2.(We do this by appending the states |0 and |+ to the sampled states.)If we let 2 and the evolution with the commutator of | ψ 1 and | ψ 2 for time t = 1 generates the unitary Phase estimation on U to precision Ω(1/ √ w) suffices to distinguish between λ = 0 and λ ≥ w, and thus solves orthogonality testing.Similarly to Appendix D, phase estimation with constant probability of success requires O(1/ √ w) applications of U, each implemented to error O( √ w); this uses O(1/( √ w) 2 ) = O(1/w) samples.To succeed with probability 1 − we can repeat O(log(1/ )) times, giving a total sample complexity of O(log(1/ )/w). 2or the lower bound, first notice that sample-based Grover search (see Section 4.2) reduces to orthogonality testing in the following way.Since in sample-based Grover search we do not count uses of U, we may therefore perform tomography on U to learn, to arbitrary accuracy, a complete orthogonal basis {|t 1 , |t 2 , . . ., |t k } for the target space T. Now perform orthogonality testing between |s and each of the |t j to determine whether |s has overlap λ j = | s|t j | 2 at least w/k.If the total probability mass of |s inside the target space is at least w, then this must be true for some |t j .Treating factors of k as constant, this implies Ω(log(1/ )/w) copies of |s (and |t j by symmetry) must be required for orthogonality testing with success probability 1 − , so as not to break the sample-based Grover search lower bound of Lemma 6.

Universality of LMR
In many solid state implementations of quantum computers such as quantum dots [LD98], donorpairs [Kan98], and electron spins [VYW + 00], the Heisenberg exchange is the natural coupling interaction between qubits.Up to an overall scaling, the Heisenberg interaction is the same as the swap interaction used in the LMR protocol.
The Heisenberg interaction between qubits i and j is given by where X i , Y i , and Z i are the Pauli matrices acting on qubit i.In these solid state systems, the Heisenberg interaction can be turned on and off for different pairs of qubits for any desired length of time.
The operations induced by the Heisenberg interactions in these systems are fast and reliable.While it is beneficial to create computing models that take advantage of this Heisenberg exchange interaction, the Heisenberg interaction is not universal for spin-1/2 systems [BBC + 95].Several schemes have overcome this limitation by using encoded logical qubits and decoherence-freesubsystems [DBK + 00, Lev02].
In this section, we show how to use the LMR protocol to design a universal model for quantum computation that does not use encoded qubits, but which requires only the Heisenberg interaction, as well as the ability to prepare the states |0 and |+ on a single qubit.Our scheme thus requires n + 1 physical qubits to perform computations on n qubits, in contrast to encoded schemes, of which the simplest require 2 or 3 times the number of physical qubits [DBK + 00, Lev02].Furthermore, there has been much research in the field of quantum dots on how to quickly and reliably prepare a fixed qubit state, e.g.[CV10, FPMU03, HVvB + 04, RSL00].These schemes could be applied to produce the single qubit states needed for our protocol.
We consider a connectivity graph of the qubits as in Fig. 3 (different connectivity graphs lead to different scalings depending on which costs you would like to optimize).We assume exchange interactions can be applied between connected qubits in the form of unitaries exp(−itH ij ) for arbitrary t.The qubit q * is where the states |0 and |+ are prepared.q * q 0 q 1 q 2 q 3 q 4 q 5 q 6 Figure 3: Connectivity graph for qubits in our model.Each circle represents a qubit.Qubits connected by a solid line can have the Heisenberg interaction applied between them.The qubit q * can be prepared in the state |0 or |+ .
Recall that arbitrary single qubit gates combined with any entangling two-qubit gate is sufficient for universal quantum computation [BDD + 02].Since we do not have encoded qubits, the exchange interaction itself immediately gives us an entangling gate.Now for universal quantum computation we need to show how to perform arbitrary single qubit gates.
Let X φ denote the unitary operation cos(φ/2)1 and let Z θ denote the unitary operation cos(φ/2)1 Then any single qubit rotation can be written as X φ Z θ X ξ for angles φ, θ, and ξ.Therefore, it is sufficient to show how to perform X and Z rotations.If qubit i needs to have a single qubit gate performed on it, using the Heisenberg interaction, we use swap gates to move that qubit to position 0. We now show how to perform a Z φ and X θ on the qubit in position 0. Using LMR, given n copies of the state |0 input at qubit q * , using only partial swap operations on qubits q 0 and q * , (i.e.applying the Heisenberg interaction between qubits q 0 and q * ) we can apply the unitary exp(−iφ|0 0|) = Z φ (87) to accuracy O n −1 .Likewise, using the LMR protocol, given n copies of the state |+ , using only partial swap interactions between qubits q 0 and q * , we can apply the unitary exp(−iθ|+ +|) = X θ (88) to accuracy O n −1 .
To apply an arbitrary single qubit rotation to accuracy , we need O( −1 ) resource states |0 and |+ (this construction is reminiscent of ideas in [MM08]).Suppose that over the course of an algorithm, one must apply M single qubit gates and M CNOT gates.We note that to apply a CNOT gate requires a constant number of single qubits gates as well as a constant number of partial swap gates [BDD + 02].Then to bound the error over the course of the algorithm, we require accuracy of O((M + M ) −1 ) for each single qubit gate.Therefore, we require O((M + M ) 2 ) resource states |0 and |+ in total.Additionally, using the connectivity graph of Fig. 3, to move qubits into proximity with one another to perform any single or two qubit gate requires O(N) swap operations operations, where N is the number of qubits.This results in a total number of operations that scales as O(N(M + M ) 2 ).
We note that the states |0 and |+ need not be prepared perfectly for our protocol to work.For example, if we have slightly depolarized versions of these states, we would simply need to increase the number of rounds in the LMR protocol by a constant factor.In fact, two arbitrary states (other than |0 and |+ ) could be used, as long as they are not diagonal in the same basis, and as long as the states themselves are well characterized.
Our model produces a polynomial (in particular squared) blow-up in the number of operations, which still allows for universal quantum computation.However, with such a model, it would be impossible to obtain a speed-up for problems such as Grover's search.We hope it is a useful model for systems where the Heisenberg exchange is a natural operation.It may even be useful in non-solid state systems such as cold, trapped atoms, where it was shown that partial swaps could be implemented using Rydberg interactions or through coupling to a cavity [PZS + 16].

Outlook
We have shown that the LMR protocol is optimal for the problem of simulating unknown Hamiltonians encoded as quantum states.Moreover, the protocol and its generalizations also turn out to be optimal for a variety of other tasks, such as discriminating between pure states and Hamiltonian evolution under the commutators of unknown states.We hope that this study will motivate the discovery of other possible applications of this versatile protocol.
We have not shown the optimality of our protocol for simulating the evolution by the multinomials in Eq. ( 42).It would be interesting to investigate whether it is optimal, or whether better algorithms can be found.
Another interesting aspect is the role of ancilla qubits in our protocol.While the original LMR protocol for Hamiltonian simulation is based on partial swaps and hence does not require ancilla qubits, the use of ancillas seems to be essential in our more general simulation protocol (see Fig. 2).We wonder whether the use of ancillas is necessary in our protocol, or for example, whether it can be implemented using the continuous permutations introduced in [Ozo15].These continuous permutations generalize the partial swap operation and do not require ancillas.
Another possible direction is to investigate distributed versions of our protocols in the context of multiparty communication.[HL11] consider a protocol for simulating distributed unitaries over multiple remote parties using shared entanglement and a limited amount of quantum communication, and the techniques they use are reminiscent to those of the LMR protocol.It would be interesting to investigate connections of [HL11] with the protocols in our work.
Finally, the LMR protocol can be seen as allowing the encoding of the operation e −iρt into multiple copies of a quantum state ρ.As discussed in Section 2, having access to O(t 2 /δ) copies of ρ allows a user to perform the operation e −iρt , but may be insufficient for the user to determine what ρ is through tomography.It is an intriguing question whether other quantum operations could be encoded into states in this way, so that a user could perform the quantum operation but learn little else about what operation is being performed.This could be seen as a form of quantum copy-protection [Aar09].See [ML16] for some progress in this direction, and [AF16] for negative results when the encoding is required to be a circuit and not a state.was visiting the University of Maryland and MIT, so he would like to thank both institutions for their hospitality.
The state after the first iteration of the above procedure can be explicitly written as where the partial trace can be computed using graphical notation [WBC15], and the last line was obtained using the Taylor expansion at = 0. Note that the difference in trace distance between the ideal state Eq. ( 94) and our first approximation Eq. ( 97) is AB denotes the state in Eq. ( 97)), we get the following recursion from Eq. (97): By evaluating this recursively, the final state can be expressed as for any m ∈ {0, . . ., n}.In particular, for m = n we get σ Choosing = t/n in Eq. ( 103) and comparing this with the desired final state at time t (given by Eq. (94), with t instead of ), we see that Thus, if we desire a final trace distance accuracy of δ and want to implement the Hamiltonian ρ for a time t, the LMR protocol uses n = O(t 2 /δ) copies of ρ.

C Controlled Density Matrix Exponentiation
As we have seen in Theorem 1, given an input state σ and O(t 2 /δ) copies of another state ρ, the LMR protocol allows us to obtain the output state e −iρt σe iρt .In many applications (see e.g.[LMR14,Wan14]) we would like to perform phase estimation on the operator e −iρt , which requires the ability to apply the controlled-e −iρt operation.However it is not immediately obvious to see that the controlled-e −iρt operation can be performed with the LMR protocol: since the LMR protocol involves the discarding (tracing out) of quantum registers, it could very well lose any coherence between the |0 and |1 components of the control qubit.Nevertheless, we show this is not the case: Theorem 20.Given O(t 2 /δ) copies of an unknown quantum state ρ ∈ D(H A ), the controlled-e −iρt operation, |0 0| ⊗ 1 A + |1 1| ⊗ e −iρt , can be performed up to error δ in diamond norm.
Proof.We will give two ways of deriving this result.The first simplest method is to realize that and so to simulate the controlled-e −iρt operator, we can simply use the |1 1| ⊗ ρ as the input states to the LMR protocol instead.Alternatively, the naive method of replacing the partial swaps in the LMR protocol by controlled versions also works.To see this, let us consider starting from the initial state Σ = (a|0 + b|1 )(a * 0| + b * |1 ) ⊗ σ ⊗ ρ, applying the controlled-partial swap |0 0| ⊗ 1 + |1 1| ⊗ e −iS∆ .The result is where we've used the shorthand s ≡ sin ∆ and c ≡ cos ∆.
where Tr 3 Σ refers to the initial state with the third register (ρ) traced out.This shows that we can use one copy of ρ to implement the controlled-e −iρ∆ operation up to error O(∆ 2 ).Therefore similar to the discussion in the proof of Theorem 1, by choosing ∆ = δ/t and repeating this procedure O(t 2 /δ) times, we implement the controlled-e −iρt operator up to error O(δ).

D Better Phase Estimation
We note that principal component analysis can be performed using fewer samples than what is claimed in [LMR14].
Corollary 22. Kitaev's phase estimation on the unitary U = e −iρ can be performed to precision and constant failure probability, using O(1/ 2 ) samples.
In [LMR14], they state that this phase estimation requires O(1/ 3 ) samples, so this is a polynomial improvement.
Proof.Notice that to estimate an eigenvalue of U to precision using standard Kitaev's phase estimation requires O(1/ ) uses of controlled-U.Then as long as the simulation of controlled-U does not change the resulting state by trace distance more than O( ), the total error in trace distance of the final state will be O(1).Using the LMR protocol, we can simulate e −iρ to precision O( ) a total of O(1/ ) times, giving a sample complexity of O(1/ 2 ).

E Equivalence of Hermitian polynomials and the Jordan-Lie algebra
In this appendix we sketch that any Hamiltonian that is a Hermitian multinomial in ρ 1 , ρ 2 , . . ., ρ k can be created from sums of nested commutators (multiplied by i) and anticommutators of density matrices (i.e. is in the Jordan-Lie algebra [Emc84] generated by the states), and vice versa.
First, we begin by noticing that for z ∈ C, . . .
So we can write all monomials as sums of nested commutators (but no i) and anticommutators.Now we just want to show that a monomial plus its Hermitian conjugate can be written as a sum of nested commutators (with i) and anticommutators.This is possible by noticing that the Hermitian conjugate of an expression of nested commutators and anticommutators (e.g.With this fact we can treat all Hermitian polynomials.We replace each monomial by a nested expression of commutators and anticommuatators, and then group together terms with the same parity c of commutators.The terms with even c will contribute to the real part of z while the terms with odd c will contribute to the imaginary part of z (we also need to introduce extra minus signs in front of nested commutators when c is equal to 2 or 3 modulo 4).For instance, we can rewrite the degree-3 monomial plus its Hermitian conjugate as follows: [[{[A, B], C}, D], E]) of Hermitian matrices (e.g.A, B, C, D, E) is equal to that same expression with a (−1) c sign, where c is the number of commutators (alternatively, the number of "[" symbols) in the expression (e.g.[[{[A, B], C}, D], E] † = (−1) 3 [[{[A, B], C}, D], E]).
Now since O(1/ √ w) applications of R |s s| are required, and we would like the entire algorithm to succeed with constant error, we need each application of R |s s| to have at most √ w error.So if n δ copies of ρ are required to simulate R |s s| to trace norm accuracy δ, then n √ w • 1/√w copies are required in total for the entire algorithm.Notice n √ w • 1/ √ w = Ω(1/w) by the lower bound in Lemma 6.Thus n δ = Ω(1/δ) as advertised.