Fragmented imaginary-time evolution for early-stage quantum signal processors

Simulating quantum imaginary-time evolution (QITE) is a significant promise of quantum computation. However, the known algorithms are either probabilistic (repeat until success) with unpractically small success probabilities or coherent (quantum amplitude amplification) with circuit depths and ancillary-qubit numbers unrealistically large in the mid-term. Our main contribution is a new generation of deterministic, high-precision QITE algorithms that are significantly more amenable experimentally. A surprisingly simple idea is behind them: partitioning the evolution into a sequence of fragments that are run probabilistically. It causes a considerable reduction in wasted circuit depth every time a run fails. Remarkably, the resulting overall runtime is asymptotically better than in coherent approaches, and the hardware requirements are even milder than in probabilistic ones. Our findings are especially relevant for the early fault-tolerance stages of quantum hardware.

Quantum Gibbs states can be approximated by quantum Metropolis Markov-chains [8,9] or by variational circuits trained to minimise the free energy [16], e.g.However, the former involve deep and complex circuits, whereas the latter are highly limited by the variational Ansatz.In turn, heuristic QITE algorithms for ground-state optimisations exist [1][2][3][4][5][32][33][34].There, one simulates pure-state QITE with a unitary circuit that depends on the input state, the Hamitonian, and β.For small-β steps, one can determine the circuits by measurements on the input state at each step and classical post-processing.One possibility is to optimise a variational circuit on the measured data [1], but this is again limited by the expressivity of the Ansatz.Another possibility is to invert a linear system generated from the measurements [2][3][4][5], but the size of such system (as well as the number of measurements required) is exponential in the number of qubits, unless restrictive locality assumptions are made.
The most general, guaranteed-precision QITE algorithms are based on unitary circuits followed by ancillary-qubit post-selection [6,11,14,15,18].These circuits -to which we refer as QITE primitives -are efficient in β as well as in the target precision.However, due to the intrinsically probabilistic post-selection, they must be applied multiple timesby what we refer to as master QITE algorithms -to obtain a deterministic output.Repeat-until-success master algorithms apply the primitive in parallel (i.e. in independent probabilistic runs), thereby not inducing any increase in circuit depth.However their overall complexity is inversely proportional to the post-selection probability.Instead, coherent master al- of q gates, with [q] := {1, . . .q}, is applied and then the ancillas are measured.Each gate makes one query to the Hamiltonian oracle (not shown).The specific choice of gates in the sequence is determined by a powerful formalism for operator-function design, known as quantum signal processing, which allows us to synthesise the QITE propagator F β (H) := e −β(H−λ min ) efficiently.More precisely, conditioned on detecting |0 on the ancillas, the desired state is output up to controllable error.We refer to the circuit generating U F β (H) as a QITE primitive.We develop two QITE primitives, one for each oracle model.Both primitives feature excellent scalings of q with β and the error.b) Master QITE algorithms: The post-selection probability -given approximately by F β (H)|Ψ 2 -can decrease with β very fast.Hence, for high β, probabilistic approaches based on repeat-until-success fail for the vast majority of trials, thereby producing an enormous waste in runtime.In turn, coherent approaches based on quantum amplitude amplification provide a close-to-quadratic runtime speed-up, but at the expense of enormous circuit depths, among other disadvantages.In contrast, we introduce a master algorithm that concatenates r QITE fragments of inverse temperatures {∆β l } l∈[r] , with l∈[r] ∆β l = β and β l < β for all l ∈ [r].Each fragment is successively run probabilistically and has both a success probability significantly higher and a query complexity significantly lower than that of the entire evolution run at once.Indeed, the ancillary measurements in b) detect potential errors in the operator synthesis after each fragment, instead of just once at the end in a).This ends up yielding an enormous saving in overall runtime (even beating coherent approaches for high β) while at the same time preserving all the practical advantages of probabilistic approaches for experimental implementations.
Here, we introduce two efficient QITE primitives as well as a practical master QITE algorithm (see Fig. 1); and prove a universal lower bound for the complexity of QITE primitives.The first primitive is designed for Hamiltonians given in the well-known block-encoding oracle model, whereas the second one for a simplified model of real-time evolution oracles involving a single time.Both primitives feature excellent query complexity (number of oracle calls) and ancillary-qubit overhead.In fact, for the first primitive the complexity is sub-additive in β and log(ε −1 ), with ε the tolerated error.This scaling saturates our universal bound when β log(ε −1 ).Hence Primitive 1 is optimal in that regime, which, interestingly, turns out crucial for our master algorithm.In contrast, Primitive 2's complexity is multiplicative in β and log(ε −1 ), but it requires a single ancilla throughout and its oracle significantly fewer gates.This is appealing for intermediate-scale quantum hardware.In turn, our master QITE algorithm breaks the evolution into small-β fragments and runs each fragment's primitive probabilistically.Surprisingly, this yields an overall runtime competitive with -and, in the relevant regime of high β, even better than -that of coherent approaches while, at the same time, preserving all the advantages of probabilistic ones for experimental feasibility.
More technically, our primitives are based on the quantum signal processing (QSP) framework [14,15,36,37], to which we also make two contributions relevant on their own.The first one is a generic bound on the approximation error of Hermitian-operator functions by their truncated Chebyshev series, for any analytical real function.We apply this to the Jacobi-Anger expansion of the exponential function [38], on which we base our first primitive.This expansion has been used to obtain optimal-scaling algorithms for real-time evolution (RTE) [15,36,37].Moreover, it has also been used for partition-function estimation [18].However, there, the different terms in the expansion are synthesized separately and their coefficients are then used to classically combine the measurement results.Our first primitive directly synthesizes the whole Jacobi-Anger expansion of the QITE propagator.Our second contribution (see also [39]) is a Fourier approximation of analytical real functions of Hermitian operators for RTE oracles, on which we base our second primitive.This is a novel QSP variant superior to previous Fourierbased approaches [14] in that it requires a single ancilla, instead of multiple ones.This variant does not require qubitization [37].In turn, the universal runtime bound we prove can be seen as an imaginary-time counterpart of the no fastforwarding theorem for RTE [40][41][42].
Finally, the complexity of our master algorithm depends on the fragmentation schedule, i.e. number r of fragments and their relative sizes.On one hand, for Primitive 1, we rigorously prove that, from a critical inverse temperature β c = O(2 N/2 N ) on, the runtime is lower than that with coherent QITE.This is shown by explicitly constructing schedules with only r = 2 fragments that do the job, remarkably.On the other hand, that fragmented QITE outperforms coherent QITE is also observed for both primitives through extensive numerical evidence.More precisely, we study the overall runtime as a function of β and ε, up to N = 15 qubits, and for numerically-optimized schedules.These experiments involve random instances of Hamiltonians encoding four computationally hard classes of problems: Ising models associated to the i) MaxCut and ii) weighted Max-Cut problems [20][21][22]; iii) restricted quantum Boltzmann machines (transverse-field Ising models) [30,31]; and iv) a quantum generalization (fully-connected Heisenberg models) of the Sherrington-Kirkpatrick model [43,44] for spin glasses.We see a clear trend whereby, from β c = O 2 N/2 on, fragmentation outperforms coherent QITE for both primitives, for an optimal number of fragments r 6.The obtained values for β c imply that our algorithm outperforms coherent QITE in the computationally hardest range of β, particularly relevant for Hamiltonians with an exponentially small spectral gap [45,46].Moreover, impressively, such advantages are attained at no cost in circuit depth or number of ancillas, which are identical to those of probabilistic QITE.
The paper is organised as follows.In Sec.II we introduce basic definitions and notation.In Sec.III we present our main results: In Sec.III A, we state two theorems giving the complexities of the primitives.In Sec.III B, we prove a universal lower-bound on the complexity of generic QITE primitives.In Sec.III C, we present our master algorithm and prove its correctness and complexity.In Sec.III D, we present exhaustive numerical results showing the advantages of fragmentation over quantum amplitude amplification for quantum Gibbs-state sampling.Finally, in Sec.IV, we give our concluding remarks and outlook.The technical details of our primitives and theorem proofs are left for the Methods, in Secs.V and VI, where we also review the basics of QSP.

II. PRELIMINARIES
We consider an N -qubit system S, of Hilbert space H S .QITE with respect to a Hamiltonian H on H S and over an imaginary time −i β is represented by the non-unitary operator e −βH .This can be simulated via post-selection with a unitary operator U that encodes e −βH in one of its matrix blocks [6,11,14,15,18].We denote by H A the Hilbert space of an ancillary register A, by H SA := H S ⊗ H A the joint Hilbert space of S and A, and by A the spectral norm of an operator A. The following formalizes the encoding.Definition 1. (Block encodings).For sub-normalization 0 ≤ α ≤ 1 and tolerated error ε > 0, a unitary operator U A on H SA is an (ε, α)-block-encoding of a linear operator A on H S if α A − 0| U A |0 ≤ ε, for some |0 ∈ H A .For ε = 0 and (ε, α) = (0, 1) we use the short-hand terms perfect α-block-encoding and perfect block-encoding, respectively.E.g., if U A is a perfect α-block-encoding of A, measuring |0 ∈ H A on U A |Ψ |0 ∈ H SA , for any |Ψ ∈ H S , leaves S in the state A|Ψ A|Ψ .The probability of that outcome is α 2 A|Ψ 2 .Note that, since U A = 1, a perfect α-blockencoding is possible only if α A ≤ 1.Hence, α allows one to encode matrices even if their norm is greater than 1.Typically, however, one wishes α as high as possible, to avoid unnecessary reductions in post-selection probability.
Our algorithms admit two types of oracle as input.The first one is based on perfect block-encodings of H and therefore requires H ≤ 1.If H > 1, however, the required normalisation can be enforced by a simple spectrum rescaling.More precisely, for λ − and λ + arbitrary lower and upper bounds, respectively, to the minimal and maximal eigenvalues of H, λ min and λ max , the rescaled Hamiltonian H := H− λ1 ∆λ fulfils H ≤ 1 by construction, with the short-hand notation λ := λ++λ− 2 and ∆λ := λ+−λ− 2 .Then, by correspondingly rescaling the inverse temperature as β := ∆λ β, one obtains the propagator e −β H , which induces the same physical transformation as e −βH .Hence, from now on, without loss of generality we assume throughout that H ≤ 1, i.e. that −1 ≤ λ min ≤ λ max ≤ 1.
We are now in a good position to define our first oracle, O 1 , which is the basis of our first primitive, P 1 .We denote by A 1 the entire ancillary register needed for P 1 and by A O1 ⊂ A 1 the specific ancillary qubits required to implement O 1 .
Definition 2. (Block-encoding Hamiltonian oracles).We refer as a block-encoding oracle for a Hamiltonian H on H S to a controlled unitary operator , where 1 is the identity operator on H S , {|0 , |1 } a computational basis for the control qubit, and U H a perfect block encoding of H.This is a powerful oracle paradigm used both in QITE [11,14,15,18] and real-time evolution [15,36,37,47].It encompasses, e.g., Hamiltonians given by linear combinations of unitaries, d-sparse Hamiltonians (i.e. with at most d non-null matrix entries per row), and Hamiltonians given by states [37].Its complexity depends on H, but highly efficient implementations are known.E.g., for H a linear combination of m unitaries, each one requiring at most c two-qubit gates, O 1 can be implemented with |A O1 | = O(log 2 m) ancillary qubits and gate complexity (i.e. total number of twoqubit gates) g O1 = O m(c + log 2 m) [37,47].
The second oracle model that we consider encodes H through the real-time unitary evolution it generates.

Definition 3. (Real-time evolution Hamiltonian oracle). We refer as a real-time evolution oracle for a Hamiltonian
This is a simplified version of the models of [6,14], e.g.There, controlled real-time evolutions at multiple times are required, thus involving multiple ancillas.In contrast, O 2 involves a single real time, so the ancillary register A O2 consists of |A O2 | = 1 single qubit (the control).In fact, we show below that no other ancilla is needed for our second primitive, P 2 , i.e.A 2 = A O2 .This is advantageous for near-term implementations.There, one may for instance apply product formulae [48,49] to implement O 2 with gate complexities g O2 that, for intermediate-scale systems, can be considerably smaller than for O 1 (see App. A).Furthermore, this oracle is also relevant to hybrid analogue-digital platforms, for which QSP schemes have already been studied [50].
QITE algorithms based on post-selection rely on a unitary quantum circuit to simulate a block encoding of the QITE propagator.We refer to such circuits as QITE primitives.
Note that P is Hamiltonian agnostic, i.e. it admits any H provided it is properly encoded in the corresponding oracle.The factor e −βλmin implies that F β (H) = 1, thus maximizing the post-selection probability.However, if λ min is unknown, one can replace it by a suitable lower bound λ − ≥ −1.This introduces only a constant sub-normalisation.In turn, the query complexity is the gold-standard figure of merit for efficiency of oracle-based algorithms.It quantifies the runtime of P relative to that of an oracle query.In fact, P is time-efficient if its query complexity and gate complexity per query g P are both in O poly(N, β, 1/ε , α) .
Importantly, normalisation causes the post-selection probability p Ψ (β, ε , α) of P (on an input state |Ψ ) to propagate onto the error ε in the output state, making the latter in general greater than ε .The exact dependance of In turn, the primitives must be incorporated into master procedures to attain deterministic QITE.We call these master QITE algorithms.
up to trace-distance error ε with unit probability.Its overall query complexity Q(β, ε) is the sum over the query complexities of each P β ,ε ,α applied.
Until now, two variants of master QITE algorithms had been reported, probabilistic and coherent.The former leverage repeat-until-success: perform O(1/p Ψ (β, ε , α)) trials of P β,ε ,α (on independent systems) until getting the desired output.In contrast, the latter are based on quantum amplitude amplification [35].There, P β,ε ,α is incorporated into a unitary amplification engine that is sequentially applied (on the same system) O 1/p Ψ (β, ε , α) times.Conveniently, 1/p Ψ (β, ε , α) is well-approximated by 1/p Ψ (β, α) for ε 1 (see App. C).Hence, the overall complexity of both variants is given by the unified expression where κ = prob or coh for probabilistic or coherent schemes, respectively, Since p Ψ (β, α) can decrease with N exponentially, the quadratic advantage in 1/p Ψ (β, α) of coherent approaches is highly significant.However, while probabilistic schemes make no assumption on the input-state source, coherent ones require a unitary oracle that prepares |Ψ from a reference state.Most importantly, coherent algorithms have a circuit depth O( 1/p Ψ (β, α)) times greater than in probabilistic ones and require O(N ) extra ancillas.This makes coherent schemes impractical for intermediate-scale quantum devices.

III. RESULTS
We first discuss the primitives, then the universal complexity lower bound, and the master algorithm at last.

A. Quantum imaginary-time evolution primitives
We introduce two QITE primitives.Both of them possess the basic structure shown in Fig. 1 a), where a sequence of gates {V k } ∈[q] , with [q] := {1, 2, . . .q}, generates an approximate block-encoding of the QITE propagator F β (H).The approximation consists of truncating an expansion of the exponential function at finite order q.The gates are determined by such expansion using quantum signal processing [14,15,36,37] (see Sec. V).Conceptually, the two primitives differ in the kind of expansion and the type of oracle.Their circuit descriptions are given in the Methods.
Primitive 1 is the circuit output by Algorithm 2, in Sec.V B, on inputs f (λ) = e −β(λ−λmin) , with β ≥ 0, ε ≥ 0, a block-encoding oracle O 1 , and its inverse O † 1 .Recall that |A O1 | is the ancillary-register size and g O1 is the gate complexity of O 1 .In Sec.V B 1, we prove the following.
Theorem 1. (QITE primitive for block-encoding oracles).Circuit P 1 constructed by Alg. 2 for f (λ) = e −β(λ−λmin) is a (β, ε , 1)-QITE-primitive with query complexity |A 1 | = |A O1 | + 1 total ancillary qubits, and gate complexity per query The highlights of Eq. ( 2) are its sub-logarithmic dependence on ε and its sub-additivity in β and ln(1/ε ).We note that a QITE primitive from [15] works for the same oracle model and has complexity upper-bounded by O 2 max[e 2 β, ln(2/ε )] ln(4/ε ) .Even though that is asymptotically better in β than the bound in Eq. ( 2), it underperforms it for all β 8 ln(4/ε ).On the other hand, the complexity obtained in Ref. [18] lacks the denominator in the second term of Eq. ( 2), also making the former worse than the latter for low β or high ln(1/ε ).While both the scalings of Refs.[15,18] tend to O ln(1/ε ) for β → 0, Eq. ( 2) tends to zero.In fact, in Sec.III B we show that the scaling in Eq. ( 2) becomes optimal as β decreases relative to ln(1/ε ).We stress that the latter regime is crucial for the master algorithm of Sec.III C, whose first fragments require precisely low inverse temperatures and high precisions.In turn, in the opposite regime of high β, preliminary numerical observations [51] suggest that the asymptotic scaling of the exact value of q 1 could actually be as good as q 1 (β, ε ) = O β ln(1/ε ) , i.e. similar to that from [15].
Primitive 1 is based on the Jacobi-Anger expansion [38].This immediately gives a truncated Chebyshev-polynomial series [52,53] for F β (H), which can be synthesized with quantum signal processing (see Sec. V).In Refs.[15,18,36,37,42], the truncation error by this expansion was bounded [42] using properties of the Bessel functions [54].In contrast, here, we use a generic upper bound (Lemma 9, in Sec.V B) for truncated Chebyshev series of arbitrary Hermitianoperator functions.This gives the same bound for F β (H) as [42] but holds for any analytical real function, and is hence useful for operator-function design in general.
Primitive 2 is the circuit output by Algorithm 3, in Sec.
, and its inverse O † 2 , with γ ≥ 0 a parameter that will determine the subnormalization.In Sec.V C 1, we prove the following.
In contrast to Eq. ( 2), the relation between β and ln(1/ε ) in Eq. ( 3) is multiplicative.However, in return, P 2 requires |A 2 | = 1 ancillary qubit throughout, remarkably.This is a drastic reduction with respect to not only block-encoded oracles but also other oracles based on real-time evolution [6,14].In fact, |A 2 | = 1 is the minimum possible, because, since F β (H) is non-unitary, at least 1 ancilla is needed to block-encode it.Moreover, the scaling of g P2 is optimal too, as it adds only a small, constant number of gates per query to the intrinsic gate complexity g O2 of the oracle.These features make P 2 specially well-suited for near-term devices.
Our algorithms support any λ min ∈ [−1, 1].For P 2 , this is reflected by the sub-normalization factor e −β(1+λmin) , which decreases as λ min departs from −1.In turn, the other factor, e −γ , arises from the Gibbs phenomenon of Fourier series.The theorem holds for all γ ≥ 0, allowing one to trade success probability for query complexity.However, in App.D, we obtain the optimal γ minimising Eq. ( 1), for both probabilistic (κ =prob) and coherent (κ =coh) algorithms.Conveniently, for ε 1, this is well-approximated by Importantly, rather than a peculiarity of P 2 , the favourable scalings of |A 2 | and g P2 are generic features of the type of operator-function design behind it: An optimised Fourierapproximation algorithm for arbitrary analytical real functions of Hermitian operators (see Sec. V C).This is a novel quantum-signal-processing variant interesting on its own.Since it is based based on real-time evolution oracles, it requires no qubitization [37].This explains the scaling of g P2 .In addition, in contrast to other Fourier-based approaches, ours involves a single real time instead of an error-dependent number of them [14].Here, the different frequency terms in the Fourier approximation are generated through the action of O 2 multiple times, instead of using ancillas to make a linear combination of several unitaries, one unitary for each frequency term.This explains the scaling of |A 2 |.
Finally, Theos. 1 and 2 can be straightforwardly extended to the realistic case of approximate oracles: In App.A, we show (for generic analytical operator functions) that it suffices to take the oracle error (deviation from an ideal oracle) as ε O = O(ε /q) to keep the primitive's error in O(ε ).

B. Cooling-speed limits for oracle-based QITE algorithms
The most challenging applications of QITE involve small post-selection probabilities, decreasing exponentially in N in the worst cases.In an effort to reduce the overall complexity [see Eq. ( 1)], this has fueled a long race [6,7,11,12,14,15] to improve q(β, ε ), going from the seminal O β poly(1/ε ) of [6] to the recent O 2 max[e 2 β, ln(2/ε )] ln(4/ε ) of [15] or the additive scaling of Eq. (2).However, to our knowledge, no runtime limit for QITE simulations has been established.This contrasts with RTE, where fundamental runtime lower bounds are given by the "no-fast-forwarding theorem" [40][41][42].These are saturated by optimal RTE algorithms [15,36,37].Here we derive an analogous bound for imaginary time, which we call cooling-speed limit in allusion to the use of QITE to cool systems down to their ground state.
More precisely, in Sec.VI, we prove a universal efficiency limit for QITE primitives based on block-encoded oracles.This is convenient as it directly applies to our primitive with lowest query complexity, i.e.P 1 .
Theorem 3 (Imaginary-time no-fast-forwarding theorem).Let β > 0 and 0 < ε < α/2.Then, any (β, ε , α)-QITEprimitive querying block-encoding Hamiltonian oracles has query complexity at least q min (β, ε , α) ≥ q, where q ∈ R >0 is the unique solution to the equation Even though the bound is only given implicitly, interesting conclusions can readily be drawn.First, for any fixed β, the left-hand-side of Eq. ( 5) decreases monotonically with q (therefore the uniqueness of the solution).Second, for any fixed ε and α, q grows monotonically with β.Third, and most important, Eq. ( 5) is approximated by β 8 q 2q = 2 ε /α for β q, as a Taylor expansion shows.The latter equation has a known explicit solution [15], which, for α = 1, is given by Eq. (2).Hence, for β/q → 0, Eq. (2) tends to the optimal scaling.Note that β q is equivalent to the first term Eq. (2) being much smaller than its second term, which in turn implies that ε should be exponentially small in β.Thus, P 1 is close to optimal for small inverse temperatures or high precisions.Interestingly, this is the regime at which the first fragments of our master algorithm operate, as we see next.

C. Fragmented master QITE algorithm
Our master algorithm relies on the basic identity F β (H) = r l=1 F ∆β l (H) to partition the evolution into r ∈ N fragments of inverse temperatures S r := {∆β l > 0} l∈[r] , such that l∈[r] ∆β l = β.We refer to S r as the fragmentation schedule.For each l, the algorithm repeats until success a (∆β l , ε l , α l )-QITE-primitive P ∆β l ,ε l ,α l on the output state |Ψ l−1 of the (l − 1)-th step, with ε l given in Eq (6).That is, if the post-selection succeeds, its output state |Ψ l is input into the (l + 1)-th fragment.Else, the algorithm starts all over from the first fragment on |Ψ 0 := |Ψ , until |Ψ l−1 is prepared and the l-th fragment can be run again.Alternatively, the measurement on A after each fragment can be seen as monitoring that the correct block of U F ∆β l (H) is applied on each |Ψ l−1 , in contrast to the single error detection after U F β (H) in the probabilistic master algorithm (see Fig. 1).Note that the total number of trials (i.e.preparations of |Ψ ) coincides with the number of repetitions of the first fragment.The following pseudocode summarizes all the algorithm: for all l ∈ [r], Algorithm 1 is a β, O(ε) -master-QITEalgorithm for H on |Ψ with average query complexity where is the average number of times that P ∆β l ,ε l ,α l is run, with β 0 := 0, β l := l k=1 ∆β k for all l ∈ [r], and p Ψ (β l−1 ) := F β l−1 (H)|Ψ 2 for all l ∈ [r+1].The overall complexity of the probabilistic master algorithm is dominated by the area of the yellow rectangle.In contrast, the corresponding complexity of the fragmented algorithm (here, for the exemplary case of r = 3 fragments) is dominated by the area of the blue-shaded rectangles.Up to logarithmic corrections in the precision, the cumulative width of the blue-shaded rectangles coincides with the width of the yellow one, of order β.In contrast, while the height of the yellow rectangle is of order 1/pΨ(β), the height of the blue-shaded ones decreases from order 1/pΨ(β) till order pΨ(βr−1)/pΨ(β), making the blue-shaded area smaller than the yellow one.For high-enough β, the reduction can be so strong that the complexity of the fragmented algorithm can reach even that of the coherent algorithm (not shown).This intuition is rigorously proven for Primitive 1 (in Theorem 5) and numerically verified to exhaustion for both Primitives 1 and 2 (in Sec.III D).
We note that, for Primitive 1, the average total number of trials coincides with that of the probabilistic algorithm: n 1 = 1/p Ψ (β) =: n prob .This is important because the probabilistic algorithm consumes q 1 (β, ε ) queries per trial, successful or not.In contrast, the fragmented one consumes per trial q 1 (∆β 1 , ε 1 ) queries, plus q 1 (∆β 2 , ε 2 ) queries only if the first post-selection succeeds, plus q 1 (∆β 3 , ε 3 ) queries only if the second one succeeds too, and so on.Hence, the total waste in queries is lower with fragmentation.See Fig. 2. The strength of the reduction depends on how fast p Ψ (β l ) (and so n l ) decreases with l; but, in any case, it gets more drastic as β increases.That is, the largest reductions are expected at the hardest regime of p Ψ (β) 1.To maximize the effect, one wishes q 1 (∆β l , ε l ) to decrease with l as fast as possible.Note that Eq. ( 6) implies ε l < ε l+1 , which plays against the latter wish.However, fortunately, q 1 (∆β l , ε l ) grows approximately linearly in ∆β l but sub-logarithmically in 1/ε l .Hence, for sufficiently high β, one can make q 1 (∆β l , ε l ) arbitrarily smaller than q 1 (∆β l+1 , ε l+1 ) by choosing ∆β l sufficiently smaller than ∆β l+1 .
Based on these heuristics, we next prove for Primitive 1 that Alg. 1 can not only outperform the probabilistic algorithm but also -for sufficiently high β -even the coherent one, surprisingly.The proof is constructive: we devise suitable schedules that give the desired advantage for fragmentation.Remarkably, it is enough to consider only r = 2 fragments.The result is valid for any |Ψ and H, under only mild assumptions on the success probability p Ψ as a function of β.We denote the inverse function of p Ψ by p − The proof is given in App.G.The schedules constructed there have the sole purpose of proving the existence of β c in general and are therefore not necessarily optimal for each specific H and |Ψ .For instance, in App.H we study Gibbsstate sampling (i.e. for the maximally-mixed state as input, with o = 2 −N/2 ) for H describing non-interacting particles, where a closed-form expression for p Ψ (β) can be obtained.For this simple case, the theorem yields β c = O 2 N/2 N .However, in Sec.III D we numerically optimize the schedules and obtain β c = O 2 N/2 for hard-to-simulate, interacting systems.The proof in App.G exploits the additive dependence of q 1 on β and the logarithmic term in Eq. (2).Its extension to the multiplicative case of q 2 is left for future work.Nevertheless, here, we do consistently observe an advantage of fragmented QITE over coherent one for P 2 .More precisely, in Sec.III D, we numerically find that also for P 2 does fragmentation outperform coherent-QITE at Gibbsstate sampling, with β c scaling with N as in P 1 but with a somewhat larger pre-factor (which is expectable, as α l < 1 gives an exponential dependance of n l on r that worsens the performance).Either way, that fragmentation can outperform quantum amplitude amplification at all is remarkable, since the latter requires circuits O( 1/p Ψ (β)) times deeper and O(N ) more ancillas than the former.Importantly, our findings would have little practical relevance if β c was unphysically high.Fortunately, β c = O 2 N/2 is in an intermediate regime useful for important applications: E.g., Ground-state cooling (or, more generally, Gibbs-state sampling at low temperatures) requires β scaling inversely proportionally to the spectral gap, which can be exponentially small in N even for relatively simple Hamiltonians such as transverse-field Ising models [45,46].Finally, as already mentioned, P 1 is particularly well-suited for fragmentation.On the one hand, it displays α l = 1 for all l ∈ [r].On the other hand, and most importantly, q 1 becomes optimal as β decreases relative to ln(1/ε ).This is convenient to minimize Eq. ( 7), because the first fragments (specially the first one) operate precisely at low ∆β l and ε l , close to that optimality regime.The latter is verified both analytically for the non-interacting case of App.H and numerically for the examples of Sec.III D in App.I, where we consistently observe that β 1 is typically only a tinny fraction of ln(1/ε 1 ).Colloquially speaking, the widths of the first blueshaded rectangles in Fig. 2 can be reduced more with P 1 than with other primitives.

Quantum RBMs
Quantum spin glasses Weighted MaxCut < l a t e x i t s h a 1 _ b a s e 6 4 = " h s q z A 6 A 3 + A 3 E u y s J j J + U + K 9 P B g w

Query complexity
Query depth Figure 3. Runtimes and circuit depths of quantum Gibbs-state samplers running on Primitive 1 versus inverse temperature.Red corresponds to the probabilistic master QITE algorithm, green to the coherent one, blue to the fragmented one with uniform schedule Sr for the best r, and orange to the fragmented one with a schedule Sr,a as in Eq. ( 8) for the best r and a (see also Fig. 5).The three classes of Hamiltonians (expressions in upper panels and lattice geometries in lower ones) correspond to the weighted MaxCut problem, the quantum RBM, and a quantum version of the Sherrington-Kirkpatrick spin glass model.Solid curves represent the means over 1000 random instances from each class, whereas (the thicknesses of) shaded curves are the corresponding standard deviations.Also, note the logarithmic vertical scale.The examples shown correspond to N = 12 qubits and a tolerated error of ε = 10 −3 , but qualitatively identical behaviors are observed for all N between 2 and 15 as well as for ε = 10 −2 and ε = 10 −1 .Besides, qualitatively similar behaviors are observed for Primitive 2 (see App. J).Upper panels: average overall query complexity.Both fragmented algorithms comfortably outperform the probabilistic one already at small β.In addition, fragmentation with non-uniform schedule outperforms even coherent QITE at a critical inverse temperature βc.Moreover, fragmentation with uniform schedule is in the same order of magnitude as coherent QITE at β = βc and eventually also outperforms it (not shown) at larger β (between 5500 and 10000 depending on the Hamiltonian class).The dependance of βc with N is shown in Fig. 4. Lower panels: average query depth.Defined as the maximum number of queries per circuit run (i.e., not taking into account independent trials), the query depth quantifies the circuit depth (relative to the depth per query) required by the last run (the successful one).As expected, coherent QITE lies orders of magnitude above probabilistic QITE; but, surprisingly, fragmented QITE with non-uniform and uniform schedules are respectively almost identical to and of the same order of the latter.
that each fragment's query complexity is integer.
For N up to 15 qubits, we draw 1000 random H's within each class.For fair comparison, we re-scale all H's so that λ min = −1 and λ max = 1.For each of them, we calculate the complexities for β between 0 and 10000 and ε = 0.1, 0.01, or 0.001.Partition functions are evaluated by exact diagonalization of H. Evaluating Eq. ( 7) requires in addition a choice of schedule.We propose for a > 1, so that β l = l r a β/2 for all l ∈ [r].This guarantees that ∆β 1 < ∆β 2 . . .< ∆β r and allows us to control the strength of the inequalities by varying a.For each problem instance (N , H, and β), we sweep r and, for each value of r, we find the optimal a through the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm until minimizing Q Sr,a (β/2, ε) [60].
The overall complexities and circuit depths obtained (together with those for uniform schedules, i.e. with fixed a = 1) are shown in Fig. 3 for P 1 ; and the scalings with N of β c in Fig. 4. Similar scalings for the critical inverse temper-Figure 4. Critical inverse temperatures for P1 versus number of qubits.The error and Hamiltonian classes are the same as in Fig. 3, except for MaxCut, defined as weighted MaxCut but with random Ji,j ∈ {0, 1} for all (i, j).Blue dots represent the means over 1000 instances from each class, whereas dashed orange curves their fits over the Ansatz βc(N ) = A 2 η N + B, with A, B, η ∈ R. The fit results, together with their root-mean-square deviations (RMSDs), are shown in the insets.Similar scalings with N are observed for P2 (see App. J).In all cases, βc = O 2 N/2 is satisfied.ature are obtained for P 2 but with somewhat higher constant pre-factors (see App. J), which is expectable due to the nonunit sub-normalization factors α k in n l .Summarizing, our numerical experiments support the following observation.Observation 6. (Gibbs-state sampling with fragmented QITE).Let the primitives be of fixed type, either P 1 or P 2 .Then, for every H and ε > 0 studied, there exists Moreover, the maximal circuit depth required by fragmentation is asymptotically the same as that of probabilistic QITE.
Apart from the notable fact that fragmentation outperforms coherent QITE for both primitives, it is also remarkable that, long before Q Sr reaches Q coh , at β c , Q Sr is already much smaller than Q prob .Crucially, these advantages of fragmented QITE come at no cost in circuit depth, since the query depth of fragmentation, r l=1 q(∆β l , ε l , α l ), is observed to almost coincide with that of repeat until success, q(β/2, ε , α), specially for high β.Note that the latter needs not be the case: strictly speaking, neither q 1 nor q 2 are addi- J).The tolerated error is ε = 10 −3 .Qualitatively identical behaviors are observed for all N between 2 and 15 as well as for ε = 10 −2 and ε = 10 −1 ; and the same holds for the other Hamiltonian classes in Fig. 4. The upper panel shows the optimal number of fragments r for uniform schedules Sr,1.The central and lower panels respectively show the optimal r and a for non-uniform schedules Sr,a.The dashed and dotted curves in the upper and lower panels respectively represent fits over the Ansätze r(β) = A β η and a(β) = A β η , with A and η ∈ R. The fit results are shown in the insets.Remarkably, for non-uniform schedules, the observed scaling for r is constant not only with β but also with N .
tive in ∆β l due to the non-linear dependance of ε l on ∆β l .
Of course, the optimal schedules as functions of β are a priori unknown.Nevertheless, the trends we observe for the schedule proposals in Eq. ( 8) are so compelling that they provide a sound basis for educated guesses in general: Observation 7. (Optimal schedules).For uniform and nonuniform schedules S r,1 and S r,a , given by Eq. (8), the overall complexity for P 1 is respectively minimised by r = O(β 1/2 ) and (r, a) = O(1), O(β 1/3 ) (see Fig. 5); whereas for P 2 by r = 2 and (r, a) = O(1), O(β 1/6 ) (see App. J).
As expected from the exponential dependence on r in Eq. ( 6), a slow growth of r with β is observed for each N to minimize Q Sr (β/2, ε).This is indeed seen for P 1 with uniform schedules (Fig. 5, upper panel).On the other hand, for P 2 with uniform schedules, r = 2 is observed (App.J) to minimize Q Sr but the resulting complexity does not reach Q coh over the scanned domain (0 ≤ β ≤ 10000).However, for both P 1 (Fig. 5, central panel) and P 2 (App.J) with non-uniform schedules (where fragmentation does outperform amplitude amplification), the observed scaling of r is constant with both β and N , remarkably.In turn, that a grows with β implies that each ∆β l decreases relative to ∆β l+1 as β grows.This is consistent with the intuition from Sec. III C that each ∆β l should be smaller than ∆β l+1 .In addition, see also App.I, where we consistently observe that, for the obtained optimal schedules, ∆β 1 is only a tinny fraction (around 0.1% to 2%) of 8 ln(4/ε 1 ).In fact, for both primitives, inserting the obtained a(β) into Eq.( 8), one sees that all ∆β l 's (except the last one, ∆β r ) also decrease in absolute terms as β grows.Yet, that a grows slowly with β guarantees that the ∆β l 's do not decrease too much.More precisely, comparing with Eq. ( 6), we see that ∆β l > ε l for all l ∈ [r].This is an important sanity check, because if ∆β l < ε l , the identity operator would readily provide an (ε l , 1)-block-encoding of F ∆β l (H), hence rendering the obtained scaling for a(β) meaningless.

IV. CONCLUDING DISCUSSION
We have presented two QITE primitives and a master QITE algorithm.The first primitive is designed for block-encoding Hamiltonian oracles and has query complexity (number of oracle calls) sub-additive in the inversetemperature β and ln(ε −1 ), with ε the error.This scaling is better than all previously-known bounds [11,15] for β 8 ln(4 ε −1 ) and becomes provably optimal for β ln(ε −1 ).Optimality is proven by showing saturation of a universal cooling-speed limit that is an imaginary-time counterpart of the celebrated no fast-forwarding theorem for real-time simulations [40][41][42].It is an open question what the optimal scaling is away from the saturation regime.Coincidentally, the first steps of our master algorithm operate precisely in that regime.On the other hand, the second primitive is designed for a simplified model of real-time evolution oracles involving a single time.Its complexity is multiplicative in β and ln(ε −1 ), but it requires a single ancillary qubit throughout and its oracle is experimentally-friendlier than in previous QITE primitives.Interestingly, preliminary numerical analysis [51] suggests that the asymptotic scaling with β of both primitives' complexities could actually be significantly better than in the analytical bounds above, for P 1 even reaching levels as good as q 1 (β, ε ) = O β ln(1/ε ) .
Our primitives are based on two technical contributions to quantum signal processing (QSP) [14,15,36,37] relevant on their own.The first one is a bound on the approximation error of Hermitian-operator functions by their truncated Chebyshev series, for any analytical real function.The second one is a novel, Fourier-based QSP variant for real-time evolution oracles superior to previous ones [14] in that it requires a single real time (and therefore a single ancilla), instead of multiple ones.Moreover, it is also experimentally friendly in that it requires no qubitization [37].
Primitive technicalities aside, the main conceptual contribution of this work is the master QITE algorithm, which is conceptually simple, yet surprisingly powerful.It is based on breaking the evolution into small-β fragments.This gives a large reduction in wasted queries and circuit depth, yielding an overall runtime competitive with (and for high β even better than) that of coherent approaches based on quantum amplitude amplification (QAA).This is remarkable since the latter requires in general N extra ancillary qubits and circuits O 1/ p Ψ (β, α) times deeper than the former.To put this in perspective, it is illustrative to compare with quantum amplitude estimation (QAE).In its standard form, QAE has similar hardware requirements as QAA [35].However, recently, interesting algorithms have appeared [61,62] that perform partial QAE with circuit depths that can interpolate between the probabilistic and coherent cases.In contrast, here, we beat full QAA using circuit depths for most runs much lower than in the bare probabilistic approach.
That fragmented QITE outperforms coherent QITE is proven rigorously for Primitive 1 and also supported by exhaustive numerical evidence for both primitives.Namely, our numerical experiments address random instances of Ising, transverse-field Ising, and Heisenberg-like Hamiltonians encoding computationally hard problems relevant for combinatorial optimisations, generative machine learning, and statistical physics, e.g.We emphasize that our analysis of is based on the analytical upper bounds on the query complexity we obtained, instead of the complexities themselves.The corresponding analysis for the actual (numerically obtained) query complexities requires re-optimizing the fragmentation schedules.Preliminary observations [51] in that direction are again promising, indicating that the actual overall complexities may be orders of magnitude lower than in Fig. 3, e.g.In any case, qualitatively similar interplays between fragmentation and QAA are expected even for other types of primitives (beyond QITE) whose complexity and post-selection probability have similar scalings.All these exciting prospects are being explored for future work.
Our findings open a new research direction towards midterm high-precision quantum algorithms.In particular, the presented primitives, cooling-speed limit, QSP methods, and master algorithm constitute a powerful toolbox for quantum signal processors specially relevant for the transition from NISQ to early prototypes of fault-tolerant hardware.
V. METHODS: QUANTUM SIGNAL PROCESSING Quantum signal processing (QSP) is a powerful method to obtain an ε -approximate block encoding of an operator function f (H) := λ f (λ)|λ λ|, where {|λ ∈ H S } are the eigenvectors and {λ} the eigenvalues of a Hamiltonian H, from queries to an oracle for H [36].We note that QSP can also be extended to non-Hermitian operators [15], but here we restrict to the Hermitian case for simplicity.We present two QSP methods for general functions one for each oracle model in Defs. 2 and 3. Our QITE primitives are then obtained by particularizing these methods to the case f (H) = F β (H), with F β (H) = e −β(H−λmin) .

A. Real-variable function design with single-qubit rotations
We start by reviewing how to approximate functions of one real variable with single-qubit pulses.

Single-qubit QSP method 1
Consider the single qubit rotation R 1 (θ, φ) := e iθX e iφZ , where X and Z are the first and third Pauli matrices, respectively, and φ ∈ [0, 2π].The angle θ ∈ [−π, π] is the signal to be processed and the rotation e iθX is called the iterate.One can show [63] that, given q ∈ N even and a sequence of angles where B and D are polynomials in cos θ with complex coefficients determined by Φ 1 .
For target real polynomials B(cos θ) and D(cos θ), we wish to find Φ 1 that generates B(cos θ) and D(cos θ) with B(cos θ) and D(cos θ) as either their real or imaginary parts, respectively.This can be done iff they satisfy (App.K) for all θ, and have the form with b k ∈ R and d k ∈ R. Alternatively, Eq. ( 11) can also be expressed in terms of Chebyshev polynomials of first T k (cos θ) := cos(kθ) and second U k (cos θ) := sin ((k + 1)θ) / sin θ kinds.This can be used to obtain either Chebyshev or Fourier series of target operator functions.If the target expansion satisfies Eqs.(10) and (11), the angles Φ 1 can be computed classically in time O (poly(q)) [63-66].

Single-qubit QSP method 2
This method is inspired by a construction in Ref. [67] and shown in detail in a companying paper [39].The fundamental gate is R 2 (x, ω, ζ, η, ϕ, κ) = e i ζ+η 2 Z e −iϕY e i ζ−η 2 Z e iωxZ e −iκY , which has five adjustable parameters {ω, ξ} ∈ R 5 , where ξ := {ζ, η, ϕ, κ}.Here, x ∈ R will play the role of the signal and e iωxZ that of the iterate.In Ref. [67], it was observed that the gate sequence 4(q+1) , can encode certain finite Fourier series into its matrix components.In [39], not only it is formally proven that for any target series a unitary operator can be built with it as one of its matrix elements but also we provide an explicit, efficient recipe for finding the adequate choice of pulses Φ 2 .This is the content of the following lemma.

B. Operator-function design from block-encoded oracles
Here, we synthesize an (ε ,1)-block-encoding of f (H) from queries to an oracle for H as in Def. 2. The algorithm can be seen as a variant of the single-ancilla method from Ref. [37] with slightly different pulses.The basic idea is to design a circuit, P 1 , that generates a perfect blockencoding V Φ1 of a target Chebyshev expansion fq (H) := q/2 k=0 b k T k (H) that ε tr -approximates f (H), for some 0 ≤ ε tr ≤ ε .This can be done by adjusting Φ 1 as in Sec.V A 1. Note that the achievability condition (10) requires that fq (H) ≤ 1, but we only guarantee fq (H) ≤ 1 + ε tr .
However, this can be easily accounted for introducing an inoffensive sub-normalization α = (1 + ε tr ) −1 (see, e.g., Lemma 14 in Ref. [37]), which we neglect here throughout.Choosing fq as the truncated Chebyshev series of f with truncation error ε tr ≤ ε , we obtain the desired blockencoding of f (H).For f analytical, the error fulfills [53] with f (q/2+1) the (q/2 + 1)-th derivative of f .This allows one to obtain the truncation order q/2 and Chebyshev coefficients b := {b k } 0≤k≤q/2 (see App. L).Then, from b, one can calculate the required Φ 1 (see App. M).Next, we explicitly show how to generate V Φ1 .Using the short-hand notation To exploit the single-qubit formalism from Sec. V A 1, one needs an iterate that acts as an SU (2) rotation within each H λ .In general, O 1 itself is not appropriate for this due to leakage out of H λ by repeated applications of O 1 .However, there is a simple oracle transformation -qubitization -that maps O 1 into another blockencoding O 1 of the same H but with the desired property [37].The transformed oracle reads (see App. N) with θ λ : = cos −1 (λ) and Although the qubit resemblance could be considered in a direct analogy to QSP for a single qubit, it leads to a more strict class of achievable functions than if we resort to one additional qubit (single-ancilla QSP) [37].This extra ancilla controls the action of the oracle O 1 through the iterate on H SA , where |± are the eigenstates of the Pauli operator X for the QSP qubit ancilla.Throughout this section, 1 is the identity operator on H SA O 1 and M denotes the single-qubit Hadamard gate.Let us define the operators V φ = V 0 1 ⊗ e iφZ and Vφ = V † 0 1 ⊗ e iφZ for a given phase φ ∈ [0, 2π], which play the role of R(θ, φ) of the previous sub-section with θ λ playing the role of θ for each λ.These operators can be phase iterated to generate on H SA , with ancilla pre-and post-processing unitaries The following pseudocode gives the entire procedure.The correctness and complexity of Algorithm 2 are addressed by the following lemma, proven in App.O .Lemma 9. Let f , ε , O 1 , and O † 1 be as in the input of Alg. 2.Then, for any q s.t.ε tr in Eq. ( 12) is no greater than ε , there exists Φ 1 ∈ R q+1 such that V Φ1 in Eq. ( 15) is a (ε ,1)-blockencoding of f (H).The circuit P 1 generating V Φ1 requires a single-qubit ancilla, q queries to O 1 and O † 1 each, and g P1 = O (g O1 + |A O1 |) gates per query, with g O1 the gate complexity of O 1 .Furthermore, the classical runtime (calculations of b and Φ 1 ) is within complexity O poly(q/2) .Some final comments about the input function are in place.The restriction of f being analytical is needed to determine the truncation order through Eq. ( 12).In fact, to evaluate the RHS of the equation exactly, one needs in general closedform expression for f .However, if the required truncation order is given in advance, the corresponding Chebyshev coefficients can be obtained from q/2 + 1 evaluations of f in specific points (the nodes of the Chebyshev polynomials).In that case, a closed-form expression for f is not required and a classical oracle for evaluating it suffices.Moreover, it is important to note that a satisfactory Chebyshev approximation is guaranteed to exist for all bounded and continuous functions [52].If the Chebyshev expansion is given, then step 1 of Algorithm 2 can obviously be skipped and f is not required at all.We further note that Alg. 2 can also For P1 and P2, these angles are chosen such that fq is a highprecision Chebyshev and Fourier approximation of f , respectively.be applied even to non-continuous functions over restricted domains without the discontinuities.This is for instance the case of the inverse function, which can be well-approximated over the sub-domain [−1, −δ] ∪ [δ, 1] by a pseudo-inverse polynomial of δ-dependent degree [68].
1. QITE-primitive from a block-encoding oracle QITE primitive 1 corresponds to the output of Alg. 2 for f = F β = e −β(λ−λmin) .The Chebyshev coefficients can be readily obtained from the Jacobi-Anger expansion [38] e where I k (β) is a modified Bessel function.The proof of Theorem 1 thus follows straightforwardly from Lemma 9.

C. Operator function design from real-time evolution oracles
Here, we synthesize an (ε , α)-block-encoding of f (H) from an oracle for H as in Def. 3. We proceed as in Sec.V B, but with a circuit P 2 generating a perfect blockencoding V Φ2 of a target Fourier expansion gq (H) := q/2 m=−q/2 c m e imHt that ε tr -approximates α f (H), for some ε tr ≤ ε , α ≤ 1, and a suitable t > 0. This is done by adjusting Φ 2 according to Lemma 8.The function gq is a Fourier approximation of an intermediary function g such that g(λ, t) := g(x λ ) = α f (λ), for t chosen so that x λ = λ t is in the interval of convergence of gq to g for all λ ∈ [λ min , λ max ].The reason for this intermediary step here is to circumvent the well-known Gibbs phenomenon, by virtue of which convergence of a Fourier expansion cannot in general be guaranteed at the boundaries.In turn, the sub-normalization factor α arises because our gq converges to g only for |x λ | < π/2, whereas Lemma 8 requires that |g q (x λ )| ≤ 1 for all |x λ | ≤ π.This forces one to subnormalize the expansion so as to guarantee normalization over the entire domain.(As in Sec.V B, the inoffensive subnormalization factor (1 + ε tr ) −1 is neglected.) More precisely, we employ (see Ref. [39]) a construction from Ref. [14] that, given 0 < δ ≤ π/2 and a power series that εtr 4 -approximates g, gives c := {c m } |m|≤q/2 such that gq ε tr -approximates g for all For f analytical, one can obtain the power series of g from a truncated Taylor series of f using that g(x λ ) = αf (λ).The truncation order L can be obtained from the remainder: In turn, the conditions (This renders gq periodic in x λ with period 2π.)In addition, in Ref. [39] the sub-normalization constant α is bounded in terms of the obtained t and Taylor coefficients a := {a l } 0≤l≤L of f .It suffices to take α such that Note that L and α are inter-dependent.One way to determine them is to increase L and iteratively adapt α until Eqs.( 19) and ( 20) are both satisfied.Alternatively, if the expansion converges sufficiently fast (e.g., if lim l→∞ | a l+1 a l | < 1 − 2δ π ), one can simply substitute L in Eq. ( 20) by ∞.This is indeed the case with QITE primitives (see Sec. V C 1). There, the substitution introduces a slight increase of unnecessary subnormalization but makes the resulting α independent of L, thus simplifying the analysis.Then, from the obtained c, one can finally calculate the required Φ 2 [39].
Next, we explicitly show how to generate V Φ2 .The iterate is now taken simply as the oracle itself: readily acts as an SU (2) rotation on each 2-dimensional subspace span{|λ |0 , |λ |1 }.This relaxes the need for qubitization.The basic QSP blocks for the unitary operator

(21a) and
with ξ k := {ζ k , η k , ϕ k , κ k }.V ξ k and Vξ k play a similar role to R 2 (x, ω k , ξ k ) in Sec.V A 2 (with x λ inside O 2 playing the role of x there for each λ).Here we take and W out = 1.The circuit is depicted in Figs.6.a) and 6.c).
The following pseudocode presents the entire procedure.The correctness and complexity of Alg. 3 are addressed by the following lemma, proved in Ref. [39].

VI. METHODS: MINIMUM QUERY COMPLEXITY OF QITE PRIMITIVES BASED ON BLOCK-ENCODING ORACLES
Our proof strategy for Theorem 3 is analogous to that of the no-fast-forwarding theorem for real-time evolutions [40][41][42].That is, it is based on a reduction to QITE of the task of determining the parity par(x) := with ⊕ the bit-wise sum, of an unknown N -bit string x := x 0 x 1 • • • x N −1 from a parity oracle U x for x; together with known fundamental complexity bounds for the latter task [69, 70].More precisely, our proof relies on three facts: i) No algorithm can find par(x) from U x with fewer than a known number of queries to it [69, 70]; ii) a QITE primitive querying an oracle for an appropriate Hamiltonian H x gives an algorithm to find par(x); and iii) a block-encoding oracle for H x can be synthesized from one call to U x .The three facts are established in the following lemmas.
The first lemma, proven in [69], lower-bounds the complexity of any quantum circuit able to obtain par(x) from queries to U x .For our purposes, it can be stated as follows.
Lemma 11.Let C be a quantum circuit composed of xindependent gates and q times the x-dependent unitary with {|j } j∈[N +1] an orthogonal basis, such that, acting on an x-independent input state and upon measurement on an x-independent basis, outputs par(x) with probability greater than 1/2 for all x ∈ {0, 1} N .Then, q ≥ N/2 .
The second lemma, proven in App.P, reduces parity finding to QITE and is the key technical contribution of this section.
Lemma 12. Let β > 0, ε > 0, α ∈ (0, 1], and x an N -bit string such that Then, there exists H x , with H x ≤ 1, such that a (β, ε , α)-QITE-primitive with calls to a block-encoding oracle for H x , acting on an x-independent input state and upon measurement on an x-independent basis, outputs par(x) with probability greater than 1/2 for all x ∈ {0, 1} N .
Finally, the missing link between Lemmas 11 and 12 is a sub-routine to query H x given queries to U x in Eq. ( 23).This is provided by the following lemma, proven in App.Q. Lemma 13.A block-encoding oracle for H x can be generated from a single query to x and O(N ) x-independent gates, for all x ∈ {0, 1} N .(See Fig. 15 for circuit.)Proof of Theorem 3.With these three Lemmas, the proof of Theo. 3 is straightforward.First, note that the left-hand side of Eq. ( 24) decreases monotonically in N .Hence, for any fixed (β, ε , α), the largest N ∈ N that satisfies Eq. ( 24) is N = 2q , with q ∈ R defined by Eq. ( 5).This, together with Lemmas 12 and 13, implies that any (β, ε , α)-QITEprimitive synthesized from queries to the parity oracle U x provides a quantum circuit to determine the parity of any string x of length 2q .Then, by virtue of Lemma 11, the query complexity q min (β, ε , α) ∈ N of the primitive cannot be smaller than q ≥ N/2 = 2q 2 .Note that this number is the nearest integer to q. Ergo, q min (β, ε , α) ≥ q.
Finally, a comment on why Theo. 3 does not hold for QITE primitives based on real-time evolution (RTE) oracles is useful at this point.The reason is that, by virtue of the RTE no-fast-forwarding theorem [40][41][42] Both algorithms for operator-function design we present assume access to ideal oracles as given by Def. 2 and Def. 3.This is not the case in an experimental scenario where only approximate oracles are available.This leads to a total error in the algorithm that is proportional to the number of approximate oracle calls.The following lemma deals with that error for generic operator functions.Lemma 14. (Primitives from imperfect oracles) Let a circuit with q queries to an oracle O for H generate an (ε, α)-block- Proof.For q ε O 1, straightforward matrix multiplication shows that the total error is ||V f (H) − Ṽf(H) || = O(q ε O ).This implies ||f (H) − 0| Ṽf(H) |0 || ≤ ε P + q ε O by virtue of the triangle inequality.This concludes the proof.
As an exemplary application of Lemma 14, we calculate the gate complexity g Õ2 required to implement an approximate real-time evolution oracle Õ2 for QITE primitive P 2 by Trotter-Suzuki-like simulation methods.To attain a total error ε , we choose ε = ε /2 and ε O = ε /(2q).For real time t, the gate complexity of state-of-the-art algorithms [48,49] based on first-order product formulae is O(t 2 /ε O ).Substituting for ε O , and then for t and q with the real time and query complexity given in Theorem 2, we obtain the oracle gate complexity g Õ2 = O π 2 β γ ln(8/ε ) 2 ε (1+γ/β) .This scaling is exponentially worse in 1/ε than what we would get if we simulated O 2 with more sophisticated real-time evolution algorithms [15,36,37,42,47].However, those algorithms assume more powerful oracles, which require significantly more ancillary qubits.The fact that P 2 can be synthesised with only 1 ancillary qubit throughout and with no qubitization makes it appealing for near-term implementations on intermediate-sized quantum hardware.Since α F β (H) is not unitary, the (operator-norm) error ε in its approximation by an (α, ε )-block-encoding gets amplified at the post-selection due to state renormalization.The following allows us to control the error in the output state.
Lemma 15. (Error propagation from block encoding to output state) Let p Ψ (β, α) ≤ 1 be the post-selection probability of an (α, ε )-block-encoding and, so, for some ∈ R, with Next, we Taylor-expand the output state in terms of ε.To that end, note first that So, the output state-vector error is upper-bounded as  β, α) , after which the desired output is obtained with probability close to 1. Hence, the coherent master algorithm displays a significantly lower overall query complexity than the probabilistic one.However, in return, the former requires much larger circuit depth than the latter.
With the l 2 -norm distance between the two state vectors, we can upper-bound the trace distance between their corresponding rank-1 density operators using well-known inequalities.First, note that the l 2 -norm distance between two arbitrary normalised vectors |φ and |ψ can be expressed as Finally, since α 2 F β (H)|Ψ is in general unknown, one has no a priori knowledge of k opt .However, fortunately, this can be accounted for with successive attempts with k randomly chosen within a range of values that grows exponentially with the number of attempts (see [35,Theorem 3]).Remarkably, the resulting average number of applications of the primitive remains within O(1/ p Ψ (β, α)).In this appendix, we prove that Eq. ( 4) gives approximately the optimal value for the subnormalization of P 2 , such that the overall query complexity given by Eq. ( 1) is minimized.Notice that, by diminishing the value of γ we increase the success probability of P 2 , decresing the number of times it needs to be realized.At the same time, it leads to an increment on the query complexity of P 2 given by Eq. ( 3).Therefore, the optimal subnormalization is a tradeoff between these two contributions.
From Eqs. ( 1) and (3), given β, ε, and the master algorithm type κ, we get that the optimal γ minimizes with g = 1 introduced for convenience.Here we have used given by Eq. ( 4).In order to prove that that expression is a good approximation for the actual optimal value γ which is much smaller than 1 provided that ε 1, since γ κ (β) < 1 and, therefore, ∆γ < 1.

Appendix E: Proof of Theorem 4
Let us first prove a convenient auxiliary lemma.Ideally, the input and output states of the l-th fragment of should be |Ψ , respectively.However, the actual operator that each P ∆β l ,ε l ,α l perfectly block-encodes is for some E l on H S with ||E l || =: ε l .Hence, in analogy to Eq. (B1b), the actual input and output states are for some "error states" |Ξ l−1 and |Ξ l , respectively.(Note that |Ξ 0 = 0.) The following lemma controls the output-state error in terms of the errors in the input state and block encoding.
Lemma 16. (Error propagation from input state and block encoding to output state) For all l ∈ [r], let ε l−1 := |Ξ l−1 and ε l be the input-state and block-encoding errors, respectively; and ε l the tolerated output-state error.If Proof.Similar to the proof of Lemma 15 but where also the input state is approximate.
Note that the proof of Lemma 15 (and therefore also that of Lemma 16) makes no use of the specific form of F β (or F ∆β l ).Consequently, both lemmas hold for (α, ε )-blockencodings of any operator function, not just the QITE propagator.We are now in a good position to prove Theorem 4.
Proof of Theorem 4. We begin by proving soundness.The proof consists of showing that Eqs.(6) imply that Lemma 16 holds for all l ∈ [r] and gives ε r := |Ξ r = O(ε).To see this, apply the lemma from l = 1 till l = r, with ε 0 = 0, and iteratively use the property We next prove complexity.To this end, we must show the validity of Eq. (7).By Def. 5, the overall query complexity of Algorithm 1 is the sum over the query complexities of each primitive applied.Each primitive P ∆β l ,ε l ,α l has complexity q(∆β l , ε l , α l ).Hence, we must show that the average number of times n l that each P ∆β l ,ε l ,α l is run is given by Then, use the latter together with Eq.
(E2) to get n l = pΨ(β l−1 ) Interestingly, we note that the only specific detail about F β that the proof of Theorem 4 uses is the fact that F β (H) = r l=1 F ∆β l (H) (this is required for Eq.(E2) to hold).Consequently, Theorem 4 applies not only to fragmented QITE but actually also to the fragmentation of any operator function in terms of suitable factors.

Appendix F: Validity of the query complexity bounds of QITE primitives for low beta
As discussed in Secs.III C and III D, the first steps of fragmented QITE involve small inverse temperatures, often with ∆β 1 < 1.However, Eqs. ( 2) and (3) are in principle only asymptotic upper bounds for the actual query complexities.Hence, it is licit to question how valid they are to access the complexity of the first steps of Alg. 1.In this appendix, we discuss the validity and tightness for all β > 0 of as exact expressions for the query complexities of primitives 1 and 2, respectively.These formulas are the ones used in our numeric experiments.First of all, notice that although these expressions are continuous functions, the actual query complexities used by Algs. 2 and 3 take only even integer values.Thus, we take the value of the query for each round of fragmentation as 2 qk (β, ε )/2 , k = 1, 2. In particular, for any value of β such that 0 < q1 (β, ε ) ≤ 2 or 0 < q2 (β, ε ) ≤ 2 at least two queries to the corresponding oracle is used.
For P 2 there is actually no concern because the reasoning of Sec.V C that led to Eq. ( 3) is valid for any beta.This is because q 2 = 2 q2 (β, ε ) is the exact expression furnished by Lemma 37 in Ref. [14], used to prove Lemma 10, for αF β (λ).Therefore, although this formula is not strictly tight, because it comes from successive approximations [14], it is exact and valid for any β > 0 in the sense that making exactly q2 (β, ε ) queries to the oracle ensures that the error is below the tolerated.
P 1 , on the other hand, requires a further analysis since, according to Eq. ( 2), q 1 (β, ε ) is known to be equal to q1 (β, ε ) only in big-O notation and, particularly, for large β.Next, we show that q1 (β, ε ) is, in fact, an over-estimation of the actual q 1 (β, ε ) needed to guarantee error ε .
We notice that the first inequality in Eq. ( 17) gives a tighter upper bound for the query complexity.Therefore, to attain a target error ε , it is enough that q 1 = 2 q1 (β, ε )/2 queries yields to a truncation error which satisfies As we explain next, Fig. 8 shows that it is, in fact, what happens.In Fig. 8a) we show the interval of β's for which the value q1 (β, ε ) is between q 1 = 18 and q 1 = 20.For any β inside this interval, the number of queries made in P 1 will be q 1 = 20.However, in Fig. 8b) when we look at the upper bound for the truncation error with 20 queries, ε , in the same interval, we see that it is much smaller than the tolerated error ε .Consequently, the truncation error ε tr , which is the actual error of P 1 , is also below ε .This is also observed in Figs.8c) and 8d), where the interval of β's for which the minimum number of q 1 = 2 queries is made and the corresponding ε (q1=2) tr in that interval are shown, respectively.We tested for other values of ε ranging from 10 −1 to 10 −10 and the same behavior is observed.This shows that q1 (β, ε ) is valid as the query complexity of P 1 even for small values of β.Moreover, it actually overestimates the minimum query complexity for a given target error ε .(q 1 =20) up (β) from Eq. (F3) for q1 = 20 queries in the same interval of β, which is below the tolerated error ε .Parts (c) and (d) show the same thing for the minimum number of queries q1 = 2, showing that ε Appendix G: Fragmented QITE outperforms coherent QITE Here we analytically show, for primitive P 1 , that there exists an inverse temperature β c above which fragmented QITE outperforms coherent QITE (based on quantum amplitude amplification) in terms of overall query complexity.Before the proof, it is useful to recall that the success probability (of any QITE primitive) is given by p for all β ≥ 0.Moreover, ] is a monotonically decreasing function.Therefore, it has an inverse function, which we denote as p −1 Ψ .
Proof of Theorem 5.For simplicity, we omit here the big-O notation in Eq. ( 2) and take the complexity of P 1 directly as . This is also consistent with the tightness analysis of App.F. We need to show that there exists a schedule 1) and (7), this holds if where ε l is given by Eq. ( 6).It is useful to introduce the upper bound q1 (β, ε ) := eβ 2 +ln 1 ε .(Note that q1 (β, ε ) > q 1 (β, ε ) for all β > 0.) Then, Eq. (G2) holds if Next, we break this inequality into two inequalities, one for ∆β 1 and another for ∆β 2 .Then, we show that, under the theorem's assumptions, both inequalities can be satisfied.More precisely, Eq. ( G3) is satisfied if the following two inequalities are simultaneously satisfied: and we construct an S 2 that fulfills this.First, substituting for q1 and q 1 , we find that each fragment must satisfy For consistency, the right-hand side of Eq. (G5) should be positive for each l, the most critical case being that of l = 1, when the first term is more positive while the second term is more negative as compared with the case l = 2.So it suffices to enforce positivity of the right-hand side of Eq. (G5) for β ≥ β c only for l = 1.One can directly see that this is already satisfied by plugging the expression for β c into Eq.(G5) and using Eq.(G1).Because β = ∆β 1 + ∆β 2 , we are free to choose the size of only one fragment, say ∆β 1 ≡ β 1 .Eq. (G5) imposes an upper on β 1 for l = 1 and a lower bound for l = 2.We first find a β 1 that satisfies the lower bound and then show that, for β ≥ β c , the upper bound is automatically satisfied.Using Eq. ( 6) and ∆β 2 = β − β 1 , we re-write Eq. (G5) for l = 2 as Sufficient to satisfy this inequality is that each term on the right-hand side is positive.Since ln e+2 ln(1/ε )/eβ > 1, the third term is always smaller than the second one.Therefore, requiring that ensures that both the second and the third terms are positive.This condition is in turn satisfied if p Ψ (β 1 ) ≤ √ pΨ(β) , which is equivalent to demanding Notice that Eq. (G7) also ensures that p Ψ (β 1 ) ≤ 1/2 for β ≥ β c , due to p Ψ (β) ≤ 1/4 by theorem assumption.This makes the last term in Eq. (G6) also positive.Now, because of Eq. (G1), sufficient to satisfy Eq. ( G8) is to demand that This is our final lower bound on β 1 .Its fulfillment guarantees Eq. (G8) and, therefore, also Eq. (G6).On the other hand, for l = 1, Eq. (G5) can be re-written as where we have used Eq. ( 6) again and p Ψ (β 0 ) = p Ψ (0) = 1.
For β 1 satisfying Eq. (G9), Eq. (G10) is satisfied if which, in turn, by virtue of Eq. (G1), is satisfied if (G12) The RHS of Eq. (G12) still depends on β.To remove this dependence, we use a β-independent upper bound for the logarithmic term.First, notice that, for Eq.(G12) to hold, β > and, since p −1 Ψ is a monotonically decreasing function, ).That latter is the desired β-independent upper bound.Here, we note that the theorem assumption o < 1/2.2 ensures that p −1 and demanding that β > β c is sufficient to satisfy Eq. (G12).This is our final lower bound on β.
In conclusion, the schedule S 2 = {β 1 , β − β 1 }, with β 1 in the range determined by Eqs.(G9) and (G10), and β ≥ β c , satisfies Eq. (G4) for both l = 1 and l = 2.In particular, in the theorem statement, β 1 is taken precisely as the RHS of Eq. (G9).This proves the last remaining implication.Let us consider the simple case of a non-interacting qubit Hamiltonian given by with b j ≥ 0 and Clearly, the analysis for this case covers also any other Hamiltonian unitarily-equivalent to Eq. (H1).We focus on the task of Gibbs-state sampling (maximally mixed state as input).Hence, o = 2 −N/2 and the success probability for P 1 is given by p Ψ (β) = e −2 β N j=1 cosh(2 β b j ).For simplicity, we take b j = 1/N for all j, obtaining p Ψ (β) = e −2 β cosh(2 β/N ) N .
(H2) This is simple enough to obtain a closed-form expression for its inverse function and -so -explicit expressions for the fragmentation schedule of Lemma 5, which we do next.By virtue of Lemma 5, 2) 1/N .This, using that, for N ≥ 3, ln As clear from Eqs. (H3) and (H4), the first fragment is exponentially shorter in N than the second one.Moreover, we show next that, the first fragment satisfies β 1 < 8 ln(4/ε 1 ).This implies that Primitive 1 performs better (in query complexity) than the one from Ref. [15] at the fragmentation scheme from Lemma 5, as discussed after Theor. 1.In fact, it also implies that Primitive 1's performance is close to optimal (see discussion after Theor.3) at that first fragment.From Eq. ( 6), for r = 2, we have where, in the last equation, we have used Eq.(H3) and the fact that β 1 = ∆β 1 .This finishes the proof.
Appendix I: Close-to-optimality of Primitive 1 for interacting Hamiltonians Here, we study the ratio between ∆β 1 = β 1 and 8 ln(4/ε 1 ) for fragmented quantum-Gibbs-state sampling with P 1 , for the generic Hamiltonians studied in Sec.III D and the optimal schedules S r,a used for the central panel of Fig. 5. Recall that β 1 ≈ 8 ln(4/ε 1 ) is the point at which the complexity upper bound in Eq. ( 2) starts outperforming the complexity upper bound derived in [15] for the powerful QITE primitive obtained there, as mentioned after Theo. 1.The results are displayed in Fig. 9, showing the histograms of β 1 /8 ln(4/ε 1 ) for the random instances considered for the three classes of Hamiltonian and for different N .This clearly  shows that β 1 8 ln(4/ε 1 ), supporting our claim that, for the first fragment, the fragmented master algorithm operates deep into the optimality regime of Primitive 1. Finally, we performed the same analysis for ∆β 2 (not shown) and consistently observed that ∆β 2 is smaller than 8 ln(4/ε 2 ) too.That is, also the second fragment operates close to the optimality regime of P 1 .
The last part of the Lemma, presented before in Eq. (11), can be obtained from condition (i) using the properties of the Chebyshev polynomials T k (cos θ) = cos(kθ) and U k (cos θ) = sin ((k + 1)θ) / sin θ.Condition (ii) is equivalent to Eq. (10) in the main text.Given that the desired polynomials are achievable by QSP, the set of rotation angles Φ 1 can be computed classically in time O (poly(q)) [63][64][65][66].

Appendix L: Chebyshev expansions
In this appendix we briefly approach polynomial approximations to continuous functions using Chebyshev polynomials.
It is known that [52], whenever a function f is continuous and bounded on the interval [−1, 1], it is endowed with a convergent Chebyshev series such that with coefficients The truncated version of (L1) up to order q is close to the optimal polynomial approximation of degree q for f (λ) in the interval [−1, 1] [52].The following Lemma (Theorem 2.1 from [53]) gives the maximal error ε of this approximation for a given degree q.
Another way of obtaining a nearly optimal polynomial approximation of order q is to construct it from Chebyshev polynomials interpolation, which can be done in polynomial time by solving a linear system [53].The latter construction is based on conveniently choosing a list of points λ 0 , • • • , λ q , typically related to the zeros or extreme points of T q+1 (λ), and finding the linear combination of all degree-less-than-q Chebyshev polynomials which interpolates f (λ 0 ), • • • , f (λ q ).The error formula of Eq. (L4) is not exclusive of the Chebyshev truncated series.It is also valid for the special interpolating polynomials just described and for the optimal polynomial approximation, the difference being the point ξ where the (n + 1)-th derivative must be evaluated.

Quantum RBMs
Quantum spin glasses Weighted MaxCut < l a t e x i t s h a 1 _ b a s e 6 4 = " T i s J 4 5 d a u J C N r g 0 R s R r j k e u r l c 4 C. F. NU.F. U.      .Runtimes and circuit depths of quantum Gibbs-state samplers running on Primitive 2 versus inverse temperature.The color code is the same as in Fig. 3: red corresponds to probabilistic, green to coherent, blue to fragmented with uniform schedule Sr for the best r, and orange to fragmented with a schedule Sr,a as in Eq. ( 8) for the best r and a (see also Fig. 12).The Hamiltonian classes are also the same as in Fig. 3.For simplicity, here we use just a subset of 100 random instances from each class out of the 1000 used for Fig. 3.The examples correspond to N = 12 qubits and an error ε = 10 −3 , but qualitatively identical behaviors are observed for all N between 2 and 15 as well as for ε = 10 −2 and ε = 10 −1 .Upper panels: average overall query complexity.As mentioned, also for P2 does fragmentation with non-uniform schedule outperform coherent QITE, except the critical inverse temperature βc is now higher than in Fig. 3.The dependance of βc with N is shown and discussed in Fig. 11.Lower panels: average query depth.As with P1, also for P2 does coherent QITE lie orders of magnitude above probabilistic QITE; while fragmented QITE is almost identical to the latter.
Appendix M: A recipe for pulses -QSP method 1 This appendix is based on the proof of Theorem 3 of Ref. [15].We restrict ourselves to the case of even q, as this is what we use throughout our operator-function design algorithm.Henceforth, we write x = cos θ.Here we only show how to get the QSP pulse sequence Φ given the complex polynomials B(x) and D(x), such that Eq. ( 9) is attained.We refer the reader to Theorem 5 of Ref. [15] for a proof of existence of B and D given the real polynomials B(x) and D(x) satisfying Lemma 17.The proof of the referred theorem is also an algorithm to calculate B and D whose classical computational runtime scales as O(poly(q)).

Quantum RBMs
Quantum spin glasses

Weighted MaxCut MaxCut
< l a t e x i t s h a 1 _ b a s e 6 4 = " e z o m E 6 r P 0    k t J v s L F Q x N Y 6 3 2 H n N / g T T h 6 F J h 6 4 c D j n X u 6 9 J 4 g 5 0 8 Z 1 v 5 y V 1 b X 1 j c 3 M V n Z 7 Z 3 d v P 3 d w W N N R o i h U a c Q j 1 Q i I B s 4 k V A 0 z H B q x A i I C D v W g f z v x 6 w N Q m k X y 3 g x j 8 A X p S h Y y S o y V / F Y A h r T T l h K Y j t q 5 v F t w p 8 D L x J u T f O l k X P l + P B 2 X 2 7 n P V i e i i Q B p K C d a N z 0 3 N n 5 K l G G U w y j b S j T E h P Z J F 5 q W S i J A + + n 0 6 B E + t 0 o H h 5 G y J Q 2 e q r 8 n U i K 0 H o r A d g p i e n r R m 4 j / e c 3 E h N d + y m S c G J B 0 t i h M O D Y R n i S A O 0 w B N X x o C a G K 2 V s x 7 R F F q L E 5 Z W 0 I 3 u L L y 6 R W L H i X h W L F p n G D Z s i g Y 3 S G L p C H r l A J 3 a E y q i K K H t A T e k G v z s B 5 d t 6 c 9 1 n r i j O f O U J / 4 H z 8 A K G e l b c = < / l a t e x i t > c Figure 11.Critical inverse temperatures for P2 versus of qubits.The tolerated error and Hamiltonian classes are the same as in Fig. 4. Blue dots are the means over 100 instances from each class, whereas dashed orange curves their fits over the same Ansatz as in Fig. 4. The fit results are shown in the insets.The scalings of βc with N are similar to those for P1 but with somewhat higher pre-factors and additive constants.The fact that βc's are higher than for P1 comes from the non-unit factors α k inside n l in Eq. (7).
Because the pair of polynomials satisfies condition iii), they always define a unitary operator d l/2−1 makes the higher order terms in both B(x) and D(x) vanish.This is a valid choice for φ 1 because, if l > 0, condition iii) implies that the polynomial |B(x)| 2 + (1 − x 2 )|D(x)| 2 is constant, and thus each of its coefficients accompanying a greater-than-zero power of x must vanish.In particular, it with |0 in H A .Moreover, the pulse sequence can be obtained classically in time O(poly(q)).
Proof of Lemma 19.Recalling that the identity operator does not produce any leakage of states out of the initial subspace they belong to, the identity operator in H SA O 1 can be effectively represented on the subspace λ H λ as 1 = λ (|0 λ+ 0 λ+ | + |0 λ− 0 λ− |).Using this representation and Eq.(N1), it is straightforward to write the operator V 0 of Eq. ( 14) as where X is the first Pauli operator for an extra single-qubit ancilla we will denote by A e .From V 0 one can also obtain a convenient representation for V φ = V 0 1 ⊗ e iφZ in λ H λ ⊗ H Ae , it reads with the qubit rotations acting on A e previously defined as R 1 ± θ λ 2 , φ = e ±i θ λ 2 X e iφZ .If the operator defined as Vφ = V † 0 1 SA O ⊗ e iφ Z is applied in the sequence, it removes the undesired phase factors e ±i θ λ 2 .A direct consequence of Eq. (O3) is that the operator V Φ1 defined in Eq. ( 15) can be represented in λ H λ ⊗ H Ae as where R 1 ± θ λ 2 , Φ 1 is given in Eq. (9).In order to obtain functions with vanishing imaginary part, one can initialize and post-select the ancillas in the state |0 ∈ H A , which corresponds in H S to applying the operator where B cos θ λ 2 is the the first matrix element of R 1 θ λ 2 , Φ 1 .Because of the halved angle, the achievable functions can be written, according to Lem. 17 We are now able to prove Lemma 9, which is a direct consequence of Lemmas 19 and 18, and of the qubitization construction.
Proof of Lemma 9.If the QSP sequence Φ 1 is taken as to reproduce the coefficients of the finite Chebyshev expansion of a continuous target function f (see App. L) with f (λ) = f∞ (λ), then Eq. (O1) means that V Φ1 is a perfect block-encoding of fq (H) a (1,ε )-blockencoding of f (H), with ε given by the truncation error max λ∈[−1,1] |f (λ) − fq (λ)|.fq (λ) can be the truncated Chebyshev expansion of f (λ) or an interporlating series.f is assumed to be an analytical function and, therefore, ε is related to the series order q/2 according to Lemma 18.The achievability condition (10) is not a strong limitation as f is bounded and we can always redefine f (λ) = f (λ) fmax , with f max = max −1≤λ≤1 f (λ).Note that, even though |B(λ)| ≤ 1 + ε , it can be rescaled as to satisfy Eq. ( 10) [15].
Each one of the q operators V φ k or Vφ k in V Φ1 calls the qubitized oracle once which means calling the oracle O 1 twice.Thus 2q queries are necessary in total.Moreover the number of gates per query is basicaly given by the number of gates used for qubitization times a constant factor due to the controlled action of O 1 .The control ancilla contributes with a constant small number of gates, such that the total number of gates per query is O(g O1 + |A O1 |).

Figure 1 .
Figure 1.High-level schematics of our algorithms.a) QITE primitives: A system register S carries the input state |Ψ , whereas an ancillary register A is initialised in a computational-basis state |0 .A unitary transformation U F β (H) , composed of a sequence {V k } ∈[q]of q gates, with [q] := {1, . . .q}, is applied and then the ancillas are measured.Each gate makes one query to the Hamiltonian oracle (not shown).The specific choice of gates in the sequence is determined by a powerful formalism for operator-function design, known as quantum signal processing, which allows us to synthesise the QITE propagator F β (H) := e −β(H−λ min ) efficiently.More precisely, conditioned on detecting |0 on the ancillas, the desired state

Figure 2 .
Figure 2. Intuition behind the complexity reduction by fragmentation.The overall complexity of the probabilistic master algorithm is dominated by the area of the yellow rectangle.In contrast, the corresponding complexity of the fragmented algorithm (here, for the exemplary case of r = 3 fragments) is dominated by the area of the blue-shaded rectangles.Up to logarithmic corrections in the precision, the cumulative width of the blue-shaded rectangles coincides with the width of the yellow one, of order β.In contrast, while the height of the yellow rectangle is of order 1/pΨ(β), the height of the blue-shaded ones decreases from order 1/pΨ(β) till order pΨ(βr−1)/pΨ(β), making the blue-shaded area smaller than the yellow one.For high-enough β, the reduction can be so strong that the complexity of the fragmented algorithm can reach even that of the coherent algorithm (not shown).This intuition is rigorously proven for Primitive 1 (in Theorem 5) and numerically verified to exhaustion for both Primitives 1 and 2 (in Sec.III D).
t e x i t s h a 1 _ b a s e 6 4 = " t Y L R e F + e H 5 9 f k n b

Figure 5 .
Figure 5. Optimal fragmentation schedules for Primitive 1 versus inverse temperature.System sizes are N = 5 (blue), N = 10 (orange), and N = 15 (green).Solid curves represent the means over 1000 random weighted-MaxCut Hamiltonians, whereas (the thicknesses of) shaded curves are the standard deviations (see App.J).The tolerated error is ε = 10 −3 .Qualitatively identical behaviors are observed for all N between 2 and 15 as well as for ε = 10 −2 and ε = 10 −1 ; and the same holds for the other Hamiltonian classes in Fig.4.The upper panel shows the optimal number of fragments r for uniform schedules Sr,1.The central and lower panels respectively show the optimal r and a for non-uniform schedules Sr,a.The dashed and dotted curves in the upper and lower panels respectively represent fits over the Ansätze r(β) = A β η and a(β) = A β η , with A and η ∈ R. The fit results are shown in the insets.Remarkably, for non-uniform schedules, the observed scaling for r is constant not only with β but also with N .
respectively, with M the single-qubit Hadamard matrix.The resulting circuit, P 1 , is depicted in Figs.6.a) and 6.b).

Figure 6 .
Figure 6.QSP primitives for generic operator function design.a) Both circuits P1 from Alg. 2 and P2 from Alg. 3 have the same structure.If the ancillas are initialised and post-selected in |0 A, the circuit prepares the system state fq (H)|Ψ || fq (H)|Ψ || , which ε-approximates the target output f (H)|Ψ f (H)|Ψ .The details specific to P1 and P2 are respectively shown in panels b) and c).Win and Wout are fixed ancillary unitaries, and M is a single-qubit Hadamard gate.The basic blocks V k in panel a) represent the gates V φ k in b) and V ξ k in c).Each V φ k involves one query to the qubitized oracle O 1 , which in turn requires one query to O1 and one to its inverse O † 1 (see Fig. 13).Whereas each V ξ k involves one query to the oracle O2.Vk is defined as V k but with O † 1 substituting O 1 or O † 2 substituting O2.Hence, the query complexities of P1 and P2 are respectively 2q and q.The approximating function fq is determined by the angles Φ1 = φ1, • • • , φq+1 or Φ2 = {ξ 0 , • • • , ξ q } in the rotations Rz(2φ k ) = e iφ k Z or Ry(2κ k ) = e iκ k Y and Rzyz(ξ k ) = Rz(ζ k + η k ) Ry(2ϕ k ) Rz(ζ k − η k ), with ξ k = {ζ k , η k , ϕ k , κ k }.For P1 and P2, these angles are chosen such that fq is a highprecision Chebyshev and Fourier approximation of f , respectively.

1 .
QITE-primitive from a real-time evolution oracle QITE primitive 2 is the output of Alg. 3 for f = F β .The proof of Theo. 2 thus follows straight from Lemma 10.
Appendix B: Post-selection probability propagation onto output state error

the error in l 2 -Figure 7 .
Figure 7. Probabilistic versus coherent master QITE algorithms.a) The probabilistic approach repeatedly applies the unitary U F β (H) generated by the primitive (on independent preparations of |0 ∈ HSA) until the post-selection is successful, i.e. the measurement on the ancillas returns |0 as outcome.This takes on average O 1/pΨ(β, α) repetitions.No a priori knowledge of the input state |Ψ is required (it can be fully generic, even mixed).b) The coherent approach is based on quantum amplitude amplification.It operates only on pure input states.So, if the input state is mixed, an extra ancillary register Apur of |Apur| = |S| = N qubits is required to purify it.This is the case in quantum Gibbs state sampling, where |Ψ is a purification of the maximally mixed state on HS .The primitive is repeated applied sequentially (on the same preparation of |Ψ |0 ), interleaved with reflection operators R |Φ 0 and R |0 around the states |Φ0 := U F β (H) |Ψ |0 ∈ HSA and |0 ∈ HA, respectively.In practice, this requires full a priori knowledge of |Ψ .The total number of repetitions of the primitive is O 1/ pΨ(β, α) , after which the desired output is obtained with probability close to 1. Hence, the coherent master algorithm displays a significantly lower overall query complexity than the probabilistic one.However, in return, the former requires much larger circuit depth than the latter.
Appendix D: Optimal sub-normalisation for QITE Primitive 2

Figure 9 .
Figure 9. Histograms of β1/8 ln(4/ε 1 ) for the optimized schedules Sr,a from Eq. (8).The upper panels correspond to the quantum spin glass Hamiltonians with N = 14 (left) and N = 15 (right) qubits, the central ones to the quatum RBM with N = 4 (left) and N = 5 (right), and the lower ones to the weighted Max-Cut with N = 12 (left) and N = 13 (right).Similar behaviours are observed for other values of N and ε.For the upper and central panels we took ε = 0.001, whereas for the lower ones ε = 0.01.For each Hamiltonian instance H, we took several different values of β; and each pair (H, β) constitutes a different event in the histogram.For the panels where N ≤ 12, the histograms are built from 81 uniformly chosen values of β from 0 to 2000 for each of the 1000 Hamiltonian instances in each class, giving a total of 81000 events.Whereas for those where N > 12, they are built from 21 uniformly chosen values of β from 0 to 10000 and 100 Hamiltonian instances from each class, giving a total of 2100 events.The results show that β1 is typically 1000 (and at worst 20) times smaller than 8 ln(4/ε 1 ).This confirms that the first fragment operates deep into the regime of optimality of P1 even for uniformly sampled β.If, instead of uniform values of β between 0 and the high values mentioned above, we choose β values close to the corresponding critical inverse temperature, then β1/8 ln(4/ε 1 ) concentrates even more at the first column close to zero.

Lemma 17 .
Given two polynomials B(x), D(x) : [−1, 1] → R, and q ∈ N even, there exists Φ 1 = (φ 1 , • • • , φ q+1 ) ∈ R q+1 such that B(cos θ) = Re[B(cos θ)] (or B(cos θ) = Im[B(cos θ)]) and D(cos θ) = Re[D(cos θ)] (or D(cos θ) = Im[D(cos θ)]) for all θ ∈ [−π, π], with B and D as in Eq. (9), if and only if B and D satisfy: H s 3 G S U w = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " T i s J 4 5 d a u J C N r g 0 R s R r j k e u r l c 4= " > A A A B 7 H i c b V B N S 8 N A E N 3 4 W e t X 1 a M i w S J 4 K o m C e i x 6 8 d i C a Q t t K J v t p F 2 6 2 Y T d i V B C j 5 6 9 e F D E q 7 + h v 8 O b v 8 E / 4 f b j o K 0 P B h 7 v z T A z L 0 g E 1 + g 4 X9 b S 8 s r q 2 n p u I 7 + 5 t b 2 z W 9 j b r + k H s 3 G S U w = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " T i s J 4 5 d a u J C N r g 0 R s R r j k e u r l c 4= " > A A A B 7 H i c b V B N S 8 N A E N 3 4 W e t X 1 a M i w S J 4 K o m C e i x 6 8 d i C a Q t t K J v t p F 2 6 2 Y T d i V B C j 5 6 9 e F D E q 7 + h v 8 O b v 8 E / 4 f b j o K 0 P B h 7 v z T A z L 0 g E 1 + g 4 X9 b S 8 s r q 2 n p u I 7 + 5 t b 2 z W 9 j b r + k

Figure 10
Figure 10.Runtimes and circuit depths of quantum Gibbs-state samplers running on Primitive 2 versus inverse temperature.The color code is the same as in Fig.3: red corresponds to probabilistic, green to coherent, blue to fragmented with uniform schedule Sr for the best r, and orange to fragmented with a schedule Sr,a as in Eq. (8) for the best r and a (see also Fig.12).The Hamiltonian classes are also the same as in Fig.3.For simplicity, here we use just a subset of 100 random instances from each class out of the 1000 used for Fig.3.The examples correspond to N = 12 qubits and an error ε = 10 −3 , but qualitatively identical behaviors are observed for all N between 2 and 15 as well as for ε = 10 −2 and ε = 10 −1 .Upper panels: average overall query complexity.As mentioned, also for P2 does fragmentation with non-uniform schedule outperform coherent QITE, except the critical inverse temperature βc is now higher than in Fig.3.The dependance of βc with N is shown and discussed in Fig.11.Lower panels: average query depth.As with P1, also for P2 does coherent QITE lie orders of magnitude above probabilistic QITE; while fragmented QITE is almost identical to the latter.
z s B 5 d t 6 c 9 1 n r i j O f O U J / 4 H z 8 A K G e l b c = < / l a t e x i t > t e x i t s h a 1 _ b a s e 6 4 = " e z o m E 6 r P 0 z s B 5 d t 6 c 9 1 n r i j O f O U J / 4 H z 8 A K G e l b c = < / l a t e x i t > t e x i t s h a 1 _ b a s e 6 4 = " e z o m E 6 r P 0 z s B 5 d t 6 c 9 1 n r i j O f O U J / 4 H z 8 A K G e l b c = < / l a t e x i t > t e x i t s h a 1 _ b a s e 6 4 = " e z o m E 6 r P 0 W L F p n G D Z s i g Y 3 S G L p C H r l A J 3 a E y q i K K H t A T e k G v z s B 5 d t 6 c 9 1 n r i j O f O U J / 4 H z 8 A K G e l b c = < / l a t e x i t > t e x i t s h a 1 _ b a s e 6 4 = " e z o m E 6 r P 0A a h D W Y O y r T 6 s K K 8 R H Y = " > A A A B 9 H i c b V C 7 S g N B F J 3 1 G e M r a q n I Y B C s w m 4 K t Q z a W C Z g H p A s Y X Z y N x k y M 7 v O z A b C k t J v s L F Q x N Y 6 3 2 H n N / g T T h 6 F J h 6 4 c D j n X u 6 9 J 4 g 5 0 8 Z 1 v 5 y V 1 b X 1 j c 3 M V n Z 7 Z 3 d v P 3 d w W N N R o i h U a c Q j 1 Q i I B s 4 k V A 0 z H B q x A i I C D v W g f z v x 6 w N Q m k X y 3 g x j 8 A X p S h Y y S o y V / F Y A h r T T l h K Y j t q 5 v F t w p 8 D L x J u T f O l k X P l + P B 2 X 2 7 n P V i e i i Q B p K C d a N z 0 3 N n 5 K l G G U w y j b S j T E h P Z J F 5 q W S i J A + + n 0 6 B E + t 0 o H h 5 G y J Q 2 e q r 8 n U i K 0 H o r A d g p i e n r R m 4 j / e c 3 E h N d + y m S c G J B 0 t i h M O D Y R n i S A O 0 w B N X x o C a G K 2 V s x 7 R F F q L E 5 Z W 0 I 3 u L L y 6 R W L H i X h W L F p n G D Z s i g Y 3 S G L p C H r l A J 3 a E y q i K K H t A T e k G vz s B 5 d t 6 c 9 1 n r i j O f O U J / 4 H z 8 A K G e l b c = < / l a t e x i t > c < l a t e x i t s h a 1 _ b a s e 6 4 = " e z o m E 6 r P 0 A a h D W Y O y r T 6 s K K 8 R H Y = " > A A A B 9 H i c b V C 7 S g N B F J 3 1 G e M r a q n I Y B C s w m 4 K t Q z a W C Z g H p A s Y X Z y N x k y M 7 v O z A b C

Figure 12 . 1 dl/ 2 − 2 .
Figure 12.Optimal fragmentation schedules for P2 versus inverse temperature.System sizes and color code are the same as in Fig.5: N = 5 (blue), N = 12 (orange), and N = 15 (green).Solid curves represent the means over 100 random weighted-MaxCut Hamiltonians, whereas (the thicknesses of) shaded curves are the standard deviations.The error is ε = 10 −3 .Qualitatively identical behaviors are observed for all N between 2 and 15 as well as for ε = 10 −2 and ε = 10 −1 ; and the same holds for the other Hamiltonian classes.For uniform schedules Sr,1 (not shown in the figure), a constant r = 2 is observed to minimize QS r but the resulting complexity does not reach Q coh over the domain scanned in Fig.10.The upper and lower panels respectively show the optimal r and a for non-uniform schedules Sr,a.The dotted curves in the lower panel represent fits over the Ansatz a(β) = A β η , with A, η ∈ R. The fit results are shown in the inset.As with P1, also here is r constant not only with β but also with N , remarkably.

2 ≤ 1 .
The only real polynomial of interest, out of the four composing the QSP operator R 1θ λ 2 , Φ 1 , is Re B cos θ λ 2, the other three can be determined as to keep achievability.Provided that Re B cos θ λ 2 has the form in Eq. (O6), the only remaining condition is that Re B cos θ λ With the choice D(λ) = 0 and B(λ) = fq (λ), condition (10) is satisfied, assuring the existence of such Φ 1 .
r , and QITE primitives {P ∆β l ,ε l ,α l } l∈[r] querying an oracle for H (for ∆β l ∈ S r , ε l given in Eq (6), and α l > 0).output: F β (H)|Ψ /||F β (H)|Ψ || up to trace-distance error O(ε).apply the circuit P ∆β l ,ε l ,α l on S and A; 1 Set l = 1, initialize S in the input state |Ψ and A in the reference state |0 ; 2 while l ≤ r do 3 1 Ψ .For simplicity, we state the theorem explicitly for the restricted case of H non-degenerate, with a unique ground state |λ min of overlap o 2 := | λ min |Ψ | 2 with |Ψ .However, it can be straightforwardly generalized to the degenerate case by redefining o 2 as the overlap with the lowest-energy subspace.Let |λ min ∈ H S be the unique ground state of H and |Ψ ∈ H S such that 0 < o ≤ 1/2.2.Define the critical inverse temperature β c = 2 [39]in α, L, and a s.t.ε tr ≤ ε ; 2 obtain q s.t.Eq.(18) holds; 3 calculate the Fourier coefficients c; 4 calculate the rotation angles Φ 2[39]; 5 begin construction of P 2 : δ ∈ (0, π/2), oracle O 2 for H at a time t = π/2 − δ, and its inverse O † 2 .output: unitary quantum circuit P 2 . 1 6 apply W in from Eq. (22) on A; , a single call to an RTE oracle suffices to find the parity with probability greater than 1/2.Hence, it is the query complexity RTE oracles themselves what is lower-bounded by the parity considerations above, but not that of RTE-based QITE primitives.It is an open question whether similar bounds can be obtained for RTE-based QITE primitives by other arguments.