Practical Hamiltonian learning with unitary dynamics and Gibbs states

We study the problem of learning the parameters for the Hamiltonian of a quantum many-body system, given limited access to the system. In this work, we build upon recent approaches to Hamiltonian learning via derivative estimation. We propose a protocol that improves the scaling dependence of prior works, particularly with respect to parameters relating to the structure of the Hamiltonian (e.g., its locality k). Furthermore, by deriving exact bounds on the performance of our protocol, we are able to provide a precise numerical prescription for theoretically optimal settings of hyperparameters in our learning protocol, such as the maximum evolution time (when learning with unitary dynamics) or minimum temperature (when learning with Gibbs states). Thanks to these improvements, our protocol has practical scaling for large problems: we demonstrate this with a numerical simulation of our protocol on an 80-qubit system.


I. INTRODUCTION
An important task for learning about many-body quantum systems is to learn the associated Hamiltonian operator efficiently (i.e., without requiring resources that scale exponentially in system size).There are multiple physical motivations for this problem.First, for an isolated many-body system, all the information about its future state is contained in the initial state and the Hamiltonian H, since the Hamiltonian generates the time evolution operator e −iHt .Second, for a system in thermal contact with an environment of inverse temperature β, the thermal equilibrium state is of the form e −βH / Tr e −βH , and hence is also determined by the Hamiltonian.Third, the spectral response, such as the absorption and emission spectra, of a system is determined by its Hamiltonian.
The Hamiltonian learning problem is a highly relevant task across many domains.In condensed matter physics, we can experimentally verify our models of quantum materials by comparing theoretical predictions about their effective interactions with the interactions inferred by Hamiltonian learning (Burgarth and Ajoy 2017, Wang et al. 2017, Kwon et al. 2020, Wang et al. 2020).This verification is also applicable for quantum device engineering.With the expanding capabilities of quantum computers, it is increasingly important to be able to certify their behavior (Carrasco et al. 2021).While benchmarking protocols can give coarse-grained information about a particular quantum device, knowing its Hamiltonian can be significantly more powerful, allowing us to design improved devices (Boulant et al. 2003, Innocenti et al. 2020, Ben Av et al. 2020) or better understand the physical origin of failure modes (Shulman et al. 2014, Sheldon et al. 2016, Sundaresan et al. 2020).
In this work, we will treat the system under study as a black box system with an unknown Hamiltonian H, and our goal will be to efficiently infer H with access to only a limited number of inputs to, and outputs from the black box.Importantly, we assume that we can only interact with the system classically (see Figure 1).The key defining characteristic of 'classical' interaction is that we prohibit any quantum channel between the system under study (whose Hamiltonian we are trying to learn), and some other quantum processing unit -that is, we rule out setups used in other approaches that assume we can interact with the system under study via another trusted quantum simulator (e.g., Wiebe et al. 2014a,b, Verdon et al. 2019).Two examples of what we call 'classically limited' interactions are making measurements on time-evolved states (Figure 1a) or on Gibbs states (Figure 1b).For the former, we initialize the system in some known (mixed) state ρ 0 , and evolve it forward in time by t, resulting in the state: (1) For the latter, we assume we have access to a system in thermal equilibrium at a temperature β −1 .That is, we have access to the Gibbs state ρ(β) = exp(−βH) Tr(exp(−βH)) . (2) (a) Time evolution: we can control three quantities: ρ 0 , t, and O.We assume we can evolve the input state ρ 0 forward in time.After a time t, we make a measurement of the observable O. FIG. 1. Two models of classical interaction with a black box quantum system.We view the system as a set of oracles indexed by the state preparation and measurement parameters ρ0, O in the time evolution case, and O in the Gibbs state case.These oracles take some input t or β, and we use their output to characterize the Hamiltonian.
In these two models of interaction, we assume we can control the parameters t and β, respectively.Finally, we assume that we can measure some observable O of the final states ρ(t) and ρ(β).However, we do not demand full control over ρ 0 and O, since arbitrary quantum states are hard to prepare (Plesch and Brukner 2011).We only require that ρ 0 and O be a tensor product over single sites.More precisely, we assume that we can prepare any ρ 0 ∈ {I/2, (I + σ x )/2, (I + σ y )/2, (I + σ z )/2} ⊗n and measure any observable O ∈ {I, σ x , σ y , σ z } ⊗n .Our focus on these two methods of interacting with the system is in part motivated by the classical analogue of the quantum Hamiltonian learning problem.This classical analogue is well-studied by the machine learning community, with two primary approaches being learning using system dynamics (Brunton et al. 2016, Pan et al. 2016, Trischler and D'Eleuterio 2016, Lusch et al. 2018, Course et al. 2021) or samples from the Gibbs distribution (Abbeel et al. 2006, Santhanam and Wainwright 2012, Bresler et al. 2013, Lokhov et al. 2018).
Using these two models of black-box interaction, we propose a method for Hamiltonian learning that relies on a simple intuition.For some state preparation and measurement (SPAM) settings, we can define a function f SPAM as the expectation value of an observable (which is specified by the SPAM settings) on the final state ρ(t) and ρ(β): Tr(Oρ(t = x)) for unitary evolution Tr(Oρ(β = x)) for Gibbs states. (3) We will show that for the appropriate choice of SPAM parameters, f SPAM (x) can be viewed as black box function in x, whose Taylor expansion can be connected in a straightforward manner to the coefficients of the Hamiltonian.More specifically, for each SPAM setting, the first order coefficient f (x = 0) will yield one of the Hamiltonian's coefficients.
In this work, we will formalize this idea into a protocol that is efficient in theory, and demonstrate its usefulness in practice.

II. PRELIMINARIES
Before describing the contributions of this work, we give a formalized definition of the Hamiltonian learning problem, and define the class of Hamiltonians that we restrict our attention to.
Definition 1 (Hamiltonian learning problem).Fix a Hamiltonian on an n-qubit system that has an expansion in the Pauli basis: where each P m ∈ {I, σ x , σ y , σ z } ⊗n is a Pauli operator and θ m ∈ R are the Hamiltonian coefficients.We will denote the vector of Hamiltonian coefficients Θ = θ 1 , . . ., θ r T .We assume the Hamiltonian is traceless (i.e., P m = I ⊗n ), and that we know the structure of the Hamiltonian (i.e., which Paulis P m are present in the expansion), but that the coefficients θ m are unknown.The Hamiltonian learning problem is to infer all of the coefficients θ m up to an additive error • max m |θ m | with success probability at least 1 − δ.
In this work, we restrict our attention to a broad class of Hamiltonians that we call sparsely interacting Hamiltonians.A sparsely interacting Hamiltonian is defined below.
Definition 2 (Sparsely interacting Hamiltonian).The interaction graph (called the "dual" interaction graph in Haah et al. (2021)) of a Hamiltonian G consists of a set of vertices V and edges E.
Each vertex represents one Pauli operator P i in the Hamiltonian, and there are edges between two vertices if the support of their corresponding Pauli operators overlap.The support of a Pauli, supp(P ), is the set of sites that P acts nontrivially on.We also define the degree of the Hamiltonian D to be the maximum degree of any node in the interaction graph: A Hamiltonian is sparsely interacting if D = O(1) (that is, D does not depend on system size).Notably, this class of Hamiltonians includes k-local Hamiltonians, as this locality constraint implies that the number of terms overlapping with any Pauli term is a function of k alone.
Example 2.1.Below, we show a sample interaction graph for a 9-qubit transverse field Ising model (TFIM), whose Hamiltonian is The TFIM will serve as a prototypical example for the rest of this work.

III. PRIOR WORK AND OUR CONTRIBUTION
The assumption of a sparsely interacting (or k-local) Hamiltonian can often simplify the learning problem.For instance, in an early work (da Silva et al. 2011), it was shown that systems with local Hamiltonians can be efficiently characterized without the expensive requirements of full state tomography.However, this method was only applicable in certain cases, and was found to be prohibitively expensive in general.Later approaches demonstrated that machine learning could be applied successfully on small systems (Hentschel and Sanders 2010, 2011, Sergeevich et al. 2011, Granade et al. 2012), but these methods did not come with rigorous performance guarantees or scaling results that would allow them to be applied on larger systems.There have also been a number of proposals (Qi and Ranard 2019, Bairey et al. 2019, Evans et al. 2019) to learn the coefficients of the Hamiltonian by solving a system of linear equations, where the coefficient matrix is determined by local measurement outcomes.However, the performance of these approaches is determined by the spectral gap of this coefficient matrix, which is not yet well-characterized.
A recent series of works (Haah et al. 2021, Anshu et al. 2021, Sbahi et al. 2022) focus on Hamiltonian learning with Gibbs states.Notably, Haah et al. (2021) has found asymptotically optimal (with respect to error and failure probability) methods for Hamiltonian learning from high-temperature Gibbs states.They propose an algorithm that requires a number of copies of the Gibbs state (at temperature β −1 ) that scales with β −2 and polynomially in D (see Definition 2).However, they impose the high-temperature constraint β −1 ≥ 25e 6 (D + 1) 10 : Although this still results in polynomial scaling with D, this is not feasible in practice, as this constraint on β results in a query complexity with a prefactor that scales with D 21 .For one of the simplest non-trivial cases where we might apply Hamiltonian learning, the 1-dimensional transverse field Ising model (which has D = 4), the required number of copies of the Gibbs state has a prefactor Dβ −2 ≥ 10 22 .Despite this, the techniques developed in this work are useful -particularly, the central role of D in calculations.
Finally, there have also been proposals to learn the Hamiltonian from short time unitary dynamics (Hangleiter et al. 2021, Yu et al. 2022, França et al. 2022).Yu et al. (2022) considers the case where the Hamiltonian structure is not known a priori, and develop a protocol that is robust to circuit noise and SPAM errors, and avoids the exponential query requirements of full tomography.However, their protocol has relatively strong requirements on how to interact with the Hamiltonian: it requires the periodic insertion of gates between different time evolutions.Furthermore, their analysis does not include the effects of evolution time t 0 on their algorithm's performance.França et al. (2022) has developed a protocol that has similar scaling to Haah et al. (2021).Notably, it can incorporate Lindbladian terms, as well as being compatible with algebraically decaying interactions.However, the protocol does not take advantage of prior knowledge about Pauli terms present in the Hamiltonian, which can potentially result in a measurement parallelization overhead that scales with 16 k (where k is the locality of the Hamiltonian).
In Section IV, we will describe an algorithm that addresses some of the shortcomings of previous works.Our protocol has a query complexity that scales like that achieved by Haah et al. (2021), except our dependence on the parameter D will be O D 4 rather than O D 21 .Furthermore, we parallelize our measurements in such a way that avoids the O 16 k scaling of França et al. (2022), and instead requires a parallelization overhead that scales with O D 2 .This offers an improvement when we have prior information about the Pauli terms present in the Hamiltonian, so that D < 4 k ; indeed, a more typical assumption is D ∼ O(poly(k)).In summary, we achieve a query complexity and classical processing time complexity for Hamiltonian learning using unitary dynamics.Similar to França et al. (2022), this can be generalized, via careful selection of initial states and measurements, to learn the Lindbladian (when expanded in the Pauli basis) of open quantum systems undergoing Markovian dynamics.The query and classical processing time complexity using Gibbs states is only worse by a factor D and D 2 , respectively.Finally, in Section V, we will discuss heuristic optimizations of our algorithm that can improve performance in practice.We will then show numerical results on an 80-qubit transverse field Ising model that indicates the feasibility our algorithm in practice.
IV. METHODS

A. Outline of our approach
We outline our basic approach below.The following can be considered a generalization of the ideas in França et al. (2022).This generalization allows us to learn using Gibbs states as well as unitary dynamics, and is also able to reduce measurement parallelization overhead.
1. Connect the oracle and the Hamiltonian parameters Write the Taylor expansion of Equation (3): Determine the connection between the first-order coefficient c 1 and the Hamiltonian.For both unitary dynamics and Gibbs states, this connection can be made extremely straightforward by appropriately setting the SPAM parameters.
2. Bound higher order derivatives Establish a bound for the higher order derivatives |c k | in terms of the structure parameter D. This bound is similar in spirit to the shadow norm in shadow tomography (Huang et al. 2020).The scaling we find for |c k | varies depending on whether we are using unitary dynamics or Gibbs states (just as O used).Furthermore, similarly to (Huang et al. 2020), this bound determines the performance of the rest of our algorithm.In this work, we find .Observe that if |c L | grows no faster than a factorial, as is the case in Equation ( 12), the error decreases (at least) as a power law in L for suitably chosen A. However, our overall error scaling is nowhere near as good as this due to the presence of noise when evaluating f SPAM (increasing L will result in an increase in the variance of our estimator for c 1 ).The modeling error (bias) must be carefully traded against the effects of noise (variance).
4. Apply simultaneous measurements If possible, devise a scheme to execute the previous step using simultaneous measurements to estimate several parameters at once.When the SPAM settings use measurements whose locality does not exceed that of the Hamiltonian (as is the case in this work), this is possible.This enables a query complexity that is sublinear in the number of Hamiltonian coefficients r.
In Section IV B, we first establish an elementary procedure for estimating the first order derivative f (0) given access only to noisy estimates of f .Then, in Section IV C, we apply this procedure to Hamiltonian learning with unitary dynamics and Gibbs states.
B. Inferring the First-Order Commutator For a system evolving under a Hamiltonian H and an initial state given by some density matrix ρ 0 , the expectation value of any operator P can be written as: where This equality is simply using the Heisenberg expansion of the time-evolved operator P (t).
In this section, we define a critical subroutine of our Hamiltonian learning algorithm that infers the expectation Tr((i[H, P ])ρ 0 ) by measuring time-evolved expectation values.The main idea behind our algorithm is that Tr((i[H, P ])ρ 0 ) is the time derivative of the expectation Tr P e −iHt ρ 0 e iHt .More specifically, the Heisenberg expansion in Equation ( 13) expresses the time-evolved expectation of an observable as Therefore P (t) can be modeled as a univariate power series in time, If we were able to access P (t) exactly, the most effective way to find c 1 would be to simply differentiate P (t) via finite differences with very small ∆t (i.e., c 1 ≈ P (∆t) − P (0) ∆t ).Since our measurements of P (∆t) are subject to shot noise, the variance of this estimator scales with O (∆t) −2 , preventing us from using arbitrarily small ∆t.However, as ∆t grows, the bias in the finite difference estimator grows.Algorithm 1 is a generalization of finite differencing, and uses Chebyshev regression (see Appendix A) to estimate c 1 .This algorithm takes as input a maximum evolution time A and an cutoff degree for the Chebyshev polynomial L. This finite cutoff degree induces biases in the recovered polynomial coefficients1 , however, we will demonstrate that this bias is suppressed much more effectively than for the finite-difference estimator, as it turns out that these errors scale in a power-law with power L. As mentioned in the beginning of this section, this error bound depends on a bound for the derivative Since ρ(t) is a density matrix, a simple application of the Von Neumann trace inequality (Mirsky 1975) shows that |Tr([H m P ]ρ(t))| ≤ [H m P ] (where • denotes the spectral norm).We can bound spectral norms of iterated commutators with the Hamiltonian as follows: Theorem 1 (Iterated commutator norm bound).When P is a single-qubit observable, the spectral norm of the iterated commutator is bounded by where the ∞ norm denotes Θ ∞ = max i |θ i |.For commuting Hamiltonians (i.e., every term P i in the Hamiltonian commutes with every other term), Proof.See Appendix B.
Definition 3 (Typical scales).The form of the above bound is indicative of two typical scales.We will take to define a typical time scale for our Hamiltonian, and to define a typical scale for our Hamiltonian coefficients.
Algorithm 1 Estimating the first derivative Tr((i[H, P ])ρ 0 ) for ← 1, L do Construct the dataset D (Definition 4) y ← estimate of Tr P e −iHt ρ0e iHt Average N measurement outcomes of P 6: for m ← 1, L − 1 do Estimate the Chebyshev coefficients (Theorem 5) return c1 Definition 4 (Dataset).We construct a dataset with L evaluations of the expectation P (t) , evaluated at the roots of the Lth Chebyshev polynomial.Our dataset comprises of L points: where with z i the roots of the Lth Chebyshev polynomial and ensures that the evolution time is nonnegative and never exceeds A.
The following theorem shows that for the appropriate choice of evolution time A and Chebyshev degree L, the error scaling is close to being noise-limited.
Theorem 2 (Query complexity for one coefficient).Fix some failure probability δ and an error .Assume that we have access to an unbiased (single-shot) estimator of P (t) with variance σ 2 ≤ 1.Then there is some choice of A ∼ τ and L ∼ log −1 such that with query complexity, we can construct an estimator c1 such that |c1−c1| γ ≤ , except with failure probability at most δ.
Proof.See Appendix C.

C. Recovering Hamiltonian Coefficients
With an efficient algorithm for accurately estimating first-order commutators Tr(i[H, P ]ρ 0 ), it is possible to construct an algorithm that can infer the coefficients of H using these commutators.The idea is to carefully choose ρ 0 and P so that Tr(i[H, P ]ρ 0 ) corresponds to one parameter at a time.
First, we introduce the notation that ρ (X) 0 and P (X) will be the portion of a density matrix or Pauli matrix (respectively) that is restricted to the qubits in X, and X will be the set of all qubits not in X.
Lemma 1 (Term selection).Let P be some Pauli operator such that there exists some i ∈ {1, . . ., r} where supp P ⊆ supp P i and i In words, Y is a neighborhood around X that contains the support of all Paulis that intersect with X, and Z is the set of all qubits that are not in X ∪ Y .The state ρ 0 is defined such that for all qubits in Y , it is the maximally mixed state and for qubits inside X, ρ 0 is defined in a way such that Tr i[P i , P ]ρ (X) 0 /2 = 1, and for all other qubits, ρ 0 can be anything.Then: Proof.See Appendix D.
This defines a simple algorithm for Hamiltonian learning.For simplicity, for any Pauli P i , we will simply set the observable P to be a single qubit Pauli acting on one site in X such that [P i , P ] = 0.
Algorithm 2 Naive Hamiltonian learning However, the runtime of this algorithm is Ω(r), since this procedure must be called once for each term in the Hamiltonian.We propose an improvement of this algorithm wherein we estimate Tr P e −iHt ρ 0 e iHt for many different choices of P simultaneously.We aim to set ρ 0 in such a way that we can extract coefficients for many terms simultaneously.Yet, rather than using shadow tomography (as done in França et al. (2022)), which can result in O 16 k scaling, we carefully take advantage of our knowledge about the Hamiltonian structure to get a smaller parallelization overhead.The way forward relies on the fact that in Lemma 1, ρ (Z) 0 can be anything.Similarly to (Haah et al. 2021), we partition the terms of our Hamiltonian into groups of terms that can each be inferred simultaneously.This partition is based on a graph coloring, which we define below.
Definition 5 (Squared graph).Let the square of the interaction graph, G 2 , be the graph with the same vertex set as G and in which any two vertices are connected if their distance in G is at most 2. In words, the edges for G 2 are Our algorithm will rely on a graph coloring of G 2 .The essential idea is that for Paulis of the same color, there is always a "moat" separating them.This moat will then be filled with maximally mixed states, which completely suppresses the influence of terms that we are not interested in.A partitioning of the Hamiltonian terms via some C-coloring of G 2 makes it natural to rewrite the Hamiltonian using a double sum notation: where V i is the set of all Paulis with the same color C i .For instance, see Example 8.1 for a coloring of the squared interaction graph for a 9-qubit TFIM.
Lemma 2 (Simultaneous inference for a partition).Let V i be a partition in a coloring of G 2 .The coefficient for each Pauli in V i can be inferred with up to an error Θ ∞ , with failure probability for each individual coefficient being at most δ (so the overall failure probability is upper bounded by δ|V i |).This can be done with query complexity Proof.See Appendix D.
Theorem 3 (Hamiltonian learning with unitary dynamics).Fix a sparsely interacting Hamiltonian H that has r terms in its Pauli expansion with coefficients Θ.For the appropriate choice of Chebyshev degree L and evolution time A, Algorithm 3 solves the quantum Hamiltonian learning problem (with an additive error Θ ∞ and failure probability at most δ) with query complexity and classical processing time complexity Proof.We find a coloring of the squared interaction graph.There is an efficient way to do this with at most D 2 colors (see Definition 8).Now, we apply Lemma 2 to each of these partitions.For the detailed proof, see Appendix D.
Algorithm 3 Hamiltonian learning with unitary dynamics for j ← 1, . . ., |Vi| do Define the observables and states (Lemma 2) 5: P j ← a single-qubit Pauli such that P j , Pj = 0 6: for q ∈ (supp Vi) do Fill the moats for k ← 1, . . .K do 10: for ← 1, . . .L do Construct the dataset (Definition 4) M ← N simultaneous measurements of Tr P j e −iHt ρ0e iHt for j ∈ {1, . . ., |Vi|} for m ← 1, . . ., L − 1 do Estimate the Chebyshev coefficients (Theorem 5) In a different setup, we may be given access to copies of a Gibbs state at a temperature β −1 .If we measure an observable P i , the expectation will be In what follows, we apply the analysis of Haah et al. (2021) to formulate P i β as a polynomial in β, in accordance to the framework in Equation ( 3).We will show that we can learn the coefficients of the Hamiltonian from the first order term in this polynomial, therefore mapping the problem of Hamiltonian learning from Gibbs states onto Hamiltonian learning with unitary dynamics.
Theorem 4 (Hamiltonian learning with Gibbs states).The Hamiltonian learning problem (with an additive error Θ ∞ and failure probability at most δ) can be solved using copies of the Gibbs state.This can be achieved with a time complexity Proof.The protocol is a near mirror image of the Hamiltonian learning protocol using unitary dynamics.For the full proof, see Appendix E.

V. NUMERICAL RESULTS AND DISCUSSION
In this section, we demonstrate numerically the performance of Algorithm 3 on a simulated Hamiltonian learning problem using unitary dynamics.We will apply Algorithm 3 for learning a transverse field Ising model: where J i , B i ∼ Unif(−1, 1).To simulate the dynamics of this Hamiltonian, we use the time-evolution block-decimation method (Zwolak and Vidal 2004, Paeckel et al. 2019, White and Feiguin 2004, Daley et al. 2004, Vidal 2004).
There are two hyperparameters in Algorithm 3 that are crucial for determining the performance of the Hamiltonian learning routine: the maximum evolution time A and the Chebyshev degree L. Setting these parameters is a delicate balance between noise-induced error and modelling errors.If A is too low or L is too high, the variance in the dataset will dominate the error, and on the other hand, if A is too high or L is too low, the modelling error will dominate.It is generally desirable to set these two parameters such that the modelling and noise errors are comparable.2A possible method for setting A and L can be to optimize the error bounds (see Figure 3).Numerically, these optimal values behave as anticipated in Theorem 2: the optimal L * scales with O log −1 , and A/τ ∼ 1 which leads to shot requirements scaling with O polylog(1/ ) −2 .
In Figure 4, we show the error in the recovered Hamiltonian parameters corresponding to a target error of = 0.021.As expected, the theoretical prediction for the noise error is close to perfect.However, the modelling error is drastically overestimated by nearly four orders of magnitude.This is reasonable, and originates in the upper bound for the norm of the iterated commutator [H m P ] found in Theorem 1, since the counting bound in Theorem 6 is unlikely to be saturated.This miscalculated modelling error has important consequences for the algorithm, since it results in a poorly specified evolution time τ .Our bounds in Theorem 1 are too loose, hence our calculated optimal evolution time is too small, resulting in a higher noise error than necessary.We propose a number of ways to remedy this, beginning with a heuristic attempt to tighten the bound in Theorem 1.The reason for the first two optimizations is nonobvious, and is described in detail in Appendix F. For the case of the arbitrary Hamiltonians, note that although we were free to set A and L in the proof of Theorem 2, we indeed find the optimal L = O log −1 , A = O(τ ), and N = O polylog(1/ ) −2 .We find similar scaling for the case of the commuting Hamiltonian in every variable except A/τ , which also to scale as O log −1 : this is because the modeling error term is suppressed by a factorial, and scales like (A/τ ) L /L!.Despite this, the overall query complexity is only better than the general case by a constant factor.Optimization 1.When calculating the optimal evolution time A and polynomial degree L, we replace D with Optimization 2. A further improvement of algorithm performance can be gained by optimizing the number of queries allocated to each evaluation of P (t) .Let Then, for a fixed number of total queries N max , we allocate a number of queries when estimating P (t ) .
Optimization 3. Since we know the polynomial at t = 0 to have a value of 0 (our SPAM parameters are set such that this is always true), we constrain our solution for the Chebyshev coefficients such that the constant term in the polynomial is 0. That is, rather than fitting a generic polynomial =0 c t to the data, we fit =1 c t .This further reduces the variance of our solution, mitigating the noise-induced error.
We apply Optimizations 1 to 3 (the heuristic calculation of D, improved shot allocation, and solution constraints), to the TFIM model.The error distributions shown in Figure 5 are the result of running the Hamiltonian learning algorithm on 40 random instances of the 80-qubit TFIM problem.Since we can already calculate the improvements realized by Optimization 1 (decrease in error by a factor D /D), we show only the effects of Optimizations 2 and 3. Empirically, we are able to realize significant improvements by applying these optimizations.There is a decrease in error by a factor ∼ 1.7, enabling us to use 3 times fewer shots.
), and on the right, we show the difference between the theoretical error upper bound and the empirical errors from numerical simulations (note the log-log scale for both plots).The violin plots show the distribution of maximum absolute errors from 100 random initializations of the TFIM (with coefficients sampled uniformly between −1 and 1).The distributions show the [1%, 99%] interval in a narrow line, a [16%, 84%] interval in a wider line, and the median marked in white.The violin plots are offset by a small amount for visualization purposes, but each cluster of four violin plots used the same number of queries marked by the dotted grey lines.Recall N and K are defined in Theorem 2, L is the Chebyshev degree, and χ(G 2 ) is the chromatic number of the squared interaction graph (Definition 8).We set the failure probability to δ = 15%.
As shown on the right of Figure 5, there is a significant gap (by a factor ∼ 10) between the theoretical error upper bound (calculated using the effective degree D) and the empirical error.We propose two possible reasons for this gap.First, despite the heuristic improvements made in Optimization 1, there is still a significant gap between the theoretical and empirical modelling error (see Figure 10).This adds a discrepancy by a factor ∼ 2 between the theoretical and empirical errors, since the optimal configurations always have a theoretical error that is composed roughly of half modeling error and half noise error.Secondly, the settings for N and K used in Theorem 2 to guarantee P(maximum absolute error > ) < δ are overestimates of the true requirements.

VI. CONCLUSIONS
In this work, we have discussed the quantum Hamiltonian learning problem.We introduced a unifying model for Hamiltonian learning using both unitary dynamics and Gibbs states.By subsuming these two approaches into the same model, we were able to describe an abstract routine for learning the Hamiltonian of a quantum many-body system given limited access to the system.This routine was based on fixing certain SPAM parameters, then viewing the system as a function f of a single variable (in this work, we consider this variable to be either time t or inverse temperature β).We argued that the coefficients in the Taylor expansion of f (particularly, the first order coefficient) could be connected with the Hamiltonian parameters in a straightforward manner, then showed that the relevant coefficient could be inferred both accurately and efficiently from noisy evaluations of f .Finally, we concluded by describing how our protocol could achieve better than linear query complexity in r (the number of Hamiltonian parameters) by using SPAM configurations amenable to simultaneous measurements.
This culminated in our main result, wherein we proposed an algorithm that achieves an almost noise-limited (∼ polylog( −1 ) 2 ) query complexity, similar to that of Haah et al. (2021) and França et al. (2022).However, our work represents an advance for several reasons.In comparison to Haah et al. (2021), we significantly reduce their dependence on the parameter D from D 21 to D 4 .In comparison to França et al. (2022), we generalize their protocol to learning with Gibbs states and improve their measurement parallelization overhead from O 16 k to O D 2 -this can be a significant improvement when we have prior knowledge of the Pauli terms in the Hamiltonian (i.e., D < 4 k ).Furthermore, by deriving explicit bounds on the performance of our algorithm, we were able to provide precise numerical prescriptions for theoretically optimal hyperparameters such as maximum evolution time and Chebyshev degree.We concluded by proposing a number of heuristic improvements to our algorithm, and argued they were reasonable to apply in general.This combination of improvements makes our algorithm applicable in practice for non-trivial Hamiltonians, which we demonstrated by applying our protocol on a simulated, large (80-qubit) problem.
Although we have demonstrated a successful application of our learning algorithm on a simulated problem, this simulation did not include possible detrimental experimental effects.Our algorithm makes minimal SPAM requirements (requiring only single-qubit measurements and simple product states), but further investigation is needed to determine the effects of SPAM errors on performance.We also leave for later works a study of how this protocol can be improved by making stronger assumptions on either the Hamiltonian or the suite of interactions available to us.For instance, we already showed a constant (but significant) drop in the number of measurements required for learning a commuting Hamiltonian with unitary dynamics.We expect a similar effect for Hamiltonian learning with Gibbs states.Furthermore, if we assume we can interact with our system using a trusted quantum simulator of our own, a variety of approaches become possible.Among these is Hamiltonian learning with Loschmidt echoes, as done in Wiebe et al. (2014a).Rigorous performance bounds have not yet been found for this approach, but we speculate that a similar application of our techniques may yield improved performance -however, we leave this for future works.

Appendix A: Chebyshev Regression
To fit a black-box function f where we can query f (x) for x within some window [A, B], we are often interested in approximating f with a polynomial of degree L. This is known as polynomial interpolation.When we have complete freedom in choosing the location of the points x ∈ [A, B], a popular method is Chebyshev interpolation.There are a number of favorable properties associated with this method.Importantly, it achieves close to an optimal approximation error on [A, B].That is, if f is the degree-L polynomial resulting from Chebyshev interpolation, max x∈ [A,B] f (x) − f (x) is close to the minimal possible value among all polynomials of degree L (Mason andHandscomb 2003, Press 2007).
Chebyshev interpolation takes its name from its extensive use of a class of polynomials known as Chebyshev polynomials.This class of polynomials is unique in that for z ∈ [−1, 1], they can be written: Although this is not in polynomial form, by using the identity cos(nθ) = r=0 2r≤n (−1) r n 2r cos n−2r (θ) sin 2r (θ), (A2) we find: This makes clear that the degree of T n is n.The first seven Chebyshev polynomials are Below, we list a number of useful properties of Chebyshev polynomials.
Lemma 3. The roots of the nth Chebyshev polynomial are Proof.This follows from a simple substitution.
. ., n} are the roots of T n , then for any i, j ≤ n: In the general case, if either i > n or j > n: Proof.The case of i = j = 0 is trivial, since T 0 = 1.So, below, we will assume either i = 0 or j = 0.
With these preliminaries, we now describe Chebyshev interpolation.After fixing a degree L − 1 for the interpolating polynomial, we evaluate f at the roots of T L (z). 3 This results in a dataset D = {(z i , y i ) | i = 1, . . ., L} where z i are the roots of T L (z) and y i = f (z i ).The functional form with which we interpolate {y i | i = 1, . . ., L} is simply a linear combination of the first L Chebyshev polynomials: Theorem 5 (Chebyshev interpolation).The coefficients {b i } such that f (z; {b i }) perfectly fits our dataset are: Proof.We minimize the squared distance squared error between f and the dataset.Then, since any set of L points can be perfectly fitted by a degree L − 1 polynomial, the resulting coefficients will be the interpolation solution.We We now use Lemma 4. If = 0: This simply says that the constant term in the fitted polynomial is the average over {y i }, an intuitive result.For = 0: Definition 6 (Types of tuples).We introduce some nomenclature that abbreviates much of our later proofs.First, let Θ be a tuple of the Hamiltonian parameters (θ 1 , . . ., θ r ).Also, we introduce multi-index sets α ∈ (Z ≥0 ) r , which are tuples of nonnegative integers (α 1 , α 2 , . . ., α r ).We define: We call |α| the size of α.Finally, we define "term tuples" S ∈ {1, . . ., r} m of size m to be ordered tuples (s 1 , . . ., s m ).Each entry in S is meant to point at a term in the Hamiltonian.
There is a correspondence between size m multi-index sets and term tuples.A term tuple S with size m maps to α as follows: (α(S)) i = |{j | S j = i, j = 1, . . ., m}|.This mapping essentially counts the number of occurrences of i in S. Graphically, this is represented in Figure 6.This mapping is surjective, in the sense that any multi-index set α with size m can be written as α(S) for some S with size m.However, it is not injective, since multiple term tuples can map to the same multi-index set.Lemma 5.The iterated commutator [H m P ] can be expanded as: S) [P s1 , [P s2 , . . ., [P sm , P ]]].(B2) We emphasize that Equation (B2) is nothing more than a polynomial expansion, in the Hamiltonian coefficients, of the iterated commutator.
Proof.We inductively arrive at the following formula for the iterated commutator [H m , P ].
[ Without any structural assumptions about the Hamiltonian H, the form of this expansion is not useful.However, by applying our assumption that the Hamiltonian is sparsely interacting, we can find useful bounds using the above expansion.
Definition 7 (Support tree).Let P be any Pauli operator such that there is some P i where supp P ⊆ supp P i (for the remainder of this work, wherever we write P , it will always satisfy this assumption).For this Pauli, we can define the support tree induced by P , which we call T P (see Figure 7).This tree is defined as follows: the children of any node P i is children The root of T (m) P is P .
Example 7.1.The following is a support tree induced by P = σ (2) x on the 9-qubit TFIM shown in Example 2.1.

FIG. 7.
The support tree TP for the TFIM system shown in Figure 2, induced by some Pauli P that acts on qubit 2 (e.g., P = σ (2) x ).For illustration purposes, we truncate the tree at a depth of 2. Each node has at most D + 1 children.The structure of this tree is entirely dependent on the interaction graph (see Definition 2).
Theorem 6 (Counting labeled subtrees).Let T be an infinite rooted d-regular tree (i.e., every node has d children).There are exactly labelled rooted subtrees in T , of size m and labels {1, . . ., m}, such that the label of any given node is less than that of all of its descendants.
Proof.Let a m be the number of labelled rooted subtrees of size m that satisfy the requirements in the above theorem.
If we neglect the labeling, so that any subtrees are identical if they are structurally identical, then a m satisfies the following recurrence: Each summand represents a configuration in which the root node has a tree of size k 1 on its first child, size k 2 on its second child, and so on.We now need to adjust this recurrence to account for labeling.For some configuration (k 1 , k 2 , . . ., k d ), observe that we can combine the configurations as follows.There are m slots that need to be labeled.The first must be the root.Then, we can place the labels from the first subtree in any of m−1 k1 positions (while maintaining relative ordering within the subtree4 ).Next, we can place the labels from the second subtree in any of m−1−k1 k2 positions, and so on.In total, there are ways of placing the labels.This is simply the multinomial coefficient m−1 k1,...,k d .Therefore, the correct recurrence relation is: This can be written in a more convenient form if we define a b m ≡ am m! , so that we have: To obtain a closed form for b m , we define the generating function This is a first order equation that can be solved in closed form.
There are at most D m (m + 1)! non-vanishing tuples of size m.
Proof.We argue that non-vanishing tuples are in one-to-one correspondence with the subtrees of size m + 1 described in the previous theorem.We do this by construction.
Algorithm 4 Finding all possible non-vanishing tuples of size m Remove v from F 10: Add the children of v to F 11: R.extend(NonvanishingTuples(S, F , m − 1)) 12: return R 13: S0 ← empty list of size m + 1 14: S0[m + 1] ← P 15: F0 ← the children of P in the support tree of TP 16: {S} ← NonvanishingTuples(S0, F0, m) In the above, S represents a given non-vanishing tuple, and F represents a frontier of Paulis whose support intersects with the support of at least one Pauli in S.This algorithm builds non-vanishing tuples from inside-out, first selecting all P sm such that supp P sm ∩ supp P = ∅, then finds P sm−1 such that supp P sm−1 ∩ (supp P sm ∪ supp P ) = ∅, and so on.Now, we observe that this algorithm is constructing, structurally speaking, rooted subtrees of size m + 1 in T P (where the +1 is due to the root P ).The recursion takes the current subtree, searches its 'frontier' (i.e., the set of nodes that is connected to at least one node in the current subtree) and adds one of these frontier nodes to the current subtree, repeating until the subtree is size m + 1.However, the algorithm does not only output structures: there is an ordering to each structure depending on the order in which node is visited.A given subtree may be counted several times, with different ordering.This ordering corresponds to a labeling of the nodes in the subtree with {1, . . ., m}, with the labels representing the order in which the nodes are visited.The constraint on the labeling is that no descendent of a node can be visited before its parent -this simply means the label of any node must be strictly less than that of its descendants.Therefore, Algorithm 4 counts all labelled subtrees of size m with the property that the label of any node is less than that of its descendants.Since T P is at most (D + 1)-regular, by Theorem 6, there are at most D m (m + 1)! non-vanishing tuples.
Remark 6.1.It is not guaranteed that every tuple returned by Algorithm 4 will be nonvanishing, since it is possible, for example, that P s1 commutes with [P s2 , . . ., [P sm , P ]].However, it is guaranteed that every non-vanishing tuple is contained in the set {S} found by Algorithm 4. Remark 6.2.Corollary 6.1 is a similar approach to (Haah et al. 2021).However, they use the known result that the number of structurally unique subtrees with size m of a d-regular tree is 1991, Knuth 1969).This scales with ∼ (e • d) m , and if we want to count all non-vanishing tuples, to allow for relabelling, we are forced to introduce a factor m!.This gives a bound on the number of non-vanishing tuples ∼ (e • d) m m!.By being more careful with relabeling (since not all labelings are permitted), the counting in Theorem 6 is better by a factor O e m m .
Theorem 7. The spectral norm of the iterated commutator is bounded by where the ∞ norm denotes Θ ∞ = max i |θ i |.
Proof.We refer to the expression for the iterated commutator in Equation (B2): Theorem 8.For commuting Hamiltonians (i.e., every term P i in the Hamiltonian commutes with every other term), when P is a single-qubit observable for all m.
Proof.We separate H = H 1 + H 2 , where H 1 is composed of all terms in the Hamiltonian that have a support that overlaps with the support of P .We then inductively show Equation (B11) by proving the strong statement Assuming the induction hypothesis, we have: By commutativity of the Hamiltonian [H 1 , H 2 ] = 0, so we can freely rearrange By definition of H 2 , supp H 2 ∩ supp P = ∅, so they commute.
= H m+1 1 P Since H 1 contains at most D + 1 terms, by applying the triangle inequality for the matrix norm, we have Notably, Equation (B11) is smaller than the general bound in Equation (B9) by a factor ∼ m!.This can be attributed to the fact that when H is commuting, the support [H m P ] never grows beyond a ring around P -more precisely, supp [H m P ] ⊆ supp H 1 for all m.
Noise induced Since γ defines a typical scale for the Hamiltonian coefficients, E (c 1 − c1 ) 2 /γ 2 can be interpreted as a relative error.
Proof.We use the identity , where V[c 1 ] is variance in the estimator c1 due to randomness in the dataset.
Applying Equation (A10): Swapping the order of the two sums: Since each of the y are statistically independent: By the discrete orthogonality conditions, the sum is non-vanishing only when Next, we evaluate the bias (c 1 − E[c 1 ]) 2 .Since E[c 1 ] corresponds to Chebyshev interpolation with no noise, we make use of a theorem (Howell 1991, Equation 4.2) concerning the derivative error bounds for Chebyshev interpolation.
This theorem says that if f is a degree L−1 Chebyshev interpolation of some function f , f (0 , where f (L) is the Lth derivative of f , ω 1 (t) ≡ Applying Theorem 1: , since this is the unique monic polynomial with roots at z 2 , . . ., z L .Then: We use the fact that cos x < 1 − x 2 /2.3 for |x| < 1, so that Therefore, In summary, the total error is: Remark 9.1.The above error bound can be written as: By viewing the estimator c1 as a generalized finite-difference estimator f (0) ≈ f ( )−f (0) , we argue that the terms in this expression (with the exception of those in bold) are fundamental: • The inverse dependence on evolution time A corresponds to dependence on 1/(∆t) 2 in the finite-difference estimator.
• The σ 2 dependence originates in the noisiness of measurements we take.
On the other hand, the terms in bold are not fundamental.More precisely, they originate in the fact that we can only evolve our system forward in time.If it were possible to evolve backward in time, we would have access to the central difference estimator f (0) ≈ f ( )−f (− )

2
. This would improve our estimator by reducing the terms in bold.
• Currently, the dependence on (L + 1)(A/τ ) L measures the modeling error.This is analogous to the finite difference estimator f (x) ≈ f (x+ )−f (x) having an error that scales with O f (ξ) 2!
However, with a central difference method has an error that scales with error O 2 f (ξ) 3!
. Therefore, roughly speaking, we would expect the modeling error to improve by at least a factor A/τ L+1 if we had access to backwards time evolution.
• The L 4 dependence changes to a L 2 dependence.This is because, by placing t = 0 in the center of the Chebyshev roots {z i } (rather than at the extreme end z = −1), the expression for c 1 changes its form from ∼ m b m m 2 to ∼ m b m m.Plugging this into the derivation for the variance error, we get scaling with L 2 .
Theorem 10 (Query complexity for one coefficient).Fix some failure probability δ and an error .Assume that we have access to an unbiased (single-shot) estimator of P (t) with variance σ 2 ≤ 1.Then there is some choice of A ∼ 1 γ and L ∼ log −1 such that with Proof.It suffices to show that for all j such that j = i, Tr(i[P j , P ]ρ 0 ) = 0.This is trivially true in the case where supp P j ∩ supp P = ∅, since the commutator vanishes.There are two remaining cases.
Case 1. P j acts nontrivially on some set of qubits in Y .Note that P (Y ) = I.Then [P j , P ] = P (X) j , P (X) ⊗ P Case 2. P j acts trivially on all qubits in Y .Then, we have i[P j , P ] = i P (X) j , P (X) ⊗ I (Y ) ⊗ I (Z) .
We now observe that P X) .Since P Lemma 8 (Simultaneous inference for a partition).Let V i be a partition in a coloring of G 2 .The coefficient for each Pauli in V i can be inferred with up to an error Θ ∞ , with failure probability for each individual coefficient being at most δ (so the overall failure probability is upper bounded by δ|V i |).This can be done with query complexity (D6) Proof.For each P i,j ∈ V i , let P i,j be a single-qubit Pauli such that P i,j , P i,j = 0. Let M = (supp V i ) .Let Here, M is the "moat".Now, for any P i,j , there is a straightforward labeling of qubits that maps ρ 0 onto the structure ρ .By doing this for each of the χ(G 2 ) ≤ D 2 + 1 partitions and applying a union bound on the failure probability, we see we can recover each coefficient up to an additive error Θ ∞ with failure probability at most δ using N queries, which has a complexity given by Equation (D12).The classical time complexity takes a similar form to the query complexity, except it replaces a factor of D 2 (corresponding to χ(G 2 )) by a factor r.This is because we need to process N LK measurement results for each of the r coefficients in the Hamiltonian, whereas for the query complexity, we make N LK measurements for each of the χ(G 2 ) partitions.Since the classical time to color the graph is just O D 2 , the overall classical complexity is still dominated by N • L • K • r, giving Equation (D13).
Lemma 11 (Temperature derivative bound).The mth derivative of P i β evaluated at β = 0 is bounded in absolute value by: Proof.This follows directly from Lemma 3.7 and Proposition 3.8 of Haah et al. (2021).First, we apply Proposition 3.8 to find that With unitary dynamics, we already demonstrated that the derivative bound drops by a factor m! if the Hamiltonian is commuting.We expect a similar decrease here, but we leave a proof of this for future works.
Theorem 12 (Hamiltonian learning with Gibbs states).The Hamiltonian learning problem (with an additive error Θ ∞ and failure probability at most δ) can be solved using (E7) Proof.The protocol is a near mirror image of the Hamiltonian learning protocol using unitary dynamics.We aim to infer the first derivative in the polynomial P i β , so we apply our EstimateDerivative protocol from Algorithm 1.The error scaling of this protocol is slightly different, since f (L) changes using P i β as the polynomial rather than P i (t) .Repeating the analysis from Equation (C4), we find: This amounts to redefining γ = O Θ ∞ D 2 (see Definition 3).This can be plugged into our analysis for Theorem 2 to find that we can infer any individual coefficient up to an error Θ ∞ with failure probability ≤ δ using O log(1/δ) polylog(D/ )( /D 2 ) −2 (E8) copies of the Gibbs state.This then carries into our main result Theorem 3. We color our interaction graph such that terms from the same color (i.e., partition) do not have an overlapping support.Then, the coefficients for each partition can be inferred simultaneously because the observables in each partition can be measured simultaneously.Since there are at most D + 1 partitions, the overall query complexity of our algorithm is:  Hamiltonians, all single-qubit Paulis, and all system sizes.The rigorous bound is that found in Corollary 6.1: D m (m + 1)!, and the heuristic bound is simply D m (m + 1)!.The average degree D was set to 3/2, which is the effective degree as n → ∞.
Appendix F: Heuristic Optimizations In the following, we provide a theoretical justification for Optimizations 1 and 2.
Optimization 1.The support tree T P is not often a truly regular tree (i.e., not all nodes have exactly D +1 children).
We expect this to be reflected in the scaling of the number of non-vanishing tuples.Corollary 6.1 says that the number of non-vanishing tuples of size m is D m (m + 1)!.The factor D m resembles the number of nodes at a depth m in a D-regular tree.Therefore, it is reasonable to suppose that we can replace D m with a factor that more closely reflects the number of nodes at a depth m in the support tree T P , which is not a truly regular tree.We assume T P can be modeled as a branching process (Athreya and Ney 1972).A branching process is a stochastic process that models reproduction over m generations.We begin with a population size of X 0 = 1, and at each time step, each member of the population produces a random number N of offspring, where N is a positive discrete random variable, and the previous generation dies out.Therefore, the distribution of population sizes is It is a standard result that the expected population size at a time t is E[N ] t (Athreya and Ney 1972).Applied to our case, we see that a reasonable definition for N is to let it be chosen uniformly at random from the set of degrees in the interaction graph: N ∈ R {deg(v) | v ∈ V } (where ∈ R denotes random selection).Furthermore, recall Remark 6.1: not every subtree of the support tree T P is nonvanishing.For any fixed Pauli, the probability that it commutes with some random Pauli is one-half.5Therefore, as we are constructing a non-vanishing tuple (using the subtree analogy from Corollary 6.1), we expect that adding any node has a one-half probability of commuting with the current subtree.Therefore, we define the average degree to be The factor 1 2 comes from the fact that for a given subtree, we expect one out of every two of the children that we add to commute with the existing operator defined by the subtree.With this, we now replace D everywhere with D.
Gibbs states: we can control two quantities: β and O.We assume we have access to the Gibbs state at a temperature β −1 , and then measure the observable O.
FIG. 2. Interaction graph G for a 9-qubit transverse field Ising model.The degree of this Hamiltonian is D = 4, since for instance P2 is connected to 4 other Pauli terms.
14: for j ← 1, . . ., |Vi| do Estimate the first commutator (Algorithm 1) for each Pauli in Vi 15:y ← estimate of Tr P j e −iHt ρ0e iHt by averaging over M for = 1, . . ., L 16: FIG. 3. Settings forA and L as a function of the desired error .These settings are found based on minimizing the upper bound on N • L in Equation (C8), plugging in the two types of derivative bounds found in Theorem 1 for arbitrary and commuting Hamiltonians.For the case of the arbitrary Hamiltonians, note that although we were free to set A and L in the proof of Theorem 2, we indeed find the optimal L = O log −1 , A = O(τ ), and N = O polylog(1/ ) −2 .We find similar scaling for the case of the commuting Hamiltonian in every variable except A/τ , which also to scale as O log −1 : this is because the modeling error term is suppressed by a factorial, and scales like (A/τ ) L /L!.Despite this, the overall query complexity is only better than the general case by a constant factor.
FIG.4.The empirical modelling and noise error of the Hamiltonian learning protocol using the optimal A and L for = 0.021 as prescribed in Figure3.The modelling errors are calculated with a noise-free dataset, and the noise errors are calculated from a single noisy dataset.The dashed line indicates the maximum theoretical modelling error on the left and indicates the predicted variance due to noise on the right.

FIG. 5 .
FIG.5.On the left, we show the maximum absolute error across all 159 coefficients of the 80-qubit TFIM model, plotted against the total number of queries N • L • K • χ(G 2 ), and on the right, we show the difference between the theoretical error upper bound and the empirical errors from numerical simulations (note the log-log scale for both plots).The violin plots show the distribution of maximum absolute errors from 100 random initializations of the TFIM (with coefficients sampled uniformly between −1 and 1).The distributions show the [1%, 99%] interval in a narrow line, a [16%, 84%] interval in a wider line, and the median marked in white.The violin plots are offset by a small amount for visualization purposes, but each cluster of four violin plots used the same number of queries marked by the dotted grey lines.Recall N and K are defined in Theorem 2, L is the Chebyshev degree, and χ(G 2 ) is the chromatic number of the squared interaction graph (Definition 8).We set the failure probability to δ = 15%.
FIG. 9. Bounds for the iterated commutator norm [H m P ].We randomly generate 10000 TFIM Hamiltonians (sampling the interaction strengths from Unif(−1, 1)) for system sizes n = 2, . . ., 9, and calculate the norm for [H m P ] for every singlequbit Pauli P .The points marked simulation in the figure are the maximum [H m P ] over all 10000 randomly generated Hamiltonians, all single-qubit Paulis, and all system sizes.The rigorous bound is that found in Corollary 6.1: D m (m + 1)!, and the heuristic bound is simply D m (m + 1)!.The average degree D was set to 3/2, which is the effective degree as n → ∞.