Introduction

An increasingly relevant task for the study of many-body quantum systems is to learn the associated Hamiltonian operator efficiently (i.e., without requiring resources that scale exponentially in system size). In condensed matter physics, we can experimentally verify our models of quantum materials by comparing theoretical predictions about their effective interactions with the interactions inferred by Hamiltonian learning1,2,3,4. This verification is also applicable for quantum device engineering. With the expanding capabilities of quantum computers, it is increasingly important to be able to certify their behavior5, and while benchmarking protocols can give coarse-grained information about a particular quantum device, knowing its Hamiltonian can be significantly more powerful, allowing us to design improved devices6,7,8 or better understand the physical origin of failure modes9,10,11.

Several promising approaches have been proposed for the Hamiltonian learning problem. An early work12 demonstrated that systems with local Hamiltonians can be efficiently characterized without requiring full state tomography, which is costly in terms of accuracy in the trace norm. However, this method was limited in its applicability and found to be prohibitively expensive in general. Subsequent approaches13,14,15,16 successfully employed machine learning on small systems. Nonetheless, these methods lacked rigorous performance guarantees or scaling results that would provide confidence in their application to larger systems, as their performance on such systems has not been explored beyond limited numerical studies. Additionally, several proposals17,18,19 suggested learning the coefficients of the Hamiltonian by solving a system of linear equations, with the coefficient matrix determined by local measurement outcomes. However, the performance of these approaches relies on the spectral gap of the coefficient matrix, which remains poorly characterized. Recent works20,21 have achieved asymptotically optimal sample complexities, albeit with large constant prefactors that render them impractical in real-world scenarios.

In this work, we propose a protocol for Hamiltonian learning that aims to address these shortcomings. Our protocol is motivated by a major application of Hamiltonian learning, which is the characterization of near-term quantum computers. To accommodate this application, our protocol is designed to make relatively weak assumptions about the nature of the system. Specifically, we assume:

  • The Hamiltonian we are interested in learning is sparsely interacting (these are generalizations of k-local Hamiltonians; see Definition 2).

  • Our interaction with the system is limited to the ‘prepare-and-measure’ model – that is, we do not require the ability to interact with the system under study via another trusted quantum simulator e.g.,22,23,24 or make interventions (other than measurement) after initializing the system25. Two examples of this prepare-and-measure setup are making measurements on time-evolved states or on Gibbs states (Fig. 1). In these two settings, we assume that we can control the evolution time and the temperature, respectively.

  • We can prepare fully separable states and make Pauli measurements.

We note that the practicality of these assumptions depends on the experimental platform. Indeed there are other approaches that impose even more stringent assumptions, such as the restriction of only being able to prepare a single fixed initial state26, or the ability to make measurements on only a single site27. In our work, we do not impose such restricted assumptions, as they do not align with the application we focus on, namely the characterization of near-term quantum computers. In this context, it remains a natural assumption that we have the ability to prepare arbitrary product states and perform Pauli measurements on arbitrary sites. A further advantage of our protocol is that it is easily parallelizable. In short, in this work, we will describe a Hamiltonian learning protocol that requires only \({{{{{{{\mathcal{O}}}}}}}}\left({\epsilon }^{-2}{{{{{{{\rm{polylog}}}}}}}}(n/\epsilon )\right)\) samples to recover every parameter of a sparsely interacting n-qubit Hamiltonian up to an error ϵ. We will conclude by providing a concrete prescription for optimal configurations of the protocol when used in practice, and demonstrate its performance with numerical examples.

Results

In this work, we will treat the system under study as a black box system with an unknown Hamiltonian H, and our goal will be to efficiently infer H with access to only a limited number of inputs to, and outputs from the black box. Importantly, we use the ‘prepare-and-measure model’ of interaction with our system (see Fig. 1). This model of interaction prohibits any quantum channel between the system under study (whose Hamiltonian we are trying to learn) and some other quantum processing unit. Furthermore, after initializing the system in some state, it prohibits any interaction with the system other than making measurements. Two typical examples of this are Hamiltonian learning using unitary dynamics and Gibbs states. For the former, we initialize the system in some known state ρ0, and evolve it forward in time by t, resulting in the state:

$$\rho (t)={e}^{-iHt}{\rho }_{0}{e}^{iHt}.$$
(1)

For the latter, we assume we have access to a system in thermal equilibrium at a temperature β−1. That is, we have access to the Gibbs state

$$\rho (\beta )=\frac{\exp (-\beta H)}{{{{{{{{\rm{Tr}}}}}}}}(\exp (-\beta H))}.$$
(2)

We assume that we can control the parameters t and β, respectively. Finally, we assume that we can measure some observable P of the final states ρ(t) and ρ(β). However, we do not insist on arbitrary control over ρ0 and P; we only consider the case where ρ0 is fully separable and P is a local Pauli operator.

Fig. 1: Classical interaction with quantum systems.
figure 1

The 'prepare and measure' model for interacting with a quantum system. We view the system as a set of oracles indexed by the state preparation and measurement parameters ρ0, P in the time evolution case, and P in the Gibbs state case. These oracles take some input t or β, and we use their output to characterize the Hamiltonian. In (a), we show our model for time evolution, wherein we control three quantities: ρ0, t, and P. We assume we can evolve the input state ρ0 forward in time, and after a time t, we make a measurement of the observable P. In (b), we show our model for learning from Gibbs states, wherein we control two quantities β and P. We assume we have access to the Gibbs state at temperature β−1, and then measure the observable P.

Using these two interaction models, we propose a method for Hamiltonian learning that relies on a simple intuition. For some particular state preparation and measurement (SPAM) settings (consisting of a prescription for the observable P, and in the case of unitary dynamics, the initial state ρ0), which we write as \({{{{{{{\mathcal{S}}}}}}}}\), we can define a function \({f}_{{{{{{{{\mathcal{S}}}}}}}}}\) as the expectation value of P on the state ρ(t) and ρ(β):

$${f}_{{{{{{{{\mathcal{S}}}}}}}}}(x)=\left\{\begin{array}{ll}{{{{{{{\rm{Tr}}}}}}}}(P\rho (t=x))\quad &{{{{{{{\rm{for}}}}}}}}\,{{{{{{{\rm{unitary}}}}}}}}\,{{{{{{{\rm{evolution}}}}}}}}\\ {{{{{{{\rm{Tr}}}}}}}}(P\rho (\beta=x))\quad &{{{{{{{\rm{for}}}}}}}}\,{{{{{{{\rm{Gibbs}}}}}}}}\,{{{{{{{\rm{states}}}}}}}}.\hfill\end{array}\right.$$
(3)

We will show that for the appropriate choice of SPAM parameters, \({f}_{{{{{{{{\mathcal{S}}}}}}}}}(x)\) can be viewed as black box function in x. Using this framework, we describe our basic approach below. For concreteness, we will consider learning with unitary evolution (the analysis for Gibbs states follows similarly in the Supplementary Note 5). First, to assist the reader, we provide below a glossary of notation (Table 1) to serve as reference.

Table 1 Glossary of Notations

Preliminaries

To set the stage, we first give a formal definition of the Hamiltonian learning problem and define a sparsely interacting Hamiltonian.

Definition 1

(Hamiltonian learning problem). Fix a Hamiltonian on an n-qubit system that has an expansion in the Pauli basis:

$$H=\mathop{\sum }\limits_{i=1}^{r}{\theta }_{i}{P}_{i},$$
(4)

where each \({P}_{i}\in {\left\{I,{\sigma }_{x},{\sigma }_{y},{\sigma }_{z}\right\}}^{\otimes n}\) is a Pauli operator and \({{\Theta }}={\left[{\theta }_{1},\ldots,{\theta }_{r}\right]}^{T}\in {{\mathbb{R}}}^{r}\) are the Hamiltonian coefficients. We assume the Hamiltonian is traceless (i.e., Pi ≠ In), and that we know the structure of the Hamiltonian (i.e., which Paulis Pm are present in the expansion), but that the coefficients θm are unknown. The Hamiltonian learning problem is to infer all of the coefficients θm up to an additive error \(\epsilon \cdot \mathop{\max }\nolimits_{m}\left|{\theta }_{m}\right|\) with success probability at least 1 − δ. We will assume two types of data access which define different variants of the Hamiltonian learning problem.

  1. 1.

    Unitary evolution: We can prepare the system in some initial product state ρ0 and evolve for a specifiable duration of time t. We can then make a measurement of some local Pauli observable on this time-evolved state.

  2. 2.

    Gibbs states: We can prepare the system in a Gibbs state at some specifiable temperature. We can then make a measurement of some local Pauli observable on this Gibbs state.

Definition 2

(Sparsely interacting Hamiltonian). The interaction graph \({{{{{{{\mathcal{G}}}}}}}}\) (called the “dual” interaction graph by Haah et al.21) of a Hamiltonian consists of a set of vertices V and edges E.

$$V=\left\{{P}_{i}| i=1,\ldots,r\right\}\,,$$
(5)
$$E=\left\{\left({P}_{i},{P}_{j}\right)| \left({{{{{{{\rm{supp}}}}}}}}\left({P}_{i}\right)\cap {{{{{{{\rm{supp}}}}}}}}\left({P}_{j}\right)\ne \varnothing \right)\wedge \left(i \, \ne \, j\right)\right\}\,.$$
(6)

Each vertex represents one Pauli operator Pi in the Hamiltonian, and there are edges between two vertices if the support of their corresponding Pauli operators overlap. The support of a Pauli, supp(P), is the set of sites that P acts nontrivially on. We also define the degree \({{{{{{{\mathscr{D}}}}}}}}\) of the Hamiltonian to be the maximum degree of any node in the interaction graph:

$${{{{{\mathscr{D}}}}}}=\mathop{\max}\limits_{v {\in} {V}} {\deg}\!(v).$$
(7)

A Hamiltonian is sparsely interacting if \({{{{{{{\mathscr{D}}}}}}}}={{{{{{{\mathcal{O}}}}}}}}\left(1\right)\) (that is, \({{{{{{{\mathscr{D}}}}}}}}\) does not depend on system size). Notably, this class of Hamiltonians includes geometrically k-local Hamiltonians, as this locality constraint implies that the number of terms overlapping with any Pauli term is a function of k alone.

Example 2.1

In Fig. 2, we show a sample interaction graph for a 9-qubit transverse field Ising model (TFIM), whose Hamiltonian is

$$H=\mathop{\sum }\limits_{i=1}^{8}{\sigma }_{z}^{(i)}{\sigma }_{z}^{(i+1)}+\mathop{\sum }\limits_{i=1}^{9}{\sigma }_{x}^{(i)}.$$
(8)

The TFIM will serve as a prototypical example for the rest of this work.

Fig. 2: Interaction graph for a transverse field Ising model.
figure 2

Interaction graph \({{{{{{{\mathcal{G}}}}}}}}\) for a 9-qubit transverse field Ising model. The degree of this Hamiltonian is \({{{{{{{\mathscr{D}}}}}}}}=4\), since for instance P2 is connected to 4 other Pauli terms.

Collecting necessary data

Writing the Taylor expansion of Eq. (3) \({f}_{{{{{{{{\mathcal{S}}}}}}}}}(t)=\mathop{\sum }\nolimits_{m=0}^{\infty }{c}_{m}\frac{{t}^{m}}{m!}\), our protocol will focus on extracting Hamiltonian parameters using only the first order coefficient of the Taylor expansion c1. To infer this coefficient, we will need to collect data that allows us to estimate \({f}_{{{{{{{{\mathcal{S}}}}}}}}}(t)\). The amount and nature of this data will depend on the higher order derivatives cm. More specifically, together with the desired accuracy of the learning protocol ϵ, a bound on the norm \(\left|{c}_{m}\right|\) will determine the required accuracy for our estimate of \({f}_{{{{{{{{\mathcal{S}}}}}}}}}(t)\), the number of different points at which we evaluate the function, and the specific times at which we evaluate it. The scaling we find for \(\left|{c}_{m}\right|\) varies depending on whether we are using unitary dynamics or Gibbs states, and also depends on the assumptions we make about the Hamiltonian (i.e., the structure parameter \({{{{{{{\mathscr{D}}}}}}}}\), and whether the Hamiltonian is commuting). This bound is a crucial determining factor for the rest of our algorithm. In this work, we find

$$\left|{c}_{m}\right|\sim \left\{\begin{array}{ll}{{{{{{{\mathcal{O}}}}}}}}\left({{{{{{{{\mathscr{D}}}}}}}}}^{m}m!\right)\quad &{{{{{{{\rm{for}}}}}}}}\,{{{{{{{\rm{sparsely}}}}}}}}\,{{{{{{{\rm{interacting}}}}}}}}\,{{{{{{{\rm{Hamiltonians}}}}}}}}\,{{{{{{{\rm{using}}}}}}}}\,{{{{{{{\rm{unitary}}}}}}}}\,{{{{{{{\rm{dynamics}}}}}}}}\\ {{{{{{{\mathcal{O}}}}}}}}\left({{{{{{{{\mathscr{D}}}}}}}}}^{m}\right)\quad &{{{{{{{\rm{for}}}}}}}}\,{{{{{{{\rm{commuting}}}}}}}}\,{{{{{{{\rm{Hamiltonians}}}}}}}}\,{{{{{{{\rm{using}}}}}}}}\,{{{{{{{\rm{unitary}}}}}}}}\,{{{{{{{\rm{dynamics}}}}}}}}\hfill\\ {{{{{{{\mathcal{O}}}}}}}}\left({{{{{{{{\mathscr{D}}}}}}}}}^{2m}m!\right)\quad &{{{{{{{\rm{for}}}}}}}}\,{{{{{{{\rm{sparsely}}}}}}}}\,{{{{{{{\rm{interacting}}}}}}}}\,{{{{{{{\rm{Hamiltonians}}}}}}}}\,{{{{{{{\rm{with}}}}}}}}\,{{{{{{{\rm{Gibbs}}}}}}}}\,{{{{{{{\rm{states.}}}}}}}}\hfill\end{array}\right.$$
(9)

Importantly, due to the structure of the Hamiltonian (i.e., it is sparsely interacting), \(\left|{c}_{m}\right|\) does not depend on the size of the system. This enables our protocol to achieve a sample complexity that scales only polylogarithmically in n.

Having characterized the higher order derivatives, we return to c1: as mentioned above, this is the only derivative we are interested in. This is because, with the appropriate SPAM configuration, the first order Taylor coefficient c1 will correspond to exactly one Hamiltonian parameter. More precisely, by expanding H in the Pauli basis, we find that there is always at least one pair (P, ρ0) such that \({c}_{1}={{{{{{{\rm{Tr}}}}}}}}(i[H,P]{\rho }_{0})\) corresponds exactly to one of the Hamiltonian coefficients θm. However, this approach only allows us to extract one Hamiltonian parameter at a time. It turns out that if we are careful, we can learn entire sets of parameters at once by applying simultaneous measurements. These sets of parameters can be chosen with an efficient classical analysis of the Hamiltonian’s interaction graph: the key idea is that if two Pauli terms in the Hamiltonian are far enough apart, they have no effect on each other (to first order in time). After these sets are chosen, we can use a single fixed state ρ0, and a set of commuting observables \(\left\{{P}_{i}\right\}\) such that each \({{{{{{{\rm{Tr}}}}}}}}\left(i\left[H,{P}_{i}\right]{\rho }_{0}\right)\) extracts one Hamiltonian parameter, and all the observables Pi can be measured simultaneously. Furthermore, the observables Pi can be chosen to be single qubit Paulis and the initial state ρ0 will be a fully separable state. The reduced state for each site will be either the maximally mixed state I/2 or an eigenstate of X, Y, or Z; the full state ρ0 is a tensor product of these single qubit states. These states are easily prepared from \({\left|0\right\rangle }^{\otimes n}\) by applying a constant number of single qubit gates. This simultaneous measurement technique allows us to learn all the Hamiltonian parameters with a sample complexity that is only logarithmic in the number of parameters.

After the SPAM parameters have been determined, we then evaluate \({f}_{{{{{{{{\mathcal{S}}}}}}}}}\) to collect a dataset that will subsequently allow us to infer c1. This dataset collection is the only part of our protocol that requires interaction with the system under study. For Hamiltonian learning with unitary dynamics, this involves initializing the system in a product state ρ0, evolving for some time t1, then measuring the set of observables Pi. This is repeated L times for different (predetermined) evolution times t1, t2, …tL [0, A] up to some maximum time A.

Classical postprocessing

Having constructed our dataset, our Hamiltonian learning protocol can be summarized as follows. For each Hamiltonian parameter θi, we fit the corresponding data in our dataset with a degree L − 1 polynomial in t. The first derivative of this fitted polynomial at t = 0 serves as an estimate for the parameter θi. The following is an informal sketch of our algorithm.

By using a form of polynomial regression known as Chebyshev regression (which simply consists of choosing t judiciously), we can guarantee that c1 can be estimated with a bias \({{{{{{{\mathcal{O}}}}}}}}\left(\frac{{A}^{L}\left|{c}_{L}\right|}{L!}\right)\). If \(\left|{c}_{L}\right|\) grows no faster than a factorial, as is the case in Eq. (9), the bias decreases (at least) as a power law in L for suitably chosen A. However, our overall error scaling cannot achieve this bound due to the presence of noise when evaluating \({f}_{{{{{{{{\mathcal{S}}}}}}}}}\), as increasing L will result in an increase in the variance of our estimator for c1. The modeling error (bias) must be carefully traded against the effects of noise (variance). By appropriately balancing these two, we show that we are able to achieve almost shot noise-limited performance. This is made precise by the following theorem.

Theorem 1

(Hamiltonian learning with unitary dynamics). For the appropriate choice of Chebyshev degree \(L \sim {{{{{{{\mathcal{O}}}}}}}}\left(\log {\epsilon }^{-1}\right)\) and evolution time \(A \sim {{{{{{{\mathcal{O}}}}}}}}\left(1\right)\), the algorithm shown in Box 1 solves the Hamiltonian learning problem with sample complexity

$${{{{{{{\mathcal{O}}}}}}}}\left(\frac{{{{{{{{{\mathscr{D}}}}}}}}}^{4}\log (r/\delta )\,{{{{{{{\rm{polylog}}}}}}}}({{{{{{{\mathscr{D}}}}}}}}/\epsilon )}{{\epsilon }^{2}}\right),$$
(10)

and classical processing time complexity

$${{{{{{{\mathcal{O}}}}}}}}\left(\frac{{{{{{{{{\mathscr{D}}}}}}}}}^{2}r\log (r/\delta )\,{{{{{{{\rm{polylog}}}}}}}}({{{{{{{\mathscr{D}}}}}}}}/\epsilon )}{{\epsilon }^{2}}\right).$$
(11)

Proof

See Supplementary Note 4.

Similar to the results of França et al.28, this can be generalized, via careful selection of initial states and measurements, to learn the Lindbladian (when expanded in the Pauli basis) of open quantum systems undergoing Markovian dynamics. The sample and classical processing time complexity using Gibbs states is only worse by a factor \({{{{{{{\mathscr{D}}}}}}}}\) and \({{{{{{{{\mathscr{D}}}}}}}}}^{2}\), respectively.

Numerical simulations

In Theorem 1, we have established the theoretical sample and processing time complexities of our Hamiltonian learning protocol, indicating its effectiveness under certain settings of the Chebyshev degree L and evolution time A. However, to provide practical guidance, we now delve into the optimal configurations of our algorithm for real-world applications. This includes prescribing specific values for L and A based on numerical considerations. Additionally, we present compelling numerical results obtained from an 80-qubit transverse field Ising model (TFIM), providing empirical evidence that further supports the utility of our protocol. Our aim will be to learn the following TFIM Hamiltonian:

$$H=\mathop{\sum }\limits_{i=1}^{n-1}{J}_{i}{\sigma }_{z}^{(i)}\otimes {\sigma }_{z}^{(i+1)}+\mathop{\sum }\limits_{i=1}^{n}{B}_{i}{\sigma }_{x}^{(i)},$$
(12)

where Ji, Bi~Unif(−1, 1). We choose the TFIM for its broad range of applications29, including its relevance for quantum computing platforms such as Rydberg atom arrays30. The dynamics of this Hamiltonian are simulated with the time evolution block-decimation method31,32,33,34,35.

Our protocol has two hyperparameters that determine its performance: the maximum evolution time A and the fitting polynomial degree L. Setting these parameters is a delicate balance between noise-induced error and modeling errors. If A is too low or L is too high, the variance in the dataset will dominate the error, and on the other hand, if A is too high or L is too low, the modeling error will dominate. It is generally desirable to set these two parameters such that the modeling and noise errors are comparable. However, in some settings, it may be desirable to let the dataset variance grow somewhat larger than the modeling error, since this error can be quantified exactly (see Supplementary Note 3), where \({\sigma }_{\ell }^{2}\) can be obtained by a bootstrap estimate from the dataset. There are no similar methods to quantify the modeling error. One possible method for setting A and L can be to optimize the error bounds (see Fig. 3). Numerically, these optimal values behave as anticipated in our theoretical analysis (Theorem 1): the optimal L* scales with \({{{{{{{\mathcal{O}}}}}}}}\left(\log {\epsilon }^{-1}\right)\), and A~1. This leads to a sampling complexity that scales with \({{{{{{{\mathcal{O}}}}}}}}\left({{{{{{{\rm{polylog}}}}}}}}(1/\epsilon ){\epsilon }^{-2}\right)\).

Fig. 3: Optimal hyperparameter settings.
figure 3

Settings for N, L, and A as a function of the desired error ϵ. These settings are found based on minimizing the upper bound on NL (in practice, L can only take integer values, so the values shown would be rounded to the nearest integer). For the case of arbitrary Hamiltonians, we observe \(L \sim {{{{{{{\mathcal{O}}}}}}}}\left(\log {\epsilon }^{-1}\right)\), \(A \sim {{{{{{{\mathcal{O}}}}}}}}\left(1\right)\), and \(N \sim {{{{{{{\mathcal{O}}}}}}}}\left({{{{{{{\rm{polylog}}}}}}}}(1/\epsilon ){\epsilon }^{-2}\right)\). We find similar scaling for the case of the commuting Hamiltonian in every variable except A, which also scales as \({{{{{{{\mathcal{O}}}}}}}}\left(\log {\epsilon }^{-1}\right)\). Despite this, the overall sample complexity is only better than the general case by a constant factor.

In Fig. 4, we show the error in the recovered Hamiltonian parameters corresponding to a target error of ϵ = 0.01. As expected, the theoretical prediction for the noise error is close to perfect. However, the modeling error is drastically overestimated by nearly four orders of magnitude. This miscalculated modeling error has important consequences for the algorithm, since it results in a poorly specified evolution time A. We propose a number of remedies for this in Supplementary Note 6; the improvements enabled by these techniques are shown in Fig. 5. As demonstrated by the figure, we are able to recover all 159 Hamiltonian parameters up to an error 10% using just ~106 samples.

Fig. 4: Empirical error of the Hamiltonian learning protocol.
figure 4

The empirical modeling and noise error of the Hamiltonian learning protocol using the optimal A and L for ϵ = 0.01 as prescribed in Fig. 3. The modeling errors are calculated with a noise-free dataset, and the noise errors are calculated from a single noisy dataset. The dashed line indicates the maximum theoretical modeling error on the left and indicates the predicted variance due to noise on the right.

Fig. 5: Empirical error distribution.
figure 5

On the left, we show the maximum absolute error across all 159 coefficients of the 80-qubit TFIM model, plotted against the total number of samples used by the learning protocol, and on the right, we show the quotient of the theoretical error upper bound and the empirical errors from numerical simulations (note the log-log scale for both plots). The violin plots show the distribution of maximum absolute errors from 100 random initializations of the TFIM (with coefficients sampled uniformly between −1 and 1). The distributions show the [1%, 99%] interval in a narrow line, a [16%, 84%] interval in a wider line, and the median marked in white. The violin plots are offset by a small amount for visualization purposes, but each cluster of four violin plots used the same number of queries marked by the dotted gray lines. We set the failure probability to δ = 15%.

Discussion

In this work, we have discussed the quantum Hamiltonian learning problem. We introduced a unifying model for Hamiltonian learning using both unitary dynamics and Gibbs states. By subsuming these two approaches into the same model, we were able to describe an abstract routine for learning the Hamiltonian of a quantum many-body system given limited access to the system. This routine was based on fixing certain SPAM parameters, then viewing the system as a function f of a single variable. In this work, we consider this variable to be either time t (in which case f represents the time-evolved expectation value of a Pauli observable) or inverse temperature β (in which case f represents the thermal expectation value of a Pauli observable). We argued that for the appropriate choice of SPAM parameters, the derivatives of f – particularly \({f}^{{\prime} }(t=0)\) – would correspond exactly to particular coefficients in the Hamiltonian. We then showed that \({f}^{{\prime} }(t=0)\) could be inferred both accurately and efficiently from noisy evaluations of f. Finally, we concluded by describing how our protocol could achieve better than linear sample complexity in r (the number of Hamiltonian parameters) by using SPAM configurations amenable to simultaneous measurements.

This culminated in our main result, wherein we proposed an algorithm that achieves an almost noise-limited (\(\sim \frac{{{{{{{{\rm{polylog}}}}}}}}({\epsilon }^{-1})}{{\epsilon }^{2}}\)) sample complexity, similar to that of Haah et al.21 and França et al.28. However, our work represents an advance for several reasons. In comparison to Haah et al.21, we significantly reduce the sample complexity dependence on the parameter \({{{{{{{\mathscr{D}}}}}}}}\) from \({{{{{{{{\mathscr{D}}}}}}}}}^{21}\) to \({{{{{{{{\mathscr{D}}}}}}}}}^{4}\). In comparison to França et al.28, while their approach includes only Hamiltonian learning from unitary dynamics, our protocol is generalizable to Gibbs states. Furthermore, our approach also offers an additional advantage. Unlike28, which requires a geometrically local Hamiltonian, our protocol operates efficiently with a “sparsely interacting" Hamiltonian, which is a considerably weaker assumption. This advantage is particularly significant as it eliminates the need for geometric locality. Moreover, we enhance the measurement parallelization overhead from \({{{{{{{\mathcal{O}}}}}}}}(1{6}^{k})\) (assuming a geometrically k-local Hamiltonian) to \({{{{{{{\mathcal{O}}}}}}}}\left({{{{{{{{\mathscr{D}}}}}}}}}^{2}\right)\), a substantial improvement. This is especially relevant in practical applications, where we can often a priori rule out the presence of certain terms in our Hamiltonian from physical constraints or symmetry considerations. That is, oftentimes, we have \({{{{{{{\mathscr{D}}}}}}}}\ll {4}^{k}\); in these settings, our protocol can provide a significant advantage. Furthermore, by deriving explicit bounds on the performance of our algorithm, we were able to provide precise numerical prescriptions for theoretically optimal hyperparameters such as maximum evolution time and Chebyshev degree. We concluded by proposing a number of heuristic improvements to our algorithm, and argued they were reasonable to apply in general. This combination of improvements makes significant steps towards achieving a practically useful protocol that can be applied experimentally, as indicated by the demonstration of our protocol on a large (80-qubit) simulated problem.

Although we have demonstrated a successful application of our learning algorithm on a simulated problem, this simulation did not include possible detrimental experimental effects. With respect to SPAM errors, our algorithm makes minimal SPAM requirements (requiring only single qubit measurements and simple product states). To the first order, the effect of SPAM errors will only be in the measurement of the first order commutator \({{{{{{{\rm{Tr}}}}}}}}(i[H,P]{\rho }_{0})\). For instance, if our initial state is subject to decoherence, this will result in a systematic underestimate of the Hamiltonian parameters. Therefore, a natural future direction for investigation is how this protocol can be made robust to SPAM errors. Another consideration is the potential discrepancy between the Hamiltonian ansatz used by the learning algorithm and the actual underlying Hamiltonian governing the physical system. In realistic scenarios, the system Hamiltonian may deviate from the assumed form due to various factors such as unaccounted interactions, noise, or experimental limitations. To the first order, terms that are unaccounted for do not affect the performance guarantees of our algorithm except for their effect on \({{{{{{{\mathscr{D}}}}}}}}\). However, as noted previously, a good estimate of \({{{{{{{\mathscr{D}}}}}}}}\) is a strong determining factor in the practical performance of our protocol; further investigation is needed to understand the extent to which model mismatches adversely affect performance in practice. We also leave for later works a study of how this protocol can be improved by making stronger assumptions on either the Hamiltonian or the suite of interactions available to us. For instance, we already showed a constant (but significant) drop in the number of measurements required for learning a commuting Hamiltonian with unitary dynamics. We expect a similar effect for Hamiltonian learning with Gibbs states. Furthermore, if we assume we can interact with our system using a trusted quantum simulator of our own, a variety of approaches become possible. Among these is Hamiltonian learning with Loschmidt echoes, as done in Wiebe et al.22. Rigorous performance bounds have not yet been found for this approach, but we speculate that a similar application of our techniques may yield improved performance – however, we leave this for future works.

Methods

In this section, we will describe our derivative estimation protocol, and show that this allows us to make guarantees on the error. First, we establish an elementary procedure for estimating the first order derivative \({f}^{{\prime} }(0)\) given access only to noisy estimates of f. We then apply this procedure to Hamiltonian learning with unitary dynamics and Gibbs states.

Inferring the first-order commutator

For a system evolving under a Hamiltonian H and an initial state given by some density matrix ρ0, the expectation value of any operator P can be written as:

$$\left\langle P\left(t\right)\right\rangle={{{{{{{\rm{Tr}}}}}}}}\left(P{\rho }_{0}(t)\right)={{{{{{{\rm{Tr}}}}}}}}\left(P{e}^{-iHt}{\rho }_{0}{e}^{iHt}\right)=\mathop{\sum }\limits_{m=0}^{\infty }\frac{{(it)}^{m}}{m!}{{{{{{{\rm{Tr}}}}}}}}\left(\left[{H}^{m}P\right]{\rho }_{0}\right),$$
(13)
$${{{{{{\rm{where}}}}}}}\,\left[H^{m} P\right]=\underbrace{ \left[H,\left[H,\ldots,\left[{H}\right.\right.\right.}_{{m\, {{{{{{\rm{times}}}}}}}}},\left.\left.\left.P\right]\ldots\right]\right] \,{{{{{\rm{with}}}}}}\, \left[H^0 P\right]=P.$$
(14)

This equality is simply using the Heisenberg expansion of the time-evolved operator P(t).

In this section, we define a critical subroutine of our Hamiltonian learning algorithm that infers the expectation \({{{{{{{\rm{Tr}}}}}}}}(\left(i\left[H,P\right]\right){\rho }_{0})\), for P being a local Pauli operator, by measuring time-evolved expectation values. The main idea behind our algorithm is that \({{{{{{{\rm{Tr}}}}}}}}\left(\left(i[H,P]\right){\rho }_{0}\right)\) is the time derivative of the expectation \({{{{{{{\rm{Tr}}}}}}}}(P{e}^{-iHt}{\rho }_{0}{e}^{iHt})\). More specifically, the Heisenberg expansion in Eq. (13) expresses the time-evolved expectation of an observable as

$$\left\langle P(t)\right\rangle=\mathop{\sum }\limits_{m=0}^{\infty }\frac{{i}^{m}}{m!}{{{{{{{\rm{Tr}}}}}}}}(\left[{H}^{m}P\right]{\rho }_{0}){t}^{m}.$$
(15)

Therefore \(\left\langle P(t)\right\rangle\) can be modeled as a univariate power series in time, \(\mathop{\sum }\nolimits_{m=0}^{\infty }{c}_{m}{t}^{m}\), with coefficients

$${c}_{m}=\frac{{i}^{m}}{m!}{{{{{{{\rm{Tr}}}}}}}}\left(\left[{H}^{m}P\right]{\rho }_{0}\right).$$
(16)

If we were able to access \(\left\langle P(t)\right\rangle\) exactly, the most effective way to find c1 would be to simply differentiate \(\left\langle P(t)\right\rangle\) via finite differences with very small Δt (i.e., \({c}_{1}\approx \frac{\left\langle P({{\Delta }}t)\right\rangle -\left\langle P(0)\right\rangle }{{{\Delta }}t}\)). Since our measurements of \(\left\langle P({{\Delta }}t)\right\rangle\) are subject to shot noise, the variance of this estimator scales with \({{{{{{{\mathcal{O}}}}}}}}({({{\Delta }}t)}^{-2})\), preventing us from using arbitrarily small Δt. However, as Δt grows, the bias in the finite difference estimator grows. The algorithm in Box 2 is a generalization of finite differencing, and uses Chebyshev regression (see Supplementary Note 1) to estimate c1. This algorithm takes as input a maximum evolution time A and an cutoff degree for the Chebyshev polynomial L. This finite cutoff degree induces biases in the recovered polynomial coefficients, however, we will demonstrate that this bias is suppressed much more effectively than for the finite-difference estimator, as it turns out that these errors scale in a power-law with power L. As mentioned in the beginning of this section, this error bound depends on a bound for the derivative \(|\frac{{{{{{{{{\rm{d}}}}}}}}}^{L}\langle P(t) \rangle }{{{{{{{{{\rm{d}}}}}}}}t}^{L}}|=|{{{{{{{\rm{Tr}}}}}}}}([{H}^{m}P]\rho (t))|\). Since ρ(t) is a density matrix, a simple application of the Höelder inequality shows that \(\left|{{{{{{{\rm{Tr}}}}}}}}(\left[{H}^{m}P\right]\rho (t))\right|\le \left|\left[{H}^{m}P\right]\right|\) (where \(\left|\cdot \right|\) denotes the spectral norm). We can bound spectral norms of iterated commutators with the Hamiltonian as follows:

Definition 3

(Typical scales). We define a typical time scale

$$\tau=\frac{1}{2{{{{{{{\mathscr{D}}}}}}}}{\left|{{\Theta }}\right|}_{\infty }}$$
(17)

of our Hamiltonian. The appearance of \({\left|{{\Theta }}\right|}_{\infty }\) in these scales is unsurprising; scaling all the coefficients up by some constant factor will decrease the time scale of the time evolution by the same factor. The structure parameter \({{{{{{{\mathscr{D}}}}}}}}\) appears in this time scale because, all things being equal, we expect a highly connected Hamiltonian to have observables that change faster compared to weakly connected ones. Indeed, in Supplementary Note 2, we show an upper bound for the norm of the mth iterated commutator between H and P scales roughly with ~τm.

Definition 4

(Dataset). Assume we are given the following hyperparameters:

  • L, which tells us how many different times at which to evaluate \(\left\langle P(t)\right\rangle\);

  • A, which tells us the maximum time at which we want to evaluate \(\left\langle P(t)\right\rangle\); and

  • N, which tells us how many samples we use to estimate a single evaluation of \(\left\langle P(t)\right\rangle\).

We construct the dataset \({{{{{{{\mathcal{D}}}}}}}}\) by evaluating \(\left\langle P(t)\right\rangle\) at the roots \(\left\{{z}_{i}| i=1,\ldots,L\right\}\) of the Lth Chebyshev polynomial (see Supplementary Note 1 for a review of Chebyshev polynomials). Our dataset comprises of L points:

$${{{{{{{\mathcal{D}}}}}}}} =\left\{({t}_{1},{y}_{1}),({t}_{2},{y}_{2}),\ldots,({t}_{L},{y}_{L})\right\},\; {{{{{{{\rm{where}}}}}}}} \\ {t}_{i} =\frac{A}{2}(1+{z}_{i}),\\ {y}_{i} \sim {Y}_{i},$$
(18)

where Yi is an N-sample mean estimator of \(\left\langle P({t}_{i})\right\rangle\). That is, it satisfies \({\mathbb{E}}[{Y}_{i}]=\left\langle P({t}_{i})\right\rangle\) and \({{{{{{{\rm{var}}}}}}}}[{Y}_{i}]={\sigma }_{i}^{2}/N\), where \({\sigma }_{i}^{2}\) is the variance for a single measurement of \(\left\langle P({t}_{i})\right\rangle\). The mapping \({t}_{i}=\frac{A}{2}(1+{z}_{i})\) ensures that the evolution time is nonnegative and never exceeds A.

Having collecting the dataset, it is simple to infer the first derivative c1.

The following theorem shows that for the appropriate choice of evolution time A and Chebyshev degree L, the error of the estimator \({\tilde{c}}_{1}\) in Box 2 is close to being noise-limited.

Theorem 2

(Sample complexity for one coefficient). Fix some maximum failure probability δ and an error ϵ. Assume that we have access to an unbiased (single-shot) estimator of \(\left\langle P(t)\right\rangle\) with variance σ2 ≤ 1. Furthermore, assume \(\left|P\right|\le 1\). Then there is some choice of maximum evolution A ~ τ and Chebyshev degree \(L \sim \log {\epsilon }^{-1}\) such that with

$$N={{{{{{{\mathcal{O}}}}}}}}\left(\log (1/\delta ){{{{{{{\rm{polylog}}}}}}}}(1/\epsilon ){\epsilon }^{-2}\right)$$
(19)

sample complexity, we can construct an estimator \({\tilde{c}}_{1}\) such that \(\left|{c}_{1}-{\tilde{c}}_{1}\right|\le \epsilon \cdot {{{{{{{\mathscr{D}}}}}}}}\), except with a failure probability at most δ.

Proof

See Supplementary Note 3. □

Recovering Hamiltonian coefficients

With an efficient algorithm for accurately estimating first order commutators \({{{{{{{\rm{Tr}}}}}}}}(i[H,P]{\rho }_{0})\), it is possible to construct an algorithm that can infer the coefficients of H using these commutators. The idea is to carefully choose ρ0 and P so that \({{{{{{{\rm{Tr}}}}}}}}(i[H,P]{\rho }_{0})\) corresponds to one parameter at a time.

First, we introduce the notation that \({\rho }_{0}^{({{{{{{{\mathcal{X}}}}}}}})}\) and \({P}^{({{{{{{{\mathcal{X}}}}}}}})}\) will be the reduced state or Pauli matrix (respectively) that is restricted to the qubits in \({{{{{{{\mathcal{X}}}}}}}}\), and \({{{{{{{{\mathcal{X}}}}}}}}}^{{\prime} }\) will be the set of all qubits not in \({{{{{{{\mathcal{X}}}}}}}}\).

Lemma 1

(Term selection) Let P be some Pauli operator such that there exists some \(i\in \left\{1,\ldots,r\right\}\) where suppP suppPi and \(\frac{i[{P}_{i},P]}{2} \, \ne \, 0\). Let

$${{{{{{{\mathcal{X}}}}}}}}={{{{{{{\rm{supp}}}}}}}}{P}_{i},$$
(20)
$${{{{{{{\mathcal{Y}}}}}}}}=\left(\bigcup \left\{{{{{{{{\rm{supp}}}}}}}}{P}_{j}| {{{{{{{\rm{supp}}}}}}}}{P}_{j}\cap {{{{{{{\mathcal{X}}}}}}}} \, \ne \, \varnothing \right\}\right)\setminus {{{{{{{\mathcal{X}}}}}}}},$$
(21)
$${{{{{{{\mathcal{Z}}}}}}}}={({{{{{{{\mathcal{X}}}}}}}}\cup {{{{{{{\mathcal{Y}}}}}}}})}^{{\prime} },$$
(22)
$${\rho }_{0}={\left(\frac{{\mathbb{I}}+i[{P}_{i},P]/2}{{2}^{\left|{{{{{{{\mathcal{X}}}}}}}}\right|}}\right)}^{({{{{{{{\mathcal{X}}}}}}}})}\otimes {\left(\frac{{\mathbb{I}}}{{2}^{\left|{{{{{{{\mathcal{Y}}}}}}}}\right|}}\right)}^{({{{{{{{\mathcal{Y}}}}}}}})}\otimes {\rho }_{0}^{({{{{{{{\mathcal{Z}}}}}}}})}.$$
(23)

In words, \({{{{{{{\mathcal{Y}}}}}}}}\) is a neighborhood around \({{{{{{{\mathcal{X}}}}}}}}\) that contains the support of all Paulis that intersect with \({{{{{{{\mathcal{X}}}}}}}}\), and \({{{{{{{\mathcal{Z}}}}}}}}\) is the set of all qubits that are not in \({{{{{{{\mathcal{X}}}}}}}}\cup {{{{{{{\mathcal{Y}}}}}}}}\). The state ρ0 is defined such that for all qubits in \({{{{{{{\mathcal{Y}}}}}}}}\), it is the maximally mixed state and for qubits inside \({{{{{{{\mathcal{X}}}}}}}}\), ρ0 is defined in a way such that \({{{{{{{\rm{Tr}}}}}}}}(i[{P}_{i},P]{\rho }_{0}^{({{{{{{{\mathcal{X}}}}}}}})}/2)=1\), and for all other qubits, ρ0 can be anything. Then:

$${{{{{{{\rm{Tr}}}}}}}}(i[H,P]{\rho }_{0})={\theta }_{i}.$$
(24)

Proof

See Supplementary Note 4. □

This defines a simple algorithm for Hamiltonian learning. For simplicity, for any Pauli Pi, we will simply set the observable P to be a single qubit Pauli acting on one site in \({{{{{{{\mathcal{X}}}}}}}}\) such that [Pi, P] ≠ 0 (see Box 3).

However, the runtime of this algorithm is Ω(r), since this procedure must be called once for each term in the Hamiltonian. We propose an improvement of this algorithm wherein we estimate \({{{{{{{\rm{Tr}}}}}}}}(P{e}^{-iHt}{\rho }_{0}{e}^{iHt})\) for many different choices of P simultaneously. We aim to set ρ0 in such a way that we can extract coefficients for many terms simultaneously. Yet, rather than using shadow tomography (as done in França et al.28), which can result in \({{{{{{{\mathcal{O}}}}}}}}\left(1{6}^{k}\right)\) scaling, we carefully take advantage of our knowledge about the Hamiltonian structure to get a smaller parallelization overhead. The way forward relies on the fact that in Lemma 1, \({\rho }_{0}^{({{{{{{{\mathcal{Z}}}}}}}})}\) can be anything. Similarly to Haah et al.21, we partition the terms of our Hamiltonian into groups of terms that can each be inferred simultaneously. This partition is based on a graph coloring; for details, see the Supplementary Note 4.

Definition 5

(Squared graph). Let the square of the interaction graph, \({{{{{{{{\mathcal{G}}}}}}}}}^{2}\), be the graph with the same vertex set as \({{{{{{{\mathcal{G}}}}}}}}\) and in which any two vertices are connected if their distance in \({{{{{{{\mathcal{G}}}}}}}}\) is at most 2. In words, the edges for \({{{{{{{{\mathcal{G}}}}}}}}}^{2}\) are

$$\left\{(i,k)| \exists j\,\left({{{{{{{\rm{supp}}}}}}}}{P}_{i}\cap {{{{{{{\rm{supp}}}}}}}}{P}_{j}\ne \varnothing \right)\wedge \left({{{{{{{\rm{supp}}}}}}}}{P}_{j}\cap {{{{{{{\rm{supp}}}}}}}}{P}_{k}\ne \varnothing \right)\wedge (i\ne k)\right\}$$
(25)

Our algorithm will rely on a graph coloring of \({{{{{{{{\mathcal{G}}}}}}}}}^{2}\). The essential idea is that for Paulis of the same color, there is always a “moat" separating them. This moat will then be filled with maximally mixed states, which completely suppresses the influence of terms that we are not interested in. A partitioning of the Hamiltonian terms via some C-coloring of \({{{{{{{{\mathcal{G}}}}}}}}}^{2}\) makes it natural to rewrite the Hamiltonian using a double sum notation:

$$H=\mathop{\sum }\limits_{i=1}^{C}\mathop{\sum }\limits_{j=1}^{\left|{{{{{{{{\bf{V}}}}}}}}}_{i}\right|}{\theta }_{i,j}{P}_{i,j},$$
(26)

where Vi is the set of all Paulis with the same color Ci. For instance, see Supplementary Fig. 3 for a coloring of the squared interaction graph for a 9-qubit TFIM.

Lemma 2

(Simultaneous inference for a partition) Let Vi be a partition in a coloring of \({{{{{{{{\mathcal{G}}}}}}}}}^{2}\). The coefficient for each Pauli in Vi can be inferred with up to an error \(\epsilon {\left|{{\Theta }}\right|}_{\infty }\), with failure probability for each individual coefficient being at most δ (so the overall failure probability is upper bounded by \(\delta \cdot \left|{{{{{{{{\bf{V}}}}}}}}}_{i}\right|\)). This can be done with sample complexity

$${{{{{{{\mathcal{O}}}}}}}}\left({{{{{{{{\mathscr{D}}}}}}}}}^{2}\log (1/\delta ){{{{{{{\rm{polylog}}}}}}}}({{{{{{{\mathscr{D}}}}}}}}/\epsilon ){\epsilon }^{-2}\right).$$
(27)

Proof

See Supplementary Note 4. □

Theorem 3

(Hamiltonian learning with unitary dynamics). Fix a sparsely interacting Hamiltonian H that has r terms in its Pauli expansion with coefficients Θ. For the appropriate choice of Chebyshev degree L and evolution time A, the algorithm in Box 3 and Box 4 solves the quantum Hamiltonian learning problem (with an additive error \(\epsilon {\left|{{\Theta }}\right|}_{\infty }\) and failure probability at most δ) with sample complexity

$${{{{{{{\mathcal{O}}}}}}}}\left(\frac{{{{{{{{{\mathscr{D}}}}}}}}}^{4}\log (r/\delta ){{{{{{{\rm{polylog}}}}}}}}({{{{{{{\mathscr{D}}}}}}}}/\epsilon )}{{\epsilon }^{2}}\right),$$
(28)

and classical processing time complexity

$${{{{{{{\mathcal{O}}}}}}}}\left(\frac{{{{{{{{{\mathscr{D}}}}}}}}}^{2}r\log (r/\delta ){{{{{{{\rm{polylog}}}}}}}}({{{{{{{\mathscr{D}}}}}}}}/\epsilon )}{{\epsilon }^{2}}\right).$$
(29)

Proof

We partition our Hamiltonian terms into sets that can be simultaneously inferred. There are at most \({{{{{{{{\mathscr{D}}}}}}}}}^{2}\) of these sets (for a proof, see Supplementary Note 4) – morevoer, this partitioning into at most \({{{{{{{{\mathscr{D}}}}}}}}}^{2}\) can be found with classical greedy algorithm that has runtime \({{{{{{{\mathcal{O}}}}}}}}\left({{{{{{{{\mathscr{D}}}}}}}}}^{2}\right)\)36. Now, we apply Lemma 2 to each of these sets. For the detailed proof, see Supplementary Note 4. □

In a different setup, we may be given access to copies of a Gibbs state at a temperature β−1. If we measure an observable Pi, the expectation will be

$${\left\langle {P}_{i}\right\rangle }_{\beta }=\frac{{{{{{{{\rm{Tr}}}}}}}}({P}_{i}\exp (-\beta H))}{{{{{{{{\rm{Tr}}}}}}}}(\exp (-\beta H))}$$
(30)

In what follows, we apply the analysis of Haah et al.21 to formulate \({\left\langle {P}_{i}\right\rangle }_{\beta }\) as a polynomial in β, in accordance to the framework in Eq. (3). We will show that we can learn the coefficients of the Hamiltonian from the first order term in this polynomial, therefore mapping the problem of Hamiltonian learning from Gibbs states onto Hamiltonian learning with unitary dynamics.

Theorem 4

(Hamiltonian learning with Gibbs states). The Hamiltonian learning problem (with an additive error \(\epsilon {\left|{{\Theta }}\right|}_{\infty }\) and failure probability at most δ) can be solved using

$${{{{{{{\mathcal{O}}}}}}}}\left(\frac{{{{{{{{{\mathscr{D}}}}}}}}}^{5}\log (r/\delta ){{{{{{{\rm{polylog}}}}}}}}({{{{{{{\mathscr{D}}}}}}}}/\epsilon )}{{\epsilon }^{2}}\right)$$
(31)

copies of the Gibbs state. This can be achieved with a time complexity

$${{{{{{{\mathcal{O}}}}}}}}\left(\frac{{{{{{{{{\mathscr{D}}}}}}}}}^{4}r\log (1/\delta ){{{{{{{\rm{polylog}}}}}}}}({{{{{{{\mathscr{D}}}}}}}}/\epsilon )}{{\epsilon }^{2}}\right).$$
(32)

Proof

The protocol is a near mirror image of the Hamiltonian learning protocol using unitary dynamics. For the full proof, see Supplementary Note 5. □