Introduction

Finding the ground state of a quantum many-body system is a fundamental problem with far-reaching consequences for physics, materials science, and chemistry. Many powerful methods1,2,3,4,5,6,7 have been proposed, but classical computers still struggle to solve many general classes of the ground state problem. To extend the reach of classical computers, classical machine learning (ML) methods have recently been adapted to study this and related problems both empirically and theoretically8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35. A recent work36 proposes a polynomial-time classical ML algorithm that can efficiently predict ground state properties of gapped geometrically local Hamiltonians, after learning from data obtained by measuring other Hamiltonians in the same quantum phase of matter. Furthermore36, shows that under a widely accepted conjecture, no polynomial-time classical algorithm can achieve the same performance guarantee. However, although the ML algorithm given in36 uses a polynomial amount of training data and computational time, the polynomial scaling \({{{{{{{\mathcal{O}}}}}}}}({n}^{c})\) has a very large degree c. Here, \(f(x)={{{{{{{\mathcal{O}}}}}}}}(g(x))\) denotes that f(x) is asymptotically upper bounded by g(x) up to constant factors with respect to the limit n → . Moreover, when the prediction error ϵ is small, the amount of training data grows exponentially in 1/ϵ, indicating that a very small prediction error cannot be achieved efficiently.

In this work, we present an improved ML algorithm for predicting ground state properties. We consider an m-dimensional vector x [−1, 1]m that parameterizes an n-qubit gapped geometrically local Hamiltonian given as

$$H(x)=\mathop{\sum}\limits_{j}{h}_{j}({\overrightarrow{x}}_{j}),$$
(1)

where x is the concatenation of constant-dimensional vectors \({\overrightarrow{x}}_{1},\ldots,{\overrightarrow{x}}_{L}\) parameterizing the few-body interaction \({h}_{j}({\overrightarrow{x}}_{j})\). Let ρ(x) be the ground state of H(x) and O be a sum of geometrically local observables with O ≤ 1. We assume that the geometry of the n-qubit system is known, but we do not know how \({h}_{j}({\overrightarrow{x}}_{j})\) is parameterized or what the observable O is. The goal is to learn a function h*(x) that approximates the ground state property \({{{{{{{\rm{Tr}}}}}}}}(O\rho (x))\) from a classical dataset,

$$\left({x}_{\ell },{y}_{\ell }\right),\quad \forall \ell=1,\ldots,N,$$
(2)

where \({y}_{\ell }\,\approx\, {{{{{{{\rm{Tr}}}}}}}}(O\rho ({x}_{\ell }))\) records the ground state property for x [−1, 1]m sampled from an arbitrary unknown distribution \({{{{{{{\mathcal{D}}}}}}}}\). Here, \({y}_{\ell }\,\approx \,{{{{{{{\rm{Tr}}}}}}}}(O\rho ({x}_{\ell }))\) means that y has additive error at most ϵ. If \({y}_{\ell }\,=\,{{{{{{{\rm{Tr}}}}}}}}(O\rho ({x}_{\ell }))\), the rigorous guarantees improves.

The setting considered in this work is very similar to that in36, but we assume the geometry of the n-qubit system to be known, which is necessary to overcome the sample complexity lower bound of N = nΩ(1/ϵ) given in36. Here, f(x) = Ω(g(x)) denotes that f(x) is asymptotically lower bounded by g(x) up to constant factors. One may compare the setting to that of finding ground states using adiabatic quantum computation37,38,39,40,41,42,43,44. To find the ground state property \({{{{{{{\rm{Tr}}}}}}}}(O\rho (x))\) of H(x), this class of quantum algorithms requires the ground state ρ0 of another Hamiltonian H0 stored in quantum memory, explicit knowledge of a gapped path connecting H0 and H(x), and an explicit description of O. In contrast, here we focus on ML algorithms that are entirely classical, have no access to quantum state data, and have no knowledge about the Hamiltonian H(x), the observable O, or the gapped paths between H(x) and other Hamiltonians.

The proposed ML algorithm uses a nonlinear feature map xϕ(x) with a geometric inductive bias built into the mapping. At a high level, the high-dimensional vector ϕ(x) contains nonlinear functions for each geometrically local subset of coordinates in the m-dimensional vector x. Here, the geometry over coordinates of the vector x is defined using the geometry of the n-qubit system. The ML algorithm learns a function h*(x) = w*ϕ(x) by training an 1-regularized regression (LASSO)45,46,47 in the feature space. An overview of the ML algorithm is shown in Fig. 1. We prove that given ϵ = Θ(1), Here, the notation f(x) = Θ(g(x)) denotes that \(f(x)={{{{{{{\mathcal{O}}}}}}}}(g(x))\) and f(x) = Ω(g(x)) both hold. Hence, f(x) is asymptotically equal to g(x) up to constant factors. the improved ML algorithm can use a dataset size of

$$N={{{{{{{\mathcal{O}}}}}}}}\left(\log \left(n\right)\right),$$
(3)

to learn a function h*(x) with an average prediction error of at most ϵ,

$$\mathop{{\mathbb{E}}}\limits_{x \sim {{{{{{{\mathcal{D}}}}}}}}}{\left\vert {h}^{*}(x)-{{{{{{{\rm{Tr}}}}}}}}(O\rho (x))\right\vert }^{2}\le \epsilon,$$
(4)

with high success probability.

Fig. 1: Overview of the proposed machine learning algorithm.
figure 1

Given a vector x [−1, 1]m that parameterizes a quantum many-body Hamiltonian H(x), the algorithm uses a geometric structure to create a high-dimensional vector \(\phi (x)\in {{\mathbb{R}}}^{{m}_{\phi }}\). The ML algorithm then predicts properties or a representation of the ground state ρ(x) of Hamiltonian H(x) using the mϕ-dimensional vector ϕ(x).

The sample complexity \(N={{{{{{{\mathcal{O}}}}}}}}\left(\log \left(n\right)\right)\) of the proposed ML algorithm improves substantially over the sample complexity of \(N={{{{{{{\mathcal{O}}}}}}}}({n}^{c})\) in the previously best-known classical ML algorithm36, where c is a very large constant. The computational time of both the improved ML algorithm and the ML algorithm in36 is \({{{{{{{\mathcal{O}}}}}}}}(nN)\). Hence, the logarithmic sample complexity N immediately implies a nearly linear computational time. In addition to the reduced sample complexity and computational time, the proposed ML algorithm works for any distribution over x, while the best previously known algorithm36 works only for the uniform distribution over [−1, 1]m. Furthermore, when we consider the scaling with the prediction error ϵ, the best known classical ML algorithm in36 has a sample complexity of \(N={n}^{{{{{{{{\mathcal{O}}}}}}}}(1/\epsilon )}\), which is exponential in 1/ϵ. In contrast, the improved ML algorithm has a sample complexity of \(N=\log (n){2}^{{{{{{{{\rm{polylog}}}}}}}}(1/\epsilon )}\), which is quasi-polynomial in 1/ϵ.

We also discuss a generalization of the proposed ML algorithm to predicting ground state representations when trained on classical shadow representations48,49,50,51,52. In this setting, the proposed ML algorithm yields the same reduction in sample and time complexity compared to36 for predicting ground state representations.

Results

The central component of the improved ML algorithm is the geometric inductive bias built into our feature mapping \(x\in {[-1,1]}^{m}\mapsto \phi (x)\in {{\mathbb{R}}}^{{m}_{\phi }}\). To describe the ML algorithm, we first need to present some definitions relating to this geometric structure.

Definitions of the geometric inductive bias

We consider n qubits arranged at locations, or sites, in a d-dimensional space, e.g., a spin chain (d = 1), a square lattice (d = 2), or a cubic lattice (d = 3). This geometry is characterized by the distance \({d}_{{{{{{{{\rm{qubit}}}}}}}}}(i,{i}^{{\prime} })\) between any two qubits i and \({i}^{{\prime} }\). Using the distance dqubit between qubits, we can define the geometry of local observables. Given any two observables OA, OB on the n-qubit system, we define the distance dobs(OA, OB) between the two observables as the minimum distance between the qubits that OA and OB act on. We also say an observable is geometrically local if it acts nontrivially only on nearby qubits under the distance metric dqubit. We then define S(geo) as the set of all geometrically local Pauli observables, i.e., geometrically local observables that belong to the set {I, X, Y, Z}n. The size of S(geo) is \({{{{{{{\mathcal{O}}}}}}}}(n)\), linear in the total number of qubits.

With these basic definitions in place, we now define a few more geometric objects. The first object is the set of coordinates in the m-dimensional vector x that are close to a geometrically local Pauli observable P. This is formally given by,

$${I}_{P}\triangleq \left\{c\in \{1,\ldots,m\}:{d}_{{{{{{{{\rm{obs}}}}}}}}}({h}_{j(c)},P)\le {\delta }_{1}\right\},$$
(5)

where hj(c) is the few-body interaction term in the n-qubit Hamiltonian H(x) whose parameters \({\overrightarrow{x}}_{j(c)}\) include the variable xc [ − 1, 1], and δ1 is an efficiently computable hyperparameter that is determined later. Each variable xc in the m-dimensional vector x corresponds to exactly one interaction terms \({h}_{j(c)}={h}_{j(c)}({\overrightarrow{x}}_{j(c)})\), where the parameter vector \({\overrightarrow{x}}_{j(c)}\) contains the variable xc. Intuitively, IP is the set of coordinates that have the strongest influence on the function \({{{{{{{\rm{Tr}}}}}}}}(P\rho (x))\).

The second geometric object is a discrete lattice over the space [−1, 1]m associated to each subset IP of coordinates. For any geometrically local Pauli observable PS(geo), we define XP to contain all vectors x that take on value 0 for coordinates outside IP and take on a set of discrete values for coordinates inside IP. Formally, this is given by

$${X}_{P}\triangleq \left.\left\{\begin{array}{l}x\in {[-1,1]}^{m}:\,{{\mbox{if}}}\,\,c \, \notin \, {I}_{P},\,\,{x}_{c}\,=\,0\quad \hfill \\ \,{{\mbox{if}}}\,\,c\in {I}_{P},\,\,{x}_{c}\in \left\{0,\pm {\delta }_{2},\pm 2{\delta }_{2},\ldots,\pm 1\right\}\quad \end{array}\right.\right\},$$
(6)

where δ2 is an efficiently computable hyperparameter to be determined later. The definition of XP is meant to enumerate all sufficiently different vectors for coordinates in the subset IP {1, …, m}.

Now given a geometrically local Pauli observable P and a vector x in the discrete lattice XP [−1, 1]m, the third object is a set Tx,P of vectors in [−1, 1]m that are close to x for coordinates in IP. This is formally defined as,

$${T}_{x,P}\triangleq \left\{{x}^{{\prime} }\in {[-1,1]}^{m}:-\frac{{\delta }_{2}}{2} \, < \, {x}_{c}-{x}_{c}^{{\prime} }\le \frac{{\delta }_{2}}{2},\forall c\in {I}_{P}\right\}.$$
(7)

The set Tx,P is defined as a thickened affine subspace close to the vector x for coordinates in IP. If a vector \({x}^{{\prime} }\) is in Tx,P, then \({x}^{{\prime} }\) is close to x for all coordinates in IP, but \({x}^{{\prime} }\) may be far away from x for coordinates outside of IP. Examples of these definitions are given in Supplementary Figs. 1 and 2.

Feature mapping and ML model

We can now define the feature map ϕ taking an m-dimensional vector x to an mϕ-dimensional vector ϕ(x) using the thickened affine subspaces \({T}_{{x}^{{\prime} },P}\) for every geometrically local Pauli observable PS(geo) and every vector \({x}^{{\prime} }\) in the discrete lattice XP. The dimension of the vector ϕ(x) is given by \({m}_{\phi }={\sum }_{P\in {S}^{{{{{{{{\rm{(geo)}}}}}}}}}}| {X}_{P}|\). Each coordinate of the vector ϕ(x) is indexed by \({x}^{{\prime} }\in {X}_{P}\) and PS(geo) with

$$\phi {(x)}_{{x}^{{\prime} },P}\triangleq {\mathbb{1}}\left[x\in {T}_{{x}^{{\prime} },P}\right],$$
(8)

which is the indicator function checking if x belongs to the thickened affine subspace. Recall that this means each coordinate of the mϕ-dimensional vector ϕ(x) checks if x is close to a point \({x}^{{\prime} }\) on a discrete lattice XP for the subset IP of coordinates close to a geometrically local Pauli observable P.

The classical ML model we consider is an 1-regularized regression (LASSO) over the ϕ(x) space. More precisely, given an efficiently computable hyperparameter B > 0, the classical ML model finds an mϕ-dimensional vector w* from the following optimization problem,

$$\mathop{\min }\limits_{\begin{array}{c}{{{{{{{\bf{w}}}}}}}}\in {{\mathbb{R}}}^{{m}_{\phi }}\\ \parallel {{{{{{{\bf{w}}}}}}}}{\parallel }_{1}\le B\end{array}}\,\frac{1}{N}\mathop{\sum }\limits_{\ell=1}^{N}{\left\vert {{{{{{{\bf{w}}}}}}}}\cdot \phi ({x}_{\ell })-{y}_{\ell }\right\vert }^{2},$$
(9)

where \({\{({x}_{\ell },{y}_{\ell })\}}_{\ell=1}^{N}\) is the training data. Here, x [−1, 1]m is an m-dimensional vector that parameterizes a Hamiltonian H(x) and y approximates \({{{{{{{\rm{Tr}}}}}}}}(O\rho ({x}_{\ell }))\). The learned function is given by h*(x) = w*ϕ(x). The optimization does not have to be solved exactly. We only need to find a w* whose function value is \({{{{{{{\mathcal{O}}}}}}}}(\epsilon )\) larger than the minimum function value. There is an extensive literature53,54,55,56,57,58,59 improving the computational time for the above optimization problem. The best known classical algorithm58 has a computational time scaling linearly in mϕ/ϵ2 up to a log factor, while the best known quantum algorithm59 has a computational time scaling linearly in \(\sqrt{{m}_{\phi }}/{\epsilon }^{2}\) up to a log factor.

Rigorous guarantee

The classical ML algorithm given above yields the following sample and computational complexity. This theorem improves substantially upon the result in36, which requires \(N={n}^{{{{{{{{\mathcal{O}}}}}}}}(1/\epsilon )}\). The proof idea is given in Section “Methods”, and the detailed proof is given in Supplementary Sections 1, 2, 3. Using the proof techniques presented in this work, one can show that the sample complexity \(N=\log (n/\delta ){2}^{{{{{{{{\rm{polylog}}}}}}}}(1/\epsilon )}\) also applies to any sum of few-body observables O = ∑jOj with ∑jOj≤1, even if the operators {Oj} are not geometrically local.

Theorem 1

(Sample and computational complexity). Given \(n,\,\delta \, > \, 0,\,\frac{1}{e} \, > \,\epsilon \, > \, 0\) and a training data set \({\{{x}_{\ell },{y}_{\ell }\}}_{\ell=1}^{N}\) of size

$$N=\log (n/\delta ){2}^{{{{{{{{\rm{polylog}}}}}}}}(1/\epsilon )},$$
(10)

where x is sampled from an unknown distribution \({{{{{{{\mathcal{D}}}}}}}}\) and \(| {y}_{\ell }-{{{{{{{\rm{Tr}}}}}}}}(O\rho ({x}_{\ell }))| \le \epsilon\) for any observable O with eigenvalues between −1 and 1 that can be written as a sum of geometrically local observables. With a proper choice of the efficiently computable hyperparameters δ1, δ2, and B, the learned function h*(x) = w*ϕ(x) satisfies

$$\mathop{{\mathbb{E}}}\limits_{x \sim {{{{{{{\mathcal{D}}}}}}}}}{\left\vert {h}^{*}(x)-{{{{{{{\rm{Tr}}}}}}}}(O\rho (x))\right\vert }^{2}\le \epsilon$$
(11)

with probability at least 1 − δ. The training and prediction time of the classical ML model are bounded by \({{{{{{{\mathcal{O}}}}}}}}(nN)=n\log (n/\delta ){2}^{{{{{{{{\rm{polylog}}}}}}}}(1/\epsilon )}\).

The output y in the training data can be obtained by measuring \({{{{{{{\rm{Tr}}}}}}}}(O\rho ({x}_{\ell }))\) for the same observable O multiple times and averaging the outcomes. Alternatively, we can use the classical shadow formalism48,49,50,51,52,60 that performs randomized Pauli measurements on ρ(x) to predict \({{{{{{{\rm{Tr}}}}}}}}(O\rho ({x}_{\ell }))\) for a wide range of observables O. We can also combine Theorem 1 and the classical shadow formalism to use our ML algorithm to predict ground state representations, as seen in the following corollary. This allows one to predict ground state properties \({{{{{{{\rm{Tr}}}}}}}}(O\rho (x))\) for a large number of observables O rather than just a single one. We present the proof of Corollary 1 in Supplementary Section 3B.

Corollary 1

Given \(n,\,\delta\, > \, 0,\,\frac{1}{e} \, > \, \epsilon \, > \, 0\) and a training data set \({\{{x}_{\ell },{\sigma }_{T}(\rho ({x}_{\ell }))\}}_{\ell=1}^{N}\) of size

$$N=\log (n/\delta ){2}^{{{{{{{{\rm{polylog}}}}}}}}(1/\epsilon )},$$
(12)

where x is sampled from an unknown distribution \({{{{{{{\mathcal{D}}}}}}}}\) and σT(ρ(x)) is the classical shadow representation of the ground state ρ(x) using T randomized Pauli measurements. For \(T=\tilde{{{{{{{{\mathcal{O}}}}}}}}}(\log (n)/{\epsilon }^{2})\), then the proposed ML algorithm can learn a ground state representation \({\hat{\rho }}_{N,T}(x)\) that achieves

$$\mathop{{\mathbb{E}}}\limits_{x \sim {{{{{{{\mathcal{D}}}}}}}}}| {{{{{{{\rm{Tr}}}}}}}}(O{\hat{\rho }}_{N,T}(x))-{{{{{{{\rm{Tr}}}}}}}}(O\rho (x)){| }^{2}\le \epsilon$$
(13)

for any observable O with eigenvalues between −1 and 1 that can be written as a sum of geometrically local observables with probability at least 1 − δ.

We can also show that the problem of estimating ground state properties for the class of parameterized Hamiltonians \(H(x)={\sum }_{j}{h}_{j}({\overrightarrow{x}}_{j})\) considered in this work is hard for non-ML algorithms that cannot learn from data, assuming the widely believed conjecture that NP-complete problems cannot be solved in randomized polynomial time. This is a manifestation of the computational power of data studied in61. The proof of Proposition 1 in36 constructs a parameterized Hamiltonian H(x) that belongs to the family of parameterized Hamiltonians considered in this work and hence establishes the following.

Proposition 1

(A variant of Proposition 1 in36). Consider a randomized polynomial-time classical algorithm \({{{{{{{\mathcal{A}}}}}}}}\) that does not learn from data. Suppose for any smooth family of gapped 2D Hamiltonians \(H(x)={\sum }_{j}{h}_{j}({\overrightarrow{x}}_{j})\) and any single-qubit observable \(O,{{{{{{{\mathcal{A}}}}}}}}\) can compute ground state properties \({{{{{{{\rm{Tr}}}}}}}}(O\rho (x))\) up to a constant error averaged over x [−1, 1]m uniformly. Then, NP-complete problems can be solved in randomized polynomial time.

This proposition states that even under the restricted settings of considering only 2D Hamiltonians and single-qubit observables, predicting ground state properties is a hard problem for non-ML algorithms. When one consider higher-dimensional Hamiltonians and multi-qubit observables, the problem only becomes harder because one can embed low-dimensional Hamiltonians in higher-dimensional spaces.

Numerical experiments

We present numerical experiments to assess the performance of the classical ML algorithm in practice. The results illustrate the improvement of the algorithm presented in this work compared to those considered in36, the mild dependence of the sample complexity on the system size n, and the inherent geometry exploited by the ML models. We consider the classical ML models previously described, utilizing a random Fourier feature map62. While the indicator function feature map was a useful tool to obtain our rigorous guarantees, random Fourier features are more robust and commonly used in practice. Moreover, we still expect our rigorous guarantees to hold with this change because Fourier features can approximate any function, which is the central property of the indicator functions used in our proofs. Furthermore, we determine the optimal hyperparameters using cross-validation to minimize the root-mean-square error (RMSE) and then evaluate the performance of the chosen ML model using a test set. The models and hyperparameters are further detailed in Supplementary Section 4.

For these experiments, we consider the two-dimensional antiferromagnetic random Heisenberg model consisting of 4 × 5 = 20 to 9 × 5 = 45 spins as considered in previous work36. In this setting, the spins are placed on sites in a 2D lattice. The Hamiltonian is

$$H=\mathop{\sum}\limits_{\langle ij\rangle }{J}_{ij}({X}_{i}{X}_{j}+{Y}_{i}{Y}_{j}+{Z}_{i}{Z}_{j}),$$
(14)

where the summation ranges over all pairs 〈ij〉 of neighboring sites on the lattice and the couplings {Jij} are sampled uniformly from the interval [0, 2]. Here, the vector x is a list of all couplings Jij so that the dimension of the parameter space is m = O(n), where n is the system size. The nonnegative interval [0, 2] corresponds to antiferromagnetic interactions. To minimize the Heisenberg interaction terms, nearby qubits have to form singlet states. While the square lattice is bipartite and lacks the standard geometric frustration, the presence of disorder makes the ground state calculation more challenging as neighboring qubits will compete in the formation of singlets due to the monogamy of entanglement63.

We trained a classical ML model using randomly chosen values of the parameter vector x = {Jij}. For each parameter vector of random couplings sampled uniformly from [0, 2], we approximated the ground state using the same method as in36, namely with the density-matrix renormalization group (DMRG)64 based on matrix product states (MPS)65. The classical ML model was trained on a data set \({\{{x}_{\ell },{\sigma }_{T}(\rho ({x}_{\ell }))\}}_{\ell=1}^{N}\) with N randomly chosen vectors x, where each x corresponds to a classical representation σT(ρ(x)) created from T randomized Pauli measurements48. For a given training set size N, we conduct 4-fold cross validation on the N data points to select the best hyperparameters, train a model with the best hyperparameters on the N data points, and test the performance on a test set of size N. Further details are discussed in Supplementary Section 4.

The ML algorithm predicted the classical representation of the ground state for a new vector x. These predicted classical representations were used to estimate two-body correlation functions, i.e., the expectation value of

$${C}_{ij}=\frac{1}{3}({X}_{i}{X}_{j}+{Y}_{i}{Y}_{j}+{Z}_{i}{Z}_{j}),$$
(15)

for each pair of qubits 〈ij〉 on the lattice. Here, we are using the combination of our ML algorithm with the classical shadow formalism as described in Corollary 1, leveraging this more powerful technique to predict a large number of ground state properties.

In Fig. 2A, we can clearly see that the ML algorithm proposed in this work consistently outperforms the ML models implemented in36, which includes the rigorous polynomial-time learning algorithm based on Dirichlet kernel proposed in36, Gaussian kernel regression66,67, and infinite-width neural networks68,69. Figure 2A (Left) and (Center) show that as the number T of measurements per data point or the training set size N increases, the prediction performance of the proposed ML algorithm improves faster than the other ML algorithms. This observation reflects the improvement in the sample complexity dependence on prediction error ϵ. The sample complexity in36 depends exponentially on 1/ϵ, but Theorem 1 establishes a quasi-polynomial dependence on 1/ϵ. From Fig. 2A (Right), we can see that the ML algorithms do not yield a substantially worse prediction error as the system size n increases. This observation matches with the \(\log (n)\) sample complexity in Theorem 1, but not with the poly(n) sample complexity proven in36. These improvements are also relevant when comparing the ML predictions to actual correlation function values. Figure 3 in36 illustrates that for the average prediction error achieved in their work, the predictions by the ML algorithm match the simulated values closely. In this work, we emphasize that significantly less training data is needed to achieve the same prediction error36 and agree with the simulated values.

Fig. 2: Predicting ground state properties in 2D antiferromagnetic random Heisenberg models.
figure 2

a Prediction error. Each point indicates the root-mean-square error for predicting the correlation function in the ground state (averaged over Heisenberg model instances and each pair of neighboring spins). We present log-log plots for the scaling of prediction error ϵ with T and N: the slope corresponds to the exponent of the polynomial function ϵ(T), ϵ(N). The shaded regions show the standard deviation over different spin pairs. b Visualization. We plot how much each coupling Jij contributes to the prediction of the correlation function over different pairs of qubits in the trained ML model. Thicker and darker edges correspond to higher contributions. We see that the ML model learns to utilize the local geometric structure.

An important step for establishing the improved sample complexity in Theorem 1 is that a property on a local region R of the quantum system only depends on parameters in the neighborhood of region R. In Fig. 2B, we visualize where the trained ML model is focusing on when predicting the correlation function over a pair of qubits. A thicker and darker edge is considered to be more important by the trained ML model. Each edge of the 2D lattice corresponds to a coupling Jij. For each edge, we sum the absolute values of the coefficients in the ML model that correspond to a feature that depends on the coupling Jij. We can see that the ML model learns to focus only on the neighborhood of a local region R when predicting the ground state property.

Discussion

The classical ML algorithm and the advantage over non-ML algorithms as proven in36 illustrate the potential of using ML algorithms to solve challenging quantum many-body problems. However, the classical ML model given in36 requires a large amount of training data. Although the need for a large dataset is a common trait in contemporary ML algorithms70,71,72, one would have to perform an equally large number of physical experiments to obtain such data. This makes the advantage of ML over non-ML algorithms challenging to realize in practice. The sample complexity \(N={{{{{{{\mathcal{O}}}}}}}}(\log n)\) of the ML algorithm proposed here illustrates that this advantage could potentially be realized after training with data from a small number of physical experiments. The existence of a theoretically backed ML algorithm with a \(\log (n)\) sample complexity raises the hope of designing good ML algorithms to address practical problems in quantum physics, chemistry, and materials science by learning from the relatively small amount of data that we can gather from real-world experiments.

Despite the progress in this work, many questions remain to be answered. Recently, powerful machine learning models such as graph neural networks have been used to empirically demonstrate a favorable sample complexity when leveraging the local structure of Hamiltonians in the 2D random Heisenberg model29,30. Is it possible to obtain rigorous theoretical guarantees for the sample complexity of neural-network-based ML algorithms for predicting ground state properties? An alternative direction is to notice that the current results have an exponential scaling in the inverse of the spectral gap. Is the exponential scaling a fundamental nature of this problem? Or do there exist more efficient ML models that can efficiently predict ground state properties for gapless Hamiltonians?

We have focused on the task of predicting local observables in the ground state, but many other physical properties are also of high interest. Can ML models predict low-energy excited state properties? Could we achieve a sample complexity of \(N={{{{{{{\mathcal{O}}}}}}}}(\log n)\) for predicting any observable O? Another important question is whether there is a provable quantum advantage in predicting ground state properties. Could we design quantum ML algorithms that can predict ground state properties by learning from far fewer experiments than any classical ML algorithm? Perhaps this could be shown by combining ideas from adiabatic quantum computation37,38,39,40,41,42,43,44 and recent techniques for proving quantum advantages in learning from experiments73,74,75,76,77. It remains to be seen if quantum computers could provide an unconditional super-polynomial advantage over classical computers in predicting ground state properties.

Methods

We describe the key ideas behind the proof of Theorem 1. The proof is separated into three parts. The first part in Supplementary Section 1 describes the existence of a simple functional form that approximates the ground state property \({{{{{{{\rm{Tr}}}}}}}}(O\rho (x))\). The second part in Supplementary Section 2 gives a new bound for the 1-norm of the Pauli coefficients of the observable O when written in the Pauli basis. The third part in Supplementary Section 3 combines the first two parts, using standard tools from learning theory to establish the sample complexity corresponding to the prediction error bound given in Theorem 1. In the following, we discuss these three parts in detail.

Simple form for ground state property

Using the spectral flow formalism78,79,80, we first show that the ground state property can be approximated by a sum of local functions. First, we write O in the Pauli basis as \(O={\sum }_{P\in {\{I,X,Y,Z\}}^{\otimes n}}{\alpha }_{P}P\). Then, we show that for every geometrically local Pauli observable P, we can construct a function fP(x) that depends only on coordinates in the subset IP of coordinates that parameterizes interaction terms hj near the Pauli observable P. The function fP(x) is given by

$${f}_{P}(x)={\alpha }_{P}{{{{{{{\rm{Tr}}}}}}}}(P\rho ({\chi }_{P}(x))),$$
(16)

where χP(x)  [−1, 1]m is defined as χP(x)c = xc for coordinate cIP and χP(x)c = 0 for coordinates cIP. The sum of these local functions fP can be used to approximate the ground state property,

$${{{{{{{\rm{Tr}}}}}}}}(O\rho (x)) \, \approx \, \mathop{\sum}\limits_{P\in {S}^{{{{{{{{\rm{(geo)}}}}}}}}}}{f}_{P}(x).$$
(17)

The approximation only incurs an \({{{{{{{\mathcal{O}}}}}}}}(\epsilon )\) error if we consider \({\delta }_{1}=\Theta ({\log }^{2}(1/\epsilon ))\) in the definition of IP. The key point is that correlations decay exponentially with distance in the ground state of a gapped local Hamiltonian; therefore, the properties of the ground state in a localized region are not sensitive to the details of the Hamiltonian at points far from that localized region. Furthermore, the local function fP is smooth. The smoothness property allows us to approximate each local function fP by a simple discretization,

$${f}_{P}(x)\approx \mathop{\sum}\limits_{{x}^{{\prime} }\in {X}_{P}}{f}_{P}({x}^{{\prime} }){\mathbb{1}}\left[x\in {T}_{{x}^{{\prime} },P}\right].$$
(18)

One could also use other approximations for this step, such as Fourier approximation or polynomial approximation. In fact, we apply a Fourier approximation instead in the numerical experiments, as discussed in Supplementary Section 4. For simplicity of the proof, we consider a discretization-based approximation with δ2 = Θ(1/ϵ) in the definition of \({T}_{{x}^{{\prime} },P}\) to incur at most an \({{{{{{{\mathcal{O}}}}}}}}(\epsilon )\) error. The point is that, for a sufficiently smooth function fP(x) that depends only on coordinates in IP and a sufficiently fine lattice over the coordinates in IP, replacing x by the nearest lattice point (based only on coordinates in IP) causes only a small error. Using the definition of the feature map ϕ(x) in Eq. (8), we have

$${{{{{{{\rm{Tr}}}}}}}}(O\rho (x))\,\approx\, \mathop{\sum}\limits_{P\in {S}^{{{{{{{{\rm{(geo)}}}}}}}}}}\mathop{\sum}\limits_{{x}^{{\prime} }\in {X}_{P}}{f}_{P}({x}^{{\prime} })\phi {(x)}_{{x}^{{\prime} },P}={{{{{{{{\bf{w}}}}}}}}}^{{\prime} }\cdot \phi (x),$$
(19)

where \({{{{{{{{\bf{w}}}}}}}}}^{{\prime} }\) is an mϕ-dimensional vector indexed by \({x}^{{\prime} }\in {X}_{P}\) and PSgeo given by \({{{{{{{{\bf{w}}}}}}}}}_{{x}^{{\prime} },P}^{{\prime} }={f}_{P}({x}^{{\prime} })\). The approximation is accurate if we consider \({\delta }_{1}=\Theta ({\log }^{2}(1/\epsilon ))\) and δ2 = Θ(1/ϵ). Thus, we can see that the ML algorithm with the proposed feature mapping indeed has the capacity to approximately represent the target function \({{{{{{{\rm{Tr}}}}}}}}(O\rho (x))\). As a result, we have the following lemma.

Lemma 1

(Training error bound). The function given by \({{{{{{{{\bf{w}}}}}}}}}^{{\prime} }\cdot \phi (x)\) achieves a small training error:

$$\frac{1}{N}\mathop{\sum }\limits_{\ell=1}^{N}{\left\vert {{{{{{{{\bf{w}}}}}}}}}^{{\prime} }\cdot \phi ({x}_{\ell })-{y}_{\ell }\right\vert }^{2}\le 0.53\epsilon .$$
(20)

This lemma follows from the two facts that \({{{{{{{{\bf{w}}}}}}}}}^{{\prime} }\cdot \phi (x)\,\approx \,{{{{{{{\rm{Tr}}}}}}}}(O\rho (x))\) and \({{{{{{{\rm{Tr}}}}}}}}(O\rho ({x}_{\ell }))\,\approx \,{y}_{\ell }\).

Norm inequality for observables

The efficiency of an 1-regularized regression depends greatly on the 1 norm of the vector \({{{{{{{{\bf{w}}}}}}}}}^{{\prime} }\). Moreover, the 1-norm of \({{{{{{{{\bf{w}}}}}}}}}^{{\prime} }\) is closely related to the observable O = ∑jOj given as a sum of geometrically local observables with O≤1. In particular, again writing O in the Pauli basis as \(O={\sum }_{Q\in {\{I,X,Y,Z\}}^{\otimes n}}{\alpha }_{Q}Q\), the 1-norm \(\parallel {{{{{{{{\bf{w}}}}}}}}}^{{\prime} }{\parallel }_{1}\) is closely related to \({\sum }_{Q}\left\vert {\alpha }_{Q}\right\vert,\) which we refer to as the Pauli 1-norm of the observable O. While it is well known that

$$\mathop{\sum}\limits_{Q}{\left\vert {\alpha }_{Q}\right\vert }^{2}={{{{{{{\rm{Tr}}}}}}}}({O}^{2})/{2}^{n}\le \parallel O{\parallel }_{\infty }^{2},$$
(21)

there do not seem to be many known results characterizing \({\sum }_{Q}\left\vert {\alpha }_{Q}\right\vert\). To understand the Pauli 1-norm, we prove the following theorem.

Theorem 2

(Pauli 1-norm bound). Let \(O={\sum }_{Q\in {\{I,X,Y,Z\}}^{\otimes n}}{\alpha }_{Q}Q\) be an observable that can be written as a sum of geometrically local observables. We have,

$$\mathop{\sum}\limits_{Q}| {\alpha }_{Q}| \le C\parallel O{\parallel }_{\infty },$$
(22)

for some constant C.

A series of related norm inequalities are also established in81. However, the techniques used in this work differ significantly from those in81.

Prediction error bound for the ML algorithm

Using the construction of the local function fP(xc, cIP) given in Eq. (16) and the vector \({{{{{{{{\bf{w}}}}}}}}}^{{\prime} }\) defined in Eq. (19), we can show that

$$\parallel {{{{{{{{\bf{w}}}}}}}}}^{{\prime} }{\parallel }_{1}\le \mathop{\max }\limits_{P\in {S}^{{{{{{{{\rm{(geo)}}}}}}}}}}\left\vert {X}_{P}\right\vert \left(\mathop{\sum}\limits_{Q}\left\vert {\alpha }_{Q}\right\vert \right)\le {\left(1+\frac{2}{{\delta }_{2}}\right)}^{{{{{{{{\rm{poly}}}}}}}}({\delta }_{1})}\left(\mathop{\sum}\limits_{Q}\left\vert {\alpha }_{Q}\right\vert \right).$$
(23)

The second inequality follows by bounding the size of our discrete subset XP and noticing that IP = poly(δ1). The norm inequality in Theorem 2 then implies

$$\parallel {{{{{{{{\bf{w}}}}}}}}}^{{\prime} }{\parallel }_{1}\le C\parallel O{\parallel }_{\infty }{\left(1+\frac{2}{{\delta }_{2}}\right)}^{{{{{{{{\rm{poly}}}}}}}}({\delta }_{1})}\le {2}^{{{{{{{{\rm{poly}}}}}}}}\log (1/\epsilon )},$$
(24)

because O ≤ 1 and \({\delta }_{1}=\Theta ({\log }^{2}(1/\epsilon )),{\delta }_{2}=\Theta (1/\epsilon )\). This shows that there exists a vector \({{{{{{{{\bf{w}}}}}}}}}^{{\prime} }\) that has a bounded 1-norm and achieves a small training error. The existence of \({{{{{{{{\bf{w}}}}}}}}}^{{\prime} }\) guarantees that the vector w* found by the optimization problem with the hyperparameter \(B\ge \parallel {{{{{{{{\bf{w}}}}}}}}}^{{\prime} }{\parallel }_{1}\) will yield an even smaller training error. Using the norm bound on \({{{{{{{{\bf{w}}}}}}}}}^{{\prime} }\), we can choose the hyperparameter B to be \(B={2}^{{{{{{{{\rm{poly}}}}}}}}\log (1/\epsilon )}\). Using standard learning theory46,47, we can thus obtain

$$\mathop{{\mathbb{E}}}\limits_{x \sim {{{{{{{\mathcal{D}}}}}}}}}{\left\vert {h}^{*}(x)-{{{{{{{\rm{Tr}}}}}}}}(O\rho (x))\right\vert }^{2}\le \frac{1}{N}\mathop{\sum }\limits_{\ell=1}^{N}{\left\vert {{{{{{{{\bf{w}}}}}}}}}^{*}\cdot \phi ({x}_{\ell })-{y}_{\ell }\right\vert }^{2}+{{{{{{{\mathcal{O}}}}}}}}\left(B\sqrt{\frac{\log ({m}_{\phi }/\delta )}{N}}\right)$$
(25)

with probability at least 1 − δ. The first term is the training error for w*, which is smaller than the training error of 0.53ϵ for \({{{{{{{{\bf{w}}}}}}}}}^{{\prime} }\) from Lemma 1. Thus, the first term is bounded by 0.53ϵ. The second term is determined by B and mϕ, where we know that \({m}_{\phi }\le | {S}^{{{{{{{{\rm{(geo)}}}}}}}}}| {(1+\frac{2}{{\delta }_{2}})}^{{{{{{{{\rm{poly}}}}}}}}({\delta }_{1})}\) and \(| {S}^{{{{{{{{\rm{(geo)}}}}}}}}}|={{{{{{{\mathcal{O}}}}}}}}(n)\). Hence, with a training data size of

$$N={{{{{{{\mathcal{O}}}}}}}}\left(\log (n/\delta ){2}^{{{{{{{{\rm{polylog}}}}}}}}(1/\epsilon )}\right),$$
(26)

we can achieve a prediction error of ϵ with probability at least 1 − δ for any distribution \({{{{{{{\mathcal{D}}}}}}}}\) over [−1, 1]m.