Improved machine learning algorithm for predicting ground state properties

Finding the ground state of a quantum many-body system is a fundamental problem in quantum physics. In this work, we give a classical machine learning (ML) algorithm for predicting ground state properties with an inductive bias encoding geometric locality. The proposed ML model can efficiently predict ground state properties of an n-qubit gapped local Hamiltonian after learning from only \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${{{{{{{\mathcal{O}}}}}}}}(\log (n))$$\end{document}O(log(n)) data about other Hamiltonians in the same quantum phase of matter. This improves substantially upon previous results that require \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${{{{{{{\mathcal{O}}}}}}}}({n}^{c})$$\end{document}O(nc) data for a large constant c. Furthermore, the training and prediction time of the proposed ML model scale as \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${{{{{{{\mathcal{O}}}}}}}}(n\log n)$$\end{document}O(nlogn) in the number of qubits n. Numerical experiments on physical systems with up to 45 qubits confirm the favorable scaling in predicting ground state properties using a small training dataset.


I. INTRODUCTION
Finding the ground state of a quantum many-body system is a fundamental problem with far-reaching consequences for physics, materials science, and chemistry.Many powerful methods [1][2][3][4][5][6][7] have been proposed, but classical computers still struggle to solve many general classes of the ground state problem.To extend the reach of classical computers, classical machine learning (ML) methods have recently been adapted to study this problem .A recent work [29] proposes a polynomial-time classical ML algorithm that can efficiently predict ground state properties of gapped geometrically local Hamiltonians, after learning from data obtained by measuring other Hamiltonians in the same quantum phase of matter.Furthermore, [29] shows that under a widely accepted conjecture, no polynomial-time classical algorithm can achieve the same performance guarantee.However, although the ML algorithm given in [29] uses a polynomial amount of training data and computational time, the polynomial scaling (  ) has a very large degree .Moreover, when the prediction error  is small, the amount of training data grows exponentially in 1/, indicating that a very small prediction error cannot be achieved efficiently.
In this work, we present an improved ML algorithm for predicting ground state properties.We consider an -dimensional vector  ∈ [−1, 1]  that parameterizes an -qubit gapped geometrically local Hamiltonian given as where  is the concatenation of constant-dimensional vectors ⃗  1 , . . ., ⃗   parameterizing the few-body interaction ℎ  (⃗   ).Let () be the ground state of () and  be a sum of geometrically local observables with ‖‖ ∞ ≤ 1.We assume that the geometry of the -qubit system is known, but we do not know how ℎ  (⃗   ) is parameterized or what the observable  is.The goal is to learn a function ℎ * () that approximates the ground state property Tr(()) from a classical dataset, where  ℓ ≈ Tr(( ℓ )) records the ground state property for  ℓ ∈ [−1, 1]  sampled from an arbitrary unknown distribution .
The setting considered in this work is very similar to that in [29], but we assume the geometry of the -qubit system to be known, which is necessary to overcome the sample complexity lower bound of  =  Ω(1/) given in [29].One may compare the setting to that of finding ground states using adiabatic quantum computation [30][31][32][33][34][35][36][37].To find the ground state property Tr(()) of (), this class of quantum algorithms requires the ground state  0 of another Hamiltonian  0 stored in quantum memory, explicit knowledge of a gapped path connecting  0 and (), and an explicit description of .In contrast, here we focus on ML algorithms that are entirely classical, have no access to quantum state data, and have no knowledge about the Hamiltonian (), the observable , or the gapped paths between () and other Hamiltonians.
The proposed ML algorithm uses a nonlinear feature map  ↦ → () with a geometric inductive bias built into the mapping.At a high level, the high-dimensional vector () contains nonlinear functions  that parameterizes a quantum many-body Hamiltonian ().The algorithm uses a geometric structure to create a high-dimensional vector () ∈ R   .The ML algorithm then predicts properties or a representation of the ground state () of Hamiltonian () using the   -dimensional vector ().
for each geometrically local subset of coordinates in the -dimensional vector .Here, the geometry over coordinates of the vector  is defined using the geometry of the -qubit system.The ML algorithm learns a function ℎ * () = w * • () by training an ℓ 1 -regularized regression (LASSO) [38][39][40] in the feature space.We prove that given  = Θ(1), the improved ML algorithm can use a dataset size of  =  (log ()) , with high success probability.The sample complexity  =  (log ()) of the proposed ML algorithm improves substantially over the sample complexity of  = (  ) in the previously best-known classical ML algorithm [29], where  is a very large constant.The computational time of both the improved ML algorithm and the ML algorithm in [29] is ( ).Hence, the logarithmic sample complexity  immediately implies a nearly linear computational time.In addition to the reduced sample complexity and computational time, the proposed ML algorithm works for any distribution over , while the best previously known algorithm [29] works only for the uniform distribution over [−1, 1]  .Furthermore, when we consider the scaling with the prediction error , the best known classical ML algorithm in [29] has a sample complexity of  =  (1/) , which is exponential in 1/.In contrast, the improved ML algorithm has a sample complexity of  = log()2 polylog(1/) , which is quasi-polynomial in 1/.In combination with the classical shadow formalism [41][42][43][44][45], the proposed ML algorithm also yields the same reduction in sample and time complexity compared to [29] for predicting ground state representations.

II. ML ALGORITHM AND RIGOROUS GUARANTEE
The central component of the improved ML algorithm is the geometric inductive bias built into our feature mapping  ∈ [−1, 1]  ↦ → () ∈ R   .To describe the ML algorithm, we first need to present some definitions relating to this geometric structure.

A. Definitions
We consider  qubits arranged at locations, or sites, in a -dimensional space, e.g., a spin chain ( = 1), a square lattice ( = 2), or a cubic lattice ( = 3).This geometry is characterized by the distance  qubit (,  ′ ) between any two qubits  and  ′ .Using the distance  qubit between qubits, we can define the geometry of local observables.Given any two observables   ,   on the -qubit system, we define the distance  obs (  ,   ) between the two observables as the minimum distance between the qubits that   and   act on.We also say an observable is geometrically local if it acts nontrivially only on nearby qubits under the distance metric  qubit .We then define  (geo) as the set of all geometrically local Pauli observables, i.e., geometrically local observables that belong to the set {, , , } ⊗ .The size of  (geo) is (), linear in the total number of qubits.
With these basic definitions in place, we now define a few more geometric objects.The first object is the set of coordinates in the -dimensional vector  that are close to a geometrically local Pauli observable  .This is formally given by,   {︀  ∈ {1, . . ., } :  obs (ℎ () ,  ) ≤  1 }︀ , (II. 1) where ℎ () is the few-body interaction term in the -qubit Hamiltonian () that is parameterized by the variable   ∈ [−1, 1], and  1 is an efficiently computable hyperparameter that is determined later.Note that, by definition, each variable   parameterizes one of the interaction terms ℎ () .Intuitively,   is the set of coordinates that have the strongest influence on the function Tr( ()).
The second geometric object is a discrete lattice over the space [−1, 1]  associated to each subset   of coordinates.For any geometrically local Pauli observable  ∈  (geo) , we define   to contain all vectors  that take on value 0 for coordinates outside   and take on a set of discrete values for coordinates inside   .Formally, this is given by where  2 is an efficiently computable hyperparameter to be determined later.The definition of   is meant to enumerate all sufficiently different vectors for coordinates in the subset   ⊆ {1, . . ., }.Now given a geometrically local Pauli observable  and a vector  in the discrete lattice   ⊆ [−1, 1]  , the third object is a set  , of vectors in [−1, 1]  that are close to  for coordinates in   .This is formally defined as, 3) The set  , is defined as a thickened affine subspace close to the vector  for coordinates in   .If a vector  ′ is in  , , then  ′ is close to  for all coordinates in   , but  ′ may be far away from  for coordinates outside of   .

B. Feature mapping and ML model
We can now define the feature map  taking an -dimensional vector  to an   -dimensional vector () using the thickened affine subspaces   ′ , for every geometrically local Pauli observable  ∈  (geo)  and every vector  ′ in the discrete lattice   .The dimension of the vector () is given by   = ∑︀  ∈ (geo) |  |.Each coordinate of the vector () is indexed by  ′ ∈   and  ∈  (geo) with which is the indicator function checking if  belongs to the thickened affine subspace.Recall that this means each coordinate of the   -dimensional vector () checks if  is close to a point  ′ on a discrete lattice   for the subset   of coordinates close to a geometrically local Pauli observable  .The classical ML model we consider is an ℓ 1 -regularized regression (LASSO) over the () space.More precisely, given an efficiently computable hyperparameter  > 0, the classical ML model finds an   -dimensional vector w * from the following optimization problem, where {( ℓ ,  ℓ )}  ℓ=1 is the training data.Here,  ℓ ∈ [−1, 1]  is an -dimensional vector that parameterizes a Hamiltonian () and  ℓ approximates Tr(( ℓ )).The learned function is given by ℎ * () = w * • ().The optimization does not have to be solved exactly.We only need to find a w * whose function value is () larger than the minimum function value.There is an extensive literature [46][47][48][49][50][51][52] improving the computational time for the above optimization problem.The best known classical algorithm [51] has a computational time scaling linearly in   / 2 up to a log factor, while the best known quantum algorithm [52] has a computational time scaling linearly in √   / 2 up to a log factor.

C. Rigorous guarantee
The classical ML algorithm given above yields the following sample and computational complexity.This theorem improves substantially upon the result in [29], which requires  =  (1/) .The proof idea is given in Section III, and the detailed proof is given in Appendices A, B, C. Using the proof techniques presented in this work, one can show that the sample complexity  = log(/)2 polylog(1/) also applies to any sum of few-body observables  = ∑︀    with ∑︀  ‖  ‖ ∞ ≤ 1, even if the operators {  } are not geometrically local.
Theorem 1 (Sample and computational complexity).Given ,  > 0, The output  ℓ in the training data can be obtained by measuring Tr(( ℓ )) for the same observable  multiple times and averaging the outcomes.Alternatively, we can use the classical shadow formalism [41][42][43][44][45]53] that performs randomized Pauli measurements on ( ℓ ) to predict Tr(( ℓ )) for a wide range of observables .Theorem 1 and the classical shadow formalism together yield the following corollary for predicting ground state representations.We present the proof of Corollary 1 in Appendix C 2.
Corollary 1.Given ,  > 0, for any observable  with eigenvalues between −1 and 1 that can be written as a sum of geometrically local observables with probability at least 1 − .
We can also show that the problem of estimating ground state properties for the class of parameterized Hamiltonians () = ∑︀  ℎ  (⃗   ) considered in this work is hard for non-ML algorithms that cannot learn from data.This is a manifestation of the computational power of data studied in [54].The proof of Proposition 1 in [29] constructs a parameterized Hamiltonian () that belongs to the family of parameterized Hamiltonians considered in this work and hence establishes the following.
Proposition 1 (A variant of Proposition 1 in [29]).Consider a randomized polynomial-time classical algorithm  that does not learn from data.Suppose for any smooth family of gapped 2D Hamiltonians () = ∑︀  ℎ  (⃗   ) and any single-qubit observable ,  can compute ground state properties Tr(()) up to a constant error averaged over  ∈ [−1, 1]  uniformly.Then, NP-complete problems can be solved in randomized polynomial time.

III. PROOF IDEAS
We describe the key ideas behind the proof of Theorem 1.The proof is separated into three parts.The first part in Appendix A describes the existence of a simple functional form that approximates the ground state property Tr(()).The second part in Appendix B gives a new bound for the ℓ 1 -norm of the Pauli coefficients of the observable  when written in the Pauli basis.The third part in Appendix C combines the first two parts, using standard tools from learning theory to establish the sample complexity corresponding to the prediction error bound given in Theorem 1.In the following, we discuss these three parts in detail.

A. Simple form for ground state property
Using the spectral flow formalism [55][56][57], we first show that the ground state property can be approximated by a sum of local functions.First, we write  in the Pauli basis as  = ∑︀  ∈{,,,} ⊗    .Then, we show that for every geometrically local Pauli observable  , we can construct a function   () that depends only on coordinates in the subset   of coordinates that parameterizes interaction terms ℎ  near the Pauli observable  .The function   () is given by where where w ′ is an   -dimensional vector indexed by  ′ ∈   and  ∈  geo given by w ′  ′ , =   ( ′ ).The approximation is accurate if we consider  1 = Θ(log 2 (1/)) and  2 = Θ(1/).Thus, we can see that the ML algorithm with the proposed feature mapping indeed has the capacity to approximately represent the target function Tr(()).As a result, we have the following lemma.

B. Norm inequality for observables
The efficiency of an ℓ 1 -regularized regression depends greatly on the ℓ 1 norm of the vector w ′ .Moreover, the ℓ 1 -norm of w ′ is closely related to the observable  = ∑︀    given as a sum of geometrically local observables with ‖‖ ∞ ≤ 1.In particular, again writing  in the Pauli basis as  = ∑︀ ∈{,,,} ⊗   , the ℓ 1 -norm ‖w ′ ‖ 1 is closely related to ∑︀  |  | , which we refer to as the Pauli 1-norm of the observable .While it is well known that there do not seem to be many known results characterizing ∑︀  |  |.To understand the Pauli 1-norm, we prove the following theorem.
A series of related norm inequalities are also established in [58].However, the techniques used in this work differ significantly from those in [58].

C. Prediction error bound for the ML algorithm
Using the construction of the local function   (  ,  ∈   ) given in Eq. (III.1) and the vector w ′ defined in Eq. (III.4),we can show that The second inequality follows by bounding the size of our discrete subset   and noticing that |  | = poly( 1 ).The norm inequality in Theorem 2 then implies )︂ poly(1) ≤ 2 poly log(1/) , (III.9) because ‖‖ ∞ ≤ 1 and  1 = Θ(log 2 (1/)),  2 = Θ(1/).This shows that there exists a vector w ′ that has a bounded ℓ 1 -norm and achieves a small training error.The existence of w ′ guarantees that the vector w * found by the optimization problem with the hyperparameter  ≥ ‖w ′ ‖ 1 will yield an even smaller training error.Using the norm bound on w ′ , we can choose the hyperparameter  to be  = 2 poly log(1/) .Using standard learning theory [39,40], we can thus obtain

IV. NUMERICAL EXPERIMENTS
In this section, we present numerical experiments to assess the performance of the classical ML algorithm in practice.The results illustrate the improvement of the algorithm presented in this work compared to those considered in [29], the mild dependence of the sample complexity on the system size , and the inherent geometry exploited by the ML models.We consider the classical ML models described in Section II B, utilizing a random Fourier feature map [59].While the indicator function feature map was a useful tool to obtain our rigorous guarantees, random Fourier features are more robust and commonly used in practice.Furthermore, we determine the optimal hyperparameters using crossvalidation to minimize the root-mean-square error (RMSE) and then evaluate the performance of the chosen ML model using a test set.The models and hyperparameters are further detailed in Appendix D.
For these experiments, we consider the two-dimensional antiferromagnetic random Heisenberg model consisting of 4 × 5 = 20 to 9 × 5 = 45 spins.In this setting, the spins are placed on sites in a 2D lattice.The Hamiltonian is where the summation ranges over all pairs ⟨⟩ of neighboring sites on the lattice and the couplings {  } are sampled uniformly from the interval [0, 2].Here, the vector  is a list of all couplings   so that the dimension of the parameter space is  = (), where  is the system size.
We trained a classical ML model using randomly chosen values of the parameter vector  = {  }.For each parameter vector of random couplings sampled uniformly from [0, 2], we approximated the ground state using the same method as in [29], namely with the density-matrix renormalization group (DMRG) [60] based on matrix product states (MPS) [61].The classical ML model was trained on a data set { ℓ ,   (( ℓ ))}  ℓ=1 with  randomly chosen vectors , where each  corresponds to a classical representation   (( ℓ )) created from  randomized Pauli measurements [41].The ML algorithm predicted the classical representation of the ground state for a new vector .These predicted classical representations were used to estimate two-body correlation functions, i.e., the expectation value of for each pair of qubits ⟨⟩ on the lattice.
In Figure 2A, we can clearly see that the ML algorithm proposed in this work consistently outperforms the ML models implemented in [29], which includes the rigorous polynomial-time learning algorithm based on Dirichlet kernel proposed in [29], Gaussian kernel regression [62,63], and infinite-width neural networks [64,65].Figure 2A (Left) and 2A (Center) show that as the number  of measurements per data point or the training set size  increases, the prediction performance of the proposed ML algorithm improves faster than the other ML algorithms.This observation reflects the improvement in the sample complexity dependence on prediction error .The sample complexity in [29] depends exponentially on 1/, but Theorem 1 establishes a quasi-polynomial dependence on 1/.From Figure 2A (Right), we can see that the ML algorithms do not yield a substantially worse prediction error as the system size  increases.This observation matches with the log() sample complexity in Theorem 1, but not with the poly() sample complexity proven in [29].
An important step for establishing the improved sample complexity in Theorem 1 is that a property on a local region  of the quantum system only depends on parameters in the neighborhood of region .
In Figure 2B, we visualize where the trained ML model is focusing on when predicting the correlation function over a pair of qubits.A thicker and darker edge is considered to be more important by the trained ML model.Each edge of the 2D lattice corresponds to a coupling   .For each edge, we sum the absolute values of the coefficients in the ML model that correspond to a feature that depends on the coupling   .We can see that the ML model learns to focus only on the neighborhood of a local region  when predicting the ground state property.

V. OUTLOOK
The classical ML algorithm and the advantage over non-ML algorithms as proven in [29] illustrate the potential of using ML algorithms to solve challenging quantum many-body problems.However, the classical ML model given in [29] requires a large amount of training data.Although the need for a large dataset is a common trait in contemporary ML algorithms [66][67][68], one would have to perform an equally large number of physical experiments to obtain such data.This makes the advantage of ML over non-ML algorithms challenging to realize in practice.The sample complexity  = (log ) of the ML algorithm proposed here illustrates that this advantage could potentially be realized after training with data from a small number of physical experiments.The existence of a theoretically backed ML algorithm with a log() sample complexity raises the hope of designing good ML algorithms to address practical problems in quantum physics, chemistry, and materials science by learning from the relatively small amount of data that we can gather from real-world experiments.
Despite the progress in this work, many questions remain to be answered.Recently, powerful machine learning models such as graph neural networks have been used to empirically demonstrate a favorable sample complexity when leveraging the local structure of Hamiltonians in the 2D random Heisenberg model [69,70].Is it possible to obtain rigorous theoretical guarantees for the sample complexity of neural-network-based ML algorithms for predicting ground state properties?An alternative direction is to notice that the current results have an exponential scaling in the inverse of the spectral gap.Is the exponential scaling a fundamental nature of this problem?Or do there exist more efficient ML models that can efficiently predict ground state properties for gapless Hamiltonians?
We have focused on the task of predicting local observables in the ground state, but many other physical properties are also of high interest.Can ML models predict low-energy excited state properties?Could we achieve a sample complexity of  = (log ) for predicting any observable ?Another important question is whether there is a provable quantum advantage in predicting ground state properties.Could we design quantum ML algorithms that can predict ground state properties by learning from far fewer experiments than any classical ML algorithm?Perhaps this could be shown by combining ideas from adiabatic quantum computation [30][31][32][33][34][35][36][37] and recent techniques for proving quantum advantages in learning from experiments [71][72][73][74][75].It remains to be seen if quantum computers could provide an unconditional super-polynomial advantage over classical computers in predicting ground state properties.

D. Details of numerical experiments
These appendices provide detailed proofs of the statements in the main text.We discuss our main contribution that Tr() can be approximated by a machine learning model given training data scaling logarithmically in system size, where  is an unknown observable and  is the ground state of a Hamiltonian.The proof of this result has three main parts.The first two parts yield important results necessary for the design of the ML algorithm and its sample complexity.
We recommend that readers start with Appendix A, which derives a simpler form for the ground state property Tr(()) that we wish to predict.In Appendix B, we give a norm inequality characterizing the Pauli coefficients of any observable that can be written as a sum of geometrically local observables.The norm inequality reveals a structure of the ground state property Tr(()) that we can use to design an ML algorithm that uses very few training data.In Appendix C, we present our ML algorithm and prove its sample complexity using standard tools in ML theory, including known guarantees on the LASSO (least absolute shrinkage and selection operator) algorithm's performance.Finally, in Appendix D, we describe numerical experiments performed to assess the performance of the algorithm in practice.

Appendix A: Simple form for ground state property
This section is dedicated to deriving a simpler form for the ground state property Tr(()) as a function of .We consider the assumptions (a)-(d) from Appendix F.5 of [29], with (b) and (d) adjusted for our setting, which we reproduce here for convenience: (a) Physical system: We consider  finite-dimensional quantum systems that are arranged at locations, or sites, in a -dimensional space, e.g., a spin chain ( = 1), a square lattice ( = 2), or a cubic lattice ( = 3).Unless specified otherwise, our big-, Ω, Θ notation is with respect to the thermodynamic limit  → ∞.
(c) Ground-state subspace: We consider the ground state () for the Hamiltonian () to be defined as . This is equivalent to a uniform mixture over the eigenspace of () with the minimum eigenvalue.
(d) Observable:  can be written as a sum of few-body observables  = ∑︀    , where each   only acts on an (1) number of sites.Hence, we can also write  = ∑︀  ∈ (geo)    , where  ∈ {, , , } ⊗ and  (geo) is the set of geometrically local Pauli observables (defined more precisely in Def. 6).The results in this section hold for any  of the above form.However, we only focus on  given as a sum of geometrically local observables ∑︀    , where each   only acts on an (1) number of sites in a ball of (1) radius.
Under these assumptions, we can prove that Tr(()) can be approximated by a sum of weighted indicator functions, where the weights satisfy a ℓ 1 -norm bound.A precise statement of this result is found in Appendix A 3. We first show that Tr(()) can be approximated by a sum of smooth local functions in Appendix A 1.Then, we prove that this sum of smooth local functions can be approximated by simple functions in Appendix A 2. Finally, we put everything together in Appendix A 3. Several technical lemmas for bounding integrals are needed throughout these proofs, which are compiled in Appendix A 4.

Approximation by a sum of smooth functions
The key intermediate step is to approximate Tr(()) by a sum of smooth local functions.The proof of this relies on the spectral flow formalism [55] and Lieb-Robinson bounds [77].
First, we review the tools necessary from spectral flow [55][56][57].Let the spectral gap of () be lower bounded by a constant  over [−1, 1]  .Then, the directional derivative of an associated ground state in the direction defined by the parameter unit vector û is given by where  û() is given by Here,   () is defined by where  is chosen to be the largest real solution of Next, we review the Lieb-Robinson bounds [77,78].Let the distance  obs ( 1 ,  2 ) between any two operators  1 ,  2 be defined as the minimum distance between all pairs of sites acted on by  1 and  2 , respectively, in the -dimensional space.Formally, this is defined as where dom() contains the qubits that the observable  acts on and  qubit (,  ′ ) is the distance between two qubits  and  ′ .Furthermore, notice that for any operator  acting on a single site, a ball of radius  around  contains (  ) local terms in -dimensional space: where ℎ  is an interaction term of the Hamiltonian  = ∑︀  =1 ℎ  .Here, this bound implies the existence of a Lieb-Robinson bound [78,79] such that for any two operators  1 ,  2 and any  ∈ R, where  lr ,  lr ,  lr = Θ(1) are constants.Having reviewed these tools, before stating our result formally, we need to define a quantity that we use throughout the proof.where we denote  /2 lr for convenience, and  lr is the constant from the Lieb-Robinson bound in Eq. (A.6).Here,  max = max( 1 ,  2 ,  3 ), where  1 ,  2 ,  3 are constants defined in Lemmas 6, 7, 8. Also, we define  4 as a constant such that for all  ′ ≥  4 , Similarly, define  5 as a constant such that for all  ′ ≥  5 , Moreover,  is defined such that for all  ≥ ( + 1), 35 log 2  <  − .Finally,  is chosen to be the largest real solution of The existence of  4 ,  5 is guaranteed by noting that as  ′ goes to infinity, the inequalities become less than or equal to 2. Similarly, the existence of  is guaranteed by considering  → ∞.Using the quantity  1 defined above, we also define the parameters "close to" a given Pauli term  .Definition 2. Given  1 from Definition 1 and an observable  = ∑︀  ∈ (geo)    , for each Pauli term  ∈  (geo) , we define as in Eq. (II.1).Now, we are ready to present the precise statement that the ground state property Tr(()) can be approximated by a sum of smooth local functions.First, we consider the simpler case where our observable  =    is a single Pauli term, which easily generalizes to the general case via triangle inequality.
Lemma 2 (Approximation using smooth local functions; simple case).Consider a class of local Hamiltonians {() :  ∈ [−1, 1]  } satisfying assumptions (a)-(c), and an observable  =    , where  acts on at most (1) qubits.Then, there exists a constant  > 0 such that for any 1/ >  > 0, where   ()   Tr( (  ())) is a smooth function that only depends on parameters and the set   of coordinates is given in Definition 2. The function   () is smooth in the sense that for some constant  ′ > 0. where  () = ∑︀  ∈ (geo)   () for   () given in Lemma 2. We illustrate the intuition for Lemma 2 in Figure 3.The proof of Lemma 2 requires several steps.The main idea is that the function   () is simply   Tr( (  ())) such that   ()  =   for  ∈   and   ()  = 0 for coordinates  / ∈   .Thus, we need to show that changing coordinates outside of   does not change   Tr( ()) by much.First, we change one coordinate outside of   at a time and show that the directional derivative of   Tr( ()) in the direction changing this coordinate is bounded.Next, we use this to prove that |  Tr( ()) −   Tr( ( ′ ))| is bounded, where  and  ′ differ in this one coordinate.Finally, we show that the difference is bounded for the case where  and  ′ differ for all coordinates outside of   , which concludes the proof of Lemma 2. We separate these results into lemmas.Throughout the proofs of these lemmas, we also need several technical lemmas for showing the existence of certain constants and bounding integrals, proofs of which we relegate to Appendix A 4. In the rest of this section, and in Appendix A 4, we use the notation  /(2 lr ) and Δ(,  )  obs (ℎ () ,  ) for convenience.
Lemma 3 (Change one coordinate; directional derivative).Consider a class of local Hamiltonians {() :  ∈ [−1, 1]  } satisfying assumptions (a)-(c), and an observable  =    , where  acts on at most (1) qubits.Suppose that some ,  ′ ∈ [−1, 1]  only differ in one coordinate, say the coordinate  * such that  * / ∈   and only one ℎ  depends on   * .Let û be a unit vector in the direction that moves from  to  ′ along the  * th coordinate.Then, there exist constants  1 ,  2 such that Proof.For the direction û, we can write the directional derivative of () in two ways.First, we have the standard definition: Then, from spectral flow, we also have Eq.(A.1).When evaluated on an observable  =    , this establishes the following correspondence: Expanding  û() according to Eq. (A.2) and applying the triangle inequality to we have Here, since   * only affects ℎ  for one  and û is in the direction where only the coordinate  * changes, then for all  ′ ̸ = .Thus, we are left with Notice that the first integral corresponds to the case when we are outside of the lightcone, i.e., Δ(,  ) > 2 lr || while the other two integrals correspond to the case when we are inside of the light cone.First, we bound the first integral using the Lieb-Robinson bound.Applying Eq. (A.6) to the commutator norm, we have where in the last inequality, we are using assumption (b) that ‖ℎ  / û‖ ∞ ≤ 1 and |dom(ℎ  )| ≤  ℎ for a constant  ℎ .Plugging this into the integral, we have where in the second line, we used the fact that sup  |  ()| = 1/2, and in the fifth line, we substituted back in  * = Δ(,  )/(2 lr ).
We can also bound the other integrals using the commutator norm bound to obtain: where in the second line, we used assumption (b) that ‖ℎ  / û‖ ∞ ≤ 1.To bound the resulting integral, we use the definition of   () in Eq. (A.3).Note that by our definition of  * ,  * > , so we only need to consider this case in the upper bound on   ().This is because we chose and here we consider Δ(,  ) >  1 .Thus, we have Hence, we can bound the integral: In the inequality, we used the definition of   () and in the equality, we used the substitution  = .We can bound this integral using Lemma 9. Set  = 2/7 and  = 4.We have chosen  * and  1 such that all of the assumptions of Lemma 9 are satisfied.In particular, from Eq. (A.30), we see that  =  * > max(5900, , 7( + 11), ) ≥ 5900.Furthermore, we have / log 2 () > 2 + 2, because if  ≥ 5900, then it is clear that / log 2 () > 10.Now, applying Lemma 9, we have The last integral can be bounded in exactly the same way.Plugging these bounds into Eq.(A.24), we have Δ(,  ) log 2 (Δ(,  )) where in the second line, we defined constants Thus, we have proven that if we only change one coordinate outside of   , then the directional derivative changing this coordinate is small.This is exactly the claim of the lemma.
An immediate consequence of this is that we can integrate the directional derivative to obtain a bound on the distance between Tr( ()) and Tr( ( ′ )).
where in the last line, we used the correspondence from Eq. (A.18).Now, the integrand is exactly what we bounded in Lemma 3, so we have Δ(,  ) log 2 (Δ(,  )) Δ(,  ) log 2 ( 1 ) where in the last line, we can bound this integral because where and   ,  ′  ∈ [−1, 1]  denote parameter vectors that only differ in the th coordinate.Each of these terms in the summand can be bounded using Lemma 4: Δ(,  ) log 2 (Δ(,  )) Δ(,  ) log 2 (Δ(,  )) It remains to integrate this to obtain our desired bound.Distributing, we can split this integral into four terms.We bound each of these individually.
First, we have We also have where in the last equality we used integration by parts.For the other two integrals, we use Lemma 11 to obtain for some constant .Similarly, for the last integral, by Lemma 12, we have for some constant  ′ .Putting everything together, we have Combining constants and simplifying, we have To obtain the final bound, we can use our choice of  1 to write this bound in terms of : where the last inequality follows from our choice of  1 in Definition 1 and  1 in Lemma 6.Thus, we have where we take To complete the proof, recall that   () =   Tr( (  ())), where   is defined in Eq. (A.13).The function   only depends on parameters in   by definition.By the previous analysis, since   () and  only differs in the coordinates outside of the set   , the function   () should be close to   Tr( (  ())) in absolute value as required.Moreover, Tr( ()) is smooth by Lemma 4 in [29] in that for some constant  ′ > 0.Then, because   is defined as   Tr( (  ())), we have so   is smooth as claimed.

Simplification using discretization
Now, we want to show that the sum of smooth local functions  () = ∑︀  ∈ (geo)   () from Corollary 2 can be approximated by simple functions, i.e., linear combinations of indicator functions.In order to do so, we discretize our parameter space and map each  ∈ [−1, 1]  to some  ′ with discrete values.Our simple function is then  evaluated on this discretized  ′ .To state this more precisely, we first require some definitions.An illustrative example of how each set is defined is given in Figure 4. where   is defined in Definition 2 and  ′ is as in Lemma 2. Define the discretized parameter space as Moreover, for each  ∈   , define the thickened affine subspace close to the vector  for coordinates in   as With these definitions, the simple function that approximates  is defined by
(A. 60) In what follows, we prove that  indeed approximates  well.As in Appendix A 1, we first consider the simpler case where our observable  =    is a single Pauli term, which easily generalizes to the general case via triangle inequality.
Lemma 5 (Approximation using simple functions; simple case).Let  > 0. Given this  in Definition 3, for any , where   is as in Lemma 2 and   is defined in Eq. (A.60).
Corollary 3 (Approximation using simple functions; general case).Let  > 0. Given this  in Definition 3, then Here, in the first term, only the coordinates not in   change while in the second term, only coordinates in   change.To bound the first term, we can use Lemma 2 with  set to /(2), where  is the constant defined in Lemma 2, to obtain For the second term in Eq. (A.65), we bound this using the fact that  ′ and  are separated by at most  2 for coordinates in   and the smoothness condition on   from Lemma 2. The key step here is that we can write this difference as the integral of the directional derivative of   along the direction from  in to  ′ in given by a line.In particular, we can parameterize this line by  in () =  in + ( ′ in −  in ).Notice that at  = 0, this is equal to  in while at  = 1, this is equal to  ′ in .Thus, suppressing the  out parameters in our notation, we have Here, in the third line, we use the chain rule.In the fifth line, we use the Cauchy-Schwarz inequality.In the sixth line, we use the smoothness condition from Lemma 2 to bound the ℓ 2 -norm of the gradient.In the seventh line, we use the fact that ‖‖ 2 ≤ √ ‖‖ ∞ where  is the number of elements in .In the eighth line, we use the definition of  , .Finally, in the last line, we use our choice of  2 as Combining this bound with Eq. (A.66) and plugging into Eq.(A.65), we have as required.

Simple form for ground state property
We can combine the results of the previous two sections to obtain the final result giving a simpler form for the ground state property Tr(()).The proof of this statement is simple given the previous results.This concludes the proof.

Technical lemmas for finding constants and bounding integrals
In this section, we state and prove several technical lemmas for showing the existence of certain constants and bounding integrals of specific forms needed throughout Appendix A. Throughout this section, we use the notation  /(2 lr ).First, we show the existence of the constants utilized in Definition 1. Lemma 6.Given  lr ,  > 0 and  ≥ 1, there exists a constant  1 large enough such that for all 1/ >  ′ > 0 and for all Explicitly, such a constant  1 can be given by To prove this, we need a bound on the upper incomplete Gamma function: Lemma 10 (Proposition 2.7 in [80]).Take any real  ≥ 0.Then, for all real  > .
Proof of Lemma 9. Define the function Here, because we are considering the domain  ∈ (1, ∞), then this function is well-defined and differentiable.Moreover, it is always positive because log 2 () ≥ log() > 0 for  > 1.Also, consider the derivative Again, this is well-defined because log 3 () > 0 for  > 1.Furthermore, we see that if  ≥  2 , then   > 0. Thus, for  ≥  2 ,  () is monotone increasing.Ultimately, our goal is to bound the integral by using a substitution  =  (),  =   .Substituting in for , we use the inverse  = ( ) and for the differential , we use  =    to obtain We want to get this into the form of the upper incomplete Gamma function: Thus, we want to find bounds on   and ( ) in terms of  (and constants).Since we define ( ) as the inverse of  (), then we know that We notice here that if  ≥ 28, then We can further bound this by Here, this is because of Eq. (A.8).This follows because Now, by our choice of  1 and Lemma.7, then we have Taking  = 4 1 , we arrive at our claim.
Lemma 12. Let  1 ,  be as in Definition 1.Then, there exists a constant  ′ such that Proof.The proof is the same as that of Lemma 11 after replacing  10 by  +10 .Moreover, in the final steps, instead of using Eq.(A.8) and Lemma 7, we use Eq.(A.9) and Lemma 8, respectively.In cases when the range  = ( 1) is unimportant, we simply say that  is geometrically local.
We can now properly state the norm inequality relating the Pauli-1 norm to the spectral norm.
Theorem 4 (Detailed restatement of Theorem 2).Given an observable  = ∑︀     that can be written as a sum of geometrically local observables with range  in a finite -dimensional space, we have If we additionally require that ‖‖ ∞ = (1), we have the following corollary.
Corollary 4. Given an observable  = ∑︀     with ‖‖ ∞ = (1) that can be written as a sum of geometrically local observables in a finite -dimensional space with  = (1), we have . In order to establish the above norm inequality, we consider an explicit algorithm for constructing a state  satisfying ∑︀  |  | ≤  Tr().In this way, bounding Tr() above by ‖‖ ∞ gives the desired inequality.We briefly discuss the idea of the algorithm.First, we consider the set of all geometrically local blocks over the  qubits.Then, we consider all Pauli observables  with nonzero   and the qubits that  acts on.For each block, if the qubits that  acts on are all inside that block, we put  inside of this block.If there are multiple such blocks, we choose an arbitrary one to put  in so that each Pauli observable  is in exactly one block.After that, we separate all blocks into a few disjoint layers of blocks.Each layer contains many blocks that are sufficiently far from one another, and each block contains some Pauli observables.We select the layer that has the largest ∑︀  |  |, where this sum is over all Pauli observables inside that layer.To construct the state , we let  be the maximally mixed state on qubits outside of the selected layer.For each block in the selected layer, we choose  to be a state that maximizes the sum of the Pauli terms in the block.With a careful analysis, the constructed state  satisfies the desired norm inequality.

Facts and lemmas
Before proving Theorem 4, we give a few definitions, facts and lemmas.Definition 6 (geometrically local Pauli observables).Throughout the appendix, we consider  (geo) to be the set of all geometrically local Pauli observables with a constant range  = (1).
The following fact can be easily shown by considering the Pauli decomposition of each geometrically local observable in the sum.Fact 1.Any observable  that can be written as a sum of geometrically local observables can also be written as a sum of geometrically local Pauli observables.Thus, we can write  = ∑︀     , where   = 0 for all  / ∈  (geo) .
Proof.First, we can easily show that  has unit trace.Let   = ⨂︀  =1  , for all  = 1, . . ., , where  , ∈ {, , , }.Then, we have where the last equality follows because the trace of a nonidentity Pauli matrix is 0, and we assume that   ̸ =  ⊗ so that the  , are not all identity.To show that  is positive semidefinite, it suffices to prove that the eigenvalues of (± 1 ± • • • ±   )/ are between −1 and 1.Then, when this is summed with the identity matrix which has eigenvalue +1, the eigenvalues are nonnegative.We see this using the spectral norm which concludes our proof.Now, we want to define an operation that is useful throughout the proof.In Definition 7, the subscript notation is used to be consistent with the more standard notation of   to denote a Pauli acting on qubit .

Proof of Theorem 4
The key idea is to upper bound ∑︀  |  | by a constant times Tr() for some test state .We construct such a  with a similar form to that seen in Lemma 13.Then, because  is positive semidefinite and has unit trace by Lemma 13, Tr() ≤ ‖‖ ∞ .Putting everything together, we have as required.Thus, it suffices to consider this intermediate step of finding a quantum state  such that To this end, we consider dividing our space of all Pauli observables into different sets and focus on one set, which educates our choice of .
Consider some Pauli observable  ∈  (geo) , where  (geo) is the set of all geometrically local Pauli observables.Since  is geometrically local, by Definition 5, there exist constants   for  = 1, . . .,  that serve as the maximum range of qubits that a Pauli observable covers in the th dimension.We want to divide our -dimensional space into blocks of   qubits in each dimension.These blocks of qubits are where ⃗  = ( 1 , . . .,   ) and ⃗  = ( 1 , . . .,   ).We construct these blocks for   = 1, . . ., ⌊ ⌊  √ ⌋−  +  2  ⌋ and   = 0, . . ., 2  − 1 for  = 1, . . ., .Here, we are dividing the -dimensional space into blocks of  = ∏︀  =1   qubits, where each block is index by ⃗  and is separated from the next by   qubits in the th dimension.We refer to this gap between the blocks as the buffer.Denote the buffer as In both cases, the idea is to divide our qubits (blue circles) in -dimensional space into blocks (light blue boxes), and consider the quantity we wish to bound in these blocks.Note that all qubits not highlighted are in the buffer region.The first column in the figure depicts the unshifted blocks, i.e., ⃗  = 0.The second column displays an example of shifted blocks (dashed boxes).Finally, the last column considers Pauli terms (dark blue circles) acting on the qubits circled and indicates if they are contained in 0, defined in Eq. (B.12).
where the union is over all possible vectors ⃗  such that   ranges from ⌋.This separation using the buffer region is so that no Pauli term can act on qubits in two blocks at once, which we use later.Moreover, we are considering possible shifts of these blocks by   qubits in each of the dimensions.Notice that there are only 2  possible shifts in each dimension until the blocks align with the original positioning of another block.Consider the related set consisting of the Pauli terms that act only on qubits in a given block where we define ( ⃗  ′ , ⃗  ′ ) ≤ ( ⃗ , ⃗ ) using the standard lexicographical order, i.e., if and only if ⃗  ′ < ⃗ , or ⃗  ′ = ⃗  and ⃗  ′ ≤ ⃗ .Here, ⃗  ′ ≤ ⃗  if and only if  ′ 1 <  1 , or  ′ 1 =  1 and  ′ 2 <  2 , or, etc.Thus, we create these sets  (,) sequentially according to this ordering.We remove previous sets so that each  ( ⃗ , ⃗ ) is disjoint from other sets  ( ⃗  ′ , ⃗  ′ ) .Now, taking a union over all ⃗ , we can consider the Pauli terms acting on these blocks together.The resulting sets then only differ based on the shift of   qubits in each dimension.
Figure 5 illustrates all these definitions.We now consider ∑︀ |  |, where these   are the coefficients in  = ∑︀     .We want to pick the set  ⃗  such that this sum is largest, i.e.
We now focus on the set  ⃗  * .To justify this choice, we can think of each of the sets shifted by ⃗  as breaking up the sum ∑︀  |  | into different disjoint sums.This is a result of our earlier choice for  ⃗  to contain a disjoint set of Pauli terms.Then, the maximum over all shifts ⃗  of ∑︀ |  | is greater than the average over all shifts.In other words, we have where recall that  = ∏︀  =1   .Relating back to our original goal, it remains to find a test state  such that Once we have this, we can conclude that proving our claim, where the first inequality follows because  is positive semidefinite and has unit trace from Lemma 13.In what follows, we aim to define this  based on the set  ⃗  * and show that this inequality holds.
The idea is to have  as the maximally mixed state on qubits in the buffer region  ′ ⃗  * and be a state of the form in Lemma 13 for qubits in ⋃︀ ⃗   ( ⃗ , ⃗  * ) .In this way, when we take Tr(), any Pauli terms not in  ⃗  * contribute 0 while Pauli terms  in  ⃗  * contribute a constant times |  |.Explicitly, we define  as where  {  <0} is 1 when   < 0 and 0 otherwise.Also, the tensor product is again over all possible vectors ⃗  such that each entry   ranges from Here, we are using the notation from Definition 7 to denote quantum operations restricted to their action on a given set of qubits.By Lemma 13,  ⃗  is a proper quantum state that is positive semidefinite and has unit trace; hence  is a quantum state.Now, we want to calculate Tr().Recall that  = ∑︀     .Taking the trace, we have There are four cases that can occur regarding dom( ).
4.  acts nontrivially only on qubits in a single block and  is in the set   * , i.e., dom( for some ⃗   and  ∈  ( ⃗   , ⃗  * ) .
We compute Tr( ) for each of these cases.We note that Tr (︀  ⃗  )︀ = 1 in all these calculations.For Case 1, it suffices to consider the case where  acts on only one qubit in the buffer region, i.e. there exists a qubit ℓ * ∈  ′ ⃗  * such that ℓ * ∈ dom( ) and dom( ) ∖ {ℓ * } ⊆  ( ⃗   , ⃗  * ) .Then, the state   is as follows where we are again using the notation from Definition 7. Taking the trace of this state, since the trace of /2 is 1, we have where the last equality follows because the trace of a nonidentity Pauli string is 0 and Here,   ( ⃗   , ⃗  * ) ̸ =   ( ⃗   , ⃗  * ) because  ∈  ( ⃗   , ⃗  * ) so that  acts nontrivially only on qubits in  ( ⃗   , ⃗  * ) while  acts nontrivially on ℓ * / ∈  ( ⃗   , ⃗  * ) .Thus, Case 1 contributes 0 to Tr().Next, we consider Case 2. In Case 2, we consider what happens if  acts nontrivially on qubits in more than one block, i.e., dom( ) ⊆  ( ⃗  ,1 , ⃗  * ) ∪  ( ⃗  ,2 , ⃗  * ) .However, this case is in fact not possible by construction because the buffer region between  ( ⃗  ,1 , ⃗  * ) and  ( ⃗  ,2 , ⃗  * ) is of size   in each of the dimensions.Recall that   is the largest distance between two qubits that any  acts on in the th dimension.Thus, it is not possible for  to span across the buffer region, so this case cannot occur.Hence, it trivially contributes 0 to Tr().Now, we consider Case 3. From the previous two cases, we see that  can only act nontrivially on qubits in a single block  ( ⃗   , ⃗  * ) to contribute to Tr().However, by construction of the sets  ( ⃗ , ⃗ ) , in order to make them disjoint, it is possible that  / ∈  ( ⃗   , ⃗  * ) despite it acting on the correct block of qubits.We show that this also contributes 0 to Tr().Taking the trace, we have Here,   ( ⃗   , ⃗  * ) ̸ =   ( ⃗   , ⃗  * ) because  ∈  ( ⃗   , ⃗  * ) while we know from this case that  / ∈  ( ⃗   , ⃗  * ) .Finally, we consider Case 4. From the previous cases, we see that the only remaining possibility is that  acts nontrivially on qubits in a single block  ( ⃗   , ⃗  * ) and is also contained in a set  ( ⃗   , ⃗  * ) .Computing the trace, we have where  ′ is defined in Lemma 2. From this, we can define the discretized parameter space   , which contains parameter vectors that are 0 outside of   and take on discrete values inside of   : Furthermore, for each discretized vector  ′ ∈   , let  , be the set of vectors close to  ′ for coordinates in   : Finally, we define an additional hyperparameter  > 0 as With these definitions in place, we can discuss the ML algorithm.At a high level, the algorithm first maps the parameter space into a high-dimensional feature space.Then, the ML algorithm learns a linear function in this feature space using ℓ 1 -regularized regression.
In particular, the feature map  maps  ↦ → (), where  ∈ [−1, 1]  is an -dimensional vector while () ∈ R   is an   -dimensional vector with Here,  (geo) denotes the set of all geometrically local Pauli observables as in Def. 6.Each coordinate of () is indexed by  ′ ∈   ,  ∈  (geo) and is defined as The hypothesis class for our proposed ML algorithm consists of linear functions in this feature space, i.e., functions of the form ℎ() = w • ().The classical ML model learns such a function using ℓ 1regularized regression (LASSO) [38][39][40] over the feature space.Namely, given the hyperparameter  > 0 defined above, we utilize LASSO to find an   -dimensional vector w * from the following optimization problem that minimizes the training error where  ℓ approximates Tr(( ℓ )).We denote the learned function by ℎ * () = w * • ().Importantly, this learned function does not need to achieve the minimum training error.In the following, we consider the vector w * to yield a training error that is larger than the minimum training error by at most  3 /2.

Rigorous guarantee
Given these definitions and the ML algorithm, we prove the following theorem.The theorem stated in the main text corresponds to  1 = 0.2,  2 = , and  3 = 0.4.In the ML problem formulated in Sec.II B and Appendix C 1, the training data { ℓ ,  ℓ }  ℓ=1 corresponds to a fixed and unknown observable .However, we may be interested in training an ML model that can predict Tr(()) for a wide range of observables .In this setting, one could consider a classical dataset { ℓ ,   (( ℓ ))}  ℓ=1 generated by performing classical shadow tomography [41][42][43][44][45] on the ground state ( ℓ ) for each  ℓ in ℓ = 1, . . .,  .This is achieved by repeatedly performing  randomized Pauli measurements on each state ( ℓ ).Using the classical shadow dataset, we can obtain the following corollary for predicting ground state representations.where  ℓ is sampled from an unknown distribution  and   (( ℓ )) is the classical shadow representation of the ground state ( ℓ ) using  randomized Pauli measurements.For  = (log(/)/ for all ℓ = 1, . . .,  and  ∈  (geo) with probability at least 1 − (/2).For each  ∈  (geo) , we consider ℎ *  () to be the function produced from Theorem 5. From Theorem 5, we have for all  ∈  (geo) with probability at least 1 − (/2) conditioned on the event given in Eq. (C.17) occurs.
Using the union bound to combine the two events considered in Eq. (C.17 This concludes the proof of the corollary.

ℓ1-Norm bound on coefficients of linear hypothesis
We now justify our choice of the hyperparameter  > 0 such that ‖w‖ 1 ≤ .From Appendix A, we constructed a function that approximates the ground state property.Explicitly, this function is defined as where in this case, the vector of coefficients w ′ , indexed by  ′ ∈   ,  ∈  (geo) , is defined as Thus, we see that the ML model, which learns functions of this form, has the capacity to approximate the target ground state property Tr(()).The actual function we learn, ℎ * () = w * • () could differ significantly from () = w ′ • () because w ′ is unknown.Nevertheless, we can utilize an upper bound ‖w ′ ‖ 1 ≤  to restrict the hypothesis set of the ML algorithm to functions of the form ℎ() = w • () such that ‖w‖ 1 ≤ .Thus, we find an upper bound on ‖w ′ ‖ 1 in the following lemma.
Lemma 14 (ℓ 1 -Norm bound).Let w ′ be the vector of coefficients defined in Eq. (C.26).Then, we have the following bound on ‖w ′ ‖ 1 : . (C.34) Plugging this into our ℓ 1 -norm bound from Eq. (C.29c), we have where the second inequality follows from Corollary 4, taking  as this constant.We can simplify this expression further by using that  1 =  max log 2 (2/ 1 ) for sufficiently small  1 according to Eq. (C.1).
which is the promised scaling.(C.37) We can bound the training error in the following lemma.
Lemma 15 (Detailed restatement of Lemma 1).The function achieves training error where the training error is defined in Definition 8.
Proof.This lemma follows directly from Theorem 3. Let ℓ * be defined as Then, the training error can be bounded above by Therefore, the minimum training error must be at most R(), min Together, we have The last inequality follows from Lemma 15.

Prediction error bound
With this bound on the training error, it remains to find a bound on the prediction error of our hypothesis function.To this end, we can use a standard result from machine learning theory about the prediction error of ℓ 1 -norm-constrained linear hypotheses trained using the LASSO algorithm [38][39][40].
We can use this theorem to prove the prediction error bound in Theorem 5.
Proof of prediction error in Theorem 5. We utilize Theorem 6 as well as our established lemmas.First, we demonstrate that we satisfy the conditions of the theorem in our setting.Here, we view ℎ in Theorem 6 as a function of the higher-dimensional feature vector () rather than the -dimensional vector  ∈ [−1, 1]  so that ℎ is a linear hypothesis.In this perspective, our input space  is the feature space {0, 1}   ⊆ R   , as the indicator functions we are evaluating only take 0-1 values.In our case, the dimension  is given by Together, the training data size  given above guarantees that (ℎ * ) ≤ ( 1 +  2 ) 2 +  3 with probability at least 1 − .

Computational time for training and prediction
Finally, we find the computation time required for the ML algorithm's training and prediction.To this end, we utilize standard results about the training time of the LASSO algorithm [51].
Proof of computational time in Theorem 5.The training time is dominated by the time required for ℓ 1regularized regression (LASSO) over the feature space defined by the feature map .It is well-known that to obtain a training error at most ( 3 /2) larger than the optimal function value, the LASSO algorithm on the feature space can be executed in time  (︁ 3 )︁ [51], where   is the dimension of the feature space.By Eq. (C.55b), we know that where the summation ranges over all pairs ⟨⟩ of neighboring sites on the lattice and the couplings {  } are sampled uniformly from the interval [0, 2].Here, the parameter  is a list of all couplings   so that the dimension of the parameter space is  = (), where  is the system size.We are interested in predicting ground state properties, which in this case are the two-body correlation functions for each pair of qubits on the lattice.In particular, this correlation function is the expectation value of   = 1 3 (    +     +     ), (D.2) for each pair of qubits ⟨⟩.
We generated training and testing data for this model using the same method as [29].For completeness, we briefly discuss this here.For each parameter vector of random couplings sampled uniformly from [0, 2], we approximated the ground state using the density-matrix renormalization group (DMRG) [60] based on matrix product states (MPS) [61].We first consider an initial random MPS with bond dimension  = 10 and variationally optimize it using a singular value decomposition cutoff of 10 −8 .We terminate the DMRG runs when the change in energy is less than 10 −4 .After DMRG converges, we perform randomized Pauli measurements by locally rotating into the corresponding Pauli bases and sampling the rotated state [81].In this work, we utilize two different data sets: one which is the same as in [29] and the other which is generated in the same way but contains more data points.
We consider classical machine learning models given by first performing a feature mapping  on the input vector  and then running ℓ 1 -regularized regression (LASSO) over the feature () space, as described in Appendix C 1.However, while the indicator function feature map was a useful tool to obtain our rigorous guarantees, it is often hard to discretize a high-dimension parameter space into   in practice.Thus, we instead utilize random Fourier features [59].One can think of this as a single layer of a randomly initialized neural network.Explicitly, this feature map is where  is the length of the vector ,  > 0 and  > 0 are tunable hyperparameters, and   are dimensional vectors sampled from a multivariate standard normal distribution.Here, for each vector , () is a 2-dimensional vector.Thus, the hyperparameter  determines the length of the feature vector.We consider a set of different hyperparameters: ∈ {5, 10, 20, 40}, (D.4)  ∈ {0.4,0.5, 0.6, 0.65, 0.7, 0.75}. (D.5) Using this feature map, the ML algorithm is implemented as follows.First, we decompose  into several vectors corresponding to local regions of a given local term of the Hamiltonian.This is analogous to the discretization of the parameter space using   .Explicitly, the decomposition is performed in the following way.First, recall that in the 2D antiferromagnetic Heisenberg model, qubits are placed on sites in a 2D lattice.Thus, each local term can be viewed as an edge between neighboring sites on the lattice.We construct a local region around this edge by including all edges within an ℓ 1 -distance  1 .This is analogous to Eq. (II.1).Now, for each vector resulting from the decomposition of , we apply the feature map  and concatenate all vectors together to obtain ().Finally, we run the LASSO algorithm using scikit-learn, a Python package [82].Here, LASSO optimizes the objective function ‖ − ‖ where  is the amount of training data,  is a vector of the training data labels { ℓ }  ℓ=1 ,  is a matrix of the training data inputs { ℓ }  ℓ=1 ,  is a vector of coefficients we want to learn, and  > 0 is a regularization parameter.We consider a set of different regularization parameters  ∈ {2 −8 , 2 −7 , 2 −6 , 2 −5 }. (D.7) We consider several different classical ML models, corresponding to these choices of hyperparameters , , .Thus, we perform model selection to determine the optimal choice of these hyperparameters.
To this end, we consider  different values of the parameter  = {  }, where  is roughly around 100 across different system sizes1 .From these  data points, we randomly choose half of these points as training data (i.e.,  = /2) and the remaining half is test data.For each ground state property we want to predict, we choose one value of each of , ,  such that the root-mean-square error is minimized when performing 4-fold cross-validation, which is also implemented using scikit-learn.Finally, we test the performance of the ML model with these chosen hyperparameters using the test data.
For each vector  that we tested on, we would predict the correlation functions for all pairs of qubits ⟨⟩.Hence, the prediction error is averaged over (/2) × (1.8 − 5) ≈ 1500 to 3500 predictions, i.e., over all of the test data and all pairs of qubits.Despite /2 being only of around 50, the prediction errors reported in the plots are statistically sound given the large total number of predictions.The standard deviation of the exact correlation functions in the data varies slightly across different system sizes2 .When the standard deviation is smaller, the prediction error will also be smaller.To judge the difficulty to predict the correlation functions across different system sizes, we normalize the standard deviation to be the average standard deviation of 0.191.We also include experiments where we vary the training data size  or the classical shadow size  , i.e., the number of randomized Pauli measurements used to approximate the ground state.For a fixed training data size of  = /2, we vary the classical shadow size with values in  ∈ {50, 100, 250, 500, 1000}.Similarly, for a fixed shadow size of  = 500, we vary the training data size with values  =  for  ∈ {0.1, 0.3, 0.5, 0.7, 0.9}.The numerical results of these experiments are summarized in Figure 2.

Figure 1 :
Figure 1: Overview of the proposed machine learning algorithm.Given a vector  ∈ [−1, 1] that parameterizes a quantum many-body Hamiltonian ().The algorithm uses a geometric structure to create a high-dimensional vector () ∈ R   .The ML algorithm then predicts properties or a representation of the ground state () of Hamiltonian () using the   -dimensional vector ().

Figure 2 :
Figure 2: Predicting ground state properties in 2D antiferromagnetic random Heisenberg models.(A) Prediction error.Each point indicates the root-mean-square error for predicting the correlation function in the ground state (averaged over Heisenberg model instances and each pair of neighboring spins).Left figure fixes the training set size  to be 50 and system size  to be 9 × 5 = 45.Center figure fixes the shadow size  to be 500 and  = 45.Right figure fixes  = 50 and  = 500.The shaded regions show the standard deviation over different spin pairs.(B) Visualization.We plot how much each coupling  contributes to the prediction of the correlation function over different pairs of qubits in the trained ML model.Thicker and darker edges correspond to higher contributions.We see that the ML model learns to utilize the local geometric structure.

Figure 3 :
Figure 3: Intuition behind Lemma 2. The qubits (blue circles) are arranged in a two-dimensional lattice with local Hamiltonian terms (light gray shading) acting between all pairs of neighboring qubits.A Pauli term  acts on a subset of these qubits indicated by the light blue region.The dark blue circle represents a neighborhood around the region on which  acts.The idea of Lemma 2 is that when changing the parameters , only ⃗  such that ℎ(⃗ ) within the neighborhood around the region that  acts on should significantly change Tr( ()).Hence,  depends only on those parameters.It is implicit in the figure that ℎ depends on ⃗  for all .Hence,  depends only on the vectors ⃗ 14, ⃗ 19, ⃗ 20, ⃗ 25.

Figure 5 :
Figure 5: Intuition behind proof construction of Theorem 4 for the cases of  = 1 (a) and  = 2 (b).In both cases, the idea is to divide our qubits (blue circles) in -dimensional space into blocks (light blue boxes), and consider the quantity we wish to bound in these blocks.Note that all qubits not highlighted are in the buffer region.The first column in the figure depicts the unshifted blocks, i.e., ⃗  = 0.The second column displays an example of shifted blocks (dashed boxes).Finally, the last column considers Pauli terms (dark blue circles) acting on the qubits circled and indicates if they are contained in 0, defined in Eq. (B.12).

4 .
Training error bound Using the results in Appendix A 3, we can derive a bound on the training error of () = w ′ • () discussed in the previous section.The existence of w ′ then guarantees that the function ℎ * () = w * •() found by performing optimization to minimize training error will also yield a training error close to zero.To prove this rigorously, we first write a precise definition of training error.Definition 8 (Training error).Given a function ℎ() and a training dataset {( ℓ ,  ℓ )}  ℓ=1 .The training error is defined as R(ℎ) = min w 1   ∑︁ ℓ=1 |ℎ( ℓ ) −  ℓ | 2 .