Out-of-distribution generalization for learning quantum dynamics

Generalization bounds are a critical tool to assess the training data requirements of Quantum Machine Learning (QML). Recent work has established guarantees for in-distribution generalization of quantum neural networks (QNNs), where training and testing data are drawn from the same data distribution. However, there are currently no results on out-of-distribution generalization in QML, where we require a trained model to perform well even on data drawn from a different distribution to the training distribution. Here, we prove out-of-distribution generalization for the task of learning an unknown unitary. In particular, we show that one can learn the action of a unitary on entangled states having trained only product states. Since product states can be prepared using only single-qubit gates, this advances the prospects of learning quantum dynamics on near term quantum hardware, and further opens up new methods for both the classical and quantum compilation of quantum circuits.


I. INTRODUCTION
In Quantum Machine Learning (QML) a quantum neural network (QNN) is trained using classical or quantum data, with the goal of learning how to make accurate predictions on unseen data [1][2][3]. This ability to extrapolate from training data to unseen data is known as generalization. There is much excitement currently about the potential of such QML methods to outperform classical methods for a range of learning tasks [4][5][6][7][8][9][10][11]. However, to achieve this, it is critical that the training data required for successful generalization can be produced efficiently.
While recent work has established a number of fundamental bounds on the amount of training data required for successful generalization in QML [11][12][13][14][15][16][17][18][19][20][21][22][23][24], less attention has been paid so far to the type of training data required for generalization. In particular, prior work has established guarantees for the in-distribution generalization of QML models, where training and testing data are assumed to be drawn independently from the same data distribution. However, in practice one may only have access to a limited type of training data, and yet be interested in making accurate predictions for a wider class of inputs. This is particularly an issue in the Noisy Intermediate-Scale Quantum (NISQ) era [25], when deep quantum circuits cannot be reliably executed, effectively * The first two authors contributed equally to this work.
limiting the quantum training data states that can be prepared.
In this article, we study out-of-distribution generalization in QML. That is, we investigate generalization performance when the testing and training distributions do not coincide. Specifically, we consider the task of learning unitary dynamics, which is a fundamental primitive for a range of QML algorithms. At its simplest, the target unitary could be the unknown dynamics of an experimental quantum system. For this case, which has close links with quantum sensing [26] and Hamiltonian learning [27][28][29], the aim is essentially to learn a digitalization of an analog quantum process. This could be performed using a 'standard' quantum computer or a simpler experimental system with perhaps a limited gate set, as sketched in Fig. 1a) and b) respectively. Alternatively, the target unitary could take the form of a known gate sequence that one seeks to compile into a shorter depth circuit or a particular structured form [30][31][32][33]. The compilation could be performed either on a quantum computer, see Fig. 1c), or entirely classically, see Fig. 1d). Such a subroutine can be used to reduce the resources required to implement larger scale quantum algorithms including those for dynamical simulations [34][35][36][37].
Here we prove out-of-distribution generalization for unitary learning with a broad class of training and testing distributions. Specifically, we show that the average prediction error over any two locally scrambled [38,39] ensembles of states are perfectly correlated up to a small constant factor. This is captured by our main theo- a) Quantum dynamics learning of an experimental process using a quantum computer. b) Quantum dynamics learning with a more specialized experimental system with potentially a limited gate set. c) and d) Quantum compilation of a known unitary on a quantum computer and classical computer, respectively.
rem, Theorem 1. By combining this observation with in-distribution generalization guarantees it follows that if the training and testing distributions are both locally scrambled (but potentially otherwise different distributions), out-of-distribution generalization is always possible between locally scrambled distributions. In particular, we show that a QNN trained on quantum data capturing the action of an efficiently implementable target unitary on a polynomial number of random product states, generalizes to test data composed of fully random states. That is, rather intriguingly, we show that one can learn the action of such a unitary on a broad spread of highly entangled states having only studied its action on a limited number of product states.
We numerically illustrate these analytical results by showing that the short time evolution of a Heisenberg spin chain can be well learned using only product state training data. Namely, we find that the outof-distribution generalization error nearly perfectly correlates with the in-distribution generalization error and the training cost. In particular, in our numerical experiments, the testing performances achieved by the QML model on Haar-random states and on random product states differ only by a small constant factor, as predicted analytically. We further perform noisy simulations that demonstrate how the noise accumulated preparing highly entangled states can prohibit training. In contrast, noisy training on product states, which can be prepared using only single-qubit gates, remains feasible. Additionally, in Appendix C 2 we numerically validate our generalization guarantees in a task of learning so-called fast scrambler unitaries [40]. Thus our results make the possibility of using QML to learn unitary processes nearer term.
Our results further suggest a new quantum-inspired classical approach to unitary compilation. Namely, our results imply that a low-entangling unitary can be compiled using only low-entangled training states. Such circuits can be readily simulated using classical tensor network methods, and hence this compilation can be performed classically.

A. Framework
In this work we consider the QML task of learning an unknown n-qubit unitary U ∈ U((C 2 ) ⊗n ). The goal is to use training states to optimize the classical parameters α of V (α), an n-qubit unitary QNN (or classical representation of a QNN), such that, for the optimized parameters α opt , V (α opt ) well predicts the action of U on previously unseen test states.
To formalize this notion of learning, we employ the framework of statistical learning theory [41,42]. The prediction performance of the trained QNN V (α opt ) can be quantified in terms of the average distance between the output state predicted by V (α opt ) and the true output state determined by U . The average is taken over input states from a testing ensemble, which represents the ensemble of states that one wants to be able to predict the action of the target unitary on. More precisely, the goal is to minimize an expected risk where the testing distribution P is a probability distribution over (pure) n-qubit states |Ψ and the factor of 1/4 ensures 0 ≤ R P (α) ≤ 1. A learner will not have access to the full testing ensemble P and so cannot evaluate the cost in Eq. (1). Instead, it is typically assumed that the learner has access to a training data set consisting of input-output pairs of pure n-qubit states, where the N input states are drawn independently from a training distribution Q. Equipped with such training data, the learner may evaluate the training cost (3) We note that this cost can be rewritten in terms of the average fidelity as and thus can be efficiently computed using a Loschmidt echo [14] or swap test circuit [43,44]. The hope is that by training the parameters α of the QNN to minimize the training cost C D Q (N ) (α) one will also achieve small risk R P (α). However, whether such a strategy is successful crucially depends on whether the training cost C D Q (N ) (α) is indeed a good proxy for the expected cost R P (α). This is exactly the question of generalization: Does good performance on the training data imply good performance on (previously unseen) testing data?
In statistical learning theory, answers to this question are given in terms of generalization bounds. These are bounds on the generalization error, which is typically taken to be the difference between expected risk and training cost, i.e., gen P,D Q (N ) (α opt ) := R P (α opt ) − C D Q (N ) (α opt ) . (5) Usually, such bounds are proved under an i.i.d. assumption on training and testing. That is, they are based on the assumptions (a) that the training examples are drawn independently from a training distribution Q and (b) that the training and testing distributions coincide, Q = P. In this case, we speak of in-distribution generalization.
In this paper, we consider out-of-distribution generalization where we drop assumption (b) by allowing Q = P. Borrowing classical machine learning terminology, one can also regard this as a scenario of dataset shift [45], or more specifically covariate shift [46,47], which is often addressed using transfer learning techniques [48,49]. We formulate our results for a broad class of ensembles called locally scrambled ensembles. In loose terms, locally scrambled ensembles of states can be thought of as ensembles of states that are at least locally random [50]. More formally, they are defined as follows.
Definition 1 (Locally scrambled ensembles). An ensemble of n-qubit unitaries is called locally scrambled if it is invariant under pre-processing by tensor products of arbitrary local unitaries. That is, a unitary ensemble U LS is locally scrambled iff for U ∼ U LS and for any fixed U 1 , . . . , U n ∈ U(C 2 ) also U ( n i=1 U i ) ∼ U LS [51]. Accordingly, an ensemble S LS of n-qubit quantum states is locally scrambled if it is of the form S LS = U LS |0 ⊗n for some locally scrambled unitary ensemble U LS [52]. We denote the classes of locally scrambled ensembles of unitaries and states as U LS and S LS , respectively.
In fact, our results hold for a slightly broader class of ensembles where we only require that the ensemble agrees with a locally scrambled one up to and including its (complex) second moments. That is, more informally, the average over the ensemble agrees with those of a locally scrambled ensemble over all functions of U that contain at most two copies of U . We will denote these broader classes of unitary and state ensembles, which we formally define in Appendix A, as U (2) LS and S (2) LS , respectively. In our results, we suppose that both the testing and training ensembles are such ensembles, i.e. P ∈ S (2) LS and Q ∈ S (2) LS . However, as S (2) LS captures a variety of different possible ensembles, P and Q can be ensembles containing very different sorts of states. In particular, as detailed further in Appendix A, the following are important examples of ensembles in S (2) LS : • S Haar ⊗n 1 -Products of Haar-random single-qubit states.
• S Stab ⊗n 1 -Products of random single-qubit stabilizer states.
• S Haar ⊗n/k k -Products of Haar-random k-qubit states.
• S A k RandCirc -The output states of random quantum circuits. (Here A k denotes the k-local n-qubit quantum circuit architecture from which the random circuit is constructed.) These examples highlight that the class of locally scrambled ensembles includes both ensembles that consist solely of product states and ensembles composed mostly of highly entangled states. We can use this to our advantage to construct more efficient machine learning strategies.
Typically the learner will be interested in learning the action of a unitary on a wide class of input states including both entangled and unentangled states. For example, they might be interested in learning the action of a unitary on all states that can be efficiently prepared on a quantum computer using a polynomial-depth hardwareefficient layered ansatz. Thus in general the expected risk should be evaluated over distributions such as S Haarn , S 2design or S A k RandCirc (for k ≥ 2) which cover a large proportion of the total Hilbert space.
In classical machine learning one often thinks of the training data as given. However, in the context of learning or compiling quantum unitary dynamics (as sketched in Fig. 1), one in practice needs either to prepare the training states on a quantum computer or in an experimental setup, or to be able to efficiently simulate them classically. Thus, it is desirable to train on states that can be prepared using simple circuits, i.e., those that are short depth, low-entangling or require only simple gates. This is especially important in the NISQ era due to noise-induced barren plateaus [53] or other noise-related issues [54]. Therefore, as random stabilizer states and random product states can be prepared using only a single layer of single-qubit gates, it makes practical sense to train using the ensembles S Haar ⊗n 1 or S Stab ⊗n 1 .
In this manner the class of ensembles that are locally scrambled to the second moment, S in-distribution generalization to out-of-distribution generalization when using a QNN to learn an unknown unitary from quantum data. For the formal proofs see Appendix B.

Equivalence of Locally Scrambled Risks
We first show a close connection between the risks for unitary learning arising from any locally scrambled ensembles. More precisely, we show that they can be upper and lower bounded in terms of the expected risk over the Haar distribution in our main technical result: Lemma 1. For any Q ∈ S (2) LS and any parameter setting α, where d = 2 n is the dimension of the target unitary U being learned.
This result establishes that learning over any locally scrambled distribution is effectively equivalent (up to a constant multiplicative factor) to learning over the uniform distribution over the entire Hilbert space. We note that the factor of 1/2 in the lower bound emerges from the structure of our proof, and for typical cases we expect the relation between the costs to be tighter still. We explore this numerically in Appendix C 1 for the special case of training on random product states, i.e. Q = S Haar ⊗n 1 . A direct consequence of Lemma 1 is that the risks arising from any two locally scrambled ensembles are related as follows.
Theorem 1 (Equivalence of locally scrambled ensembles for comparing unitaries). Let P ∈ S (2) LS and Q ∈ S (2) LS , then for any parameter setting α, Theorem 1 establishes an equivalence (up to a constant multiplicative factor) between all locally scrambled testing distributions for the task of learning an unknown unitary on average. In particular, even simple locally scrambled ensembles, such as tensor products of Haarrandom single-qubit states or of random single-qubit stabilizer states, are for this purpose effectively equivalent to seemingly more complex locally scrambled ensembles. The latter include the output states of random quantum circuits or, indeed, globally Haar-random states.

Out-of-Distribution Generalization for QNNs Trained on
Locally Scrambled States Theorem 1 gives rise to a general template for lifting in-distribution generalization bounds for QNNs to out-of-distribution generalization guarantees in unitary learning. This is captured by the following corollary: Corollary 1 (Locally scrambled out-of-distribution generalization from in-distribution generalization). Let P ∈ S (2) LS and Q ∈ S (2) LS . Let U be an unknown n-qubit unitary. Let V (α) be an n-qubit unitary QNN that is trained using training data D Q (N ) containing N input-output pairs, with inputs drawn from the ensemble Q. Then, for any parameter setting α, Thus, when training using training data D Q (N ), the out-of-distribution risk R P (α opt ) of the optimized parameters α opt after training is controlled in terms of the optimized training cost C D Q (N ) (α opt ) and the indistribution generalization error gen Q,D Q (N ) (α opt ). We can now bound the in-distribution generalization error using already known QML in-distribution generalization bounds [11-23] (or, indeed, any such bounds that are derived in the future). We point out that our results up to this point do not require any assumptions on the QNN architecture underlying V (α), except for overall unitarity. As a concrete example of guarantees that can be obtained this way, we combine Corollary 1 with an indistribution generalization bound established in [20] to prove: Corollary 2 (Locally scrambled out-of-distribution generalization for QNNs). Let P ∈ S (2) LS and Q ∈ S (2) LS . Let U be an unknown n-qubit unitary. Let V (α) be an n-qubit unitary QNN with T parameterized local gates. When trained with the cost C D Q (N ) using training data D Q (N ), the out-of-distribution risk w.r.t. P of the parameter setting α opt after training satisfies with high probability over the choice of training data of size N according to Q.
The out-of-distribution generalization guarantee of Corollary 2 is particularly interesting if the training data is drawn from a distribution composed only of products of single-qubit Haar-random or random stabilizer states, i.e. Q = S Haar ⊗n 1 or Q = S Stab ⊗n 1 , but the testing data is drawn from more complex distributions such as the Haar ensemble or the outputs of random circuits, i.e. P = S Haarn or P = S RandCirc . In this case, Corollary 2 implies that efficiently implementable unitaries can be learned using a small number of simple unentangled training states. More precisely, if U can be approximated via a QNN with poly(n) trainable local gates, then only poly(n) unique product training states suffice to learn the action of U on the Haar distribution, i.e. across the entire Hilbert space.
To understand why out-of-distribution generalization is possible, recall that any state is linearly spanned by n-qubit Pauli observables P ∈ {I, X, Y, Z} ⊗n , and each Pauli observable P can be written as a linear combination of product states |s s| = n i=1 |s i s i |, where s i ∈ {0, 1, +, −, y+, y−}. These two facts imply that for any state |φ φ|, there exists coefficients α s , such that |φ φ| = s α s |s s|. Hence, if we exactly know U |s s|U † for all 6 n product states |s s|, then we can figure out U |φ φ|U † for any state |φ φ| by linear combination. However, this requires an exponential number of product states in the training data. In our prior work [20], we show that one only needs poly(n) training product states to approximately know U |s s|U † for most of the 6 n product states, assuming U is efficiently implementable. The key insight in this work is that one can predict U |φ φ|U † as long as the coefficients α s in |φ φ| = s α s |s s| are sufficiently random and spread out across the 6 n product states. We make this condition precise by defining locally scrambled ensembles and proving that the action of U on a state sampled from any such ensemble can be predicted. In Appendix B 3, we further discuss the role that linearity plays in our results.
We can immediately extend our results for out-ofdistribution to local variants of costs. Such local costs are essential to avoid cost-function-dependent barren plateaus [55] when training a shallow QNN. As a concrete example, when taking S Haar ⊗n 1 as the training ensemble, we can consider the local training cost where |Ψ for all j and we have introduced the local measurement operator H This local cost is faithful to its global variant for product state training in the sense that it vanishes under the same conditions [30], but crucially, in contrast to the global case, may be trainable [55]. In Appendix B 2, we prove a version of Corollary 2 when training on the local cost from Eq. (10). Specifically we find: Corollary 3 (Locally scrambled out-of-distribution generalization for QNNs via a local cost). Let P ∈ S (2) LS and let U be an unknown n-qubit unitary. Let V (α) be an n-qubit unitary QNN with T parameterized local gates. When trained with the cost C L Prod,N , the out-ofdistribution risk w.r.t. P of the parameter setting α opt after training satisfies with high probability over the choice of training data of size N .
Clearly, analogous local variants of the training cost can be defined whenever the respective ensemble has a tensor product structure (such as S Stab ⊗n 1 ). However, if the training data is highly entangled, constructing such local costs in this manner is not possible. Thus, this is another important consequence of our results: The ability to train solely on product state inputs makes it straightforward to generate the local costs that are necessary for efficient training.
The results presented thus far concern the number of unique training states required for generalization, but in practice multiple copies of each training state will be needed for successful training. As O(1/ 2 ) shots are required to evaluate a cost to precision and since for gradient based training methods one needs to evaluate the partial derivative of the cost with respect to each of the T trainable parameters, one would expect to need on the order of O(T M opt / 2 ) copies of each of the N input states and output states to reduce the cost to . Here M opt is the number of optimization steps. Classical shadow tomography [56][57][58] provides a way towards a copy complexity bound that is independent of the number of optimization steps. Namely, exploiting covering number bounds for the space of pure output states of polynomialsize quantum circuits (compare [4, 20]), polynomial-size classical shadows can be used to perform tomography among such states. In the case of an efficiently implementable target unitary U and QNN V (α) that both admit a circuit representation with T ∈ O(poly(n)) local gates, O(T log(T / )/ 2 ) ≤Õ(poly(n)/ 2 ) copies of each of the input states |Ψ (j) and output states |Φ (j) suffice to approximately evaluate the cost (both the global and local variants) and its partial derivatives arbitrarily often.

C. Numerical Results
Here we provide numerical evidence to support our analytical results showing that out-of-distribution generalization is possible for the learning of quantum dynamics. We focus on the task of learning the parameters of an unknown target Hamiltonian by studying the evolution of product states under it.
For concreteness, we suppose that the target Hamiltonian is of the form with the specific parameter setting (p * , q * , r * ) given by p * k = sin πk 2n for 1 ≤ k ≤ n − 1 and q * k = sin πk n , r * k = cos πk n for 1 ≤ k ≤ n. The learning is performed by comparing the exact evolution under e −iH(p * ,q * ,r * )t to a Trotterized ansatz. Specifically, we use an L layered ansatz V L (p, q, r) := (U ∆t (p, q, r)) L where U ∆t is a second order Trotterization of e −iH(p,q,r)∆t . That is, where the Hamiltonians H A (r) := n−1 k=1 Z k Z k+1 + n k=1 r k Z k and H B (p, q) := n−1 k=1 p k X k X k+1 + n k=1 q k X k contain only commuting terms and so can be readily exponentiated.
We attempt to learn the vectors p * , q * , and r * by comparing e −iH(p * ,q * ,r * )t |ψ j and V L (p, q, r)|ψ j over N random product states |ψ j . To do so, we use the training data D Q (N ) with Q = S Haar ⊗n 1 , and the cost function given in Eq. (4). The learning is performed classically for n = 4, . . . , 12 and L = 2, . . . , 5 and we take the total evolution time to be t = 0.1. For all values of n we train on two product states, i.e. N = 2. We repeated the optimization 5 times in each case and kept the best run. While the small training data size N = 2 was sufficient for the model considered here, in Appendix C 2 we present a more involved unitary learning setting that requires larger values of N . Fig. 3 plots the in-distribution risk and out-ofdistribution risk as a function of the final optimized cost function values, C D Q (2) (α opt ) with Q = S Haar ⊗n 1 . Here the in-distribution risk is the average prediction error over random product states, i.e. R S Haar ⊗n 1 , and for the out-of-distribution testing we chose to compute the risk over the global Haar distribution, i.e. R SHaar n . These risks can be evaluated analytically using Lemma B.3 and Eqs. (B1), (B2). The linear correlation between the cost function and both R S Haar ⊗n 1 and R SHaar n demonstrates that both in-distribution and out-of-distribution generalization have been successfully achieved.
Next, we perform noisy simulations to assess the performance of learning the parameters of the Hamiltonian in Eq. (12) in two situations: (i) the training is performed on random product states and (ii) the training data is prepared with deep quantum circuits. We expect that the presence of noise will have a different impact depending on the amount of noise that is accumulated during the preparation of the training states.
Our simulations used a realistic noise model based on gate-set tomography on the IBM Ourense superconducting qubit device [59] but with the experimentally ob- tained error rates reduced by a factor of 20 to make the difference in training more pronounced. The training set is constructed from just two states (either product states or those prepared with a linear depth hardware efficient circuits). The optimizer is a version of the gradient-free Nelder-Mead method [60]. The cost function in Eq. (4) is computed with an increasing number of shots, starting with 10 shots per cost function evaluation. That number is increased by 50% once the optimizer detects a lack of progress within a specified number of iterations. This optimization procedure is sensitive to flatness of the cost function landscape: The flatter the landscape, the more shots are needed to resolve it and find a minimizing direction. Fig. 4 shows the results of the training procedure performed on an n = 6 qubit system. Here, we train the L = 2 ansatz for V L (p, q, r) and consider total evolution time t = 0.1. The optimization is repeated 20 times, each time starting with different random initial point (p 0 , q 0 , r 0 ). Red (blue) lines indicate the risk obtained for product (deep circuit) training states as a function of total number of shots.
Training with product states is successful: once the number of shots per cost function evaluation is large enough (total shots above 10 3 ), the optimizer detects the downhill direction and the in-distribution risk is gradually decreased, eventually reaching 10 −3 . The out-ofdistribution risk closely follows the in-distribution risk proving that generalization can be achieved with product training states under realistic noise and finite shot budget conditions. In contrast, the training set built with deep circuits fails to produce successful training for all 20 optimization runs. Even in the limit of very large number of shots, both in-distribution and out-of-distribution risks remain large. This proof-of-principle numerical experiment shows that our out-of-distribution generalization guarantees can make training and learning feasible in noisier scenarios than otherwise viable.

III. DISCUSSION
Our work establishes that for learning unitaries, QNNs trained on quantum data enjoy out-of-distribution generalization between some physically relevant distributions if the training data size is roughly the number of trainable gates. The class of locally scrambled distributions that our results hold for fall naturally into sub-classes of training ensembles and testing ensembles, characterized by their practicality and generality respectively. The simplest possible training ensemble in this context are products of stabilizer states. Our results show that training on this easy to experimentally prepare and easy to classically simulate ensemble generalizes to the uniform Haar ensemble of states, as well as to practically motivated ensembles such as the output of random circuits. Thus, somewhat surprisingly, we have shown the action of quantum unitaries can be predicted on a wide class of highly entangled states, having only observed their action on relatively few unentangled states.
These results have implications for the practicality of learning quantum dynamics. We are particularly intrigued by the possibility of using quantum hardware or experimental systems to characterize unknown dynamics of quantum experimental systems. This could be done by coherently interacting a quantum system with a quantum computer, or alternatively could be conducted in a more conventional experimental setup. We stress for the latter, the experimental setup may not be equipped with a complete gate set, and so our proof that learning can be done using only products of random single qubits states, which require only simple single-qubit gates to prepare, is particularly important.
We are also interested in the potential of these results to ease the classical compilation of local short-time evolutions into shorter depth circuits [30] and circuits of a particular desired structure [34,35]. Since low-entangling unitaries and product states may be classically simulated using tensor network methods, our results show that the compilation of such unitaries may be performed entirely classically. This could be used to develop more effective methods for dynamical simulation or to learn more efficient pulse sequences for noise resilient gate implementations.
An immediate extension of our results would be to investigate whether our proof techniques can be used to more efficiently evaluate Haar integrals, or more generally to relate averages over different locally scrambled ensembles in other settings. For example, one might explore whether they could be used in a DQC1 (Deterministic Quantum Computation with 1 clean qubit) setting where one inputs a maximally mixed state [61]. Alternatively, one might investigate whether they could be used to bound the frame potential of an ensemble, an important quantity for evaluating the randomness of a distribution that has links with quantifying chaotic behavior [62].
In this paper we have focused on the learning of quantum dynamics, in particular the learning of unitaries, using locally scrambled distributions. Given recent progress on different quantum channel learning questions [63][64][65][66][67][68][69][70][71][72][73], it is natural to ask whether out-of-distribution generalization is possible for other QML tasks such as learning quantum channels [74] or, more generally, for performing classification tasks such as classifying phases of matter [20, 75,76]. It would further be valuable to investigate whether out-of-distribution generalization is viable for other classes of distributions. Such results, if obtainable, would again have important implications for the practicality of QML on near term hardware and restricted experimental settings.
Our approach to out-of-distribution generalization does not rely on specific learning algorithms, nor transfer learning techniques, as is often the case in the classical literature [45][46][47][48][49]. Rather, we establish generalization guarantees that apply to a specific QML task (learning quantum dynamics) with data coming from a specific class of distributions (locally scrambled ensembles). That is, we show that in this context, out-of-distribution generalization is essentially automatic. In the classical ML literature, a similar-in-spirit focus on properties of the class of distributions of interest can for example be seen in the concepts of invariance [77,78] and variation [79] of features, but the nature of these properties is still quite different from the ones that we consider. Nevertheless, we hope that combining such perspectives from classical ML theory with physics-informed choices of distributions, as in our case, will lead to a better understanding of outof-distribution generalization.

IV. METHODS
In this section, we give an overview over the proof strategy leading to our central analytical result contained in Lemma 1. At a high level, our proof boils down to rewriting R SHaar n (α) and R Q (α) with Q locally scrambled into forms which are comparable by known and newly derived inequalities.
First, we recast the Haar risk R SHaar n (α) into an average over Pauli products and upper bound it by a risk over local stabilizer states. To do so, we rewrite the Haar risk by recalling the relationship between the (Haar) average gate fidelity between two unitaries U and V and the Hilbert-Schmidt inner product [80], Next, we use the Pauli basis expansion of the swap operator to write the Haar risk as an average over Pauli operators. That is, as shown explicitly in Lemma B.1, we use This gives an expression for the Haar risk R SHaar n (α) in terms of an average over Pauli products. This average over Pauli observables can then be upper bounded by an average over products of stabilizer states by introducing a spectral decomposition, as detailed in Lemma B.2 and Corollary B.1. Finally, by the 2-design property of the random single-qubit stabilizer states, we can rewrite this upper bound in terms of a local Haar average, (17) The latter can then be related to R Q (α) because Q is locally scrambled, which then leads to the first inequality in Lemma 1. Here the choice to bound by Haar ⊗n 1 specifically hints towards our final result that a unitary can be learnt over the Haar average from product state training data.
Second, we recast the generic locally scrambled risk R Q into a sum of locally scrambled expectation values over different partitions of the system. Specifically, using a well known expression for the complex second moment of the single-qubit Haar measure (see, e.g., Eq. (2.26) in [62]), we find that whereŨ ∼Ũ is drawn from the locally scrambled unitary ensembleŨ with Q =Ũ|0 ⊗n . See Lemma B.3 for more details. From here, we can use matrix-analytic inequalities to show a lower bound on the Frobenius norm of a partial trace of a matrix in terms of the absolute value of the trace of the original matrix. Plugging this lower bound into the explicit expression for R Q (α) translates exactly to the second inequality in Lemma 1.

DATA AVAILABILITY
The data generated and analyzed during the current study are available from the authors upon request.

CODE AVAILABILITY
Further implementation details are available from the authors upon request.   Before beginning our main discussion, we introduce some notation that will be used throughout the Supplementary Material. In our discussion, we consider systems consisting of n qubits. Thus, we work with the complex Hilbert space (C 2 ) ⊗n of dimension d = 2 n . For any d, B(C d ) denotes the set of bounded linear operators on C d , which we implicitly identify with the set of d × d matrices by fixing a basis whenever convenient. Also, we denote by U(C d ) the set of unitary operators on C d . The sets of operators B((C 2 ) ⊗n ) and U((C 2 ) ⊗n ) are defined analogously. Finally, we use standard bra-ket notation for pure quantum states.
We consider the task of learning an unknown n-qubit unitary U ∈ U((C 2 ) ⊗n ) from pairs of input and output states using a quantum neural network (QNN). For our purposes, we think of a QNN as a (possibly variable-structure) k-local quantum circuit on n qubits that contains tunable gates. (Here, k is an n-independent constant). Mathematically, we describe such a QNN by a parameterized n-qubit unitary V (α) with classical parameters α, where the parameterization arises from the QNN structure. The parameter vector α can consist of both continuous parameters (which indeed parameterize the trainable gates, e.g. acting as rotation angles) and discrete parameters (which encode freedom in the chosen quantum circuit structure, e.g. the number of trainable gates). The input to the quantum learning procedure is a training data set D Q (N ) of the form where the |Ψ (j) ∈ (C 2 ) ⊗n are pure n-qubit input states drawn i.i.d. from a probability distribution Q and |Φ (j) = U |Ψ (j) are the corresponding output states. The goal is to train the classical parameters α in the QNN V (α) such that the QNN V (α opt ) with the optimized parameters α opt predicts the output states of U well on average when the input states are drawn from a testing probability distribution P over pure n-qubit states. That is, the optimized parameters α opt should be such that is small. A learner who does not know the testing distribution P and the target unitary U cannot evaluate the expected testing risk from Eq. (A3). Instead, given a training data set as in Eq. (A1), the learner may try to evaluate and optimize the training cost Here, we rewrite the trace norm distance between two pure states in terms of their fidelity to obtain an expression for the training cost that can be evaluated on a quantum computer with a swap test [43,44].
Optimizing the training cost from Eq. (A5), however, is not automatically a promising avenue towards achieving a small expected testing risk from Eq. (A3). Such a promise can only be fulfilled when a good performance on the available training data, i.e. a small value of C D Q (N ) (U, V (α opt )), also leads to a good average performance on previously unseen data points, i.e. to a small value of R P (U, V (α opt )). This ability to generalize from training data to unseen data is of central importance to the viability of (quantum) machine learning. In particular, the generalization behavior often has a determining influence on the amount of training data that a (quantum) machine learning model requires.
For the case of Q = P, when training and testing data are drawn i.i.d. from the same distribution, such questions can be studied in the standard framework of in-distribution generalization (sometimes also known as weak generalization).
Here, however, we focus on the case of Q = P, when the QNN is trained on a distribution different from the testing distribution. This scenario is variously known as out-of-distribution generalization or strong generalization. More precisely, we will consider training and testing states coming from (different) locally scrambled ensembles [38, 39].
Definition A.1 (Locally scrambled ensembles -Restatement of Definition 1). An ensemble U of n-qubit unitaries is called locally scrambled if it is invariant under preprocessing by tensor products of arbitrary local unitaries. That is, if U ∼ U, then for any fixed U 1 , . . . , U n ∈ U(C 2 ) also U ( n i=1 U i ) ∼ U. Accordingly, an ensemble S of n-qubit quantum states is called locally scrambled if it is of the form S = U|0 ⊗n for some locally scrambled ensemble U of n-qubit unitaries.
We use U LS to denote the class of all locally scrambled unitary ensembles and S LS to denote the class of all locally scrambled state ensembles. (Here, we suppress the number of qubits in the notation in favor of improved readability.) As discussed in more detail below, examples of locally scrambled unitary ensembles include Haar-random n-qubit unitaries, tensor products of Haar-random single-qubit unitaries followed by some fixed n-qubit unitary, and unitaries implemented by random quantum circuits of some fixed depth, among others.
In fact, our results for locally scrambled ensembles below immediately extend to a slightly broader class of ensembles: Definition A.2. An ensemble U of n-qubit unitaries is called locally scrambled up to (and including) complex second moments if there exists an ensembleŨ of n-qubit unitaries such that the complex first and second moments of U agree with those ofŨ. That is, ). Accordingly, an ensemble S of n-qubit quantum states is called locally scrambled up to (and including) complex second moments if it is of the form S = U|0 ⊗n for ensemble U of n-qubit unitaries that is locally scrambled up to (and including) complex second moments. We use U   Example A.7 (Output states of random quantum circuits, S A k RandCirc ). Let A k be a k-local n-qubit quantum circuit architecture (in which every qubit is acted on non-trivially), with k ≤ n. Let U A k denote the ensemble of n-qubit unitaries obtained by drawing every k-qubit unitary in A k at random from the k-qubit Haar measure. Then, by right-invariance of the k-qubit Haar measure, U A k ∈ U LS . Accordingly, the ensemble S RandCirc := U A k |0 ⊗n of output states of a random quantum circuit with architecture A k satisfies S A k RandCirc ∈ S LS .
We begin our analysis by comparing the testing risks obtained from Eq. (A3) for different locally scrambled testing distributions. In this subsection, we show that all such testing risks are equivalent in the sense that they differ by at most a constant factor. We prove this equivalence by showing that all locally scrambled risks R P (U, V ) are tightly related to the Hilbert-Schmidt inner product between U and V . To formalize this discussion, we first introduce a cost arising naturally from that inner product: Definition B.1 (Hilbert-Schmidt test cost). The Hilbert-Schmidt test (HST) cost between two n-qubit unitaries U ∈ U((C 2 ) ⊗n ) and V ∈ U((C 2 ) ⊗n ) is defined as At this point, we note that, as shown in Refs. [30,80], we can view the HST cost as an expected testing risk as in Eq. (A3) via where d = 2 n . (Here, the notation R SHaar n (U, V ) indicates an expected testing risk w.r.t. the Haar ensemble from Example A.4.) In the main text, for conciseness of presentation, we have used the right hand-side of Eq. (B2) instead of its left-hand side. Our first result is an expression for the squared absolute value of the Hilbert-Schmidt inner product between two matrices -which in particular gives an expression for the HST cost between two unitaries -in terms of an average over n-qubit Paulis: In particular, we can express the HST cost between two n-qubit unitaries U ∈ U((C 2 ) ⊗n ) and V ∈ U((C 2 ) ⊗n ) as Here, the first line is due to our shorthand, the second line is |z| 2 = zz for z ∈ C, the third line uses Next, we present a technical lemma which we later use to control the Pauli average in the expression for the HST cost between two unitaries: Lemma B.2. Let n ∈ N and write d = 2 n . Let W ∈ U((C 2 ) ⊗n ), let P ∈ {1, X, Y, Z} ⊗n , and let |s ∈ {|0 , |1 , |+ , |− , |y+ , |y− } ⊗n be an eigenvector of P . Then, 0 ≤ 1 − s|P |s · s|W † P W |s ≤ 2 1 − | s|W |s | 2 . (B12) Proof. Let |s 1 = |s , |s 2 , . . . , |s d be an orthonormal basis consisting of eigenvectors of P . Then, by plugging in the spectral decomposition P = d i=1 s i |P |s i · |s i s i |, we get, using that the eigenvalues of P lie in {−1, 1}, (B15) With this notation, we have 0 ≤ p i ≤ 1 for all 1 ≤ i ≤ d, where the upper bound holds by Cauchy-Schwarz and unitarity of W , as well as is a probability vector of length d. Therefore, from Eq. (B15), we conclude Accordingly, we obtain as claimed.
We emphasize that this proof, in contrast to those of Lemma B.1 (and also Lemma B.3 below), explicitly uses the unitarity of the matrix W . This is also why we assume unitarity of U and V in Lemma B.4 below, since we will again consider W = U † V .
Lemmas B.1 and B.2 allow us to prove the following upper bound on the HST cost in terms of an average over tensor products of Haar-random single-qubit states: Corollary B.1. Let n ∈ N and write d = 2 n . Let U, V,Ũ ∈ U((C 2 ) ⊗n ). Then, where we again use the shorthand W = U † V .
Proof. We begin with the expression for the HST cost in terms of a Pauli average derived in Lemma B.1. For any fixedŨ ∈ U((C 2 ) ⊗n ), the definition of C HST (U, V ) and Lemma B.1 imply: Now, we can consider a spectral decomposition for P ∈ {1, X, Y, Z} ⊗n , which -since we are dealing with Pauli strings -we can write as follows: Plugging this spectral decomposition into Eq. (B4) to evaluate the trace, we obtain Here, we denote by E P ∼{1,X,Y,Z} ⊗n the expectation over uniformly random Pauli strings of length n, and E |s ∼{|0 ,|1 ,|+ ,|− ,|y+ ,|y− } ⊗n : s|P |s =0 denotes the expectation over uniformly random tensor products of single-qubit stabilizer states which have non-vanishing overlap with P . Equivalently, the latter is the expectation over uniformly random eigenvectors of P . Similarly, E |s ∼{|0 ,|1 ,|+ ,|− ,|y+ ,|y− } ⊗n denotes the expectation over uniformly random tensor products of single-qubit stabilizer states, and E P ∼{1,X,Y,Z} ⊗n : s|P |s =0 denotes the expectation over uniformly random Pauli strings of length n that have non-vanishing overlap with |s . Equivalently, the latter is the expectation over uniformly random Pauli strings that have |s as an eigenvector. Note that the expectation values involved here are w.r.t. uniform distributions over finite sets, which in particular justifies the last step in the above computation (since this then becomes a mere reordering of a finite sum). Plugging in the upper bound of Lemma B.2, applied for the unitaryŨ † WŨ , we further obtain: where the last equality uses that single-qubit stabilizer states form a 2-design (compare, e.g., [85], or see [86-88] for a stronger statement).
To facilitate the comparison between the HST cost and a locally scrambled risk, we next show how to rewrite a general locally scrambled risk: Lemma B.3. Let n ∈ N. Let P be a locally scrambled ensemble of n-qubit quantum states, with U test the corresponding locally scrambled unitary ensemble. Then, for any n-qubit unitaries U and V , using the shorthand W = U † V , where Tr A c denotes partial trace over all systems with index not in the set A and · F denotes the Frobenius norm (which is the norm induced by the Hilbert-Schmidt inner product).
Proof. Throughout the proof, we use the shorthand W = U † V . We begin by noticing that, since U is locally scrambled, for any fixed U 1 , . . . , U n ∈ U(C 2 ), we have: If we now take an expectation over tensor products of Haar-random single-qubit unitaries n i=1 U i ∼ Haar ⊗n 1 , this yields: where used that we can exchange the order of the expectation values by Tonelli's theorem, since the integrand is nonnegative, and then slightly abused notation by using Haar ⊗n 1 to also denote the probability distribution describing a tensor product of Haar-random single-qubit states.
Next, we recall the following Haar identity (see, e.g., Eq. (2.26) in [62]): where SWAP is the swap operator between two qubits and 1 denotes the identity matrix on a single qubit system. Using this identity and the trace equality we can rewrite, for any fixedŨ ∈ U((C 2 ) ⊗n ), Here, SWAP i,i denotes the single-qubit swap operator that acts on the i th qubits of the first and second n-qubit tensor factors, respectively. Next, we use that the swap operator is its own inverse and the definition of the partial trace to continue the computation: In the last step, we have used Tr A c ,A c to denote the partial trace over the A c -subsystems of both the first and the second n-qubit tensor factors. Observing that writing SWAP A,A = i∈A SWAP i,i , we finish the computation and obtain 1 6 n A⊆{1,...,n} Plugging this observation into Eq. (B36) finishes the proof.
As an aside, note that the proof of Lemma B.3 does not use the unitarity of U and V . That is, Lemma B.3, just like Lemma B.1, is valid for arbitrary U, V ∈ B(C d ). However, Lemma B.2, Corollary B.1, and the results from here on make use of the unitarity assumption. Now, we are ready to combine the tools developed so far to establish the main technical result of this subsection. We show that all locally scrambled risks are equivalent to the HST cost: Lemma B.4 (Restatement of Lemma 1). Let n ∈ N and write d = 2 n . Let P be a locally scrambled ensemble of nqubit quantum states, with U test the corresponding locally scrambled unitary ensemble. Then, for any n-qubit unitaries U ∈ U((C 2 ) ⊗n ) and V ∈ U((C 2 ) ⊗n ), Proof. Throughout the proof, we use the shorthand W = U † V . We start with the proof of the first inequality, To this end, we apply Corollary B.1 to see that, for anyŨ ∈ U((C 2 ) ⊗n ), Taking an expectation overŨ ∼ U test , we obtain: where the last equality uses that U is locally scrambled and was already derived previously in Eq. (B35). This finishes the proof of the first inequality.
We now turn our attention to the second inequality, R P (U, V ) ≤ C HST (U, V ). To prove this inequality, we rely on the expression for R P (U, V ) derived in Lemma B.3. So, let A ⊆ {1, . . . , n} be a subset of cardinality |A| = k. Then, we have, again for any fixedŨ ∈ U(C d ): This is the second inequality that we set out to prove.
As an immediate consequence of Lemma B.4, since all locally scrambled risks are equivalent to the HST cost, we also see that all locally scrambled risks are equivalent to each other up to a constant multiplicative factor. That is, the following holds.
Theorem B.1 (Equivalence of locally scrambled ensembles for comparing unitaries -Restatement of Theorem 1). Let P and Q be two locally scrambled ensembles of n-qubit quantum states. Then, for any n-qubit unitaries U and V , Thus, for the purposes of comparing unitaries, all locally scrambled ensembles are in effect equivalent. In the next subsection, we combine this insight with known in-distribution generalization bounds for learning unitaries via QNNs to establish out-of-distribution generalization guarantees, if both the training and the testing distribution are locally scrambled. In particular, we will use Theorem B.1 to show that, if we train on input states coming from a "simple" locally scrambled ensemble, such as random product states (Example A.1), and have good in-distributiongeneralization there, then we generalize to any other "more complicated" locally scrambled ensemble, such as fully Haar-random (and thus highly entangled) states or output states of random circuits (Examples A.4 and A.7).
Remark B.1. By definition, the expected risk in Eq. (A3) depends only on the complex second moment of the testing distribution P. Therefore, we can directly extend Theorem B.1 beyond locally scrambled ensembles, i.e. elements of S LS , to ensembles whose complex second moments agree with those of some locally scrambled ensembles, i.e. to elements of S (2) LS . As a concrete example (discussed also in Example A.2): Single-qubit stabilizer states form a 2-design and tensor products of Haar-random single-qubit states are locally scrambled. Thus, also the testing risk obtained by taking tensor products of random single-qubit stabilizer states for P in Eq. (A3) is equivalent to any locally scrambled risk up to factors of 2.

Out-of-Distribution Generalization for QNNs Trained on Locally Scrambled States
In this subsection, we use our results of Subsection B 1 to strengthen in-distribution generalization bounds for learning unitaries via QNNs to out-of-distribution generalization bounds for the same task, where we allow for arbitrary locally scrambled training and testing distributions. The following theorem, where we denote the in-distribution generalization error by gen Q,D Q (N ) (α) = R Q (U, V (α)) − C D Q (N ) (U, V (α)), serves as a general template for such a strengthening: Corollary B.2 (Out-of-distribution generalization from in-distribution generalization in unitary learning -Restatement of Corollary 1). Let n ∈ N and write d = 2 n . Let Q and P be two locally scrambled ensembles of n-qubit quantum states. Let U be an unknown n-qubit unitary. Let V (α) be an n-qubit unitary QNN. For any parameter setting α, we have Proof. As both Q and P are locally scrambled ensembles of n-qubit quantum states, Theorem B.1 yields After rewriting this gives the statement of the theorem.
Corollary B.2 has the following implication for out-of-distribution generalization after training: When training the QNN V (α) with the cost C D Q (N ) using training data D Q (N ), the out-of-distribution testing risk R P (U, V (α opt )) of the parameter setting α opt after training is controlled in terms of the training cost and the in-distribution generalization error. Here, the only assumption on the training and testing distributions is that they are both locally scrambled.
To demonstrate the usefulness of Corollary B.2, we next show the concrete form it takes when combined with the QNN generalization guarantees of [20]: Corollary B.3 (Locally scrambled out-of-distribution generalization for QNNs -Restatement of Corollary 2). Let n ∈ N and write d = 2 n . Let δ ∈ (0, 1). Let Q and P be two locally scrambled ensembles of n-qubit quantum states. Let U be an unknown n-qubit unitary. Let V (α) be an n-qubit unitary QNN with T parameterized local gates. When trained with the cost C D Q (N ) using training data D Q (N ), the out-of-distribution testing risk w.r.t. P of the parameter setting α opt after training satisfies with probability ≥ 1 − δ over the choice of training data of size N according to Q.
Proof. According to [20, Theorem 11], the in-distribution generalization error in this setting is bounded as Plugging this in-distribution generalization error bound into Corollary B.2 yields the claim.
Corollary B.3 has the following implication for training data requirements: To ensure that, with high probability, the expected risk does not exceed twice the training cost by more than a specified accuracy ε, it suffices to have training data of size ∼ T log(T ) /ε 2 . This sufficient training data size scales only slightly superlinearly in the number of trainable gates in the QNN.
For the special case of the locally scrambled training ensemble being tensor products of Haar-random states, with corresponding product state training data given as where the |ψ Corollary B.4 (Out-of-distribution generalization for QNNs trained on random product states). Let n ∈ N and write d = 2 n . Let δ ∈ (0, 1). Let P be a locally scrambled ensemble of n-qubit quantum states. Let U be an unknown n-qubit unitary. Let V (α) be an n-qubit unitary QNN with T parameterized local gates. When trained with the cost C D Haar ⊗n 1 (N ) using training data D Haar ⊗n 1 (N ), the out-of-distribution risk w.r.t. P of the parameter setting α opt after training satisfies with probability ≥ 1 − δ over the choice of training data of size N according to S Haar ⊗n 1 .
Notice that, following Remark B.1, we can simplify the training data even further and still achieve the same performance. Namely, if we train on tensor products of random stabilizer states (instead of on tensor products of random product states), then exactly the same out-of-distribution generalization bound as in Corollary B.4 holds.
Remark B.2. We can extend our results for out-of-distribution generalization when training on tensor products of Haarrandom states to local variants of our risks and costs. Such local costs are essential to avoid cost function dependent barren plateaus [55] when training a shallow QNN, thereby facilitating optimization. As a concrete example, when taking S Haar ⊗n 1 from Example A.1 as testing ensemble, we can consider the local expected testing risk and, for a training data set as in Eq. (B71), the local training cost Clearly, analogous local variants of expected risk and training cost can be defined whenever the respective ensemble has a tensor product structure. Among the examples presented in the main text, both S Haar ⊗n 1 from Example A.1 and S Stab ⊗n 1 from Example A.2 have that form. However, if the training data is highly entangled constructing such local costs in this manner is not possible. Thus, this is another important consequence of our proof that training on product state training data enjoys out-of-distribution generalization.
According to [30, Appendix C], we know that We can combine this with Theorem B.1 to obtain: If P is any locally scrambled ensemble of n-qubit quantum states, then for any n-qubit unitaries U and V ,

Remarks on the Role of Linearity
As outlined in the discussion between Corollary 2 and Corollary 3, linearity is important in enabling our out-ofdistribution generalization. Intuitively, as long as the training states span the space on which one wishes to learn the action of the target unitary, it ought to be possible to train on those states and by linearity extrapolate to the entire space. However, this line of argument alone is insufficient to explain out-of-distribution generalization. The random ensembles of states also have to be "well-behaved" to ensure good generalization from a manageable number of training states.
One way of highlighting this subtlety is to note that even an exponential number of computational basis states cannot be used to learn an unknown unitary using a cost formulated in terms of the 1-norm distance between the guess output and true output (or equivalently the fidelity between the guess and true outputs). Namely, computational basis states do not allow to learn relative phases. This can be illustrated by the following concrete example: Suppose the unknown unitary is the single-qubit unitary U = e −iϕZ for some ϕ ∈ [0, 2π). That is, we consider The action of U on the computational basis states is thus given by U |0 = e −iϕ |0 and U |1 = e iϕ |1 , all the relevant information lies in the relative phase between the two output states. As our notions of risk are (as is physically reasonable) independent of global phases in the output states, the unitary V = 1 2 achieves a perfect training cost on the training data set D = {(|0 , U |0 ), (|1 , U |1 )}, namely and therefore, since the training data set in this case consists of exactly the two single-qubit computational basis states, a perfect expected testing risk over randomly drawn computational basis states, namely However, V clearly fails to capture the relative phase between U |0 and U |1 . In particular, if we consider the testing risk over uniformly random states in the X-basis, we see that which is strictly bigger than zero whenever ϕ is not an integer multiple of π. Also, using Eq. (B2), we obtain which is strictly bigger than zero whenever ϕ is not an integer multiple of π. Thus, despite perfect training error, perfect in-distribution generalization error, and perfect in-distribution testing error, the out-of-distribution generalization and testing errors can be non-zero. This single-qubit example shows that no analogue of Theorem 1 can hold without additional assumptions on shared properties between the training and testing ensembles (such as both being locally scrambled). In particular, a good training and testing performance on randomly drawn computational basis states does not imply a good testing performance over (for example) random X-basis states or Haar-random states. While this counterexample to out-of-distribution generalization from training on computational basis states is specific to our (physically motivated) choice of cost function, the above argument still emphasizes that linearity alone does not trivially imply out-of-distribution generalization.
It is further worth stressing that an argument based on linearity places no guarantees on how many training states are required/sufficient for convergence. The argument that 'as long as the training states span the space on which you wish to learn the action of the target unitary on, it ought to be possible to train on those states and by linearity extrapolate to the entire space' crucially only applies if you train on an exponentially large training ensemble. In general, how many states are required/sufficient to ensure good generalization will depend on the types of states in the training ensemble. In our work, we combine Theorem 1 with the recent in-distribution generalization bounds of [20] and thereby show that the worst-case training data requirements for random product states cannot be significantly worse than those for fully random states.
Appendix C: Additional Numerical Results

Numerical Test of Lemma 1
We numerically probe the validity of Lemma 1 for the NISQ friendly scenario Q = S Haar ⊗n 1 , i.e. the set of Haarrandom product states. In this case, the relation between the average Haar-random product state cost and the general n-qubit Haar-random state cost according to Lemma 1 is where α contains all parameters that define the QNN. To make contact with NISQ applications, we will probe the validity of Ineq. (C1) by sampling W = V † U (α) from random low-depth quantum circuits. Given W , we can evaluate d+1 d R SHaar n (U, V (α)) from Eq. B2 and R S Haar ⊗n 1 (U, V (α)) from Lemma B.3 with straightforward matrix operations.
To ensure that we can quickly sample from a large range of cost values without the need for optimization, we implement an ansatz of the form Here, each G ik is an arbitrary 2-qubit gate, and the inner product over i represents a hardware-efficient tiling of these 2-qubit gates. That is, we apply n/2 2-qubit gates from even qubits to odd qubits (i. e. between (0, 1), (2, 3), etc . . .) in parallel and then (n − 1)/2 2-qubit gates from odd to even (i. e. between (1, 2), (3, 4), etc . . .) in parallel. The outer product just means we are applying l layers. So far, we have just described a familiar hardware-efficient tiling of arbitrary 2-qubit gates used in many variational quantum algorithms [34,35,55], but there is one crucial difference between our implementation and the standard one. Rather than use the minimal 15 single-qubit gate and 3 CNOT gate decomposition of G [89, 90] (aka the KAK decomposition [91]), we use a slightly larger 21 single-qubit gate and 4 CNOT gate decomposition. Though our choice has more parameters than necessary, it is defined so that G(0) = I, which is not true for the KAK decomposition. This has the desirable property that W (α = 0) = I. Of course, all risks/costs comparing U and V are defined so that when W = V † U = I, they vanish (i.e. R = 0 and C = 0 for any sensible risk R and cost C). By writing α = rθ we emphasize the point that regardless of the choice of θ, setting r = 0 samples the point (0, 0) where both risks vanish.  By defining the ansatz in this way, we can easily sample a large range of cost values. In particular, for fixed l and n, we randomly sample (θ ik ) p ∼ N (0, 2π) ∀i, k, p, generating a random initial parameter vector θ (0) for the entire ansatz from which we then compute the starting risks, (R S Haar ⊗n 1 (U, V (θ (0) )), d+1 d R SHaar n (U, V (θ (0) ))). Note that this is the same initialization procedure for many variational quantum algorithms (VQAs) [34,35,55], and for a deep enough circuit with a random choice of θ (0) , any risk or cost will be approximately maximized (i.e. R ≈ 1 and C ≈ 1). If we were truly running a VQA to learn U (θ) given a known unitary V , we would need to iteratively estimate R S Haar ⊗n 1 on a quantum computer and update our best guess for θ * classically. Instead, we exploit the form of our toy ansatz directly: we simply re-scale each angle by r, i.e. θ (0) → rθ (0) for different values of r. By construction, the risk will be minimized when r = 0. Thus, by sampling values of r ∈ [0, 1], we can explore the empirical cost relationship (R S Haar ⊗n 1 (U, V (r · θ (0) )), d+1 d R SHaar n (U, V (r · θ (0) ))) between the two extremes. In Fig. C1, we show this empirical relationship for n = 2, . . . , 10 qubits with ansatz depths l = 1, 2, 3 for 20 random initialization vectors θ (i) and 100 values r for each random sample. For all points sampled, Ineq. (C1) is satisfied as expected. In fact, the upper-bound is often pretty loose, and it appears that a tighter relationship might be true even for large values of R S Haar ⊗n 1 (U, V (α)).

Out-of-Distribution Generalization for Learning Fast Scramblers
Task and setup: Here, we consider the task of learning a so-called fast scrambler [40] into an ansatz V (α) of a similar form. An n-qubit fast scrambler unitary U composed of t time steps is defined as follows: where U I j is a product of independent Haar-random single-qubit unitaries, U I j = n k=1 u j,k , and U II j is given by where g is a real parameter. The ansatz V (α) for learning the scrambler U has the same structure as U with fixed single-qubit gates replaced by parametrized ones. That is, the ansatz V (α) takes the form where V I (α j ) = n k=1 v j,k (α j,k ), with parametrized one qubit gates v j,k (α j,k ). Here, we view t as the number of time steps. This is a parameter that controls the difficulty of the optimization problem, since the entanglement introduced by U quickly grows with t. The parameter g in Eq. (C4) controls how quickly the learning difficulty grows with t. We work with g = 1, but consider several values of t.
The learning is performed as described in the main text. That is, we first build a training set and then optimize a corresponding cost function. First, we generate a training set of size N of the form D Q (N ) = {|ψ j , U |ψ j } N j=1 , where input states |ψ j are random product states. Second, we optimize the parameters α according to the cost function C D Q (N ) (α) introduced in Eq. (3). Optimized parameters α opt are found by (approximately) solving the optimization problem: We measure the learning quality with the (out-of-distribution) risk R SHaar n (α), see Eq.
(1) and also Eq. (B2). We are interested in generalization error R SHaar n (α opt ) − C D Q (N ) (α opt ) as a function of various parameters in the learning problem.
Results: We learn an 8-qubit fast-scrambler unitary U with t = 3, . . . , 10 and training data set sizes N = 1, . . . , 15. The learning is performed by repeating the optimization in Eq. (C6) 1000 times. Each optimization learns a different randomly generated U , works with different randomly drawn training set, and starts with different random initial parameters α 0 . The optimization is performed with a variant of the gradient descent method in which the single-qubit unitaries v j,k (β) are spanned by three rotation angles, v j,k (β) = e −iZβ1 e −iXβ2 e −iZβ3 .  Figure C2 summarizes our results. Panel (a) shows the testing risk R SHaar n (α opt ) as a function of the training cost C D Q (N ) (α opt ) calculated for 1000 independently obtained values of α opt . The data was obtained for t = 5. Blue (red) dots represent optimization performed with training data size N = 3 (N = 15). We observe that small training data size of N = 3 may lead to a situation in which optimization has already reached appreciable training cost values (C D Q (N ) (α opt ) ≈ 0.5) but the testing risk is still at its maximal value (R SHaar n (α opt ) ≈ 1). This generalization issue is resolved by adding more points to the training data set. Indeed, when training on N = 15 data points, the obtained data does not display a concentration around R SHaar n (α opt ) ≈ 1. We also observe that larger data sets result in an increased likelihood of almost perfect learning (that is, an optimization that achieves C D Q (N ) (α opt ) R SHaar n (α opt ) 0). Only 7% of the optimization runs with N = 3 achieved almost perfect learning while 12.5% of the optimization runs with N = 15 reached that goal. That plot also shows that larger training set leads to better generalization: achieving a given cost value of C D Q (N ) (α opt ) with a bigger training set results in smaller risk R SHaar n (α opt ). We observe this behavior for every optimization that we have performed.
Panel (b) corroborates those findings further. It shows the generalization error R SHaar n (α opt ) − C D Q (N ) (α opt ), averaged over all 1000 optimizations, as a function of N , the training data size, for several values of t. We see that the average generalization error obtained for the values t = 6, . . . , 10 is almost identical. The reason for this behavior is likely that for these values of t and for the training data sizes N used in our experiment, only very few optimization runs managed to lower the cost function enough to achieve a risk R SHaar n (α opt ) smaller than its maximal value. This optimization issue might be dealt with by more refined minimization techniques. While the results in this setup (large t and insufficiently large N ) are not useful from a learning point of view, our theoretical generalization bounds still hold. As the data suggests, the scaling is better than the worst case scenario covered by the theoretical analysis.
Panel (c) avoids interpretational complications caused by optimization issues and averages only those minimization runs that achieved C D Q (N ) (α opt ) < 0.5. We see a scaling behavior similar to what we observed when taking the entire data into account. Panels (b) and (c) show that the generalization error decreases faster than theoretical upper bound, which scales as ∼ N −1/2 with the training data size N and is shown by the black solid line. As the learning difficulty (measured by t) increases, the rate at which generalization error decreases with N seems to slowly approach the theoretical bound.