Theoretical Guarantees for Permutation-Equivariant Quantum Neural Networks

Despite the great promise of quantum machine learning models, there are several challenges one must overcome before unlocking their full potential. For instance, models based on quantum neural networks (QNNs) can suffer from excessive local minima and barren plateaus in their training landscapes. Recently, the nascent field of geometric quantum machine learning (GQML) has emerged as a potential solution to some of those issues. The key insight of GQML is that one should design architectures, such as equivariant QNNs, encoding the symmetries of the problem at hand. Here, we focus on problems with permutation symmetry (i.e., the group of symmetry $S_n$), and show how to build $S_n$-equivariant QNNs. We provide an analytical study of their performance, proving that they do not suffer from barren plateaus, quickly reach overparametrization, and generalize well from small amounts of data. To verify our results, we perform numerical simulations for a graph state classification task. Our work provides the first theoretical guarantees for equivariant QNNs, thus indicating the extreme power and potential of GQML.


INTRODUCTION
Symmetry studies and formalizes the invariance of objects under some set of operations.A wealth of theory has gone into describing symmetries as mathematical entities through the concept of groups and representations.While the analysis of symmetries in nature has greatly improved our understanding of the laws of physics, the study of symmetries in data has just recently gained momentum within the framework of learning theory.In the past few years, classical machine learning practitioners realized that models tend to perform better when constrained to respect the underlying symmetries of the data.This has led to the blossoming field of geometric deep learning [1][2][3][4][5], where symmetries are incorporated as geometric priors into the learning architectures, improving trainability and generalization performance [6][7][8][9][10][11][12][13].
The tremendous success of geometric deep learning has recently inspired researchers to import these ideas to the realm of quantum machine learning (QML) [14][15][16].QML is a new and exciting field at the intersection of classical machine learning, and quantum computing.By running routines in quantum hardware, and thus exploiting the exponentially large dimension of the Hilbert space, the hope is that QML algorithms can outperform their classical counterparts when learning from data [17].
The infusion of ideas from geometric deep learning to QML has been termed "geometric quantum machine learning"(GQML) [18][19][20][21][22][23][24].GQML leverages the machinery of group and representation theory [25] to build quantum architectures that encode symmetry information about the problem at hand.For instance, when the model is parametrized through a quantum neural network (QNN) [16,[26][27][28], GQML indicates that the layers of the QNN should be equivariant under the action of the symmetry group associated to the dataset.That is, applying a symmetry transformation on the input to the QNN layers should be the same as applying it to its output.
One of the main goals of GQML is to create architectures that solve, or at least significantly mitigate, some of the known issues of standard symmetry non-preserving QML models [16].For instance, it has been shown that the optimization landscapes of generic QNNs can exhibit a large number of local minima [29][30][31][32], or be prone to the barren plateau phenomenon [33][34][35][36][37][38][39][40][41][42][43][44][45] whereby the loss function gradients vanish exponentially with the problem size.Crucially, it is known that barren plateaus and excessive local minima are connected to the expressibility [30,32,37,43,46] of the QNN, so that problem-agnostic architectures are more likely to exhibit trainability issues.In this sense, it is expected that following the GQML program of baking symmetry directly into the algorithm, will lead to models with sharp inductive biases that suitably limit their expressibility and search space.
Our first contribution is to provide guidelines to build unitary S n -equivariant QNNs.We then derive rigorous theoretical guarantees for these architectures in terms of their trainability and generalization capabilities.Specifically, we prove that S n -equivariant QNNs do not lead to barren plateaus, can be overparametrized with polynomially deep circuits, and generalize well with a only a polynomial number of training points.We also identify problems (i.e., datasets) for which the model is trainable, but also datasets leading to untrainability.All these appealing properties are also demonstrated in numerical simulations of a graph classification task.Our empirical results verify our theoretical ones, and even show that the performance of S n -equivariant QNNs can, in practice, be better than that guaranteed by our theorems.

Preliminaries
While the formalism of GQML can be readily applied to a wide range of tasks with S n symmetry, here we will focus on supervised learning problems.We note however that our results can be readily extended to more general scenarios such as unsupervised learning [72,73], reinforced learning [74,75], generative modeling [76][77][78][79], or to the more task-oriented computational paradigm of variational quantum algorithms [63,80].
Generally, a supervised quantum machine learning task can be phrased in terms of a data space R -a set of quantum states on some Hilbert space H-and a real-valued label space Y.We will assume H to be a tensor product of n twodimensional subsystems (qubits) and thus of dimension d = 2 n .We are given repeated access to a training dataset S = {(ρ i , y i )} M i=1 , where ρ i is sampled from R according to some probability P , and where y i ∈ Y.We further assume that the labels are assigned by some underlying (but unknown) function f : R → Y, that is, y i = f (ρ i ).We make no assumptions regarding the origins of ρ i , meaning that these can correspond to classical data embedded in quantum states [81,82], or to quantum data obtained from some quantum mechanical process [60,61,83].GQML embeds geometric priors into a QML model.Incorporating prior knowledge through Snequivariance heavily restricts the search space of the model.We show that such inductive biases lead to models that do not exhibit barren plateaus, can be efficiently overparametrized, and require small amounts of data to generalizing well.
The goal is to produce a parametrized function h θ : R → Y closely modeling the outputs of the unknown target f , where θ are trainable parameters.That is, we want h θ to accurately predict labels for the data in the training set S (low training error), as well as to predict the labels for new and previously unseen states (small generalization error).We will focus on QML models that are parametrized through a QNN, a unitary channel U θ : B(H) → B(H) such that U θ (ρ) = U (θ)ρU (θ) † .Here, B(H) denotes the space of bounded linear operators in H. Throughout this work we will restrict to L-layered QNNs , where U l θ l (ρ) = e −iθ l H l ρe iθ l H l , (1) for some Hermitian generators {H l }, so that U (θ) = L l=1 e −iθ l H l .Moreover, we consider models that depend on a loss function of the form where O is a Hermitian observable.We quantify the training error via the so-called empirical loss, or training error, which is defined as The model is trained by solving the optimization task arg min θ L(θ) [63].Once a desired convergence in the optimization is achieved, the optimal parameters, along with the loss function ℓ θ , are used to predict labels.For the case of binary classification, where Y = {+1, −1}, one can choose c i := − yi M .Then, if the measurement operator is normalized such that ℓ θ (ρ i ) ∈ [−1, 1], this corresponds to the hinge loss, a standard loss function but not the only relevant one [84]) in machine learning.
We further remark that while Eq. ( 3) approximates the error of the learned model, the true loss is defined as Here we have denoted the weights as c(y) to make their dependency on the labels y explicit.The difference between the true loss and the empirical one, known as the generalization error, is given by We now turn to GQML, where the first step is identifying the underlying symmetries of the dataset, as this allows us to create suitable inductive biases for h θ .In particular, many problems of interest exhibit so-called label symmetry, i.e., the function f produces labels that remain invariant under a set of operations on the inputs.Concretely, one can verify that such set of operations forms a group [18], which leads to the following definition.
Definition 1 (Label symmetries and G-invariance).Given a compact group G and some unitary representation R acting on quantum states ρ, we say f has a label symmetry if it is G-invariant, i.e., if Here, we recall that a representation is a mapping of a group into the space of invertible linear operators on some vector space (in this case the space of quantum states) that preserves the structure of the group [25].Also, we note that some problems may have functions f whose outputs change (rather than being invariant) in a way entirely determined by the action of G on their inputs.While still captured by general GQML theory, these do not pertain to Definition 1 and are not discussed further.Label invariance captures the scenario where the relevant information in ρ is unchanged under the action of G. Evidently, when searching for models h θ that accurately predict outputs of f , it is natural to restrict our search to the space of models that respect the label symmetries of f .In this context, the theory of GQML provides a constructive approach to create G-invariant models, resting on the concept of equivariance [23].Each layer of the QNN is obtained by exponentiation of a generator from the set G in Eq. ( 9).Here we show a circuit with L = 3 layers acting on n = 4 qubits.Single-qubit blocks indicate a rotation about the x or y axis, while two-qubit blocks denote entangling gates generated by a ZZ interaction.All colored gates between dashed horizontal lines share the same trainable parameter θ l .
Definition 2 (Equivariance).We say that an observable O is G-equivariant iff for all elements g ∈ G, [O, R(g)] = 0. We say that a layer U l θ l of a QNN is G-equivariant iff it is generated by a G-equivariant Hermitian operator.
By the previous definition, G-equivariant layers are maps that commute with the action of the group Definition 2 can be naturally extended to QNNs.
Definition 3 (Equivariant QNN).We say that a Llayered QNN is G-equivariant iff each of its layers is Gequivariant.
Altogether, equivariant QNNs and measurement operators provide a recipe to design invariant models, i.e. models that respect the label symmetries.Akin to their classical machine learning counterparts [1][2][3][4][5], such GQML models consist in a composition of many equivariant operations (realized by the L layers of the equivariant QNN) and an invariant one (realized by the measurement of the equivariant observable) [23].Furthermore, model invariance extends to the loss function itself, as captured by the following Lemma.
Lemma 1 (Invariance from equivariance).A loss function of the form in Eq. ( 2) is G-invariant if its composed of a G-equivariant QNN and measurement.
A proof of this Lemma along with that of the following Lemmas and Theorems are presented in Supplementary Methods 2 and 3.

Sn-Equivariant QNNs and measurements
In the previous section we have described how to build generic G-invariant models.We now specialize to the case where G is the symmetric group S n , and where R is the qubit-defining representation of S n , i.e., the one permuting qubits which for any π ∈ S n acts as Following Definitions 2 and 3, the first step towards building S n -equivariant QNNs is defining S n -equivariant generators for each layer.In the Methods section we describe how such operators can be obtained, but here we will restrict our attention to the following set of generators Note that there is some freedom in the choice of generators.Any two sums over two distinct single qubit Pauli operators (the first two generators) plus a sum over pairs of the remaining Pauli operator (the third generator) suffices and we choose the above set without loss of generality.In Fig. 2 we show an example of an L = 3 layered S n -equivariant QNN acting on n = 4 qubits.While the single-qubit rotations generated by G are readily achievable in most quantum computing platforms, the collective ZZ interactions are best suited to architectures allowing for reconfigurable connectivity [85][86][87] or platforms that implement mediated all-to-all interactions [88,89].In fact, such interactions are referred to as one-axis twisting [90] in the context of spin squeezing [91] and form the basis of many quantum sensing protocols.
In addition, we will consider observables of the following form where χ is a (fixed) Pauli matrix.It is straightforward to see that any H l ∈ G and O ∈ M will commute with R(π) for any π ∈ S n .We note that one could certainly consider other observables as well.
We now leverage tools from representation theory to understand and unravel the underlying structure of S nequivariant QNNs and measurement operators.The previous will allow us to derive, in the next section, theoretical guarantees for these GQML models.
One of the most notable results from representation theory is that a given finite dimensional representation of a group decomposes into an orthogonal direct sum of fundamental building-blocks known as irreducible representations (irreps).As further explained in the Methods, the qubit-defining representation takes, under some appropriate global change of basis (which we denote with ∼ =), the block-diagonal form Here λ labels the irreps of S n and r λ is the corresponding irrep itself, which appears d λ times.The collection of these repeated irreps is called an isotypic component.Crucially, the only irreps appearing in R correspond to two-row Young diagrams (see Methods) and can be parametrized by a single non-negative integer m, as λ ≡ λ(m) = (n−m, m), where m = 0, 1, . . ., ⌊ n 2 ⌋.It can be shown that where again d λ is the number of times the irrep appears and m λ is the dimension of the irrep itself.Note that every d λ is in O(n), whereas some m λ can grow exponentially with the number of qubits.For instance, if n is even and m = n/2, one finds that m λ = Ω(4 n /n 2 ).We finally note that Eq. (11) Given the block-diagonal structure of R, S n -equivariant unitaries and measurements must necessarily take the form That is, both U (θ) and O decompose into a direct sum of d λ -dimensional blocks repeated m λ times (with m λ called the multiplicity) on each isotypic component λ.This decomposition is illustrated in Fig. 3.
Let us highlight several crucial implications of the block diagonal structure arising from S n -equivariance.First and foremost, we note that, under the action of an S nequivariant QNN, the Hilbert space decomposes as where each H ν λ denotes a d λ -dimensional invariant subspace.Moreover, one can also see that when the QNN acts on an input quantum state as U θ (ρ) = U (θ)ρU (θ † ), it can only access the information in ρ which is contained Figure 3. Representation theory and Sn-equivariance.Using tools from representation theory we find that the Sn-equivariant QNN U (θ) and the representation of the group elements R(π) -for any π ∈ Sn-admit an irrep block decomposition as in Eq. ( 13) and Eq.(11), respectively.The irreps can be labeled with a single parameter λ = (n − m, m) where m = 0, 1, . . ., ⌊ n 2 ⌋.For a system of n = 5 qubits, we show in a) the block diagonal decomposition for U (θ) and in b) the decomposition of R(π) as a representation of S5.The dashed boxes denote the isotypic components labeled by λ. c) As n increases, U (θ) has a block diagonal decomposition which contains polynomially large blocks repeated a (potentially) exponential number of times.In contrast, the block decomposition of R(π) (for any π ∈ Sn) contains blocks that can be exponentially large but that are only repeated a polynomial number of times. in the invariant subspaces H ν λ (see also [23]).This means that to solve the learning task, we require two ingredients: i) the data must encode the relevant information required for classification into these subspaces [23,25], and ii) the QNN must be able to accurately process the information within each H ν λ .As discussed in the Methods, we can guarantee that the second condition will not be an issue, as the set of generators in Eq. ( 9) is universal within each invariant subspace, i.e., the QNN can map any state in H ν λ to any other state in H ν λ (see also Ref. [92]).A second fundamental implication of Eq.( 13) is that the manifold of equivariant unitaries is of low-dimension.We make this explicit in the following lemma.
Lemma 2 (Dimension of S n -equivariant unitaries).The submanifold of S n −equivariant unitaries is of dimension equal to the Tetrahedral numbers Te n+1 = n+3 3 (see Fig. 4), and therefore on the order of Θ(n 3 ).
Crucially, Lemma 2 shows that the equivariance constraint limits the degrees of freedom in the QNN (and concomitantly in any observable) from 4 n to only polynomially many.
Absence of barren plateaus in Sn-equivariant QNNs Barren plateaus have been recognized as one of the main challenges to overcome in order to guarantee the success of QML models using QNNs [16].When a model exhibits a barren plateau, the loss landscape becomes, on average, exponentially flat and featureless as the problem size increases [33][34][35][36][37][38][39][40][41][42][43][44][45].This severely impedes its trainability, as one needs to spend an exponentially large amount of resources to correctly estimate a loss minimizing direction.Issues of barren plateaus arise primarily due to the structure of the models (including the choice of QNN, the input state and the observables) employed [33][34][35][36][37][38][39][40][41][42][43]45] but can also be caused solely by effects of noise [44].In the rest of this section, we will only be concerned with the former type of barren plateaus, that is the most studied.
When studying barren plateaus, one typically analyzes the variance of the empirical loss function partial derivatives, ∂ µ L(θ) = ∂ L(θ)/∂θ µ , where θ µ ∈ θ.We say that there is a barren plateau in the Before stating our main results, we introduce a bit of notation.Let us define Q ν λ to be the operator that maps vectors from H to H ν λ , such that (Q ν λ ) † Q ν λ realizes a projection onto H ν λ (see Supplementary Methods 4 for additional details).Given a matrix B ∈ C d×d , we will denote its restriction to H ν λ as with B ν λ ∈ C d λ ×d λ .We remark that the restriction of S nequivariant generators is independent of the ν multiplicity index (see Eq. ( 13)).On the other hand, the restriction of non-equivariant operators (such as the input states ρ 1 ) are not independent of ν, meaning that that the set composed of all the restrictions ρ ν λ contain an exponentially large amount of non-redundant information that the QNN can act on (see also [23]).
Denoting the weighted average of the input states as σ = M i=1 c i ρ i , we find: Theorem 1 (Variance of partial derivatives).Let U θ be an S n -equivariant QNN, with generators in G, and O an S n -equivariant measurement operator from M. Consider an empirical loss L(θ) as in Eq. (3).Assuming a circuit depth L such that the QNN forms independent 2-designs on each isotypic block, we have ⟨∂ µ L(θ)⟩ θ = 0, and Here, In the Methods we present a sketch of the proof for Theorem 1, as well as its underlying assumptions.
We remark that while we have derived Theorem 1 for S nequivariant QNNs and measurement operators, given some general finite-dimensional compact group G, the form of Eq. ( 16) is valid provided that one uses a G-equivariant QNN that is universal with each invariant subspace.In this case, the summation over λ will run over the irreps of the representation of G.
Let us now analyze each term in Eq. ( 16) to identify potential sources of untrainability.First, let us consider the prefactors 2d λ (d 2 λ −1) 2 .From Eq. ( 12) we can readily see that 2d λ (d 2 λ −1) 2 ∈ Ω( 1 n 3 ) for any λ.Next, it is convenient to separate the two remaining potential sources of barren plateaus into two categories: i) those that are QNN or measurement dependent, ∆(H µ, λ ) and ∆(O λ ), and ii) those that are dataset-dependent, ∆( ν σ ν λ ).This identification commonly appears when analyzing the absence of barren plateaus (see Refs. [34,42,43,106,107]) and allows one to study how the architecture and dataset individually affect the trainability.In what follows, we will say that some architecture does not induce barren plateaus if the terms that are QNN or measurement dependent are not exponentially vanishing.
Using tools from representation theory we can obtain the following exact expressions for S n -equivariant operators.
In Supplementary Methods 6 we also derive formulas for the case of A being k-body operators.
Let us review the implications of Theorem 2. First, note that all elements of our gate-set G and measurement-set M are of the form in Theorem 2, and therefore belong in Ω(d λ ).This follows from the fact that the binomial coefficient n+a b scales as a polynomial of degree b in n.Since d λ itself is in Θ(n) (see Eq. ( 12)), for all λ and µ Hence, combining this result with Theorem 1 allows us to argue that S n -equivariant QNNs do not induce barren plateaus.
Corollary 1.Under the same assumptions as Theorem 1, it follows that, if ∆( We note that a crucial requirement for Corollary 1 to hold is that ∆( ν σ ν λ ) needs to be, at most, polynomially vanishing.In Sec. , we identify cases of datasets leading to trainability but also to untrainability.Finally, we note that as discussed in Supplementary Methods 9, Corollary 1 is sufficient to guarantee that the loss function does not exhibit the narrow gorge phenomenon, whereby the minima of the loss occupy an exponentially small volume of parameter space [108].In other words, we show that absence of barren plateau implies absence of narrow gorges and loss function anti-concentration.

Efficient overparametrization
Absence of barren plateaus is a necessary, but not sufficient, condition for trainability, as there could be other issues compromising the parameter optimization.In particular, it has been shown that quantum landscapes can exhibit a large number of local minima [29][30][31].As such, here we consider a different aspect of the trainability of S n -equivariant QNNs: their ability to converge to global minima.For this purpose, we find it convenient to recall the concept of overparametrization.
Overparametrization denotes a regime in machine learning where models have a capacity much larger than that necessary to represent the distribution of the training data.For example, when the number of parameters is greater than the number of training points.Models operating in the overparametrized regime have seen tremendous success in classical deep learning, as they closely fit the training data but still generalize well when presented with new data instances [109][110][111][112]. Recently, Ref. [32] studied overparametrization in the context of QML models.A clear phase transition in the trainability of under-and overparametrized QNNs was evidenced: Below some critical number of parameters (underparametrized) the optimizer greatly struggles to minimize the loss function, whereas beyond that number of parameters (overparametrized) it converges exponentially fast to solutions (see Methods for further details).
Given the desirable features of overparametrization, it is important to estimate how many parameters are needed to achieve this regime.Here, we can derive the following theorem.
Theorem 3. Let U θ be a S n -equivariant QNN with generators in G.Then, U θ can be overparametrized with O(n 3 ) parameters.
Theorem 3 guarantees that S n -equivariant QNNs only require a polynomial number of parameters to reach overparametrization.

Generalization from few data points
Thus far we have seen that S n -equivariant QNNs can be efficiently trained, as they exhibit no barren plateaus and can be overparametrized.However, in QML we are not only interested in achieving a small training error, we also aim at low generalization error [26,61,[113][114][115][116].
Computing the generalization error in Eq. ( 4) is usually not possible, as the probability distribution P over which the data is sampled is generally unknown.However, one can still derive bounds for gen(θ) which guarantee a certain performance when the model sees new data.Here, we obtain an upper bound for the generalization error via the covering numbers (see Methods) [61,117], and prove that the following theorem holds.
Theorem 4. Consider a QML problem with loss function as described in Eq. (4).Suppose that an n-qubit S nequivariant QNN U(θ) is trained on M samples to obtain some trained parameters θ * .Then the following inequality holds with probability at least 1 − δ The crucial implication of Theorem 4 is that we can guarantee gen(θ * ) ⩽ ϵ with high probability, if . For fixed δ and ϵ, this implies M ∈ O(n 3 ), i.e., we only need a polynomial number of training points.Also note that this results shows that minimizing the empirical loss closely minimizes the true loss with high probability.Say that L * = inf θ L(θ) is the minimal empirical loss and L * = inf θ L(θ) the minimal true loss.
Then, with M ∈ O Ten+1+log(1/δ) Lastly, we remark that Theorem 4 can be readily adapted to other GQML models.As shown in Methods, this theorem stems from the fact that the equivariant unitary submanifold, in its block-diagonal form in Eq. ( 13), can be covered [117] by ε-balls in a block-wise manner.In Supplementary Methods 8, we also show that the VC dimension [118] of equivariant QNNs (and also more general parameterized channels) can be upper bounded by the dimension of the commutant of the symmetry group, a fact which could be of independent interest.

Trainable States
As discussed in the previous section, S n -equivariant QNNs and measurement operators cannot induce barren plateaus.Thus, the trainability of the model hinges on the behavior of ∆( ν σ ν λ ).We note that this datasetdependent trainability it not unique to S n -equivariant Analytical method indicates that we can exactly compute the scaling of ∆( ν σ ν λ ), whereas numerical one means that we evaluate it numerically.The analytical proofs and details of the simulations can be found in Supplementary Methods 7. We note that, these results are obtained by computing the loss with a single data instance (i.e., for M = 1 in Eq. ( 3)).
QNNs, but is rather present in all absence of barren plateaus results (see Refs. [34,42,43,106,107,119]) as there always exist datasets for which an otherwise trainable model can be rendered untrainable.
To understand the conditions that lead to an exponentially vanishing of ∆( ν σ ν λ ) we note that for a Hermitian operator B, we have ∆(B) = D HS B, Tr[B]  dim(B) 1 1 , where Alternatively, we can interpret ∆(B) as the variance of the eigenvalues of B. From here, we can see that one will obtain trainability if at least one σ λ is not exponentially close to a multiple of the identity in some subspace H ν λ .In Table I we present examples of states for which ∆( ν σ ν λ ) vanishes polynomially, leading to a trainable model, but also cases where the input state leads to exponentially vanishing ∆( ν σ ν λ ) and thus to a barren plateau.While we leave the details of how each type of input state is generated for the Methods section, we note that the results in Table I demonstrate the critical role that the input states play in determining the trainability of a model (this will be further elucidated in numerical results below).Such insight is particularly important as one can create adversarial datasets yielding barren plateaus (see Supplementary Methods 10).Moreover, it indicates that care must be taken when encoding classical data into quantum states as the embedding scheme can induce trainability is-sues [42,119].

Numerical results
Here, we consider the task of classifying connected graph states from disconnected graph states, which are prepared as follows.First, we generate n-node random graphs from the Erdös-Rényi distribution [120], with an edge probability of 40%.The ensuing graphs are binned into two categories: connected and disconnected.We then embed these graphs into quantum graph states via the canonical scheme of [121,122] (see Methods section).We highlight that such encoding preserves symmetries in the input data, in the sense that a permutation of the underlying graph yields a permutation of the qubits constituting its graph state (i.e., of the form Eq. ( 8)).The previous allows us to create a dataset where half of the states encodes connected graphs (label y i = +1), and the other half encodes disconnected graphs (label y i = −1).To analyze the data, we use an S n -equivariant QNN with generators in Eq. ( 9) (see also Fig. 2), and measure the operator In the following, we characterize the trainability and generalization properties of S n -equivariant QNNs for this classification task, but we note that further aspects of the problem are discussed in the Supplementary Note.These include analyzing the effect of the graph encoding scheme in the trainability, the irrep contributions to the gradient variance, and comparing S n -equivariant QNNs against problem-agnostic ones.In particular, the latter shows that for the present graph classification task, problem-agnostic models are hard to train and tend to greatly overfit the data, i.e., they have large generalization errors despite performing well on the training data.

Numerics on barren plateaus
In Fig. 5(a) we show the variance of the cost function partial derivatives for a parameter θ µ in the middle of the QNN.Each point is evaluated for a total of 50 random input states, and with 20 random sets of parameters θ per input.We can see that when the variance is evaluated for states randomly drawn from the whole dataset -with an equal number of connected and disconnected graphsthen Var θ [∂ µ L] only decreases polynomially with the system size (as evidenced by the curved line in the log-linear scale), meaning that the model does not exhibit a barren plateau.We note that, as shown in Fig. 5(a), when the input to the QNN is a disconnected graph state, then the variance vanishes polynomially, whereas if we input a connected graph state it vanishes exponentially.This illustrates a key fact of QML: when trained over a dataset, the data from different classes can contribute very differently to the model's trainability (see [18] for a discussion on how this result enables new forms of classification).

Numerics on overparametrization
Following the results in Ref. [32], let us analyze the overparametrization phenomenon by studying the rank of the quantum fisher information matrix (QFIM) [123,124], denoted F (θ) and whose entries are given by The rank of the QFIM quantifies the number of potentially accessible directions in state space.In this sense, the model is overparametrized if the QFIM rank is saturated, i.e., if adding more parameters (or layers) to the QNN does not further increase the QFIM rank.When this occurs, one can access all possible directions in state space and efficiently reach the solution manifold [32,125,126].On the other hand, the model is underparametrized if the QFIM rank is not maximal.In this case, there exists inaccessible directions in state space, leading to false local minima, that is, local minima that are not actual minima of the loss function.
In Fig. 5(b, left panel) we report representative results of the QFIM rank versus the number of layers L for problems with even numbers n ∈ [4,16] of qubits.These results correspond to random connected graphs and random values of θ.Here we can see that, for a given n, as the number of layers increases, the rank of the QFIM also increases until it reaches a saturation point.Once this critical number of layers (denoted as L ovp ) is reached, the model is considered to be overparametrized [32].In Fig. 5(b, middle panel) we plot the scaling of L ovp (for 10 random connected or disconnected graphs per system size) versus n, as well as the Tetrahedral numbers Te n+1 .As can be seen, in all cases, the overparametrization onset occurs for a number of layers L ovp < Te n+1 , indicating efficient overparametrization.
To appreciate the practical effects of overparametrization, we report in Fig. 5(b, right panel) optimization performances of S n -equivariant QNNs as a function of the number L of layers employed.All the optimizations are performed using the hinge loss function, with the L-BFGS-B optimization algorithm [127].The system sizes are in n ∈ [4,16] qubits, and correspond to the graphs that were studied in the left panel and highlighted in the middle one.The relative loss error reported indicates how close an optimized QNN is from the best achievable model.Explicitly, it is defined as | LL − Lmin |/| Lmin |, where LL is the loss achieved after optimization of a QNN with a given L, and where Lmin is the minimum loss achieved for any of the values L considered, i.e., Lmin = arg min L LL (we systematically verify that for sufficient large L all optimizations reliably converge to this same loss Lmin ).For every value of n studied, we see that for a small number of layers the optimizer struggles to significantly minimize the loss.However, as L increases, there exists a computational phase transition whereby the optimizer is able to easily identify optimal parameters and reach much smaller loss values.Notably, such computational phase transition occurs slightly before L ovp (indicated by a dashed vertical line), meaning that even before the QFIM rank saturates, the model has sufficient directions to efficiently reach the solution manifold.Overall, we see that for number of layers growing at most polynomially with n, one can ensure convergence to solution of the model.

Numerics on generalization error
In Fig. 5(c) we study the generalization error of an overparametrized S n -equivariant QNN (with L = Te n+1 ) for different training dataset sizes M and with respect to test sets of size M test = 2×Te n+1 that are independently drawn from the training ones.Generalization errors are evaluated for random QNNs parameters θ and we report the 90-th percentile of the errors obtained, i.e., for δ = 90% in Eq. (20).In the plot, we show the normalized generalization error g(θ) = gen(θ) . We stress that such normalization can only increase the generalization errors obtained, and is only used in order to compare generalization errors across different values of n without artifacts resulting from loss concentration effects as the system sizes grow.As seen in Fig. 5(c), when the size of the training set is constant, the generalization error is also approximately constant across problem sizes.However, when the training set size scales with n, the generalization error decreases with n, with this even occurring for M = n.Notably, if M = Te n+1 ∈ Θ(n 3 ), we can see that the generalization error significantly decreases with problem size.That is, for this problem, we found generalization errors to be better than the scaling of the bounds derived in Eq. (20).

DISCUSSION
GQML has recently been proposed as a framework for systematically creating models with sharp geometric priors arising from the symmetries of the task at hand [18][19][20][21][22].Despite its great promise, this nascent field has only seen heuristic success as no true performance guarantees have been proved for its models.In this work we provide the first theoretical guarantees for GQML models aimed at problems with permutation invariance.Our first contribution is the introduction of the S n -equivariant QNN architecture.Using tools from representation theory, we rigorously find that these QNNs present salient features such as absence of barren plateaus (and narrow gorges), generalization from very few data points, and a capability of being efficiently overparametrized.All these favorable properties can be viewed as being direct consequences of the inductive biases embedded in the model, which greatly limits their expressibility [37,46,128].Namely, these S n -equivariant QNNs act only on the -polynomially large-multiplicity spaces of the qubit-defining representation of S n .To complete our analysis, we performed numerical simulations for a graph classification task and heuristically found that the model's performance is even better than that predicted by our theoretical results.
Taken together, our results provide the first rigorous guarantees for equivariant QNNs, and demonstrate that GQML may be a powerful tool in the QML repertoire.We highlight that while we focus on problems with S n symmetry, many of our proof techniques hold for general finite-dimensional compact groups.Hence, we hope that the representation-theory-based techniques used here can serve as blueprints to analyze the performance of other models.We envision that in the near future, GQML models with provable guarantees will be widely spread among the QML literature.
Finally, we note that while our results were derived in the absence of noise, it would be interesting to account for hardware imperfections.Clearly, the presence of noise would change our analysis, and most likely weaken our trainability guarantees.As such, while we can guarantee that S n -equivariant QNNs will be useful on fault-tolerant quantum devices, we do not abandon hope that they can be used in the near-term era provided that noise levels are small enough.
Note added: In light of the recent preprint [129], we have added a detailed discussion in the Supplementary Note regarding the possibility of classically simulating S nequivariant QNNs.As we argue there, for most relevant cases in QML, the algorithm in [129] is not fully classical, as it require access to a quantum computer to obtain a "classical description" of the input data.Moreover, even if one is given such "classical description", the ensuing algorithm that replaces the use of a QNN scales extremely poorly with the number of qubits.Taken together these results indicate that if one has access to a quantum computer, it is not entirely obvious whether one should use it to obtain a classical description of the data followed by expensive post-processing, or if one should run the QNN on the quantum device and exploit its favorable properties like efficient overparametrization and absence of barren plateaus.We will save such comparison for future work.Now we will briefly compare S n -equivariant QNNs to other barren-plateau-avoiding architectures.
First, let us consider the shallow hardware efficient ansatz (HEA) [34,130] and the quantum convolutional neural network (QCNN) [60,106].While our goal is not to provide a comprehensive description of these models, we recall the three key properties leading to their trainability: locality of the gates, shallowness of the circuit, locality of the measurement operator.Both the HEA and QCNN are composed of parametrized gates acting in a brick-like fashion on alternating pairs of neighboring qubits (local gates), and are composed of only a few -logarithmically many-layers of such gates (shallowness of the circuit).The combination of these two factors leads to a low scrambling power and greatly limited expressibility of the QNN.Then, the final ingredient for their trainability requires measuring a local operators (i.e., an operator acting non-trivially on a small number of qubits).While this assumption is guaranteed for QCNNs -due to their feature-space reduction property-, the HEA can be shown to be untrainable for global measurement (i.e., operators acting non-trivially on all qubits).Here we can already see that S n -equivariant QNNs do not share the properties leading to trainability in HEAs and QCNNs.To begin, we can see from the set of generators G in Eq. ( 9) that the S n -equivariant architecture allows for all long-range interactions in each layer, breaking the locality of gates assumption.Moreover, and in stark contrast to HEAs, one can train the S n -equivariant QNN even when measuring global observables (for instance, we allow for the O = n j=1 X j in Eq. ( 10)).Finally, we remark that HEAs and QCNNs cannot be efficiently overparametrized, as they require an exponentially large number of parameters to reach overparametrization [43].On the other hand, according to Theorem 3 the S n -equivariant QNN can be overparametrized with polynomially many layers.
Next, let us consider the transverse field Ising model Hamiltonian variational ansatz (TFIM-HVA) [43,45].The mechanism leading to absence of barren plateaus in this architectures is more closely related to that of the S nequivariant model, although there are still some crucial differences.On the one hand, it can be shown that the TFIM-HVA has an extremely limited expressibility, having only a maximum number of free parameters in O(n 2 ), and being able to reach overparametrization with polynomially many layers.While this is similar to the case of S n -equivariant architectures (see Lemma 2 and Theorem 3), the block diagonal structure of the TFIM-HVA is fundamentally different than that arising from S n -equivariant: The TFIM-HVA unitary has four exponentially large blocks repeated a single time each, while S n -equivariant unitaries have polynomially small blocks repeated exponentially many times.This subtle, albeit important, distinction makes it such that S n -equivariant QNNs enjoy generalization guarantees (from Theorem 4) which are not directly applicable to TFIM-HVA architectures.
The previous shows that S n -equivariant QNNs standout amid the other trainable architectures, exhibit many favorable properties that other models only partially enjoy.
Lastly, we now consider future directions and possible extensions of our work.We recall that Definition 3 requires every layer of the QNN to be equivariant.This is evidently not general, as one could have several consecutive layers which are not individually equivariant, but compose to an equivariant unitary for certain θ [18,131].While in this manuscript we do not consider this scenario, it is worth exploring how less strict equivariance conditions affect the performance and the trainability guarantees here derived.Second, we note that as indicated in this work, the block diagonal structure of the S n -equivariant QNN restricts the information in the input data that the model can access.This could lead to conditions where the model cannot solve the learning task as it cannot 'see' the relevant information in the input states.Such issue can be in principle solved by allowing the model to simultaneously act on multiple copies of the data, and even to change the representation of S n throughout the circuit [23].We also leave this exploration for future work.
Another potentially interesting research direction would be equivariant embeddings and re-uploading of classical data.For the purposes of this work, we make no assumptions to the source or form of the data, such as whether it is quantum or classical.However, when considering analyzing classical data on quantum computer, embeddings become important.We give one such example, which we call a "fixed Hamming-weight encoding".Another example is the standard encoding of a graph into a graph state, which we considered in our numerics.This is far from exhaustive and more sophisticated methods exist, including trainable encoding [54].Similarly, we have not studied how our results change in the presence of data re-uploading [132].We know that if the data is re-uploaded via equivariant generators (e.g., if the data re-uploading unitary takes the form V (x) = l ′ e −ix l H l , with H l being S n -equivariant), then our theoretical guarantees results do not change.This follows from the fact that the DLA of the circuit will remain the same, and hence our results follow.We leave the study of more general encoding and re-uploading schemes for future work.

METHODS
This section provides an overview of the different tools used in the main text.Here we also present a sketch of the proof of our main results.Full details can be found in the Supplementary Methods.

Building Sn-equivariant operators
Here we briefly describe how to build S n -equivariant operators that can be used as generators of the QNN, or as measurement operators.In particular, we will focus on the so-called twirling method [19,23].Take a unitary representation R of a discrete group G over a vector space V .Then the twirl operator is the linear map T G : GL(V ) → GL(V ), defined as It can be readily verified that the twirling of any operator A yields a G-equivariant operator, i.e., we have The previous allows us to obtain a G-equivariant operator from any operator A ∈ GL(V ).For instance, let us consider the case in the case of G = S n , R the qubit-defining representation and A = X 1 .Then, we have T G (X 1 ) = for any 1 ⩽ j ⩽ n.Note that twirling over S n cannot change the locality of an operator.That is, twirling a k-body operator leads to a sum of k-body operators.

Representation theory of Sn
In this section we review a few basic notions from representation theory.For a more thorough treatment we refer the reader to Refs.[133][134][135][136], and more specifically to the tutorial in Ref. [25] which provides an introduction to representation theory from the perspective of QML.We recall that we are interested in the qubit-defining representation of S n , i.e., the one permuting qubits As mentioned in the main text, representations break down into fundamental building blocks called irreducible representations (irreps).
Definition 4 (Irrep decomposition).Given some unitary representation R of a compact group G, there exists a basis under which it takes a block diagonal form with r λ (π) irreps of G appearing m r λ times.
The irreps of the symmetric group are commonly labeled by the set of partitions of the integer n.A partition of a positive integer n ∈ N is a non-decreasing sequence of positive integers λ = (λ 1 , • • • , λ k ) satisfying i λ i = n.Partitions are typically visualized using young diagrams, a set of empty, left-justified boxes arranged in rows such that there are λ i boxes in the i-th row.For instance, the integer n = 3 can split into We note that in the case of the qubit-defining representation, the only λ appearing in Eq. ( 22) have at most two rows (e.g., would not include the last partition in Eq. ( 23)).
The dimension of an S n irrep r λ can be computed from the hook length formula where each h λ (b) is the hook length for box b in λ, which is the total number of boxes in a 'hook' (or 'l' shape) composed of box b and every box beneath (in the same column) and to its right (in the same row).
Given the block-diagonal structure of R in Eq. ( 22), one can see that a general G-equivariant operator has to be of the form where A λ are m r λ -dimensional matrices repeated dim(r λ ) times.In general, the number of times an irrep appears in an arbitrary representation R (i.e., m r λ in Eq. ( 22)) can be determined through character theory.Instead, in our case, we will take a shortcut and exploit one of the most remarkable results in representation theory, called the Schur-Weyl duality [137].
Consider the representation Q of the unitary group U(2) acting on H = (C 2 ) ⊗n through the n-fold tensor product Q(W ∈ U(2)) = W ⊗n .Evidently, according to Eq. ( 22), Q will also have an isotypic decomposition where s labels the different (spin) irreps of U(2).The Schur-Weyl duality, states that the matrix algebras C[R] and C[Q] mutually centralize each other, meaning that C[R] is the space of U(2)-equivariant linear operators, and similarly C[Q] is the space of S n -equivariant ones.As a consequence of this duality, H can be decomposed as , where λ simultaneously labels irrep spaces V λ and W λ for S n and U(2), respectively.That is, H supports a simultaneous action of S n and U(2), where the irreps of each appear exactly once and are correlated: Each of the two-row Young diagrams λ = (n − m, m) labeling the irreps in R can be associated unequivocally with a spin label s(λ) for an Moreover, since under the joint action of S n × U(2) the multiplicities are one, one can assert that the irrep q λ of U(2) appears dim(r λ )-times in Q, and conversely, the irrep r λ of S n appears dim(q λ )-times in R. Using the well known dimension of spin irreps dim(q s ) = 2s + 1, we can derive an expression for the multiplicity of S n irreps Also, it is straightforwards to adapt the formula in Eq. (24) to two-row diagrams λ = (n − m, m) We finally note that, since we are ultimately interested in S n -equivariant operators, in the main text we have defined d λ ≡ m r λ and m λ ≡ dim(r λ ).That is, the dimension and multiplicity of an irrep in the main text are for the representations of U.

Universality, expressibility and dynamical Lie algebra
In the main text we have argued that the set of generators in Eq. ( 9) is universal within each invariant subspace.Here we will formalize this statement.
First, let us recall that we say that a parametrized unitary is universal if it can generate any unitary (up to a global phase) in the space over which it acts.One can quantify the capacity of being able to create different unitaries through the so-called measures of expressibility [37,43,46,128].Here we will focus on the notion of potential expressibility of a given QNN, which is formalized via the dynamical Lie algebra of the architecture [138].
Definition 5 (Dynamical Lie algebra).Given a set of generators G defining a QNN, its dynamical Lie algebra g is the span of the Lie closure ⟨•⟩ Lie of G.That is, g = span R ⟨G⟩ Lie , where ⟨G⟩ Lie is defined as the set of all the nested commutators generated by the elements of G.
In particular, the dynamical Lie algebra (DLA) fully characterizes the group of unitaries that can be ultimately expressed by the circuit: for any unitary U realized by a QNN with generators in G there exists an anti-hermitian operator η ∈ g = ⟨G⟩ Lie such that U = e η .Evidently, g ⊆ u(d), that is, it is a subalgebra of the space of antihermitian operators.When g is su(d) or u(d) we say that the QNN is controllable or universal since for any pair of states |ψ⟩ and |ϕ⟩, there exists a unitary U = e η with η ∈ g such that |⟨ϕ|U |ψ⟩| 2 = 1.
In the framework of GQML one designs symmetryrespecting QNNs by using group-equivariant generators.This implies that the corresponding DLA is constrained and necessarily takes the form where g λ ⊆ u(d λ ).For this scenario, we provide a notion of controllability restricted to each of the invariant subspaces: We say that a QNN is subspace-controllable in the isotypic component λ if g λ is su(d λ ) or u(d λ ).This means that the QNN can map between any pair of states in every H ν λ .Notably, the following result follows from Refs.[92,139].
As shown below, this result will be crucial for the proof of Theorem 1.

Proof of absence of barren plateaus
Here we sketch our proof of Theorem 1.Our goal is to calculate Var In general, we will have to deal with integrals of the form where f is some parametrized function -for example the cost function or its partial derivatives-and D θ : [0, 2π] M → [0, 1] is some distribution over parameter space -typically the uniform distribution.The first step is to transform the integration over parameter space to an integration over the resulting QNN unitary distribution D. Since D is known to converge (given enough depth) to ϵ-approximate 2-designs over the Lie group e g [43,140], assuming f is a polynomial of degree ⩽ 2 in the entries of U (as is the case of interest), we can replace the integration over D with an integration over the Haar measure over e g .In general, g is a reductive Lie algebra consisting of multiple orthogonal ideals g = λ g λ , where g i is either simple or abelian, and the Lie group e g is the product group λ e g λ .It can be shown (see Supplementary Methods 4) that the Haar measure over such a product group is the product of the Haar measures over the normal subgroups e g λ .Finally, the ansatz with generators in Eq. ( 9) has a DLA g that is subspace controllable, meaning that each simple g λ is either su(d λ ) or u(d λ ) [92,139].Summarizing, we have The main advantage of Eq. ( 31) is that we can use tools from Weingarten calculus to perform symbolic integration over the Haar measure of unitary groups [141].Explicitly, we care for the variance of where U B and U A denote the unitary circuits before and after the parametrized gate we are differentiating.Assuming that the depth L of the QNN is enough to guarantee that both U A and U B form independent 2-designs on e g , we can use Weingarten calculus to evaluate the terms in , and obtain Eq. ( 16) in Theorem 1.The details of this calculation are presented in Supplementary Methods 4.
While the previous, along with the results in Theorem 2, allow to prove by direct construction that S n -equivariant QNNs do not lead to barren plateaus, we here provide further intuition for this result in terms of the expressibility reduction induced by the equivariance inductive biases.As shown in Ref. [37], QNNs that are too expressible exhibit exponentially vanishing gradients, whereas models whose expressibility is restricted can exhibit large gradients.Hence, we can expect the result in Corollary 1 to be a direct consequence of the reduced expressibility of the model.We can further formalize this statement using the results of Ref. [43].Therein, it was found that there exists a link between the presence or absence of barren plateaus and the dimension of the DLA.In particular, the authors conjecture, and prove for several examples (see also Ref. [142] for an independent verification of the conjecture), that deep QNNs have gradients that scale inversely with the size of the DLA, that is, Var θ [∂ L(θ)] ∼ 1 poly(dim(g)) .For the case of S n -equivariant QNNs we know from Lemma 3 that dim(g) ∈ Θ(n 3 ) thus indicating that the variance should only vanish polynomially with n (for an appropriate dataset).We note this conjecture was recently proven [140,143].

Intuition behind the overparametrization phenomenon
Recently, Ref. [32] studied the overparametrization of QNNs from the perspective of a complexity phase transition in the loss landscape.In the underparametrized regime, we experience rough loss landscapes, which in turn can be traced back to a lack of control in parametrized state space.When the number of parameters is below the number of directions in state space, the parameter update can only access a subset of those potential directions.This constraint can be shown to introduce false local minima, that is, local minima that are not actual minima of the loss function (as a function of state space) but instead artifacts of a poor parametrization.Instead, upon introduction of more parameters the parametrized state starts accessing these previously unavailable directions, and false minima disappear as we transition into the overparametrized regime.Because in the overparametrized regime the number of parameters is greater than the number of ever accessible directions, solutions in the control landscape are degenerate and form multidimensional submanifolds, allowing the optimizer to reach them more easily [125,126].
The main contribution in Ref. [32] is the realization that, under standard assumptions, one needs one parameter per potentially accessible direction in state space, and that the latter can be formalized as the dimension of the orbit of the initial state under the Lie group e g resulting from the exponential of the DLA g.In particular, this means that exponential DLA architectures require an exponential number of parameters to be overparametrized, whereas polynomial DLA architectures only need a polynomial number of them.
With these definitions, the proof of Theorem 3 is immediate.Since the ansatz is subspace controllable (Lemma 3), the dimension of the DLA is equal to the dimension of the commutant, which is Θ(n 3 ) (Lemma 2).
To finish, we note that the definition of overparametrization employed here (in terms of saturating the number of available directions) might differ from some definitions of overparametrization in the classical neural network community.Namely, in classical machine learning researchers have studied overparametrization through the optics of generalization [109,[144][145][146][147], while others have investigated the effect of overparametrization on the training processes.In particular, it has been proposed that the onset of overparametrization can be detected using metrics such as parameter redundancy which is captured by the rank of the classical Fisher information matrix [148][149][150].It is precisely this notion of overparametrization that Ref. [32] ported to quantum, and the one used in the present work.

Generalization
We consider the QML setting in this paper where the empirical loss function is of the form We assume that the operator norm of O is bounded by a constant and also |c i | ⩽ 1/M .We follow closely the covering number-based generalization bound in [61].First recall that a set V is ε-covered by a subset K ⊆ V with respect to a distance metric d if ∀x ∈ V , ∃y ∈ K such that d(x, y) ⩽ ε.The ε-covering number (w.r.t.metric d) of V , denoted as N (V, d, ε), is the cardinality of the smallest such subset [117].The following theorem bounds the ε-covering number of S n -equivariant QNNs.
Theorem 5.The ε-covering number of the set V n of nqubit unitary S n -equivariant QNNs w.r.t. the operator norm ∥ • ∥ can be bounded as Proof.Recall that an S n -EQNN U can be blockdiagonalized as U ∼ = λ 1 1 m λ ⊗ U λ , where each U λ is a unitary for U to be unitary.Let U(d λ ) denote the set of all unitaries of dimension d λ .Following Lemma 6 in [61] and Section 4.2 in [151] we can bound the ε-covering number of U d λ as follows Next, we construct an ε-covering subset of the S nequivariant unitary set, V n , from the ε-covering subsets, K λ , of the blocks λ.Indeed, given any Therefore, there exists an ε-covering net of V n of size , concluding the proof.
Having established this bound on the ε-covering numbers of S n -EQNN, we apply a known result from [61] (with some extra care) to obtain Theorem 4.
Proof of Theorem 4. We assume knowledge of Theorem 6 in [61].In step two of the proof where the authors use the chaining argument [152] to bound the generalization error, notice that the covering number N j in their Eq.( 64) is replaced by 6 ε 2Ten+1 in our case.In other words, there is no architecture-dependence (the number of gates T in their case) inside the logarithm in the resulting Eq. ( 65).Applying this change to the rest of their proof leads to our claimed generalization bound.
We note that in the previous derivation, we have used knowledge of the isotypic decomposition of the S nequivariant QNN, which allows us to obtain a specialized generalization error bound that does not follow from a direct application of the results in [61].

Trainable and untrainable states
Here we describe how the states in Table I are obtained.The "symmetric states" are obtained from the symmetric subspace [153], i.e., the set of states {|ψ⟩ ∈ H | R(π) |ψ⟩ = |ψ⟩ , ∀π ∈ S n }.The so-called "fixed Hamming-weight encoded" states correspond to states representing classical data: Given an array of real values {x i }, such that i x 2 i = 1, each x i is encoded as the weight of a unique bitstring z of Hamming weight k, where k is some fixed constant.That is, prepare the state |x⟩ = z s.t.w(z)=k x z |z⟩, where we are now indexing x i with a bitstring z. "Local Haar random" states are obtained by preparing the state |0⟩ ⊗n and applying a Haar random single-qubit unitary to each qubit."Global Haar random" states are obtained by preparing the state |0⟩ ⊗n and applying a random n-qubit unitary sampled from the Haar measure over U(d).The "fixed and linear depth random circuit" states correspond to the states obtained by preparing the state |0⟩ ⊗n and respectively applying a constant-depth, or lineardepth layered hardware-efficient quantum circuit [34,130] with random parameters.For the "graph states", we use a canonical encoding to embed a graph into a quantum state [121,122].Specifically, to create a graph state, one starts with the state |+⟩ ⊗n , and applies a controlled-Z rotation for every edge in the graph.We consider 3-regular and n/2-regular graphs, as well as random graphs generated according to the Erdös-Rényi model [120].

Lemma 1 (Invariance from equivariance). A loss function of the form l
Proof.For any g ∈ G we have where in the second line we have used the equivariance of the QNN, in the third line cyclicity of the trace, and in the fourth the equivariance of the measurement.

III. SUPPLEMENTARY METHODS 3: PROOF OF LEMMA 2
Here, we provide a proof for Lemma 2, which characterizes the number of free parameters in S n -equivariant unitaries.
Lemma 2 (Dimension of S n -equivariant unitaries).The submanifold of S n −equivariant unitaries is of dimension equal to the Tetrahedral numbers Te n+1 = n+3 3 (see Fig. 4 in the main text), and therefore on the order of Θ(n 3 ).
Proof.As shown in Eq. 13 of the main text, any S n -equivariant unitary U (θ) is fully characterized by its d λ -dimensional isotypic components U λ (θ), each having d 2 λ free real-valued parameters.Thus, the unitary U (θ) has λ d 2 λ free parameters, which can be shown (by induction) to be equal to the Tetrahedral numbers Te n+1 .

IV. SUPPLEMENTARY METHODS 4: PROOF OF THEOREM 1
Here, we provide a proof for Theorem 1, yielding an exact expression for the variance of the partial derivatives of the loss function Theorem 1.Let U θ be an S n -equivariant QNN, with generators in G. Let O be an S n -equivariant measurement operator from M, and consider an empirical loss function L(θ) with the form in Eq. 6, for some given training set S. Assuming a circuit depth L such that the QNN forms independent 2-designs on each isotypic block, we have ⟨∂ µ L(θ)⟩ θ = 0, and Here, ∆(B) = Tr B 2 − Tr[B] 2 dim(B) .Proof.First, let us recall from the main text that under the action of an S n -equivariant QNN, the Hilbert space decomposes as where we have Note that, for a given λ all the subspaces H ν λ have the same dimension irrespective of ν.Then, let us define as Q ν λ the d λ × d matrix that results from horizontally stacking the basis elements of such that Q ν λ maps vectors from H to H ν λ .These matrices satisfy the following property allowing us to define the projector P ν λ onto each subspace as In what follows we will denote as the operator obtained by reducing A onto the subspace labeled by λ and ν.In particular, we note that if A = A † then (A ν λ ) † = A ν λ (the converse is not necessarily true).Moreover, if A is positive semi-definite, then so is A ν λ .This can be seen from the fact that given a d λ -dimensional vector |x⟩, then ⟨x|A ν λ |x⟩ = ⟨ x|A| x⟩, where | x⟩ is a d-dimensional vector where x is padded with zeros.
Here we also recall that both the QNN and measurement operator are block diagonal, such that or alternatively, such that We can now explicitly evaluate the loss function partial derivative with respect to the parameter θ µ in the µ-th layer.First, we recall that the loss function is with the S n -equivariant quantum neural network unitary being For convenience, let us also introduce the following notation Here we have omitted the explicit dependence on θ for simplicity of notation.An explicit calculation yields Using Eqs.(15) we can see that Here, λ = (λ 1 , λ 2 , λ 3 , λ 4 , λ 5 , λ 6 ), ν = (ν 1 , ν 2 , ν 3 , ν 4 , ν 5 , ν 6 ), and we have used in the second line the fact that the projectors P ν λ are orthogonal.Thus, we can write where we have defined (21) shows that the loss function can be expressed as a summation of contributions from the blocks associated to the different invariant subspaces.While both U (θ) and O are block diagonal in this bases, the same is not necessarily true for ρ i .That is, while the states ρ i may not be block diagonal in the irreps of U(2), the loss function one only takes into account their projection into the irrep subspaces.
For a dataset, we evaluate our model with the training error L(θ).Via linearity of expectation, Ultimately, we are interested in computing the variance of the loss function partial derivative, which is defined as where ⟨•⟩ θ indicates the average over the set of parameters θ in the S n -equivariant quantum neural network.
As discussed in the main text, we can always replace the integral over parameter space by an integral over a distribution of unitaries, which is known to converge to 2-designs in the Lie group e g allowing us to in turn replace -under the assumption that f is a polynomial of degree ⩽ 2 in the matrix elements of U -the previous with an integration over the Haar measure of this Lie group.We then use the fact that this Lie group is in general a product group (in our case all the simple groups are unitary or special unitary) and that its Haar measure is the product over the Haar measure of each of the simple normal subgroups1 .We have As mentioned in the main text, integration over Haar of the unitary group can be done via Weingarten calculus.
Let us begin with the second term in Eq. (23).That is, we want to compute M i=1 c i ⟨∂ µ ℓ θ (ρ i )⟩ θ .Explicitly, we have for the mean of the single-state expectation value ℓ θ (ρ i ) where ] .Using Eq. ( 24) on both where in the second equation we have used the trivial integration over all λ ′ ̸ = λ.Using Eq. ( 3), we can evaluate the terms where we have used that the trace of a commutator is always zero.From the previous we have shown that the first moment of the cost function partial derivative is zero, i.e., we have proved that and clearly, in turn ⟨∂ µ L(θ)⟩ θ = 0.
Next, let us evaluate the variance via the second moment dµ(UA,α) where we have defined ] and α and α ′ label irreps.Let us focus on a single summand, which we denote by S ν,ν ′ ,i,j λ,λ ′ .Two cases arise: either where after taking the trivial integrations we are left with four integrals.An explicit integration of U B (either λ or λ ′ ) via Eq.( 5) leads to where in the last equality we have used Eq. ( 29).On the other hand, when λ = λ ′ , we have and in the end we are left with only two integrals.Integration over U B leads to where we define the operator for any two d × d matrices A and B. Note that A = B leads to † the Hilbert-Schmidt distance.We will also use the notation ∆(A) = ∆(A, A).Thus, we can write Finally, using Eq. ( 4) leads to Combining Eqs. ( 31), ( 40) and ( 39) leads to From Eq. ( 41) we can see that the variance of the loss function contains a term for each irrep.Let us analyze each of those terms more closely.First, we recall that meaning that within each irrep, the variance contains a term that quantifies how close the reduced measurement operator O and the reduced generator H k,λ is to the (normalized) d λ × d λ identity 1 1 λ .As expected if O of H µ are trivial when projected into the λ irrep, then its respective contribution to the variance will be insignificant.We now note that the summation over i and j can be pushed into ∆(ρ ν i,λ , ρ ν ′ j,λ ).Specifically, = ∆( = ∆( This motivates us to define σ = M i=1 c i ρ i , and we can now write the variance as Recall that ∆( ⩾ 0 and thus the variance is non-negative as expected.To evaluate the presence of barren plateaus, it remains to find ∆(H λ ) and ∆(O λ ).Now that we have connected the eigenvalues of A in terms of Hamming weights w(z) (see Lemma 4), we can proceed to (ii): Investigate which of them are compatible with a given irrep subspace H ν λ .
Proof of Theorem 2. We prove the theorem by using the connection between the irreps of U(2), Pauli operators, and spin systems.The operator A must take the block diagonal form A ∼ = λ 1 1 m λ ⊗A λ , where A λ acts on the irrep of U(2) indexed by λ.Notably the irreps of U(2) are spaces of fixed total spin [3][4][5][6].That is, for λ = (n − m, m), the total spin of states in the subspace are fixed at n−2m 2 .Thus, the local spin components must lie in the range {− n−2m 2 , − n−2m 2 + 1, . . ., n−2m 2 }.As Hamming weight corresponds to the number of spin down sites, for an irrep λ the weights are restricted to the range w ∈ {m, . . ., n − m}.We can then use Eq. ( 48) to compute the eigenvalues and from there ∆(A λ ).

2
. Summing over these values yields (2) λ Thus, the variance of the eigenvalues is λ ) = 8 3 Note that for m > ⌊ n 2 ⌋ − 1 this evaluates to 0. Case 3. In this case A (n) = ⊗ n i=1 Z i , and from Eq.48 we have e(z) = (−1) w(z) .Thus, (A Thus, Tr A VI. SUPPLEMENTARY METHODS 6: GENERALIZATION OF THEOREM 2 We here generalize the result in Theorem 2 to the case of arbitrary k-local Pauli strings.Consider the operator A (k) = {j1,...,j k } ⊗ k i=1 χ ji .Again, χ is arbitrary thus we assume χ = Z.The eigenvalue corresponding to a bitstring |z⟩ is where K k (x; n) are the binary Krawtchouk polynomials.Then we have that VII. SUPPLEMENTARY METHODS 7: TRAINABILITY OF STATES We here consider certain families of states and argue for, or against, their trainability under an S n -equivariant architecture.
As discussed in the main text, Theorem 1 asserts that there can be several sources of untrainability, those QNN and measurement related, and those dataset related.Since Corollary 1 of the main the text states that the former source cannot lead to exponentially vanishing gradients, the trainability of the model hinges on the behavior of the latter dataset-related source.We note that this dataset-dependent trainability is not unique for S n -equivariant QNNs, but rather present in all absence of barren plateaus results (see Refs. [2,[7][8][9][10]) as there always exist datasets for which an otherwise trainable model can be rendered untrainable.In this section we briefly discuss the trainability of S n -equivariant models for different types of datasets.
First, we recall from Theorem 1 that the dataset will induce a barren plateau if, for every λ, ∆( m λ ν=1 σ ν λ ) is exponentially small.Here, σ = M i=1 c i ρ i and σ ν λ is the reduction of σ to H ν λ .This highlights a clear connection between the underlying block structure and the trainability condition for a given dataset: Because σ is Hermitian, we can interpret ∆( m λ ν=1 σ ν λ ) as measuring the variance of the eigenvalues of the multiplicity-averaged reduced operators σ ν λ , and to guarantee trainability we need at least one irrep λ where such eigenvalue variance is not exponentially vanishing.
In the following, we will make some distinction between training single states versus datasets.The former is a natural framework for variational quantum algorithms [11,12].In this case, we need only to show that ∆( ν σ ν λ ) ∈ Ω( 1 poly(n) ) for some λ ∈ Ŝn .For the latter, trainability also depends on the weights c i .Let us clarify this with an example.
Example 1.Consider the set {|ψ⟩ ∈ H | R(π) |ψ⟩ = |ψ⟩ , ∀π ∈ S n }.This is commonly known as the symmetric subspace [13] and corresponds to the irrep λ = (n, 0) where d (n,0) = n + 1 and m (n,0) = 1.Now, suppose our dataset is composed of n + 1 pure states forming an orthonormal basis for this subspace, which we label as 1) and these states are trainable.However, this may not be the case when we train on the entire dataset.Assume that we are performing binary classification, for which c i := − 1 M y i .If all inputs have the same label, then σ = ± 1 n+1 P Sym , where P Sym is the projector into the symmetric subspace, and thus σ (n,0) = 1 n+1 1 1 (n,0) implying ∆(σ (n,0) ) = 0 (this follows from the fact that P Sym is the identity matrix in H ν λ , and hence its eigenvalues have zero variance).Now instead consider c 1 = −1 and c i = 1 for i > 1.In this case, ∆(σ (n,0) ) = 4n (n+1) 3 and there will not be barren plateaus.Note that the true average, c i = 1 N , is a sort of worst case as each state is treated the same with no dependence on y i .Regardless, in Sec.X D we present numerics showing that the true average of certain families of states is trainable.
While the previous shows that σ may not be trainable, even when each ρ i is, we can instead say that if each ρ i is not trainable, then nor will σ be.We formalize this with the following theorem: Proposition 1.Given an ensemble of states {ρ i } and their corresponding σ = i c i ρ i , then Proof.This result is easily proven by recalling that ∆(A) = D HS (A, The utility of this result is that if ∆( ) is small for an ensemble of states {ρ i }, then it will still be small for the weighted average σ.
A. Analytical results for trainability of single states

Fixed Hamming-weight encoding
Building on the intuition that states with an at most polynomially vanishing component in the symmetric subspace can have a ∆(ρ (n,0) ) that is at most polynomially vanishing (see Example 1), let us devise an encoding scheme with trainability guarantees: Given some array of real values {x i }, perhaps representing some high dimensional vector, such that i x 2 i = 1, encode each x i as the weight of a unique bitstring z of Hamming weight k, where k is some fixed constant.That is, prepare the quantum state where w(z) is the Hamming weight and we are now indexing {x i } with z.To analyze the trainability of this encoding we consider the projection into the symmetric subspace.From the analysis above we know that we need only show that Tr[P sym |x⟩⟨x|] ∈ Ω( 1 poly(n) ).Recalling that P sym = 1 n! g∈Sn R(g), we can evaluate this trace directly.First, we note that where again k = w(z).Thus, and Note that since i x 2 i = 1, we have Here we consider two cases.First, that where x i ⩾ 0, ∀i.For example, in encoding an image, the pixel values are all non-negative.In this case clearly ⟨x|P sym |x⟩ ⩾ max(k!,(n−k)!) n!
. As a second scenario, consider drawing data It is not unreasonable to assume that x i is symmetric in distribution around 0. That is, P (x i ⩽ −a) = P (x i ⩾ a).If we also assume that only the magnitudes of the data are correlated, and not the signs, then E[x i x j ] = 0 (for i ̸ = j).In this case, E[⟨x|P sym |x⟩] = max(k!,(n−k)!) n!
. Without loss of generality take k ⩽ n 2 .Then, For a fixed k, then the expected component in the symmetric subspace ∆(ρ (n,0) ) is inverse-polynomial.

Global Haar random state
While the previous family of input states were trainable, we now provide an example leading to untrainability.Consider sending a single global Haar random pure state ρ through the S n -equivariant QNN.That is, states of the form ρ = V |ψ⟩⟨ψ| V † where V is a Haar-random unitary and |ψ⟩ is an arbitrary pure state.
Consider the expectation value: To continue, recall that We can then rewrite the first expectation value above as where we have defined We employ the following identity (Eq.( 4)) } to be the orthogonal sets of vectors in the full Hilbert space corresponding to the subspaces H ν λ and H ν ′ λ .Then Using Eq. ( 79) we then see that To evaluate the second expectation value we first rewrite the expression as We now employ the following identity (Eq.( 5)) This allows us to evaluate the expectation value Combining the expectation values derived above yields We turn back to E[∆( which is now easy to evaluate.

E[∆(
Recall that, ) and, on average, Haar random states will not be trainable.Of course, this is not surprising as it is known that Haar random states are highly entangled and make poor computational resources [14].

VIII. SUPPLEMENTARY METHODS 8: VC DIMENSION BOUNDS (FOR GENERAL SYMMETRIES)
Besides the covering number bound in the main text, here we show a bound on the VC dimension of equivariant QNNs.We begin by giving a general framework for GQML.Consider a quantum neural network N θ : B(H) → B(K), where H is the input Hilbert space and K the output.Note that this needs not be a unitary mapping and can be a general quantum channel.Further, assume N is equivariant to a compact group G with representation R 1 (g) on H and R 2 (g) on K.That is, With this neural network we construct classifiers of the form where O is some Hermitian operator such that Proof.First, note that c can be absorbed into O via setting O ′ := O − c1 1.As 1 1 commutes with everything, O ′ is then still in comm(R 2 ).We can then consider classifiers of the form Tr[ON (ρ; θ)] ⩾ 0 without loss of generality.
Recall that the twirling operation for a compact group is given by T G [X] = G dµ(g)R(g)XR(g) † , where µ is the Haar measure.In addition, T G is known to be a projector into comm(R) [15].Note that here we twirl with R 2 (g).Clearly, T G [O] = O for the chosen measurement operator.We require one last property of twirling-it is self-adjoint.That is, T G [N θ (ρ)] is some operator in comm(R 2 ).Thus, any classifier of the form Tr[ON (ρ; θ)] ⩾ 0 is a linear classifier in comm(R 2 ).The VC dimension of all linear classifiers is the dimension of the vector space [16].As the space of classifiers obtained by varying θ is a subset of all linear classifiers, then the VC dimension of F θ is less than or equal to the dimension of the commutant of R 2 (g).
For the architecture considered in this paper, G = S n and R 1 (g) = R 2 (g).From the isotypic decomposition the dimension of the commutant will be the sum of the squares of the multiplicities.The multiplicity space of irrep λ is an irrep of U(2) of dimension n − 2m + 1. Summing over all irreps, we see that the total dimension of the commutant is m=0 (n − 2m + 1) 2 , proving Theorem 7.

IX. SUPPLEMENTARY METHODS 9: ABSENCE OF BARREN PLATEAUS IMPLIES ABSENCE OF NARROW GORGES
In this section we will show that in the loss function does not exhibit a barren plateaus, and more importantly that it cannot have a narrow gorge.
First, let us recall the definition of a narrow gorge from Ref. [17].Let the loss function be defined as We say that ℓ θ (ρ) exhibits a narrow gorge if the following two conditions are met: 1.There exist points in the landscape which are lower than the mean cost value by at least a a quantity ∆(n) with ∆(n) ∈ Ω(1/ poly(n)).

A. Gradient scalings for different families of input states
We numerically probe the scaling of the gradients for different families of random initial states.All the results reported in Fig. 1 are obtained for a generator H µ = 1 n n j=1 X j in the middle of the S n circuits with L = 3n layers.Similar scalings are obtained for the other generators (Eq. 9 in the main text) but not displayed here.
In Fig. 1(a), we study graph states [18,19] which, we recall, are obtained by applying a controlled-Z gate, CZ = |0⟩⟨0| ⊗ 1 1 + |1⟩⟨1| ⊗ Z for each edge (a, b) belonging to an underlying graph, onto an initial state |+⟩ ⊗n .The underlying random graphs considered are either k-regular graphs (with k = 3 or k = n/2), or they follow an Erdös-Rényi distribution (with a probability of any edge to be be included taken to be p = 30% or 50%).As can be seen, gradients are found exponentially decreasing for the graphs drawn from a Erdös-Rényi distribution, but only polynomially decreasing for the k-regular graphs.
Gradients for local Haar random input states and also for states produced by application of a random circuit to a fiducial state |0⟩ ⊗n are reported in Fig. 1(b).For the random circuit preparation, we use a Hardware Efficient Ansatz (HEA) [20] composed of layers of Y rotations applied to each qubit with random angles followed by controlled-NOT gates acting onto pairs of adjacent qubits.Such circuits of HEA layers are either taken to be constant depth (L = 15 or 30), or with a number layers scaling linearly with the system size (L = n, 2n or 3n).Only in the case of a linear number of layers that the gradients are found to vanish exponentially.

B. Assessment of the analytical variances.
In Fig. 2 we compare analytical expression of the gradient variances (as provided in Eq. ( 46)) to variances estimated numerically.Both are evaluated for a generalized graph states, where edges are now encoded by means of a controlled-   ) and our analytical expressions (Pred.) in Eq. (46).All the data reported corresponds to generalized graph states (described in the main text) for 3-regular graphs with number of nodes n ∈ [4,16].Results are displayed for varied encoding angles ϕ (columns) and for each of the generators Hµ (rows) appearing in G (Eq. 9 of the main text).
To evaluate Eq. ( 46), we explicitly construct a basis that block diagonalizes S n in the form of the isotypic decomposition (Eq.11 in the main text) This so called Schur basis can be constructed recursively (for instance, see Refs.[3,21]).In turn, this allows us to evaluate for any operator B (including the input state σ, the measurement operators O and the generators H µ ) its restriction B ν λ as defined in Eq. ( 13) and as required to compute Eq. ( 46).As shown in Fig. 2, for all the different scenarios studied, the variances yielded by Eq. ( 46) match closely the variance numerically estimated.Furthermore, in addition to the case with ϕ = π already discussed it can be seen that there exist choices of ϕ results in non-exponentially vanishing gradient variances.

C. Contributions of the different irreps for 3-regular graph states
Given this ability to evaluate the different terms entering Eq. ( 46), we can further analyze the contributions of the different irreps to the total variance, for any given initial state.The contribution of a given irrep is defined as the summand indexed by λ in Eq. (46), such that the sum of all irreps contributions equals the overall variance.
In Fig. 2, we report these contributions for each of the irreps involved and labeled as λ = (n − m, m) (given in legend).These data are obtained for graph states based on 3-regular graphs and a number of nodes n ∈ [4,16].While analytical results for trainable states in Sec.VII A were based on symmetric component of the input state vanishing only polynomially, here we can see that the situation is different.Explicitly, despite the exponential vanishing of the contribution of λ = (n, 0) (the symmetric subspace in the top right sub-figure), the overall gradient variance only decays To further substantiate our claims on the efficacy of S n -equivariant QNNs compared to non-equivariant models, we consider the problem of classifying connected vs disconnected graphs encoded as graph states (ϕ = π).While this problem can be solved readily with classical methods, it still serve as an instructive example as connectivity is clearly invariant under relabeling.The number of non-isomorphic graphs is exponential, but we train with only a polynomial number of random graphs to illustrate the small data requirements of S n -equivariant QNNs.To reach overparametrization, we train models with Θ(n 3 ) layers.Our EQNN model follows the structure in Fig. 2 of the main text and ends with a measurement of O = Z ⊗n .Then, the prediction of the classifier is given by h θ (ρ) = sgn(Tr[U θ (ρ)Z ⊗n ]).
As a comparison, we also train standard hardware-efficient ansatzes [20] with the same number of parameters.Here the layers are composed of local unitaries (on single qubits) followed by ladders of CNOTs on adjacent parties.At the end, X 1 + X 2 is measured and again classification given by the sign of the expectation value.
In Fig. 5 we give an example for n = 7.First, note that the EQNN converges in less than 100 epochs while the standard QNN requires ∼ 250.Most importantly, the training and testing accuracies of the EQNN closely match.The standard QNN, on the other hand, achieves similar training accuracy, but severely overfits and performs poorly in testing.Within 1000 epochs the QNN fails to achieve high training accuracy, unlike the EQNN.

SUPPLEMENTARY NOTE
Soon after our work was completed the preprint [22] was posted on the arXiv.Therein, the authors claim that the loss function of S n -equivariant QNNs can be classically simulated.In this section we address these claims.In particular, we aim at making two important points: • For most relevant cases in QML, the relevant algorithm in [22] is not fully classical, as it require access to a quantum computer to obtain a "classical description" of the input data.
• Even if one is given such "classical description", the ensuing algorithm that replaces the use of a quantum neural networks scales extremely poorly with the number of qubits, meaning that it is not clear whether it is truly favorable to replace the quantum circuit by classical post-processing.Training and testing performance for a permutation-equivariant QNN versus a standard QNN learning to classify connected vs disconnected graphs of 7 nodes.Each epoch (x axis) is a single step of optimization consisting of takes gradients over two batches of the data.Both models have 120 free parameters, a count such that the equivariant model reaches overparametrization.Models were trained with a set of 14 random connected and 14 disconnected graphs.Testing was conducted with 6 new random connected and 6 disconnected graphs.For both architectures, the best model out of 15 random initial settings was selected.Note that the equivariant QNN converges within 100 epochs with small generalization error while the standard model requires several hundred and heavily overfits the data.
We begin by recalling that in Ref. [22], the authors argue that for symmetries restrictive enough (such as the permutation group S n ) some computational tasks become classically tractable provided certain conditions.These conditions are defined as "given certain classical descriptions of the input or the results of efficiently obtainable quantum measurements".This are indeed crucial assumptions, which we argue in the following do not hold in many tasks of practical importance, including quantum machine learning (QML) problems as envisioned in our manuscript.We stress that we do not claim that no interesting problem can be addressed classically, but rather that this is not the case for most QML problems.For instance, we have no doubt that evaluating eigenenergies (and efficiently describing their eigenstates) of S n -invariant Hamiltonians can indeed be performed classically as evidenced by the Theorems of Ref. [22].
Let us thus henceforth focus on QML applications.In particular, we will discuss Corollary 12 and Theorem 8 of Ref. [22], since these address directly our QML setup (as evidenced by a reference to our manuscript provided just before the corollary).Here, the authors argue that one can evaluate the loss function ℓ θ (ρ i ) provided one is given "efficiently obtainable quantum measurements" on the quantum state ρ i .In practice, this materializes as being able to prepare ρ i and implementing on a quantum computer the quantum Schur transform [23], which is denoted in Ref. [22] as V ST O .For completeness, we recall that the quantum Schur transform (QST) is a quantum primitive that promises to greatly accelerate several tasks provided access to quantum computing resources (e.g., see a discussion of the many known applications in Ch.6 of Ref. [23] or the introduction of Ref. [24]).The QST is deemed efficiently implementable on a quantum device in the sense that it requires a polynomial number of operations and a logarithmic number of additional ancillas (See, e.g., Refs [21,[24][25][26] for different possible implementations).While the previous is theoretically true, to our knowledge the QST has not been demonstrated on current devices due to its high requirements for implementation.
Given that the simulability of a generic S n -invariant QML pipeline, based on Corollary 12 in Ref. [22], effectively requires accessing a quantum state (i.e., the input data) in the first place, coherent processing (i.e., performing a QST) and additional non-trivial measurements (i.e., classical shadows as mentioned when introducing Th.8 in Ref. [22]), we feel comfortable classifying the algorithm of Ref. [22] as truly non-classical (and also non-near-term).
Next, let us discuss the second scenario envisioned in Ref. [22], which assumes that one has been "given certain classical description" of the input states.Already, this assumption is non-realistic and does not hold in generic QML problems, whereby input states can be accessed rather than being "described " in a preferable basis.Still, one could question the case when the input data is classical and encoded in a quantum state.In such a case, the learner effectively has some description of the input state.Still, with the exception of most basic forms of encoding, such description is not readily transferable in a form suitable to the classical simulations of Ref. [22].To exemplify such situation, the authors address in Ref. [22] the case where the input state is a predefined computational basis state.Indeed, it is known that description of such state can be efficiently mapped to a description in the preferable basis.This can readily be extended to superposition of polynomially many computational basis states.Yet, already the graph state encoding that was used as an example in our numerical experiments would escape such description.Hence, even in this classical input data scenario, we believe that for the majority of encoding (except, the simplest ones) obtaining an efficient classical description of the data without a quantum computer is not a viable option.
Lastly, let us assume that one is given adequate measurements or access to an appropriate classical description of the input state.Then, while it is true that the loss ℓ θ (ρ i ) can be computed classically, this requires incurring very large computational requirements.Explicitly, as mentioned by the authors of Ref. [22] (Appendix F), the classical computation involves routines scaling as O(n 10 ) and O(n 15 ).While polynomial in the system size n, such complexity may well prohibit any reasonable attempt of scaling past n = 20 qubits.This discussion shows that if one has access to a quantum computer (as required by the algorithm in Ref. [22]) it is not entirely obvious that one should use it to obtain a classical description of the data (followed by expensive post-processing), rather than using the device to run an S n -equivariant quantum neural networks with trainability guarantees.One notes that the run-time of our model scales like n 6 (O(n 3 ) layers each composed of O(n 2 ) operations).Then, our results indicate that using a quantum computer with a run-time scaling as n 6 (≪ n 15 ) leads to overparametrized models with no barren plateaus; and thus there is hope of achieving a quantum advantage in spite of having such large symmetries.

Figure 1 .
Figure 1.GQML embeds geometric priors into a QML model.Incorporating prior knowledge through Snequivariance heavily restricts the search space of the model.We show that such inductive biases lead to models that do not exhibit barren plateaus, can be efficiently overparametrized, and require small amounts of data to generalizing well.

Figure 2 .
Figure 2. Quantum circuit for an Sn-equivariant QNN.Each layer of the QNN is obtained by exponentiation of a generator from the set G in Eq. (9).Here we show a circuit with L = 3 layers acting on n = 4 qubits.Single-qubit blocks indicate a rotation about the x or y axis, while two-qubit blocks denote entangling gates generated by a ZZ interaction.All colored gates between dashed horizontal lines share the same trainable parameter θ l .

Figure 4 .
Figure 4. Tetrahedral numbers.a) The Tetrahedral numbers Ten are obtained by counting how many spheres can be stacked in the configuration of a tetrahedron (triangular base pyramid) of height n. b) One can also compute Ten as the sum of consecutive triangular numbers, which count how many objects (e.g., spheres) can be arranged in an equilateral triangle.

Figure 5 .
Figure 5. Task of distinguishing connected from disconnect graphs with an Sn-equivariant QNN.a) Variance of the loss function partial derivatives versus the number of qubits n (in log-linear scale).The square blue line depicts the variance for inputs of the QNN drawn from a dataset composed of connected and disconnected graph states.To visualize how the data with different labels contributes to this variance, we also plot in green crosses (orange circles) the variances when the QNN is only fed connected (disconnected) graph states.b) In the left panel, we show representative results for the rank of the QFIM (defined in the main text) versus the number of layers L for different number of qubits n.The critical value of layers at which this rank saturates, denoted Lovp (vertical dashed lines), corresponds to the onset of overparametrization.In the middle panel, we report the scaling of Lovp versus the number of qubits (log-linear scale).For each problem size, we present results for 10 random input graph states and, as a comparison, also report the Tetrahedral numbers Ten+1 (solid line).In the right panel, we report the relative loss error of optimized QNNs at given number of layers L (in log-linear scale).These are obtained for different system sizes, with the dashed vertical lines indicating the corresponding values of Lovp.c) Normalized generalization error versus number of qubits n (in log-linear scale) for different training dataset sizes M .Here, we consider an overparametrized QNN with L = Ten+1.

Theorem 7 .
Let U θ be an S n -equivariant QNN with generators in G, and O an S n -equivariant measurement from M. The VC dimension of classifiers of the form Tr U (θ)ρU (θ) † O ⩾ b is less than or equal to λ d 2 λ = Te n+1 ∈ Θ(n 3 ).

Figure 1 .
Figure1.Scaling of the gradients of Sn-equivariant circuits.Left: the input states are graph states, which underlying graphs are drawn from different distributions (in legend).Right: Local Haar states and states prepared with a random circuit taken to be a HEA circuit (described further in the main text) with a number L of layers either constant or scaling linearly with the system sizes (given in legend).

Figure 2 .
Figure2.Predicted vs actual gradients Comparison between gradients estimated numerically (Est.) and our analytical expressions (Pred.) in Eq.(46).All the data reported corresponds to generalized graph states (described in the main text) for 3-regular graphs with number of nodes n ∈[4, 16].Results are displayed for varied encoding angles ϕ (columns) and for each of the generators Hµ (rows) appearing in G (Eq. 9 of the main text).

Figure 4 .
Figure 4. Variance of the gradients for different sizes N of the dataset.We report variances obtained for gradients of the empirical loss in Eq. (6) for datasets of M = 50 states with uniform weighting ci = 1/M (square markers) and compare them to the single state, with M = 1, case (circle marker).Additionally we report variances for the dataset scenario under the assumption of i.i.d.gradients (dashed line, and discussed further in the main text).Variances are obtained for three distributions of input states (colors in legend) including local Haar random states and graph states based on 3-regular graphs or drawn for an Erdös-Rényi distribution.Results are reported for two gate generators, namely Hµ = k<j ZjZ k in (a) and Hµ = n i=1 Xj in (b).

Figure 5 .
Figure 5. Training and testing performance for a permutation-equivariant QNN versus a standard QNN learning to classify connected vs disconnected graphs of 7 nodes.Each epoch (x axis) is a single step of optimization consisting of takes gradients over two batches of the data.Both models have 120 free parameters, a count such that the equivariant model reaches overparametrization.Models were trained with a set of 14 random connected and 14 disconnected graphs.Testing was conducted with 6 new random connected and 6 disconnected graphs.For both architectures, the best model out of 15 random initial settings was selected.Note that the equivariant QNN converges within 100 epochs with small generalization error while the standard model requires several hundred and heavily overfits the data.

Table I .
Input pure states and their effect on the trainability of Sn-equivariant QNNs.Trainable means that ∆