Introduction

Symmetry studies and formalizes the invariance of objects under some set of operations. A wealth of theory has gone into describing symmetries as mathematical entities through the concept of groups and representations. While the analysis of symmetries in nature has greatly improved our understanding of the laws of physics, the study of symmetries in data has just recently gained momentum within the framework of learning theory. In the past few years, classical machine learning practitioners realized that models tend to perform better when constrained to respect the underlying symmetries of the data. This has led to the blossoming field of geometric deep learning1,2,3,4,5, where symmetries are incorporated as geometric priors into the learning architectures, improving trainability and generalization performance6,7,8,9,10,11,12,13.

The tremendous success of geometric deep learning has recently inspired researchers to import these ideas to the realm of quantum machine learning (QML)14,15,16. QML is a new and exciting field at the intersection of classical machine learning, and quantum computing. By running routines in quantum hardware, and thus exploiting the exponentially large dimension of the Hilbert space, the hope is that QML algorithms can outperform their classical counterparts when learning from data17.

The infusion of ideas from geometric deep learning to QML has been termed ‘geometric quantum machine learning’ (GQML)18,19,20,21,22,23,24. GQML leverages the machinery of group and representation theory25 to build quantum architectures that encode symmetry information about the problem at hand. For instance, when the model is parametrized through a quantum neural network (QNN)16,26,27,28, GQML indicates that the layers of the QNN should be equivariant under the action of the symmetry group associated to the dataset. That is, applying a symmetry transformation on the input to the QNN layers should be the same as applying it to its output.

One of the main goals of GQML is to create architectures that solve, or at least significantly mitigate, some of the known issues of standard symmetry non-preserving QML models16. For instance, it has been shown that the optimization landscapes of generic QNNs can exhibit a large number of local minima29,30,31,32, or be prone to the barren plateau phenomenon33,34,35,36,37,38,39,40,41,42,43,44,45 whereby the loss function gradients vanish exponentially with the problem size. Crucially, it is known that barren plateaus and excessive local minima are connected to the expressibility30,32,37,43,46 of the QNN, so that problem-agnostic architectures are more likely to exhibit trainability issues. In this sense, it is expected that following the GQML program of baking symmetry directly into the algorithm, will lead to models with sharp inductive biases that suitably limit their expressibility and search space.

In this work, we leverage the GQML toolbox to create models that are permutation invariant, i.e., models whose outputs remain invariant under the action of the symmetric group Sn (see Fig. 1). We focus on this particular symmetry as learning problems with permutation symmetries abound. Examples include learning over sets of elements47,48, modeling relations between pairs (graphs)49,50,51,52,53,54 or multiplets (hypergraphs) of entities55,56,57, problems defined on grids (such as condensed matter systems)58,59,60,61, molecular systems62,63,64, evaluating genuine multipartite entanglement65,66,67,68, or working with distributed quantum sensors69,70,71.

Fig. 1: GQML embeds geometric priors into a QML model.
figure 1

Incorporating prior knowledge through Sn-equivariance heavily restricts the search space of the model. We show that such inductive biases lead to models that do not exhibit barren plateaus, can be efficiently overparametrized, and require small amounts of data to generalize well.

Our first contribution is to provide guidelines to build unitary Sn-equivariant QNNs. We then derive rigorous theoretical guarantees for these architectures in terms of their trainability and generalization capabilities. Specifically, we prove that Sn-equivariant QNNs do not lead to barren plateaus, can be overparametrized with polynomially deep circuits, and generalize well with only a polynomial number of training points. We also identify problems (i.e., datasets) for which the model is trainable, but also datasets leading to untrainability. All these appealing properties are also demonstrated in numerical simulations of a graph classification task. Our empirical results verify our theoretical ones, and even show that the performance of Sn-equivariant QNNs can, in practice, be better than that guaranteed by our theorems.

Results

Preliminaries

While the formalism of GQML can be readily applied to a wide range of tasks with Sn symmetry, here we will focus on supervised learning problems. We note, however that our results can be readily extended to more general scenarios such as unsupervised learning72,73, reinforced learning74,75, generative modeling76,77,78,79, or to the more task-oriented computational paradigm of variational quantum algorithms63,80.

Generally, a supervised quantum machine learning task can be phrased in terms of a data space \({{{\mathcal{R}}}}\) -a set of quantum states on some Hilbert space \({{{\mathcal{H}}}}\)- and a real-valued label space \({{{\mathcal{Y}}}}\). We will assume \({{{\mathcal{H}}}}\) to be a tensor product of n two-dimensional subsystems (qubits) and thus of dimension d = 2n. We are given repeated access to a training dataset \({{{\mathcal{S}}}}={\{({\rho }_{i},{y}_{i})\}}_{i = 1}^{M}\), where ρi is sampled from \({{{\mathcal{R}}}}\) according to some probability P, and where \({y}_{i}\in {{{\mathcal{Y}}}}\). We further assume that the labels are assigned by some underlying (but unknown) function \(f:{{{\mathcal{R}}}}\mapsto {{{\mathcal{Y}}}}\), that is, yi = f(ρi). We make no assumptions regarding the origins of ρi, meaning that these can correspond to classical data embedded in quantum states81,82, or to quantum data obtained from some quantum mechanical process60,61,83.

The goal is to produce a parametrized function \({h}_{{{{\boldsymbol{\theta }}}}}:{{{\mathcal{R}}}}\mapsto {{{\mathcal{Y}}}}\) closely modeling the outputs of the unknown target f, where θ are trainable parameters. That is, we want hθ to accurately predict labels for the data in the training set \({{{\mathcal{S}}}}\) (low training error), as well as to predict the labels for new and previously unseen states (small generalization error). We will focus on QML models that are parametrized through a QNN, a unitary channel \({{{{\mathcal{U}}}}}_{{{{\boldsymbol{\theta }}}}}:{{{\mathcal{B}}}}({{{\mathcal{H}}}})\to {{{\mathcal{B}}}}({{{\mathcal{H}}}})\) such that \({{{{\mathcal{U}}}}}_{{{{\boldsymbol{\theta }}}}}(\rho )=U({{{\boldsymbol{\theta }}}})\rho U{({{{\boldsymbol{\theta }}}})}^{{\dagger} }\). Here, \({{{\mathcal{B}}}}({{{\mathcal{H}}}})\) denotes the space of bounded linear operators in \({{{\mathcal{H}}}}\). Throughout this work we will restrict to L-layered QNNs

$${{{{\mathcal{U}}}}}_{{{{\boldsymbol{\theta }}}}}={{{{\mathcal{U}}}}}_{{\theta }_{L}}^{L}\circ \cdots \circ {{{{\mathcal{U}}}}}_{{\theta }_{1}}^{1},\,\,\,\,{{\mbox{where}}}\,\,\,\,\,{{{{\mathcal{U}}}}}_{{\theta }_{l}}^{l}(\rho )={{{{\rm{e}}}}}^{-{{{\rm{i}}}}{\theta }_{l}{H}_{l}}\rho {{{{\rm{e}}}}}^{{{{\rm{i}}}}{\theta }_{l}{H}_{l}},$$
(1)

for some Hermitian generators {Hl}, so that \(U({{{\boldsymbol{\theta }}}})=\mathop{\prod }\nolimits_{l = 1}^{L}{{{{\rm{e}}}}}^{-{{{\rm{i}}}}{\theta }_{l}{H}_{l}}\). Moreover, we consider models that depend on a loss function of the form

$${\ell }_{{{{\boldsymbol{\theta }}}}}({\rho }_{i})={{{\rm{Tr}}}}[{{{{\mathcal{U}}}}}_{{{{\boldsymbol{\theta }}}}}({\rho }_{i})O],$$
(2)

where O is a Hermitian observable. We quantify the training error via the so-called empirical loss, or training error, which is defined as

$$\widehat{{{{\mathcal{L}}}}}({{{\boldsymbol{\theta }}}})=\mathop{\sum }\limits_{i=1}^{M}{c}_{i}{\ell }_{{{{\boldsymbol{\theta }}}}}({\rho }_{i}).$$
(3)

The model is trained by solving the optimization task \(\arg {\min }_{{{{\boldsymbol{\theta }}}}}\widehat{{{{\mathcal{L}}}}}({{{\boldsymbol{\theta }}}})\)63. Once a desired convergence in the optimization is achieved, the optimal parameters, along with the loss function θ, are used to predict labels. For the case of binary classification, where \({{{\mathcal{Y}}}}=\{+1,-1\}\), one can choose \({c}_{i}:= -\frac{{y}_{i}}{M}\). Then, if the measurement operator is normalized such that θ(ρi)  [−1, 1], this corresponds to the hinge loss, a standard loss function but not the only relevant one84) in machine learning.

We further remark that while Eq. (3) approximates the error of the learned model, the true loss is defined as

$${{{\mathcal{L}}}}({{{\boldsymbol{\theta }}}})={{\mathbb{E}}}_{\rho \sim P}[c(y){\ell }_{{{{\boldsymbol{\theta }}}}}(\rho )].$$
(4)

Here, we have denoted the weights as c(y) to make their dependency on the labels y explicit. The difference between the true loss and the empirical one, known as the generalization error, is given by

$${{{\rm{gen}}}}({{{\boldsymbol{\theta }}}})=| {{{\mathcal{L}}}}({{{\boldsymbol{\theta }}}})-\hat{{{{\mathcal{L}}}}}({{{\boldsymbol{\theta }}}})| .$$
(5)

We now turn to GQML, where the first step is identifying the underlying symmetries of the dataset, as this allows us to create suitable inductive biases for hθ. In particular, many problems of interest exhibit so-called label symmetry, i.e., the function f produces labels that remain invariant under a set of operations on the inputs. Concretely, one can verify that such set of operations forms a group18, which leads to the following definition.

Definition 1

(Label symmetries and G-invariance). Given a compact group G and some unitary representation R acting on quantum states ρ, we say f has a label symmetry if it is G-invariant, i.e., if

$$f(R(g)\rho R{(g)}^{{\dagger} })=f(\rho ),\,\forall g\in G.$$
(6)

Here, we recall that a representation is a mapping of a group into the space of invertible linear operators on some vector space (in this case the space of quantum states) that preserves the structure of the group25. Also, we note that some problems may have functions f whose outputs change (rather than being invariant) in a way entirely determined by the action of G on their inputs. While still captured by general GQML theory, these do not pertain to Definition 1 and are not discussed further. Label invariance captures the scenario where the relevant information in ρ is unchanged under the action of G.

Evidently, when searching for models hθ that accurately predict outputs of f, it is natural to restrict our search to the space of models that respect the label symmetries of f. In this context, the theory of GQML provides a constructive approach to create G-invariant models, resting on the concept of equivariance23.

Definition 2

(Equivariance). We say that an observable O is G-equivariant iff for all elements gG, [O, R(g)] = 0. We say that a layer \({{{{\mathcal{U}}}}}_{{\theta }_{l}}^{l}\) of a QNN is G-equivariant iff it is generated by a G-equivariant Hermitian operator.

By the previous definition, G-equivariant layers are maps that commute with the action of the group

$${{{{\mathcal{U}}}}}_{{\theta }_{l}}^{l}(R(g)\rho R{(g)}^{{\dagger} })=R(g){{{{\mathcal{U}}}}}_{{\theta }_{l}}^{l}(\rho )R{(g)}^{{\dagger} }.$$
(7)

Definition 2 can be naturally extended to QNNs.

Definition 3

(Equivariant QNN). We say that a L-layered QNN is G-equivariant iff each of its layers is G-equivariant.

Altogether, equivariant QNNs and measurement operators provide a recipe to design invariant models, i.e., models that respect the label symmetries. Akin to their classical machine learning counterparts1,2,3,4,5, such GQML models consist in a composition of many equivariant operations (realized by the L layers of the equivariant QNN) and an invariant one (realized by the measurement of the equivariant observable)23. Furthermore, model invariance extends to the loss function itself, as captured by the following Lemma.

Lemma 1

(Invariance from equivariance). A loss function of the form in Eq. (2) is G-invariant if its composed of a G-equivariant QNN and measurement.

A proof of this Lemma along with that of the following Lemmas and Theorems are presented in Supplementary Methods 2 and 3.

S n-Equivariant QNNs and measurements

In the previous section we have described how to build generic G-invariant models. We now specialize to the case where G is the symmetric group Sn, and where R is the qubit-defining representation of Sn, i.e., the one permuting qubits which for any πSn acts as

$$R(\pi )\mathop{\bigotimes }\limits_{i=1}^{n}\left\vert {\psi }_{i}\right\rangle =\mathop{\bigotimes }\limits_{i=1}^{n}\left\vert {\psi }_{{\pi }^{-1}(i)}\right\rangle .$$
(8)

Following Definitions 2 and 3, the first step towards building Sn-equivariant QNNs is defining Sn-equivariant generators for each layer. In the Methods section we describe how such operators can be obtained, but here we will restrict our attention to the following set of generators

$${{{\mathcal{G}}}}=\left\{\frac{1}{n}\mathop{\sum }\limits_{j=1}^{n}{X}_{j},\frac{1}{n}\mathop{\sum }\limits_{j=1}^{n}{Y}_{j},\frac{2}{n(n-1)}\mathop{\sum}\limits_{k < j}{Z}_{j}{Z}_{k}\right\}.$$
(9)

Note that there is some freedom in the choice of generators. Any two sums over two distinct single qubit Pauli operators (the first two generators) plus a sum over pairs of the remaining Pauli operator (the third generator) suffices and we choose the above set without loss of generality. In Fig. 2 we show an example of an L = 3 layered Sn-equivariant QNN acting on n = 4 qubits. While the single-qubit rotations generated by \({{{\mathcal{G}}}}\) are readily achievable in most quantum computing platforms, the collective ZZ interactions are best suited to architectures allowing for reconfigurable connectivity85,86,87 or platforms that implement mediated all-to-all interactions88,89. In fact, such interactions are referred to as one-axis twisting90 in the context of spin squeezing91 and form the basis of many quantum sensing protocols.

Fig. 2: Quantum circuit for an Sn-equivariant QNN.
figure 2

Each layer of the QNN is obtained by exponentiation of a generator from the set \({{{\mathcal{G}}}}\) in Eq. (9). Here we show a circuit with L = 3 layers acting on n = 4 qubits. Single-qubit blocks indicate a rotation about the x or y axis, while two-qubit blocks denote entangling gates generated by a ZZ interaction. All colored gates between dashed horizontal lines share the same trainable parameter θl.

In addition, we will consider observables of the following form

$${{{\mathcal{M}}}}=\left\{\frac{1}{n}\mathop{\sum }\limits_{j=1}^{n}{\chi }_{j},\frac{2}{n(n-1)}\mathop{\sum }\limits_{k < j;j=1}^{n}{\chi }_{j}{\chi }_{k},\mathop{\prod }\limits_{j=1}^{n}{\chi }_{j}\right\},$$
(10)

where χ is a (fixed) Pauli matrix. It is straightforward to see that any \({H}_{l}\in {{{\mathcal{G}}}}\) and \(O\in {{{\mathcal{M}}}}\) will commute with R(π) for any πSn. We note that one could certainly consider other observables as well.

We now leverage tools from representation theory to understand and unravel the underlying structure of Sn-equivariant QNNs and measurement operators. The previous will allow us to derive, in the next section, theoretical guarantees for these GQML models.

One of the most notable results from representation theory is that a given finite dimensional representation of a group decomposes into an orthogonal direct sum of fundamental building-blocks known as irreducible representations (irreps). As further explained in the Methods, the qubit-defining representation takes, under some appropriate global change of basis (which we denote with  ), the block-diagonal form

$$R(\pi \in {S}_{n})\cong \mathop{\bigoplus}\limits_{\lambda }\mathop{\bigoplus }\limits_{\mu =1}^{{d}_{\lambda }}{r}_{\lambda }(\pi )=\mathop{\bigoplus}\limits_{\lambda }{r}_{\lambda }(\pi )\otimes {{\mathbb{1}}}_{{d}_{\lambda }}.$$
(11)

Here λ labels the irreps of Sn and rλ is the corresponding irrep itself, which appears dλ times. The collection of these repeated irreps is called an isotypic component. Crucially, the only irreps appearing in R correspond to two-row Young diagrams (see Methods) and can be parametrized by a single non-negative integer m, as λ ≡ λ(m) = (n − m, m), where \(m=0,1,\ldots ,\lfloor \frac{n}{2}\rfloor\). It can be shown that

$$\begin{array}{rcl}{d}_{\lambda }&=&n-2m+1,\quad \,{{\mbox{and}}}\,\\ {m}_{\lambda }&=&\frac{n!(n-2m+1)!}{(n-m+1)!m!(n-2m)!}\end{array}$$
(12)

where again dλ is the number of times the irrep appears and mλ is the dimension of the irrep itself. Note that every dλ is in \({{{\mathcal{O}}}}(n)\), whereas some mλ can grow exponentially with the number of qubits. For instance, if n is even and m = n/2, one finds that mλ = Ω(4n/n2). We finally note that Eq. (11) implies ∑λmλdλ = 2n.

Given the block-diagonal structure of R, Sn-equivariant unitaries and measurements must necessarily take the form

$$U({{{\boldsymbol{\theta }}}})\cong \mathop{\bigoplus}\limits_{\lambda }{{\mathbb{1}}}_{{m}_{\lambda }}\otimes {U}_{\lambda }({{{\boldsymbol{\theta }}}}),\quad \,{{\mbox{and}}}\,\quad O\cong \mathop{\bigoplus}\limits_{\lambda }{{\mathbb{1}}}_{{m}_{\lambda }}\otimes {O}_{\lambda }.$$
(13)

That is, both U(θ) and O decompose into a direct sum of dλ-dimensional blocks repeated mλ times (with mλ called the multiplicity) on each isotypic component λ. This decomposition is illustrated in Fig. 3.

Fig. 3: Representation theory and Sn-equivariance.
figure 3

Using tools from representation theory we find that the Sn-equivariant QNN U(θ) and the representation of the group elements R(π) -for any πSn- admit an irrep block decomposition as in Eq. (13) and Eq. (11), respectively. The irreps can be labeled with a single parameter λ = (n − m, m) where \(m=0,1,\ldots ,\lfloor \frac{n}{2}\rfloor\). For a system of n = 5 qubits, we show in a) the block diagonal decomposition for U(θ) and in b) the decomposition of R(π) as a representation of S5. The dashed boxes denote the isotypic components labeled by λ. c As n increases, U(θ) has a block diagonal decomposition which contains polynomially large blocks repeated a (potentially) exponential number of times. In contrast, the block decomposition of R(π) (for any πSn) contains blocks that can be exponentially large but that are only repeated a polynomial number of times.

Let us highlight several crucial implications of the block diagonal structure arising from Sn-equivariance. First and foremost, we note that, under the action of an Sn-equivariant QNN, the Hilbert space decomposes as

$${{{\mathcal{H}}}}\cong \mathop{\bigoplus}\limits_{\lambda }\mathop{\bigoplus }\limits_{\nu =1}^{{m}_{\lambda }}{{{{\mathcal{H}}}}}_{\lambda }^{\nu },$$
(14)

where each \({{{{\mathcal{H}}}}}_{\lambda }^{\nu }\) denotes a dλ-dimensional invariant subspace. Moreover, one can also see that when the QNN acts on an input quantum state as \({{{{\mathcal{U}}}}}_{{{{\boldsymbol{\theta }}}}}(\rho )=U({{{\boldsymbol{\theta }}}})\rho U({{{{\boldsymbol{\theta }}}}}^{{\dagger} })\), it can only access the information in ρ which is contained in the invariant subspaces \({{{{\mathcal{H}}}}}_{\lambda }^{\nu }\) (see also ref. 23). This means that to solve the learning task, we require two ingredients: i) the data must encode the relevant information required for classification into these subspaces23,25, and ii) the QNN must be able to accurately process the information within each \({{{{\mathcal{H}}}}}_{\lambda }^{\nu }\). As discussed in the Methods, we can guarantee that the second condition will not be an issue, as the set of generators in Eq. (9) is universal within each invariant subspace, i.e., the QNN can map any state in \({{{{\mathcal{H}}}}}_{\lambda }^{\nu }\) to any other state in \({{{{\mathcal{H}}}}}_{\lambda }^{\nu }\) (see also ref. 92).

A second fundamental implication of Eq. (13) is that the manifold of equivariant unitaries is of low dimension. We make this explicit in the following lemma.

Fig. 4: Tetrahedral numbers.
figure 4

a The Tetrahedral numbers Ten are obtained by counting how many spheres can be stacked in the configuration of a tetrahedron (triangular base pyramid) of height n. b One can also compute Ten as the sum of consecutive triangular numbers, which count how many objects (e.g., spheres) can be arranged in an equilateral triangle.

Lemma 2

(Dimension of Sn-equivariant unitaries). The submanifold of Sn − equivariant unitaries is of dimension equal to the Tetrahedral numbers \({{{{\rm{Te}}}}}_{n+1}=\left(\genfrac{}{}{0.0pt}{}{n+3}{3}\right)\) (see Fig. 4), and therefore on the order of Θ(n3).

Crucially, Lemma 2 shows that the equivariance constraint limits the degrees of freedom in the QNN (and concomitantly in any observable) from 4n to only polynomially many.

Absence of barren plateaus in S n-equivariant QNNs

Barren plateaus have been recognized as one of the main challenges to overcome in order to guarantee the success of QML models using QNNs16. When a model exhibits a barren plateau, the loss landscape becomes, on average, exponentially flat and featureless as the problem size increases33,34,35,36,37,38,39,40,41,42,43,44,45. This severely impedes its trainability, as one needs to spend an exponentially large amount of resources to correctly estimate a loss-minimizing direction. Issues of barren plateaus arise primarily due to the structure of the models (including the choice of QNN, the input state and the observables) employed33,34,35,36,37,38,39,40,41,42,43,45 but can also be caused solely by effects of noise44. In the rest of this section, we will only be concerned with the former type of barren plateaus, that is the most studied.

Recently, a great deal of effort has been put forward towards creating strategies capable of mitigating the effect of barren plateaus78,93,94,95,96,97,98,99,100,101,102,103,104,105. While these are promising and have shown moderate success, the ‘holy grail’ is identifying architectures which are immune to barren plateaus altogether, and thus enjoy trainability guarantees. Examples of such architectures are shallow hardware efficient ansatzes34, quantum convolutional neural networks106, or the transverse field Ising model Hamiltonian variational ansatz43,45. Here, we prove that another architecture can be added to this list: Sn-equivariant QNNs.

When studying barren plateaus, one typically analyzes the variance of the empirical loss function partial derivatives, \({\partial }_{\mu }\hat{{{{\mathcal{L}}}}}({{{\boldsymbol{\theta }}}})=\partial \hat{{{{\mathcal{L}}}}}({{{\boldsymbol{\theta }}}})/\partial {\theta }_{\mu }\), where θμθ. We say that there is a barren plateau in the θμ direction if \({{\mathbb{E}}}_{{{{\boldsymbol{\theta }}}}}[{\partial }_{\mu }\hat{{{{\mathcal{L}}}}}({{{\boldsymbol{\theta }}}})]=0\) and \({{{{\rm{Var}}}}}_{{{{\boldsymbol{\theta }}}}}[{\partial }_{\mu }\hat{{{{\mathcal{L}}}}}({{{\boldsymbol{\theta }}}})]\) is exponentially vanishing.

Before stating our main results, we introduce a bit of notation. Let us define \({Q}_{\lambda }^{\nu }\) to be the operator that maps vectors from \({{{\mathcal{H}}}}\) to \({{{{\mathcal{H}}}}}_{\lambda }^{\nu }\), such that \({({Q}_{\lambda }^{\nu })}^{{\dagger} }{Q}_{\lambda }^{\nu }\) realizes a projection onto \({{{{\mathcal{H}}}}}_{\lambda }^{\nu }\) (see Supplementary Methods 4 for additional details). Given a matrix \(B\in {{\mathbb{C}}}^{d\times d}\), we will denote its restriction to \({{{{\mathcal{H}}}}}_{\lambda }^{\nu }\) as

$${B}_{\lambda }^{\nu }={Q}_{\lambda }^{\nu }B{({Q}_{\lambda }^{\nu })}^{{\dagger} },$$
(15)

with \({B}_{\lambda }^{\nu }\in {{\mathbb{C}}}^{{d}_{\lambda }\times {d}_{\lambda }}\). We remark that the restriction of Sn-equivariant generators is independent of the ν multiplicity index (see Eq. (13)). On the other hand, the restriction of non-equivariant operators (such as the input states ρ1) are not independent of ν, meaning that that the set composed of all the restrictions \({\rho }_{\lambda }^{\nu }\) contain an exponentially large amount of non-redundant information that the QNN can act on (see also ref. 23).

Denoting the weighted average of the input states as \(\sigma =\mathop{\sum }\nolimits_{i = 1}^{M}{c}_{i}{\rho }_{i}\), we find:

Theorem 1

(Variance of partial derivatives). Let \({{{{\mathcal{U}}}}}_{{{{\boldsymbol{\theta }}}}}\) be an Sn-equivariant QNN, with generators in \({{{\mathcal{G}}}}\), and O an Sn-equivariant measurement operator from \({{{\mathcal{M}}}}\). Consider an empirical loss \(\hat{{{{\mathcal{L}}}}}({{{\boldsymbol{\theta }}}})\) as in Eq. (3). Assuming a circuit depth L such that the QNN forms independent 2-designs on each isotypic block, we have \({\langle {\partial }_{\mu }\hat{{{{\mathcal{L}}}}}({{{\boldsymbol{\theta }}}})\rangle }_{{{{\boldsymbol{\theta }}}}}=0\), and

$${{{{\rm{Var}}}}}_{{{{\boldsymbol{\theta }}}}}[{\partial }_{\mu }\hat{{{{\mathcal{L}}}}}({{{\boldsymbol{\theta }}}})]=\mathop{\sum}\limits_{\lambda }\frac{2{d}_{\lambda }}{{({d}_{\lambda }^{2}-1)}^{2}}\Delta ({H}_{\mu {,}_{\lambda }})\Delta ({O}_{\lambda })\Delta \left(\mathop{\sum }\limits_{\nu =1}^{{m}_{\lambda }}{\sigma }_{\lambda }^{\nu }\right).$$
(16)

Here, \(\Delta (B)={{{\rm{Tr}}}}[{B}^{2}]-\frac{{{{\rm{Tr}}}}{[B]}^{2}}{\dim (B)}\).

In the “Methods”, we present a sketch of the proof for Theorem 1, as well as its underlying assumptions.

We remark that while we have derived Theorem 1 for Sn-equivariant QNNs and measurement operators, given some general finite-dimensional compact group G, the form of Eq. (16) is valid provided that one uses a G-equivariant QNN that is universal with each invariant subspace. In this case, the summation over λ will run over the irreps of the representation of G.

Let us now analyze each term in Eq. (16) to identify potential sources of untrainability. First, let us consider the prefactors \(\frac{2{d}_{\lambda }}{{({d}_{\lambda }^{2}-1)}^{2}}\). From Eq. (12) we can readily see that \(\frac{2{d}_{\lambda }}{{({d}_{\lambda }^{2}-1)}^{2}}\in \Omega (\frac{1}{{n}^{3}})\) for any λ. Next, it is convenient to separate the two remaining potential sources of barren plateaus into two categories: i) those that are QNN or measurement dependent, \(\Delta ({H}_{\mu {,}_{\lambda }})\) and Δ(Oλ), and ii) those that are dataset-dependent, \(\Delta ({\sum }_{\nu }{\sigma }_{\lambda }^{\nu })\). This identification commonly appears when analyzing the absence of barren plateaus (see refs. 34,42,43,106,107) and allows one to study how the architecture and dataset individually affect the trainability. In what follows, we will say that some architecture does not induce barren plateaus if the terms that are QNN or measurement dependent are not exponentially vanishing.

Using tools from representation theory we can obtain the following exact expressions for Sn-equivariant operators.

Theorem 2

Let A be a Sn-equivariant operator.

$$\left\{\begin{array}{l}\,{{\mbox{If}}}\,\,A=\mathop{\sum }\limits_{j=1}^{n}{\chi }_{j},\quad \,{{\mbox{then}}}\,\,\Delta ({A}_{\lambda })=2\left(\begin{array}{c}{d}_{\lambda }+1\\ 3\end{array}\right),\quad \\ \,{{\mbox{If}}}\,\,A=\mathop{\sum}\limits_{k < j}{\chi }_{j}\chi ,\quad \,{{\mbox{then}}}\,\,\Delta ({A}_{\lambda })=\frac{8}{3}\left(\begin{array}{c}{d}_{\lambda }+2\\ 5\end{array}\right),\quad \\ \,{{\mbox{If}}}\,\,A=\mathop{\prod }\limits_{j=1}^{n}{\chi }_{j},\quad \,{{\mbox{then}}}\,\,\Delta ({A}_{\lambda })=\frac{{d}_{\lambda }^{2}-1+n\,{{{\rm{mod}}}}2}{{d}_{\lambda }},\quad \end{array}\right.$$
(17)

where χ {X, Y, Z}.

In Supplementary Methods 6, we also derive formulas for the case of A being k-body operators.

Let us review the implications of Theorem 2. First, note that all elements of our gate-set \({{{\mathcal{G}}}}\) and measurement-set \({{{\mathcal{M}}}}\) are of the form in Theorem 2, and therefore belong in Ω(dλ). This follows from the fact that the binomial coefficient \(\left(\genfrac{}{}{0.0pt}{}{n+a}{b}\right)\) scales as a polynomial of degree b in n. Since dλ itself is in Θ(n) (see Eq. (12)), for all λ and μ

$$\Delta ({O}_{\lambda })\,\,{{\mbox{and}}}\,\,\Delta ({H}_{\mu ,\lambda })\in \Omega (n).$$
(18)

Hence, combining this result with Theorem 1 allows us to argue that Sn-equivariant QNNs do not induce barren plateaus.

Corollary 1

Under the same assumptions as Theorem 1, it follows that, if \(\Delta (\mathop{\sum }\nolimits_{\nu = 1}^{{m}_{\lambda }}{\sigma }_{\lambda }^{\nu })\in \Omega \left(1/{{{\rm{poly}}}}(n)\right)\), then the empirical loss function satisfies

$${{{{\rm{Var}}}}}_{{{{\boldsymbol{\theta }}}}}[{\partial }_{\mu }\hat{{{{\mathcal{L}}}}}]\in \Omega \left(\frac{1}{{{{\rm{poly}}}}(n)}\right).$$
(19)

We note that a crucial requirement for Corollary 1 to hold is that \(\Delta ({\sum }_{\nu }{\sigma }_{\lambda }^{\nu })\) needs to be, at most, polynomially vanishing. In Sec., we identify cases of datasets leading to trainability but also to untrainability. Finally, we note that as discussed in Supplementary Methods 9, Corollary 1 is sufficient to guarantee that the loss function does not exhibit the narrow gorge phenomenon, whereby the minima of the loss occupy an exponentially small volume of parameter space108. In other words, we show that absence of barren plateau implies absence of narrow gorges and loss function anti-concentration.

Efficient overparametrization

Absence of barren plateaus is a necessary, but not sufficient, condition for trainability, as there could be other issues compromising the parameter optimization. In particular, it has been shown that quantum landscapes can exhibit a large number of local minima29,30,31. As such, here we consider a different aspect of the trainability of Sn-equivariant QNNs: their ability to converge to global minima. For this purpose, we find it convenient to recall the concept of overparametrization.

Overparametrization denotes a regime in machine learning where models have a capacity much larger than that necessary to represent the distribution of the training data. For example, when the number of parameters is greater than the number of training points. Models operating in the overparametrized regime have seen tremendous success in classical deep learning, as they closely fit the training data but still generalize well when presented with new data instances109,110,111,112. Recently, ref. 32 studied overparametrization in the context of QML models. A clear phase transition in the trainability of under- and overparametrized QNNs was evidenced: Below some critical number of parameters (underparametrized) the optimizer greatly struggles to minimize the loss function, whereas beyond that number of parameters (overparametrized) it converges exponentially fast to solutions (see Methods for further details).

Given the desirable features of overparametrization, it is important to estimate how many parameters are needed to achieve this regime. Here, we can derive the following theorem.

Theorem 3

Let \({{{{\mathcal{U}}}}}_{{{{\boldsymbol{\theta }}}}}\) be a Sn-equivariant QNN with generators in \({{{\mathcal{G}}}}\). Then, \({{{{\mathcal{U}}}}}_{{{{\boldsymbol{\theta }}}}}\) can be overparametrized with \({{{\mathcal{O}}}}({n}^{3})\) parameters.

Theorem 3 guarantees that Sn-equivariant QNNs only require a polynomial number of parameters to reach overparametrization.

Generalization from few data points

Thus far, we have seen that Sn-equivariant QNNs can be efficiently trained, as they exhibit no barren plateaus and can be overparametrized. However, in QML we are not only interested in achieving a small training error, we also aim at low generalization error26,61,113,114,115,116.

Computing the generalization error in Eq. (4) is usually not possible, as the probability distribution P over which the data is sampled is generally unknown. However, one can still derive bounds for gen(θ) which guarantee a certain performance when the model sees new data. Here, we obtain an upper bound for the generalization error via the covering numbers (see Methods)61,117, and prove that the following theorem holds.

Theorem 4

Consider a QML problem with loss function as described in Eq. (4). Suppose that an n-qubit Sn-equivariant QNN \({{{\mathcal{U}}}}({{{\boldsymbol{\theta }}}})\) is trained on M samples to obtain some trained parameters θ*. Then the following inequality holds with probability at least 1 − δ

$${{{\rm{gen}}}}({{{{\boldsymbol{\theta }}}}}^{* })\,\leqslant\,{{{\mathcal{O}}}}\left(\sqrt{\frac{{{{{\rm{Te}}}}}_{n+1}}{M}}+\sqrt{\frac{\log (1/\delta )}{M}}\right).$$
(20)

The crucial implication of Theorem 4 is that we can guarantee gen(θ*) ϵ with high probability, if \(M\in {{{\mathcal{O}}}}\left(\frac{{{{{\rm{Te}}}}}_{n+1}+\log (1/\delta )}{{\epsilon }^{2}}\right)\). For fixed δ and ϵ, this implies \(M\in {{{\mathcal{O}}}}({n}^{3})\), i.e., we only need a polynomial number of training points. Also note that this results shows that minimizing the empirical loss closely minimizes the true loss with high probability. Say that \({\hat{{{{\mathcal{L}}}}}}^{* }=\mathop{\inf }\limits_{{{{\boldsymbol{\theta }}}}}\hat{{{{\mathcal{L}}}}}({{{\boldsymbol{\theta }}}})\) is the minimal empirical loss and \({{{{\mathcal{L}}}}}^{* }=\mathop{\inf }\limits_{{{{\boldsymbol{\theta }}}}}{{{\mathcal{L}}}}({{{\boldsymbol{\theta }}}})\) the minimal true loss. Then, with \(M\in {{{\mathcal{O}}}}\left(\frac{{{{{\rm{Te}}}}}_{n+1}+\log (1/\delta )}{{\epsilon }^{2}}\right)\) training data point the inequality \(| {\hat{{{{\mathcal{L}}}}}}^{* }-{{{{\mathcal{L}}}}}^{* }| \leqslant \epsilon\) holds with probability at least 1 − δ.

Lastly, we remark that Theorem 4 can be readily adapted to other GQML models. As shown in Methods, this theorem stems from the fact that the equivariant unitary submanifold, in its block-diagonal form in Eq. (13), can be covered117 by ε-balls in a block-wise manner. In Supplementary Methods 8, we also show that the VC dimension118 of equivariant QNNs (and also more general parameterized channels) can be upper bounded by the dimension of the commutant of the symmetry group, a fact which could be of independent interest.

Trainable states

As discussed in the previous section, Sn-equivariant QNNs and measurement operators cannot induce barren plateaus. Thus, the trainability of the model hinges on the behavior of \(\Delta ({\sum }_{\nu }{\sigma }_{\lambda }^{\nu })\). We note that this dataset-dependent trainability is not unique to Sn-equivariant QNNs, but is rather present in all absence of barren plateaus results (see refs. 34,42,43,106,107,119) as there always exist datasets for which an otherwise trainable model can be rendered untrainable.

To understand the conditions that lead to an exponentially vanishing of \(\Delta ({\sum }_{\nu }{\sigma }_{\lambda }^{\nu })\) we note that for a Hermitian operator B, we have \(\Delta (B)={D}_{{{{\rm{HS}}}}}\left(B,\frac{{{{\rm{Tr}}}}[B]}{\dim (B)}{\mathbb{1}}\right)\), where \({D}_{{{{\rm{HS}}}}}(A,B)=\parallel A-B{\parallel }_{2}^{2}\) is the Hilbert-Schmidt distance. Alternatively, we can interpret Δ(B) as the variance of the eigenvalues of B. From here, we can see that one will obtain trainability if at least one σλ is not exponentially close to a multiple of the identity in some subspace \({{{{\mathcal{H}}}}}_{\lambda }^{\nu }\).

In Table 1, we present examples of states for which \(\Delta ({\sum }_{\nu }{\sigma }_{\lambda }^{\nu })\) vanishes polynomially, leading to a trainable model, but also cases where the input state leads to exponentially vanishing \(\Delta ({\sum }_{\nu }{\sigma }_{\lambda }^{\nu })\) and thus to a barren plateau. While we leave the details of how each type of input state is generated for the Methods section, we note that the results in Table 1 demonstrate the critical role that the input states play in determining the trainability of a model (this will be further elucidated in numerical results below). Such insight is particularly important as one can create adversarial datasets yielding barren plateaus (see Supplementary Methods 10). Moreover, it indicates that care must be taken when encoding classical data into quantum states as the embedding scheme can induce trainability issues42,119.

Table 1 Input pure states and their effect on the trainability of Sn-equivariant QNNs.

Numerical results

Here, we consider the task of classifying connected graph states from disconnected graph states, which are prepared as follows. First, we generate n-node random graphs from the Erdös-Rényi distribution120, with an edge probability of 40%. The ensuing graphs are binned into two categories: connected and disconnected. We then embed these graphs into quantum graph states via the canonical scheme of121,122 (see Methods section). We highlight that such encoding preserves symmetries in the input data, in the sense that a permutation of the underlying graph yields a permutation of the qubits constituting its graph state (i.e., of the form Eq. (8)). The previous allows us to create a dataset where half of the states encodes connected graphs (label yi = + 1), and the other half encodes disconnected graphs (label yi = − 1). To analyze the data, we use an Sn-equivariant QNN with generators in Eq. (9) (see also Fig. 2), and measure the operator \(O=\frac{2}{n(n-1)}\mathop{\sum }\nolimits_{k < j;j = 1}^{n}{X}_{j}{X}_{k}\).

In the following, we characterize the trainability and generalization properties of Sn-equivariant QNNs for this classification task, but we note that further aspects of the problem are discussed in the Supplementary Note. These include analyzing the effect of the graph encoding scheme in the trainability, the irrep contributions to the gradient variance, and comparing Sn-equivariant QNNs against problem-agnostic ones. In particular, the latter shows that for the present graph classification task, problem-agnostic models are hard to train and tend to greatly overfit the data, i.e., they have large generalization errors despite performing well on the training data.

Numerics on barren plateaus

In Fig. 5a we show the variance of the cost function partial derivatives for a parameter θμ in the middle of the QNN. Each point is evaluated for a total of 50 random input states, and with 20 random sets of parameters θ per input. We can see that when the variance is evaluated for states randomly drawn from the whole dataset—with an equal number of connected and disconnected graphs—then \({{{{\rm{Var}}}}}_{{{{\boldsymbol{\theta }}}}}[{\partial }_{\mu }\hat{{{{\mathcal{L}}}}}]\) only decreases polynomially with the system size (as evidenced by the curved line in the log-linear scale), meaning that the model does not exhibit a barren plateau. We note that, as shown in Fig. 5a, when the input to the QNN is a disconnected graph state, then the variance vanishes polynomially, whereas if we input a connected graph state it vanishes exponentially. This illustrates a key fact of QML: when trained over a dataset, the data from different classes can contribute very differently to the model’s trainability (see ref. 18 for a discussion on how this result enables new forms of classification).

Fig. 5: Task of distinguishing connected from disconnect graphs with an Sn-equivariant QNN.
figure 5

a Variance of the loss function partial derivatives versus the number of qubits n (in log-linear scale). The square blue line depicts the variance for inputs of the QNN drawn from a dataset composed of connected and disconnected graph states. To visualize how the data with different labels contributes to this variance, we also plot in green crosses (orange circles) the variances when the QNN is only fed connected (disconnected) graph states. b In the left panel, we show representative results for the rank of the QFIM (defined in the main text) versus the number of layers L for different number of qubits n. The critical value of layers at which this rank saturates, denoted Lovp (vertical dashed lines), corresponds to the onset of overparametrization. In the middle panel, we report the scaling of Lovp versus the number of qubits (log-linear scale). For each problem size, we present results for ten random input graph states and, as a comparison, also report the Tetrahedral numbers Ten+1 (solid line). In the right panel, we report the relative loss error of optimized QNNs at given number of layers L (in log-linear scale). These are obtained for different system sizes, with the dashed vertical lines indicating the corresponding values of Lovp. c Normalized generalization error versus number of qubits n (in log-linear scale) for different training dataset sizes M. Here, we consider an overparametrized QNN with L = Ten+1.

Numerics on overparametrization

Following the results in ref. 32, let us analyze the overparametrization phenomenon by studying the rank of the quantum fisher information matrix (QFIM)123,124, denoted F(θ) and whose entries are given by

$${[F({{{\boldsymbol{\theta }}}})]}_{jk}=4{{{\rm{Re}}}}[\left\langle {\partial }_{j}\psi ({{{\boldsymbol{\theta }}}})\vert {\partial }_{k}\psi ({{{\boldsymbol{\theta }}}})\right\rangle -\left\langle {\partial }_{j}\psi ({{{\boldsymbol{\theta }}}})\vert \psi ({{{\boldsymbol{\theta }}}})\right\rangle \left\langle \psi ({{{\boldsymbol{\theta }}}})\vert {\partial }_{k}\psi ({{{\boldsymbol{\theta }}}})\right\rangle ],$$

with \(\left\vert \psi ({{{\boldsymbol{\theta }}}})\right\rangle =U({{{\boldsymbol{\theta }}}})\left\vert \psi \right\rangle\), and \(\left\vert {\partial }_{i}\psi ({{{\boldsymbol{\theta }}}})\right\rangle =\partial \left\vert \psi ({{{\boldsymbol{\theta }}}})\right\rangle /\partial {\theta }_{i}={\partial }_{i}\left\vert \psi ({{{\boldsymbol{\theta }}}})\right\rangle\) for θiθ. The rank of the QFIM quantifies the number of potentially accessible directions in state space. In this sense, the model is overparametrized if the QFIM rank is saturated, i.e., if adding more parameters (or layers) to the QNN does not further increase the QFIM rank. When this occurs, one can access all possible directions in state space and efficiently reach the solution manifold32,125,126. On the other hand, the model is underparametrized if the QFIM rank is not maximal. In this case, there exists inaccessible directions in state space, leading to false local minima, that is, local minima that are not actual minima of the loss function.

In Fig. 5(b, left panel) we report representative results of the QFIM rank versus the number of layers L for problems with even numbers n [4, 16] of qubits. These results correspond to random connected graphs and random values of θ. Here we can see that, for a given n, as the number of layers increases, the rank of the QFIM also increases until it reaches a saturation point. Once this critical number of layers (denoted as Lovp) is reached, the model is considered to be overparametrized32. In Fig. 5(b, middle panel) we plot the scaling of Lovp (for 10 random connected or disconnected graphs per system size) versus n, as well as the Tetrahedral numbers Ten+1. As can be seen, in all cases, the overparametrization onset occurs for a number of layers Lovp < Ten+1, indicating efficient overparametrization.

To appreciate the practical effects of overparametrization, we report in Fig. 5(b, right panel) optimization performances of Sn-equivariant QNNs as a function of the number L of layers employed. All the optimizations are performed using the hinge loss function, with the L-BFGS-B optimization algorithm127. The system sizes are in n [4, 16] qubits, and correspond to the graphs that were studied in the left panel and highlighted in the middle one. The relative loss error reported indicates how close an optimized QNN is from the best achievable model. Explicitly, it is defined as \(| {\hat{{{{\mathcal{L}}}}}}_{L}-{\hat{{{{\mathcal{L}}}}}}_{{{\mbox{min}}}}| /| {\hat{{{{\mathcal{L}}}}}}_{{{\mbox{min}}}}|\), where \({\hat{{{{\mathcal{L}}}}}}_{L}\) is the loss achieved after optimization of a QNN with a given L, and where \({\hat{{{{\mathcal{L}}}}}}_{{{\mbox{min}}}}\) is the minimum loss achieved for any of the values L considered, i.e., \({\hat{{{{\mathcal{L}}}}}}_{{{\mbox{min}}}}=\arg {\min }_{L}{\hat{{{{\mathcal{L}}}}}}_{L}\) (we systematically verify that for sufficient large L all optimizations reliably converge to this same loss \({\hat{{{{\mathcal{L}}}}}}_{{{\mbox{min}}}}\)). For every value of n studied, we see that for a small number of layers the optimizer struggles to significantly minimize the loss. However, as L increases, there exists a computational phase transition whereby the optimizer is able to easily identify optimal parameters and reach much smaller loss values. Notably, such computational phase transition occurs slightly before Lovp (indicated by a dashed vertical line), meaning that even before the QFIM rank saturates, the model has sufficient directions to efficiently reach the solution manifold. Overall, we see that for number of layers growing at most polynomially with n, one can ensure convergence to solution of the model.

Numerics on generalization error

In Fig. 5c, we study the generalization error of an overparametrized Sn-equivariant QNN (with L = Ten+1) for different training dataset sizes M and with respect to test sets of size Mtest = 2 × Ten+1 that are independently drawn from the training ones. Generalization errors are evaluated for random QNNs parameters θ and we report the 90-th percentile of the errors obtained, i.e., for δ = 90% in Eq. (20). In the plot, we show the normalized generalization error \(g({{{\boldsymbol{\theta }}}})=\frac{{{{\rm{gen}}}}({{{\boldsymbol{\theta }}}})}{{{{{\rm{V\; ar}}}}}_{{{{\boldsymbol{\theta }}}},\rho }^{1/2}[\ell ({{{\boldsymbol{\theta }}}},\rho )]}\). We stress that such normalization can only increase the generalization errors obtained, and is only used in order to compare generalization errors across different values of n without artifacts resulting from loss concentration effects as the system sizes grow. As seen in Fig. 5(c), when the size of the training set is constant, the generalization error is also approximately constant across problem sizes. However, when the training set size scales with n, the generalization error decreases with n, with this even occurring for M = n. Notably, if M = Ten+1 Θ(n3), we can see that the generalization error significantly decreases with problem size. That is, for this problem, we found generalization errors to be better than the scaling of the bounds derived in Eq. (20).

Discussion

GQML has recently been proposed as a framework for systematically creating models with sharp geometric priors arising from the symmetries of the task at hand18,19,20,21,22. Despite its great promise, this nascent field has only seen heuristic success as no true performance guarantees have been proved for its models. In this work we provide the first theoretical guarantees for GQML models aimed at problems with permutation invariance. Our first contribution is the introduction of the Sn-equivariant QNN architecture. Using tools from representation theory, we rigorously find that these QNNs present salient features such as absence of barren plateaus (and narrow gorges), generalization from very few data points, and a capability of being efficiently overparametrized. All these favorable properties can be viewed as being direct consequences of the inductive biases embedded in the model, which greatly limits their expressibility37,46,128. Namely, these Sn-equivariant QNNs act only on the –polynomially large– multiplicity spaces of the qubit-defining representation of Sn. To complete our analysis, we performed numerical simulations for a graph classification task and heuristically found that the model’s performance is even better than that predicted by our theoretical results.

Taken together, our results provide the first rigorous guarantees for equivariant QNNs, and demonstrate that GQML may be a powerful tool in the QML repertoire. We highlight that while we focus on problems with Sn symmetry, many of our proof techniques hold for general finite-dimensional compact groups. Hence, we hope that the representation-theory-based techniques used here can serve as blueprints to analyze the performance of other models. We envision that in the near future, GQML models with provable guarantees will be widely spread among the QML literature.

Finally, we note that while our results were derived in the absence of noise, it would be interesting to account for hardware imperfections. Clearly, the presence of noise would change our analysis, and most likely weaken our trainability guarantees. As such, while we can guarantee that Sn-equivariant QNNs will be useful on fault-tolerant quantum devices, we do not abandon hope that they can be used in the near-term era provided that noise levels are small enough.

Note added: In light of the recent preprint129, we have added a detailed discussion in the Supplementary Note regarding the possibility of classically simulating Sn-equivariant QNNs. As we argue there, for most relevant cases in QML, the algorithm in129 is not fully classical, as it require access to a quantum computer to obtain a “classical description” of the input data. Moreover, even if one is given such “classical description”, the ensuing algorithm that replaces the use of a QNN scales extremely poorly with the number of qubits. Taken together these results indicate that if one has access to a quantum computer, it is not entirely obvious whether one should use it to obtain a classical description of the data followed by expensive post-processing, or if one should run the QNN on the quantum device and exploit its favorable properties like efficient overparametrization and absence of barren plateaus. We will save such comparison for future work.

Now, we will briefly compare Sn-equivariant QNNs to other barren-plateau-avoiding architectures.

First, let us consider the shallow hardware efficient ansatz (HEA)34,130 and the quantum convolutional neural network (QCNN)60,106. While our goal is not to provide a comprehensive description of these models, we recall the three key properties leading to their trainability: locality of the gates, shallowness of the circuit, locality of the measurement operator. Both the HEA and QCNN are composed of parametrized gates acting in a brick-like fashion on alternating pairs of neighboring qubits (local gates), and are composed of only a few—logarithmically many—layers of such gates (shallowness of the circuit). The combination of these two factors leads to a low scrambling power and greatly limited expressibility of the QNN. Then, the final ingredient for their trainability requires measuring a local operators (i.e., an operator acting non-trivially on a small number of qubits). While this assumption is guaranteed for QCNNs—due to their feature-space reduction property—, the HEA can be shown to be untrainable for global measurement (i.e., operators acting non-trivially on all qubits). Here we can already see that Sn-equivariant QNNs do not share the properties leading to trainability in HEAs and QCNNs. To begin, we can see from the set of generators \({{{\mathcal{G}}}}\) in Eq. (9) that the Sn-equivariant architecture allows for all long-range interactions in each layer, breaking the locality of gates assumption. Moreover, and in stark contrast to HEAs, one can train the Sn-equivariant QNN even when measuring global observables (for instance, we allow for the \(O=\mathop{\prod }\nolimits_{j = 1}^{n}{{{{\rm{X}}}}}_{j}\) in Eq. (10)). Finally, we remark that HEAs and QCNNs cannot be efficiently overparametrized, as they require an exponentially large number of parameters to reach overparametrization43. On the other hand, according to Theorem 3 the Sn-equivariant QNN can be overparametrized with polynomially many layers.

Next, let us consider the transverse field Ising model Hamiltonian variational ansatz (TFIM-HVA)43,45. The mechanism leading to absence of barren plateaus in this architectures is more closely related to that of the Sn-equivariant model, although there are still some crucial differences. On the one hand, it can be shown that the TFIM-HVA has an extremely limited expressibility, having only a maximum number of free parameters in \({{{\mathcal{O}}}}({n}^{2})\), and being able to reach overparametrization with polynomially many layers. While this is similar to the case of Sn-equivariant architectures (see Lemma 2 and Theorem 3), the block diagonal structure of the TFIM-HVA is fundamentally different than that arising from Sn-equivariant: The TFIM-HVA unitary has four exponentially large blocks repeated a single time each, while Sn-equivariant unitaries have polynomially small blocks repeated exponentially many times. This subtle, albeit important, distinction makes it such that Sn-equivariant QNNs enjoy generalization guarantees (from Theorem 4) which are not directly applicable to TFIM-HVA architectures.

The previous shows that Sn-equivariant QNNs stand-out amid the other trainable architectures, exhibit many favorable properties that other models only partially enjoy.

Lastly, we now consider future directions and possible extensions of our work. We recall that Definition 3 requires every layer of the QNN to be equivariant. This is evidently not general, as one could have several consecutive layers which are not individually equivariant, but compose to an equivariant unitary for certain θ18,131. While in this manuscript we do not consider this scenario, it is worth exploring how less strict equivariance conditions affect the performance and the trainability guarantees here derived. Second, we note that as indicated in this work, the block diagonal structure of the Sn-equivariant QNN restricts the information in the input data that the model can access. This could lead to conditions where the model cannot solve the learning task as it cannot ‘see’ the relevant information in the input states. Such issue can be in principle solved by allowing the model to simultaneously act on multiple copies of the data, and even to change the representation of Sn throughout the circuit23. We also leave this exploration for future work.

Another potentially interesting research direction would be equivariant embeddings and re-uploading of classical data. For the purposes of this work, we make no assumptions to the source or form of the data, such as whether it is quantum or classical. However, when considering analyzing classical data on quantum computer, embeddings become important. We give one such example, which we call a ‘fixed Hamming-weight encoding’. Another example is the standard encoding of a graph into a graph state, which we considered in our numerics. This is far from exhaustive and more sophisticated methods exist, including trainable encoding54. Similarly, we have not studied how our results change in the presence of data re-uploading132. We know that if the data is re-uploaded via equivariant generators (e.g., if the data re-uploading unitary takes the form \(V({{{\boldsymbol{x}}}})={\prod }_{{l}^{{\prime} }}{{{{\rm{e}}}}}^{-{{{\rm{i}}}}{x}_{l}{H}_{l}}\), with Hl being Sn-equivariant), then our theoretical guarantees results do not change. This follows from the fact that the DLA of the circuit will remain the same, and hence our results follow. We leave the study of more general encoding and re-uploading schemes for future work.

Methods

This section provides an overview of the different tools used in the main text. Here we also present a sketch of the proof of our main results. Full details can be found in the Supplementary Methods.

Building S n-equivariant operators

Here, we briefly describe how to build Sn-equivariant operators that can be used as generators of the QNN, or as measurement operators. In particular, we will focus on the so-called twirling method19,23. Take a unitary representation R of a discrete group G over a vector space V. Then the twirl operator is the linear map \({{{{\mathcal{T}}}}}_{G}:GL(V)\to GL(V)\), defined as

$${{{{\mathcal{T}}}}}_{G}(A)=\frac{1}{| G| }\mathop{\sum}\limits_{g\in G}R(g)AR{(g)}^{{\dagger} }.$$
(21)

It can be readily verified that the twirling of any operator A yields a G-equivariant operator, i.e., we have \([{{{{\mathcal{T}}}}}_{G}(A),R(g)]=0\) for any gG.

The previous allows us to obtain a G-equivariant operator from any operator AGL(V). For instance, let us consider the case in the case of G = Sn, R the qubit-defining representation and A = X1. Then, we have \({{{{\mathcal{T}}}}}_{G}({X}_{1})=\frac{1}{n!}{\sum }_{\pi \in {S}_{n}}R(\pi ){X}_{1}R{(\pi )}^{{\dagger} }=\frac{1}{n}\mathop{\sum }\nolimits_{i = 1}^{n}{X}_{i}={{{{\mathcal{T}}}}}_{G}({X}_{j})\) for any 1 jn. Note that twirling over Sn cannot change the locality of an operator. That is, twirling a k-body operator leads to a sum of k-body operators.

Representation theory of S n

In this section we review a few basic notions from representation theory. For a more thorough treatment we refer the reader to refs. 133,134,135,136, and more specifically to the tutorial in ref. 25 which provides an introduction to representation theory from the perspective of QML. We recall that we are interested in the qubit-defining representation of Sn, i.e., the one permuting qubits

$$R(\pi \in {S}_{n})\mathop{\bigotimes }\limits_{i=1}^{n}\left\vert {\psi }_{i}\right\rangle =\mathop{\bigotimes }\limits_{i=1}^{n}\left\vert {\psi }_{{\pi }^{-1}(i)}\right\rangle .$$

As mentioned in the main text, representations break down into fundamental building blocks called irreducible representations (irreps).

Definition 4

(Irrep decomposition). Given some unitary representation R of a compact group G, there exists a basis under which it takes a block diagonal form

$$R(g\in G)\cong \mathop{\bigoplus}\limits_{\lambda }\mathop{\bigoplus }\limits_{\mu =1}^{{m}_{{r}_{\lambda }}}{r}_{\lambda }(\pi )=\mathop{\bigoplus}\limits_{\lambda }{r}_{\lambda }(\pi )\otimes {{\mathbb{1}}}_{{m}_{{r}_{\lambda }}},$$
(22)

with rλ(π) irreps of G appearing \({m}_{{r}_{\lambda }}\) times.

The irreps of the symmetric group are commonly labeled by the set of partitions of the integer n. A partition of a positive integer \(n\in {\mathbb{N}}\) is a non-decreasing sequence of positive integers λ = (λ1,  , λk) satisfying ∑iλi = n. Partitions are typically visualized using young diagrams, a set of empty, left-justified boxes arranged in rows such that there are λi boxes in the i-th row. For instance, the integer n = 3 can split into

(23)

We note that in the case of the qubit-defining representation, the only λ appearing in Eq. (22) have at most two rows (e.g., would not include the last partition in Eq. (23)).

The dimension of an Sn irrep rλ can be computed from the hook length formula

$$\dim ({r}_{\lambda })=\frac{n!}{{\prod }_{b\in \lambda }{h}_{\lambda }(b)},$$
(24)

where each hλ(b) is the hook length for box b in λ, which is the total number of boxes in a ’hook’ (or ’l’ shape) composed of box b and every box beneath (in the same column) and to its right (in the same row).

Given the block-diagonal structure of R in Eq. (22), one can see that a general G-equivariant operator has to be of the form

$$A\cong \mathop{\bigoplus}\limits_{\lambda }{{\mathbb{1}}}_{\dim ({r}_{\lambda })}\otimes {A}_{\lambda },$$
(25)

where Aλ are \({m}_{{r}_{\lambda }}\)-dimensional matrices repeated \(\dim ({r}_{\lambda })\) times. In general, the number of times an irrep appears in an arbitrary representation R (i.e., \({m}_{{r}_{\lambda }}\) in Eq. (22)) can be determined through character theory. Instead, in our case, we will take a shortcut and exploit one of the most remarkable results in representation theory, called the Schur-Weyl duality137.

Consider the representation Q of the unitary group \({\mathbb{U}}(2)\) acting on \({{{\mathcal{H}}}}={({{\mathbb{C}}}^{2})}^{\otimes n}\) through the n-fold tensor product \(Q(W\in {\mathbb{U}}(2))={W}^{\otimes n}\). Evidently, according to Eq. (22), Q will also have an isotypic decomposition

$$Q(W\in {\mathbb{U}}(2))=\mathop{\bigoplus}\limits_{s}{{\mathbb{1}}}_{{m}_{{q}_{s}}}\otimes {q}_{s}(W),$$
(26)

where s labels the different (spin) irreps of \({\mathbb{U}}(2)\). The Schur-Weyl duality, states that the matrix algebras \({\mathbb{C}}[R]\) and \({\mathbb{C}}[Q]\) mutually centralize each other, meaning that \({\mathbb{C}}[R]\) is the space of \({\mathbb{U}}(2)\)-equivariant linear operators, and similarly \({\mathbb{C}}[Q]\) is the space of Sn-equivariant ones. As a consequence of this duality, \({{{\mathcal{H}}}}\) can be decomposed as \({{{\mathcal{H}}}}{\cong \bigoplus }_{\lambda }{V}_{\lambda }\otimes {W}_{\lambda }\), where λ simultaneously labels irrep spaces Vλ and Wλ for Sn and \({\mathbb{U}}(2)\), respectively. That is, \({{{\mathcal{H}}}}\) supports a simultaneous action of Sn and \({\mathbb{U}}(2)\), where the irreps of each appear exactly once and are correlated: Each of the two-row Young diagrams λ = (n − m, m) labeling the irreps in R can be associated unequivocally with a spin label s(λ) for an \({\mathbb{U}}(2)\) irrep appearing in Q

$$s(\lambda )=\frac{{\lambda }_{1}-{\lambda }_{2}}{2}=\frac{n-2m}{2}.$$
(27)

Moreover, since under the joint action of \({S}_{n}\times {\mathbb{U}}(2)\) the multiplicities are one, one can assert that the irrep qλ of \({\mathbb{U}}(2)\) appears \(\dim ({r}_{\lambda })\)-times in Q, and conversely, the irrep rλ of Sn appears \(\dim ({q}_{\lambda })\)-times in R. Using the well-known dimension of spin irreps \(\dim ({q}_{s})=2s+1\), we can derive an expression for the multiplicity of Sn irreps

$${m}_{{r}_{\lambda }}=\dim ({q}_{s(\lambda )})=2s(\lambda )+1=n-2m+1.$$
(28)

Also, it is straightforwards to adapt the formula in Eq. (24) to two-row diagrams λ = (n − m, m)

$$\dim ({r}_{\lambda })=\frac{n!(n-2m+1)!}{(n-m+1)!m!(n-2m)!}.$$
(29)

We finally note that, since we are ultimately interested in Sn-equivariant operators, in the main text we have defined \({d}_{\lambda }\equiv {m}_{{r}_{\lambda }}\) and \({m}_{\lambda }\equiv \dim ({r}_{\lambda })\). That is, the dimension and multiplicity of an irrep in the main text are for the representations of \({\mathbb{U}}\).

Universality, expressibility, and dynamical Lie algebra

In the main text we have argued that the set of generators in Eq. (9) is universal within each invariant subspace. Here we will formalize this statement.

First, let us recall that we say that a parametrized unitary is universal if it can generate any unitary (up to a global phase) in the space over which it acts. One can quantify the capacity of being able to create different unitaries through the so-called measures of expressibility37,43,46,128. Here we will focus on the notion of potential expressibility of a given QNN, which is formalized via the dynamical Lie algebra of the architecture138.

Definition 5

(Dynamical Lie algebra). Given a set of generators \({{{\mathcal{G}}}}\) defining a QNN, its dynamical Lie algebra \({\mathfrak{g}}\) is the span of the Lie closure 〈Lie of \({{{\mathcal{G}}}}\). That is, \({\mathfrak{g}}={{{{\rm{span}}}}}_{{\mathbb{R}}}{\langle {{{\mathcal{G}}}}\rangle }_{{{{\rm{Lie}}}}}\), where \({\langle {{{\mathcal{G}}}}\rangle }_{{{{\rm{Lie}}}}}\) is defined as the set of all the nested commutators generated by the elements of \({{{\mathcal{G}}}}\).

In particular, the dynamical Lie algebra (DLA) fully characterizes the group of unitaries that can be ultimately expressed by the circuit: for any unitary U realized by a QNN with generators in \({{{\mathcal{G}}}}\) there exists an anti-hermitian operator \(\eta \in {\mathfrak{g}}={\langle {{{\mathcal{G}}}}\rangle }_{{{{\rm{Lie}}}}}\) such that U = eη. Evidently, \({\mathfrak{g}}\subseteq {\mathfrak{u}}(d)\), that is, it is a subalgebra of the space of anti-hermitian operators. When \({\mathfrak{g}}\) is \({\mathfrak{su}}(d)\) or \({\mathfrak{u}}(d)\) we say that the QNN is controllable or universal since for any pair of states \(\left\vert \psi \right\rangle\) and \(\left\vert \phi \right\rangle\), there exists a unitary U = eη with \(\eta \in {\mathfrak{g}}\) such that ϕUψ2 = 1.

In the framework of GQML one designs symmetry-respecting QNNs by using group-equivariant generators. This implies that the corresponding DLA is constrained and necessarily takes the form

$${\mathfrak{g}}=\mathop{\bigoplus}\limits_{\lambda }{{\mathbb{1}}}_{{m}_{\lambda }}\otimes {{\mathfrak{g}}}_{\lambda },$$
(30)

where \({{\mathfrak{g}}}_{\lambda }\subseteq {\mathfrak{u}}({d}_{\lambda })\). For this scenario, we provide a notion of controllability restricted to each of the invariant subspaces: We say that a QNN is subspace-controllable in the isotypic component λ if \({{\mathfrak{g}}}_{\lambda }\) is \({\mathfrak{su}}({d}_{\lambda })\) or \({\mathfrak{u}}({d}_{\lambda })\). This means that the QNN can map between any pair of states in every \({{{{\mathcal{H}}}}}_{\lambda }^{\nu }\). Notably, the following result follows from Refs. 92,139.

Lemma 3

(Subspace controllability). The set of Sn-equivariant generators in Eq. (9) is subspace-controllable in every λ.

As shown below, this result will be crucial for the proof of Theorem 1.

Proof of absence of barren plateaus

Here we sketch our proof of Theorem 1. Our goal is to calculate \({{{{\rm{Var}}}}}_{{{{\boldsymbol{\theta }}}}}[{\partial }_{\mu }\widehat{{{{\mathcal{L}}}}}({{{\boldsymbol{\theta }}}})]={{\mathbb{E}}}_{{{{\boldsymbol{\theta }}}}}[{({\partial }_{\mu }\widehat{{{{\mathcal{L}}}}}({{{\boldsymbol{\theta }}}}))}^{2}]-{{\mathbb{E}}}_{{{{\boldsymbol{\theta }}}}}{[{\partial }_{\mu }\widehat{{{{\mathcal{L}}}}}({{{\boldsymbol{\theta }}}})]}^{2}\). In general, we will have to deal with integrals of the form \({\int}_{{{{{\mathcal{D}}}}}_{{{{\boldsymbol{\theta }}}}}}f(U({{{\boldsymbol{\theta }}}}))\) where f is some parametrized function—for example the cost function or its partial derivatives– and \({{{{\mathcal{D}}}}}_{{{{\boldsymbol{\theta }}}}}:{[0,2\pi ]}^{M}\to [0,1]\) is some distribution over parameter space—typically the uniform distribution. The first step is to transform the integration over parameter space to an integration over the resulting QNN unitary distribution \({{{\mathcal{D}}}}\). Since \({{{\mathcal{D}}}}\) is known to converge (given enough depth) to ϵ-approximate 2-designs over the Lie group \({{{{\rm{e}}}}}^{{\mathfrak{g}}}\)43,140, assuming f is a polynomial of degree  2 in the entries of U (as is the case of interest), we can replace the integration over \({{{\mathcal{D}}}}\) with an integration over the Haar measure over \({e}^{{\mathfrak{g}}}\). In general, \({\mathfrak{g}}\) is a reductive Lie algebra consisting of multiple orthogonal ideals \({\mathfrak{g}}{ = \bigoplus }_{\lambda }{{\mathfrak{g}}}_{\lambda }\), where \({{\mathfrak{g}}}_{i}\) is either simple or abelian, and the Lie group \({{{{\rm{e}}}}}^{{\mathfrak{g}}}\) is the product group \({\bigotimes }_{\lambda }{{{{\rm{e}}}}}^{{{\mathfrak{g}}}_{\lambda }}\). It can be shown (see Supplementary Methods 4) that the Haar measure over such a product group is the product of the Haar measures over the normal subgroups \({{{{\rm{e}}}}}^{{{\mathfrak{g}}}_{\lambda }}\). Finally, the ansatz with generators in Eq. (9) has a DLA \({\mathfrak{g}}\) that is subspace controllable, meaning that each simple \({{\mathfrak{g}}}_{\lambda }\) is either \({\mathfrak{su}}({d}_{\lambda })\) or \({\mathfrak{u}}({d}_{\lambda })\)92,139. Summarizing, we have

$$\begin{array}{lll}{\int}_{{{{{\mathcal{D}}}}}_{{{{\boldsymbol{\theta }}}}}}d{{{\boldsymbol{\theta }}}}f(U({{{\boldsymbol{\theta }}}}))\,=\,{\int}_{{{{\mathcal{D}}}}}dUf(U)\\ \qquad\qquad\qquad\,\to \,{\int}_{{e}^{{\mathfrak{g}}}}d\mu (U)f(U)\\ \qquad\qquad\qquad\,=\,\mathop{\prod}\limits_{\lambda }{\int}_{{\mathbb{U}}({d}_{\lambda })}d{\mu }_{\lambda }({U}_{\lambda })f(\{{U}_{\lambda }\}).\end{array}$$
(31)

The main advantage of Eq. (31) is that we can use tools from Weingarten calculus to perform symbolic integration over the Haar measure of unitary groups141. Explicitly, we care for the variance of \({\partial }_{\mu }\widehat{{{{\mathcal{L}}}}}({{{\boldsymbol{\theta }}}})=\mathop{\sum }\nolimits_{i = 1}^{M}{c}_{i}{\partial }_{\mu }{\ell }_{{{{\boldsymbol{\theta }}}}}({\rho }_{i})\) where

$${\partial }_{\mu }{\ell }_{{{{\boldsymbol{\theta }}}}}({\rho }_{i})={{{\rm{i}}}}{{{\rm{Tr}}}}[{U}_{B}{\rho }_{i}{U}_{B}^{{\dagger} }[{H}_{\mu },{U}_{A}^{{\dagger} }O{U}_{A}]],$$

where UB and UA denote the unitary circuits before and after the parametrized gate we are differentiating. Assuming that the depth L of the QNN is enough to guarantee that both UA and UB form independent 2-designs on \({{{{\rm{e}}}}}^{{\mathfrak{g}}}\), we can use Weingarten calculus to evaluate the terms in \({{\mathbb{E}}}_{{{{\boldsymbol{\theta }}}}}[{({\partial }_{\mu }\widehat{{{{\mathcal{L}}}}}({{{\boldsymbol{\theta }}}}))}^{2}]\,\)\({{\mathbb{E}}}_{{{{\boldsymbol{\theta }}}}}{[{\partial }_{\mu }\widehat{{{{\mathcal{L}}}}}({{{\boldsymbol{\theta }}}})]}^{2}\), and obtain Eq. (16) in Theorem 1. The details of this calculation are presented in Supplementary Methods 4.

While the previous, along with the results in Theorem 2, allow to prove by direct construction that Sn-equivariant QNNs do not lead to barren plateaus, we here provide further intuition for this result in terms of the expressibility reduction induced by the equivariance inductive biases. As shown in ref. 37, QNNs that are too expressible exhibit exponentially vanishing gradients, whereas models whose expressibility is restricted can exhibit large gradients. Hence, we can expect the result in Corollary 1 to be a direct consequence of the reduced expressibility of the model. We can further formalize this statement using the results of ref. 43. Therein, it was found that there exists a link between the presence or absence of barren plateaus and the dimension of the DLA. In particular, the authors conjecture, and prove for several examples (see also ref. 142 for an independent verification of the conjecture), that deep QNNs have gradients that scale inversely with the size of the DLA, that is, \({{{{\rm{Var}}}}}_{{{{\boldsymbol{\theta }}}}}[\partial \hat{{{{\mathcal{L}}}}}({{{\boldsymbol{\theta }}}})] \sim \frac{1}{{{{\rm{poly}}}}(\dim ({\mathfrak{g}}))}\). For the case of Sn-equivariant QNNs we know from Lemma 3 that \(\dim ({\mathfrak{g}})\in \Theta ({n}^{3})\) thus indicating that the variance should only vanish polynomially with n (for an appropriate dataset). We note this conjecture was recently proven140,143.

Intuition behind the overparametrization phenomenon

Recently, ref. 32 studied the overparametrization of QNNs from the perspective of a complexity phase transition in the loss landscape. In the underparametrized regime, we experience rough loss landscapes, which in turn can be traced back to a lack of control in parametrized state space. When the number of parameters is below the number of directions in state space, the parameter update can only access a subset of those potential directions. This constraint can be shown to introduce false local minima, that is, local minima that are not actual minima of the loss function (as a function of state space) but instead artifacts of a poor parametrization. Instead, upon introduction of more parameters the parametrized state starts accessing these previously unavailable directions, and false minima disappear as we transition into the overparametrized regime. Because in the overparametrized regime the number of parameters is greater than the number of ever accessible directions, solutions in the control landscape are degenerate and form multidimensional submanifolds, allowing the optimizer to reach them more easily125,126.

The main contribution in ref. 32 is the realization that, under standard assumptions, one needs one parameter per potentially accessible direction in state space, and that the latter can be formalized as the dimension of the orbit of the initial state under the Lie group \({e}^{{\mathfrak{g}}}\) resulting from the exponential of the DLA \({\mathfrak{g}}\). In particular, this means that exponential DLA architectures require an exponential number of parameters to be overparametrized, whereas polynomial DLA architectures only need a polynomial number of them.

With these definitions, the proof of Theorem 3 is immediate. Since the ansatz is subspace controllable (Lemma 3), the dimension of the DLA is equal to the dimension of the commutant, which is Θ(n3) (Lemma 2).

To finish, we note that the definition of overparametrization employed here (in terms of saturating the number of available directions) might differ from some definitions of overparametrization in the classical neural network community. Namely, in classical machine learning researchers have studied overparametrization through the optics of generalization109,144,145,146,147, while others have investigated the effect of overparametrization on the training processes. In particular, it has been proposed that the onset of overparametrization can be detected using metrics such as parameter redundancy which is captured by the rank of the classical Fisher information matrix148,149,150. It is precisely this notion of overparametrization that ref. 32 ported to quantum, and the one used in the present work.

Generalization

We consider the QML setting in this paper where the empirical loss function is of the form \(\widehat{{{{\mathcal{L}}}}}({{{\boldsymbol{\theta }}}})=\mathop{\sum }\nolimits_{i = 1}^{M}{c}_{i}{{{\rm{Tr}}}}[{U}_{{{{\boldsymbol{\theta }}}}}({\rho }_{i})O]\). We assume that the operator norm of O is bounded by a constant and also ci 1/M. We follow closely the covering number-based generalization bound in ref. 61. First recall that a set V is ε-covered by a subset KV with respect to a distance metric d if xV, yK such that d(x, y) ε. The ε-covering number (w.r.t. metric d) of V, denoted as \({{{\mathcal{N}}}}(V,d,\varepsilon )\), is the cardinality of the smallest such subset117. The following theorem bounds the ε-covering number of Sn-equivariant QNNs.

Theorem 5

The ε-covering number of the set \({{{{\mathcal{V}}}}}_{n}\) of n-qubit unitary Sn-equivariant QNNs w.r.t. the operator norm can be bounded as \({{{\mathcal{N}}}}({{{{\mathcal{V}}}}}_{n},\parallel \cdot \parallel ,\varepsilon )\leqslant {\left(\frac{6}{\varepsilon }\right)}^{2{{{{\rm{Te}}}}}_{n+1}}\).

Proof

Recall that an Sn-EQNN U can be block-diagonalized as \(U{\cong \bigoplus }_{\lambda }{{\mathbb{1}}}_{{m}_{\lambda }}\otimes {U}_{\lambda }\), where each Uλ is a unitary for U to be unitary. Let \({\mathbb{U}}({d}_{\lambda })\) denote the set of all unitaries of dimension dλ. Following Lemma 6 in ref. 61 and Section 4.2 in ref. 151 we can bound the ε-covering number of \({{\mathbb{U}}}_{{d}_{\lambda }}\) as follows

$${{{\mathcal{N}}}}({\mathbb{U}}({d}_{\lambda }),\parallel \cdot \parallel ,\varepsilon )\leqslant {\left(\frac{6}{\varepsilon }\right)}^{2{d}_{\lambda }^{2}}.$$
(32)

Next, we construct an ε-covering subset of the Sn-equivariant unitary set, \({{{{\mathcal{V}}}}}_{n}\), from the ε-covering subsets, Kλ, of the blocks λ. Indeed, given any \(U\,{\cong \bigoplus }_{\lambda }{{\mathbb{1}}}_{{m}_{\lambda }}\otimes {U}_{\lambda }\), we can identify unitaries \({\tilde{U}}_{\lambda }\) from Kλ such that \(\parallel {U}_{\lambda }-{\tilde{U}}_{\lambda }\parallel \leqslant \varepsilon ,\forall \lambda\). The unitary \(\tilde{U}\,{\cong \bigoplus }_{\lambda }{{\mathbb{1}}}_{{m}_{\lambda }}\otimes {\tilde{U}}_{\lambda }\) then satisfies

$$\parallel U-\tilde{U}\parallel \leqslant \mathop{\max }\limits_{\lambda }\parallel {U}_{\lambda }-{\tilde{U}}_{\lambda }\parallel \leqslant \varepsilon .$$
(33)

Therefore, there exists an ε-covering net of \({{{{\mathcal{V}}}}}_{n}\) of size \({\prod }_{\lambda }{\left(\frac{6}{\varepsilon }\right)}^{2{d}_{\lambda }^{2}}={\left(\frac{6}{\varepsilon }\right)}^{2{{{{\rm{Te}}}}}_{n+1}}\), concluding the proof. □

Having established this bound on the ε-covering numbers of Sn-EQNN, we apply a known result from ref. 61 (with some extra care) to obtain Theorem 4.

Proof

(Proof of Theorem 4). We assume knowledge of Theorem 6 in ref. 61. In step two of the proof where the authors use the chaining argument152 to bound the generalization error, notice that the covering number \({{{{\mathcal{N}}}}}_{j}\) in their Eq. (64) is replaced by \({\left(\frac{6}{\varepsilon }\right)}^{2{{{{\rm{Te}}}}}_{n+1}}\) in our case. In other words, there is no architecture-dependence (the number of gates T in their case) inside the logarithm in the resulting Eq. (65). Applying this change to the rest of their proof leads to our claimed generalization bound. □

We note that in the previous derivation, we have used knowledge of the isotypic decomposition of the Sn-equivariant QNN, which allows us to obtain a specialized generalization error bound that does not follow from a direct application of the results in ref. 61.

Trainable and untrainable states

Here, we describe how the states in Table 1 are obtained. The “symmetric states” are obtained from the symmetric subspace153, i.e., the set of states \(\{\left\vert \psi \right\rangle \in {{{\mathcal{H}}}}\,| \,R(\pi )\left\vert \psi \right\rangle =\left\vert \psi \right\rangle ,\,\forall \pi \in {S}_{n}\}\). The so-called “fixed Hamming-weight encoded” states correspond to states representing classical data: Given an array of real values {xi}, such that \({\sum }_{i}{x}_{i}^{2}=1\), each xi is encoded as the weight of a unique bitstring z of Hamming weight k, where k is some fixed constant. That is, prepare the state \(\left\vert {{{\bf{x}}}}\right\rangle ={\sum }_{\begin{array}{c}{{{\bf{z}}}}\,{{\mbox{s.t.}}}\,w({{{\bf{z}}}}) = k\end{array}}{x}_{{{{\bf{z}}}}}\left\vert {{{\bf{z}}}}\right\rangle\), where we are now indexing xi with a bitstring z. “Local Haar random” states are obtained by preparing the state \({\left\vert 0\right\rangle }^{\otimes n}\) and applying a Haar random single-qubit unitary to each qubit. “Global Haar random” states are obtained by preparing the state \({\left\vert 0\right\rangle }^{\otimes n}\) and applying a random n-qubit unitary sampled from the Haar measure over \({\mathbb{U}}(d)\). The “fixed and linear depth random circuit” states correspond to the states obtained by preparing the state \({\left\vert 0\right\rangle }^{\otimes n}\) and respectively applying a constant-depth, or linear-depth layered hardware-efficient quantum circuit34,130 with random parameters. For the “graph states”, we use a canonical encoding to embed a graph into a quantum state121,122. Specifically, to create a graph state, one starts with the state \({\left\vert +\right\rangle }^{\otimes n}\), and applies a controlled-Z rotation for every edge in the graph. We consider 3-regular and n/2-regular graphs, as well as random graphs generated according to the Erdös-Rényi model120.