The imminent of quantum computing devices opens up possibilities for exploiting quantum machine learning (QML)1,2,3 to improve the efficiency of classical machine learning algorithms in many scientific domains like drug discovery4 and efficient solar conversion5. Although the exploitation of quantum computing devices to carry out QML is still in its early exploratory states, the rapid development in quantum hardware has motivated advances in quantum neural network (QNN) to run in noisy intermediate-scale quantum (NISQ) devices6,7,8,9, where not enough qubits could be spared for quantum error correction and the imperfect qubits have to be directly employed at the physical layer10,11,12. Even though, a compromised QNN is proposed by employing a quantum-classical hybrid model that relies on an optimization of the variational quantum circuit (VQC)13,14. The resilience of the VQC to certain types of quantum noise errors and the high flexibility concerning coherence time and gate requirements admit VQC to apply to many promising applications on NISQ devices15,16,17,18,19,20,21,22.

Although many empirical studies of VQC for quantum machine learning have been reported, its theoretical understanding requires further investigation in terms of representation and generalization powers, particularly when the non-linear operator is employed for dimensionality reduction. This work introduces a tensor-train network (TTN) on top of the VQC model to implement a TTN-VQC. The TTN is a non-linear operator mapping high-dimensional features into low-dimensional ones. Then, the resulting low-dimensional features go through the framework of VQC. Compared with a hybrid model where the operation of dimensionality reduction is constituted by a classical neural network (NN)23, TTN can be realized by utilizing universal quantum circuits18,24,25, and an end-to-end quantum neural network can be setup.

In this work, we discuss the theoretical performance of TTN-VQC in the context of functional regression. Functional regression refers to building a vector-to-vector operator such that the regression output can approximate a target operator. In more detail, given a Q-dimensional input vector space \({{\mathbb{R}}}^{Q}\) and a measurable U-dimensional output vector space \({{\mathbb{R}}}^{U}\), the TTN-VQC-based vector-to-vector regression aims to find a TTN-VQC operator \(f:{{\mathbb{R}}}^{Q}\to {{\mathbb{R}}}^{U}\) such that the output vectors of f can approximate a desirable target one.

In particular, this work concentrates on the error performance analysis for TTN-VQC-based functional regression by leveraging the error decomposition technique26 to factorize an expected loss over the TTN-VQC operator into the sum of the approximation error, estimation error, and training error. We separately upper bound each error component by harnessing statistical machine learning theory. More specifically, we define \({{\mathbb{F}}}_{{{{\rm{TV}}}}}\) as the TTN-VQC hypothesis space which represents a collection of TTN-VQC operators. Then, given a data distribution \({{{\mathcal{D}}}}\), assuming a smooth target function \({h}_{{{{\mathcal{D}}}}}^{* }\) and a set of N training data drawn independent and identically distributed from a data distribution \({{{\mathcal{D}}}}\), for a loss function and an optimal TTN-VQC operator \({f}_{{{{\mathcal{D}}}}}^{* }\in {{\mathbb{F}}}_{{{{\rm{TV}}}}}\), an expected loss is defined as:

$${{{{\mathcal{L}}}}}_{{{{\mathcal{D}}}}}({f}_{{{{\mathcal{D}}}}}^{* }):={{\mathbb{E}}}_{{{{\bf{x}}}} \sim {{{\mathcal{D}}}}}\left[\ell ({h}_{{{{\mathcal{D}}}}}^{* }({{{\bf{x}}}}),{f}_{{{{\mathcal{D}}}}}^{* }({{{\bf{x}}}}))\right],$$

which can be minimized by using an empirical loss as:

$${{{{\mathcal{L}}}}}_{S}({f}_{{{{\mathcal{D}}}}}^{* }):=\frac{1}{N}\mathop{\sum }\limits_{n=1}^{N}\ell ({h}_{{{{\mathcal{D}}}}}^{* }({{{{\bf{x}}}}}_{n}),{f}_{{{{\mathcal{D}}}}}^{* }({{{{\bf{x}}}}}_{n})).$$

Since the mean absolute error (MAE)27 is a 1-Lipschitz continuous28, the loss function is set as the MAE. Furthermore, we separately define \({f}_{{{{\mathcal{D}}}}}^{* }\), \({f}_{S}^{* }\) and \({\bar{f}}_{S}\) as an optimal TTN-VQC operator, an empirical optimal operator, and a returned operator. Then, as shown in Fig. 1, the error decomposition technique26 factorizes the expected loss \({{{{\mathcal{L}}}}}_{{{{\mathcal{D}}}}}({\bar{f}}_{S})\) into three error components as:

$$\begin{array}{rcl}{{{{\mathcal{L}}}}}_{{{{\mathcal{D}}}}}({\bar{f}}_{S})&=&\underbrace{{{{{\mathcal{L}}}}_{{{{\mathcal{D}}}}}({f}_{{{{\mathcal{D}}}}}^{* })}}_{\begin{array}{c}Approximation\;Error\end{array}}+\underbrace{{{{{\mathcal{L}}}}_{{{{\mathcal{D}}}}}({f}_{S}^{* })-{{{{\mathcal{L}}}}}_{{{{\mathcal{D}}}}}({f}_{{{{\mathcal{D}}}}}^{* })}}_{\begin{array}{c}Estimation\; Error\end{array}}+\underbrace{{{{{\mathcal{L}}}}_{{{{\mathcal{D}}}}}({\bar{f}}_{S})-{{{{\mathcal{L}}}}}_{{{{\mathcal{D}}}}}({f}_{S}^{* })}}_{\begin{array}{c}Training\;Error\end{array}}\\ &\le &{{{{\mathcal{L}}}}}_{{{{\mathcal{D}}}}}({f}_{{{{\mathcal{D}}}}}^{* })+2\mathop{\sup }\limits_{f\in {{\mathbb{F}}}_{{{{\rm{TV}}}}}}| {{{{\mathcal{L}}}}}_{{{{\mathcal{D}}}}}(f)-{{{{\mathcal{L}}}}}_{S}(f)| +{{{{\mathcal{L}}}}}_{{{{\mathcal{D}}}}}({\bar{f}}_{S})-{{{{\mathcal{L}}}}}_{{{{\mathcal{D}}}}}({f}_{S}^{* })\\ &\le &{{{{\mathcal{L}}}}}_{{{{\mathcal{D}}}}}({f}_{{{{\mathcal{D}}}}}^{* })+2{\hat{{{{\mathcal{R}}}}}}_{S}({{\mathbb{F}}}_{{{{\rm{TV}}}}})+\nu ,\end{array}$$

where \({{{{\mathcal{L}}}}}_{{{{\mathcal{D}}}}}({f}_{{{{\mathcal{D}}}}}^{* })\) is associated with the approximation error, \({\hat{{{{\mathcal{R}}}}}}_{S}({{\mathbb{F}}}_{{{{\rm{TV}}}}})\) is an empirical Rademacher complexity29 over the family \({{\mathbb{F}}}_{{{{\rm{TV}}}}}\), and ν refers to the training error that results from the optimization bias of gradient-based algorithms. The Rademacher complexity \(\hat{{{{\mathcal{R}}}}}({{\mathbb{F}}}_{{{{\rm{TV}}}}})\) can measure the model complexity and is particularly used for the regression problem26. In this work, our theoretical results concentrate on the error analysis by upper-bounding each error component, and our empirical results are illustrated to corroborate our theoretical analysis.

Fig. 1: An illustration of error decomposition technique.
figure 1

\({h}_{{{{\mathcal{D}}}}}^{* }\) is a smooth target function in a family of all functions \({{{{\mathcal{Y}}}}}^{{{{\mathcal{X}}}}}\) over a data distribution \({{{\mathcal{D}}}}\); \({{\mathbb{F}}}_{{{{\rm{TV}}}}}\) denotes the family of TTN-VQC operators as shown in the dashed square; \({f}_{{{{\mathcal{D}}}}}^{* }\) represents the optimal hypothesis from the space of TTN-VQC operators over the distribution \({{{\mathcal{D}}}}\); \({f}_{S}^{* }\) denotes the best empirical hypothesis over the set of training samples S; \({\bar{f}}_{S}\) is the returned hypothesis based on the training dataset S.

Our derived theoretical results in this work and the significance of TTN-VQC-based functional regression are summarized as follows:

  • Representation power: our upper bound on the approximation error is derived as \(\frac{\Theta (1)}{\sqrt{U}}+{{{\mathcal{O}}}}\left(\frac{1}{\sqrt{M}}\right)\), where U and M separately denote the number of qubits and the count of quantum measurement. The result suggests that the expressive capability of TTN-VQC can be mainly determined by the number of qubits, and the quality of the expressiveness is also affected by the count of quantum measurements. Larger U and M correspond to the fact that more algorithmic qubits and a longer decoherence time are necessarily required to ensure stronger representation power of TTN-VQC. Furthermore, since more qubits are more likely to result in the problem of Barren Plateaus of VQC during the training process, the introduction of PL condition is significant to handle the problem.

  • Generalization power: we derive an upper bound on the estimation error concerning the empirical Rademacher complexity \({\hat{{{{\mathcal{R}}}}}}_{S}({{\mathbb{F}}}_{{{{\rm{TV}}}}})\), which is further upper bounded by a constant as \(\frac{2P}{\sqrt{N}}(\sqrt{\mathop{\sum }\nolimits_{k = 1}^{K}{\Lambda }_{k}^{2}}+{\Lambda }^{{\prime} })\). Here, P, N, and K separately denote the input power, the amount of training data, and the order of multi-dimensional tensor; Λk and Λ refer to the upper bounds on the Frobenius norm of TTN parameters. The result of the generalization power suggests that given the training data and model structure, the additive noise corresponds to a larger value of P which results in an upper bound on a weaker generalization capability.

  • Optimization bias: the PL condition is employed to initialize the TTN-VQC parameters and the training error can be exponentially converged to a small loss value. The problem of barren plateau is a serious issue in the training process of the quantum neural network30, especially for a randomized QNN architecture, the variance of gradients exponentially vanishes with the increase of qubits. In this work, we claim that the model setting based on the PL condition could be beneficial to the improvement of the TTN-VQC training.

Besides, our empirical results of functional regression are designed to corroborate the corresponding theoretical results of representation and generalization powers, and the analysis of optimization performance.

The related work comprises theoretical and technical aspects. As for the theoretical point, Du et al.31 analyzes the learnability of quantum neural networks with parameterized quantum circuits and gradient-based classical optimizer. A theoretical comparison between this work and Du et al.31 is shown in Table 1, where our theoretical results mainly follow the error decomposition method26,32. More specifically, in this work, we factorize an expected loss based on MAE over a TTN-VQC operator into three error components: approximation error, estimation error, and training error. We separately derive upper bounds on each error component and the results are summarized in Table 1.

Table 1 A comparison of learning theory for VQC between this work and Du et al.31.

Besides, the techniques of this work rely on the TTN and VQC models. The TTN, also known as matrix product state (MPS)33, was first put forth by Alexander et al.34 in the applications of machine learning. Chen et al.25 employs MPS to extract low-dimensional features for VQC. Although this work leverages the TTN for dimensionality reduction, we rebuild the TTN as parallel neural network architecture, where the sigmoid activation function is separately imposed upon each neural network. In this work, we choose the TTN for dimensionality reduction for the reason that although the hybrid quantum-classical model may take more resources while simulating a quantum computer, it can be implemented on actual quantum hardware. Whereas, the classical models cannot be put on quantum hardware. Moreover, since the VQC models have been widely used in the domains of quantum machine learning35,36,37, we follow the standard VQC pipeline such that our theoretical results can be employed for the general VQC model.



Before we delve into the detailed architecture of the TTN-VQC, we first introduce the basic components of TTN and VQC, which have been previously proposed and widely used in quantum machine learning.

As shown in Fig. 2, we first introduce a VQC which is composed of three components: (a) Tensor Product Encoding (TPE); (b) Parametric Quantum Circuit (PQC); (c) Measurement.

Fig. 2: An illustration of three components in the VQC model.
figure 2

The TPE employs a series of \({R}_{Y}(\frac{\pi }{2}{x}_{i})\) to transform classical data into quantum states. The PQC is composed of CNOT gates and single-qubit rotation gates RX, RY, RZ with free model parameters α, β, and γ. The CNOT gates impose the operation of quantum entanglement among qubits, and the gates RX, RY, and RZ can be adjustable during the training stage. The PQC model in the green dash square is repeatably copied to build a deeper model. The measurement converts the quantum states \(\vert {z}_{1}\rangle , \vert z_{2}\rangle ,...,\vert {z}_{U}\rangle\) into the corresponding expectation values \(\langle {\sigma }_{z}^{(1)}\rangle ,\langle {\sigma }_{z}^{(2)}\rangle ,...,\langle {\sigma }_{z}^{(U)}\rangle\). The outputs \(\langle {\sigma }_{z}^{(1)}\rangle\), \(\langle {\sigma }_{z}^{(2)}\rangle\), ..., \(\langle {\sigma }_{z}^{(U)}\rangle\) are connected to a loss function and the gradient descent algorithms can be used to update VQC parameters.

The TPE model was proposed in38 and it aims at converting a classical data x into a quantum state \(\left\vert {{{\bf{x}}}}\right\rangle\) by adopting a one-to-one mapping as:

$$\left\vert {{{\bf{x}}}}\right\rangle =\left({\otimes }_{i = 1}^{U}{R}_{Y}\left(\frac{\pi }{2}{x}_{i}\right)\right){\left\vert 0\right\rangle }^{\otimes U}=\left[\begin{array}{c}\cos \left(\frac{\pi }{2}{x}_{1}\right)\\ \sin \left(\frac{\pi }{2}{x}_{1}\right)\end{array}\right]\otimes \left[\begin{array}{c}\cos \left(\frac{\pi }{2}{x}_{2}\right)\\ \sin \left(\frac{\pi }{2}{x}_{2}\right)\end{array}\right]\otimes \cdots \otimes \left[\begin{array}{c}\cos \left(\frac{\pi }{2}{x}_{U}\right)\\ \sin \left(\frac{\pi }{2}{x}_{U}\right)\end{array}\right],$$

where each xi can be strictly restricted in the domain of [0, 1] such that the conversion between x and \(\left\vert {{{\bf{x}}}}\right\rangle\) is a reversely one-to-one mapping.

The PQC framework consists of U quantum channels corresponding to currently accessible U qubits on NISQ devices. Here, the controlled-NOT (CNOT) gates realize the quantum entanglement, and the single rotation gates RX, RY, and RZ compose the PQC model with model free parameters α = {α1, α2, . . . , αU}, β = {β1, β2, . . . , βU} and γ = {γ1, γ2, . . . , γU}. The PQC model corresponds to a linear operator \({{{{\mathcal{T}}}}}_{{{{{\boldsymbol{\theta }}}}}_{{{{\rm{vqc}}}}}}\) that transforms the quantum input state \(\left\vert {{{\bf{x}}}}\right\rangle\) into the output one \(\left\vert {{{\bf{z}}}}\right\rangle\). The PQC model in the green dash square is repeatably copied to compose a deeper architecture.

The measurement framework outputs the expectation values concerning the Pauli-Z operators, namely \(\langle {\sigma }_{z}^{(1)}\rangle\), \(\langle {\sigma }_{z}^{(2)}\rangle\), ..., \(\langle {\sigma }_{z}^{(U)}\rangle\) which results in the output vector \({{{\bf{z}}}}={[\langle {\sigma }_{z}^{(1)}\rangle ,\langle {\sigma }_{z}^{(2)}\rangle ,...,\langle {\sigma }_{z}^{(U)}\rangle ]}^{T}\). The expectation vector z refers to the classical data and it is connected to the operation of functional regression.

Then, we briefly introduce the formulation of TTN. A TTN refers to a tensor network aligned in a 1-dimensional array and is generated by repetitively singular value decomposition (SVD)39 to a many-body wave function24. To utilize the TTN for dimensionality reduction, in this work, we first define the tensor-train decomposition (TTD) for a 1-dimensional vector and a tensor-train representation for a 2-dim matrix. More specifically, given a vector \({{{\bf{x}}}}\in {{\mathbb{R}}}^{D}\) where \(D=\mathop{\prod }\nolimits_{k = 1}^{K}{D}_{k}\), we reshape the vector x into a K-order tensor \({{{\mathcal{X}}}}\in {{\mathbb{R}}}^{{D}_{1}\times {D}_{2}\times \cdot \cdot \cdot \times {D}_{K}}\). Then, given a set of tensor-train ranks (TT-ranks) {R1, R2, . . . , RK+1} (R1 and RK+1 are set as 1), all elements of \({{{\mathcal{X}}}}\) can be represented by multiplying K matrices \({{{{\mathcal{X}}}}}_{{d}_{k}}^{[k]}\) based on the TT-format as:

$${{{{\mathcal{X}}}}}_{{d}_{1},{d}_{2},...,{d}_{K}}={{{{\mathcal{X}}}}}_{{d}_{1}}^{[1]}{{{{\mathcal{X}}}}}_{{d}_{2}}^{[2]}\cdots {{{{\mathcal{X}}}}}_{{d}_{K}}^{[K]}=\mathop{\prod }\limits_{k=1}^{K}{{{{\mathcal{X}}}}}_{{d}_{k}}^{[k]},$$

where the matrices \({{{{\mathcal{X}}}}}_{{d}_{k}}^{[k]}\in {{\mathbb{R}}}^{{R}_{k}\times {R}_{k+1}}\), dk [Dk]. The ranks R1 and RK+1 are set as 1 to ensure the term \(\mathop{\prod }\nolimits_{k = 1}^{K}{{{{\mathcal{X}}}}}_{{d}_{k}}^{[k]}\) is a scalar.

Next, we are concerned with the TTD for a 2-dim matrix. A feed-forward neural network with U neurons has the form:

$${{{\bf{y}}}}(u)=\mathop{\sum }\limits_{d=1}^{D}{{{\bf{W}}}}(d,u)\cdot {{{\bf{x}}}}(d),\forall u\in [U].$$

If we assume that \(U=\mathop{\prod }\nolimits_{k = 1}^{K}{u}_{k}\), then we can reshape the 2-order matrix W as a D-order double-indexed tensor \({{{\mathcal{W}}}}\) and it can be factorized into the TT-format as:

$${{{{\mathcal{W}}}}}_{({d}_{1},{u}_{1}),({d}_{2},{u}_{2}),...,({d}_{K},{u}_{K})}={{{{\mathcal{W}}}}}_{{d}_{1},{u}_{1}}^{[1]}{{{{\mathcal{W}}}}}_{{d}_{2},{u}_{2}}^{[2]}\cdots {{{{\mathcal{W}}}}}_{{d}_{K},{u}_{K}}^{[K]},$$

where \({{{{\mathcal{W}}}}}^{[k]}\in {{\mathbb{R}}}^{{R}_{k}\times {D}_{k}\times {U}_{k}\times {R}_{k+1}}\) is a 4-order core tensor, and each element \({{{{\mathcal{W}}}}}_{{d}_{k},{u}_{k}}^{[k]}\in {{\mathbb{R}}}^{{R}_{k}\times {R}_{k+1}}\) is a matrix. Then, we can reshape the input vector x and the output one y into two tensors of the same order: \({{{\mathcal{X}}}}\in {{\mathbb{R}}}^{{D}_{1}\times {D}_{2}\times \cdots \times {D}_{K}}\), \({{{\mathcal{Y}}}}\in {{\mathbb{R}}}^{{U}_{1}\times {U}_{2}\times \cdots \times {U}_{K}}\), and we build the mapping function between the input tensor \({{{{\mathcal{X}}}}}_{{d}_{1},{d}_{2},...,{d}_{K}}\) and the output one \({{{{\mathcal{Y}}}}}_{{u}_{1},{u}_{2},...,{u}_{K}}\) as:

$${{{{\mathcal{Y}}}}}_{{u}_{1},{u}_{2},...,{u}_{K}}=\mathop{\sum }\limits_{{d}_{1}=1}^{{D}_{1}}\mathop{\sum }\limits_{{d}_{2}=1}^{{D}_{2}}\cdots \mathop{\sum }\limits_{{d}_{K}=1}^{{D}_{K}}{{{{\mathcal{W}}}}}_{({d}_{1},{u}_{1}),({d}_{2},{u}_{2}),...,({d}_{K},{u}_{K})}{{{{\mathcal{X}}}}}_{{d}_{1},{d}_{2},...,{d}_{K}}.$$

Then, by employing the TTD for the K-order tensor element \({{{{\mathcal{X}}}}}_{{d}_{1},{d}_{2},...,{d}_{K}}\) and \({{{{\mathcal{W}}}}}_{({d}_{1},{u}_{1}),({d}_{2},{u}_{2}),...,({d}_{K},{u}_{K})}\) separately defined in Eq. (5) and (7), we attain that

$$\begin{array}{rcl}{{{{\mathcal{Y}}}}}_{{u}_{1},{u}_{2},...,{u}_{K}}&=&\mathop{\sum }\limits_{{d}_{1}=1}^{{D}_{1}}\mathop{\sum }\limits_{{d}_{2}=1}^{{D}_{2}}\cdots \mathop{\sum }\limits_{{d}_{K}=1}^{{D}_{K}}{{{{\mathcal{W}}}}}_{({d}_{1},{u}_{1}),({d}_{2},{u}_{2}),...,({d}_{K},{u}_{K})}{{{{\mathcal{X}}}}}_{{d}_{1},{d}_{2},...,{d}_{K}}\\ &=&\mathop{\sum }\limits_{{d}_{1}=1}^{{D}_{1}}\mathop{\sum }\limits_{{d}_{2}=1}^{{D}_{2}}\cdots \mathop{\sum }\limits_{{d}_{K}=1}^{{D}_{K}}\mathop{\prod }\limits_{k=1}^{K}{{{{\mathcal{W}}}}}_{{d}_{k},{u}_{k}}^{[k]}\odot \mathop{\prod }\limits_{k=1}^{K}{{{{\mathcal{X}}}}}_{{d}_{k}}^{[k]}\\ &=&\mathop{\prod }\limits_{k=1}^{K}\mathop{\sum }\limits_{{d}_{k}=1}^{{D}_{k}}{{{{\mathcal{W}}}}}_{{d}_{k},{u}_{k}}^{[k]}\odot {{{{\mathcal{X}}}}}_{{d}_{k}}^{[k]}\\ &=&\mathop{\prod }\limits_{k=1}^{K}{{{{\mathcal{Y}}}}}_{{u}_{k}}^{[k]},\end{array}$$

where \({{{{\mathcal{W}}}}}_{{d}_{k},{u}_{k}}^{[k]}\odot {{{{\mathcal{X}}}}}_{{d}_{k}}^{[k]}\) refers to an element-wise multiplication of the two matrices, and \(\mathop{\sum }\nolimits_{{d}_{k} = 1}^{{D}_{k}}{{{{\mathcal{W}}}}}_{{d}_{k},{u}_{k}}^{[k]}\odot {{{{\mathcal{X}}}}}_{{d}_{k}}^{[k]}\) results in a matrix \({{{{\mathcal{Y}}}}}_{{u}_{k}}^{[k]}\) in \({{\mathbb{R}}}^{{R}_{k}\times {R}_{k+1}}\). The ranks R1 = RK+1 = 1 ensures the \(\mathop{\prod }\nolimits_{k = 1}^{K}{{{{\mathcal{Y}}}}}_{{u}_{k}}^{[k]}\) is a scalar. Based on the framework of TTN, two requirements need to be met as follows: (a) given an input vector \({{{\bf{x}}}}\in {{\mathbb{R}}}^{D}\), we need \(D=\mathop{\sum }\nolimits_{k = 1}^{K}{D}_{k}\), dk = [Dk] and R1 = RK+1 = 1; (b) Given the output vector in \({{\mathbb{R}}}^{U}\), we have \(U=\mathop{\prod }\nolimits_{k = 1}^{K}{U}_{k}\), where uk = [Uk], and R1 = RK+1 = 1. In particular, this work’s output dimension U corresponds to the number of qubits.

Theoretical results

This section first exhibits the architecture of TTN-VQC, and then we analyze the upper bounds on the representation and generalization powers and the optimization performance.

The TTN-VQC pipeline is shown in Fig. 3, where (a) denotes the framework of TTN, (b) is associated with the VQC model, and (c) represents the operation of functional regression. The VQC model is based on the standard architecture as shown in Fig. 2, and the TTN is designed according to the framework in Section “Preliminaries”. To introduce the non-linearity to the TTN model, a sigmoid activation function Sigm() is taken for each \({{{{\mathcal{Y}}}}}_{k}({j}_{k})\) such that

$$\hat{{{{\mathcal{Y}}}}}({j}_{1},{j}_{2},...,{j}_{K})=\mathop{\prod }\limits_{k=1}^{K}Sigm\left({{{{\mathcal{Y}}}}}_{k}\left({j}_{k}\right)\right),$$

which introduces the non-linearity to the TTN features and corresponds to a parallel neural network structure.

Fig. 3: An illustration of the TTN-VQC architecture.
figure 3

\({{{{\mathcal{T}}}}}_{{{{{\boldsymbol{\theta }}}}}_{{{{\rm{ttn}}}}}}\) and \({{{{\mathcal{T}}}}}_{{{{{\boldsymbol{\theta }}}}}_{{{{\rm{vqc}}}}}}\) represent the TTN and VQC operators with trainable parameters θttn and θvqc, respectively. \({{{{\mathcal{T}}}}}_{{{{\boldsymbol{y}}}}}\) refers to a reversible classical-to-quantum mapping. The VQC model in the green dash square can be repeatably copied to generate a deep parametric model. The framework of functional regression outputs loss values and evaluates gradients of loss functions to update model parameters θvqc and θttn. \({{{{\mathcal{T}}}}}_{{{{\rm{lr}}}}}\) refers to a fixed regression matrix.

The parallel DNN structure is illustrated in Fig. 4, where a K-order tensor \({{{{\mathcal{X}}}}}_{{d}_{1},{d}_{2},...,{d}_{K}}\) is first decomposed into 2-dim matrices \({{{{\mathcal{X}}}}}_{{d}_{1}}^{[1]}\), \({{{{\mathcal{X}}}}}_{{d}_{2}}^{[2]}\), ..., \({{{{\mathcal{X}}}}}_{{d}_{K}}^{[K]}\) and each \({{{{\mathcal{X}}}}}_{{d}_{k}}^{[k]}\) goes through \({{{{\mathcal{W}}}}}_{{d}_{k},{u}_{k}}^{[k]}\). The resulting \({{{{\mathcal{Y}}}}}_{{u}_{1}}^{[1]}\), \({{{{\mathcal{Y}}}}}_{{u}_{2}}^{[2]}\), ..., \({{{{\mathcal{Y}}}}}_{{u}_{K}}^{[K]}\) are non-linearly activated by applying the sigmoid activation function before multiplying them together into a K-order tensor \({\hat{{{{\mathcal{Y}}}}}}_{{u}_{1},{u}_{2},...,{u}_{K}}\). By iterating uk [Uk] and fixing other indices u1, u2, . . . , uk−1, uk+1, . . . , uK, we separately collect a vector associated with the kth order of \({{{\mathcal{Y}}}}\).

Fig. 4: Reformulating the TTN model in a parallel structure.
figure 4

Each element of the input K-order tensor \({{{{\mathcal{X}}}}}_{{d}_{1},{d}_{2},...,{d}_{K}}\) is factorized into K matrices \({{{{\mathcal{X}}}}}_{{d}_{k}}\) by utilizing TTD. Each \({{{{\mathcal{X}}}}}_{{d}_{k}}^{[k]}\) goes through the TTN associated with model parameters \({{{{\mathcal{W}}}}}^{[k]}\). The sigmoid function is imposed upon the output \({{{{\mathcal{Y}}}}}_{{u}_{k}}^{[k]}\), and all \({{{{\mathcal{Y}}}}}_{{u}_{k}}^{[k]}\) are all multiplied to form the output \({\hat{{{{\mathcal{Y}}}}}}_{{u}_{1},{u}_{2},...,{u}_{K}}\).

More significantly, the non-linearity introduced by the sigmoid function sets up a parallel DNN structure for TTN and helps to build a one-to-one mapping in the TPE framework because the sigmoid function compresses the functional values in the domain of (0, 1). Proposition 1 suggests the sigmoid activation function ensures a one-to-one mapping from the classical data to the quantum state.

Proposition 1

The sigmoid activation function applied to the TTN ensures the TPE as a linear unitary operator \(\left\vert {{{\bf{y}}}}\right\rangle ={{{{\mathcal{T}}}}}_{{{{\bf{y}}}}}({\left\vert 0\right\rangle }^{\otimes U})\) such that a quantum state \(\left\vert {{{\bf{y}}}}\right\rangle\) can be generated from a classical vector y. On the other hand, the classical vector y can be exactly deduced based on the operator \({{{{\mathcal{T}}}}}_{{{{\bf{y}}}}}\).

Proposition 1 can be justified based on Eq. (4), where \(\cos (\frac{\pi }{2}{x}_{i})\) and \(\sin (\frac{\pi }{2}{x}_{i})\) are reversible one-to-one functions because of each xi (0, 1). Then, we can deduce the original classical vector y given the quantum state \(\left\vert {{{\bf{y}}}}\right\rangle\).

The VQC outputs a classical vector \({{{\bf{z}}}}={[\langle {\sigma }_{z}^{(1)}\rangle ,\langle {\sigma }_{z}^{(2)}\rangle ,...,\langle {\sigma }_{z}^{(K)}\rangle ]}^{T}\), and then z is connected to the framework of functional regression, where a fixed linear regression operator \({{{{\mathcal{T}}}}}_{{{{\rm{lr}}}}}\) further transforms z into the output vector. The MAE is taken to measure the loss value and the related gradients of the loss function, which are used to update the parameters of both VQC and TTN models.

To analyze the representation power, Theorem 1 shows an upper bound on the approximation error. The upper bound on the approximation error relies on the theoretical analysis of the inherent parallel structure for the TTN model and the universal approximation theory utilized for neural networks40,41,42. Theorem 1 suggests that the representation power of linear operator \({{{\mathcal{M}}}}\circ {{{{\mathcal{T}}}}}_{{{{{\boldsymbol{\theta }}}}}_{{{{\rm{vqc}}}}}}\circ {{{{\mathcal{T}}}}}_{{{{\bf{y}}}}}\) is strengthened by applying a non-linear operator \({{{{\mathcal{T}}}}}_{{{{{\boldsymbol{\theta }}}}}_{{{{\rm{ttn}}}}}}({{{\bf{x}}}})\).

Theorem 1

Given a smooth target function \({h}_{{{{\mathcal{D}}}}}^{* }:{{\mathbb{R}}}^{Q}\to {{\mathbb{R}}}^{U}\) and a classical data x, there exists a TTN-VQC \(g({{{\bf{x}}}};{{{{\boldsymbol{\theta }}}}}_{{{{\rm{vqc}}}}},{{{{\boldsymbol{\theta }}}}}_{{{{\rm{ttn}}}}})={{{\mathcal{M}}}}\circ {{{{\mathcal{T}}}}}_{{{{{\boldsymbol{\theta }}}}}_{{{{\rm{vqc}}}}}}\circ {{{{\mathcal{T}}}}}_{{{{\bf{y}}}}}\circ {{{{\mathcal{T}}}}}_{{{{{\boldsymbol{\theta }}}}}_{{{{\rm{ttn}}}}}}({{{\bf{x}}}})\), we obtain

$${{{{\mathcal{L}}}}}_{{{{\mathcal{D}}}}}({f}_{{{{\mathcal{D}}}}}^{* })=\parallel {h}_{{{{\mathcal{D}}}}}^{* }({{{\bf{x}}}})-{{{{\mathcal{T}}}}}_{lr}\left({\mathbb{E}}\left[g({{{\bf{x}}}};{{{{\boldsymbol{\theta }}}}}_{{{{\rm{vqc}}}}},{{{{\boldsymbol{\theta }}}}}_{{{{\rm{ttn}}}}})\right]\right){\parallel }_{1}\le \frac{\Theta (1)}{\sqrt{U}}+{{{\mathcal{O}}}}\left(\frac{1}{\sqrt{M}}\right),$$

where U and M separately refer to the number of qubits and the count of quantum measurement, and \({\mathbb{E}}[g({{{\bf{x}}}};{{{{\boldsymbol{\theta }}}}}_{{{{\rm{vqc}}}}},{{{{\boldsymbol{\theta }}}}}_{{{{\rm{ttn}}}}})]\) represents an expectation value of the output measurement.

The upper bound in Eq. (11) implies that the number of qubits U and the count of measurement M jointly decide the representation power of TTN-VQC, and larger values of U and M are expected to lower the upper bound. However, a larger value U requires an advanced quantum computer with more logic qubits, but more qubits are likely to degrade the optimization performance because of the problem of Barren Plateaus. To strike a balance between a large number of qubits and low optimization bias, the PL condition is introduced to initialize the TTN-VQC model.

Moreover, as for the analysis of the generalization power, Theorem 2 suggests the upper bounds of the estimation error. The upper bound on the estimation error can be derived based on the empirical Rademacher complexity \({\hat{{{{\mathcal{R}}}}}}_{S}({{\mathbb{F}}}_{{{{\rm{TV}}}}})\), which is defined as:

$${\hat{{{{\mathcal{R}}}}}}_{S}({{\mathbb{F}}}_{{{{\rm{TV}}}}}):={{\mathbb{E}}}_{{{{\boldsymbol{\epsilon }}}}}\left[\mathop{\sup }\limits_{f\in {{\mathbb{F}}}_{{{{\rm{TV}}}}}}\frac{1}{N}\mathop{\sum }\limits_{n=1}^{N}{\epsilon }_{n}f({{{{\bf{x}}}}}_{n})\right],$$

where N samples S = {x1, x2, . . . , xN}, and ϵ = {ϵ1, ϵ2, . . . , ϵN} refers to a set of N Rademacher random variables taking on values 1 and − 1 with an equal likelihood. The empirical Rademacher complexity measure how well the functional family FTV correlates with random noise ϵ on the dataset S, and it describes the richness of the family FTV: a richer family FTV can generate more functions f that better correlates with the random noise on average.

Theorem 2

Based on the TTN-VQC setup in Theorem 1, the estimation error is upper bounded by the empirical Rademacher complexity \(2{\hat{{{{\mathcal{R}}}}}}_{S}({{\mathbb{F}}}_{{{{\rm{TV}}}}})\), which is

$$\begin{array}{l}2{\hat{\mathcal{R}}}_{S}\left({{\mathbb{F}}}_{\rm{TV}}\right)\le 2{\hat{\mathcal{R}}}_{S}\left({\mathbb{F}}_{\rm{TTN}}\right)+2{\hat{\mathcal{R}}}_{S}\left({{\mathbb{F}}}_{\rm{VQC}}\right)\le \frac{2P}{\sqrt{N}}\sqrt{\mathop{\sum }\limits_{k=1}^{K}{\Lambda }_{k}^{2}}+\frac{2P{\Lambda }^{{\prime} }}{\sqrt{N}}\\ \qquad {s.t.},\Vert {{{{\bf{x}}}}}_{n}{\Vert}_{2}\le P,\forall n\in [N],\\ \quad\quad \Vert{{{\bf{W}}}}\left({{{{\mathcal{T}}}}}_{{{{{\boldsymbol{\theta}}}}}_{{{{\rm{vqc}}}}}}\right){\Vert}_{F}\le {\Lambda }^{{\prime} },\Vert {{{{\mathcal{W}}}}}^{[k]}\left({{{{\mathcal{T}}}}}_{{{{{\boldsymbol{\theta }}}}}_{{{{\rm{ttn}}}}}}\right){\Vert}_{F}\le {\Lambda }_{k},k\in [K],\end{array}$$

where \({{\mathbb{F}}}_{{{{\rm{TTN}}}}}\) and \({{\mathbb{F}}}_{{{{\rm{VQC}}}}}\) separately denote the family of TTN and VQC, P, \({\Lambda }^{{\prime} }\) and Λk are constants, \({{{\bf{W}}}}({{{{\mathcal{T}}}}}_{{{{{\boldsymbol{\theta }}}}}_{{{{\rm{vqc}}}}}})\) refers to a matrix associated with the operator \({{{{\mathcal{T}}}}}_{{{{{\boldsymbol{\theta }}}}}_{{{{\rm{vqc}}}}}}\), and \({{{{\mathcal{W}}}}}^{[k]}({{{{\mathcal{T}}}}}_{{{{{\boldsymbol{\theta }}}}}_{{{{\rm{ttn}}}}}})\) corresponds to a 4-order tensor of TTN, WF and \(\parallel {{{{\mathcal{W}}}}}^{[k]}{\parallel }_{F}\) represent the Frobenius norm of a matrix and a tensor, respectively.

The upper bound on the estimation error in Eq. (13) shows when an input x and an initialized TTN-VQC model are given, a sufficiently large amount of training data N is needed to lower the related upper bound. On the other hand, the noise perturbation associated with the noisy power Pnoise imposed upon the input corresponds to a larger total power P = Pin + Pnoise, which corresponds to a larger upper bound on the estimation error and accordingly weakens the generalization power.

The optimization error of VQC is associated with the training problem of Barren Plateaus30 that stems from optimizing a non-convex objective function and the gradients may vanish almost everywhere in the training stage. To alleviate the problem of Barren Plateaus, we introduce an initialization strategy based on the Polyak-Lojasiewicz (PL) condition43,44,45. More specifically, given the set of model parameters θ = {θttn, θvqc} for TTN-VQC, if an empirical loss function \({{{{\mathcal{L}}}}}_{S}\) satisfies μ-PL, the L2-norm of the first-order gradient \(\nabla {{{{\mathcal{L}}}}}_{S}\) concerning θ should satisfy the following inequality as:

$$\frac{1}{2}\parallel \nabla {{{{\mathcal{L}}}}}_{S}({{{\boldsymbol{\theta }}}}){\parallel }_{2}^{2}\ge \mu {{{{\mathcal{L}}}}}_{S}({{{\boldsymbol{\theta }}}}).$$

Theorem 3

If a 1-Lipschitz loss function \({{{\mathcal{L}}}}\) over the set of TTN-VQC parameters θ satisfies the PL condition, the gradient descent algorithm with a learning rate of 1 can lead to an exponential convergence rate. More specifically, at epoch T, we have

$${{{{\mathcal{L}}}}}_{S}({{{{\boldsymbol{\theta }}}}}_{T})\le \exp \left(-\mu T\right){{{{\mathcal{L}}}}}_{S}({{{{\boldsymbol{\theta }}}}}_{0}),$$

where θ0 and θT separately denote the parameters at the initial stage and the epoch T. Furthermore, given a radius \(r=2\sqrt{2{{{{\mathcal{L}}}}}_{S}({{{{\boldsymbol{\theta }}}}}_{0})}{\mu }^{-1}\) for a closed ball B(θ0, r), there exists a global minimum hypothesis θ*B(θ0, r) such that the optimization error becomes sufficiently small.

Furthermore, we show a necessary condition in Proposition 2 for a TTN-VQC operator \(f\in {{\mathbb{F}}}_{{{{\rm{TV}}}}}\) to satisfy the μ-PL setup of \({{{{\mathcal{L}}}}}_{S}({{{\boldsymbol{\theta }}}})\), which is related to the tangent kernel of the operator f.

Proposition 2

For a TTN-VQC operator \(f\in {{\mathbb{F}}}_{{{{\rm{TV}}}}}\), we define the tangent kernel \({{{{\mathcal{K}}}}}_{f}\) as f(θ) f(θ)T. If a 1-Lipschitz loss function \({{{{\mathcal{L}}}}}_{S}({{{\boldsymbol{\theta }}}})\) satisfies the μ-PL condition, \({\lambda }_{\min }({{{{\mathcal{K}}}}}_{f})\) represents the smallest eigenvalue of \({{{{\mathcal{K}}}}}_{f}\) and meets the condition as:

$${\lambda }_{\min }\left({{{{\mathcal{K}}}}}_{f}\right)\ge \mu .$$

Theorem 3 suggests that the μ-PL condition for the TTN-VQC ensures an exponential convergence rate and the training loss can reach as low as 0. Proposition 2 can check if the μ-PL condition can be met by calculating its tangent kernel. Our theorems suggest that the TTN-VQC model meeting the PL condition can better deal with the problem of Barren Plateaus, but we cannot guarantee that the model with a low optimization bias has to meet the PL condition. In other words, the PL condition is one of the potential approaches to ensure the VQC handles the optimization issue.

Based on the derived upper bound, under the setup of μ-PL condition, the upper bounds on the error components can be combined into an aggregated upper bound as:

$$\begin{array}{ll}{{{{\mathcal{L}}}}}_{{{{\mathcal{D}}}}}({\bar{f}}_{S})\le {{{{\mathcal{L}}}}}_{{{{\mathcal{D}}}}}({f}_{{{{\mathcal{D}}}}}^{* })+2{\hat{{{{\mathcal{R}}}}}}_{S}({{\mathbb{F}}}_{{{{\rm{TV}}}}})+\nu \\ \le \frac{\Theta (1)}{\sqrt{U}}+{{{\mathcal{O}}}}\left(\frac{1}{\sqrt{M}}\right)+\frac{2P}{\sqrt{N}}\sqrt{\mathop{\sum }\limits_{k=1}^{K}{\Lambda }_{k}^{2}}+\frac{2P{\Lambda }^{{\prime} }}{\sqrt{N}}\\ \quad\,\,s.t.,\quad \parallel {{{{\bf{x}}}}}_{n}{\parallel }_{2}\le P,n\in [N],\\ \parallel {{{\bf{W}}}}({{{{\mathcal{T}}}}}_{{{{{\boldsymbol{\theta }}}}}_{{{{\rm{vqc}}}}}}){\parallel }_{F}\le {\Lambda }^{{\prime} },\parallel {{{{\mathcal{W}}}}}^{[k]}({{{{\mathcal{T}}}}}_{{{{{\boldsymbol{\theta }}}}}_{{{{\rm{ttn}}}}}}){\parallel }_{F}\le {\Lambda }_{k},k\in [K].\end{array}$$

The aggregated upper bound in Eq. (17) shows that the training error ϵ can be reduced to closely 0 with the setup of μ-PL condition, and the expected loss is mainly determined by the upper bounds on the approximation and estimation errors.

Empirical results

To separately corroborate our theoretical analysis of the TTN-VQC, our experiments are composed of two groups: (1) to evaluate the representation power, the training and test datasets are set in the same clean environment; (2) to assess the generalization power of TTN-VQC, the test data are separately mixed by additive Gaussian and Laplacian noises, where the SNR levels are set as 8dB and 12dB, respectively. Our baseline system is a linear PCA-VQC model where the technique of principal component analysis (PCA)46 is employed. PCA is a standard method to reduce data dimensionality by applying a linear transformation in an unsupervised manner. Our experiments compare the performance of the TTN-VQC and PCA-VQC models, and particularly aim at verifying the following points:

  1. 1.

    The TTN-VQC can lead to better performance than PCA-VQC in both matched and unmatched environmental settings.

  2. 2.

    Increasing the number of qubits can improve the representation power of TTN-VQC.

  3. 3.

    Exponential convergence rates demonstrate our configurations of the TTN-VQC satisfy the μ-PL condition.

We evaluate the performance of TTN-VQC on the standard MNIST dataset47. The MNIST dataset aims at the task of handwritten 10 digit classification, where there are 60,000 examples for training and 10,000 data for testing. In our experiments, we randomly sample 10,000 in training data and 2000 in test data. Both training and test data are corrupted with noisy signals at different SNR levels, and the generated noisy data are taken as the input to the quantum-based models. The target of the models is set as the clean data during the training stage, where the model-enhanced data are expected to be as close as the target one. We measure the model performance in the test stage by calculating the L1-norm loss between enhanced data and target one.

As the experimental baseline, a hybrid PCA-VQC model is conducted, where PCA serves as a simple feature extractor followed by the VQC as the classifier. The PCA-VQC represents a linear VQC model which is in contrast to a non-linear one based on the TTN-VQC model. We include 4 PQC blocks in the VQC employed in the experiments. As for the experiments of TTN-VQC, the image data are reshaped into a 3-order 7 × 16 × 7 tensors. Given a set of ranks R = {1, 3, 3, 1}, we can set 3 trainable tensors as: \({{{{\mathcal{W}}}}}_{1}\in {{\mathbb{R}}}^{1\times 7\times {U}_{1}\times 3}\), \({{{{\mathcal{W}}}}}_{2}\in {{\mathbb{R}}}^{3\times 16\times {U}_{2}\times 3}\), and \({{{{\mathcal{W}}}}}_{3}\in {{\mathbb{R}}}^{3\times 7\times {U}_{3}\times 1}\), where \(U=\mathop{\prod }\nolimits_{k = 1}^{3}{U}_{k}\) is associated with the number of qubits. In particular, we separately assess the models with 8 qubits and 12 qubits, and the parameters (U1, U2, U3) are set as (2, 2, 2) for the 8 qubits and (2, 3, 2) for the 12 qubits. The stochastic gradient descent (SGD)48 with an Adam optimizer49 is utilized in the training process, where a mini-batch of 50 and a learning rate of 1 are configured. The 1-Lipschitz continuous function based on MAE is taken to meet the PL condition.

To corroborate the Theorem 1 for the representation power of TTN-VQC, both training and test data are mixed with the Gaussian noise of the 15dB SNR level, and we compare the performance of TTN-VQC with PCA-VQC on the generated noisy settings. Figure 5 demonstrates the related empirical results, where TTN-VQC_8Qubit and TTN-VQC_12Qubit separately represent the TTN-VQC models with 8 and 12 qubits and PCA-VQC_8Qubit and PCA-VQC_12Qubit denote that the PCA-VQC models with 8 and 12 qubits, respectively. Our experiments show that the TTN-VQC can significantly outperform the PCA-VQC counterparts in terms of lower training and test loss values. Moreover, our results also suggest that more qubits can improve the empirical performance of both TTN-VQC and PCA-VQC models. Table 2 presents the final results of the test dataset. The TTN-VQC_12Qubit model owns more parameters than the TTN-VQC_8Qubit model (0.636 Mb vs. 0.452 Mb), but the former one attains better empirical performance in terms of lower MAE scores (0.0156 vs. 0.0597) on the test dataset.

Fig. 5: Empirical results of the vector-to-vector regression on the MNIST dataset to evaluate the representation power of TTN-VQC.
figure 5

a MAE loss values on the training data. b MAE loss values on the test data. TTN-VQC_8Qubit and TTN-VQC_12Qubit represent the TTN-VQC models with 8 and 12 qubits, respectively; PCA-VQC_8Qubit and PCA-VQC_12Qubit separately denote the PCA-VQC models with 8 and 12 qubits.

Table 2 Empirical results of TTN-VQC and PCA-VQC models on the test dataset.

To assess the generalization power of TTN-VQC, the test data are separately mixed with additive Gaussian and Laplacian noises with 8 dB and 12 dB SNR levels. Based on the well-trained TTN-VQC and PCA-VQC models with eight qubits, we further assess their performance on the test data with Gaussian and Laplacian noisy conditions related to the evaluation of their generalization power. Based on the upper bound of the generalization power in Theorem 2, given the input dataset, a more noisy setting corresponds to a larger Pnoise, which results in a larger total power P = Pin + Pnoise. Thus, we corroborate our theorem in the experiment by evaluating the empirical performance under different noisy conditions. In the meanwhile, to highlight the advantage of non-linearity for TTN-VQC, we also compare the experimental results of both TTN-VQC and PCA-VQC.

For one thing, Fig. 6 suggests that the TTN-VQC models significantly outperform the PCA-VQC counterparts in the two noisy settings, and Table 3 shows the MAE scores of TTN-VQC and PCA-VQC models, where the TTN-VQC models achieve much better performance than the PCA-VQC ones in terms of lower MAE scores under all kinds of noisy environments. For another, we observe that the experimental performance of the TTN-VQC models under more adverse Gaussian and Laplacian noisy settings is degraded because of higher MAE scores, which corresponds to our theoretical analysis.

Fig. 6: Empirical results of the vector-to-vector regression on the MNIST dataset to evaluate the generalization power of TTN-VQC and PCA-VQC with 8 qubits.
figure 6

a MAE loss values on the training data. b MAE loss values on the test data. There are two noisy settings on the test dataset to evaluate the performance of the TTN-VQC and PCA-VQC models: Gauss-8dB and Gauss-12dB separately denote the Gaussian noisy conditions of 8 dB and 12 dB SNR levels; Laplace-8dB and Laplace-12dB refer to the Laplacian noisy settings of 8dB and 12dB SNR levels, respectively.

Table 3 Empirical results of TTN-VQC and PCA-VQC models on the test dataset with either Gaussian or Laplacian noise with 8 dB or 12 dB SNR levels.

Moreover, our derived upper bound on the estimation error is also associated with the amount of training data. To test the effect of training data for the generalization capability, the number of training data is gradually incremented from a subset of data to a whole set. In Table 4, we observe that a larger amount of training data leads to lower MAE scores which correspond to better generalization power.

Table 4 Empirical results of TTN-VQC on datasets of different sizes.


This work focuses on the theoretical error performance analysis for VQC-based functional regression, particularly when the TTN is employed for dimensionality reduction. Our theoretical results provide upper bounds on the representation and generalization powers of TTN-VQC. Our theoretical results suggest that the approximation error is inversely proportional to the square root of qubits, which means that the increase of qubits can lead to better representation power of TTN-VQC. The estimation error of TTN-VQC is related to its generalization power, which is upper bounded based on the empirical Rademacher complexity. The optimization error can be lowered to a small score by leveraging the PL condition to realize an exponential convergence based on the SGD algorithm. To our best knowledge, no prior works, such as a complete error characterization, have been delivered.

Our experiments of vector-to-vector regression on the MNIST dataset are designed to corroborate the theoretical results. We first compare the representation power of the TTN-VQC models with the PCA-VQC counterparts. We observe that more qubits and the non-linear property for TTN-VQC can improve the empirical performance that matches our theoretical analysis. Further, we assess the generalization power of TTN-VQC by taking different noisy inputs into account, and we demonstrate that more mismatched and noisy inputs can worsen the generalization power. Besides, the non-linear TTN-VQC models outperform the linear PCA-VQC models in terms of representation and generalization powers. That implies that the non-linearity of TTN-VQC can greatly contribute to the improvement of VQC performance.

We also note that the TTN-VQC models attain exponential convergence rates. The optimization error is eventually reduced to 0 in the training process, which corresponds to the PL condition in our theoretical analysis. Moreover, the empirical results on the test dataset consistently exhibit a decreasing trend. The empirical results imply that the model setup for TTN-VQC meets the PL condition and thus it can handle the problem of Barren Plateaus. Our future work will discuss how to initialize the VQC model based on the PL condition to minimize the optimization bias.

Furthermore, our theoretical results are built upon the Lipschitz loss function utilized for the regression problem, and the theoretical contributions can be certainly generalized to the classification tasks where the loss functions like hinge loss and cross-entropy are data-dependent Lipschitz continuity and the Lipschitz constant does not keep the same value on different datasets.


This section aims at providing detailed proof of our theoretical results. We first present the upper bound on the representation power, and then we derive another upper bound on the generalization power. The analysis of optimization performance is also conducted based on the PL condition.

Proof for Theorem 1

The derivation of Theorem 1 is mainly based on the classical universal approximation theorem40,41,42 and a parallel structure of TTN. We first assume gm(x; θvqc, θttn) as the m-th measurement for the TTN-VQC operator g(x; θvqc, θttn), and \(\mathop{\sum }\nolimits_{m = 1}^{M}{g}_{m}({{{\bf{x}}}};{{{{\boldsymbol{\theta }}}}}_{{{{\rm{vqc}}}}},{{{{\boldsymbol{\theta }}}}}_{{{{\rm{ttn}}}}})\) is defined as:

$$\begin{array}{rcl}\mathop{\sum }\limits_{m=1}^{M}{g}_{m}({{{\bf{x}}}};{{{{\boldsymbol{\theta }}}}}_{{{{\rm{vqc}}}}},{{{{\boldsymbol{\theta }}}}}_{{{{\rm{ttn}}}}})&=&\mathop{\sum }\limits_{m=1}^{M}{{{{\mathcal{M}}}}}_{m}\circ {{{{\mathcal{T}}}}}_{{{{{\boldsymbol{\theta }}}}}_{{{{\rm{vqc}}}}}}\circ {{{{\mathcal{T}}}}}_{{{{\bf{y}}}}}\circ {{{{\mathcal{T}}}}}_{{{{{\boldsymbol{\theta }}}}}_{{{{\rm{ttn}}}}}}({{{\bf{x}}}})\\ &=&{{{{\mathcal{M}}}}}^{{\prime} }\circ {{{{\mathcal{T}}}}}_{{{{{\boldsymbol{\theta }}}}}_{{{{\rm{vqc}}}}}}\circ {{{{\mathcal{T}}}}}_{{{{\bf{y}}}}}\circ {{{{\mathcal{T}}}}}_{{{{{\boldsymbol{\theta }}}}}_{{{{\rm{ttn}}}}}}({{{\bf{x}}}})\\ &=&{{{{\mathcal{M}}}}}^{{\prime} }\circ {{{\mathcal{H}}}}\circ {{{{\mathcal{T}}}}}_{{{{{\boldsymbol{\theta }}}}}_{{{{\rm{ttn}}}}}}({{{\bf{x}}}}),\end{array}$$

where the operator \({{{\mathcal{H}}}}={{{{\mathcal{T}}}}}_{{{{{\boldsymbol{\theta }}}}}_{{{{\rm{vqc}}}}}}\circ {{{{\mathcal{T}}}}}_{{{{\bf{y}}}}}\) refers to a unitary matrix, and \({{{{\mathcal{M}}}}}_{m}\) denotes the m-th measurement and \({{{{\mathcal{M}}}}}^{{\prime} }=\mathop{\sum }\nolimits_{m = 1}^{M}{{{{\mathcal{M}}}}}_{m}\). Moreover, \({{{{\mathcal{H}}}}}^{-1}\) is a reversely linear unitary operator of \({{{\mathcal{H}}}}\), and gm refers to the function after the quantum measurement. Next, we can further derive that

$$\begin{array}{ll}&\Vert\hat{f}({{{\bf{x}}}})-{{{{\mathcal{T}}}}}_{lr}\left({\mathbb{E}}\left[g({{{\bf{x}}}};{{{{\boldsymbol{\theta }}}}}_{{{{\rm{vqc}}}}},{{{{\boldsymbol{\theta }}}}}_{{{{\rm{ttn}}}}})\right]\right){{\Vert}_{1}}\\ &\le \Big\Vert \hat{f}({{{\bf{x}}}})-{{{{\mathcal{T}}}}}_{{{{\rm{lr}}}}}\left(\frac{1}{M}\mathop{\sum }\limits_{m=1}^{M}{g}_{m}({{{\bf{x}}}};{{{{\boldsymbol{\theta }}}}}_{{{{\rm{vqc}}}}},{{{{\boldsymbol{\theta }}}}}_{{{{\rm{ttn}}}}})\right){\Big\Vert}_{1}\quad \,{{\mbox{(Triangle Ineq.)}}}\,\\ &\quad\quad+\Big\Vert {{{{\mathcal{T}}}}}_{{{{\rm{lr}}}}}\left(\frac{1}{M}\mathop{\sum }\limits_{m=1}^{M}{g}_{m}({{{\bf{x}}}};{{{{\boldsymbol{\theta }}}}}_{{{{\rm{vqc}}}}},{{{{\boldsymbol{\theta }}}}}_{{{{\rm{ttn}}}}})\right)-{{{{\mathcal{T}}}}}_{{{{\rm{lr}}}}}\left({\mathbb{E}}[g({{{\bf{x}}}};{{{{\boldsymbol{\theta }}}}}_{{{{\rm{vqc}}}}},{{{{\boldsymbol{\theta }}}}}_{{{{\rm{ttn}}}}})]\right){\Big\Vert}_{1}\\ &=\parallel {{{{\mathcal{T}}}}}_{{{{\rm{lr}}}}}\left({{{{\mathcal{T}}}}}_{{{{\rm{lr}}}}}^{-1}(\hat{f}({{{\bf{x}}}}))-\frac{1}{M}\mathop{\sum }\limits_{m=1}^{M}{g}_{m}({{{\bf{x}}}};{{{{\boldsymbol{\theta }}}}}_{{{{\rm{vqc}}}}},{{{{\boldsymbol{\theta }}}}}_{{{{\rm{ttn}}}}})\right){\Big\Vert}_{1}\\&\quad\quad+\Big\Vert {{{{\mathcal{T}}}}}_{{{{\rm{lr}}}}}\left(\frac{1}{M}\mathop{\sum }\limits_{m=1}^{M}{g}_{m}({{{\bf{x}}}};{{{{\boldsymbol{\theta }}}}}_{{{{\rm{vqc}}}}},{{{{\boldsymbol{\theta }}}}}_{{{{\rm{ttn}}}}})-{\mathbb{E}}[g({{{\bf{x}}}};{{{{\boldsymbol{\theta }}}}}_{{{{\rm{vqc}}}}},{{{{\boldsymbol{\theta }}}}}_{{{{\rm{ttn}}}}})]\right){\parallel }_{1}\\ &\le\Big\Vert {{{{\mathcal{T}}}}}_{{{{\rm{lr}}}}}\left({{{{\mathcal{M}}}}}^{{\prime} }\circ {{{\mathcal{H}}}}\circ {{{{\mathcal{H}}}}}^{-1}\circ {{{{\mathcal{T}}}}}_{{{{\rm{lr}}}}}^{-1}(\hat{f}({{{\bf{x}}}}))-{{{{\mathcal{M}}}}}^{{\prime} }\circ {{{\mathcal{H}}}}\circ {{{{\mathcal{T}}}}}_{{{{{\boldsymbol{\theta }}}}}_{{{{\rm{ttn}}}}}}({{{\bf{x}}}})\right){\Big\Vert}_{1}\\ &\quad\quad+\,{{{\mathcal{O}}}}\left(\frac{1}{\sqrt{M}}\right)\cdot \parallel {{{{\mathcal{T}}}}}_{{{{\rm{lr}}}}}(1){\parallel }_{1}\quad \,{{\mbox{(Central Limit Theorem)}}}\,\\ &\le\mathop{\prod }\limits_{k=1}^{K}\frac{1}{{\sqrt{U}}_{k}}\cdot \parallel {{{{\mathcal{T}}}}}_{{{{\rm{lr}}}}}\circ {{{{\mathcal{M}}}}}^{{\prime} }\circ {{{\mathcal{H}}}}(1){\parallel }_{1}+{{{\mathcal{O}}}}\left(\frac{1}{\sqrt{M}}\right)\cdot \parallel {{{{\mathcal{T}}}}}_{{{{\rm{lr}}}}}(1){\parallel }_{1}\quad (\,{{\mbox{Universal Approx.}}}\,)\\ &=\frac{\Theta (1)}{\sqrt{U}}+{{{\mathcal{O}}}}\left(\frac{1}{\sqrt{M}}\right)\quad \left(\text{c.f.} \mathop{\prod }\limits_{k=1}^{K}{U}_{k}=U\right).\end{array}$$

Proof for Theorem 2

Based on Eq. (9) and Fig. 4, the kth channel is equivalent to a feed-forward layer of a neural network with the sigmoid function. More specifically, the input \({{{{\mathcal{X}}}}}^{[k]}\in {{\mathbb{R}}}^{{R}_{k}\times {D}_{k}\times {R}_{k+1}}\) is reshaped into a high-dimensional vector \({\bar{{{{\bf{x}}}}}}^{[k]}\in {{\mathbb{R}}}^{{R}_{k}{R}_{k+1}{D}_{k}}\), which further goes through the feed-forward layer with the weight matrix \({\bar{{{{\bf{W}}}}}}^{[k]}\in {{\mathbb{R}}}^{{U}_{k}\times {R}_{k}{R}_{k+1}{D}_{k}}\). After the operation of the sigmoid function, we have an output vector \({{{{\bf{y}}}}}^{[k]}\in {{\mathbb{R}}}^{{U}_{k}}\).

As for the upper bound for the TTN-VQC model on the estimation error, we separately upper bound each term of the TTN and VQC families by leveraging the empirical Rademacher complexity. Moreover, we define \({\hat{{{{\mathcal{R}}}}}}_{S}({{\mathbb{F}}}_{{{{\rm{TTN}}}}}^{[k]})\) as the functional family for the kth channel associated with Fig. 7. Thus, based on the Rademacher identities, we attain that \({\hat{{{{\mathcal{R}}}}}}_{S}\left({{\mathbb{F}}}_{{{{\rm{TTN}}}}}\right)\le \mathop{\sum }\nolimits_{k = 1}^{K}{\hat{{{{\mathcal{R}}}}}}_{S}\left({{\mathbb{F}}}_{{{{\rm{TTN}}}}}^{[k]}\right)\).

$$\begin{array}{lll}{\hat{{{{\mathcal{R}}}}}}_{S}\left({{\mathbb{F}}}_{{{{\rm{TTN}}}}}^{[k]}\right)&=&\frac{1}{N}{{\mathbb{E}}}_{{{{\boldsymbol{\epsilon }}}}}\left[\mathop{\sup }\limits_{| | {\bar{{{{\bf{w}}}}}}_{u}^{[k]}| {| }_{2}\le \Lambda }\mathop{\sum }\limits_{n=1}^{N}{\epsilon }_{n}\mathop{\sum }\limits_{u=1}^{U}\sigma \left({\bar{{{{\bf{w}}}}}}_{u}^{[k]}\cdot {\bar{{{{\bf{x}}}}}}_{n}^{[k]}\right)\right]\\ &=&\frac{1}{N}{{\mathbb{E}}}_{{{{\boldsymbol{\epsilon }}}}}\left[\mathop{\sup }\limits_{| | {\bar{{{{\bf{w}}}}}}_{u}^{[k]}| {| }_{2}\le \Lambda }\mathop{\sum }\limits_{u=1}^{U}\mathop{\sum }\limits_{n=1}^{N}{\epsilon }_{n}\sigma \left({\bar{{{{\bf{w}}}}}}_{u}^{[k]}\cdot {\bar{{{{\bf{x}}}}}}_{n}^{[k]}\right)\right]\\ &=&\frac{1}{N}{{\mathbb{E}}}_{{{{\boldsymbol{\epsilon }}}}}\left[\mathop{\sup }\limits_{| | {\bar{{{{\bf{w}}}}}}_{u}^{[k]}| {| }_{2}\le \Lambda ,u\in [1,U]}\left\vert \mathop{\sum }\limits_{n=1}^{N}{\epsilon }_{n}\sigma \left({\bar{{{{\bf{w}}}}}}_{u}^{[k]}\cdot {\bar{{{{\bf{x}}}}}}_{n}^{[k]}\right)\right\vert \right]\\ &=&\frac{1}{N}{{\mathbb{E}}}_{{{{\boldsymbol{\epsilon }}}}}\left[\mathop{\sup }\limits_{| | {\bar{{{{\bf{w}}}}}}_{u}^{[k]}| {| }_{2}\le \Lambda ,u\in [1,U]}\mathop{\sup }\limits_{s\in \{-1,+1\}}s\mathop{\sum }\limits_{n=1}^{N}{\epsilon }_{n}\sigma \left({\bar{{{{\bf{w}}}}}}_{u}^{[k]}\cdot {\bar{{{{\bf{x}}}}}}_{n}^{[k]}\right)\right].\end{array}$$
Fig. 7: The kth channel of TTN is equivalent to a feed-forward layer of neural network with the sigmoid function.
figure 7

The input \({\bar{{{{\bf{x}}}}}}^{[k]}\) is derived from the reshape of \({{{{\mathcal{X}}}}}^{[k]}\) and goes through the feed-forward neural network with the weight matrix \({\bar{{{{\bf{W}}}}}}^{[k]}\) and sigmoid function. The output \({{{{\bf{y}}}}}^{[k]}\) corresponds to the array for the kth order of \(\hat{{{{\mathcal{Y}}}}}\).

Furthermore, we upper bound \({\hat{{{{\mathcal{R}}}}}}_{S}({{\mathbb{F}}}_{{{{\rm{TTN}}}}}^{[k]})\) by utilizing Talagrand inequality50 and we obtain

$$\begin{array}{lll}{\hat{{{{\mathcal{R}}}}}}_{S}({{\mathbb{F}}}_{{{{\rm{TTN}}}}}^{[k]})&\le &\frac{1}{N}{{\mathbb{E}}}_{{{{\boldsymbol{\epsilon }}}}}\left[\mathop{\sup }\limits_{| | {\bar{{{{\bf{w}}}}}}_{u}^{[k]}| {| }_{2}\le \Lambda ,u\in [1,U]}\mathop{\sup }\limits_{s\in \{-1,+1\}}s\mathop{\sum }\limits_{n=1}^{N}{\epsilon }_{n}{\bar{{{{\bf{w}}}}}}_{u}^{[k]}\cdot {\bar{{{{\bf{x}}}}}}_{n}^{[k]}\right]\\ &=&\frac{1}{N}{{\mathbb{E}}}_{{{{\boldsymbol{\epsilon }}}}}\left[\mathop{\sup }\limits_{| | {\bar{{{{\bf{w}}}}}}_{u}^{[k]}| {| }_{2}\le \Lambda ,u\in [1,U]}\left\vert {\bar{{{{\bf{w}}}}}}_{u}^{[k]}\cdot \mathop{\sum }\limits_{n=1}^{N}{\epsilon }_{n}{\bar{{{{\bf{x}}}}}}_{n}^{[k]}\right\vert \right]\\ &=&\frac{{\Lambda }_{k}}{N}{{\mathbb{E}}}_{{{{\boldsymbol{\epsilon }}}}}\left[{\left\vert \left\vert \mathop{\sum }\limits_{n = 1}^{N}{\epsilon }_{n}{\bar{{{{\bf{x}}}}}}_{n}^{[k]}\right\vert \right\vert }_{2}\right]\\ &\le &\frac{{\Lambda }_{k}}{N}\sqrt{{{\mathbb{E}}}_{{{{\boldsymbol{\epsilon }}}}}\left[\parallel \mathop{\sum }\limits_{n=1}^{N}{\epsilon }_{n}{\bar{{{{\bf{x}}}}}}_{n}^{[k]}{\parallel }_{2}^{2}\right]}\quad ({\mbox{Jensen'}} {\mbox{s}} \,{\mbox{inequality}})\\ &=&\frac{{\Lambda }_{k}}{N}\sqrt{\mathop{\sum }\limits_{i,j=1}^{N}{{\mathbb{E}}}_{{{{\boldsymbol{\epsilon }}}}}[{\epsilon }_{i}{\epsilon }_{j}]\left({\bar{{{{\bf{x}}}}}}_{i}^{[k]}\cdot {\bar{{{{\bf{x}}}}}}_{j}^{[k]}\right)}\\ &=&\frac{{\Lambda }_{k}}{N}\sqrt{\mathop{\sum }\limits_{i,j=1}^{N}{1}_{i = j}\left({\bar{{{{\bf{x}}}}}}_{i}^{[k]}\cdot {\bar{{{{\bf{x}}}}}}_{j}^{[k]}\right)}\\ &=&\frac{{\Lambda }_{k}}{N}\sqrt{\mathop{\sum }\limits_{n=1}^{N}\parallel {\bar{{{{\bf{x}}}}}}_{n}^{[k]}{\parallel }_{2}^{2}}\quad \left(| | {\bar{{{{\bf{x}}}}}}_{n}^{[k]}| {| }_{2}^{2}\le {P}_{k}^{2}\right)\\ &\le &\frac{{\Lambda }_{k}{P}_{k}}{\sqrt{N}},\end{array}$$

where we assume \(| | {\bar{{{{\bf{x}}}}}}_{n}^{[k]}| {| }_{2}\le {P}_{k}\) and accordingly \(\sqrt{\mathop{\sum }\nolimits_{n = 1}^{N}| | {{{{\bf{x}}}}}_{n}^{[k]}| {| }_{2}^{2}}\le \sqrt{N}{P}_{k}\).

Finally, we utilize the Cauchy–Schwarz inequality and obtain the result that

$${\hat{{{{\mathcal{R}}}}}}_{S}({{\mathbb{F}}}_{{{{\rm{TTN}}}}})\le \mathop{\sum }\limits_{k=1}^{K}{\hat{{{{\mathcal{R}}}}}}_{S}({{\mathbb{F}}}_{{{{\rm{TTN}}}}}^{[k]})=\mathop{\sum }\limits_{k=1}^{K}\frac{{\Lambda }_{k}{P}_{k}}{\sqrt{N}}\le \frac{\sqrt{\mathop{\sum }\nolimits_{k = 1}^{K}{\Lambda }_{k}^{2}}\sqrt{\mathop{\sum }\nolimits_{k = 1}^{K}{P}_{k}^{2}}}{\sqrt{N}},$$

where \(P=\sqrt{\mathop{\sum }\nolimits_{k = 1}^{K}{P}_{k}^{2}}\) and \(| | {{{{\bf{x}}}}}_{n}| {| }_{2}\le \mathop{\sum }\nolimits_{k = 1}^{K}{\bar{{{{\bf{x}}}}}}_{n}^{[k]}=\mathop{\sum }\nolimits_{k = 1}^{K}{P}_{k}\le \sqrt{\mathop{\sum }\nolimits_{k = 1}^{K}{P}_{k}^{2}}=P\). Hence, we attain the inequality as follows:

$$\begin{array}{l}{\hat{{{{\mathcal{R}}}}}}_{S}({{\mathbb{F}}}_{{{{\rm{TTN}}}}})\le \frac{P\sqrt{\mathop{\sum }\nolimits_{k = 1}^{K}{\Lambda }_{k}^{2}}}{\sqrt{N}},\\ s.t.,| | {{{{\bf{x}}}}}_{n}| {| }_{2}\le P,n\in [N],| | {{{{\mathcal{W}}}}}^{[k]}({{{{\mathcal{T}}}}}_{{{{{\boldsymbol{\theta }}}}}_{{{{\rm{ttn}}}}}})| {| }_{F}\le {\Lambda }_{k},k\in [K].\end{array}$$

Similarly, we can also obtain the result that \({\hat{{{{\mathcal{R}}}}}}_{S}({{\mathbb{F}}}_{{{{\rm{V\; QC}}}}})\le \frac{P{\Lambda }^{{\prime} }}{\sqrt{N}}\) with the constraint that \(| | {{{\bf{W}}}}({{{{\mathcal{T}}}}}_{{{{{\boldsymbol{\theta }}}}}_{{{{\rm{vqc}}}}}})| {| }_{F}\le {\Lambda }^{{\prime} }\). Then, we complete the proof for Theorem 2.

Proof for Theorem 3

Assume the gradient descent algorithm runs around the closed ball B(θ0, R) with R = 2μ−1 and the loss function \({{{{\mathcal{L}}}}}_{S}({{{\boldsymbol{\theta }}}})\) has the following properties as (1) The loss \({{{{\mathcal{L}}}}}_{S}({{{\boldsymbol{\theta }}}})\) is μ-PL; (2) The loss \({{{{\mathcal{L}}}}}_{S}({{{\boldsymbol{\theta }}}})\) is 1-Lipschitz; (3) The norm of Hessian H is bounded by 1.

Then, we need to prove the following properties: (a) There exists a global minimum θ*B(θ0, R); (b) The algorithm of gradient descent converges with an exponential convergence rate: \({{{{\mathcal{L}}}}}_{S}({{{{\boldsymbol{\theta }}}}}_{t+1})\le {(1-\eta \mu )}^{t+1}{{{{\mathcal{L}}}}}_{S}({{{{\boldsymbol{\theta }}}}}_{0})\). By applying the Taylor expansion, we obtain

$$\begin{array}{rcl}&&{{{{\mathcal{L}}}}}_{S}({{{{\boldsymbol{\theta }}}}}_{t+1})\\ &=&{{{{\mathcal{L}}}}}_{S}({{{{\boldsymbol{\theta }}}}}_{t})+{({{{{\boldsymbol{\theta }}}}}_{t+1}-{{{{\boldsymbol{\theta }}}}}_{t})}^{T}\nabla f({{{{\boldsymbol{\theta }}}}}_{t})+\frac{1}{2}{({{{{\boldsymbol{\theta }}}}}_{t+1}-{{{{\boldsymbol{\theta }}}}}_{t})}^{T}H({{{{\boldsymbol{\theta }}}}}^{{\prime} })({{{{\boldsymbol{\theta }}}}}_{t+1}-{{{{\boldsymbol{\theta }}}}}_{t})\\ &=&{{{{\mathcal{L}}}}}_{S}({{{{\boldsymbol{\theta }}}}}_{t})+(-\eta )\nabla {{{{\mathcal{L}}}}}_{S}{({{{{\boldsymbol{\theta }}}}}_{t})}^{T}\nabla {{{{\mathcal{L}}}}}_{S}({{{{\boldsymbol{\theta }}}}}_{t})+\frac{1}{2}(-\eta )\nabla {{{{\mathcal{L}}}}}_{S}{({{{{\boldsymbol{\theta }}}}}_{t})}^{T}H({{{{\boldsymbol{\theta }}}}}^{{\prime} })(-\eta )\nabla {{{{\mathcal{L}}}}}_{S}({{{{\boldsymbol{\theta }}}}}_{t})\\ &=&{{{{\mathcal{L}}}}}_{S}({{{{\boldsymbol{\theta }}}}}_{t})-\eta \parallel \nabla {{{{\mathcal{L}}}}}_{S}({{{{\boldsymbol{\theta }}}}}_{t}){\parallel }_{2}^{2}+\frac{{\eta }^{2}}{2}\nabla {{{{\mathcal{L}}}}}_{S}{({{{{\boldsymbol{\theta }}}}}_{t})}^{T}H({{{{\boldsymbol{\theta }}}}}^{{\prime} })\nabla {{{{\mathcal{L}}}}}_{S}({{{{\boldsymbol{\theta }}}}}_{t})\\ &\le &{{{{\mathcal{L}}}}}_{S}({{{{\boldsymbol{\theta }}}}}_{t})-\eta (1-\frac{\eta }{2})\parallel \nabla {{{{\mathcal{L}}}}}_{S}({{{{\boldsymbol{\theta }}}}}_{t}){\parallel }_{2}^{2}\quad \quad \,{{\mbox{(by Assumption 3)}}}\,\\ &\le &{{{{\mathcal{L}}}}}_{S}({{{{\boldsymbol{\theta }}}}}_{t})-\eta (2-\eta )\mu {{{{\mathcal{L}}}}}_{S}({{{{\boldsymbol{\theta }}}}}_{t})\quad \quad ({{{\rm{by}}}}\,\mu -\,{{\mbox{PL Assumption}}}\,)\\ &=&\left(1-2\eta \mu +{\eta }^{2}\mu \right){{{{\mathcal{L}}}}}_{S}({{{{\boldsymbol{\theta }}}}}_{t})\\ &\le &{\left(1-2\eta \mu +{\eta }^{2}\mu \right)}^{t+1}{{{{\mathcal{L}}}}}_{S}({{{{\boldsymbol{\theta }}}}}_{0}).\end{array}$$

Next, we show that θt does not leave the ball B. Based on assumption 4, we have \({{{\mathcal{L}}}}({{{{\boldsymbol{\theta }}}}}_{t})-{{{\mathcal{L}}}}({{{{\boldsymbol{\theta }}}}}_{t+1})\ge \frac{\eta }{2}\parallel \nabla {{{\mathcal{L}}}}({{{{\boldsymbol{\theta }}}}}_{t}){\parallel }_{2}^{2}\), which leads to \(\parallel \nabla {{{\mathcal{L}}}}({{{{\boldsymbol{\theta }}}}}_{t}){\parallel }_{2}^{2}\le \sqrt{2\beta ({{{\mathcal{L}}}}({{{{\boldsymbol{\theta }}}}}_{t})-{{{\mathcal{L}}}}({{{{\boldsymbol{\theta }}}}}_{t+1}))}\). Then, we further derive that

$$\begin{array}{rcl}| | {{{{\boldsymbol{\theta }}}}}_{t+1}-{{{{\boldsymbol{\theta }}}}}_{0}{\parallel }_{2}^{2}&=&\eta \parallel \mathop{\sum }\limits_{\tau =0}^{t}\nabla {{{\mathcal{L}}}}({{{{\boldsymbol{\theta }}}}}_{\tau }){\parallel }_{2}^{2}\\ &\le &\eta \mathop{\sum }\limits_{\tau =0}^{t}\parallel \nabla {{{\mathcal{L}}}}({{{{\boldsymbol{\theta }}}}}_{\tau }){\parallel }_{2}^{2}\\ &\le &\eta \mathop{\sum }\limits_{\tau =0}^{t}\sqrt{2\left({{{\mathcal{L}}}}({{{{\boldsymbol{\theta }}}}}_{\tau })-{{{\mathcal{L}}}}({{{{\boldsymbol{\theta }}}}}_{\tau +1})\right)}\quad \,{{\mbox{(by Continuity)}}}\,\\ &\le &\eta \mathop{\sum }\limits_{\tau =0}^{t}\sqrt{2{{{\mathcal{L}}}}({{{{\boldsymbol{\theta }}}}}_{\tau })}\\ &\le &\eta \sqrt{2}\mathop{\sum }\limits_{\tau =0}^{t}\sqrt{{(1-2\eta \mu +{\eta }^{2}\mu )}^{\tau }{{{\mathcal{L}}}}({{{{\boldsymbol{\theta }}}}}_{0})}\quad \,{{\mbox{(by Geometric Convergence)}}}\,\\ &=&\eta \sqrt{2{{{\mathcal{L}}}}({{{{\boldsymbol{\theta }}}}}_{0})}\mathop{\sum }\limits_{\tau =0}^{t}{(1-2\eta \mu +{\eta }^{2}\mu )}^{\tau /2}\\ &=&\frac{\eta \sqrt{2{{{\mathcal{L}}}}({{{{\boldsymbol{\theta }}}}}_{0})}}{1-\sqrt{1-2\eta \mu +{\eta }^{2}\mu }}\\ &=&\frac{\sqrt{2{{{\mathcal{L}}}}({{{{\boldsymbol{\theta }}}}}_{0})}(1+\sqrt{1-2\eta \mu +{\eta }^{2}\mu })}{\mu (2-\eta )}\\ &\le &\frac{2\sqrt{2{{{\mathcal{L}}}}({{{{\boldsymbol{\theta }}}}}_{0})}}{\mu }\quad (\,{{\mbox{By Setting}}}\,\eta =1).\end{array}$$

The inequality \(\parallel {{{{\boldsymbol{\theta }}}}}_{t+1}-{{{{\boldsymbol{\theta }}}}}_{0}{\parallel }_{2}^{2}\le 2\sqrt{2{{{\mathcal{L}}}({{{{\boldsymbol{\theta }}}}}_{0})}}{\mu^{-1} }\) represents the gradient descent algorithm ensures the updated point in a ball with a radius of \(2\sqrt{2{{{\mathcal{L}}}}({{{{\boldsymbol{\theta }}}}}_{0})}{\mu^{-1}}\), and a larger μ leads to larger updates and faster convergence rate over a smaller ball.