Quantum neural network cost function concentration dependency on the parametrization expressivity

Although we are currently in the era of noisy intermediate scale quantum devices, several studies are being conducted with the aim of bringing machine learning to the quantum domain. Currently, quantum variational circuits are one of the main strategies used to build such models. However, despite its widespread use, we still do not know what are the minimum resources needed to create a quantum machine learning model. In this article, we analyze how the expressiveness of the parametrization affects the cost function. We analytically show that the more expressive the parametrization is, the more the cost function will tend to concentrate around a value that depends both on the chosen observable and on the number of qubits used. For this, we initially obtain a relationship between the expressiveness of the parametrization and the mean value of the cost function. Afterwards, we relate the expressivity of the parametrization with the variance of the cost function. Finally, we show some numerical simulation results that confirm our theoretical-analytical predictions. To the best of our knowledge, this is the first time that these two important aspects of quantum neural networks are explicitly connected.


I. INTRODUCTION
In recent years, there has been a great increase in interest in quantum computing due to its possible applications in solving problems such as simulation of quantum systems [1], development of new drugs [2], and resolution of systems of equations linear [3].Quantum machine learning, which is an interdisciplinary area of study between machine learning and quantum computing, is also another possible application that should benefit from the computational power of these devices.In this sense, several models have already been proposed, such as Quantum Multilayer Perceptron [4], Quantum Convolutional Neural Networks [5], Quantum Kernel Method [6], and Quantum-Classical Hybrid Neural Networks [7][8][9][10].However, in the era of noisy intermediate scale quantum devices (NISQ), variational quantum algorithms (VQAs) [11] are the main strategy used to build such models.
Variational quantum algorithms are models that use a classical optimizer to minimize a cost function by optimizing the parameters of a parametrization U .Several optimization strategies have already been proposed [12][13][14][15], although this is an open area of study.In fact, despite the widespread use of VQAs, our understanding of VQAs is limited and some problems still need to be solved, such as the disappearance of the gradient [16][17][18][19][20][21][22], methods to mitigate the Barren Plateaus issue [23][24][25][26][27], how to build a parameterization U [29,30], and how correct errors [32].
In this sense, in this article we aim to analyze how the expressivity of the parametrization U affects the cost function.We will show that the more expressive the U parametrization is, the more the average value of the cost function will concentrate around a fixed value.In addition, we will also show that the probability of the cost function deviating from its average will also depend on the quantum circuit expressivity.
The remainder of this article is organized as follows.In Section II, we make a short introduction about VQAs.In Section III, we comment on how expressiveness can be quantified and what is its meaning.In the following section, Section IV, we present our main results.There we will give two theorems.In Theorem 1, we obtain a relationship between the concentration of the cost function and the expressiveness of the parametrization.In Theorem 2, we obtain the probability for the cost function to deviate from its average value, restricting it via a function of the quantum circuit expressivity.Then, in Section V, we present some numerical simulation results to confirm our theoretical analytical predictions.Finally, Section VI presents our conclusions.

II. VARIATIONAL QUANTUM ALGORITHMS
Variational quantum algorithms are models where a classical optimizer is used to minimize a cost function, which is usually written as the average value of an observable O: where |ψ := V (x x x)|0 .To do so, the optimizer updates the parameters θ θ θ of the parameterization U .In Fig. 1, one can see a schematic representation of how a VQA works.In the first part, Fig. 1 A, a quantum circuit runs on a quantum computer.In general, this circuit is divided into three parts.In the first part we have a V parametrization that is used to encode data in a quantum Figure 1: Illustration of how a quantum variational algorithm works.These models have two parts.A) Quantum circuit that runs on the quantum computer.B) Classical computer that optimizes the parameters, in general, using the gradient and the cost function.
state.In quantum machine learning, this parametrization is used to bring our data, such as data from the MNIST [28] dataset, into a quantum state.Next, we have the parametrization U that will depend on the parameters θ θ θ that we must optimize.Finally, we have the measures that are used to calculate the cost function.In the second part, Fig. 1 B, we have a classical computer that performs the task of optimizing the parametrization parameters.In general, for this task the gradient of the cost function is used.
In this article, the parametrization will be given by where L is the number of layers, U l is a layer that depends on the parameters θ θ θ and W l is a layer that does not depend on the parameters θ θ θ.The construction of parametrizations is still an open area of study and, due to the complexity involved in its construction, some works have proposed using the automation of this process [29,31].Furthermore, for problems such as quantum machine learning, where a V parameterization is used to encode data in a quantum state, the choice of V is also extremely important [33], and several possible encoding forms have been proposed [34].

III. EXPRESSIVITY
Following Ref. [35], here we define expressivity as the ability of a quantum circuit to generate (pure) states that are well representative of the Hilbert space.In the case of a qubit, this comes down to the quantum circuit's ability to explore the Bloch sphere.To quantify the expressiveness of a quantum circuit, we can compare the uniform distribution of units obtained from the set U with the maximally expressive (Haar) uniform distribution of units of U(d).For this, we use the following super-operator [20] A t U (.) := where dµ(V ) is a volume element of the Haar measure and dU is a volume element corresponding to the uniform distribution over U.The uniform distribution over U is obtained by fixing the parameterization U , where for each vector of parameters θ θ θ we obtain a unit U (θ θ θ).Thus, given the set of parameters {θ θ θ 1 , θ θ θ 2 , ..., θ θ θ m }, we obtain the corresponding set of unitary operators:

IV. MAIN THEOREMS
In this section, we present our main results.First, we obtain a relationship between the average value of the cost function, Eq. ( 1), with the expressivity of the parametrization U , Eq. ( 2).Afterwards, we will obtain a relationship between the variance of the cost function and the expressiveness of the parametrization.To do so, we start by writing the average of the cost function as Therefore, using Eq.(3) in Eq. ( 5), we obtain the following relationship between the mean of the cost function and the expressivity of the parametrization, Theorem 1.
Theorem 1 (Concentration of the cost function).Let the cost function be defined as in Eq. (1), with observable O, parameterization U , Eq. (2), and encoding quantum state ρ := |ψ ψ|.Then it follows thus that The proof of this theorem is presented in Appendix A. Above we used the matrix 2-norm ||A|| 2 2 = T r(A † A).For any operator X, from Eq. (3), we have that the smaller A t U (X) 2 , with A t U ≡ A, the more expressive will be the parametrization.Therefore, Theorem 1 implies that the greater the expressiveness of the parameterization U , the more the cost function average will tend to have the value T r[O]/d.
Despite Theorem 1 implying a tendency of the mean value of the cost function to go a fixed value, when executing the VQA, the cost function may deviate from its mean.To calculate this deviation we use the Chebyshev inequality, which informs the probability for the cost function to deviate from its mean value.Next, we present the Theorem 2, relating the modulus of the cost function variance with the expressiveness of the parametrization.
Theorem 2 Let us consider the cost function defined in Eq. (1) and the parameterization U defined in Eq. (2).The variance of the cost function can be constrained as follows: with and Here d = 2 n , where n is the number of qubits.
The proof of this theorem is presented in Appendix B.
As the variance is a positive real number, we can use Theorem 2 to analyze the probability that the cost function deviates from its mean, Eq. (7).Therefore, from Theorem 2, we see that by defining the observable O and the size of the system, that is, the number of qubits used, the probability of the cost function deviating from its mean decreases as the expressivity increases.Furthermore, it also follows, from Theorem 1, that for maximally expressive parameterizations, i.e., for A t U (X) 2 = 0, the cost function will be stuck to the fixed value T r [O]  d .

V. SIMULATION RESULTS
In this section we will present some numerical simulation results.
For this, we use twelve different parametrizations, which we call, respectively, Model 1, Model 2, ..., Model 12. See Appendix C for the corresponding quantum circuits.As we saw in Eq. ( 2), the parametrization is obtained from the product of L layers U l , where each layer U l can be distinct from one another, that is, the gates and sequences we use in one layer may be different from another.However, in general, they are the same.For the results shown here, the U l layers are the same, the only difference being the θ θ θ parameters used in each layer.
For these results we define each U l as where the index l indicates the layer and the index i the qubit.Also, we use R Y (θ i,l ) = e −iθj,iY /2 in all models.In the parametrizations of Model  6) of Theorem 1.For this, we performed an initial set of simulations, Figs. 2, 3, and 4, where we fixed the number of qubits and varied the number of layers L. For the results of Figs. 2, 3, and 4, we used four, five, and six qubits, respectively.Furthermore, for these simulations we consider the particular case O = |0 0| and ρ = |0 0|.
Initially, we analytically calculate the value of A(ρ) 2 , where we get [20] with Or, from Ref. [35], we obtain So, to calculate A(ρ) 2 we generated 5000 pairs of state vectors.Although we have generated a large number of state vectors, it is still a small sample of the entire Hilbert space.So, the value we obtained for µ(ρ) is an approximation.As a consequence, in some simulations we obtained a complex value for A(ρ) 2 , Eq. ( 12).Therefore, whenever this occurred, we restarted the simulation.Furthermore, we also used 5000 units to average the cost function.
In Figs. 2, 3, and 4 is shown the behavour of the right hand side of Eq. ( 6), related to the expressivity, and of the average cost function term, the left hand side of Eq. ( 6).For producing these figures, four, five, and six qubits quantum circuits were used, respectively.
In Figs. 5, 6, and 7, we show the behavior of the numerically calculated variance, Var s, the left hand side of Eq. ( 8), and of the theoretical value, Var t, the right hand side of Eq. ( 8), where we again used four, five, and six qubits, respectively.Also, we again used 5000 unitaries to compute the averages.6), the quantum expressivity (expr), and of the average cost function term (med), the left hand side of Eq. ( 6), as the number of layers L is increased.Four qubits were used for obtaining all these plots.6), the quantum expressivity (expr), and of the average cost function term (med), the left hand side of Eq. ( 6), as the number of layers L is increased.Five qubits were used for obtaining all these plots.6), the quantum expressivity (expr), and of the average cost function term (med), the left hand side of Eq. ( 6), as the number of layers L is increased.Six qubits were used for obtaining all these plots.8), and of the expressivity-related term, Var t, the right hand side of Eq. ( 8), as the number of layers L is increased.Four qubits were used for obtaining all these plots.

VI. CONCLUSION
In this article, we analyzed how the expressiveness of the parametrization affects the cost function.As we observed, the concentration of the average value of the cost function has an upper limit that depends on the expressiveness of the parametrization, where, the more expressive this parametrization is, the more the average of the cost function will be concentrated around the fixed value T r[O]/d , Theorem 1. Furthermore, the probability for the cost function to deviate from its mean also depends on the expressiveness of the parametrization, Theorem 2.
A possible implication of these results is related to the training of VQAs with highly expressive parametrizations.Once the more expressive the parametrization is, the more the average value of the cost function will be concentrated around T r[O]/d, and the probability of the cost function deviating from this average also decreases, considering that, for the case where A(ρ) t 2 = 0, the cost function will be stuck at the value T r[O]/d.This result is in agreement with the one obtained in Ref. [20], where it was shown that the phenomenon of gradient disappearance is related with parametrization having high expressivity.
Another possible implication of our results is related to quantum machine learning models.In Ref. [37], the authors mentioned that there is a correlation between expressiveness and accuracy, where the greater is the expressiveness, in general, the greater is the accuracy.To this end, the authors used Person's correlation coefficient to quantify this correlation.However, our results imply that, not only is the training of highly expressive parametrized quantum machine learning models difficult, as it will suffer more from the problem of gradient disappearance, as indicated in Ref. [20], but also the cost function itself will become stuck to a region close to the value T r[O]/d.

Figure 2 :
Figure 2: Behavour of the right hand side of Eq. (6), the quantum expressivity (expr), and of the average cost function term (med), the left hand side of Eq. (6), as the number of layers L is increased.Four qubits were used for obtaining all these plots.

Figure 3 :
Figure 3: Behavour of the right hand side of Eq. (6), the quantum expressivity (expr), and of the average cost function term (med), the left hand side of Eq. (6), as the number of layers L is increased.Five qubits were used for obtaining all these plots.

Figure 4 :
Figure 4: Behavour of the right hand side of Eq. (6), the quantum expressivity (expr), and of the average cost function term (med), the left hand side of Eq. (6), as the number of layers L is increased.Six qubits were used for obtaining all these plots.

Figure 5 :
Figure 5: Behavour of the numerically calculated cost function variance, Var s, the left hand side of Eq. (8), and of the expressivity-related term, Var t, the right hand side of Eq. (8), as the number of layers L is increased.Four qubits were used for obtaining all these plots.