Hierarchical quantum classifiers

Quantum circuits with hierarchical structure have been used to perform binary classification of classical data encoded in a quantum state. We demonstrate that more expressive circuits in the same family achieve better accuracy and can be used to classify highly entangled quantum states, for which there is no known efficient classical method. We compare performance for several different parameterizations on two classical machine learning datasets, Iris and MNIST, and on a synthetic dataset of quantum states. Finally, we demonstrate that performance is robust to noise and deploy an Iris dataset classifier on the ibmqx4 quantum computer.


Introduction
Neural networks offer state-of-the-art performance in a wide number of machine learning tasks including computer vision, natural language processing, generative modelling and reinforcement learning [1]. The hierarchical structure of deep neural networks can allow them to match the expressiveness of shallower models with exponentially fewer parameters [2,3,4]. In recent years, there has been much interest in translating the success of neural networks to the quantum computing context [5]. Despite this, there are many open questions regarding the advantages that quantum computation can bring to machine learning [6,7]. Do quantum algorithms offer a clear speedup over classical approaches for inference and training? Are quantum machine learning algorithms robust to noise? What are the best quantum circuit layouts to carry out machine learning tasks? An exciting way to explore these questions is through experimentation on available quantum hardware, and simulation on classical hardware.
Tensor networks are a method for representing an intractable high rank tensor as a decomposition of tractable lower rank tensors connected by contraction. They are widely used in many-body physics for the simulation of strongly correlated quantum systems, and can be used to represent both quantum states and quantum circuits [8,9,10,11]. Tensor networks with hierarchical structure exhibit many similarities with neural networks and in some cases have been shown to be equivalent [12,13]. Given that tensor networks can be used to represent both neural networks and quantum circuits, they are a natural choice for exploring the intersection of both fields. In this work, we consider the supervised machine learning tasks of classifying classical and quantum data on a quantum computer using hierarchical quantum circuits.
Effectively encoding classical data for use in quantum algorithms is fraught with subtle issues. To perform classification on a quantum computer the input data must be encoded in a quantum state. Two ways in which this can be achieved are by encoding the data in the amplitudes of individual qubits in a fully separable state (qubit encoding), or in the amplitudes of an entangled state (amplitude encoding). To avoid the non-trivial requirement of preparing arbitrary superpositions, we use qubit encoding to encode classical data.
Quantum data, such as the output of a quantum circuit or of a quantum sensor, may already be in a highly entangled state. For this reason we assume that quantum data may be entangled. Once the data encoding has been done, the classifier consists of a series of unitary operations applied to the initial quantum state. Then, a measurement is carried out on a target qubit. In practice, multiple runs are required to approximate the expectation of the measurement outcome, and the most frequent outcome is taken as the predicted class. More runs increase the classifier confidence.
In addition to the pipeline just described, we need to specify the layout of the hierarchical circuit, and the algorithm for learning its parameters. The circuits we use here are tree-like and can be parameterized with a simple gate-set that is compatible with currently available quantum computers. The first of these circuits is known as a Tree Tensor Network (TTN) [9]. We then consider a more complex circuit layout known as the Multi-Scale Entanglement Renormalization Ansatz (MERA) [10]. MERAs are similar to TTNs, but make use of additional unitary transformations to effectively capture a broader range of quantum correlations. Both one-dimensional (1D) and two-dimensional (2D) versions of TTN and MERA circuits have been proposed in the literature [14,15].
In the 1D case, TTN and MERA circuits can be evaluated efficiently using classical techniques when the input data is encoded using qubit encoding. Evaluating such circuits on amplitude encoded data is likely to be classically intractable. In 2D, the TTN circuit is efficiently simulatable when using qubit encoding, whereas the 2D MERA circuit is not. Because we cannot simulate large 2D MERA circuits, we restrict our experiments to the 1D case. In all experiments we find that 1D MERA outperforms 1D TTN, suggesting that 2D MERA could in principle outperform 2D TTN. Such a hypothesis should be tested with future experiments as suitably large near-term quantum computers become available. Classifiers that possess a 2D structure would be the natural choice for 2D data such as a natural images.
Optimizing the circuits can be accomplished by stochastic gradient descent. In the case of efficiently simulatable networks, it makes sense to use the analytic gradient. For circuits that cannot be efficiently simulated it is possible to use a quantum computer and estimate gradients numerically. Moreover, a hybrid approach that involves a classical pre-training step to initialize some of the gates has been previously proposed [16]. We empirically validate this approach by initializing a 1D MERA with a pre-trained 1D TTN. Such pre-training reduces the average number of training steps needed until convergence on a model with comparable accuracy, a benefit for implementations on near-term quantum computers.
We demonstrate our techniques using TTNs and MERAs and compare performance for a number of parameterizations. The first of these uses only single-qubit rotations and fixed CNOT gates. The second uses more general two-qubit gates. The third uses three-qubit gates, where the additional ancilla qubits allow for non-linear operations. Both real and complex parameterizations are compared.
We test the ability of each classifier to predict binary labels on two canonical machine learning datasets, Iris [17] and MNIST handwritten digits [18], and on synthetic quantum datasets. We also use the IBM Quantum Experience [19] to test robustness to depolarizing noise, and to deploy the model on the ibmqx4 quantum computer.
The structure of the article is the following. In Sec. 2 we compare and contrast related work with ours. In Sec. 3 we describe the hierarchical quantum classifiers. Section 4 describes a number of experiments along with numerical results. Section 5 presents our conclusions and directions for future work.

Related work
Combining the success of deep neural networks and other machine learning methods with the power of quantum computation is a tantalizing prospect. Much work to date has focused on modifying classical machine learning algorithms to incorporate quantum linear algebra subroutines, thus inheriting their speed-ups. One such subroutine is the quantum algorithm for solving linear systems, also known as HHL [20]. The algorithm is exponentially faster than the best known classical alternative, although this comes with some caveats [7]. Quantum classifiers that use HHL include the quantum support vector machine [21] and the kernel least squares [22]. Whilst promising, these algorithms also inherit the limitations of HHL, in particular, the requirement that classical data be efficiently prepared in amplitude encoding. Another quantum subroutine that can be readily embedded in a quantum classification model is Grover's algorithm which, for example, has been used to improve both computational and statistical complexity of the perceptron model [23].
While the above proposals assume availability of universal quantum computers, much of the recent literature has been focusing on algorithms for noisy intermediate-scale quantum technologies [24]. These consist of hybrid quantum-classical algorithms where the quantum computer is used to execute ansatz circuits and to measure observables of interest. A classical optimization routine is used to adjust the ansatz circuit in order to minimize a cost function. Originally proposed for quantum chemistry and combinatorial optimization, these approaches have been recently investigated for supervised [25] and unsupervised [26,27] machine learning. The underlying ansatz circuits are often inspired by the structure of classical neural networks, but without explicit reference to tensor networks.
References [28] and [16] propose training tree-like tensor networks to be classifiers. In particular, Ref. [28] demonstrates that TTNs can be used to classify images of handwritten digits and to encode classes of images in quantum many-body states. The framework proposed in Ref. [16] examines the role of training TTNs to be classifiers in a quantum computing context and provides numerical evidence that TTNs can be used to perform supervised and unsupervised machine learning with the support of a quantum computer. Our work extends these ideas in the following respects: • More complex networks, such as MERA, are studied, and their superiority over simpler networks is demonstrated.
• We demonstrate that our networks can be used to classify quantum mechanical data, in addition to classical data; • Networks constructed from simple two-qubit gates that can be natively implemented in available hardware, are compared against networks made of more complex gates that would require compilation; • It is demonstrated that simple models can be pre-trained and used to initialize more complex ones, a step that could be crucial to enable deployment of resource heavy algorithms on near-term quantum devices; • Deployment of a trained tensor network on a real quantum device (ibmqx4) is demonstrated.
3 Data encoding & quantum circuits for classification 3

.1 Data encoding
Classification consists of assigning a category to an observation. In machine learning, an inference model is trained to minimize the classification error on a finite set of data, also known as the training set. The actual performance of the classifier, the generalization error, is then estimated on a set of data points not used for training, also known as the test set. The functional form of the inference model is often critical to the success of the classifier. State-of-the-art models for high-dimensional data sets with complex structure are typically hierarchical or compositional [1]. These ideas can be translated to the paradigm of quantum computation using the framework of tensor networks. Before describing the tensor network architectures used in this work, namely TTN and MERA, it is important to to first clarify what data sets are considered in this paper to gauge the performance of these networks, and how they are prepared. Let us first consider the case of classical data. A classical data set for binary classification is 1} are the corresponding class labels. Classifying classical data on a quantum computer requires that the input vectors be encoded in a quantum state. There are a variety of ways to accomplish this and different algorithms require different encoding methods. The most efficient approach in terms of space is to encode classical data in the amplitudes of a superposition, that is, using N qubits to encode a 2 N dimensional data vector. However, there is no known efficient method that can prepare arbitrary superpositions. In other words, the efficiency and subsequent speedups are genuine only if the desired state can be prepared in time polynomial in N [7]. For this reason we do not use this encoding for classical data. A simpler method is to encode each element of a classical data vector in the amplitude of a single-qubit. This type of encoding requires N qubits to encode an N dimensional data-vector and, therefore, is less efficient in terms of space. However, the state preparation is clearly efficient in terms of time as it only requires single-qubit rotations. We opt for this type of encoding. In particular, we first re-scale the data vectors element-wise to lie in [0, π 2 ]. Then, we encode each vector element in a qubit using the following scheme [29] The final data vector writes as ψ d = ⊗ N n=1 ψ d n , and is ready to be used in a quantum algorithm. Let us now consider the case of quantum data. A quantum data set for binary classification is a set D = {ψ d , y d )} D d=1 , where ψ d ∈ C 2 N are 2 N -dimensional input vectors of unit length, and y d ∈ {0, 1} are the corresponding classes. In contrast to classical data, quantum data such as the output of a quantum circuit or a quantum sensor, may already be in superposition. That is, the quantum states are used as-is, and there is no relevant cost for the preparation.

Circuit architecture
We now discuss the quantum circuit architectures for classification. The first circuit architecture is inspired by tree tensor networks, specifically binary trees. The TTN circuit begins by applying a  set of two-qubit nearest neighbour unitaries to the input. We then discard one of the qubits output from each unitary, halving the number of qubits in the next layer of the circuit. In the following layer we again apply two-qubit unitaries to the remaining qubits before discarding half of them. This process is repeated until only one qubit remains. The network in full consists of measuring a single-qubit expectation value on this remaining qubit whereÛ QC ({U i }) is the quantum circuit made up of unitaries U i (θ i ), θ = {θ i } is the set of parameters which define the unitaries, andM is the single-qubit operator whose expectation we are calculating. A circuit diagram of an 8-qubit TTN is shown in Fig. 1 (a). The solid lines encompass the circuit, while the dashed lines represent its conjugate transpose. The MERA network is closely related to the TTN. All of the unitaries that make up a tree network are maintained with an additional layer of two qubit unitaries added before each layer of the TTN. These additional unitaries, {D i }, each operates on one qubit of neighbouring unitaries in the upcoming TTN layer. In a conventional MERA network, the addition of these unitaries allows quantum correlations on a particular length scale to be captured at the same layer of the network [10]. A circuit diagram of an 8-qubit MERA is shown in Fig. 1

Unitary parameterization
We have explored a number of different ways to parameterize the unitaries used in these circuits. Some of the input data used is purely real, we therefore tested the effect of restricting the unitaries to be real too. That is, we chosen unitaries such that U i ∈ SO(·) ⊂ SU (·). We also consider general, complex valued unitaries U i ∈ SU (·). As has been observed in the context of the time-dependant variational principle applied to tensor networks, the use of complex weights often prevents optimization from getting stuck in local minima [30,31].
We also explored a number of other methods for parameterizing the unitaries; Fig 2 illustrates three such paramaterizations. In Fig. 2 (a), the unitary block is composed of two arbitrary singlequbit rotations and a CNOT ij gate, where i and j are control and target qubit, respectively. Note that in some cases the direction of the CNOT ij may be reversed in order to respect the causal structure. For example, in our 8 qubit implementations we reverse the control and target qubits for blocks U 2 , U 4 and U 6 lying in the lower part of the circuit. In the case of the restriction to SO(4) the single-qubit rotations are simply Y -rotations. In Fig. 2 (b), the unitary block consists of an arbitrary two-qubit gate. It is interesting to explore this much more general setting in simulations, although a practical implementation of such unitary may be costly. That is, the two-qubit unitary needs to be compiled into low-level hardware-dependent gates.
Finally, Fig. 2 (c) shows a three-qubit gate involving an ancilla qubit. By tracing out the ancilla qubit we can effectively implement a rich class of non-linear functions, e.g. step functions [32], closely resembling the operations of classical neural networks. Again, in practice a significant overhead is expected due to compilation.
The measurementM is performed on a specific qubit and consists of a simple Pauli measurement in a chosen direction. This can be implemented in practice by an additional single-qubit rotation followed by the projective measurement onto |0 0|. This is sufficient for a binary classification task; by computing and thresholding the expectation value of M , TTN and MERA classify the input ψ d into one of the two classes. In our example in Fig. 1, the measurement is performed on qubit number six.

Learning process and complexity
We now discuss the learning process. In principle, circuit's parameters would be adjusted to maximize the number of correct predictions on a training set, but this procedure may be intractable. Alternatively one can minimize a simpler cost function, and there exist several cost functions motivated by information theoretic arguments. We chose to minimize the quadratic cost function where ψ d are inputs, y d are targets, D is the number of training data points, and θ groups all the adjustable parameters of the circuit as described above. Although there exist several approaches to carry out this optimization, artificial neural networks are commonly optimized by stochastic gradient descent algorithms. At each iteration t, we estimate the gradient ∇J (t) and choose a learning rate η (t) . Parameters are then updated via a rule of the kind θ (t+1) ← θ (t) + η (t) ∇J (t) . Such algorithm is stochastic because at each iteration the gradient is estimated on a small batch rather than on the full training set. Beside speeding up the calculation, this noisy gradient may help in escaping from local minima. Much literature and experimentation has been dedicated to improving stochastic gradient descent algorithms. In this work, we employ a variant called Adaptive Moment Estimation (Adam) [33]. The cost function is a function of the measurement outcome of the circuit being trained. In order to obtain these measurement outcomes, the circuit itself must be evaluated. In Table 1 we summarize the complexity of obtaining the measurement outcomes at the end of the different types of circuits in this paper. The complexity stated is in terms of the number of multiplications required to perform the task. The complexities in the two dimensional cases are stated for a grid of N × N qudits. The complexities stated for the two-dimensional networks use the network architecture introduced in Refs. [34] and [10].
In the case of efficiently contractable networks we can compute the exact gradient using offthe-shelf automatic differentiation software (e.g., TensorFlow [35]). This applies to many onedimensional networks including TTNs and MERA. For networks that cannot be efficiently contracted a finite-difference method or an approximation to the true gradient must be used [25]. These strategies introduce additional noise due to finite-sampling error, and intrinsic noise of nearterm quantum devices. We begin exploring the impact of the latter with simulations in Sec. 4.3. Note that all of the circuits we train in this paper can be evaluated efficiently on quantum hardware. Here the complexities are for N χ-dimensional qudits in one dimension and N × N χ-dimensional qudits in two dimensions.

Iris dataset
In this experiment we tested the ability of a TTN to classify varieties of Iris. The Iris dataset [17] consists of 150 examples in total of three varieties of Iris flowers. Each example of Iris is described by four real valued attributes x 1−4 . We encoded the four attributes into four qubits using Eq. (1). We then parameterized unitaries using the simple gate shown in Fig 2 (a), and restricted the singlequbit rotations to be real (i.e., Y-rotations). To allow for binary classification, three binary datasets were extracted from the original set. In each subset, each class comprised 1/2 of the examples. For each class, 1/3 of examples were used as a test set and used to compute the accuracy. Mean accuracy and one standard deviation computed on five random initializations are given by Table 2. As shown, TTN performed extremely well in all cases.

Classifying handwritten digits
In this experiment we tested the ability of TTN and MERA classifiers on a number of handwritten digit recognition tasks and compared the performance of different parameterizations. MNIST [18] is a canonical data-set consisting of 70, 000 labelled gray-scale images of of handwritten digit from 0 to 9. From this dataset we generated four binary classification tasks. In the first we kept only images containing 0 or 1, and for the second task, only 2 or 7. For the third tasks we re-labelled all images as even or odd. For the final task we divided the images into those that were greater than 4 or not. MNIST images are 28 × 28 pixels. To allow for simulation using 8 qubits, we performed principal component analysis on the images for each task and kept only the 8 components with highest variance. Finally, we used Eq. (1) to encode the data. Of the 70, 000 examples 55, 000 were used for training, 5, 000 for validation and 10, 000 for testing. Training was performed using the Adam optimizer [33] with batches of 20 examples. Validation and test set accuracy were recorded every 10 training batches, and training was stopped when validation set accuracy did not increase for 30 consecutive tests. Figure 3 shows typical learning curves for train and test datasets.
Mean accuracy and one standard deviation computed on five random initializations are given by Table 3. The 'Classifier' column describes if the circuit was a TTN, MERA or hybrid, that is, a MERA pre-trained with TTN. The 'Unitaries' column describes if the circuit was parameterized using a simple, general or ancilla gate set as described by Fig. 2. The 'Rotations' column specifies the type of rotation used, either real, SO(4), or complex, SU (4).
Some remarks are in order. First, we note that the restriction to simple unitaries led to significantly lower accuracy than when using general unitaries. Second, complex rotations improved the accuracy of the classifiers in all tasks except for task '0 or 1' where accuracy was already > 99.5% with real rotations. It is notable that this is the case despite the input data being real-valued. Third, the MERA classifiers achieved higher accuracy than TTN classifiers in all cases, demonstrating the power of the additional unitaries. All networks besides those using the simple gate-set outperformed a logistic regression benchmark.
Finally, the hybrid classifier achieved accuracy comparable to that of the standard MERA. On average, hybrid classifiers required 2.452 times more training steps until convergence than standard MERA. However, the number of post-training steps required was only 0.825 times the number of training steps of standard MERA. This indicates that classical pre-training may lead to a reduction in the number of training steps carried out on the quantum computer, a potential advantage in the near-term.

Quantum data
We now consider the problem of classifying quantum data, that is, quantum states generated by different physical processes. A physical process can be simulated by a quantum circuit. By setting up two different quantum circuit layouts, we can generate synthetic classification tasks. Let us first define the building block for our quantum circuit layouts.
Our building block consists of single-qubit rotations U i for all qubits i ∈ {0, . . . , N }, followed by all the possible CNOT ij gates where i and j are control and target qubits, respectively, and    Table 3: Binary classification accuracy on the MNIST dataset.
i < j. The angles of the single-qubit rotations are the only parameters of our building block. By stacking several of these building blocks, we can generate deeper and more complex circuits layouts.
In particular, we chose to identify the class with the number of building blocks in the stack (e.g., class 5 consists of 5 building blocks). Now, for each class, we can generate a quantum state by randomizing all the single-qubit gates, and then executing the circuit on initial state |0 . This is repeated many times in order to generate a dataset. As discussed in Sec. 3, we assume that each quantum state in the dataset can be directly fed into the quantum computer where the classifier is executed, hence not requiring any pre-processing. The tasks of the classifier is to determine which of two circuit layouts a state was generated from.
Here, we work with circuits of N = 8 qubits. We generated datasets of D = 5, 000 quantum states for each of the classes y ∈ {1, 2, 3, 5, 10}. To make sure that the synthetic classification task was well defined, we first looked for a strategy that could classify correctly most of the time. For each state, we computed the maximum bipartite entanglement entropy, max A S(ρ A ) = max B S(ρ B ), over all possible partitions A, B of the 8 qubits. Figure 4 shows histograms of this quantity for three classification tasks. By inspecting the overlap of distributions we can find an optimal threshold that would classify states correctly most of the time. This shows that the classification task is meaningful. We would like to stress that this is an intractable strategy. The only purpose is to demonstrate that, in principle, there is a feature of the state that correlates with the class. The hope is that a hierarchical quantum classifier can find equally successful strategies in a tractable way.
The classifier used for this task was a TTN like the one shown in Fig. 1. We considered two parameterizations; the first uses general gates as those shown in Fig. 2 (b). The second uses arbitrary three-qubit gates where one of the qubits is an ancilla initialised in the state |0 , as illustrated in Fig. 2 (c). The data described above was divided into training, validation and test sets. Each of these sets were balanced, that is, they had an equal number of states from each class. Training was performed for 3, 000 iterations with batches of 40 states. Validation and test set accuracy was recorded every 50 iterations. Finally, the test set accuracy was recorded for the model with the highest validation set accuracy. Table 4 reports mean classification accuracy and one standard deviation computed on five random initializations. Results for the TTN with general two-qubit gates are no better than random class assignment in all tasks, indicating the need for a more expressive model. Indeed, when using gates augmented by an ancilla qubit, TTN was able to classify quantum states with some accuracy, suggesting that those may play a key role. The classification accuracy is higher for the '1 or 10' task; this is somewhat expected as the overlap of classes 1 and 10 shown in Fig. 4 (a) is less than that of the other tasks shown in Figs. 4 (b) and (c).
Finally, as a proof of principle, we verified the performance of a classical logistic regression model. We fed the vector of amplitudes to the model and trained with off-the-shelf software. The test accuracy was close to 50%, that is, no better than random. We shall stress that this approach is not feasible in practice, since only providing the input in classical form would require full tomography of the quantum dataset.

Characterising the effect of noise on classification performance
Many machine learning models including neural networks are highly robust against the negative effects of noise and many kinds of noise can help with convergence and even generalization [36,37]. In this experiment we tested the effect of depolarizing noise on the quantum classifier by simulating a depolarizing channel which is a completely positive map ∆ λ parameterized by λ from a 2 N -dimensional state ρ to a linear combination of ρ and a maximally mixed state given by This depolarizing channel was applied to the system following the application of each unitary gate in the circuit. We used the TTN classifier for classes 1 and 2 of the Iris dataset (see Sec. 2). To test the effect of noise, we simulated the circuit using the IBM Quantum Experience and included depolarizing noise on single-qubit rotations and CNOT gates. The noise was increased from 0 to 0.5 in 0.01 increments. 401 computational basis measurements were performed on the simulated target qubit. The predicted class remained the most frequent measurement outcome. 200 trials were conducted for each noise level. Fig 5 shows that the classifier accuracy reduces as the noise increases but the mean accuracy across 200 trials remains above 95% for depolarizing noise up to λ = 0.07.

Deployment on a quantum computer
In this experiment we deployed the Iris classifier for classes 1 and 2 (see Sec. 2) on the ibmqx4 quantum computer available in the IBM Quantum Experience. As shown in Fig. 6, this TTN classifier has three CNOT gates and seven rotations in the Y direction. A test set of 34 unseen examples was used to determine accuracy. For each example, the circuit was run 400 times, and the samples were used to compute the most likely class. The circuit correctly classified 100% of the test set, and achieved a cost function value of 0.0811 (Eq. (3)).

Conclusion & Future Work
In this report we have demonstrated that hierarchical quantum circuits can be used to classify classical and quantum data. Circuits based on the Multi-Scale Entanglement Renormalization Ansatz (MERA) outperform simpler tree-like circuits known as Tree Tensor Networks (TTN). These circuits can be parameterized with a simple gate set that can be easily implemented on existing quantum computers. A trained model is shown to be resistant to depolarizing noise and is successfully deployed on the ibmqx4 quantum computer.
Both MERA and TTN are naturally extendable to larger inputs. In 1D each additional layer doubles the dimensionality of the input. It is less clear how to increase or decrease the modelling power of a circuit without changing the dimension of the input. In classical neural networks this is achieved by increasing the depth and breadth of the network. One possibility for accomplishing this with quantum hierarchical classifiers is to use χ-level quantum systems (qudits) for some suitable χ > 2 as the unit of computation, rather than qubits (χ = 2). Ref. [28] demonstrates that model expressiveness in tensor network classifiers can be increased by increasing the input and internal bond dimensions. This is equivalent to performing computation using qudits. Data can be encoded in qudits using a generalization of qubit encoding described in Ref. [29]. Whilst it is possible to simulate qudits with qubits, there are practical considerations that can make this challenging [38].
Currently it is unclear what network architecture is ideal for a classification task, a thorough examination of the role entanglement plays in classification circuits may help illuminate this. Consider the case of a TTN circuit applied to a product state input. In this circuit the measurement qubit interacts with each other qubit in the circuit at most once and therefore its entanglement with the rest of the circuit will increase as unitaries are applied. If the measurement qubit is highly entangled with the rest of the network it will struggle to minimize the cost function Eq. (3) but clearly it is necessary to introduce some entanglement in to the network for correlations between input qubits to be shared. Such a trade-off may limit the effectiveness of TTN circuits, especially as they are scaled to larger inputs.
Constraining machine learning models using regularization can help them to generalize better to unseen data. Indeed, parameters with large magnitude are a characteristic of overfitting. The unitary constraint of quantum circuits naturally prevents parameters from becoming large, and it is likely acting as a strong regularizer. Additional regularization methods from the machine learning literature will become important in future quantum machine learning work. For example, the addition of noise during training of classical neural networks can also have a regularizing effect [36] and help the model to learn invariant representations [37]. In our study, we did not simulate circuit noise during the training phase, but we did show high resistance to depolarizing noise during the prediction step.
Much of the success of convolutional neural networks comes from their ability to learn layers of translation invariant representations using a shared set of weights. Translation invariance can be enforced in TTN and MERA by restricting the unitaries within each layer to be the same. Similarly, scale invariance can be enforced by restricting the unitaries between different layers of the circuit to be the same. The role of weight sharing in hierarchical quantum classifiers is a question for future research.
In this report we have identified two cases where the cost of classical simulation is thought to be exponentially harder than that on a quantum computer. The first of these, which we do not test, is when the hierarchical quantum classifier cannot be classically simulated even when the input is a product state, 2D MERA circuits being one such example. The second case is when the input data consists of entangled quantum states. Here, an entirely classical approach may require expensive tomography and become intractable as the system size grows. While there are many existing methods for classifying 2D classical data, developing methods for classifying quantum data is a promising research direction.