The Born supremacy: quantum advantage and training of an Ising Born machine

The search for an application of near-term quantum devices is widespread. Quantum machine learning is touted as a potential utilisation of such devices, particularly those out of reach of the simulation capabilities of classical computers. In this work, we study such an application in generative modelling, focussing on a class of quantum circuits known as Born machines. Specifically, we define a subset of this class based on Ising Hamiltonians and show that the circuits encountered during gradient-based training cannot be efficiently sampled from classically up to multiplicative error in the worst case. Our gradient-based training methods use cost functions known as the Sinkhorn divergence and the Stein discrepancy, which have not previously been used in the gradient-based training of quantum circuits, and we also introduce quantum kernels to generative modelling. We show that these methods outperform the previous standard method, which used maximum mean discrepancy (MMD) as a cost function, and achieve this with minimal overhead. Finally, we discuss the ability of the model to learn hard distributions and provide formal definitions for ‘quantum learning supremacy’. We also exemplify the work of this paper by using generative modelling to perform quantum circuit compilation.


Introduction
As fault intolerant quantum devices, with ∼ 80 − 100 qubits, begin to be built, we near the dawn of the Noisy Intermediate Scale Quantum (NISQ) [1] technology era. These devices will not be able to perform many of the most famous algorithms thought to demonstrate exponential speedups over classical algorithms [2,3]. However, they could provide an efficient solution to other problems which cannot be solved in polynomial time by purely classical means, given the limited resources of these near term devices. Showing this to be true is referred to as a demonstration of Quantum Computational Supremacy 1 .
Many of the aforementioned proof of principle problems utilise the measurement process inherent in quantum computation by generating samples from a quantum distribution. Typically, the distribution is sampled by applying a sequence of quantum gates and measurements to some initial state. If this sequence consists of quantum gates drawn from a 'universal' set, we can sample from any quantum distribution. A more restricted scenario is one where we are not permitted to utilise the full suite of universal gates. The motivation here is to generate circuits that are simpler to implement experimentally on NISQ devices. Some of these simpler circuits are thought still to be usable in demonstrations of quantum supremacy, with popular examples of sub-universal models including the process underpinning BosonSampling, [4], and Instantaneous Quantum Polynomial Time (IQP) Computations [5].
We will make an attempt to connect these sampling hardness results to their use in Quantum Machine Learning (QML), in more detail than has been previously attempted, [6]. In this spirit we utilise a learning model called the Born Machine [7,8] to perform Quantum Circuit Learning (QCL) [9,10] with NISQ devices. The core principle of the Born Machine is its ability to produce statistics which rely on the fundamental randomness of quantum mechanics, according to Born's Measurement Rule for a state |ψ and measurement outcome x: x ∼ P (x) = | x|ψ | 2 (1) By utilising quantum mechanics in machine learning we hope to be able to develop algorithms to be applied in the classical domain in areas such as accelerating recommendation systems or support vector machines [3,[11][12][13]. One may also find new applications for QML algorithms, which exist purely in the quantum domain. One example, which we will propose in Section 7.1, is in the mimicking of target quantum circuits by 'learning a circuit description' using samples from its output distribution. This could be seen as an attempt to automatically compile the model circuit onto the target circuit in such a way that the outputs from the circuits are indistinguishable to a classical observer.
This contrasts with previous attempts at quantum compilation which involve directly adapting one circuit to another. We believe out method is a new way to approach the problem of compilation on quantum hardware.
It is also hoped that quantum models would achieve 'better' performance on certain datasets, than any purely classical one. This is motivated by the supremacy arguments mentioned above, and will provide a central theme to this work. A physical demonstration of such a task would provide a definitive separation between quantum and classical machine learning algorithms in practice. This is a more challenging task than simply demonstrating quantum supremacy by itself, or even the verification of quantum supremacy, [14][15][16][17] but Section 3: The Ising Born Machine is defined and related to previous work. Our first contribution, to illustrate how the underlying circuit in the model is hard to classically simulate up to a suitable notion of error, is discussed.
Section 4: Kernel methods are introduced, along with both 'classical' and 'quantum' kernels. We describe methods to train Born Machine models, and introduce our contributions to this area; two new cost functions leading to differentiable training of the model. These are the Stein Discrepancy and the Sinkhorn Divergence, which we argue to be better than current approaches.
Section 5: Many ingredients in the training algorithm are shown to be hard for a purely classical system to perform, leveraging the hardness of the underlying quantum circuit, and relating back to Section 3.
Section 6: Numerical results are presented to illustrate these new training techniques.
Section 7: Two novel applications for the Ising Born Machine are discussed. The first is automatic compilation of quantum circuits, using purely classical data, and the second is in learning hard distributions, providing a more methodological approach to fulfilling the definitions introduced in Section 2. In depth analysis of these final applications is left to future work.

Preliminaries
We introduce some terminology and related work necessary for the reading of this paper.

Learning and Modelling
Machine learning broadly encapsulates the aspiration to be able to use algorithms to perform a task without explicit instruction, but deducing solutions using patterns or inference. Two common tasks for which ML techniques are useful are discriminative tasks, such as classification, and generative modelling. The former usually falls under the umbrella of 'Supervised Learning', in which an algorithm is trained using labelled data. The latter is typically used in 'Unsupervised Learning', in which no labels are provided, and the algorithm learns the relationships in the data by itself. The use case in this work will be parameterised generative modelling, which consists of three key components: Target/Data: A phenomenon from which we can extract sample observations.
Model: A parameterised structure which represents a characterisation of the target.
Training: A process of updating the model parameters, based on observations of the target, by sampling from the model and the target. This is achieved by evaluating some cost or notion of 'closeness' and calling some optimiser to compute updates.
The goal of the training is to ensure that samples from the model match the behaviour one would expect from the target. The particular model we shall introduce will perform a relatively newly defined paradigm known as Quantum Circuit Learning [30]. The general methodology for training in QCL can be broken down as follows: Compute Phase: A parameterised quantum circuit is applied to an initial configuration of qubits, typically the |0 ⊗n state, resulting in a parameterised final state.
Compare Phase: Some information about this state is extracted, be it a series of samples via measurements in the generative case, as we explore here, or some expectation value of an observable in the classification case [9].
Modify Phase: Based on this information, the parameters of the circuit are updated by a classical optimiser to minimise some cost function or 'error'. In gradient based methods, this update will depend on the gradient of the cost function.
This process is repeated multiple times, until there is some convergence or for a specified number of updates, in order to refine the model.

Classical Simulation of Quantum Computations
The central question behind Quantum Computational Supremacy is whether or not it is possible to design a classical algorithm which could produce a probability distribution q(x), which is close to a given quantum output distribution p(x). This notion of reproducing a quantum distribution can be formalised as classical simulation, of which there are two types. For our purposes, the more relevant notion is instead that of weak simulation, which better captures the process of sampling.
Definition 2.1 (Strong and Weak Classical Simulation). [16,31] A uniformly generated quantum circuit, C, from a family of circuits, with input size n, is weakly simulatable if, given a classical description of the circuit, a classical algorithm can produce samples, x, from the output distribution, p(x), in poly(n) time. On the other hand, a strong simulator of the family would be able to compute the output probabilities, p(x), and also all the marginal distributions over any arbitrary subset of the outputs. Both of these notions apply to some notion of error, .
As mentioned in [16], strong simulation 3 is a harder task than weak simulation, and it is this weak simulatability which we want to rule out as being classically hard. The specific instances of problems which are classically hard is captured by worst case and average case hardness. Informally, worst case implies there is at least one instance of the problem which is hard to simulate. This worst case hardness holds for IQP / QAOA circuits, [16,32], which we will illustrate in Section 2.3. A stronger notion is that of average case hardness, which has been proven for Random Circuit Sampling, [15], and BosonSampling, [4], but is only conjectured to hold for IQP circuits for example.
One could ask "What if we do not care about getting samples from the exact distribution, and instead an approximation is good enough?". Exact in this case refers to the outcome probabilities of the simulator being identical to those outputted by the quantum device; q(z) = p(z)∀z or = 0. This is a very important and relevant question to ask when discussing quantum supremacy since experimental noise means it could be that even quantum computers cannot produce the exact dynamics that they are supposed to, according to the theory. Worse still, noise typically results in decoherence and the destruction of entanglement and interference in quantum circuit, so in the presence of noise the resulting output distribution could become classically simulatable.
We wish to have strong theoretical guarantees that experiments which claim to demonstrate supremacy, even in the presence of reasonable noise, do in fact behave as expected. Since we are dealing fundamentally with probability distributions, there are many notions of error one could choose. This question is the backbone of this work, and is extremely relevant since it provides many variants of the problem that one could work with. One of the simplest examples of which is multiplicative error.

Definition 2.2 (Multiplicative Error).
A circuit family is weakly simulatable within multiplicative (relative) error, if there exists a classical probabilistic algorithm, Q, which produces samples, z, according to the distribution, q(z), in time which is polynomial in the input size, such that it differs from the ideal quantum distribution, p(z), by a multiplicative constant, c > 1: As noted in [33], it would be desirable to have a quantum sampler which could achieve the bound, (2), but this is not believed to be an experimentally reachable goal 4 . That is why much effort has been put in trying to find systems for which supremacy could be provably demonstrated according to the variational distance error condition, (3), which is easier to achieve on near term quantum devices.

Definition 2.3 (Total Variation (TV) Error).
A circuit family is weakly simulable within variation distance error, , if there exists a classical probabilistic algorithm, Q, which produces samples, x, according to the distribution, q(x), in polynomial time, such that it differs from the ideal quantum distribution, p(x) in total variation distance, : Intuitively, multiplicative error sampling is 'harder' since it must hold for all samples, i.e. the classical algorithm, Q, must capture all the fine features of the target distribution, p. In contrast, variation distance error indicates that the distributions only have to be similar 'overall'.

Quantum Circuit Classes
The circuit classes we introduce is strongly related to two classes, which have both had their relationship to Quantum Supremacy studied extensively [16,32,[34][35][36]. Both are 'subuniversal', in the sense that they are not powerful enough to directly simulate arbitrary quantum computations, but are believed to achieve something outside of the classically tractable regime. They are derived from an Ising-type Hamiltonian, and differ only in the final 'measurement' gate applied, which is a rotation gate applied immediately preceding a measurement.
Initially, a Hadamard basis preparation is performed. This is followed by a unitary evolution by m operators, each acting on the qubits in the set S j and described by: This is followed by the measurement unitary U f which is built from single qubit gates acting on each qubit.
Where X k , Y k , Z k are the canonical Pauli operators, [37], acting on qubit k. The final circuit is the following: Sampling from distributions produced by these circuits is performed by computational basis measurements of all qubits. We now show how the specific choices of parameters in (6) retrieve the circuit classes mentioned above.

Instantaneous Quantum Polynomial Time Computations (IQP)
The first example of a sub-universal class is that of Instantaneous Quantum Polynomial Time Computations (IQP) circuits [5]. IQP circuits have exactly the form of (6), but with the parameters of the measurement unitary set in the following way: This results in a final Hadamard gate applied to every qubit, since: Using only gates diagonal in the Pauli-X basis, and thus which commute, make IQP instantaneous but mean it is not able to a achieve the full power of universal quantum computation. However, it is still believed to be hard to classically simulate [16]: [16]). If the output probability distributions generated by uniform families of IQP circuits could be weakly classically simulated then the polynomial hierarchy (PH) would collapse to its third level.
A collapse of PH is thought to be unlikely at any level, giving us confidence in the hardness of IQP. In some sense, such a collapse to a certain level would be a generalisation of P = NP, which would correspond to a full collapse to the zeroth level. Theorem 2.1 and similar results in [38] are remarkable in their demonstration that quantum computers which are very much weaker than a universal BQP machine are still very difficult to classically simulate. In fact supremacy results of IQP also exist in the case of the more realistic variation distance error.
Theorem 2.2 (informal from [34]). Assume either one of two conjectures, relating to the hardness of Ising partition function and the gap of degree 3 polynomials, and the stability of the PH, it is hard to classically sample from the output probability distribution of any IQP circuit in polynomial time, up to a total variation error of = 1 384 .

Quantum Approximate Optimisation Algorithm (QAOA)
The second well known class which can be recovered from (6) is the shallowest depth version of the Quantum Approximate Optimisation Algorithm (QAOA) [36]. The QAOA is an algorithm to approximately prepare a desired quantum state, which encodes the solution to some problem that can be extracted by measuring the final state. The canonical example is MaxCut [36], which is an example of a constraint satisfaction problem. The QAOA is defined in terms of a 'cost' Hamiltonian, H z , and a 'mixer' Hamiltonian, H x (borrowing the terminology of [39]). The mixer Hamiltonian is assumed to be one which has an easily prepared ground state (typically a product state), for example: The goal of the QAOA is to produce a ground (or thermal) state of the 'cost' Hamiltonian, H z , which encodes some problem solution. This cost Hamiltonian can be exactly the exponent of the unitary in (4), where for each j, S j ⊆ [n]: In the most general form, a QAOA circuit consists of applying the unitaries (9) and (10) in an alternating fashion. A depth p − QAOA has p layers of these same gate sets acting in an alternating fashion, i.e. it produces a state: The 2p parameters, {γ,β} = {γ 1 , . . . , γ p , β 1 , . . . , β p } are optimised to produce the required state, which is assumed to be difficult to prepare directly. We are interested in the shallowest depth version of the algorithm, which produces states of the form: Since the mixer Hamiltonian in (9) is 1-local (each term acts on only a single qubit), the evolution by the unitary U QAOA f = e −iβHx can be decomposed into a tensor product of single qubit unitaries corresponding to rotations around the Pauli-X axis.
The γ parameters in (12) can be absorbed into the Hamiltonian parameters α j , and we allow β to be different for each qubit. Therefore, it can be seen that this corresponds to the following setting of the parameters in (5).
QAOA is interesting for our purposes because of the following supremacy result.
Theorem 2.3 (informal from [32]). Given an arbitrary QAOA circuit C of the form (6) with U QAOA f as in (13) the probability distribution over measurement outcomes is: If we have a poly-time randomised classical algorithm that takes as input a description of C and outputs a string x with probability q (x) satisfying the multiplicative error bound 5 : then PH collapses.

Supremacy of Quantum Learning
Here we provide, to the best of our knowledge, the first formalisation of what we call 'Quantum Learning Supremacy', specifically for distribution learning. We model our definitions around those provided in [40], which pertain to the theory of classical distribution learnability. Intuitively, a generative quantum machine learning algorithm can be said to have demonstrated 'Quantum Learning Supremacy', if it is possible to efficiently learn a representation of a distribution which for which there does not exist a classical learning algorithm achieving the same end. More specifically, the quantum device has the ability to produce samples according to a distribution that is close in total variation to some distribution, using a polynomial number of samples from that distribution. However, there should be no classical algorithm which could achieve this.
We now formalise this intuition. First we must understand the inputs and outputs to learning algorithm. The inputs are samples from the distribution to be learnt, either classical bitstrings, or which could be quantum states encoding a superposition of such bitstring states, i.e. qsamples [41]. A generator can be interpreted as a routine that simulates sampling from the distribution. As in [40], we will assume only discrete distribution classes, D n , over binary vectors of length n. Definition 2.4 (Generator [40]). A class of distributions, D n has efficient Generators, GEN D , if for every distribution D ∈ D n , GEN D produces samples in {0, 1} n according to the exact distribution D, using polynomial resources. The generator may take a string of uniformly random bits, of size polynomial in n, r(n), as input.
The reader will notice that this definition allows, for example, for the Generator to be either a classical circuit, or a quantum circuit, with polynomially many gates. Further, in the definition of a classical Generator [40] a string of uniformly random bits is taken as input, and then transformed into the randomness of D. However, a quantum Generator would be able to produce its own randomness and so no such input is necessary. In this case the algorithm could ignore the input string r(n).
While we are predominately interested in efficient learning with a Generator, one can also define a similar Evaluator : Definition 2.5 (Evaluator [40]). A class of distributions, D n has efficient Evaluators, EVAL D , if for every distribution D ∈ D n , EVAL D produces the weight of an input y in {0, 1} n under the exact distribution D, i.e. the probability of y according to D. The Evaluator is efficient if it uses polynomial resources.
The distinction between EVAL and GEN is important and interesting in this case since the output probabilities of even IQP circuits are # P-Hard to compute, [16] and also hard to sample from by classical means, yet the distributions they produce can be sampled from efficiently by a quantum computer. This draws parallels to examples in [40] where certain classes of distributions are shown not to be learnable efficiently with an Evaluator, but they are learnable with a Generator. We also wish to highlight the connections to the definitions of strong and weak simulators of quantum circuits, Definition 2.1 to reinforce the similarity between Supremacy and Learning. An Evaluator for a quantum circuit would be a strong simulator of it, and a Generator would be a weak simulator. However, we keep these definitions separate in order to connect the hardness and learnability ideas explicitly.
For our purposes, the following definitions of learnable will be used. In contrast to [40], who was concerned with defining a 'good' generator to be one which achieves closeness relative to the Kullback-Leibler (KL) divergence, we wish to expand this to general cost functions, d. This is due to the range of cost functions we have access to and our wish to connect to the quantum circuit hardness results mentioned above, which typically strive for closeness in (TV). A similar notion of an -good Evaluator could be defined. Definition 2.7 ((d, , C)-Learnable). For a metric d, > 0, and complexity class C, a class of distributions D n is called (d, , C)-learnable (with a Generator) if there exists an algorithm A ∈ C, called a learning algorithm for D n , which given 0 < δ < 1 as input, and given access to GEN D for any distribution D ∈ D n , outputs GEN D , a (d, )-Generator for D, with high probability: A should run in time poly(1/ , 1/δ, n).
In Definition 2.7, may, for example, be a function of the inputs to the learning algorithm. We may also wish to require a learnability definition which holds for all > 0. This definition would, however, be too strong for our purposes. In order to claim Quantum Learning Supremacy of a Learning Algorithm, we only need to achieve closeness up to a fixed TV distance. This will be discussed in more detail in Section 7.2. An illustration of the procedure can be seen in Figure 2. Finally, we define what it would mean for a quantum algorithm to be superior to any classical algorithm for the problem of distribution learning: Figure 2: Illustration of a learning procedure using a Generator.
Definition 2.8 (Quantum Learning Supremacy). An algorithm A ∈ BQP is said to have demonstrated the supremacy of quantum learning over classical learning if there exists a class of distributions D n for which there exists d, such that D n is (d, , BQP)-Learnable, but D n is not (d, , BPP)-Learnable.
As mentioned above, a typical choice of d would be d = TV, but one could imagine weaker definitions by using weaker cost functions, which will be discussed further in Section 7.2. One may also be more restrictive and look for a demonstration of Learning superiority by a class which was efficiently IQP-Learnable, but not BPP-Learnable. This case may be more challenging to prove theoretically, but may be more amenable for the near term, precisely the original motivation for Quantum Supremacy, and, indeed, implies Definition 2.8.

Ising Born Machine
The Born Machine (BM) [7,8] is the natural utilisation of the measurement postulate of quantum mechanics in generative modelling and applied to QCL. In particular, the Born rule gives this generative model both its name and its sampling process, as detailed in Section 3.1.
The Born Machine definition originated from tensor network approaches to define generative algorithms and the connection between physical systems and machine learning problems [42][43][44][45][46][47][48]. Since then, other works have given variants of the original definition [25], adversarial training approaches for the model [49], and adaptions to the continuous variable regime [50].
It is likely, since the statistics which the Born machine produces are generated by the fundamental randomness of quantum mechanics, that there should be no classical analogue to the model. In this regard, there is hope that a model could be defined which provably cannot be simulated by classical means, and by extension, could outperform any classical model for certain learning problems, as discussed in Section 7.2.
To accommodate our requirement for the model to be implementable on NISQ devices we will restrict the Born machine, which in general could be implemented using any quantum circuit, to an Ising version. In particular, the model will be a parameterised circuit of the form discussed in Section 2.3.

Definitions
The Model The Ising Born Machine Model is the state produced by the circuit discussed in Section 2.3, where in this case we have fixed S to be the set of all S j ⊆ [n] such that |S j | ≤ 2, i.e the computation consists of gates acting on either one or two qubits. The state obtained by this circuit is the following. (17) This restriction to two qubit gates suffices for the hardness proofs we discuss in Section 3.2.
The goal of the training will be to alter the parameters α j so that, upon measuring the state, it produces samples according to the target distribution. This will be done using QCL discussed in Section 2.1.
This approach can be compared to [9] where the authors use QCL in a classification algorithm. Typical approaches so far [8,30] require making the circuit depth as large as possible and introducing extra parameters through single qubit gates. Clearly, this approach would lead to better approximations to the data since more parameters typically leads to more accurate fits.
However, the approach we will use is somewhat different. We are interested in choosing a circuit class which is as shallow as possible, but which is sufficiently complex to be hard to simulate classically. For this purpose, we choose a model which encapsulates the subuniversal circuit classes mentioned in Section 2.3. Notice, we also choose the final gate to be U f , defined by (5), rather than the more standard R z R x R z decomposition found in other Born Machine works, [25,45]. Both are effectively equivalent; they can both generate any arbitrary single qubit gate (up to a phase). However, we chose our construction to make the hardness connection more transparent.
With the restrictions set discussed above, the term in the exponent of the diagonal unitary (18) can be written as an Ising Model Hamiltonian, [31,34]: The parameters α = {J ij , b k } can be viewed as the coupling and local magnetic fields respectively. The evolved state is then: Where θ = {Γ, ∆, Σ, α} refers to all parameters of the IBM circuit. To implement this circuit on NISQ hardware it is necessary to decompose the unitary, U z (α) into single and two qubit gates. This is straightforward to do since all the terms in U z (α) mutually commute. It is possible to find such a decomposition into two qubit CZ(α) gates, defined by the unitary matrix, (21), and single qubit rotations around the Pauli-Z axis.  Using the decomposition from [51], the relationship between a CZ ij (α) gate between qubit i and j, and an Ising interaction, exp(iαZ i ⊗ Z j ) is as follows: Therefore, U z (α) can be expanded as follows: Hence the specific circuit used will be given by (24): Previous approaches, [8,25] do not add parameters to the entangling gates, (CNOT gates in those cases), which has an advantage from an experimental point of view, but immediately increases the circuit depth. In contrast, by parameterising the entangling gates U z (J ij ) = exp(iJ ij Z ⊗ Z) also, we can get up to n 2 more parameters 'for free'. Of course, this may reduce the effectiveness of the learning algorithm in a physical implementation, since in the latter approach there is only a need to create some entanglement, whereas we require creating specific entanglement, i.e. a precise implementation of the parameters, J ij . However, we will leave a more thorough treatment of this trade-off to future work.
Sampling As mentioned above, the probability distribution of measuring a quantum state, |ψ in the computational basis, with outcome string, x, is given by Born's Rule, (1), which defines the core operating principle of a Born Machine. The resulting output probability distribution can be written as follows: Training We provide novel training methods for the model, which could outperform those given in the literature [8,10,25]. This out-performance will be measured by the closeness of the model to the data relative to total variation. As we shall see, this will involve a trade-off between the classical and quantum computing power required to train the model, which will be specific to the training algorithm. Previous methods advocate increasing the quantum part (by increasing the circuit depth), whereas we attempt to provide means to reduce the circuit depth, by leveraging more classical computation power to squeeze as much out of the given circuit as possible. This type of approach would make training quantum models more applicable to near term devices, and corresponds to leaning more heavily on the 'Classical' part of Figure 3. We adopt a 'plug-and-play' mentality in this work, depending on the compute resources available, whether quantum or classical, we expand the realm of choices one can choose from. Our contributions in this regard can be summarised as follows, corresponding to the various parts of the Figure 3.
(1) We define a new sub-universal circuit class, the IBM(θ) class.
(2) We introduce new cost functions, namely L B .
(3) We derive the corresponding gradients for these costs.
(4) We introduce the option to have quantum kernels in generative modelling.
The training process of the IBM is a hybrid approach, containing one quantum 'generation' phase, and one classical 'update' phase. It proceeds as in Figure 3 and as follows: 1. Compute Phase: Instantiate circuit parameters, θ = {J ij , b k , Γ k , ∆ k , Σ k } at random, and either in full generality or adapted to the specific hardware, [52].
2. Compare Phase: Apply circuit (17) to initial state, and measure all qubits in computational basis to generate one sample, z. Repeat N times to generate a set of z = {z 1 , . . . , z N } samples. This is illustrated by the left of Figure 3. Compute some loss function, L B with respect to data samples. This is the 'classical' part of the algorithm in Figure 3.
3. Modify Phase: Compute gradient of L B and update parameters θ using a classical optimiser according to update rule, θ ← θ − η∂ θ L B . Repeat, indicated by leftward arrow in Figure 3, until convergence of L B to a minimum.
This training process, and cost functions, L B , will be detailed in Section 4.

Multiplicative Error Hardness of Ising Born Machine Circuits
In this section, we demonstrate how the core of the IBM, the underlying circuit, should be hard to sample from efficiently by purely classical means, up to multiplicative error, by pulling together the relevant hardness results from previous works in this area. We will prove the following.
Theorem 3.1. If the output probability distributions generated by uniform families of IBM(θ) circuits could be weakly classically simulated to within multiplicative error 1 ≤ c ≤ √ 2 then PostBPP = PP (and hence the polynomial hierarchy collapses to the third level), where ∀k, i, j,: and for each of the following instances: As discussed in Section 2 the following choices of Γ, ∆, Σ give the more well known circuit classes: These are simply worst case hardness results and they tell us nothing about the specific circuit which would be hard, only that there exists one, generated by these intermediate gates, which is hard to simulate classically up to multiplicative error. See Appendix A for a proof of Theorem (3.1), and a concrete separation of our work from previous works.

Total Variation Distance Hardness of Ising Born Machine Circuits
We can improve the hardness of the model by incorporating a stronger result about the circuit class, IQP. The IBM model must be initialised to some setting of the circuit parameters to begin the training process. One such possible initialisation is to randomly assign J ij , b k to some subset with uniform probability. The random initialisation is typical in Machine Learning, but it is not the only way one could initialise the parameters. In this case however, we will do so in order to use the IQP result of [34].
Firstly, define the Partition Function, Z, associated with the Ising Hamiltonian, (18): Now, an output amplitude of an IQP circuit can be written as this partition function: Now the more formal result of Theorem 2.2 states the following: 34]). Assume it is # P-hard to approximate |Z| 2 up to a relative error 1/4 + o(1) for 1/24 fraction of instances over the choice of the weights and biases, J ij , b k . If it is possible to classically sample from the output probability distribution of any IQP circuit in polynomial time, up to an additive error of 1/384 in total variation distance, then there is a BPP NP algorithm to solve any problem in P # P , and hence the polynomial hierarchy collapses to the third level) This holds if the parameters are chosen uniformly at random from {J ij , b k } ∈ {0, π 8 , . . . , 7π 8 }, which corresponds to randomly choosing a circuit from the IQP circuit class according to some measure over the unitary group. We refer to [34] for the proof of the above. This is done by introducing a conjecture (included in Theorem (3.2)), which claims that on average (i.e. over a 1 24 fraction of instances) the Ising partition function is hard to compute up to a multiplicative error. Again, this holds only in the worst case, and translates between an average case conjecture in multiplicative error into a worst case result in total variation error. Using this, if we were to choose the setting of parameters such that they correspond to an IQP circuit, 2 } , and initialise J ij , b k to the values above, then we shall start with an IBM in a regime which is also hard to simulate classically up to a variation distance error in the worst case. Now, a random initialisation of parameterised quantum circuits has been shown to lead to 'barren plateaus', [53], in which the gradient with respect to some quantum circuits becomes exponentially small, and so one would need exponential resources to estimate it. This indeed could be an issue for training such IBM circuits as the number of qubits scales, but this question is currently under investigation, with some potential solutions found for circuit initialisation [54]. However, in the following sections, we will assume a random (but fixed) initialisation for simplicity as we are solely interested in the performance of various cost functions. Furthermore the barren plateau issue should not be a major issue for the small system sizes we consider here.

Kernel Methods
Before diving into the cost functions used to train the IBM, we will need to refer to kernel methods, which are a workhorse of machine learning. Kernel methods are a popular technique, for example in dimensionality reduction, and are beginning to be brought into the quantum realm [55,56]. A kernel is a symmetric function, κ : X × X → R, which is positive definite, meaning that the matrix it induces, called a Gram matrix, is positive semi-definite.
In models like support vector machines (SVM), the idea is to to embed the underlying sample space, X , into a Hilbert Space using a non-linear map, φ : X → H called a feature map. A kernel is simply defined by the inner product of this feature map at two different points in a Reproducing Kernel Hilbert Space (RKHS). Intuitively, the map should be designed to make the points more easily comparable in the RKHS. The choice of the kernel function (i.e. the choice of the RKHS) may allow different properties to be compared. For a comprehensive review of techniques in kernel embeddings, see [57].
Every feature map has a corresponding kernel, given by the inner product on the Hilbert Space: Theorem 4.1 (Feature Map Kernel). Let φ : X → H be a feature map. The inner product of two inputs mapped to a feature space defines a kernel via: The full proof that this feature map gives rise to a kernel, a positive semi-definite and symmetric function, can be found in [55] for quantum feature maps, φ(x).
The Hilbert Space is 'reproducing' because the inner product of the kernel at a point in the sample space, with a function on the Hilbert space, in some sense evaluates or reproduces the function at that point: We can also define the mean embedding, which will become relevant in Section 4.3: A canonical example of a kernel is the Gaussian Kernel, defined by: where || · || 2 2 is the Euclidean distance, or squared 2 norm. The parameter, σ, is known as a bandwidth, and determines the scale at which the samples are compared. It is also possible to use a Gaussian mixture; a sum of Gaussians with different bandwidth parameters, as in [8], which will be our benchmark with which to compare.
As mentioned above, we wish to use quantum kernels, i.e. those arises as a result of an state overlap in a Quantum RKHS. This notion was first presented by [55]. To the best of our knowledge, we are the first to investigate the possibility of introducing quantum kernels to generative modelling, as we shall do. The form of the kernel assumed by [55] is the one induced by the inner product on a quantum (i.e. complex) Hilbert space: Given that we are trying to exploit some advantage by using quantum computers, this definition is natural. However, it should be noted that the above definition of a kernel defines it to be a mapping from the sample space to C, i.e. the kernel is the inner product of two wave functions which encode two samples and is therefore a transition amplitude. As mentioned in [55], it would be desirable to find kernels which are hard to compute classically, in order to gain some quantum advantage. The first potential candidate for such a situation was derived by [56]. As such, to remain consistent with the notation of [56], the kernel is defined as the real transition probability instead. As mentioned in [55] it is possible to define a kernel this way by taking the modulus squared of (37), and defining the RKHS as follows: Theorem 4.2 (Quantum RKHS). let Φ : X → H be a feature map over an input set X , giving rise to a real kernel: κ(x, y) = | Φ(x)|Φ(y) | 2 . The corresponding RKHS is therefore: The final ingredient is the discussion of the computability of this kernel function, since the evaluation of the cost function introduced in Section 4.2 and training process requires this to be done many times.
In [55], the authors investigate kernels which are classically efficiently computable in order to facilitate testing of their classification algorithm, for example they only study socalled Gaussian states, a key ingredient in continuous variable quantum computing, [23], which are known to be classically simulable. However, to gain a quantum advantage it would be desirable to use a kernel which is (conjectured to be) not classically computable efficiently. For this reason, we use the kernel proposed in [56], which is defined by a feature map constructed by the following quantum circuit: so the resulting kernel is: Of particular interest, and to add further motivation to why this particular type of circuit is chosen, is its relationship to the Ising model. This can be seen through the choice of the unitary encoding operators, U Φ(x) : This is exactly the same form as the diagonal unitary in the IBM, (18), with the parameters, θ replaced by the feature map with sample x, φ S (x). However, this is not an IQP circuit due to the extra final layer of these diagonal gates 6 in (39). It should be noted that this is experimentally favourable to work alongside the original IBM circuit since the same setup (i.e. layout of entanglement gates, and single qubits rotations) is required for both circuit types, albeit with different parameters. Just like in the Ising scenario, only single and two-qubit are operations are required: The arguments of the gates, (42), used to encode the samples is given by: The choice of this kernel is motivated by [56], which conjectures that this overlap, (40), will be classically hard to estimate up to polynomially small error, whereas the kernel can be computed using a quantum device, up to an additive sampling error. A rough bound for the number of measurements required to compute the full matrix is also given by [56]. The only difference is that in that case, the Gram matrix was computed using only samples from the same source. In our example, this need not be the case. We shall see that it in general, it requires |B| samples from one source (the IBM for example), and |D| samples from another (the data). The resulting Gram matrix will then be of dimension |B| × |D|. However, to simplify the expressions, we can assume that we have the same number of samples from each: |B| = |D| = T . Therefore, directly from [56], to compute a single kernel entry to precision˜ = O(R −1/2 ) requires R = O(˜ −2 ) measurement shots. Then an approximation,K, of the Gram matrix for the kernel, K x,y = κ(x, y), which has T (T − 1) non-trivial entries, that is -close in operator norm ||K −K|| ≤ can be determined using

Cost Functions for Ising Born Machine
To train the Born Machine in general, it is necessary to have some 'cost function' to compare instantaneously between our model distribution, given by p θ (x), and the data distribution which we are trying to learn, represented by π(y), using the notation of [8]. This should be distinguished from the kernel methods in the previous section, a kernel can only compare between individual points, x, y. However, we need to also be able to distinguish between the distributions which generate those points.
Furthermore, we will consider only gradient based methods to train the model, for example stochastic gradient descent. Of course, it is possible to train these models using gradient free approaches [10], but these tend to perform poorly in large parameter spaces. Gradient descent involves the following updates to the parameters of the model at each 'epoch', d, where one epoch refers to one iteration over all parameters: where L B is the cost function in question. η is the learning rate, and controls the speed of the descent. The first distribution measure which could be used is the Kullback-Leibler (KL) Divergence. The cost function associated with the KL divergence is given by: Unfortunately, it is not possible to use the KL divergence for training the Born Machine, as noted by [8]. This is because the gradient of the KL cost, L KL , cannot be estimated efficiently for these types of QCL models: The notation p ± θ will be explained in the next section. As noted in [8,16], computing the required probabilities p θ is # P-hard, and so (46) cannot be computed efficiently. Furthermore, the KL Divergence is an example of a so-called f -divergence [58], and such measures are notoriously hard to estimate in practice. A potential solution to this is to use an alternative family of discrepancies, called Integral Probability Metrics (IPMs). This approach was first adopted by [8] in the form of the Maximum Mean Discrepancy (MMD) to train a Quantum Circuit Born Machine (QCBM), and we will expand on it in this work.
General IPM's are defined by the following equation: Where P, Q are defined over a measurable space, M. The class of functions which are chosen, F, defines the specific metric.
The MMD is one of the simplest such example of an IPM, since it is relatively easy to compute and more details on this point will be given in Section 7. The MMD was first used in [59] as a hypothesis test, to test if two distributions are identical. For a more thorough treatment of the properties of this metric, see [60,61].
The MMD can be extracted from (47) by restricting the class of functions, F, to be a unit ball in a Hilbert space: The second IPM which will be relevant here is the total variation (TV) was also shown in [58], to be the only cost function which is both an IPM, and an f -divergence. The TV can be obtained by taking the function space as follows: Where ||φ|| ∞ = sup{|φ(x)|, x ∈ M}. Also, when working with a discrete 7 space, M = X , the total variation is given by: This measure is particularly important, since it occurs in the definition of additive error simulation hardness, Definition (2.3). It will also be the benchmark we shall use for numerical experiments in Section 6. Finally, the third useful IPM is the Kantorovich Metric, defined by: where || · || L is the Lipschitz semi-norm, defined by: where d is a metric on X . It turns out due to the Kantorovich-Rubinstein theorem [62], this is also equivalent to the Wasserstein Metric and related to the notion of Optimal Transport, which shall be discussed in more detail in Section 4.5.

MMD Training of Ising Born Machine
Due to the restriction of the function space for the MMD, it is possible to use the kernel trick discussed in Section 4.1 and equate the class of functions to a feature map in a RKHS, φ ∈ F, where F = {φ : ||φ|| H ≤ 1}. In this case the feature maps embed samples into the unit ball in the Hilbert Space, where || · || H is the norm in the space, H. It can be shown that the MMD is exactly the difference in mean embeddings (35) between the two distributions, [63,64]: Using this, it is possible to define a cost function associated to this metric, exactly the one used in [8,25] for their versions of the Born Machine: where κ is the MMD kernel. A requirement for the MMD to be useful (as a hypothesis test) is that the kernel used to define it must be characteristic or universal, [65,66]. This is one property which allows it to be a valid metric on the space of probability distributions, with γ MMD (P, Q) = 0 ⇐⇒ P ≡ Q. This enables the MMD to be used in hypothesis testing to determine if the null hypotheses (P = Q) holds or not. The Gaussian kernel (36) is indeed one which is characteristic [65], and some effort has been made to find conditions under which a kernel is characteristic [60].
In the case where the support of the distributions is discrete, as is the case here, the condition on the kernel being characteristic is strict positive definiteness [58]. Strict positive definiteness of the kernel is defined as the resulting Gram matrix being positive definite 8 . Now, to set the scene, we revisit the work of [8], in which the MMD is used to train the Born Machine. In [58], it is shown that the MMD can be estimated, given i.i.d. samples from two distributions, x = (x 1 , . . . , x N ) ∼ P , y = (y 1 , . . . , y M ) ∼ Q. This is done by simply replacing the expectation values in (55) with their empirical values.
As shown in [58], the above (56) demonstrates both consistency and lack of bias, and furthermore it converges fast in probability to the true MMD: This quadratic convergence rate is highly desirable, since it does not depend on the dimension of the space from which the samples are drawn. This will be discussed further in Section 4.5.1. There is one point that must be checked in order to carry on with this. The derivation of the MMD loss estimator requires that the kernel be a bounded and measurable function. This is the case for Gaussian kernels, but we must check it holds for the quantum kernel, (40). Happily, this is the case: the overlap between two states is bounded above by 1, and the sample space consists of binary strings.
Now, to compute the derivative of this cost function, we follow the method of [8]. This approach will apply to all cost functions we employ in this work. In that work, the quantum gates with trainable parameters, η, are of the form U (η) = exp(−i η 2 Σ), where Σ can be a series of operators satisfying Σ 2 = 1.
It is known [8,30] that the gradient of an observable of the circuit, B, with respect to a parameter, η, is given by: where η ± are parameter shifted versions of the original circuits. As noted in [8], taking the observable to be a measurement in the computational basis, B = M z = |z z|, we arrive at the following form of the gradient of the probabilities: The factor of −1/2 difference with this gradient from that of [8] is due to the slightly different parameterisation we choose, our gates are instead of the form: exp(iηΣ) : As mentioned above, this gradient requires computing the parameter shifted versions of the original circuit, p ± θ k . This notation indicates that the kth parameter in the original circuit has been shifted by a factor of ± π 2 ; p ± θ k = p θ k ±π/2 . This formula is actually valid for the parameter in any unitary which has at most two distinct eigenvalues [67].
Using this, the derivative of this loss function with respect to a given parameter, θ k , is given by: where we have P, Q samples,â = {a 1 , . . . , a P },b = {b 1 , . . . , b Q } drawn from the parameter shifted circuits, p − θ k (a), p + θ k (b) respectively and N, M samples,x = {x 1 , . . . , x N },ŷ = {y 1 , . . . , y M } drawn from the the original Born circuit and the data distribution respectively, p θ (x), π(y). The parameter shifted circuits are of the form (63): where the notation indicates U l:m = U l U l−1 . . . U m+1 U m , and each gate k carries one of the Ising parameters, α k or one of the set of measurement unitaries, {Γ k , ∆ k , Σ k }, since these also could be trained.

Stein Discrepancy Training of Ising Born Machine
While the MMD seems to work as an alternative cost function to the KL Divergence to train a Born Machine, [8,25] and it has many advantages such as being in a simple form, and efficiently and accurately computable using finite samples, it is known to be a relatively weak measure of discrepancy between distributions. This 'weakness' will be discussed further in Section 7. As such, we propose the first alternative cost function for training the IBM, called the Stein Discrepancy (SD). This discrepancy has become popular for goodness-offit tests, i.e. testing whether a given set of samples come from a particular distribution or not, as opposed to the MMD, which is typically used for kernel two sample tests. The SD was originally proposed in [68], which also provided as a method to compute it using linear programs, since in general the computation involves high dimensional integrals, similarly to the MMD. A simpler form was derived by [69], which introduces a kernelised version, allowing it to be computed in closed form. The Discrepancy is derived from Stein's Identity which is given by (in the case where the sample space is one-dimensional, x ∈ X ⊆ R): where s q (x) = ∇ x log(q(x)) is the Stein Score function of the distribution q, and A q is a so-called Stein operator of q. The functions, φ, which obey the above Identity, (64), are said to be in the Stein class of the distribution q. From Stein's Identity, (64), one can define a discrepancy between two distributions, p, q, by the following optimisation problem, [70]: Note the second term in (65) is zero by (64) if and only if p ≡ q, in which case D s (p||q) = 0 by (64). Exactly as with the MMD, the power of the above discrepancy, (65) will depend on the choice of the function space, F, and by choosing it to be a RKHS, a kernelised form which is computable in closed form can be obtained. Also, this form of (65) is very reminiscent of that of the Integral Probability Metrics, (47). However, to make the SD applicable to this case, we must make a key alteration. The above, (64) is only defined for smooth probability densities, p, q, which are supported on continuous domains, e.g. R, due to the gradient term, ∇ x , in (64). In the case of the Born Machine, the support of the density is discrete (it is composed of binary strings), so the standard gradient is not defined on the sample space. Fortunately, this has been addressed by [70], which adapted the kernelised SD to the discrete domain, by introducing a discrete gradient 'shift' operator. Without loss of generality, from here on we will assume we are dealing with d-dimensional sample vectors, x ∈ X d ⊆ R d . First of all, we shall need some definitions introduced by [70]: . For a set X of finite cardinality, a cyclic permutation ¬ : X → X is a bijective function such that for some ordering x [1] , x [2] , . . . , Definition 4.2 (Partial Difference Operator and Difference Score Function). Given a cyclic permutation ¬ on X , for any vector, x = (x 1 , . . . , x d ) T ∈ X d . For any function f : X d → R, denote the (partial) difference operator as: Define the (difference) score function for a positive probability mass function, p(x) > 0∀x as: Furthermore, [70] also defines the inverse permutation by ¬ : , and the inverse shift operator by: However, this is slightly too general for our purposes, as the sample space for a single qubit is binary: X = {0, 1}. In this case, the forward and reverse permutations are identical, so ∆ ≡ ∆ * . The discrete versions of the Stein Identity, and the kernelised Stein discrepancy are also derived in [70], and are in an identical form to the continuous case. There is however, one interesting difference. In the continuous case, the class of functions, φ ∈ F, which obey Stein's Identity (i.e. those which are in the Stein class of the operator A p ) are only those for which φ(x)p(x) vanishes at the boundary of the set, ∂X [69]. Interestingly, this restriction is not necessary in the discrete case. Due to the definition of the discrete shift operator, it turns out that all functions φ : X d → R are in the Stein class of the discrete Stein operator. This allows us the freedom to use the quantum kernel, κ Q , (40), in the SD. This is possible by making a small adaption to the proof of Theorem 2 in [70] so that the discrete Stein Identity also holds for all complex valued functions also, due to the complex nature of the feature map in (38). For any function φ : X d → C, and a probability mass function p on X d : The proof is given in Appendix C. Furthermore, as in [70], the functions can also be extended into complex valued vector functions, with an analogue of the above result. In this case, for Φ : X d → C m , the discrete Stein Identity is: shifting the i th element of the j th function value. Now we can reproduce Theorem (7) from [70]: Theorem 4.4 (Theorem 7 in [70]). The Discrete Kernelised SD is given by: where κ q is the Stein kernel: As long as the Gram matrix for the kernel, K ij = κ(x i , y j ) is positive definite, the kernel, κ, will be strictly positive definite. Hence, if a given kernel on a discrete space gives rise to a positive definite Gram matrix, this kernel induces a valid discrepancy measure [70]. Note that as mentioned above in Section 4.3, this criterion is exactly the same as that which makes the MMD a valid discrepancy measure in the discrete case, [58].
While the SD is in a similar form to the MMD, there is one key difference. Namely, the score function, s q (y) in the worst case, requires computing the probabilities of the distribution we are trying to learn, q(y). If we let p θ (x) be the output from the IBM, and again, π(x) is the data distribution over binary strings, we have the Stein Cost function for the IBM given by: Now, to employ the SD loss function in gradient descent, we will need its gradient with respect to a given parameter, θ k , which is given by: The gradient derivation is identical to that of the MMD and given in Appendix D. The major difference between (76), and the gradient of the MMD (61) is the different kernel which is used, κ π vs. simply κ. This clearly makes the Stein Discrepancy more challenging to compute, but it also potentially gives it extra power as a distribution comparison mechanism. This fact is numerically reinforced in Section 6. Also, in contrast to the MMD, the SD in (74) is asymmetric since the π weighted kernel only depends on the distribution π.
The SD is computed between the instantaneous model distribution (p θ ) and the distribution to be learned, (π) by drawing N = |B| samples from the IBM, x ∼ p θ (x), and for each pair of samples, (x, y) the π-weighted kernel, κ π (x, y) is computed.
In doing so for a single pair, we must compute the original kernel, κ as a subroutine. If the quantum kernel of (40) is chosen, this requires O −2 N 4 9 [56], measurements of the quantum circuit, as discussed in Section 4.1.
However, adding to the computational burden, we also must compute the following 'shifted' versions of the kernel for each pair of samples, (x, y), in both parameters: which are required in (75). Let K be a polynomial, where it takes O(K) time to compute the entire original kernel matrix: where T (n) is a polynomial in the size of the input n (the number of qubits) describing the running time of the quantum circuit required to compute the kernel for one pair of samples, (x, y). We now examine the efficiency of computing the shifted terms in (74) using the quantum kernel of (40).
For example: , y), . . . , κ(x, y)) T − (κ (¬ 1 x, y), . . . , κ(¬ n x, y)) T We assume κ(x, y) has been computed for a single pair of samples in time O( −2 N 2 ×T (n)), and we need to compute κ(¬ i x, y) for i = {1, . . . , n}. Therefore, computing the shifted kernel operator in a single parameter takes O( −2 N 2 × T (n) × (n + 1)). The same holds for the kernel gradient with respect to the second argument, ∆ y κ(x, y). For ∆ x,y κ(x, y), the process is slightly more involved because: Each individual term in the respective sums requires the same complexity, i.e. O( −2 N 2 × T (n)) so the term tr∆ x,y κ(x, y) overall requires O( −2 N 2 × T (n) × (3n + 1)). Therefore, each term in κ π can be computed efficiently using the quantum kernel, (40). That is, with the exception of the score function s π for the Data distribution. If we are given oracle access to the probabilities, π(y), then there is no issue and SD will be computable. Unfortunately, in any practical application this will not be the case. To deal with such a case, in Section B, we give two approaches to approximate the Score function via samples from π. We call these methods the 'Identity', and 'Spectral' methods for convenience. We will only use the Spectral method in training the IBM in the numerical results in Section 6, since the former method does not give an immediate out-of-sample method to compute the score, as discussed further in Section B. Notice that even with the hurdle (difficulty in compute the score), the SD is still more suitable than the KL divergence to train these models, since the latter requires computing the circuit probabilities, p θ (x), as mentioned in Section 4, which is in general intractable, and so could not be done for any dataset. In contrast, the Stein Discrepancy does not require the circuit probabilities, only the data probabilities, which may make it ameanable for QCL using some datasets.

Sinkhorn Training of Ising Born Machine
The final cost function we shall consider is the so-called Sinkhorn Divergence (SH). This is a relatively new way to compare probability distributions, [71][72][73]. As discussed above, the Stein Discrepancy still is not totally suitable for our purposes. While it seems to be a stronger cost function than the MMD, it may have exponential complexity required to compute the probabilities required in the score function. It seems we have overshot the mark. Now, can we find some middle ground? In other words, does there exist a cost function we can choose, which is still 'stronger' than the MMD, but easier to compute than the SD.
As motivated by our introduction of the SD, a stronger distance would presumably provide better results than the MMD for generative modelling. Firstly, we may consider the Wasserstein or Kantorovich Metric, (52). The Wasserstein Metric happens to be a special case of so-called optimal transport. The solution of the 'optimal transport' problem gives the optimal way to move, or transport, probability mass from one distribution to another, and hence gives a means of determining distribution similarity. This problem of optimal transport has been widely studied, and has a rich history, so we refer to [74] for a thorough treatment.
The Optimal Transport Distance is given by: where P, Q are the marginal distributions of U , i.e. U(P, Q) is the space of joint distributions over X × Y such that x U (x, y) = Q(y), y U (x, y) = P (x), in the discrete case. c(x, y) is the 'cost' of transporting an individual 'point', x, to another point y. It somewhat plays a similar role to the kernel function in the MMD. If we take the optimal transport 'cost', to be a metric on the sample space, X × Y, i.e. c(x, y) = d(x, y) we get the Wasserstein metric, which turns out to be equivalent to the Kantorovich Metric, revealing the connection to the IPMs discussed in Section 4.2: The Wasserstein Metric has become very popular in classical ML literature, for example in the definition of an Generative Adversarial Network (WGAN) [75] 10 .
However, it does suffer from a severe drawback; In d dimensions, its sample complexity scales as O(1/N 1/d ), and furthermore it is difficult to compute. As a remedy to this, regularisation was introduced in order to smooth the problem, and ease computation, which is a standard technique in ML to prevent overfitting. Now, as noted in [78], (one) regularised version of optimal transport introduces an entropy term, in the form of the KL Divergence as follows: It turns out that the regularised version allows a cost function to be defined, which interpolates between the Wasserstein, and the MMD. As such, it allows one to take advantage of the small sample complexity, and therefore ease of computability of the MMD, but the desirable properties of the Wasserstein Distance. The relative entropy term serves to determine how distant the coupling distribution, π, is from a product distribution P ⊗ Q, and effectively smooths the problem, such that it becomes more efficiently solvable. Based on this, [72,73] define the Sinkhorn Divergence which we can appropriate for the IBM: As shown by [73], as → 0, (84) becomes the unregularised Optimal Transport distance: OT c (p θ , π), and as → ∞, (84) becomes the MMD with a kernel given by the negative of the Sinkhorn cost, κ(x, y) = −c(x, y): The extra terms in (84) relative to (80) ensure the Sinkhorn Divergence is unbiased, since OT c (P, P ) = 0 in general. Now, similarly to the previous two cost functions, we can derive gradients of the Sinkhorn Divergence, with respect to the given parameter, θ k . According to [73], each term in (84) can be written as follows: f and g are the so-called optimal Sinkhorn potentials, arising from a primal-dual formulation of optimal transport. These are computed using the Sinkhorn Algorithm, which gives the divergence its name, [79]. These vectors will be initialised at f 0 (x) = 0 = g 0 (x), and iterated in tandem according to (88) for a fixed number of 'Sinkhorn iterations' until convergence. The number of iterations required will depend on the value of , and the specifics of the problem. Typically, smaller values of epsilon will require more iterations, since this is bringing the problem closer to unregularised optimal transport, which is more challenging to compute. For further discussions on regularised optimal transport and its dual formulation, see [73,80]. Now, following [73], the SH can be written as: s, t are the 'autocorrelation' dual potentials, arising from the terms OT c (p θ , p θ ), OT c (π, π) in (84).
The reader should note that in (86), the dependence on the data distribution, π, is hidden in the dual vector, f . Following [73], we can see how by discretising the situation based on N, M samples from p θ , π respectively;x = {x 1 , . . . , x N } ∼ p θ (x),ŷ = {y 1 , . . . , y M } ∼ π(y). With this, the optimal dual vectors, f, g are given by: C(x, y) is the so-called optimal transport cost matrix derived from the cost function applied to all samples, C ij (x i , y j ) = c(x i , y j ) and LSE N k=1 (V k ) = log N k=1 exp(V k ) is a logsum-exp reduction for a vector V, used to give a smooth approximation to the true dual potentials. The autocorrelation potential, s, is given by: t(y i ) can be derived similarly by replacing p θ → π in (89) above. However, the autocorrelation dual can be found using a well-conditioned fixed point update [73], and convergence to the optimal potentials can be observed with much fewer Sinkhorn iterations, l: Of course, now that we have discussed how to compute the Sinkhorn Divergence, it is time to examine its gradient with respect to the circuit parameters, as usual. The gradient of L SH with respect to a single probability of the observed samples, p θ (x i ) is given by [73]: However, this only applies to the samples which have been used to compute f, s in the first place. If one encounters a sample from p θ ± k (which we shall in the gradient), x s , which one has not seen in the original samples from p θ , one has no value for the corresponding vectors at this point f (x s ), s(x s ). Fortunately, as shown in [73], the gradient does extend smoothly to this point (and all points in the sample space) and in general is given by: Where g (0) , s (0) , are the optimal vectors which solve the original optimal transport problem, (88,89) at convergence, given the samples,x,ŷ from p θ , π respectively. Given this, the gradient of the Sinkhorn Divergence with respect to the parameters, θ k is given by: Therefore, one can compute the gradient by drawing samples from the distributions,x ∼ p θ ± k , and computing the vector ϕ(x) , for each sample, x ∈x, using the vectors, g (0) , s (0) already computed during the evaluation of SH at each epoch.

Sample Complexity of Sinkhorn Divergence
It is of critical interest to our purpose to investigate the sample complexity of the Sinkhorn Divergence, and in doing so mention the sample complexity of Wasserstein, and the MMD. This will be necessary in our attempts to use them for a quantum model. We shall see that due to the results of [81], we will be able to define a somewhat 'optimal' cost function for the Born Machine, in order to squeeze as many favourable metric properties from the Wasserstein Metric, but still leave the resulting model efficiently computable. This will be determined by the parameter in (84).
As mentioned above, the sample complexity of the MMD scales as O(1/ √ N ), and it is known that the Wasserstein Distance scales as O(1/N d ) [82] for a distribution supported on a subset of R d . The former indicates that the MMD can be computed to an accuracy with O( −2 ) samples regardless of the underlying space, and the latter means the Wasserstein requires O( −d ) which is exponential in the size of the space. Since we are dealing with binary vectors of length n, the support of our distributions (the IBM and the data distribution) will be X = {0, 1} n ⊆ R n11 . These two cases correspond to the extreme regularisation values of → ∞ and → 0 respectively. However, the motivation for using the Sinkhorn Divergence is to leverage the favourable sample complexity of the MMD, with the stronger Wasserstein Distance. This -regularisation hyperparameter allows us to choose a cost function which optimally suits our needs. By the results of [73], we can be guaranteed that no matter which value is chosen, the SH will be suitable as a cost function to train the IBM, and in fact any generative model. This is because it is zero for any distributions which are identical, and strictly positive otherwise: It also metrizes convergence in law, effectively meaning it can be estimated using N samples, and will converge to its true value in the limit of large samples: Now, from [81], we have the following two results. The first is the mean difference between the true Sinkhorn, L SH (P, Q) and its estimator derived from the empirical distributions, L SH (P N ,Q N ) is given by: Theorem 4.5 (Theorem 3 from [81]). Consider the Sinkhorn Divergence between two distributions, P , and Q on two bounded subsets, X , Y of R d , with a C ∞ , L−Lipshitz cost c. One has: where κ = 2L|X | + ||c|| ∞ and constants only depend on |X |, |Y|, c and ||c l || ∞ for l = 0, . . . , d/2 .
The second is the following concentration result: Corollary 4.5.1 (Corollary 1 from [81]). With probability at least 1 − δ, κ S is the Matern or the Sobolev kernel, associated to the Sobolev Space, H s (R d ), which is a RKHS for s > d/2, but we will not go into further detail here.
The more exact expression for (4.5) is given by: 11 A general quantum distribution can be supported on this entire space. Now, for our particular case, we wish to choose the Hamming distance as a metric on the Hamming hypercube. However, due to the smoothness requirement of the above theorems, c ∈ C ∞ , this would not hold in the discrete case we are dealing with. However, we can take a broader view to simply use the 2 , or Euclidean distance, and embed the Hamming hypercube in a larger space. This is possible because the Hamming distance is exactly the squared 2 cost 12 , but restricted to binary vectors: In this scenario, formally, we are dealing with the general hypercube in R d , but where the probability masses are strictly concentrated on the vertices of the hypercube. Now, we can compute directly some of the constants in the above, Theorem (4.5) and Corollary (4.5.1). Taking X to be the unit hypercube in R d , and taking the Sinkhorn cost to be the 2 cost, which is Lipschitz continuous, we can compute the following: A rough upper bound for the Lipschitz constant, L, can be obtained as follows. For a function, f : A → B, the Lipschitz constant is the smallest value L such that: If we take d A to be the sum metric on the product space, X × Y, and f to be the 2 norm squared, we get: For two points, (x 1 , y 1 ), (x 2 , y 2 ) ∈ X × Y, and the cost c : X × Y → R, c = || · || 2 2 . Now, We want to find an upper bound for the left hand side of (107), assuming that x 1 = x 2 and y 1 = y 2 13 . Now, applying the trick that we have embedded the Hamming hypercube in the general hypercube, we can assume x i 1,2 , y i 1,2 ∈ {0, 1}∀i. To derive such a bound, we can bound both the numerator and the denominator of the LHS of (107) independently. We find the denominator is as small as possible, when only one element of x 1,2 or y 1,2 is equal to one, and all the rest zero. The numerator is as large as possible when one of x 1,2 or y 1,2 is the all-one vector. In this case, the LHS is upper bounded by d, so we can choose L = d, which is a constant for a fixed d.

So, (103) becomes:
The constants in O 1 + 1 d/2 , depend on |X |, |Y|, d, and ||c (k) || ∞ which are at most linear in d. Similarly the concentration bound is: ). Now, clearly, due to the asymptotic behaviour of the Sinkhorn Divergence, we would like to choose sufficiently large in order to remove as much dependence on the dimension, d, as possible. This is because, in our case, the dimension of the space is equivalent to the number of qubits, d = n, and hence to derive a favourable sample complexity, we would hope for the dependence on d to be polynomial in the number of qubits. By examining, (112), we could see that a good choice might be = O(d 3/2 ) = O(n 3/2 ). In this case, we get: with probability 1 − δ. It is likely in practice however, that a much smaller value of could be chosen, without blowing up the sample complexity. This is evidenced by numerical results in [72,73,81].

Hardness of Classically Simulating the Ising Born Machine
In this section, we revisit the hardness results from Section 3.2, which dealt only with the underlying circuit class of the Born Machine itself for a variety of parameters. We are now in a position to claim that all of the individual ingredients in the training algorithm should be equally as hard as the original circuit. Of course, by its very nature, the algorithm is a hybrid one, hence classical optimisation is clearly crucial to facilitate training. Since the computation of the cost functions rely on computing expectation values of the outputs of quantum circuits, we might not expect a provable advantage, since a classical algorithm could likely compute the probabilities to a comparable accuracy as a quantum circuit to do so, and then compute an estimator for the cost functions, given the # P-Hardness of the probabilities. However, we can leverage the sampling hardness results to claim the individual ingredients should not be possible by purely classical means efficiently, including the updated circuits found by gradient descent, and the parameter shifted gradient circuits themselves, (63).    Generate samples from the shifted distributions, Compute gradient of L B using all samples,

Algorithm 1 IBM(θ) Training Algorithm
end for 22: end for Collecting the results required in Theorem (3.1) is necessary, since we want to make the strongest arguments for the intractability of classically simulating the Ising Born Machine. If we were not careful, the parameter updates could lead us into a regime which was classically simulable. Now, the initialisation of the parameters, θ (0) , will lead to a circuit class which is hard to sample from classically in the worst case, up to variation distance error, if U f is chosen such that the IBM is exactly an IQP circuit, otherwise it will only be (provably) hard up to multiplicative error. These samples can then be used to compute the cost function, L B . Now computing the gradients, ∂L ∂θ k , for all choices of cost function, B = {MMD, SD, SH} requires running parameter shifted circuits, p − θ k , p + θ k . If the original circuit is hard to sample from, these shifted circuits will also be hard to sample from. This is due to the fact that they only differ by a single parameter shift, which can be reabsorbed into the original parameters such that the resulting circuit also resides in the hard circuit classes. As an example, using an IQP circuit, if the parameter to be updated is J ij , then that parameter should have been initialised according to Theorem (A.2): Therefore: for integers, d, l 2 2ν±1 A relabelling of l ± 4d → k, 1 4 (2ν ± 1) → µ, where µ is irrational if ν is, and k is still an integer, gives that the parameter shifted circuits should have exactly the same structure as (114), and hence should be just as hard.
The above argument illustrates how the circuits required to compute the gradient for a given parameter, J ij , should be as hard as the original circuit. However, the training algorithm requires update the parameters of the IBM circuit, over a number of epochs, in order to provide better fits to the data. As detailed above, this is done using the following update rule: We can ensure that the system remains in a class which has worst case hardness to simulate as long as we start with an initial configuration of the parameters that demonstrates supremacy, and update them such that this remains the case. For example, if we choose the initial parameters, θ (0) = 2πν, where ν is irrational, then an update will be of the form θ (d+1) = θ (d) + µ. If we choose µ = 2πα, then the new parameters at epoch, d + 1, will be θ (d+1) = 2πβ, β = ν + α 14 Since ν was irrational originally, β will also be irrational, and therefore the new configuration should be hard to simulate classically. Although, clearly we cannot allow updates like α = −ν(+δ), where δ is rational since this will result in the parameter going to 0(δ) and could make the model classically simulatable. As an example of this, choose the following circuit parameters for the IBM({J ij , b k }, −Γ k , ∆ k = 0, Σ k = 0), which is a p = 1 QAOA circuit, and then set Γ k = 0. In this case, we create a uniform superposition from the initial Hadamards, 1/ √ 2 n z |z , and then apply gates in the computational basis. This will only add phases to each contribution e iθ |z j . If we have no final 'measurement' gate (i.e. the parameter is zero), this is equivalent to measuring the state in the computational basis at the end. Since the phases do not have the ability to interfere with each other, they will have no effect on the measurement probabilities, and the final distribution will be simply the uniform distribution over all n-bit strings, z. This can clearly be classically simulated by a sequence of coin flips. Another example is if we were to allow the entangling parameters, J ij → 0. This would lead to a circuit composed only of single qubit gates, which again is not classically hard.
There is, of course, a caveat to this argument. The hardness results in Theorem (3.1) are only valid in the worst case. This means that there exists some instance of the problem generated by the set of parameters which cannot be classically simulated efficiently. More specifically, while with each instance of 'hard' parameter values, we may end up in a parameter landscape which admits is a worst case hardness result, there is obviously no 14 α will be the learning rate multiplied by the gradient, η ∂L ∂θ (d) l . guarantee that the particular instance implemented in the IBM is in fact the hard one, and we do not claim to have found such a thing. Furthermore, there is also no guarantee that we go from the 'hard' circuit generated from one set of parameters, to the hard circuit using the next set of parameters (given by the gradient update). Nevertheless, we believe this is a step forward in the methodology to design Quantum Machine Learning algorithms which demonstrate some quantum supremacy over their classical counterparts.
Further adding to this hardness argument is the required computation of the intermediate quantum-hard kernel (40), should that be the one which is chosen. This computation must be done for all pairs of samples required in the loss, L B and its gradient, and as such, increases the conjectured hardness of classically simulating the training algorithm. This is due to the argument of [56], which conjectures that this kernel should be hard to compute for any classical algorithm, up to additive error.

Numerical Results
In this section, we present numerical results demonstrating the ability of Algorithm 1 to learn a particular data distribution. We implement the algorithm with the three cost functions of Section 4, namely the MMD, Stein Discrepancy and Sinkhorn Divergence. The results are produced using Rigetti's Forest platform [83]. Both the simulator, or Quantum Virtual Machine (QVM), and sub-lattices on the Quantum Processing Unit (QPU), the Rigetti Aspen 16Q chip, are used. This gives us access to perfect and realistic implementations respectively. Our implementation can be found at [84], which uses some code from [85] to compute the Sinkhorn Divergence and its gradient.
The data used is a collection of samples from the distribution of (117). This is the same simple toy example which was used to train a Quantum Boltzmann Machine, [86], and the Quantum Approximate Boltzmann Machine (QABoM) [39].
The use of this distribution is in contrast to other recent works on the Born Machine [6,8,10,25] which use the more canonical Bars and Stripes dataset [87] to train.
To generate this data, T 'Bernoulli' hidden modes are randomly chosen. Each mode (k) is randomly chosen to be a binary string, s k = [s k 1 , . . . , s k n ], and a given sample, y, is produced with a probability which depends on the Hamming distance, d (k,y) H , between each mode string, k, and that sample. In all of the following, the Adam [88] optimiser was applied, using the hyperparameters suggested in that paper, i.e. β 1 = 0.9, β 2 = 0.999, = 1 × 10 −8 , and initial learning rate, η init . This was chosen since it was found to be more robust to sampling noise [8]. Also, in all of the simulated results below, we assume a fully connected QVM for simplicity, as illustrated in Figure 6a and Figure 7a.
We use the total variation distance, TV, to compare training methods. We chose TV as an objective measure, as opposed to, say the KL divergence as chosen in [52], for a couple of reasons. Firstly, the classical hardness results mentioned in this work all aim to prove Quantum Supremacy with respect to this measure. Secondly, TV plays a key role in our discussion of learning hard distributions, covered in Section 7.
We aim for two properties that our cost functions should exhibit in order to claim they have outperformed the MMD training method: • Speed of Converge: Both cost functions, SD and SH should achieve equal or lower TV than the MMD in a shorter time period (even accounting for various learning rates).
• Accuracy: Since the cost functions we employ are in some sense 'stronger' than MMD, we want to see them achieve a smaller TV than is possible with the MMD in an equal or quicker time.
It should be noted that the TV appears to behave strangely during training, often initially increasing. The reason for this is that it is not be minimised directly, only through auxiliary cost functions. As such, we would not expect to see a monotonically decreasing curve. It is also clear that it would not be possible to compute TV in general as the number of qubits scales, but it is possible to do so to benchmark the small examples we use here.
For all of examples here, we use a QAOA structured IBM, with the final measurement angle fixed for all qubits; Γ k = π/4∀k for simplicity. As such, only the Ising weights and biases, {J ij , b k } were trained. We also did not initialise or restrict the parameter updates to be in the 'hard' regime discussed here, but allowed them to vary arbitrarily according to gradient descent. This is because, as far as the numerical results are concerned, we were only interested in testing the performance of our novel cost functions.
Firstly, in Figure 4 and Figure 5, we compare the use of the quantum kernel of (40) against the classical Gaussian one of (36) which was used in [8]. These results indicate that the Quantum kernel (40) is both suitable for training the IBM, and seems to converge faster when used in the MMD with all other parameters being identical. This is indicated in Figure 4a and Figure 5a. However, the computation of this kernel is likely to not be feasible for near term quantum devices, but does illustrate the potential of quantum kernels first explored in [55,56]. Figure 4 and Figure 5 illustrate the difference in training between the Quantum and Gaussian kernels, for two and four qubits respectively. The data was taken during 5 independent runs in each case, with averages and errors computed over the runs. This behaviour is apparent even for a range of learning rates, which we show. In the two qubit case, there is not a major difference between the two, but the discrepancy becomes more apparent as the number of qubits increases. This could be an indicator of the quantum kernel's extra expressive power and ability to capture finer details of the distribution, as in the four qubit case it can actually achieve a lower TV than the Gaussian kernel, used in [8]. Of course, this quantum kernel may not outperform all other kernels, or even different choices of the bandwith parameters in (36), but it is encouraging that such results were observed on a randomly selected dataset. Whether or not this performance is due to some fundamental 'quantum' reason, or simply that the kernel is more complicated, requires further study. The latter seems unlikely since the dataset is completely classical.
Next, Figure 6 and Figure 7 illustrate the differences between using the MMD cost function and either the Sinkhorn Divergence or the Stein Discrepancy to learn the same dataset. We found that with the exception of the two qubit case, both the Stein Discrepancy and the Sinkhorn Divergence are able to learn with a higher speed of convergence, and accuracy as shown in Figure 6b Figure 7b. It should be noted that the results for the Sinkhorn training were highly dependent on the value of the regularisation, as expected. Also, simply because the Sinkhorn achieved better results on one particular data set, does not imply that it would do so over most. However, the fact that the distribution in question was randomly generated, does look encouraging and supports this claim.
In the case of the Stein Discrepancy, we run the training algorithm for 125 epochs. The reason for this is twofold. Firstly, when training using the 'Exact' Stein Score (i.e. with       oracle access to the data probabilities π) the convergence is so fast that the model would tend to jump out of its convergence point minimum after a certain period of time. Hence we employed a form of early stopping, as indicated in Figure 6b. Secondly, for the case of the 'Spectral' Score method, (Figure 6b and Figure 7b) we used much less samples than with the previous methods as can be observed. This is due to the (classical) computational cost of computing the Score using this method in our naive implementation. We also took the best run of the Stein training algorithm, rather than an average in the other cases. This was due to the low sample numbers, which lead to a high deviation in the training path taken, so we removed it in order to not obscure the other results. For computing the Spectral Score, we used 3 and 6 Nyström eigenvectors for 3 and 4 qubits respectively, see Appendix B.
Finally, one may comment on the fact that we allowed the Stein Discrepancy to use the exact probabilities of the data, π, and this constitutes and unfair advantage against the MMD. In fact, the high number of samples we used ensured that the approximate data distributions that the MMD received was in fact very close to the exact data, and we found no major improvement for training with the MMD by allowing oracle data access, i.e. the exact probabilities, π, reinforcing the fundamental weakness of the MMD that is discussed above.
Finally, we perform experiments on the 16 Qubit QPU of Rigetti, Aspen, as seen in Figure 8 and Figure 9, to determine the performance of the training in practice. We used two sublattices for 3 and 4 qubits respectively, the Aspen-4-3Q-A and Aspen-4-4Q-A, and their respective qvm versions. As before we ran 5 independent runs of the training procedure from the same initial condition, and took averages over the run. We restricted to the native connectivity of the chip, as seen in Figure 8e, and used the available qubits in the Aspen-4-3Q-A chip, (10,11,17), with no direct connection between qubits 11 and 17, as illustrated. Taking this into account, we did not enforce a fully connected topology since this would have resulted in the compiler implementing SWAP gates. A similar restriction was enforced for the 4 qubit version in Figure 7a. This is one reason why the models trained reasonably well on the hardware, with the fact that the two qubit gates in our IBM are native to the Rigetti chip (they implement CZ gates as the entangling links) also providing an advantage. Figure 8a and Figure 9a compare training using the MMD with a Gaussian kernel and the Sinkhorn Divergence, benchmarked relative to TV as with the previous numerical results. As expected, the QPU noise leads to a higher variance between training runs, but the large number of samples taken (500) still permists the model to learn quickly. In both examples, it can been observed that the Sinkhorn cost function enabled faster and more accurate training, which becomes more evident as the number of qubits increases. It was also necessary to use quite an aggressive learning rate for four qubits while training with the MMD over SH, 0.15vs.0.1, in order to get the model to train at all within the given timeframe. Finally, it was also observed that the absolute time for the training to complete (over 100 epochs) was actually less when training was performed (with 4 qubits) on the actual QPU itself, rather than when performed using the QVM exclusively. Such a phenomenon is obviously expected in general, but it was surprising to see it for only 4 qubits.

Applications of the IBM
We propose two applications of the IBM. The first has more practical applicability, while the second is of more theoretical interest.

Automatic Compilation of Quantum Circuits with Classical Data
The first application is the use of the IBM to compile quantum circuits using classical data from a target circuit. This is likely a process which could not be simulated efficiently classically, particularly given the arguments presented in this paper. By its very nature, a Born machine provides a means of automatic compilation of quantum circuits, it finds an optimal configuration of the circuit parameters, θ, which reproduces a given probability distribution that could arise from a quantum circuit. This is similar in flavour to the recent work, [29] and [28], which provides a closely related idea using a parameterised quantum circuit, V (θ), as an ansatz, and updating the parameters to reproduce the behaviour of a given target unitary, U . The associated cost function minimises some cost between the two unitaries, but this cost function involves embedding the two unitaries in a larger circuit, which would be expensive in terms of quantum resources. The approach we suggest is different to the standard approaches in quantum circuit compilation 15 , in which the problem is to compile one (known) unitary directly into another. In the near term it is unlikely we would be able to embed unitaries into larger circuits in order to compute these cost functions, as in [28]. In particular, this straightforward approach becomes unfeasible as the size and complexity of circuits grows, so we should increase the tools we have available to tackle the problem. A more (at least experimentally) realistic task would be to try and compile based on classical measurement results from the target circuit. Of course, in this approach, there is no guarantee that neither the target unitaries, nor its parameters, would be identical in terms of their operation, only that the output distributions would be somewhat close, given the ansatz training circuit we have chosen. Nevertheless, the ability to compile quantum circuits using classical data may be a useful toolkit to have, particularly for benchmarking quantum device relative to one another. For example, this could be useful in a case where one does not have access to the direct evolution being implemented, but only measurement statistics from it, perhaps in the simulation of molecules.
Furthermore, this mindset could potentially be incorporated into quantum chip design, to extract the qubit layout/connectivity which gives a maximal amount of quantumness, given experimental constraints.
To illustrate this in a simple example, we apply the same techniques as above to train a circuit with a random initialisation of parameters. We begin with a p = 1 QAOA circuit, with a fixed final parameter in the measurement unitary, Γ k = π/4∀k, and train the parameters, {J QAOA ij , b QAOA k } exactly as before to represent the output distribution of an IQP circuit (where the measurement unitary is simply a Hadamard) with parameters, {J IQP ij , b IQP k }. Notice this is effectively the same as one IBM attempting to learn to represent another: While this may seem trivial, since the two parameterised circuits seem the same initially as the layout of the parameterised sections will be the same, they will actually give rise to different distributions due to the interference provided by the final measurement unitary, which is different in both cases. We illustrate this in Figures (10 11) for two and three qubits. We assumed a fully connected graph, which is possible to access using the Rigetti QVM simulator but it would not be possible to directly access such connectivity on the QPU. We leave a more thorough analysis of this, similar to the work of [52], to future work. Figure 10a and Figure 11a illustrate the circuit parameters which are achieved by Sinkhorn training. For ease of viewing, the cirucuit parameters are multiplied by a factor of 10. It is interesting to notice that, as discussed above, the final parameters of the QAOA circuit are not the same as the IQP target, however the resulting distributions only deviate by 1.5 × 10 2 in total variation, in the three qubit example in Figure 11c. It is likely that increasing the depth of the ansatz circuit would achieve better fits.

Learning Hard Distributions
Now, we hope to motivate why it could be possible to combine ideas in the previous Section 7.1 with the definitions in Section 2.4 so that the IBM could learn 'hard ' distributions. This concept is tightly related to the theory of PAC (Probably Approximately Correct) Learning which has been introduced to the quantum realm, [89], and in fact the theory of distribution learning was inspired by the PAC framework [40]. A major question in this field is whether or not qsamples (i.e. classical data samples presented in superposition to a learning algorithm) improve the learnability of certain concept classes. In some cases this is the case, but it is not generally true, [90]. In any case, it is a very interesting avenue to explore.
As intuition, and a motivating example, if a particular quantum chip is developed in the near future, which has provably demonstrated a quantum advantage in the sampling problems discussed above. Given a series of samples from such a quantum device, we could test to see if that distribution could be learnt by a new quantum device, which could not be represented by any classical device 16 . If the new device could, for example learn an approximation which was -close to the 'hard' distribution in TV, then the new device must also have the capability to do something non-classical. This line of thinking could potentially lead to a method to determine new classically-hard circuit classes. A major proof for an advantage using a quantum machine learning algorithm on a near term device, would be the ability of the IBM to learn a hard distribution efficiently. This is related to the Definition 2.8 in Section 2.4 in the following way. Let's suppose the metric we choose is the total variation distance, d(p θ , p H ) = TV(p θ , p H ) between our learning model (the IBM, p θ ) and p H which is a distribution which is 'hard' for a classical device to sample from, for example an IQP distribution.
From the definition, 'Quantum Learning Supremacy' would be achieved if a quantum device (BQP) was able to output a generator, GEN p θ (a parameterised quantum circuit), which was -close to the true distribution, p H , with probability 1 − δ: Now we know that the Born Machine has the capacity to represent that hard data distribution, [6] but can it actually infer enough about it in order to represent it itself?       An obvious approach to this problem would be to use the (TV) itself as a cost function, and minimise it with respect to the hard distribution. Now, if p H is produced by a quantum supremacy circuit class, then by the hardness arguments, there can exist no randomised poly-time classical algorithm, Q, that can produce samples according to a distribution, q such that the following holds: Where δ is ideally a constant but will depend on the hard distribution p H , for example, with IQP, δ = 1/384 [34]. Now, if it is possible to learn a representation of p H using the IBM, then we can claim that no classical algorithm could exist to simulate the IBM either to the required precision in TV. More concretely, if we are able to train the IBM to achieve closeness in a variation distance to a constant, , of the hard distribution, p H then: so: Using the triangle inequality, and the last line follows if it possible to get within of p θ using the algorithm Q, and so this is a contradiction to the implausibility of (121). This idea is very closely connected to the verification of these supremacy circuits and resulting distributions. The verification problem for a supremacy experiment is to determine if the physical quantum device is actually producing samples from the correct or ideal distribution, be it a BosonSampling experiment, or a Random/IQP circuit for example. As discussed above, the desired supremacy criterion, would be such that no classical algorithm could reach the distribution in question up to a constant total variation distance. Therefore, we must check whether or not the physical quantum device is actually achieving this! This is of crucial importance with the approach of the quantum supremacy era, since benchmarking a quantum device is almost as critical as building it in the first place. Potential solutions currently include including some bias in the circuit which could be checked, [5,17], making assumptions about the device using cross-entropy benchmarking, [15,91] or by computing certain probabilities of the circuit [14].
However, actually training the Born Machine using the TV is not a practical thing to do; we can see from (122) that to do so would require computing the output probabilities of the IBM, and the hard data distribution, which is # P-Hard in general.
Therefore, we could suggest using another metric as a cost function, ideally one which upper bounds the TV. If such a metric could be found, which is sample-efficient to compute, one could minimise it in a learning algorithm, which would directly minimise the TV through the upper bound.
Unfortunately, the MMD is not ideal for this purpose, as it actually lower bounds the total variation distance, [58]: In the cases of the two kernels we use in this work, C = 1: So the supremum is equal to one, for all x. Unfortunately, a lower bound on the TV is not sufficient, and as a result minimising the MMD between the distributions says nothing necessarily about the value of the TV between them. This is also the reason why we opted for stronger cost functions than the MMD in this work. An alternative choice would be the Optimal Transport Metric or Wasserstein Distance (81). In fact, TV is in fact a form of Optimal Transport with a trivial cost, [92].
From [92], we get the following inequalities for a discrete space, X 17 : where diam(X ) = max{d(x, y), x, y ∈ X }, d min = min x =y d(x, y), and d(x, y) is the ground metric used in the Wasserstein Distance, (81). If, for instance, we were to choose d(x, y) to be the 2 metric on the hypercube as discussed in Section 4.5.1 between the binary vectors, x, y, i.e. the Hamming distance, then we get that d min = 1, diam(X ) = √ n, and so: We might hope therefore that by using the OT 2 0 distance as a cost function to train the IBM, (which can be estimated using samples), we would be able to minimise it sufficiently to fall within the Total Variation threshold for Quantum Supremacy. Unfortunately, as noted in Section 4.5, the sample complexity of the Wasserstein distance is still exponential in the number of qubits, and hence we would not expect efficient training using this metric either.
Worse still, is the recent work of [93], in which it is proven the quantum supremacy 'candidates' that exist in the literature, for example, IQP, Random Circuit Sampling, BosonSampling, all require exponentially many samples to certify that the distribution arising from the quantum device is in fact close to the required one. More specifically, they show that no efficient 'testing' algorithm exists for these circuit classes which can pass a so-called Total Variation Identity Test. This problem involves being able to determine if the output distribution is correct (the sampled distribution is identical to the ideal one) or at least away in TV without making any assumptions about the device. This result potentially implies that learning these distributions efficiently is also not possible 18 , due to the relationship between learning, and property testing of distributions, [94,95].
However, hope is not lost. As conjectured in [93] we could hope for the existence of some distributions which can be efficiently tested (and also learned) but which still admit a sampling hardness result. It is with these hypothetical distributions for which we could hope to prove some classical-quantum separation in learnability to satisfy Definition 2.8.
Secondly, the result only applies to the classical certification of these distributions. In our case, we have a quantum model, and one could hope for the possibility that a quantum learning model would be able to infer more information from the samples than is possible using a classical model alone, given that these quantum models do have the capacity to represent hard distributions.
Finally on this note, we recall a last result from [81]: Theorem 7.1 (Theorem 1. from [81]). Let P and Q be probability distributions on X , Y subsets of R d , such that |X | = |Y| ≤ D and assume that c is L-Lipschitz w.r.t. x, y. Then: Again, for our specific case, The log term will be positive as long as ≤ √ de 2 , in which case the Sinkhorn Divergence will give an upper bound for the Wasserstein distance, and hence the TV through (127). Of course, for this low value of , it would take exponentially many samples to approximate the Sinkhorn Divergence to make this bound, as evidenced by [81], and Section 4.5.1 above, since there is a gap of a factor of d between the two values for . Note if we took the cost, c, to be the 2 metric directly in Section 4.5.1, the gap would be a factor of √ d. It would be of interest to study this kind of relationship, hopefully this line of thinking would be able to define more optimal algorithms, for both verification of quantum devices using these cost functions, and the efficient learning of hard quantum distributions.

Conclusion
In conclusion, we have studied a specific generative quantum machine learning model, which we call the Ising Born Machine. We claim that most parts of the learning algorithm could not be simulated by purely classical means in the worst case, up to multiplicative error. We introduced two new cost functions, the Stein Discrepancy and the Sinkhorn Divergence, and adapted them to train the Ising Born Machine. We study numerically the potential advantage of using quantum kernels over classical kernels, and the advantage of both the Stein Discrepancy and the Sinkhorn Divergence over the MMD to achieve lower error in total variation. This was validated through numerical experiments run on Rigetti's QVM simulator and their Aspen chip. We also test an application in quantum circuit compilation using classical data, and show numerically how this can be achieved. We discuss the possibility of using this model to learn certain hard distributions, for example those exhibiting Quantum Computational Supremacy, and provided a definition for a generative QML algorithm to demonstrate 'Quantum Learning Supremacy'. In this regard, we discuss how one could potentially achieve this by leveraging certain upper bounds on total variation relating to the integral probability metrics which we use as training cost functions.
Future work directions would be, for example, determining how limited connectivity and noise affect the performance of the training algorithm. We also will run experiments on different datasets, for example the Bars-and-Stripes. A second interesting direction would be to find example of distributions which could admit a demonstration of Quantum Learning Supremacy, and provide a definitive separation (up to relevant complexity theoretic conjectures) between quantum and classical machine learning algorithms in practice. While this is quite a strong claim, it is likely that weaker claims could be made which perhaps would be more ameanable to the near term.

A Proof of Theorem 3.1
The proof of Theorem 3.1 involves stitching together several results, some well known, some less so, and, to the best of out knowledge, some unknown. The proof will proceed systematically through the instances of the parameters, {J ij , b k } , Γ, ∆, Σ, and will show that in each of the cases mentioned in Theorem 3.1 the class of circuits generated cannot, in the worst case, be simulated to within multiplicative error efficiently by any classical means. For the remainder of this proof, we use the phrase 'simulate' to mean exactly this.

A.1 IQP
As stated above, IQP circuits correspond to the setting of the parameters, {J ij , b k }, ∀k : Γ k = π/2 √ 2, ∆ k = 0, Σ k = π/2 √ 2. This means the unitary U z (θ) is applied to the initial superposition state, |+ ⊗n , followed finally by a Hadamard applied to all qubits, H ⊗n . It is known that IQP circuits, with the homogeneous parameters α = J ij = b k = π/8 ∀i, j, k in (18), are hard to simulate in the worst case [16]. We will denote IQP circuits with the two sets of parameters used to define them, {J ij , b k } by IQP(J ij , b k ).
A collapse of PH to any level is considered unlikely, and in some sense is a generalisation of P = NP.
Next, we consider the question of for which homogeneous parameters, α l = α = J ij = b k result in IQP circuit families which are hard to simulate. The answer is almost all of them.
Theorem A.2 (Adapted from Theorem 5 of [31]). If the output probability distributions generated by uniform families of IQP(θ, θ) circuits could be weakly classically simulated to within multiplicative error 1 ≤ c ≤ √ 2 then PostBPP = PP (and hence PH collapses to the third level), where Finally we note that there is no need for the circuit parameters to be homogeneous. We may also allow each single and two qubit gate to have an independent parameter, as long as they are all of the form (132).
In Theorem A.1 it was shown that a general computation can be simulated by IQP circuits generated by a homogeneous parameter value, θ = π/8. We will now use the arguments of [31] to show why IQP(J ij , b k ) circuits, with almost all parameter angles, can simulate IQP(π/8, π/8) efficiently. The result of [31] will be a subcase of this, which accounts for the case J ij = b k = π/8. Specifically, if the original IQP circuit is generated by gates in the form D 1 (π/8) = e i π 8 Z , D 2 (π/8) = e i π 8 Z⊗Z , general IQP circuits with gates acting on at most two qubits (|S j | ≤ 2) will be generated by gates D 1 (b k ) = e ib k Z , D 2 (J ij ) = e iJ ij Z⊗Z . Therefore, it is necessary to show how each of these gates can simulate each of the former gates with a homogeneous rotation angle, π/8. To do so we can use the error measure defined as follows, [37], as the difference between the operations of two arbitrary gates on a quantum state, when maximised over all possible states: Where || · || is the norm of a vector: ||U |ψ || = ψ| U † U |ψ . If, by m repetitions, the gate is within of the required gate with parameter π/8, U (π/8 + ), the error induced by this extra factor will be O( ): so the gate with parameter θ 2 can be made close to the required CZ ∼ D 2 ( π 8 ) with respect to the error, as noted by [31,37]. The same thing holds for the single qubit gates with angle θ 1 approximating Z or P for example.
The irrationality of the parameter values allows the above, (134), to be true since 2πνm mod 2π is distributed uniformly. See [96] for a proof that any arbitrary phase can be approximated to accuracy, , with poly(1/ ) repetitions of a phase which is an irrational multiple of 2π. This shows that we it is possible to achieve any gate: e i2νπnσ using m repetitions of (e i2νπnσ ) n if ν is irrational. In this construction, each individual gate, j, will have an error, j , in contrast to [31] in which all of the errors would be the same (or a constant multiple ( = j ∀k). However, as long as this parameter, j , for each gate is lower than the threshold for fault-tolerant quantum computation, as noted in [31], then we can reliably simulate universal quantum computation, and hence PostIQP(b k , J ij ) = PostIQP(π/8, π/8).
A.2 p = 1 QAOA Now, we can prove the analogous statement as with IQP with this setting of the parameters, in an identical way.
1. The case ∀k : Γ k = −π/4 is covered by [32] where it is shown how QAOA(π/8, π/8, π/4) equipped with postselection is equivalent to BQP with postselection. The proof involves the design of a gadget similar to the IQP gadget in the IQP proof of hardness, and results in a similar collapse of the PH as in Theorem (A.1).
2. Extending the parameters, J ij = b k = π/8 in the diagonal unitary U z (θ) to general parameters, J ij , b k follows identically to the argument for IQP in the previous section. More specifically, any diagonal gate with parameter, π/8, can be simulated by a gate with any parameter of the form (26), applied a constant number of times to get within a fixed error of the desired π/8 gate.
3. Finally, if we wish to simulate the behaviour of any gateH = e −iπ/4X k , by any general gate U f (Γ k ) = e −iΓ k X k on qubit k, we just need to apply the latter gate a constant, m = O(1/ ), number of times, to get within , of the gateH, exactly as in (134). Now, we will have k such gates, each requiring the application of a constant number of repetitions, m k = O(1/ k ), so the total number of gates that would have to be applied is n k=1 m k . Overall, we will acquire a polynomial overhead in the simulation of the circuit generated by the fixed parameters, ∀i, j, k : J ij = π/8, b k = π/8, Γ k = −π/4. Hence we can achieve universal quantum computation with postselection in exactly the same way, and hence PostQAOA({J ij , b k }, Γ) = PostBQP.

A.3 Remaining Cases
The above two sections, Section A.1, Section A.2, cover many of the angles for the final measurement unitary to be hard. We can immediately extend the argument to cover almost all of the angles. The following case is a generalisation of IQP: for integers, d, l 2νπ ν ∈ [0, 1) irrational.
the Stein Discrepancy useful for distributions which are normalised by some intractable normalisation factor, for example in a Boltzmann Machine. Of course, the Stein Discrepancy can be immediately used for any classical problem (classical datasets) for which the probability density is already known, or can be computed efficiently. However, if one is interested in implicit models 20 , models which admit a means of sample generation, but not explicit access to the probability density, then this not immediate. In particular, for those distributions which admit some complexity theory result indicating that they cannot be simulated on a classical device efficiently, it will not be possible to efficiently compute the probabilities required in order to compute the score.
Here we present two approaches, [99,100], to compute approximations to the score function, and their application in this specific circumstance. In all the following, we assume it is the score of the data, π, which we want to compute. Of course, the most obvious approach to computing the score, using (D) samples alone, would be to simply accumulate the empirical distribution which is observed by the samples,π(x m ) = 1 M M m=1 I(x = x m ) and compute the score from this distribution. However, this immediately has a severe drawback. Since the score for a given outcome, s π (x m ), requires also computing the probabilities of all shifted samples, π(¬ i x m )∀i, if we have not seen any of the outcomes ¬ i x m in the observed data, we will not have values for these outcomes in the empirical distribution, and hence we cannot compute the score. This would be a major issue as the number of qubits in our system grows, since we will have exponentially many outcomes, many of which we will not see with poly(n) samples.

B.1 Identity Approximation of Stein Score
As a first attempt, we shall try the method of [99]. This involves noticing that the score function appears in Stein Identity, and inverting Stein's Identity gives a procedure to approximate the score.
Of course, we shall need to use the discrete version of Stein's Identity in our case, and rederive the result of [99] but there are no major differences. If we have generated M samples from the data distribution, we denote the score matrix, G, at each of those sample points as follows: π (x 1 ) s 1 π (x 2 ) . . . s 1 π (x M ) s 2 π (x 1 ) s 2 π (x 2 ) . . . s 2 π (x M ) . . . . . . . . . . . . s n π (x 1 ) s n π (x 2 ) . . . s n π (x M ) Each column is the term which corresponds to the score function for the distribution, π, and that given sample. Now, to compute,Ĝ π ≈ G π we can invert the discrete version of Stein's Identity, (143), similar to [99] which does this in the continuous case: 20 Implicit models are also present in the classical domain, for example in Generative Adversarial Networks, [97,98], and it is of great interest to find methods of dealing with them.

B.2 Spectral Approximation of Stein Score
While the method used to approximate the score function method which we showed in Section B.1 is straightforward, it does not give a method of computing the score accurately at sample points which have not been seen in the data distribution, π. This is a problem if, for instance, we come across a sample from the Born Machine, which has not been seen in the data set. Again, this becomes exponentially more likely as the number of qubits grows. If this were to occur during training, a possible solution mentioned in [99], is simply to add that sample to the sample set, and recompute the score function by the method of Section B.1. However, this is expensive, so more streamlined approaches would be desirable. Worse still, this tactic would potentially introduce bias to the data, since there is no guarantee that the given sample from the Born machine, does not have zero probability in the true data, and hence would never occur.
The approach we take is that of [100], which uses the Nyström method as a subroutine to approximate the score. The Nyström method is a technique to approximately solve integral equations [101]. It works by finding eigenfunctions of a given kernel with respect to the target probability mass function, π. As in the case of [99], the method was defined when π is a continuous probability measure, and as such we must make suitable alterations to adapt it to the discrete setting.
We summarise the parts of [100] which are necessary in the discretisation. For the most part the derivation follows cleanly from [100], and from Section B.1. Firstly, the eigenfunctions in question are given by the following summation equation: y κ(x, y)ψ j (y)π(y) = µψ j (x) where {ψ j } N j=1 ∈ 2 (X , π), and 2 (X , π) is the space of all square-summable sequences with respect to π, over the discrete sample space, X . If the kernel is a quantum one, as in (40), the feature space has a basis, {ψ j = s j |ψ } N j=1 ∈ 2 (X , π), where |s j are for example computational basis states. We also have the constraint that these functions are orthonormal under the discrete π: x ψ i (x)ψ j (x)π(x) = δ ij (155) Approximating (154) by a Monte-Carlo estimate drawn with M samples, and finding the eigenvalues and eigenvectors of the covariance kernel matrix, K ij = κ(x i , y j ), in terms of the approximate ones given by the Monte-Carlo estimate exactly, as in [100], we get: {u j } j=1,...J are the J th largest eigenvalues of the kernel matrix, K, with eigenvalues, λ j . The true eigenfunctions are related to these 'sampled' versions by: ψ j (x m ) ≈ √ M u jm ∀m ∈ {1, . . . , M }, µ j ≈ λ j /M . Assuming 21 that the discrete score functions are square summable with respect to π, i.e. s i (x) ∈ 2 (X , π), we can be expand the score in terms of the eigenfunctions of the 2 (X , π): 21 Because the eigenfunctions, ψ j are complex valued, they automatically obey the Discrete Stein's Identity (70) and we get the same result as [100]: Proceeding as in [100], we apply the discrete shift operator, ∆ x i , to both sides of (154) to give an approximation for the term,∆ x i ψ(x) ≈ ∆ x i ψ(x): It can also be shown in this case that∆ x i ψ(x) ≈ ∆ x iψ (x), by comparing (159) with (156), and hence we arrive at the estimator for the score function: If the sample space is the space of binary strings of length n, the number of eigenfunctions, N will be exponentially large, N = 2 n , and so the sum in (157) is truncated to only include the J th largest eigenvalues and corresponding eigenvectors.

C Proof of Theorem 4.3
The proof of Theorem 4.3, and in particular (70), is a simple extension of Theorem 2 in [70] Proof.
φ(x) = a(x) + ib(x) (161) For each term, j, the real part is given by: Since the sum is taken over all the sample space, X d , and ¬ i ( ¬ i x) = x, i.e. ¬ and ¬ are inverse operations (in fact they are equal in our case), (164) is equal to (165), and so the real parts of (163) cancel out. The same result hold for the imaginary parts, completing the proof.

D Proof of (76)
Here we compute the gradient of the Stein Cost function of (76), between the Born Machine, p θ and the data distribution, π, given by (74).
Proof. The derivation is very similar to the derivation of the MMD cost function gradient. Using the fact that the Stein kernel, (75), does not depend on the parameter, θ, the gradient is given by: