Experimental kernel-based quantum machine learning in finite feature space

We implement an all-optical setup demonstrating kernel-based quantum machine learning for two-dimensional classification problems. In this hybrid approach, kernel evaluations are outsourced to projective measurements on suitably designed quantum states encoding the training data, while the model training is processed on a classical computer. Our two-photon proposal encodes data points in a discrete, eight-dimensional feature Hilbert space. In order to maximize the application range of the deployable kernels, we optimize feature maps towards the resulting kernels’ ability to separate points, i.e., their “resolution,” under the constraint of finite, fixed Hilbert space dimension. Implementing these kernels, our setup delivers viable decision boundaries for standard nonlinear supervised classification tasks in feature space. We demonstrate such kernel-based quantum machine learning using specialized multiphoton quantum optical circuits. The deployed kernel exhibits exponentially better scaling in the required number of qubits than a direct generalization of kernels described in the literature.

www.nature.com/scientificreports/ nonlinear FMs, it is not required to implement nonlinear transformations on the quantum-state encoded data, in contrast to the direct amplitude encoding common in other QML approaches. The central idea underlying KQML is that inner products of vectors that are mapped into FHS can be directly accessed by measurements, which then suggests to identify these inner products with kernel functions. By physically measuring the kernel functions κ(x ′ , x) = |�ϕ(x ′ )|ϕ(x)�| 2 , it is thus possible to bypass their per pedes computation on a classical machine. Such measurement-based implementation may, in some cases, be significantly faster than the latter option.
It follows from the representer theorem that a function of the reproducing kernel that minimizes the cost function (a solution to the ML problem) can be written as f * (x) = M m=1 a m κ(x, x m ), where M is the number of training samples, the coefficients a m are real parameters subject to the training, and x belongs to feature space. For a given kernel κ , the parameters a m can be found efficiently. The objective of ML is to deliver a function f * (x) that classifies the non-separable points x 1 , ..., x M−K and x M−K+1 , ..., x M by finding a trade-off between the number of misclassifications and the width of the separating margin. The parameters a m can be obtained by solving the following problem: minimize M m=1 (|a m | 2 + γ u m ) such that a i κ(x, x i ) ≥ 1 − u i for i = 1, ..., M − K, and a i κ(x, x i ) ≤ −(1 − u i ) for i = M − K + 1, ..., M, u ≥ 0, where γ gives the relative weight of the number of misclassified points compared to the width of the margin. In a nutshell, this approach allows to replace the nonlinearity of the problem with linear multidimensional quantum computations, which offers a potential speed-up.

Results
Kernel resolution in finite dimensions. An important and widespread kernel class are Gaussian-type kernels, which introduce a flexible notion of proximity among data points. An essential hyperparameter of Gaussian-type kernels is thus their variance (or, more generally, their resolution). The resolution determines a Gaussian kernel's ability to distinguish data points, which, for given training data, can decide if a model can be trained successfully or not. If kernel resolution is too coarse, resulting decision boundaries miss relevant details in the data; if it is too refined, the model becomes prone to overfitting. Only if the resolution can be chosen sufficiently flexibly to be accommodated to the structure of the data, model training can be expected to be successful.
In the infinite-dimensional feature spaces offered by continuous variable implementations, viable FMs with (in principle) arbitrary resolution can be implemented, e.g., by mapping data into squeezed states 12 , where the adjustable squeezing factor then determines the resolution of the resulting Gaussian kernel (i.e., its variance). However, within the paradigm of discrete, finite-dimensional quantum information processing, the FHS dimension becomes a scarce resource, resulting in limitations on kernel resolution. As we show now, optimizing the range of kernel resolutions in finite dimensions then forces us to move beyond the scope of Gaussian kernels.
Let us discuss the optimal kernel resolution that can be achieved in N-dimensional FHS, within the class of FMs of the form with {|n�} a basis of the Hilbert space and x ∈ [−1/2, 1/2) . Any data set can be brought to this form, which is a routine step in data preparation. We stress that the amplitudes r n are independent from the input values x. The resulting kernels then are of the form In this shorthand notation κ(x) ≥ 0 ∀x and κ(0) = 1 . For the sake of clarity we consider here 1D input data x. For D-dimensional inputs x , each input component x i is encoded separately, requiring an (N · D + D)-dimensional FHS. If the FHS is spanned by q qubits, we have N = 2 q − 1 . In particular, for N = 1 and r n = 1/2 we have κ(x, x ′ ) = cos[π(x ′ − x)] 2 , which realizes a cosine kernel (CK). The class of states (1) comprises also truncated squeezed states |ψ TSQ (x)� , with (ζ denotes the squeezing factor and B renormalizes the state after truncation), and, what we call here, multi-slit interference states |ψ MSI (x)� , with constant amplitudes √ r n = 1/ √ N . The latter inherit their name from the fact that, by virtue of �x|p� = e 2πipx (h=1), they are formally equivalent to a balanced superposition of momentum states in a (hypothetical) compact continuous variable Hilbert space (augmented by an internal spin-N degree of freedom), giving rise to "N-slit interference" in the position coordinate when projected onto �x| ⊗ 1 √ N N n=1 �n| 19 . Note that polynomial kernels (discussed, e.g., in 8,12 ) fall outside of the state class (1).
r n e 2πinx |n�, N n=0 r n = 1, www.nature.com/scientificreports/ We can use the above compact-space embedding to gain further insight into the nature of our kernel definition (2). If we interpret the states (1) as |ψ� = N n=1 √ r n |p = n� ⊗ |n� , we can introduce the density operator ρ = |ψ��ψ| and trace over the internal spin degree of freedom, We then find that the kernel (2) is related to the spatial coherences of the mixed reduced state ρ ext : We define a kernel's spatial resolution �x[κ] by its variance (a hyperparameter typically optimized for Gaussian kernels) where the renormalized kernel κ(x) = κ(x)/R , with R ≡ 1/2 −1/2 dx κ(x) = N n=1 r 2 n , describes a valid probability distribution. In the case of the mulit-slit interference states |ψ MSI � , one analytically obtains (�x[κ MSI ]) 2 = 1 12 (1 − S 1 (N)) , with the interferometric "squeezing factor" S 1 (N) = − 12 The kernel (6) minimizing the variance is a solution to the optimization problem: minimize r T ·K·r |r| 2 such that N n=1 r n = 1 , where and r = (r 1 , . . . , r N ) T . In Fig. 1 we compare this optimized kernel with the TSQ and the MSI kernel. The optimized kernel comes with strongly suppressed side maxima as compared to the MSI kernel, while the TSQ maintains a nonvanishing plateau for all x values. Consequently, the optimized kernel enables, for a given N, a significantly improved resolution as compared to the other kernel choices. Figure 1b clarifies that amplitudes decaying symmetrically about the "center" state are responsible for improving the kernel resolution. On the other hand, a kernel that maximizes the variance (i.e., κ(x) = 1 ) follows from r 1 = 1 and r n = 0 for n = 1 , resulting in the variance (�x[κ]) 2 = 1/12. By a suitable choice of the coefficients r n , we can thus tune the resolution of the kernel between its minimum value obtained for the optimized kernel and its maximum value assumed for a uniform kernel.
Whereas kernels of the form (2) can also be efficiently computed classically, their quantum evaluation may still deliver a significant speed-up. We illustrate this with an example, the computation of cos 2N x. The optimal classical algorithm depends on the properties of N. In the best case scenario, N is a power of 2. Then, in the first step we compute cos 2 x. Next, we compute [cos 2 x] 2 , etc. The entire computation then takes log 2 (N + 1) steps. As we demonstrate below, for the quantum implementation, the required size of the FHS (number of qubits) grows also like log 2 (N + 1) . However, in contrast, there the associated calculations are replaced by a single measurement. We expect similar arguments to hold for more general classes of functions, as well.
Beyond the quantum-classical hybrid approach pursued here, the proposed FMs may, if seen as modules to be combined with other quantum computing subroutines, contribute their resource-efficient data point separation ability to an overall setup that comes with an inherently quantum scaling advantage. MSI states, for instance, can be generated in a gate-based quantum computer following the first stage of the phase-estimation algorithm 20 .
Alternative Gaussian-kernel implementation. Above we have shown that truncated squeezed states and their resulting kernels fall within the state class (1). If we relax the condition that the amplitudes r n be inde- www.nature.com/scientificreports/ pendent from the input values x, we can formulate an alternative data encoding into truncated squeezed states according to n! . Note that this feature map is defined for 2D inputs x = (x 1 , x 2 ) T . For large N this kernel again reproduces to good approximation a Gaussian kernel, as where the variance is set by the hyperparameter s. In particular, as shown in Fig. 2, this approximation is valid for q = 2 and relatively small values of s.
We find that this kernel performs, for the number of qubits q = 2 and after numerically optimizing the hyperparameter to s = 2 , on average as well as the cosine kernel for the same total number of qubits equal to N = 1 (see Fig. 4). Moreover, by appropriate parameter reconfiguration, it would be possible to realize this type of kernel using the same experimental setup. From an experimental perspective, however, it is more convenient (and thus scalable) to implement the feature map associated with the powers of the cosine kernel, which is exclusively implemented by setting phases and polarization angles.
Cosine kernels. The kernel selected for our proof-of-principle demonstration of KQML is Note that N is related to the number of qubits q per dimension as q = ⌈log 2 (N + 1)⌉. This FM can also be considered a constant-phase representation of constant-amplitude states. This is the same as representing states either in a basis of eigenstates of x or z components of a collective spin operator. In particular, (cos( This mapping uses exponentially less resources (qubits) than the direct product of the map from Ref. 12 , i.e., |ϕ(x)� = D n=1 N m=1 1 k=0 sin k (x n ) cos 1−k (x n )|k� n,m , where the number of qubits per dimension is q = N. Using the powers of CKs allows us to adjust the kernel resolution by choosing the proper value of N. Thus, the number of used qubits can be related directly to the variance of the kernel. The number of qubits here plays the same role as the squeezing parameter in the experimental proposal given in Ref. 12 . The CK can also include additional (D − 1) degrees of freedom by virtue of a FM defined as  www.nature.com/scientificreports/ where y 0 = 0, the number of terms here is (N + 1) D , and the associated kernel measured by postselection is κ(x ′ , x) = D n=1 cos 2N (x ′ n − x n ) cos 2 (y ′ n−1 − y n−1 ).

Discussion
We have experimentally implemented KQML to solve three classification problems on a two-photon optical quantum computer. In our experiment we implemented a D = 2, N = 1 kernel (using all the modes from Fig. 3, we can set at most D = 5 with q = 1 ). We used two photons, but only the top mode of the dual-rail encoding. Including more modes would lead to kernels causing overfitting (see Fig. 4). We have performed measurements for M = 40 two-dimensional samples ( D = 2 ), drawn from two classes (see horizontally/vertically-tipped triangles in Fig. 4). This procedure was repeated for three benchmark classification problems. For each benchmark 40 × 39/2 = 780 measurements were performed to create a corresponding Gram matrix (GM), which was subsequently used to find the best linear classification boundary as given by the representer theorem. In other words, a custom kernel κ(x m , x n ) = κ(x n , x m ) for m, n = 1, 2, ..., M was measured. This kernel was used as a custom precomputed kernel for the scikit-learn SVC classifier in python.
Pairs of H-polarized photons were prepared in a type-I spontaneous parametric down-conversion process in a β-BaB 2 O 4 crystal. The crystal was pumped by a 200 mW laser beam at 355 nm (repetition rate of 120 MHz). The coincidence rate, including all possible detection events from Fig. 3, was approximately 250 counts per second. The setup operates with high fidelity (98%) and the dominant source of errors can be attributed the Poissonian photon count statistics. The design of this setup is modular and its easy to incorporate more qubits by simply adding additional blocks. We measured each point for a time necessary to collect about 2,500 detection events. Thus, excluding the time needed to switch the setup parameters, the whole measurement for a single benchmark problem takes about two hours.
To prepare the contour plot of the decision function based on the experimental data shown in Fig. 4 and to quantify the performance of the trained model on the relevant test sets, we have also measured the GM for 1,225 points and used its symmetries to fill in the unmeasured values. The values for points in between have been found www.nature.com/scientificreports/ using linear interpolation. The data accumulation time can be shortened by orders of magnitudes by fine tuning the parameters of the setup and by using brighter photon sources.

conclusions
We report on the first experimental implementation of supervised QML for solving a nonlinear multidimensional classification problem with clusters of points which are not trivially separated in the feature space. We hope that our research on QML will help to improve ML technologies, which are a major power-horse of many industries, a vivid field of research in computer science, and an important technique for solving real-world problems. We believe that both the theoretical and the experimental investigation of FM circuits and their constraints regarding kernel resolution and compression for a limited FHS (i.e., FHS size dependent FMs) constitutes a crucial step in the development of practical KQML for support-vector-machine QML [8][9][10]12,13 . We demonstrate that a linear-optical setup with discrete photon encoding is a reliable instrument for this class of quantum machine learning tasks. We also report obtaining exponentially better scaling of FHS in the case of CK than in the case of taking direct products of qubits 12 . The same can hold for other more complex kernels implemented in finite FHS, which could appear unfeasible, but in fact require nontrivial FMs (e.g., the resolution-optimized kernels shown in Fig. 1). Thus, KQML can provide a promising perspective for utilizing noisy intermediate-scale quantum systems [21][22][23][24] , complementing artificial quantum neural networks [25][26][27][28][29] and other hybrid quantum-classical algorithms [30][31][32] .
The classical computational cost of the power kernel computation is O[log(N)] and the quantum cost is a constant value depending on the precision of the computation. In the classical case, one needs to perform O[log(N)] computation steps that can not run in parallel due to the recursive nature of the classical algorithm. In the quantum case, one needs to run 1 computation step but on log(N) qubits. As in any quantum computation, the precision of the calculation depends on the number of measurements and it can be considered constant for a given computational problem. This observation itself is a valuable result and a quantum advantage. The quantum advantage of the presented approach is apparent in terms of the complexity of calculations, i.e., O[log(N)] versus O (1) . Consider the number of samples needed for quantum calculations. It depends on the confidence level (z) and admissible error: ǫ . For a given pair of z and error ǫ , one needs O(1/ǫ 2 ) repetitions of the experiment. This is just a constant overhead. In the classical case, this constant overhead can be smaller, but the complexity of calculations can be larger as the it is N-dependent. Only if we face significantly lower than unity qubit-numberdependent efficiency η (i.e., circuit-size dependent losses), for a given z-value the complexity of quantum computations should be considered as being O(η ( − log(N))/ǫ 2 ) . However, the power scaling also applies to the total error probability of classical computations of log(N) steps. Note, however, that both η and single-step error probability of classical computing are not fundamentally limited and can be arbitrary close to 1 or 0, respectively.
Our quantum kernels can be used for solving high-dimensional classification problems and could potentially be computed faster than their classical counterparts. Popular problems solved by classification algorithms include image recognition (e.g. face detection or character recognition), speech recognition (e.g. voice user interfaces), medical diagnoses (e.g. associating results of medical tests with a class of diseases), real-time specific data extraction from vast amounts of unstructured data (e.g. classification of patterns in unstructured data) and many more. Classification can also be used as an initial phase for predictive computations that help to make the best decision based on the available data (e.g., managing risk, security, traffic, procurement etc.). We believe that this quantum-enhanced approach is useful especially in cases where it is difficult or impossible to achieve the result on time with classical computing.

Methods
Optical circuit for KQML. States given by Eq. (11) can be prepared in a quantum optical setup. In the reported proof of principle experiment, we can set N = 3 and D = 2 . This means that, effectively, the experiment deploys q = 2 qubits per dimension. The FM is defined via single-photon polarization states (H/V polarization) as well as dual-rail encoding (T/B for top/bottom rail, respectively) where c(x n ) ≡ cos(x n ) and s(x n ) ≡ sin(x n ). This approach is resource-efficient as it only requires two photons to encode x into the FHS state of N = 3 and D = 2.
An optical circuit implementing this FM is depicted in Fig. 3. The top part of the FM circuit works as follows: first, it transforms the standard input |HB� using wave plates resulting in |HB� → (|HB� + |VB�)/ √ 2. Next, a beam divider separates polarization modes in space, i.e., we have (|HB� + |VT�). Now, the effective operation of wave plates in the top and bottom modes can be described as first transforming |VT� → µ T |HT� + ν T |VT� and |HB� → µ B |HB� + ν B |VB�. The parameters are set as µ T = √ 2c 3 (x n ), ν T = √ 2s 3 (x n ), µ B = √ 6c 2 (x n )s(x n ), ν B = √ 6c(x n )s 2 (x n ). This whole operation is unitary and can be described as U(x)|HH� = |ϕ(x)�. The complex conjugate of operation U(x) is U † (x ′ ) and it can be used to express the kernel as κ(x ′ , x) = |�HH|U † (x ′ )U(x)|HH�| 2 . Thus, the circuit U † (x ′ ) for projecting the state |ϕ(x)� to |ϕ(x ′ )� can be constructed as the inverse of the feature embedding U(x) circuit, but for setup parameters set for x ′ . The next action of the plates in the top and bottom rails is to perform a reverse transformation, but for x n = x ′ n . Next, the plates flip the polarizations in the respective rails. Now, the interesting part of the engineered state is in the top rail with flipped polarization. To implement U(x ′ ) † , www.nature.com/scientificreports/ the last pair of waveplates is used both to flip the polarization and to perform the Hadamard transformation. Finally, the PBS transmits only H-polarized photons for further processing. The procedure of measuring the kernel κ(x ′ , x) can be extended to include additional dimensions, resulting in measuring the kernel κ(x ′ , x) = κ(x ′ , x) cos 2 (y − y ′ ) following from FM (11). Instead of the transformation U † (x ′ )U(x), we consider R † (y ′ )U † (x ′ )U(x)R(y), where R(y) = e 2iy |H��H| is a phase shift applied to a preselected H-polarized photon in the bottom part of the setup, and R † (y ′ ) = e −2iy ′ |H��H| is a phase shift to the postselected H-polarized photon in the same part of the setup. The phase difference between the postselected upper and lower H-polarized photons can be measured as cos 2 (y − y ′ ). This is done with PBS ′ which transmits diagonally-polarized photons |D� = (|H� + |V �)/ √ 2 and reflects antidiagonal photons |A� = (|H� − |V �)/ √ 2, and polarization-resolving single-photon detectors (see caption of Fig. 3).