Variational quantum approximate support vector machine with inference transfer

A kernel-based quantum classifier is the most practical and influential quantum machine learning technique for the hyper-linear classification of complex data. We propose a Variational Quantum Approximate Support Vector Machine (VQASVM) algorithm that demonstrates empirical sub-quadratic run-time complexity with quantum operations feasible even in NISQ computers. We experimented our algorithm with toy example dataset on cloud-based NISQ machines as a proof of concept. We also numerically investigated its performance on the standard Iris flower and MNIST datasets to confirm the practicality and scalability.


Introduction
Quantum computing opens up new exciting prospects of quantum advantages in machine learning in terms of sample and computation complexity. [1][2][3][4][5] One of the foundations of these quantum advantages is the ability to form and manipulate data efficiently in a large quantum feature space, especially with kernel functions used in classification and other classes of machine learning. [6][7][8][9][10][11][12][13][14] The support vector machine (henceforth SVM) 15 is one of the most comprehensive models that help conceptualize the basis of supervised machine learning. SVM classifies data by finding the optimal hyperplane associated with the widest margin between the two classes in a feature space. SVM can also perform highly nonlinear classifications using what is known as the kernel trick. [16][17][18] The convexity of SVM guarantees global optimization.
One of the first quantum algorithms exhibiting an exponential speed-up capability is the leastsquare quantum support vector machine (LS-QSVM). 5 However, the quantum advantage of LS-QSVM strongly depends on costly quantum subroutines such as density matrix exponentiation 19 and quantum matrix inversion 20,21 as well as components such as quantum random access memory (QRAM). 2,22 Because the corresponding procedures require quantum computers to be fault-tolerant, LS-QSVM is unlikely to be realized in noisy intermediate-scale quantum (NISQ) devices. 23 On the other hand, there are a few quantum kernel-based machine-learning algorithms for near-term quantum applications.
Well-known examples are quantum kernel estimators (QKE), 8 variational quantum classifiers (VQC), 8 and Hadamard or SWAP test classifiers (HTC, STC). 10,11 These algorithms are applicable to NISQ, as there are no costly operations needed. However, the training time complexity is even worse than in the classical SVM case. For example, the number of measurements required to generate only the kernel matrix evaluation of QKE scales with the number of training samples to the power of four. 8 Here, we propose a novel quantum kernel-based classifier that is feasible with NISQ devices and that can exhibit a quantum advantage in terms of accuracy and training time complexity as exerted in Ref. 8. Specifically, we have discovered distinctive designs of quantum circuits that can evaluate the objective and decision functions of SVM. The number of measurements for these circuits with a bounded error is independent from the number of training samples. The depth of these circuits scales also linearly with the size of the training dataset. Meanwhile, the exponentially fewer parameters of parameterized quantum circuits (PQCs) 24 encodes the Lagrange multipliers of SVM. Therefore, the training time of our model with a variational quantum algorithm (VQA) 25 scales as sub-quadratic, which is asymptotically lower than that of the classical SVM case. 5,26,27 Our model also shows an advantage in classification due to its compatibility with any typical quantum feature map.

Support Vector Machine (SVM)
Data classification infers the most likely class of an unseen data pointx ∈ C N given a training dataset Here, X ⊂ C N and Y = {0, 1, . . . , L−1}. Although the data is real-valued in practical machine learning tasks, we allow complex-valued data without a loss of generality. We focus on binary classification. (i.e., Y = {−1, 1}), because multi-class classification can be conducted with a multiple binary SVM via a one-versus-all or a one-versus-on scheme. 28 We assume that S is linearly separable in the higher dimensional Hilbert space H given some feature map φ : X → H.
Then, there should exist two parallel supporting hyperplanes w, φ(·) + b = y ∈ Y that divide training data. The goal is to find hyperplanes for which the margin between them is maximized. To maximize the margin even further, the linearly separable condition can be relaxed so that some of the training data can penetrate into the "soft" margin. Because the margin is given as 2/ w by simple geometry, the mathematical formulation of SVM 13 is given as where the slack variable ξ is introduced to represent a violation of the data in the linearly separable condition. The dual formulation of SVM is expressed as 29 where the positive semi-definite (PSD) kernel is k(x 1 , x 2 ) = φ(x 1 ), φ(x 2 ) for x 1,2 ∈ X . The β i values are non-negative Karush-Kuhn-Tucker multipliers. This formulation employs an implicit feature map uniquely determined by the kernel. The global solution β is obtained in polynomial time due to convexity. 29 After optimization, the optimum bias is recovered as for any β q > 0. Such training data x q with non-zero weight β q are known as the support vectors. We estimate the labels of unseen data with a binary classifier: In a first-hand principle analysis, the complexity of solving Eq. (2)

Change of Variable and Bias Regularization
Constrained programming, such as that in the SVM case, is often transformed into unconstrained programming by adding penalty terms of constraints to the objective function. Although there are well-known methods such as an interior point method, 29 we prefer the strategies of a 'change of variables' and 'bias regularization' to maintain the quadratic form of SVM. Although motivated to eliminate constraints, the results appear likely to lead to an efficient quantum SVM algorithm.
First, we change optimization variable β to (α, B), where B := M −1 i=0 β i and α := β/B to eliminate inequality constraints. The l 1 -normalized variable α is an M -dimensional probability vector given that 0 ≤ α i ≤ 1, ∀i ∈ {0, · · · , M − 1} and where P V M is a set of M -dimensional probability vectors. Because W k (α; S) ≥ 0 for an arbitrary α due to the property of the positive semi-definite kernel, B = 1/W k (α; S) is a partial solution that maximizes Eq. (4) on B. Substituting B with Eq. (4), we have Finally, because maximizing 1/2W k (α; S) is identical to minimizing W k (α; S), we have a simpler formula that is equivalent to Eq. (2): The above Eq. (6) implies that instead of optimizing M numbers of bounded free parameters β or α, we can optimize the log(M )-qubit quantum state |ψ α and define α i := | i|ψ α | 2 . Therefore, if there exists an efficient quantum algorithm that evaluates the objective function of Eq. (6) given |ψ α , the complexity of SVM would be improved. In fact, in the later section, we propose quantum circuits with linearly scaling complexity for that purpose.
The equality constraint is relaxed after adding the l 2 -regularization term of the bias to Eq. (1).
Motivated by the loss function and regularization perspectives of SVM, 30 this technique was intro- , and (c) All qubits of index registers i and j are initialized to |+ = (|0 + |1 ) / √ 2, and the rest to |0 . Ansatz V (θ) is a PQC of m = log(M ) qubits that encodes probability vector α. U φ,S embeds a training data set S with a quantum feature map U φ(x) , which embeds classical datax to a quantum state |φ(x) . n denotes the number of qubits for the quantum feature map, which is usually N , but can be reduced to log(N ) if an amplitude encoding feature map is used. duced 31, 32 and developed 33,34 previously. The primal and dual forms of SVM become Note that k(·, ·) + λ −1 is a positive definite. As shown earlier, changing the variables causes Eq. (8) to become another equivalent optimization problem: As the optimal bias is given as b = λ −1 M −1 i=0 α i y i according to the Karush-Kuhn-Tucker condition, the classification formula inherited from Eq. (3) is expressed aŝ Eqs. (8) and (9) can be viewed as Eqs.
(2) and (6) with a quadratic penalizing term on the equality constraint such that they become equivalent in terms of the limit of λ → 0. Thus, Eqs. (7), (8), and (9) are more relaxed SVM optimization problems with an additional hyperparameter λ.

Variational Quantum Approximate Support Vector Machine
One way to generate the aforementioned quantum state |ψ α is to use amplitude encoding: √ α i |i . However, doing so would be inefficient because the unitary gate of amplitude encoding has a complex structure that scales as O(poly(M )). 35 Another way to generate |ψ α is to use a parameterized quantum circuit (PQC), known as an ansatz in this case. Because there is no prior knowledge in the distribution of α i s, the initial state should be |+ + · · · + = 1 |i . The ansatz V (θ) can transform the initial state into other states depending on gate parameter vector θ: |ψ α = V (θ) |+ + · · · + . In other words, optimization parameters encoded by θ with the ansatz are represented as α i (θ) = | i| V (θ) |+ + · · · + | 2 . Given the lack of prior information, the most efficient ansatz design can be a hardware-efficient ansatz (HEA), which consists of alternating local rotation layers and entanglement layers. 36,37 The number of qubits and the depth of this ansatz are O(polylog(M )).
We discovered the quantum circuit designs that compute Eqs. (9) and (10) within O(M ) time.
Conventionally, the quantum kernel function is defined as the Hilbert Schmidt inner product: k(·, ·) = | φ(·)|φ(·) | 2 . 4, 7, 8, 10-12 First, we divide the objective function in Eq. (9) into the loss and regularizing functions of θ using the above ansatz encoding: Specifically, the objective function is equal to L φ,λ + C −1 R. Similarly, the decision function in Eq. (10) becomes Inspired by STC, 10, 11 the quantum circuits in Fig. 1 efficiently evaluate L φ,λ , R and f φ,λ . The quantum gate U φ,S embeds the entire training dataset with the corresponding quantum feature map We apply a SWAP test and a joint σ z measurement in the loss and decision circuits to evaluate L φ,λ and f φ,λ : Here, σ z is a Pauli Z operator and M 00···0 is a projection measurement operator of state |0 We propose a variational quantum approximate support vector machine (VQASVM) algorithm that solves the SVM optimization problem with VQA 25 and transfers the optimized parameters to classify new data effectively. Fig. 2 summarizes the process of VQASVM. We estimate θ , which minimizes the objective function; this is then used for classifying unseen data: Following the general scheme of VQA, the gradient descent (GD) algorithm can be applied; classical processors update the parameters of θ i , whereas the quantum processors evaluate the functions for computing gradients. Because the objective function of VQASVM can be expressed as the expectation value of a Hamiltonian, i.e., where

Experiments on IBM Quantum Processors
We demonstrate the classification of a toy dataset using the VQASVM algorithm on NISQ computers as a proof-of-concept. Our example dataset is mapped to a Bloch sphere, as shown in Fig. 3a.
Due to decoherence, we set the data dimension to N = 2 and number of training data instances to M = 4. First, we randomly choose the greatest circle on the Bloch sphere that passes |0 and |1 . Then, we randomly choose two opposite points on the circle to be the center of two classes, A and B. Subsequently, four training data instances are generated close to each class center in order to avoid overlaps between the test data and each other. This results in a good training dataset with the maximum margin such that soft-margin consideration is not needed. In addition, thirty test data instances are generated evenly along the great circle and are labelled as 1 or -1 according to the inner products with the class centers. In this case, we can set hyperparameter C → ∞ and the process hence requires no regularization circuit evaluation. The test dataset is non-trivial to classify given that the test data are located mostly in the margin area; convex hulls of both training datasets do not include most of the test data.
We choose a quantum feature map that embeds data (x 0 , x 1 ) into a Bloch sphere instead of N = 2 . Features x 0 and x 1 are the latitude and the longitude of the Bloch sphere. We use two qubits (q 0 and q 1 ) RealAmplitude 40 PQC as the ansatz: In this experiment, we use ibmq montreal, which is one of the IBM Quantum Falcon processors. The Methods section presents the specific techniques for optimizing quantum circuits against decoherence. The simultaneous perturbation stochastic approximation (SPSA) algorithm is selected to optimize V (θ) due to its rapid convergence and good robustness to noise. 41, 42

Numerical Simulation
In a practical situation, the measurement on quantum circuits is repeated R times to estimate the expectation value within = O 1/ √ R error, which could interfere with the convergence of VQA.
However, the numerical analysis with the Iris dataset 43 confirmed that VQASVM converges even with the noise in objective function estimation exists. The following paragraphs describe the details of the numerical simulation, such as data preprocessing and the choice of the quantum kernel and the ansatz.
We assigned the labels +1 to Iris setosa and -1 to Iris versicolour and Iris virginica for binary classification. The features of the data were scaled so that the range became [−π, π]. We sampled M = 64 training data instances from the total dataset and treated the rest as the test data. The training kernel matrix constructed with our custom quantum feature map in Fig. 4a   the SPSA optimizer was used for training due to its fast convergence and robustness to noise. 41,42 The objective and decision functions were evaluated in two scenarios. The first case samples a finite number of measurement results R = 8192 to estimate the expectation values such that the error of the estimation is non-zero. The second case directly calculates the expectation values with zero estimation error, which corresponds to sampling infinitely many measurement results; i.e., R = ∞.
We defined the residual loss of training as ∆ = L φ,λ (θ t ; S) + C −1 R(θ t ) −d at iteration t to compare the convergence. Here,d is the theoretical minimum of Eq. (9) as obtained by convex optimization.
Although containing some uncertainty, Fig. 4c shows that SPSA converges to a local minimum despite the estimation noise. Both the R = 8192 and R = ∞ cases show the critical convergence rule of SPSA; ∆ ∼ O(|θ|/t) for a sufficiently large number of iterations t. More vivid visualization can be found as Supplementary Fig. S10 online. In addition, the spectrum of the optimized Lagrange multipliers α i s mostly coincides with the theory, especially for the significant support vectors. (Fig. 4d) Therefore, we concluded that training VQASVM within a finite number of measurements is achievable.
The classification accuracy was 95.34% for R = 8192 and 94.19% for R = ∞.
The empirical speed-up of VQASVM is possible because the number of optimization parameters is exponentially reduced by the PQC encoding. However, in the limit of large number of qubits A binary image data of '0's and '1's with a 28 × 28 image size were selected for binary classification, the features of which were then reduced to 10 by means of a principle component analysis (PCA). The well-known quantum feature map introduced in Ref. 8 was chosen for the simulation: The visualization of the feature map can be found as Supplementary Fig. S4 online. The ansatz architecture used for this simulation was the PQC template shown in Fig.4b with 19 layers; i.e., the first part of the PQC in Fig.4b is repeated 19 times. Thus, the number of optimization parameters is 19 × log(M ).
The numerical simulation shows that although residual training loss ∆ linearly increases with the number of PQC qubits, the rate is extremely low, i.e., ∆ ∼ 0.00024 × log(M ). Moreover, we could not observe any critical difference in classification accuracy against the reference, i.e., theoretical accuracy obtained by convex optimization. Therefore, at least for M ≤ 2 13 , we conclude that run-

Discussion
In this work, we propose a novel quantum-classical hybrid supervised machine learning algorithm that achieves run-time complexity of O(M polylog(M )), whereas the complexity of the modern classical algorithm is O M 2 . The main idea of our VQASVM algorithm is to encode optimization parameters that represent the normalized weight for each training data instance in a quantum state using exponentially fewer parameters. We numerically confirmed the convergence and feasibility of VQASVM even in the presence of expectation value estimation error using SPSA. We also observed the subquadratic asymptotic run-time complexity of VQASVM numerically. Finally, VQASVM was tested on cloud-based NISQ processors with a toy example dataset to highlight its practical application potential.
Based on the numerical results, we presume that our variational algorithm can bypass the issues evoked by the expressibility 36 and trainability relationship of PQCs; i.e., the highly expressive PQC is not likely to be trained with VQA due to the vanishing gradient variance, 45 a phenomenon known as the barren plateau. 46 This problem has become a critical barrier for most VQAs utilizing a PQC to generate solution states. However, given that the SVM is learnable (i.e., the singular values of the kernel matrix decay exponentially 5 ), only a few Lagrange multipliers corresponding to the significant support vectors are critically large; α i 1/M . 29,30 For example, Figure 4d illustrates the statement.
Thus, the PQCs encoding optimal multipliers should not necessarily be highly expressive. We speculate that there exists an ansatz generating these sparse probability distributions. Moreover, the optimal solution would have exponential degeneracy because only measurement probability matters instead of the amplitude of the state itself. Therefore, we could not observe a critical decrease in classification implementation of LS-QSVM is infeasible due to lengthy quantum subroutines, which VQASVM has managed to avoid. Also, training LS-QSVM has to be repeated for each query of unseen data because the solution state collapses after the measurements at the end; transferring the solution state to classify multiple test data violates the no-cloning theorem. VQASVM can overcome these drawbacks.
VQASVM is composed of much shorter operations; VQASVM circuits are much shallower than HHL circuits with the same moderate system size when decomposed in the same universal gate set. The classification phase of VQASVM can be separated from the training phase and performed simultaneously; training results are classically saved and transferred to a decision circuit in other quantum processing units (QPUs).
We continue the discussion on the advantage of our method compared to other quantum kernelbased algorithms, such as a variational quantum classifier (VQC) and quantum kernel estimator (QKE), which are expected to be realized in the near-term NISQ devices. 8 VQC estimates the label of data Thus, QKE has much higher complexity than both VQASVM and classical SVM. 8 In addition, unlike QKE, the generalized error converges to zero as M → ∞ due to the exponentially fewer parameters of VQASVM, strengthening the reliability of the training. 48 The numerical linearity relation with the decision function error E f and ∆ supports the claim (see Supplementary Fig. S11 online).
VQASVM can be enhanced further with kernel optimization. Like other quantum-kernel-based methods, the choice of the quantum feature map is crucial for VQASVM. Unlike previous methods (e.g., quantum kernel alignment 49 ), VQASVM can optimize a quantum feature map online during the training process. Given that U φ(·) is tuned with other parameters ϕ, optimal parameters should be the saddle point: (θ , ϕ ) = arg min θ max ϕ L φ[ϕ],λ (θ) + C −1 R(θ). In addition, tailored quantum kernels (e.g., k(·, ·) = | φ(·)|φ(·) | 2r ) can be adapted with the simple modification 10 on the quantum circuits for VQASVM to improve classification accuracy. However, because the quantum advantage in classification accuracy derived from the power of quantum kernels is not the scope of this paper, we leave the remaining discussion for the future. Another method to improve VQASVM is boosting.
Since VQASVM is not a convex problem, the performance may depend on the initial point and not

Proof of Eq. (13)
First, we note that a SWAP test operation (H a · cSWAP a→b,c · H a ) in Fig. 1 measures the Hilbert-Schmidt inner product between two pure states by estimating σ a z = | φ|ψ | 2 , where a is the control qubit and |φ and |ψ α are states on target qubits b and c, respectively. Quantum registers i and j in Fig. 1 are traced out because measurements are performed on only a and y qubits.
The reduced density matrix on x and y quantum registers before the controlled-SWAP operation is which is the statistical sum of quantum states |φ(x i ) ⊗ |y i with probability α i . Let us first consider the decision circuit (Fig. 1b). Given that the states |φ(x i ) x ⊗ |y i y and |φ(x) x are prepared, A a→x,x σ y z = A a→x,x σ y z = y i A a→x,x due to separability. Here, A a→x,x can be (H a · cSWAP a→x,x · H a ) † σ a z (H a · cSWAP a→x,x · H a ) or 1 λ σ 0 a . Similarly, for the loss circuit (Fig. 1a ), we have states |φ(x i ) x0 ⊗ |y i y0 and |φ(x j ) x1 ⊗ |y j y1 with probabil-

Realization of Quantum Circuits
In this article, U φ,S is realized using uniformly controlled one-qubit gates, which require at most M − 1 CNOT gates, M one-qubit gates, and a single diagonal (log(M ) + 1)-qubit gate. 35 For instance, the physical qubits indexed as 1, 2, 3, 4, 5, 8, 11, 14, and 16 in ibmq montreal were selected in this article (see Supplementary Fig. S1 online) We then assign a virtual qubit in the order of y 0 , i 0 , i 1 , x 0 , a, x 1 , j 0 , j 1 , y 1 so that the aforementioned required connections can be made between qubits next to each other. In conclusion, mapping from virtual qubits to physical qubits proceeds as We report that with this arrangement, the circuit depths of loss and decision circuits are correspondingly 60 and 59 for a balanced dataset and 64 and 63 for the an unbalanced dataset in the basis gate set of ibmq montreal : R z , √ X, X, CN OT .

Additional Techniques on SPSA
The conventional SPSA algorithm has been adjusted for faster and better convergence. First, the blocking technique was introduced. Assuming that the variance σ 2 of objective function L φ,λ + C −1 R is uniform on parameter θ, the next iteration t + 1 is rejected if L φ,λ + C −1 R (θ t+1 ) ≥ L φ,λ + C −1 R (θ t ) + 2σ. SPSA would converge more rapidly with blocking by preventing its objective function from becoming too large with some probability (see Supplementary Fig. S10) i=0 θ t−i . Combinations of these techniques were selected for better optimization. We adopted all these techniques for the experiments and simulations as the default condition.

Warm-start optimization
We report cases in which the optimization of IBM Q Quantum Processors yields vanishing kernel amplitudes due to the constantly varying error map problem. The total run time should be minimized to avoid this problem. Because accessing a QPU takes a relatively long queue time, we apply a 'warmstart' technique, which reduces number of QPU uses. First, we initialize and proceed a few iterations (32) with a noisy simulation on a CPU and then evaluate the functions on a QPU for the remaining iterations. Note that an SPSA optimizer requires heavy initialization computation, such as when the initial variance is calculated. With this warm-start method, we are able to obtain better results on some trials.

Author contributions
S.P. contributed to the development and experimental verification of the theoretical and circuit models; D.K.P. contributed to developing the initial concept and the experimental verification; J.K.R.
contributed to the development and validation of the main concept and organization of this work. All co-authors contributed to the writing of the manuscript.

Data Availability
The numerical data generated in this work are available from the corresponding author upon reasonable request. https://github.com/Siheon-Park/QUIC-Projects

Additional information
The authors declare no competing interests.

A Review of quantum support vector machine
Classification is a fundamental problem in machine learning. The goal of L-class classification is to infer the most likely class label of an unseen data pointx ∈ C N can be described as, given a labelled data set {0, 1, . . . , L − 1} Although the data is real-valued in usual machine learning tasks, we allow complex-valued data without loss of generality. Support Vector Machine (SVM) is a supervised machine learning algorithm that classifies data to several classes by optimizing separating hyperplanes. 15,16 SVM is one of the most robust classifier for having global minimum guaranteed by convex optimization, 51 and has been showing excellent performance not only in classification, but also in regression and other numerous fields. Also, it is known to be efficient for non-linear classification and regression for high-dimensional data with kernel trick. In this paper, we will focus on binary classification where only two class exists for the data (i.e. L = 2) since a multi-class classification can be achieved with multiple binary SVM by one-vs-all or one-vs-one scheme. Also, for notation simplicity, we assume N = 2 n , M = 2 m (n, m ∈ N 0 ).

A.1 Hard-Margin SVM
For given dataset S, y ∈ {−1, 1} N , suppose the dataset is linearly separable, that is there exists a The margin, distance between two parallel supporting hyperplanes, that we want to maximize is 2/ w . Since arg max 2/ w = arg min w 2 /2, selecting optimal hyperplane w , x + b = 0 and its parameters (w , b ) is equivalent to solving a quadratic optimization problem (S2).
Instead of solving this primal problem, solving dual problem is much easier. The dual problem of (S2) is where the optimal solution of primal and dual problems satisfies Karush-Kuhn-Tucker conditions.
By investigating complementary slackness condition (S5), it is clear that only the data on supporting hyperplanes can have non-zero β i since y i ( w , x i + b ) = 1 for those data. They are referred to as S1 support vectors. Note that from this relationship one can find optimal bias that disappeared in dual Since only support vectors are on supporting hyperplane, it can be considered as a set of linear combinations of support vectors. Thus, they are named as support vectors. The estimated labelŷ of test datax is determined by the relative location of test data with respect to the separating hyperplane From (S8), we can treat Lagrange multiplier β as normalized weights of the labelled data.

A.2 Soft-Margin SVM
C-SVM introduce slack variables ξ i 's as violation between supporting hyperplane and outlier data in between two supporting hyperplanes. The goal is to minimize this violation while maximizing distance between supporting hyperplanes.
Here, hyperparameter C is user-defined value that controls over-fitting and under-fitting. We have adapted the formulation in Ref. 13 instead of original formulation in Ref. 15 We can interpret the term 1 2 w 2 as regularizing term, and 1/C as its hyperparameter. With the same logic in case of hard-margin SVM, solving dual problem is much easier. The dual problem of (S9) is where the optimal solution of primal and dual problems satisfies Karush-Kuhn-Tucker conditions.
By investigating complementary slackness condition in (S12), it is clear that only the data on and in between supporting hyperplanes can have non-zero β i since y i ( w , x i + b ) = 1 − ξ i for those data.
Therefore, with the same logic as before, optimal bias is obtained manually.
However, there may be no examples exactly lying on the supporting hyperplane. In this case, we can approximate b as median value of absolute difference |y i − w x i | among all training data. 30 Estimating test label would be the same as that of the Hard-margin SVM (S8).

A.3 Kernel Trick
Even with the soft margin assumption, dataset may not be linearly separated efficiently. In this case, we can define feature map φ : X → H to map data to high dimensional Hilbert space H. With the sophisticated feature map, we expect mapped dataset x → φ (x) is linearly separable in H.
However, defining feature map may not be practical nor possible. For example, the dimension of H should be infinite in order to linearly separate any mapped dataset. In addition, the closeness between two mapped data is not clear. Note that both Equation (S15) and (S16) require calculation of inner product between φ instead of φ itself. Thus, rather than constructing ill-defined vector map, we define kernel function k : X × X → C.
We can consider the kernel as the measure of similarity, for it is defined as inner product between mapped vectors. It has been proven that there exists unique feature map φ for positive semi-definite kernel k(·, ·) = φ(·), φ(·) H . Consider the formulations (S10) and (S8). They imply that the inner product between examples only matter for given dataset. Therefore, if we define some features φ(x i ) to represent x i , then the dual SVM formulation (S10) and classifying equation (S8) are almost the same except for inner product part.
Since φ(·) can be highly non-linear mapping, we can construct non-linear classifier on given dataset using SVM. Note that SVM only solves linear classification problem originally. Users can now have more degree of freedom from the arbitrary selection on the feature mapping to classify examples.
However, φ(·) may map original data to very high-dimensional Hilbert space so that calculating not only features themselves but also inner product of features explicitly may cost severe computation resources. It has been proven that there exist a unique feature map for any positive semi-definite (PSD) kernel. Therefore users can define a kernel to solve dual SVM problem. This is known as kernel trick. 15,29,30 There are several popular choice for kernel; polynomial kernel and Gaussian radial basis function kernel, for examples, Polynomial kernel is defined as (S20) with kernel hyperparameter bias c, and order d. Gaussian radial basis function kernel(RBF kernel) is defined as (S21) with kernel hyperparameter inverse of standard deviation γ.