Transforming Bell's Inequalities into State Classifiers with Machine Learning

Quantum information science has profoundly changed the ways we understand, store, and process information. A major challenge in this field is to look for an efficient means for classifying quantum state. For instance, one may want to determine if a given quantum state is entangled or not. However, the process of a complete characterization of quantum states, known as quantum state tomography, is a resource-consuming operation in general. An attractive proposal would be the use of Bell's inequalities as an entanglement witness, where only partial information of the quantum state is needed. The problem is that entanglement is necessary but not sufficient for violating Bell's inequalities, making it an unreliable state classifier. Here we aim at solving this problem by the methods of machine learning. More precisely, given a family of quantum states, we randomly picked a subset of it to construct a quantum-state classifier, accepting only partial information of each quantum state. Our results indicated that these transformed Bell-type inequalities can perform significantly better than the original Bell's inequalities in classifying entangled states. We further extended our analysis to three-qubit and four-qubit systems, performing classification of quantum states into multiple species. These results demonstrate how the tools in machine learning can be applied to solving problems in quantum information science.


I. INTRODUCTION
Quantum machine learning is an emerging field of research in the intersection between quantum physics and machine learning, which has profoundly changed the way we interact with data. It represents a new paradigm of processing information, which, at the fundamental level, is still governed by the laws of quantum mechanics. In addition, there is also a real "demand" of using advanced data-processing techniques for gate-fidelity benchmarking and data analysis for the state-of-the art quantum experiments. Therefore, understanding the connection between quantum information science and machine learning is a matter of great fundamental and practical interest.
In general, there are many ways where research in quantum machine learning has become fruitful. One way is to design quantum algorithms to speed up classical machine learning [1][2][3][4][5]. For example, quantum extensions of the principal component analysis (PCA) [1] and support vector machines (SVM) [2,6] have been invented. Furthermore, quantum algorithms [3,5,7] are capable to accelerate some distance-based problems exponentially.
On the other hand, the other approach in quantum machine learning is to apply machine learning methods to study problems in quantum physics and quantum information science. In particular, classical machine-learning * Electronic address: yung@sustc.edu.cn methods [8,9] have been applied to many-body [10][11][12], superconducting [13], bosonic [14] and electronic [15] systems. Furthermore, machine-learning can also be applied to the problem of quantum-Hamiltonian learning [16,17]. Beyond quantum information science, machine learning also finds applications in particle physics [18], electronic structure of molecules [19], and gravitational physics [20].
In this work, we are interested in applications of machine learning to the problem of quantum-state classification [21], which is a generalization of the pattern recognition in learning theory. In the classical setting of pattern recognition, we are given a training set S containing paired values, S = {(x 1 , y 1 ), (x 2 , y 2 ), (x 3 , y 3 ), ...}, where x i is a data point and y i ∈ {0, 1} is a pre-determined label for x i . Furthermore, there are many classical methods in machine learning inspired by ideas in physics [22] and quantum information [23,24].
Based on the training set, the problem of pattern recognition is to construct a low-error classifier (or predictor), in the form of a function, f : x → y, for predicting the labels of new data. The quantum extension of this problem is to replace the data points x i with density matrices of quantum states ρ i , i.e., x i → ρ i . The challenge is that obtaining full information for a given quantum states becomes resource consuming as the number of qubits increases.
Instead of full information (e.g. from quantum tomography), we aim at constructing a set of quantum-state classifiers which can reliably output a correct label of a given quantum state in an ensemble, using only partial information (i.e., a few observables) about the state. Our strategy is motivated by the development of Bell's inequalities, which was originally used to exclude incompatible classical theories from a few measurement results performed non-locally.
In fact, there are challenges in using of Bell's inequalities for the purpose of state classification. Quantum mechanically, it is well-known that entanglement is necessary for violating, e.g., the CHSH (Clauser-Horne-Shimony-Holt) inequality (see also Eq. (3)). However, entanglement is not sufficient, meaning that there are many entangled states not violating the CHSH inequality, which makes it an unreliable state classifier for detecting quantum entanglement.
Our strategy is to "transform" Bell's inequalities into a reliable state classifier. However, the non-locality aspect of Bell's inequalities is not relevant to the construction of our quantum state classifiers, although we can follow the same experimental setting for an implementation of our proposal.
Here the transformation involves two levels. First, we ask the following question: "given the same measurement setting, is it possible to optimize the coefficients of the CHSH inequality for a better performance, compared with the values (1, −1, 1, 1, 2) employed in the standard CHSH inequality (see Eq. (3))?" We shall see that that the answer to this question is positive. This optimization is linear, in the sense that it gives an optimization function containing a linear combination of the observables as input.
In the second level, instead of linear optimization, we include hidden layers in a non-linear optimization process and at the same time allow the measurement angles to be varied randomly. We found that in this way, the performance of the classifier can be enhanced significantly, relative to the first level. This method is then applied to several different scenarios of quantum state classification. Before we go into the details, we provide an overview and summary of the main results below.

A. Overview and main results
In this work, we present an application of supervised machine learning to the problem of classifying quantum states, where we construct quantum-state classifiers form some training set of quantum states. Here a classifier for quantum states output a "label", for example, entangled or unentangled, for any quantum state, by sending partial (or fully tomographic in some cases) measurement data of the quantum state as input.
More specifically, we consider the scenarios where quantum states are distributed to different parties through an noise channel characterized by some unknown parameter. The parties are given the opportunity to test the channel through a set of testing states, which corresponds to the training phase of machine learning. At the end, the parties are given an non-linear function optimized for the purpose of state classification, where only partial information is required for testing new quantum states beyond the training set.
Our non-linear quantum-state classifier is constructed by a technique in machine learning known as "artificial multilayer perceptron" [25], which is a network composed of several layers, where information flows from input layer, through hidden layer, and finally to the output layer.
The input layer contains the information about the quantum state, where the expectation value of certain observables are taken as the elements of a vector x. The hidden layer contains another vector x 1 , which is constructed through the relation, Here W 1 , w 01 are initialized uniformly and optimized through the learning process. Here the ReLu function [26], defined by σ RL ([z 1 , z 2 , · · · , z ne ] T ) = [max{z 1 , 0}, max{z 2 , 0}, · · · , max{z ne , 0}] T , is a nonlinear function for every neuron. Finally, the neuron(s) in the output layer contains the probabilities for the input state to belong to a specific class. For example, for a binary-state classification, where only one neuron is needed, the output contains the probability for the input state may be identified as entangled or separable. In the following, we shall demonstrate how machinelearning methods can solve the following quantum-state classification problems: 1. We first consider the question: "is it possible to optimize the coefficients (1, 1, 1, −2) in the CHSH inequality such that the inequality becomes a better state classifier?" We found that the answer is positive; in particular, we found that the choice of (−0.521, 0.603, −0.025, 0.016, 0.373) can yield a much better performance for our testing states.
2. Second, instead of linear optimization, we include non-linear elements and a hidden layer. At the same time, we allow the parties to choose random measurement angles. We found that the performance of the state classifier based on the machinelearning method can be enhanced significantly.
3. Third, we ask the question: "is it possible to construct a universal state classifier for detecting quantum entanglement?" If possible, this would be a valuable tools for many tasks in quantum information theory. However, the challenge is to find a reliable way for labeling the quantum states in the training set. For a pair of qubits, this is possible by using the PPT (Positive Partial Transpose) criterion. We have constructed such a universal state classifier for a pair of qubits; we found that the performance depends heavily on the training sets; the major source of error comes from the training data at near the boundary between the entangled states and separable states. 4. Next, we consider multiple-state classification involving systems of three qubits. There, the structure is more complicated than two qubits. We constructed a quantum state classifier that can identify four types of separable states.

Bell-like Predictor
5. Finally, we considered an ensemble of four-qubit systems. We analyzed the performance of the state classifier in terms of three groups of quantum states.

A. General background on quantum entanglement
Entanglement is a key feature of quantum mechanics, where the correlation between pairs or groups of particles cannot be described within a local realistic classical model. In quantum information theory, entanglement is regarded as an important resources to achieving tasks, such as quantum teleportation [27], quantum computation [28], and quantum cryptography [29].
However, given a quantum state, the problem of determining if it is entangled or not is a computationally-hard,  To promote the method to distinguish different quantum states as many as possible, a hidden layer with nonlinear function is inserted between input and output layers. The task is optimizing σS(W2(σRL(W1 x + w01)) + w02) where σRL is ReLu [26] function on every hidden neuron and σS(x) = 1/(1 + e −x ) is sigmoid function.
(NP-hard [30] more precisely). This question is particularly important in quantum experiments. Currently, methods of entanglement detection has been developed for specific scenarios [31]. The most popular ones includes Positive Partial Transpose (PPT) criteria [32,33] and entanglement witnesses [32,[34][35][36]. For a pair of qubits, Positive Partial Transpose (PPT) is both sufficient and necessary for entanglement detection [37]. However, PPT is a necessary but not sufficient condition for multi-qubit systems. In addition, it requires the knowledge of the whole density matrix. Experimentally, it means one needs to perform quantum state tomography, which is resource consuming for multiplequbit systems.
Moreover, entanglement witnesses represent a different approach for entanglement detection. Given an observable W, where Tr(Wρ) ≥ 0 for all separable states. If Tr(Wρ) < 0 for (at least) one entangled ρ, then we say W detects ρ [31,38]. Here the trace Tr(Wρ) = W represents the measurement result of ρ with W. Of course, it is possible that there are entangled states not detected by a given witness, i.e., Tr(Wρ) ≥ 0 for an entangled state.
On the other hand, quantum entanglement is necessary for a violation of Bell's inequalities [39], which has been confirmed in numerous experiments [40][41][42][43][44]. In principle, Bell's inequality can be employed for detecting quantum entanglement; it can witness some entangled states. It is an attractive direction, as only partial information is needed from the quantum state. However, for normal Bell's inequalities, only small part of the entangled states can be detected; a situation similar to entanglement witness. Motivated by this problem, one of our goals is to construct an quantum-state classifier for entanglement detection through optimizing Bell's inequalities.

B. Separable states and CHSH inequality
To get started, let us consider an ensemble of quantum states ρ of n qubits; the method is also applicable for qudit systems. Recall that a quantum state is separable if and only if it can be expressed as a linear combination of product states, i.e., Otherwise, the quantum state is entangled.
In fact, entanglement is necessary for a violation of the Bell inequalities [39], e.g. the CHSH (Clauser-Horne-Shimony-Holt) inequality [45], where · represents expectation, {a, a } and {b, b } are the detector settings of parties A and B respectively that take only two values ±1 (see Fig. 2). Furthermore,n = n 1 σ x + n 2 σ y + n 3 σ z ,n ∈ {a, a , b, b }, and σ x,y,z are the Pauli matrices. Quantum states violating the CHSH inequality can be labeled as "entangled". However, CHSH inequalities cannot be employed as a reliable tool for entanglement detection. There are two reasons. First, there exist entangled states not violating the Bell inequalities. To be more specific, the maximally-entangled state, such as |ψ − = (|00 − |11 )/ √ 2 for a pair of qubits, can maximally violate the CHSH inequality [39]. However, this tool fails under the circumstances of noise, in the form of a quantum channel. After passing through a depolarizing channel [46], the resulting state, where 0 ≤ p ≤ 1, violates the CHSH inequality only if p > 1/ √ 2 0.707 [31]. However, the state is entangled when p > 1/3 0.333 [31].
Another reason is that the measurement angles depends on the quantum state. For example, if we choose fixed measurement angles with the following CHSH operator, then for any given quantum state of the form, we have the expectation value, ψ θ,φ | Π CHSH |ψ θ,φ = √ 2 (sin θ cos φ − 1), which is equal to −2 √ 2 when θ = π/2, and φ = π, i.e., when |ψ θ,φ = |ψ − . For a different value of φ, e.g. φ = π/2, the resulting quantum state can no longer be used to violate this particular CHSH inequality. Therefore, in general, the family of (original) CHSH inequalities cannot be employed as a reliable tool for detecting quantum entanglement for given quantum states.

III. OPTIMIZING CHSH OPERATOR WITH MACHINE LEARNING
In this work, we consider two types of machine learning models to classify different types of quantum ensembles (see Fig. 1), namely (i) tomographic predictor and (ii) Bell-like predictor. Tomographic predictors make use of all information of a given quantum state and is used to benchmark the performance of Bell-like predictors, which employs a subset of non-orthogonal measurements setting. For example, for a pair of qubits, the tomographic predictors are the Cartesian product of two sets of Pauli operators, {I, σ x , σ y , σ z }, which contains a total of 15 non-trivial combinations. On the other hand, the CHSH operator in Eq. (5) can be regarded as an example of using the Bell-like predictors.
To elaborate further, we construct a linear Bell-like predictor by generalizing the CHSH operator as (see Eq. (3) for notations): where the coefficients (or weights) {w 0 , w 1 , w 2 , w 3 , w 4 } are determined by the method of machine learning, through minimizing the error of detecting quantum entanglement of a given quantum ensemble. Here the measurement angles {a 0 , a 0 , b 0 , b 0 } are taken to be the same as those given in Π CHSH defined in Eq. (5). We denote the resulting Bell-like predictor as CHSH ml . For a given quantum state, the set of observables (called features) are taken as the input of the machine learning program; normally, the number of elements in this set should be much smaller than the dimension of the quantum state. In fact, the method of machine learning allows us to construct more general Bell-like predictors, given the same number of features. The key element of them is the inclusion of an extra hidden layer of neurons (see Fig. 2), compared with the linear predictor CHSH ml . Moreover, each link between a pair of neurons is associated with a weight to be optimized in the learning phase.
Specifically, here we consider a class of (non-linear) predictors denoted by where n labels the number of qubits in the quantum state, n f labels the number of features, and n e labels the number of neurons in the hidden layer of the neuron network. Apart from the extra neurons in the hidden layer,  10)) with fixed θ, p but different angle φ, the height of vertical axis presents the mismatch rate predicted by PPT criterion and other predictors. Since PPT detects all entangled states in 2-qubit system, the mismatch rate can also be called error rate of entanglement detection. Green and blue areas represent entangled and separable states individually. (d) implies that the new linear predictor (CHSH ml ) intrinsically searches a best critical p to divide entanglement and separable ensembles. (c) is the cross section of (d) with θ = π/2, which illustrates the optimization of maximum entangled states. (e-f ) Mismatch rate of entanglement detection by Bell-like predictor with different hidden layers (150,20 or no neurons) on 2-qubit system with random measurement. The mismatch only happens on the edge area of entangled and separable states. For θ = 0, all the ensembles with different φ degenerate into one state, thus the mismatch rate is either 1 or 0. the measurement angles {a, a , b, b } in the corresponding feature list are taken randomly.
As the first "test run" of our machine learning method, we focus on the following family of quantum states: where |ψ θ,φ is defined in Eq. (6), 0 ≤ p ≤ 1. For a pair of qubits, the entanglement between them can be determined by checking the PPT (positive partial transpose) criterion [32,33]: let ρ T B θ,φ be the matrix obtained by taking partial transpose of ρ θ,φ in the second qubit. The state is entangled if and only if the smallest eigenvalue of the matrix ρ T B θ,φ is negative. For our case, the minimal eigenvalue can be obtained analytically (see supplementary materials), which is given by For each quantum state in the training set, we first evaluate the value of λ min (ρ T B θ,φ ), in order to create a label for it. In Fig. 3a, we depict the portion of separable states in the colored area of a Bloch sphere.

A. Training phase of the predictors
To investigate the performance of CHSH ml , which is essentially a linearly-optimized version of CHSH, and a non-linear predictor with machine learning (see Fig. 2b). First, we need to generate an initial set of quantum states, called training set. The set of states are generated by sampling a uniform distribution of θ and φ, but with a Gaussian distribution for p, with a mean value 1/(1 + 2 sin θ), which yields an ensemble of states in the neighborhood of separable and entangled hyperplane.
Specifically, for each time, we evaluated the four fea-tures, { a 0 b 0 , a 0 b 0 , a 0 b 0 , a 0 b 0 }, in the CHSH ml for a given state in the training set, putting them into a four-dimensional feature vector x in ANN (artificial neural network). In fact, if we consider only one side of the inequality, the CHSH inequality is equivalent to where W 0 = [1, −1, 1, 1] and w 0 = 2. In other words, CHSH inequality are violated iff the output value is negative. The problem of optimization of CHSH ml is equivalent to the problem of finding an optimal set of matrix elements for W and w 0 , through the given training set of quantum state. The training steps are as follows: first, we apply a sigmoid function, for the output layer ( Fig. 2(b,c)). The output value represents the separable set. Then, we make use of a loss function constructed by the cross entropy [47] to calculate the difference between predictor and the results based on the PPT criterion for many copies in the given quantum ensemble. Next, the loss function was minimized using the stochastic gradient descent algorithm [48]. At the end, we obtained a vector W and w 0 that is optimized by the above process.

B. Testing phase of the predictors
After the predictor is well-trained, we test the performance by creating a new set of quantum ensemble that is distinct from the data set employed for training. Here the testing data comes from an ensemble of quantum states ρ θ,φ with a uniform distribution of p, θ and φ. Note that from Eq. (11), the entanglement of ρ θ,φ depends on the values of p and θ but not φ. However, the same set of features of the new density matrices are provided as the input; the values of p and θ are not directly provided in the testing phase, but they are used to evaluate the performance of the predictors.
We quantify the performance of the CHSH ml predictor as follows: for given values of p and θ, the mismatch rate R mm (p, θ) is defined by the probability that the function outputs a different label from the PPT criterion, averaged over uniform distribution of the angle φ, i.e., where x ML ∈ {0, 1} labels the output of the machine learning predictor; 1 ML (0 ML ) means separable (entangled), and similarly for x PPT . Of course, the match rate can be defined in a similar way.
The mismatch rates for the predictor CHSH ml are shown in Fig. 3(c,d). We also include the use of CHSH inequality for entanglement detection for a comparison. The numerical data indicates that both CHSH and CHSH ml can identify the regime where λ min (ρ T B θ,φ ) > 0 as separable, except when θ is close to zero θ = 0. When θ = 0, the state |ψ θ,φ reduces to only one state |00 for any choice of φ. Therefore, the mismatch rate becomes 100% whenever the predictor made a mistake. We shall see that such a problem in CHSH ml also exists in the BELL ml predictor without hidden layer. However, the problem goes away whenever hidden layers are included.
Beyond that region, CHSH produces 100% mismatch rate, but CHSH ml can significantly reduce the mismatch rate as p increases. The reason for the CHSH to produce 100% mismatch rate is similar to the situation explained after Eq. (4); there exist entangled states not violating the CHSH inequality.
Therefore, we conclude that performance of CHSH ml is significantly better than that of CHSH in detecting quantum entanglement in the regime λ min (ρ T B θ,φ ) < 0. In CHSH ml , the measurement angle is fixed, we shall see that the performance of machine learning can be significantly increased if we choose to make the measurement angle random.
The result of predictor, BELL ml (2, 4, 0) (i.e., for 2 qbuits, 4 features, and 0 hidden layer) is shown in Fig. 3(e,f ). We have also included the results of BELL ml (2,4,20) and BELL ml (2, 4, 150) predictors, with respectively 20 and 150 neurons in the hidden layer, for comparison. The overall performance in terms of mismatch rates are significantly improved, compared with the CHSH ml predictor. Furthermore, inclusion of a hidden layer can significantly mitigate the problem of CHSH ml near θ = 0. Numerically, we found that the results with a total of 150 neurons in the hidden layer do not significantly out perform the results with 20 neurons.

IV. CLASSIFYING GENERAL TWO-QUBIT STATES
In the previous section, we have studied the ability of the machine-learning predictors in detecting the entanglement of quantum states of the form indicated in Eq. (10), which belongs to the type II problem. A potentially more interesting question is, can we construct a universal function that accepts only partial information about the quantum state but, at the same time, can detect all entangled states by machine learning (i.e., type III)? For the case of two qubits, we can still rely on the PPT criterion to provide labels for our training set.
For this part, we generate a new training set of 2-qubit mixed states randomly and label them by the PPT criterion in the same way as the previous section. The ensembles ρ are prepared by first generating a set of random matrices σ, where the real and imaginary parts of the elements σ ij = a ij + ib ij are generated by a Gaussian distribution with a zero mean and unit variance. The resulting density matrix is obtained by  which is implemented by using the code of QETLAB [49].
The performance of our machine-learning predictors heavily depends not only on the training set but also the distribution of the testing data. We found that many data points (of quantum states) are localized near the boundary between entangled and separable states, which represents a challenge for us; machine learning performs not so well near the marginal cases. To regulate the anomaly, we introduce a gap in the boundary between the entangled and separable states.
Here the gap is defined for the entangled states where the absolute value of the minimum eigenvalue λ min (ρ T B AB ) of the partial transposed matrix ρ T B AB is larger than a given constant g, i.e., The distribution of λ min in our data set is given in Fig. 4b. We can see that the majority of states are weakly entangled, which imposes the challenge for our machine-learning predictors.
For this setting, we first present the results of the Bell ml (2, 8, 4000) predictor containing 4000 neurons in the hidden layer. As in shown in (Fig. 4a), the match rate is about 75% when a gap is not introduced in the testing data. Unfortunately, the match rate cannot be considered as high, as the population of entangled states in the testing set is also about 75%; one can achieve the same performance by guessing all given states as entangled. This implies that the number of features involved in Bell ml (2,8,4000) are not sufficient to detect entangled states in the ensemble. For a comparison, we consider using the Tomographic Predictor, which takes all 15 combinations of the two-qubit Pauli operators as an input. The result of match rate goes up to larger than 96%. This result implies that machine learning becomes more reliable when more information about the quantum states are available.
In addition, our Bell-like predictor gets improved if the entangled states near the boundary of the separable states are filtered out, as shown in Fig. 4a. Here we vary the gap from 0 to 0.1. The larger the gap, the better the performance of the Bell-like predictor. When the gap is about 0.07, the fraction of the entangled states is about the same as separable states, the match rate of the Belllike predictor is about 80%.

V. THREE-QUBIT SYSTEMS
In general, when extending our method of machine learning to multiple-qubit quantum states, the challenge is to find an efficient way to label the qubits.
The entanglement structure of a three-qubit system is significantly more complicated than two-qubit systems; it can be classified into several types of entanglement classes [31]. In particular, a three-qubit quantum state is called "biseparable", if two of the qubits are entangled with each other but not with the third one. The corresponding density matrices are denoted as, and their convex combination, i.e., p 1 ρ A|BC + p 2 ρ B|AC + p 3 ρ C|AB for 0 ≤ p 1 , p 2 , p 3 ≤ 1. Of course, these sets of states include fully-separable states as a special case. A system is called fully entangled (or genuine tripartite entangled for pure states) [31] if it is neither biseparable nor fully separable. In the following, we shall focus on the quantum ensemble of the following form (see Fig. 5(a,b) ), where |ψ bs belongs to a random pure state of one of the three biseparable states in Eq. (17), the value of p ∈ [0, 1] is generated uniformly, and ρ sep is a random fully separable states defined in Eq. (2). The set of states |ψ bs is generated by random vectors as follows. A random vector is defined by a vector, containing d elements v i from the Gaussian distribution with a zero mean and unit variance. For example, for the case of ρ A|BC , the state |ψ bs is obtained by the tensor product v rand (2) ⊗ v rand (4), followed by multiplying a normalization factor. In this case, we can still label the entanglement of the quantum states by the use of the PPT criterion. For example, if the pure state |ψ bs is in ρ A|BC , then we can trace out the system A in the total density matrix of the form given in Eq. (18), then apply the PPT criterion to the reduced density matrix of BC. In this way, we challenge our Bell-like predictors to classify four types of states, including 3 types of biseparable entangled states and the fully-separable states. The generation process of 4-qubit system. One channel is assumed to generate random fully separable states, while the other generate random entangled particles but mixed with separable states (noise). (d) The ANN to distinguish multi-class quantum states. We apply the model to detect the entanglement of 3-qubit system. The output are Softmax layer [p0, p1, p2, p3] T ( 3 i=0 pi = 1) which describes the probability of every possible state. In our implementation, similar to our previous construction of the Bell-like predictors based on the Bell's inequalities. Here we consider the Mermin inequality [50] and Svetlichny inequality [51]. For three-qubit systems, the Mermin inequality is of the form, The Svetlichny inequality (essentially a double Mermin inequality) is of the following form: | abc + ab c + a bc − a b c + a b c + a bc + ab c − abc | ≤ 2 (21) The Mermin inequality and the Svetlichny inequality are the multipartite counterpart of Bell's inequalities. Therefore, one can also employ these inequalities for detecting multipartite entanglement. In a similar way, we can also apply the machine learning method to boost the efficiency. In our machine learning method, we adopted the elements of Mermin inequality in the photonic experiments of Refs. [52] as input (4 features) to train our Bell-like predictor Bell ml (3, 4, x), and similarly, for the Svetlichny inequality, we constructed Bell ml (3, 8, x)). In the cases discussed previously, the sigmoid function was employed for binary classification of the quantum states.
Here the output layer is obtained by the Softmax function [47], which can be applied to multi-state classification (Fig. 5d). The mismatch rate of the machine learning method is shown in Fig. 6. It is indicated that if we just use the same number of features as in Mermin Bell ml (3, 4, x) and Svetlichny Bell ml (3, 8, x) inequalities, the performance is not satisfactory. The mismatch rate cannot get much improved by increasing the number of neurons in the hidden layer. However, the performance of the machine learning method can get significantly improved by putting 3 groups of features from CHSH inequalities for every pair of qubits, which gives a new Bell ml (3, 12, x) predictor.
Furthermore, We also trained our model, with both Tomographic predictor and Bell ml (3, 26, x), to distinguish separable states from entangled states which cannot be identified by the PPT criterion (i.e. bound entangled states), such as the entangled states generated by unextendible product basis (UPB) [53]. The accuracy on the test data is larger than 99%.

VI. FOUR-QUBIT SYSTEMS
Finally, we apply our machine learning method to the case of a four-qubit system to classify two special classes of quantum states (Fig. 5c). Ensembles from blue channel is a fully-separable state ρ sep (see Eq. (2)) and the others in green is assumed to be the form where |ψ rand is a totally random pure state (defined in Eq. (19)) and p ∈ [p min , 1] is a uniformly random variable. Note that we need to set a minimum value for p min ; it is because if p min = 0, then ρ = ρ sep , making the two sets of states identical.
Recall that the PPT criterion identifies entangled state whenever the eigenvalue of the corresponding matrix, after partial transpose, becomes negative. Here we found numerically that when p min > 0.1, all of the instances in ρ mix are entangled states. The scenarios is depicted in Fig. 7(a-c).
As shown in Fig. 7d, we studied the performance of a class of Bell-like predictors, namely Bell ml (4, 80, x), where the neuron number in hidden layer is taken from 1 to 15, i.e., x = 1, 2, · · · 15. The 80 features are generated by the following way: assume there are four parties and each party performs a measurement on a qubit locally in two different angles labeled byn i ,n i (i = 1, 2, 3, 4). Then a feature is obtained by the joint expectation value where O ∈ {n,n , I}. Note that the special case I 1 I 2 I 3 I 4 is excluded, since I 1 I 2 I 3 I 4 = 1 for any quantum state.
When the number of neurons in the hidden layer becomes sufficiently large, the Bell-like predictor is capable of distinguishing the two classes of states with more than 99% in match rate, when p min = 0.1. In other words, the ensembles assumed to be separable or entangled can be classified reliably by our predictor with an accuracy higher than 99%. The match rate can reach 95% even if p min = 0, where some of the states in ρ mix becomes separable.
To investigate further, we divide the data into three groups.
• Group I: The subclass of quantum states in ρ mix in Eq. (22), in which the minimal eigenvalue, of the partial-transposed density matrix, is negative. These states are all entangled.
• Group II: The complementary class of states of group I for ρ mix , i.e., with non-negative eigenvalues. These states should contain both entangled and separable states.
• Group III: The class of all fully-separable quantum states ρ sep (see Eq. (2)). These states are all separable.
Our goal is to study the performance of the machine learning method in analyzing the internal structure of mixed quantum states. In the training phase, we labeled all states in group I and II with the label Green, and states in group III with Blue. Fig. 7e illustrates the result of the states with p min = 0, trained by the Tomographic predictor. The length of bars represents the number of ensembles tested. For group I, which are definitely entangled, our Tomographic predictor detects their entanglement with more than 99.6% (Ent. in figure) accuracy in terms of the match rate, which grows up to 99.9% (Sep. in figure) for the fully-separable states in group III. For group II, although the states were labeled as Green (which is dominated by entangled states), Tomographic predictor suggests that a large fraction of the states are actually Blue (separable) states, which is consistent with the PPT criterion.

VII. CONCLUSION
In this work, we have applied a method of machine learning, known as Artificial Neuron Networks (ANN), to solve problems of quantum state classification in quantum information science. We have achieved several results, including (i) linear optimizing CHSH inequality, (ii) nonlinear optimization of the Bell-type inequalities, (iii) construction of universal entanglement detector for two-qubit systems, (iii) multiple-state classification for three-qubit systems, and (iv) four-qubit systems. Overall, we found that machine-learning can produce reliable results provided that the training set is properly chosen. The performance of machine learning becomes worse whenever the majority of the quantum states in the training set lies around the boundary between two classes (e.g. entangled and separable) of quantum states.
In general, our results are useful for problems where the process of labeling a quantum state is resource consuming. For example, the use of the PPT criterion requires a diagonalization of an exponentially-large matrix for n qubits. However, these procedures can be confined to the problems in the training set. In the future, we can imagine that these resource-demanding tasks can be achieved by a few super (quantum or classical) computers. Once a predictor is constructed, any small laboratory can makes use of it by measuring only a relatively small amount of features of a testing quantum states. In this sense, the task of solving the computational problems can be shared by users with different computational powers.
Define CHSH operator Π CHSH by assembling the elements of standard CHSH inequality.

III. ENTANGLEMENT WITNESS
Similar to CHSH inequality, witness detection is an another method to detect entanglement. More specifically speaking, an operator W can be viewed as an entanglement witness iff Tr(ρW ) < 0 for at least 1 entangled ensemble ρ (16) We say W detect ρ if Tr(ρW ) < 0. As an example, to detect an entangled state |ψ + = 1 It is easy to validate that ψ + | W + |ψ + < 0, thus |ψ + is detected by W + . it has been shown [5] that the witness must be decomposed at least in 3 measurements per party, while CHSH inequality only needs 2. However, for more general ensemble ρ θ,φ with unknown p, θ and φ, W + can also detect some of entangled ensembles. Because Compared with 8, if φ = 0 all the entangled ensembles can be detected by W + perfectly whatever p and θ is. However, if we do not know anything about φ, this method fails.  (3, 12, x) abI, ab I, a bI, a b I, Triple CHSH inequality aIb, aIb , a Ib, a Ib , Iab, Iab , Ia b, Ia b

IV. DETAILS ABOUT OUR ARTIFICIAL NEURAL NETWORK (ANN) MODEL
If not special specified, all of our ANN model consists of 3 layers -input, hidden and output layer. Every layer contains a fixed-size block of perceptrons (neurons) which connected with perceptrons in the neighbor layers.
1. input layer. Here we denote the input layer as a vector x. Our model encodes the observation results on the input layer, which includes some or all information of quantum system. Here the information -or more formally speaking -features, can be described by the expectation of observations (every single observation or observation products) . We assume every party observes and projects the system by 2 or 3 single Pauli-matrix, which can combine to the elements of Bell inequalities or Standard State Tomography. This is why the models are called Bell-like (2 observation per party) and Tomographic (3 observation per party) predictor. 2. hidden layers. Both hidden layers consist of perceptrons which denoted as x 1 .
where W 1 , w 01 are initialized uniformly and will be trained to decrease the loss function we will introduce soon. σ RL is a nonlinear active function for every neuron. Here ReLu [6] are used as output function to the next layer. The definition of ReLu function σ RL : σ RL ([z 1 , z 2 , · · · , z n e ] T ) = [max{z 1 , 0}, max{z 2 , 0}, · · · , max{z n e , 0}] T In our model, n e represents the number of neurons in the hidden layer and x 1 = [z 1 , z 2 , · · · , z n e ] T .
3. output layer. This layer only contains 1 perceptron for 2-class identification or m perceptrons for m-class identification, denoted as x 2 .
x 2 = σ S (W 2 x 1 + w 02 ) Here σ S is sigmoid function σ S (z) = 1/(1 + e −z ) for 2-class identification or softmax function σ S ([z 1 , z 2 , · · · , z m ] T ) = [σ S (z 1 ), σ S (z 2 ), · · · , σ S (z m )] T σ S (z l ) = e z l ∑ m j=1 e z j , l = 0, 1, · · · m for m-class identification. Every perceptron of output layer describes the probability (defined as p) of every class for given quantum state predicted by model. In our work, we adopt categorical cross entropy with m classes as loss function to measure the difference between model predictions and real categories (i.e. labels). The training process of multilayer perceptron is reducing the loss function (the sum of Eq. 25 for all data, herep j represents the probability of one state to be the j-th class predicted by machine, while p j is that for label. p j is either 1 or 0.) from the output layer by modifying the weights between neighbor layers.  .04] T , it actually say "I think the state is separable with probability = 90%." If the state is actually separable, the cross entropy is −1 × log 2 (0.9) − 0 × log 2 (0.03) − 0 × log 2 (0.03) − 0 × log 2 (0.04). For 2-class identification, we only use one neuron in the output layer to encode the result.
Large enough data are generated to feed on our model. We divide them into 2 parts, one (90% − 99%) for training the model and the other (1% − 10%) for testing the accuracy. The weights of perceptrons are initially prepared on random uniform distribution. There are a variety of learning techniques for our multi-layer networks such as Stochastic gradient descent [7]. For every iteration of training process, we shuffle training set randomly and feed the data on our model batch-by-batch. Every batch contains 32 data-label pairs. And the weights between neighbor layers are updated proportional to the gradient of loss function. ANN architecture can be easily built from readilyavailable tools such as Keras [8], which our model is implemented by. See Fig. 1 for more details.