Multi-DOA estimation based on the KR image tensor and improved CNN

In this paper, we propose an improved convolutional neural network (CNN) to solve the multi-DOA estimation problem. We use Khatri-Rao (KR) product to obtain the KR image tensor of covariance matrix and use the proposed estimation CNN to process the tensor. In order to increase the generalization of the proposed CNN and adapt the multi-label classiﬁcation problem, we use the curriculum learning scheme (CLS) and partial label strategy (PLS) to develop an efﬁcient training procedure. We implement several experiments to demonstrate the satisfying performance of the proposed estimation method. The simulation results show that our proposed method can ﬁnish the high resolution multi-DOA estimation use only a few sensors. Furthermore, the proposed method can obtain high estimation accuracy under low SNR situations and use fewer snapshots.


Introduction
Direction-of-Arrival (DOA) estimation technologies are widely used in the modern radar and communication systems [1][2][3][4][5] . In the past decades, researchers have presented many super-resolution estimation methods. Two of the most classic methods (subspace based methods) are known as the multiple signal classification (MUSIC) algorithm and estimating signal parameter via rotational invariance techniques (ESPRIT) algorithm 6,7 .
Many articles have proposed numerous variants of MUSIC and ESPRIT methods [8][9][10] , some improvements of the estimation performance have already been made. Researchers always combine the MUSIC method with other technologies to enhance the estimation performance, such as particle swarm optimization based MUSIC (PSO-MUSIC) method used for tracking sound source 11 and Low-degree root-MUSIC method used for fast estimation 12 . For the improvements of ESPRIT method, it is important to enhance the accuracy of the method and improve the estimation speed 13,14 .
As an interesting mathematical tool, the Khatri-Rao (KR) product has been introduced into the DOA estimation field. Researchers have proposed many efficient KR product based methods, such as the KR subspace approach used for quasistationary signals 15 and the KR product based MUSIC algorithm used for Gaussian signals 16 . Researchers also find that the KR product can enhance the degrees of freedom (DOF) for arrays. Based on this conception, the so-called nested array has been proposed both in the 1-D 16 and 2-D 17, 18 estimation fields. Although these KR subspace based methods have been well developed and many varietal algorithms have been proposed 19,20 , the defects of these methods are obvious. Firstly, the estimation performance of subspace based methods relies on the prior information, and the prior information is hard to obtain in reality 21 . Secondly, the decomposition operation is computationally complex. For MUSIC based method, the spectrum searching operation also cost plenty of computing resources, therefore, the subspace based methods is not applicable for systems which require reliable timeliness. Therefore, it is necessary to find some alternative methods with independence of prior information to solve the DOA estimation problem.
Prior articles have proved that DOA estimation problem is a kind of convexity problem 22,23 , therefore, researchers have brought some efficient convex optimization algorithms or machine learning methods into the DOA estimation field, such as radial basis function neural network (RBF-NN) 24 , support vector regression (SVR) 25 , and neural network 26 . These machine learning based methods are also used for dealing with special signal forms, such as the sparse Bayesian learning (SBL) method for coherent sources 27 and the modified quantum genetic algorithm (QGA) for wideband signals 28 . These methods can automatically learn the probability density function (PDF) of DOA 29 , therefore, the estimation method based on machine learning need less prior knowledge from the signal 30 . But the training and estimation processes become more complex, and the accuracy of the estimating results depends on the parameters of methods. These parameters are derived by some verification experiments and they are hard to obtain 31 . Therefore, it is necessary to find another way to develop DOA estimation methods with more convenient operations.
With the rapid development of artificial intelligent (AI) technologies 32 , some AI based estimation methods have been proposed in recent years. Researchers introduce the deep neural network (DNN) into the DOA estimation field and develop a DNN group to do the uniform linear array (ULA) estimation task 33 . Despite that the proposed method does not rely on the prior information and shows robustness with the array imperfections, it has some defects which need to be overcome. Firstly, while estimating DOA information by using the proposed DNN group, the DOAs of two adjacent receipt signals will need at least 10 • interval. This means that the resolution of the proposed DNN group is limited, and the accuracy of estimating results will decline if DOAs of signals are getting too closer. Secondly, the ULA used in this article contains 10 sensors, which means the array aperture is pretty large. Thirdly, these DNN based methods use the vector of array observation data or vectorized covariance matrix as the input tensor of the network 34 . This operation ignores the structure of covariance matrix and can reduce the quantity of information. In order to make full use of the information hidden in the covariance matrix, researchers also propose some convolution neural network (CNN) based methods, such as the deep CNN (DCNN) used for estimating multi-DOA with sparse prior 35 , the DCNN used for estimating DOAs of multi-speaker with noise signals 36 , and the de-multipath CNN models used for achieving multipath DOA estimation 37 . Although these CNN based methods use the convolutional layers to extract the deep features of covariance matrix, the pre-processing operation proposed in these articles [35][36][37] can complicate the estimation problem and cause time-consuming.
In this paper, we propose a novel AI based method to estimate multi-DOA information directly from the covariance matrix. We transform the estimation problem into an image classification problem by using the image tensor of covariance matrix. We discuss the image feature of the covariance matrix obtained by the KR product and design a simple CNN with special training technologies. The rest of this paper is organized as follows. In method section, we explain the basic conceptions of the proposed estimation method and present the image tensor obtained by the KR product. We also show the proposed training scheme and discuss the details of the training technologies in method section. In result section, we propose several experiments to show the satisfying performance of the proposed estimation CNN. At last, we conclude the main work of this paper in discussion section.

KR product and image tensor
We assume that K individual Gaussian signals spatially distributed at the direction of θ θ θ = [θ 1 , θ 2 , · · · , θ K ]. All signals simultaneously impinge on an ULA with M sensors. The array observation data x x x (n) can be expressed as Eq. (1): where n denotes the n-th snapshot. s s s (n) (resp. v v v (n)) denotes the signal vector (resp. noise vector). A A A = [a a a (θ 1 ) , a a a (θ 2 ) , · · · , a a a (θ K )] represents the array manifold, where a a a (θ k ) can be expressed as Eq. (2): where λ denotes the wavelength of signals. We set the inter-sensor space of ULA d = λ /2. The covariance matrix R R R of observation data can be calculated as Eq. (3): where N denotes the number of snapshots. R R R s = diag (σ σ σ ) represents the power matrix of signals (σ σ σ = σ 1 2 , σ 2 2 , · · · , σ K 2 T ).
σ v 2 denotes the power of white noise and I I I M denotes the M × M unit matrix. The vectorized R R R can be regarded as signals impinge on a new array whose manifold is given byÂ where denotes the KR product. The KR product can also generate several identical elements in y y y 16 , therefore, we remove the redundant elements and obtain a new vectorŷ y y. The new covariance matrixR R R can be expressed as Eq. (5):  whereê e e denotes a vector whose (M (M − 1) /2 + 1)-th element equals 1 while others equal 0. We reshapeR R R into an image tensor R where the real part matrix Re{R R R} and imaginary part matrix Im{R R R} are two individual channels of R. We propose a method to extract DOA information from the image feature of R, and the advantages of this method can be summarized as follows: • The use of tensor R can be processed by the CNN, which is an efficient estimation tool with more robustness 38 .
• The KR product can extend the DOF of array, which allows the array to estimate more DOAs 17,18 . Furthermore, KR product can ensure the size of R and help the network obtain more information.
• We investigate the image feature of R (shown in Figure 1, here we set M = 8) and find out that the main structure of R only corresponding to θ . The estimation environment can hardly influence the image feature of R. Therefore, it is possible to develop some estimation methods with high performance under harsh environment.
Based on these conceptions, we design an improved estimation CNN to deal with the multi-DOA estimation problem with the proposed KR image tensor R. Figure 2 shows the framework of the proposed CNN. It contains 6 layers: two convolutional layers, three dense layers, and one output layer. Two convolutional layers contain 4 and 16 individual feature maps, respectively. We set the kernel size as 4 × 4. Each dense layer contains 400 cells. For the output layer, we divide the angle space of θ (from −60 • to 60 • ) by 1 • and use 121 output cells to represent the spectrum of θ , therefore, the output layer contains 121 categories. We use the ReLU function as the activations of the first five layers while the output layer uses Sigmoid function. As a consequence, the proposed CNN transform the multi-DOA estimation problem into a multi-label classification problem of image tensor.

3/10
Algorithm Train the network with current generation of dataset; 7: Predict and update the missing labels; 8: until meet the stopping criteria 9: Mix the dataset with training samples in the next generation; 10: end for In order to increase the generalization of network, we introduce the curriculum learning scheme (CLS) 39 into CNN. We also use the partial label strategy (PLS) 40 to achieve efficient multi-label classification. The proposed learning scheme is shown in algorithm 1. The whole training scheme shown in algorithm 1 make the proposed estimation CNN become a so-called CurriculumNet 39 , and the training details are shown as follows. At first, we roughly train the network with all training samples and obtain a basic estimation CNN. Then we generate B clusters of samples according to the SNR information and use them as the sub-datasets. Here we fix the range of SNR at [−10dB, 20dB] and divide it into 6 segments. The training samples is also divided into 6 subsets and sorted from the sample subsets with higher SNR to the sample subsets with lower SNR. We denote the sorted subsets as D i , i = 1, 2, · · · , 6 and the training samples in the current generation as D C . During each generation of training, we continuously mix the subsets together as the training samples: 0. To deal with the multi-DOA estimation problem, we also use the PLS to decrease the scale of dataset space and obtain multiclassification results with high performance. We denote the training data as D = R (1) , z z z (1) , R (2) , z z z (2) , · · · , R (S) , z z z (S) , where R (s) (resp. z z z (s) ) denotes the s-th sample (resp. label vector). S denotes the number of training data. z z z (s) = z (s) where C = 121 represents the number of output categories. We denote z (s) c = 1 (resp. −1 and 0) means the present (resp. absent and unknown) of this label. We also define a hyperparameter p 2 % to randomly choose which labels should be set as present labels.
The partial labels is collaborated with partial-BCE loss 40 . We denote the s-th output of estimation network as w w w (s) = w where 1 1 1 [•] denotes a binary detector of z c . g(•) is a normalization probability distribution corresponding to the proportion of present labels.
where α, β , and γ are hyperparameters which control the value of g. We fix the value of γ at 3 and set g(0) = 5, then we can obtain that α = −4 and β = 5. In order to recover the missing labels, we implement the Bayesian uncertainty strategy (BUS) 41 after each generation of training. We define a score vector to represent the prediction of label vector. We also define a hyperparameter ρ as the detection threshold. The objective function of prediction can be expressed as Eq. (8): where κ is a positive value which controls the weight of F . L c denotes the loss of the c-th category. G(•) denotes the curriculum learning scheme mentioned before. The label vector v v v can be predicted as Eq. (9): where U (•) c denotes the Bayesian uncertainty 41 of the c-th category in the s-th sample.

Results
In order to verify the effectiveness of our method, we propose several experiments. We set M = 4, therefore, the number of virtual sensors equals to 7 according to Eq. (4) and (5). We set the learning rate of the proposed estimation CNN as 0.01 and use the stochastic gradient descent (SGD) optimizer to train the network. During training the network, the SNR of samples is randomly selected in the range of [−10dB, 20dB], and the number of snapshots N is fixed at 512.

Training performance
we use the Micro-F1 metrics M mF1 proposed in 42 to evaluate the accuracy of training and validation. We change the dropout rate of label from 0% to 75% with the step of 25% and record the accuracy performance of the proposed estimation CNN.   Figure 3 shows the M mF1 score versus different value of p 2 %. It is implied that the standard training procedure cannot estimate DOAs precisely when labels are missing, while the proposed learning strategy have a good performance even when samples missing half of labels. When p 2 % = 75%, the proposed network shows the best performance in both training and validation accuracy. We also investigate the relationship between S and M mF1 . The results in Table 1 shows that M mF1 of the standard training scheme declines quickly when S become small. While the proposed partial label strategy can significantly reduce the required scale of training dataset.
After analyzing the relationship between accuracy and training samples, we decide to fix the number of samples S at 60000 and set the rate of remained labels as 75%. Figure 4 shows the training and validation performance of proposed training scheme and standard training scheme, here we implement 100 independent times of training for each sub-dataset. The total training time is 600. The M mF1 comparison results illustrate that standard training scheme can cause serious over-fitting problem, while CLS technology can overcome this problem and greatly improve the generalization of the proposed estimation network.
By using the PLS and CLS training technologies, the final M mF1 validation accuracy of the proposed estimation network can exceed 90%, which means the proposed estimation CNN is well trained. After training the network, we can implement some performance test corresponding to the stability of the proposed estimation CNN.

Estimation performance
The angle spectrum obtained from the proposed estimation CNN is a discrete spectrum, therefore, we test the estimation performance of integer θ at first. Assuming 5 incoherent signals impinging at the direction of θ θ θ = [−47 The SNR (resp. number of snapshots N) is fixed at 10dB (resp. 512). Figure 5(a) shows the estimation spectrum. It is illustrated that the proposed estimation network can obtain a MUSIC-liked angle spectrum, and can estimate more signals than the number of physical sensors. We also randomly changing the SNR from −10dB to 10dB and implement 100 Monte Carlo experiments. The record of estimation error shown in Figure 5(b) implies that DOA may drop in the neighboring grid with the probability only around 5% . Both figures prove that the proposed estimation CNN can obtain high estimation accuracy for integer θ . In order to estimate non-integer θ and achieve high-resolution results, we take the method of amplitude interpolation. The estimation of non-integer θ can be calculated as Eq. (11).
where I denotes the I-th peak of the output spectrum. We set 5 non-integer signals to test the estimation performance of the proposed estimation network, and the estimation results (under SNR= 0dB situation) is shown in Figure 6. It is implied that the 6/10 estimation error is fluctuating around 0 • , and the maximum error is no more than 0.5 • . Figure 6 also illustrates that the proposed estimation method can cover the whole range of θ and achieve a high accuracy performance for estimating non-integer θ .  12).
where P denotes the number of Monte Carlo experiments.θ kp (resp. θ kp ) is the k-th real angle (resp. estimated angle by the proposed estimation network). First we investigate the relationship between RMSE and SNR. We fix the N at 512 and change the SNR from −10dB to 20dB. According to Figure 6(a), the RMSE response of RBF-NN method has the worst performance, the value of RMSE become lower than 1 • when SNR is higher than 2dB. The curves of DNN group and MUSIC algorithm have similar performance when SNR is higher than 0dB, while MUSIC algorithm performs better than DNN group in the case of low SNR situation. The RMSE of standard CNN is pretty low when SNR is higher than 4dB, but the curve rises sharply as the SNR decreases. Compared with these four methods, the RMSE of proposed estimation CNN has the best performance, and the RMSE is still lower than 1 • when SNR= −10dB.
We also investigate the relationship between RMSE and number of snapshots N. Figure 6(b) shows the RMSE results versus N, here we fix the value of SNR at 10dB. The RMSE responses of these methods (except RBF-NN method) reveal an inverse proportional function character. For the proposed estimation CNN, the RMSE remains low when N is greater than 16. This implies that the proposed estimation CNN can finish the estimation task precisely by using only a few of snapshots. However, when N is set to be 4, the RMSE rises to around 1 • , which depicts that the proposed estimation CNN cannot apply to single snapshot estimation processing.

Resolution of the proposed estimation CNN
In order to test the resolution of the proposed estimation CNN, we investigate the relationship between estimation performance and angle difference ∆θ . Figure 8(a) shows the estimation results of two 0dB signals with a set of angular separations {3 • , 4 • , 5 • , 6 • }. It is obvious that the proposed estimation CNN can distinguish two adjacent DOAs in the sector of [−60 • , 60 • ]. Figure 8(b) shows the RMSE results versus angle difference ∆θ . Here we fix the SNR (resp. number of snapshots N) at 10dB (resp. 512). The DNN group 33 can only estimate two signals with ∆θ ≥ 10 • , therefore, we do not show its performance in Figure 8(b). When ∆θ ≥ 3 • , the proposed method can obtain higher estimation accuracy than other methods. When ∆θ < 3 • only the MUSIC algorithm and the proposed method can obtain valid estimation results. However, the precision of our proposed method is slightly lower than the MUSIC algorithm when ∆θ become small.

Discussion
The experiments mentioned above demonstrate the satisfying performance of the proposed method. The results in Figure 3, Figure 4, and Table 1 illustrate that CLS and PLS are suitable for solving the DOA estimation problem. These two training technologies can improve the generalization of the proposed estimation CNN and reduce the scale of input and output space. Figure 5 and Figure 6 prove the validity of the proposed estimation method. By using the KR product, the DOF of array has been improved, which allows us to estimate more sources than the number of physical sensors. Figure 7 shows that our proposed method has better stability than prior estimation methods, while Figure 8 illustrate the high-resolution performance of the proposed estimation CNN. The proposed estimation CNN performs better than the subspace based estimation methods because it is a data-driven method, and the classification results can be hardly affected by the background information (like SNR and number of snapshots). The proposed estimation CNN only use the deep features hidden in R to do the estimation task, and these features only rely on the DOA information. The inessential information can be ignored by the proposed estimation CNN while subspace based methods and machine learning based methods can be influenced by these information.