Neural network identification of people hidden from view with a single-pixel, single-photon detector

Light scattered from multiple surfaces can be used to retrieve information of hidden environments. However, full three-dimensional retrieval of an object hidden from view by a wall has only been achieved with scanning systems and requires intensive computational processing of the retrieved data. Here we use a non-scanning, single-photon single-pixel detector in combination with a deep convolutional artificial neural network: this allows us to locate the position and to also simultaneously provide the actual identity of a hidden person, chosen from a database of people (N = 3). Artificial neural networks applied to specific computational imaging problems can therefore enable novel imaging capabilities with hugely simplified hardware and processing times.


I. NEURAL NETWORK CLASSIFIER.
We build a nonlinear classifier with the aim of correctly identifying the label of histograms resulting from the acquisition of pulsed laser light backscattered from three different people in seven different positions. We use a supervised approach where the input vector (a temporal histogram) is paired to an output vector encoding the class of the person and the target location. Both class of person and location are treated as categorical classification tasks, and encoded using a 'onehot' encoding with N c binary outputs for the person classes and N l binary outputs for location positions. In this work N c = 3, N l = 7. The cost function, minimised during learning, is the categorical cross-entropy loss [1]. For the i-th observation, the cross-entropy loss of the person class is where o c,i,j and y c,i,j denote the predicted class and the true class for the j-th person respectively.
Similarly for location, the cross-entropy loss for location is where o l,i,k and y l,i,k denote the predicted location and the true class for the k-th location respectively. As we are classifying simultaneously both person and position, the resulting cost function is the joint effect of the two cost functions on the person and location output vectors. The cost, L, minimised over the whole training set of N examples is The ANN architecture processes input data in parallel through: a fully-connected layer in order to retrieve more information about the distance; and convolutional block layers which, due to their translation invariant nature, focus more on the temporal histogram shape and features. After a number of layers, the output layer comprises two groups of softmax sublayers, associated with person class and location respectively. The largest architecture tested is shown in Figure 1. We used the flexible, open-source Keras library to implement our ANN in Python [1], for an example of how the code looks see Section II. The network weights were regularised using l 2 weight decay with a constant of 0.001. To prevent over-fitting and to encourage generalisation, a dropout layer was used after each dense or fully connected layer followed by a batch normalisation layer. In the convolutional blocks, the convolutional layer is normalised (using a batch normalisation layer) and 3 down sampled (using a max-pooling layer) except for the last block. For the first two blocks, the convolutional filter size is 10 × 1 and for the last two blocks this is reduced to 5 × 1. One hundred such filters are applied in each block. The optimisation algorithm was stochastic gradient descent, with a learning rate of 0.001, Nesterov momentum [2] and was applied for at least 100 iterations. A second architecture was tested, which differs from the first architecture primarily by reducing the number of filters from 100 to 32 for the first two blocks and 64 for the last two blocks. The convolutional filter size is 5 × 1 for all blocks. Average-pooling replaces max-pooling in the second and third block and batch-normalisation is removed from the convolutional blocks. A noisy input variant to each architecture was also tested. In these variants, a customised additive Gaussian noise layer re-introduces background noise to encourage generalisation. However, the results suggest that the form and level of the noise, based on just two day's readings, confound rather than improve the results.
Finally, we compare the results from different ANN architectures in Fig. 2 by reporting the correct-prediction percentages for each individual class (i.e. the diagonal values in the confusion matrices). The first algorithm (a) refers to the results presented above. In (b) a noise layer has been added in order to help the network generalise robustly to variation associated with noise in the sensor. The convolutional side is simplified in architecture (c), primarily by reducing the number of convolutional filters. A noise layer was added to (c) in the architecture (d). Another further simplification is applied for architecture (e) where only the fully-connected layer was considered.
In (f), a noise layer is added to (e) as well. Finally, we classify separately both people and positions with just the fully-connected layer in (g). As we can observe, the results tend to not show any particular sensitivity to the specific ANN architecture employed, therefore suggesting the robustness of this modelling approach and that further improvement would need to come from larger or more controlled training sets. However, the ANN performance does seem to suggest that the approach of classifying location and identities jointly is consistently better than trying to deal with these individually. This is probably because the internal representations learned to predict class can then be useful to help predict location more accurately, and vice versa.
An alternative approach to evaluation of the test results can be taken. Instead of taking average classification performance at a per-pixel level, we can base a classification on a majority verdict, all the ca 800 per-pixel classifications for a single measurement and average performance at this level. This can then be repeated in the same cross-validation manner used earlier. The goal here is both that it illustrates how we can potentially improve performance by integrating over multiple classifiers, and can also highlight whether the variability in the training set is mostly within pixels (probably relating to the sensor itself) or within measurements (relating to variations in light, pose, movement or clothing). The results of this approach indicate that misclassification is high within certain measurements (typically one out of five measurements collected for a specific personlocation), see Figure 3. This suggests that possible differences between measurements are not fully captured in the training data and are reducing the ability of the classifier to interpret previously unseen variations in the test data. Low variation in the data may also explain why the architectural changes were inconclusive. Increasing the variation within the training data should improve the classifier and these findings will inform future directions for this work.