Introduction

Non-line-of-sight (NLOS) imaging is a technique of detecting the hidden objects behind obstacles or around corners by exploiting the scattered light, which has attracted intensive interest for its fundamental importance in several application fields, such as autonomous vehicles, robotic vision, and biomedical imaging. NLOS imaging can be realized through reconstruction or recognition, depending on the application scenarios. NLOS reconstruction aims to make a visual representation of the hidden objects, while NLOS recognition just focuses on classifying the hidden objects.

The vast majority of NLOS imaging is based on an active detection strategy, like light detection and ranging (LIDAR)1,2,3,4,5,6,7,8,9, correlation-based imaging10,11,12,13, and holographic approaches14,15. The LIDAR technique allows for 3D scene reconstruction using a streak camera or single-photon avalanche photodiode detector and short-pulsed laser1,2,3,4,5,6,7,8,9. However, it faces severe practical limitations, including high costs, low photon efficiency, and typically long acquisition time3,4,5,6,7. The correlation-based and holographic NLOS imaging can also realize robust shape recovery with less expensive hardware, yet the former one is limited for sparse hidden objects with a small field of view10,11,12,13, and the latter one faces severe difficulties in recording holograms in practical scenarios14,15. NLOS recognition can provide labels for the hidden objects directly. Active NLOS recognition using the LIDAR technique realizes 76.6% accuracy on human pose estimation behind scattering media16. And that using optical coherence achieves 90% accuracy on the modified National Institute of Standards and Technology (MNIST) dataset of handwritten digits and 78.18% accuracy on human body posture dataset around corners17.

Unlike the active techniques, passive NLOS imaging utilizes weak scattered light or thermal radiation from the hidden objects without the need for a probing laser. Passive NLOS imaging is essential in some practical scenarios due to its simplified hardware system and good stealthiness. However, passive NLOS imaging is challenging and has limited programmable control. Different methods have been proposed to address this problem, including using a partial occluder18,19,20,21, thermal information22,23, or polarization cues24 in a NLOS system designed for imaging around corners, and using aperture masks25 or deep neural networks26,27 in an NLOS system for imaging through scattering media. The existing passive imaging is all realized through reconstruction. For example, the occluder-based passive NLOS reconstruction recovers 2D scenes by solving an inverse problem20,21, whereas the existing methods either demand prior information of the setting20 or yield low-quality recovery due to the partial knowledge of the occluder21. Moreover, they require a few minutes to process the occluder’s estimation and tens of seconds more for reconstruction, which is unrealistic for real-time NLOS applications. Deep neural networks have been used for passive NLOS reconstruction through scattering media to improve recovery quality26,27. However, the reconstruction quality is worse when the handwritten digit is illuminated by ultra-weak laser light on the same side27, which is the particular situation of passive NLOS imaging that the useful signal is extremely weak. It is hard to identify the hidden objects when the recovery quality is poor, whereas NLOS recognition can avoid this problem and meanwhile accelerate the imaging process. To our knowledge, passive NLOS recognition has not been explored thus far.

In this study, we perform passive NLOS recognition around corners using a pin-speck experimental setup, which consists of a standard digital camera and an occluder (Fig. 1). The rays of light from the monitor scene that go towards the secondary surface are partially blocked by the occluder, producing a penumbra. The camera captures the penumbra on the secondary surface, which contains category information and can be labeled by an elaborately trained convolutional neural network (CNN) model. The proposed NLOS recognition system is demonstrated with high and robust performance under increasingly complex scenes. For uncalibrated setups placed in a dark room, the recognition accuracy exceeds 97% on the well-known MNIST dataset of handwritten digits28 with less than 1 s per image. The proposed NLOS recognition system is also generalized on more sophisticated datasets of hand gestures29,30, human postures31, and fashion-MNIST32. The recognition accuracy varies from 81.58% to 93.95%, higher than the active NLOS recognition results of human postures16,17. When a homogeneous ambient light is added to the setup, the recognition accuracy still maintains above 94% on the MNIST handwritten digit dataset. When varying ambient light is added to the setup by using 0–3 plates to cast shadows on the secondary surface, the recognition accuracy remains above 88.28% on the MNIST handwritten digit dataset. The system also reveals good generalizability with an accuracy above 60% when different number of people walk around the system and cast shadows on the secondary surface.

Fig. 1: Schematic of the occluder-based passive non-line-of-sight recognition system.
figure 1

The origin of the x-axis is placed at the lower-right corner of the monitor. The origin of the y- and z-axes is placed at the intersection of the optical table and secondary surface. The position parameters of the screen, occluder, and camera are calibrated using coordinates relative to the origin (0, 0, 0). The black dashed and red solid arrows denote the light path and data flow, respectively.

On the flip side, we further consider the security threats of NLOS recognition. Here we mainly concern about the fragility arising from the NLOS recognition algorithm based on CNNs. CNNs are vulnerable to attacks in line-of-sight (LOS) image classification33,34,35,36, which limits their use in this domain. Whether are CNNs fragile in NLOS recognition? There is no consensus in the literature regarding this topic. Therefore, we conduct attacks on the NLOS-recognition process in this study using a white-box decoupled direction and norm (DDN) attack method37. Results show that an attack success rate of approximately 36% can be achieved with a relatively low cost even using uncalibrated settings, indicating that NLOS recognition is somewhat vulnerable to perturbations. Although beyond the scope of this paper, NLOS recognition’s robustness to active adversaries38,39 should be investigated in the future.

Results

Passive NLOS recognition

NLOS recognition is beyond human capacities and thereby relies on computational methods. CNNs40,41 are particularly appealing for NLOS-recognition applications due to their ability to seize invariants, reduce the dimensionality of high-dimensional noisy data, and classify objects. We construct a CNN model for NLOS recognition using a similar experimental setup to that proposed by Saunders et al.20, which consists of a standard digital camera, a liquid-crystal display monitor, and an occluder as shown in Fig. 1. Two deep CNN models, SimpleNet17 and ResNet1842, are used as classifiers. Further details about the two CNN models can be found in the “Methods” section and Supplementary Note 1. We evaluate the proposed NLOS recognition system under increasingly complex scenes: (1) dark room; (2) homogeneous ambient light; (3) varying ambient lighting conditions; and (4) practical test.

Dark room

First, we fix the parameter settings of the hardware system (Supplementary Table 3), which is referred to as a fixed setup. The occluder is rectangular with a width of 7.5 cm and a height of 7.7 cm. The support stand is 0.7 cm wide and 23.1 cm high. The center of the occluder is located at (0.3980, 0.5000, 0.2695) m relative to the origin, as indicated in Fig. 1. The monitor screen is 37.75 cm in width and 30.20 cm in height. The lower-right corner of the screen has coordinates of (0.0185, 1.0000, 0.0940) m. The center of the camera is fixed at a position of (0.5835, 1.0400, 0.2800) m with a FOV size of 47 cm. To evaluate the proposed NLOS recognition system in a dark room, we train the CNN with data acquired both from simulations and camera measurements on the MNIST dataset28 of handwritten digits, which contains ten categories and includes 60,000 training images and 10,000 test images. Simulated images are synthesized with the traditional forward transport model, in which the original images of the MNIST dataset are pre-multiplied by the light transport matrix A, which was computed using prior information of the setting (see details in Supplementary Notes 2.1). Several examples of simulated images are shown in Fig. 2. As shown in Table 1, regardless of using SimpleNet or ResNet18, the recognition accuracy of the model trained with the simulated images is comparable to that of the model trained with the original images, both exceeding 99% and confirming that the light transport matrix A preserves category information perfectly. Compared to the simulation results, the recognition accuracy of the model trained on the measured images is marginally lower but remains above 98% using both SimpleNet and ResNet18, as shown in Table 1. These results demonstrate that the noise contained in the NLOS imaging system from system modeling errors, background noise, etc. is tolerable. Therefore, the proposed method of directly identifying a hidden object based on measured images is feasible.

Fig. 2: Examples of the original images, measured images, and recognition labels in the proposed passive non-line-of-sight recognition system in a dark room.
figure 2

The simulated image is produced using the forward transport model (Supplementary Note 2.1), and the reconstructed image is produced using the spatial differencing reconstruction method (Supplementary Note 2.2) proposed by Saunders et al.20.

Table 1 Line-of-sight (LOS) and non-line-of-sight (NLOS) recognition accuracies with SimpleNet and ResNet18 in a dark room.

To benchmark the proposed method, we also perform image reconstruction using the spatial differencing reconstruction method (see details in Supplementary Notes 2.2 and 2.3). As shown in Fig. 2, the quality of the restored handwritten-digit images is significantly deteriorated compared to the original images, which will inevitably increase the subsequent misclassification rate by humans or computers. For example, it is difficult for humans to classify the reconstruction results of digits 2, 4, 5, 8, and 9 (Fig. 2). We observe that the recognition following image reconstruction has worse performance that the accuracy decreases from 99% with the original dataset to approximately 94%, which is even lower than the accuracy of the NLOS recognition model (~98%). This observation is reasonable because the data processing in image reconstruction will lose the information. To improve recognition accuracy, it would take considerable effort and time to optimize the algorithm to increase the reconstruction quality. Even when using the traditional forward transport model, an additional 2 min is required to reconstruct an image, while only 0.87 s is required with the proposed CNN-trained NLOS recognition model.

We also use the proposed passive NLOS recognition method on more sophisticated datasets that contain hand gestures29,30, human postures31, and fashion-MNIST32 (see Supplementary Note 3.1). Examples of original, simulated, and measured images for four sophisticated datasets are shown in Supplementary Fig. 6. The recognition process of each image is completed in less than 1 s, and the recognition accuracy varies from 81.58% to 93.95%, as shown in Supplementary Table 5. Therefore, the proposed NLOS-recognition system also achieves good performance on more sophisticated objects hidden around the corner.

Considering the application of the proposed method in a real situation, it is impossible to obtain all NLOS data in a fixed setup because real setups will vary. To train a CNN model that is invariant to parameter settings, we collect data with changes in the occluder shapes and the positions of the monitor, occluder, and camera within given ranges. This case is referred to as the mixed setup. The shape of the occluder is changed from a rectangle into a triangle, a circle, and even the shape of a cup. The parameter setting ranges for the mixed setup are shown in Supplementary Table 3. This model achieves recognition accuracies of 97.16% and 98.26% with SimpleNet and ResNet18, respectively, within the trained range.

In addition, we performed a simulation to study the effect of camera sensor noise on the NLOS recognition accuracy. A 96% recognition accuracy can be achieved as long as the signal-to-noise ratio of the image captured by the camera sensor is greater than 20 dB, as shown in Supplementary Note 3.2.

Homogeneous ambient light

Additionally, to verify the robustness of NLOS recognition models to ambient light, we added an incandescent lamp with adjustable light intensity in the experimental setting. As shown in Table 2, the recognition accuracy of the NLOS recognition model (SimpleNet or ResNet18) decreases as the light intensity increases. However, the recognition accuracy can still reach more than 94%, even in the presence of intense ambient light as high as 4.2 Lux, while the signal of interest is only 0.3 Lux. Therefore, NLOS recognition models are robust to homogeneous ambient light within a specific intensity range.

Table 2 Non-line-of-sight (NLOS) recognition accuracies using SimpleNet and ResNet18 under a homogeneous ambient light.

Varying ambient lighting conditions

Because there is no way to control ambient lighting in practice, we also study the robustness of the trained NLOS recognition model in varying ambient lighting conditions. Specifically, we study the effect of people walking nearby and casting shadows on the secondary wall. A house-shaped occluder-based NLOS system was placed in an exhibition hall with an intense ambient light of 610 Lux (Supplementary Fig. 8). Ambient light remains as high as 120 Lux measured in front of the camera lens after an L-shaped translucent acrylic plate was placed around the system (Fig. 3a). The NLOS recognition accuracy is 84.73% in this situation when recognizing digits shown on the monitor. Different numbers of opaque plates of approximately 1.70 m height are used to cast shadows on the secondary surface to simulate people in the vicinity of the proposed system (Fig. 3a–d). As shown in Table 3, the recognition accuracy of the CNN model trained in the experimental setting without plates (referred to as ResNet18-0) drops from 84.73% to 71.84–77.58% when 1–3 plates are placed around the proposed system. We also test ResNet18-0 when a person walks around the system. The recognition capability is affected with an accuracy of around 70%, which is comparable to the simulation results and indicates that the passive NLOS recognition model trained under one fixed lighting condition is not robust to varying ambient lighting.

Fig. 3: Occluder-based passive non-line-of-sight recognition system encircled by an L-shaped translucent acrylic plate in different ambient lighting conditions.
figure 3

a Zero, b one, c two, and d three opaque plates of around 1.70 m height placed around the system to cast shadows on the secondary surface.

Table 3 Summary of the recognition accuracies of ResNet18 models trained under varying ambient lighting conditions.

To improve the robustness of the recognition model, we retrain the network with the data collected under a mixed lighting condition with 0, 1, 2, and 3 plates in the setting (referred to as ResNet18-M). As shown in the left panel of Table 3, the recognition accuracy under different ambient lighting conditions was markedly improved by 1–15% using ResNet18-M. To improve the generalizability of the NLOS recognition model, we add Gaussian noise, random horizontal flip, and random rotation to the measured images for data augmentation. Comparing the left and right panels of Table 3, it is shown that the recognition accuracy of ResNet18-M with data augmentation improves by approximately 2%, higher than 88.28%.

Practical test

To demonstrate the effectiveness of ResNet18-M with data augmentation under real varying ambient lighting due to movements of nearby objects, practical tests are implemented on the house-shaped passive NLOS recognition system in an exhibition hall (Supplementary Fig. 9). We write a script to automatically display the 10,000 test images of the MNIST dataset on the monitor screen and calculate the accuracy when different numbers of people walk around the system with casual poses and gestures. The recognition accuracy of ResNet18-M with data augmentation remains above 60% with three rounds of the practical test, indicating good generalizability.

NLOS attack

The security of CNNs is an active topic within deep learning communities. Attacks33,34,35,36 can manipulate CNNs using generated adversarial examples (AEs) to misclassify data, in which AEs are essentially indistinguishable from legitimate ones by adding small perturbation. In this study, we use an existing white-box attack method called DDN37 to generate AEs, in which the cost is evaluated by the L2 norm of perturbations (see details in “Methods”). The lower the L2 noise of attacks, the more fragile the recognition system. The attacks performed in this study are all untargeted (i.e., the perturbations to a digit force the CNN to classify the digit as another unspecific output).

An example is shown in Supplementary Fig. 10a to demonstrate the process of a LOS attack. Learnable perturbations are added to the original image to produce an AE, which is indistinguishable from the original image by human beings but is misclassified by CNNs as another digit with high confidence. Supplementary Figure 10b shows several AEs generated for the SimpleNet and ResNet18 models trained on the MNIST handwritten digit dataset. We achieve a 97.7% success attack rate with L2 of 0.5779 for SimpleNet. The cost is expensive, as digits 2 and 4 have been severely distorted to mislead SimpleNet. Conversely, a 100% success attack rate is achieved for ResNet18 with a much smaller L2 of 0.0999. The distortion is too small to be perceived by humans. ResNet18 has deeper layers and stronger recognition capability than SimpleNet, yet it is much easier to be attacked by the DDN method. Many studies have demonstrated that CNNs are vulnerable to active attacks in LOS classification33,34,35,36; however, attacks on a NLOS-recognition CNN model have not been investigated.

Two different attack strategies are designed to perform attacks on NLOS classifiers (Fig. 4). One strategy is called attack on the monitor screen, where digits with designed small image perturbations are shown on the monitor. The AEs (i.e., digits with image perturbations) on the monitor are clear to a LOS viewer, yet cause distorted information on the secondary surface and then mislead the NLOS recognition system. The other strategy is called attack on the secondary surface, where small wall perturbations are added onto the penumbra of the digits on the secondary surface either by an additional light projection, wallpaper, or wall painting. The generated AEs on the secondary surface, which are indistinguishable from the original penumbra of the digits, can be directly captured by the camera and cause misclassification of the NLOS recognition system.

Fig. 4: Schematic of non-line-of-sight attacks.
figure 4

Attack strategy #1 attack on the monitor screen: image perturbations are added on the original image to create an adversarial example, which is displayed on the monitor screen to cause misclassification; Attack strategy #2 attack on the secondary surface: wall perturbations are added on the secondary surface to change the measured image and cause misclassification.

To produce NLOS AEs used in real-world attacks, we first perform simulations of attacks on the NLOS classifiers. In the scenario of attacks on the secondary surface, we use the same process as the LOS attacks to generate AEs to misdirect the NLOS classifiers based on the assumption that the captured penumbra of the digits on the secondary surface is the measured image (Supplementary Fig. 11). In the scenario of attacks on the monitor screen, we generate AEs shown on the monitor to disturb the NLOS classifiers by introducing the forward transport model to approximate the relationship between the image on the monitor and the captured penumbra on the secondary surface (Supplementary Fig. 12).

Supplementary Figure 13 shows simulated AEs on the monitor screen and secondary surface. To achieve a 100% attack success rate for SimpleNet, the L2 norm of the distortion on the secondary surface is 0.0241/0.0221 against the fixed/mixed NLOS classifier, which is much below the L2 value of 0.5779 in the LOS attack (Table 4). This result likely occurs because the measured images on the secondary surface are just high-frequency shadows that lose large amounts of information compared to the original images. The L2 norm of the distortion on the monitor screen is 0.2249/0.2225 intending to totally disable the fixed/mixed NLOS classifier, which is larger than the corresponding L2 value of attacks on the secondary surface. This result likely occurs because more information is contained on the monitor screen than on the secondary surface; thus, the attack cost is increased considerably. The NLOS attack cost of the mixed setup is near that of the fixed setup in simulation, and the attack scenarios for ResNet18 are similar to those for SimpleNet, as shown in Table 4. Overall, the image or wall perturbations to achieve a 100% attack success in NLOS recognition are small based on the simulation results, indicating the moderate fragility of the NLOS recognition system.

Table 4 Summary of the line-of-sight (LOS) and non-line-of-sight (NLOS) attack success rates at a cost evaluated by the L2 norms.

In order to test the effectiveness of the AEs generated by the simulation method in reality, a natural idea is that we display the simulated AEs (i.e., digits with image perturbations) on the monitor screen and later capture the corresponding distorted penumbra on the secondary surface. Then we feed the measured image into NLOS classifiers to test whether it can cheat the NLOS recognition model. This process is called a real-world attack. If the measured image is misclassified, this attack is called a successful attack. Otherwise, it is called a failed attack. Unfortunately, the attack does not work both in the fixed and mixed NLOS setups if we directly display previously simulated AEs on the monitor. The large discrepancy in the success rates of the real world and simulated attacks can be attributed to the perturbations using the DDN37 method with 100 iterations being much lower than the noise in the system.

To overcome this problem, we increase the perturbations of AEs. Besides, we find that the estimation of background noise (item b in Supplementary Eq. (2) is important in a real-world attack (see details in Supplementary Note 4.3). Therefore, AEs on the monitor for a real-world attack are generated after considering item b in the forward model and using an increased L2 norm. Figure 5 shows AEs on the monitor screen for real-world attacks in the fixed/mixed setup. When the L2 norm of the perturbations is approximately 0.60 for SimpleNet, the increased success rates against the fixed and mixed setup classifiers are 49.31% and 36.23%, respectively. The attack success rate in the mixed setup is much lower than that in the fixed setup, which demonstrates that the mixed setup classifier is more robust than the fixed setup classifier by considering the variability of parameters. These situations also apply to the attack on ResNet18, as shown in Table 4. In summary, the experimental results demonstrate that the high-precision classifiers trained in the fixed/mixed NLOS setup can be attacked in real-world scenarios. Therefore, the CNN model based on NLOS data is susceptible to AEs.

Fig. 5: Real-world attacks with increased L2 on fixed and mixed classifiers.
figure 5

Adversarial examples are generated on the monitor screen.

Discussion

Thus far, we have shown that the proposed NLOS recognition system is accurate, efficient, and practical but somewhat fragile to elaborately crafted perturbations. Our passive NLOS recognition system has demonstrated the robustness to part of the parameters of the NLOS hardware system, including occluder shape, positions of the screen, occluder, and camera, and lighting conditions. Recent work in passive NLOS reconstruction also shows that the hidden scenes can be recovered using an unknown occluder21, revealing robustness to the occluder shape. The occluder information is estimated by exploiting motions in the scene. However, the occluder estimation takes more time and the reconstruction quality deteriorates due to lack of full knowledge about the occluder shape. We note that our method is only designed for a white secondary surface of uniform albedo, while the real secondary surface may have varying albedo such as checkered tiles and wallpaper of different patterns. A recent passive NLOS work by Seidel et al.43 provides a possible path to solve this task that they recover 1D projection of the hidden scene behind a wall with real floors of varying albedo patterns by exploiting priors on the floor albedo and hidden scene from a single photograph.

We perform attacks on the NLOS CNN recognition algorithm in this study, which represent an initial but essential step to developing a robust NLOS CNN recognition system. The robustness of the proposed CNN recognition system could be improved further by applying defenses to mitigate the effects of AEs. Although such defense strategies38,39 are out of the scope of this study, they can be briefly summarized into two categories: (1) enhance the learning model via methods like adversarial training44 and defense distillation45 and (2) detect adversarial samples via methods like principal component analysis46 and feature squeezing47. White-box attacks36 are used to generate AEs using prior knowledge of the target network. In practice, we may have no access to the underlying training policy. Therefore, gray- or black-box attacks33 should be investigated in future work. Additionally, a targeted attack should also be investigated to produce labels specified in advance.

Attacks on the secondary surface would be easier than attacks on the monitor screen based on the lower L2 norms of the simulated attacks. However, it is challenging to perform these types of attacks because the perturbation on the secondary surface changes with the images shown on the monitor screen. To address this problem, a universal perturbation with L2 equal to 0.4545 is learned, as shown in Fig. 6a, which could deceive the ResNet18 recognition model and is invariant to the original images on the monitor. As shown in Fig. 6b, the difference between AEs and the original measured images of penumbra is imperceptible by humans, which is the main difference between the attack strategy on the secondary surface in this work and the robustness issues caused by using different patterns on the secondary surface43. The universal perturbation severely distorts the measured images, and the attack success rate of this method can reach 89.67% in the simulation. The AEs in this study can also be used for defenses to improve the robustness of the proposed NLOS recognition system.

Fig. 6: A universal wall perturbation on the secondary surface.
figure 6

a Learned universal wall perturbation on the secondary surface. Note that the perturbation has been amplified for visualization purposes only. b Adversarial examples are generated with the universal perturbation.

Thus, we demonstrate a passive NLOS recognition technique that is invariant to changes in calibration parameters. The setup is simple and inexpensive with a standard digital camera and an occluder. We performed experiments to verify its feasibility, and results show that passive NLOS recognition can achieve high accuracy of between 81.58% and 98.26% in a dark room with different datasets that contain images of handwritten digits, hand gestures, human postures, and fashion-MNIST with processing time less than 1 s per image. Moreover, high recognition accuracy can be maintained under more complex lighting conditions. On the other side, white-box attacks are conducted on the NLOS recognition model. Although the positions of the experimental setup vary, an attack success rate of approximately 36% can be achieved with a relatively low cost. Therefore, existing NLOS-recognition methods remain somewhat vulnerable to well-designed perturbations. The robustness of the proposed recognition model will be improved further by applying defense methods in the future.

Methods

Hardware configuration

Our capture system includes an HP LCD monitor model P19A, which has a 5:4 aspect ratio and 1280 × 1024 resolution; a FLIR Grasshopper3 camera with a resolution of 2048 × 2048 using a Tamron M118FM16 lens with a 16 mm focal length and f/1.4 aperture; a black occlude; and a white foam board that is visibility diffuse. There is no LOS from the monitor screen to the camera.

Data acquisition from the camera

A Python script was used to control data acquisition and the exposure time was 0.7 s per snapshot. Each snapshot is an 8-bit raw image with 2048 × 2048 pixels in a Bayer filter RGBG pattern storage format. To obtain a three-channel GRB mode image, the two green channels are averaged, and then the images in each channel are sampled using 16 × 16 blocks. Finally, a color image with 128 × 128 pixels is obtained, which is used to train the deep learning model or is fed into the CNN model to identify its category.

Deep neural networks for passive NLOS recognition

SimpleNet

Inspired by the deep network model built by Lei et al.17, we cherry-pick a convolutional network, as shown in Supplementary Fig. 1. Its overall structure is represented in Supplementary Table 1. Dropout was added to the top two fully connected layers to prevent overfitting. The number of fully connected layers of the last layer of the model is set to 10, which represents the number of categories in the MNIST dataset. The input sizes of the original image, simulated image, and measured image are 32 × 40, 128 × 128, and 128 × 128, respectively. For the simple MNIST dataset, we used an SGD optimizer48 with a momentum of 0.9 and an initial learning rate of 0.01. With a batch size of 64, 20 epochs were trained on the training set. Furthermore, every five training iterations, the learning rate is multiplied by 0.9.

ResNet18

ResNet’s unique residual structure makes it one of the most significant architectures in the field of computer vision. This unique residual structure is used to mitigate the performance degradation problem caused by the increased depth of the neural network. This study modifies and fine-tunes the ResNet18 model published on the PyTorch official website42, as shown in Supplementary Fig. 2. Its overall structure is represented in Supplementary Table 2. Because the image in the experimental datasets is grayscale, its input channel is modified to 1 and the output number of neurons in the last layer of the fully connected layer is adjusted to the number of categories in the experimental datasets. The training strategy of this network is consistent with that of SimpleNet on the MNIST dataset. For the sophisticated datasets, we modify the number of training epochs to 60. For the training strategy, we use a weight decay of 0.0001. In the 20th and 50th epochs, the learning rate is multiplied by 0.1 and the other parameters of the network remain unchanged. Additionally, we use data augmentation, such as Gaussian noise, random horizontal flip, and random rotation to avoid overfitting.

DDN for NLOS attack

DDN37 is an existing white-box attack method that is designed to generate AEs with a low perturbation norm in a few iterations for a given input image. The objective of the non-targeted attack is to minimize the likelihood of correct classification with a given maximum norm of the disturbance constraint. This problem can be described as

$$ {{\mathrm{min}}_{\mathbf{\updelta }}}\,P(y_{\mathrm{{true}}}|{\mathbf{x}} + {\mathbf{\updelta}} ,{\it{\theta }})\\ {\mathrm{subject}}\,{\mathrm{to}}\left\| {\mathbf{\updelta}} \right\| \le {\it{\varepsilon }}\,{\mathrm{and}}\,0 \le \left\{ {{\mathbf{x}} + {\mathbf{\updelta}}} \right\}_{i,j} \le M,i,j = 1,2, \cdots ,n$$
(1)

where \({\it{y}}_{{\mathrm{true}}}\) denotes the ground-truth label corresponding to the sample \({\mathbf{x}}\); \({\mathbf{x}} + {\mathbf{\updelta }}\) represents the AEs; \(P\left( {y|{\mathbf{x}},\theta } \right)\) is the probability that sample \({\mathbf{x}}\) is predicted to be \({\it{y}}\) when the model parameter \(\theta\) is known; \({\it{\upvarepsilon }}\) and \(M\) represent the maximum norm of the disturbance and the maximum value of each pixel, respectively; and \({\it{n}}\) represents the size of the image \({\mathbf{x}}\).

DDN using the projected gradient descent method49 to refine the perturbation iteratively for a given input image works by calculating the cumulative gradient direction based on the misclassification loss function, and updates the perturbation by multiplying an adaptive step size by a unit vector of the computed gradient:

$${\tilde{\mathbf{x}}}_k = {\mathrm{clip}}_{(0,1)}\left\{ {{\mathbf{x}} + {\it{\upvarepsilon }}_k\frac{{\mathbf{\updelta}}_k}{{\left\| {\mathbf{\updelta}}_k \right\|_2}}} \right\}$$
(2)

where \({\mathrm{clip}}_{\left( {0,1} \right)}\) limits the adversarial perturbation to [0,1]. For the kth iteration, when given the perturbation step \({\it{\upalpha }}\), the current gradient direction \({\mathbf{g}}\) of the input image is calculated by the loss function \({\it{L}}\), \({\mathbf{g}} = {\it{\upalpha }}\frac{{\nabla _{{\tilde{\mathbf{x}}}_{k - 1}}L({\tilde{\mathbf{x}}}_{k - 1},y,\theta )}}{{\left\| {\nabla _{{\tilde{\mathbf{x}}}_{k - 1}}L({\tilde{\mathbf{x}}}_{k - 1},y,\theta )} \right\|_2}}\), which are aggregated to the current cumulative gradient direction \({\mathbf{\updelta }} _{k - 1}\) to obtain the next cumulative gradient direction \({\mathbf{\updelta }}_k\), \({\mathbf{\updelta }}_k = {\mathbf{\updelta }}_{k - 1} + {\mathbf{g}}\). \({\it{\upvarepsilon }}_k\) is the step size corresponding to the \({\mathbf{\updelta }}_k\). For a given adaptive factor \({\it{\upgamma }}\), if the current \({\tilde{\mathbf{x}}}_{k - 1}\) sample is adversarial, the step size \({\it{\upvarepsilon }}_k\) will be decreased to minimize the perturbation norm, \({\it{\upvarepsilon }}_k{\mathrm{ = }}\left( {{\mathrm{1}} - \gamma } \right){\it{\upvarepsilon }}_{k - 1}\); otherwise, the step size \({\it{\upvarepsilon }}_k\) will be increased, \({\it{\upvarepsilon}}_k = \left( {{{1}} + \gamma } \right){\it{\upvarepsilon}}_{k - 1}\). This method is suitable for targeted and non-targeted attacks in white-box scenarios and achieves the best overall performance in terms of attack success rate, low disturbance, and convergence speed.

Because the dimensions of the original image and the measured image are different, this study employs the relative L2 norm rather than the L2 norm to measure the distance between the AE and the given input image to be perturbed. The specific formula is

$$L_2\left( {{\mathbf{x}},{\tilde{\mathbf{x}}}} \right) = \sqrt {\left\| {{\mathbf{x}} - {\tilde{\mathbf{x}}}} \right\|^2/\left[ {\dim ({\mathbf{x}}) * \left( {\max ({\mathbf{x}}) - \min ({\mathbf{x}})} \right)^2} \right]}$$
(3)

where \({\mathbf{x}}\) and \({\tilde{\mathbf{x}}}\) represent the given input image to be perturbed and the AE, respectively. The size of the L2 value measures the robustness of the model against attacks: the smaller the value is, the lower the cost of the attack model. The parameter settings against LOS and NLOS attacks are summarized in Supplementary Table 6.

The authors affirm that informed consent for publication of the images in Supplementary Fig. 9 was obtained from the identifiable individuals.