Combined person classification with airborne optical sectioning

Fully autonomous drones have been demonstrated to find lost or injured persons under strongly occluding forest canopy. Airborne optical sectioning (AOS), a novel synthetic aperture imaging technique, together with deep-learning-based classification enables high detection rates under realistic search-and-rescue conditions. We demonstrate that false detections can be significantly suppressed and true detections boosted by combining classifications from multiple AOS—rather than single—integral images. This improves classification rates especially in the presence of occlusion. To make this possible, we modified the AOS imaging process to support large overlaps between subsequent integrals, enabling real-time and on-board scanning and processing of groundspeeds up to 10 m/s.


Scientific Reports
| (2022) 12:3804 | https://doi.org/10.1038/s41598-022-07733-z www.nature.com/scientificreports/ methods 39,51,52 . Adaptive techniques are based mainly on evolution, artificial intelligence algorithms, or fuzzy set theories, while non-adaptive combination techniques apply simple rules for combination, such as sum, product, maximum, and median 39,51,52 . No theoretical or empirical evidence of the general superiority of any particular combination scheme exists, and even simple combination schemes have been shown to improve accuracy in various systems 40,41 .
To guarantee real-time rates on low-performance on-device mobile processors, we consider non-adaptive, measurement-based combination techniques. We demonstrate in "Combined classification" and "Results" sections that (compared to independent classifications in single integral images) the product of median and maximum confidence scores of combined classifications significantly suppresses false while boosting true detections. However, to enable a combined classification initially, the AOS imaging process must support large overlaps between subsequent integral images, which was not hitherto feasible. We discuss optimal AOS sampling parameters and how we achieve these overlaps in the "Continuous 1D synthetic aperture sampling" and "Combined classification" sections. We start with a brief review of the AOS principle in the "Airborne optical sectioning" section.

Airborne optical sectioning
As shown in Fig. 1, AOS captures thermal radiation with a drone that samples forest within the range of a synthetic aperture (SA) at flying altitude. This results in multiple geotagged aerial thermal images and consequently in unstructured thermal light-field rays formed by image pixels and camera poses on the SA 53,54 . With known camera intrinsics, drone poses, and a representation of the terrain (e.g., a digital elevation model or a focal plane approximation 28 ), each ray's origin on the ground can be reconstructed. Averaging all rays with the same origin results in a focused and widely occlusion-free integral image of thermal sources (e.g., persons) on the ground.
There is a statistical chance that a point on the forest ground is unoccluded by vegetation from multiple perspectives, as explained by the probability model in Ref. 25 . Thus, depending on density, more or fewer rays of a surface point contain information of random occluders, while others carry the constant signal of the target. Integrating multiple rays deemphasizes the occlusion signal and emphasizes the target signal. Since the remaining occlusion only lowers the contrast of the target 25 , reliable classification of the target is possible-even under strong occlusion conditions 30 .
AOS relies on the camera's pose information while capturing images within a certain SA. In Ref. 30 , the SA was a 2D sampling area, and precise pose estimation was achieved with computer-vision-based reconstruction techniques. This resulted in high-quality integral images but was-for two reasons-not usable in time-critical applications, such as search and rescue: Sampling a 2D area led to long flying times, and precise computervision-based pose estimation was not possible in real time. All computations were carried out offline and after recording at predefined SA waypoints. This was significantly improved in Ref. 31 by sampling along short 1D SAs (linear flight paths) and by using instant but imprecise sensor readings from the drone (barometer altitude, nondifferential GPS location, and compass orientation) for real-time computations directly on the drone. Despite the imprecision caused by lower-dimensional sampling and imprecise pose estimation, person classification performance was similar to that of 2D SA sampling with precise pose estimation 31 .
To date 30,31 , person classification with AOS has only been achieved in discrete (non-overlapping) integral images, for which 30 single images were required at 1 m intervals before they could be integrated and classified after a 30 m flight distance and 30 s of flight time. Due to slow recording speed and long image transfer times from the camera to the drone's processor, the ground surface covered did not overlap in multiple integral images. Consequently, the probability of detecting a person on the ground surface relied solely on a single classification chance. If under unfavorable occlusion conditions, a person was not detected in the corresponding integral image, they were never found.
In the Sections below, we present the theoretical and practical foundations of continuous 1D synthetic aperture sampling (i.e., the real-time capturing, processing, and evaluation of integral images with a large ground   The sampling density of integral images is the number of single images N being integrated: is the sampling distance of single images. It correlates with the efficiency of occlusion removal, and has an upper limit, as discussed in Ref. 25 . Note that t p also increases proportionally with N . Furthermore, the rate at which single images are recorded should be equal to or higher than the rate at which integrals are computed ( d ≥ d i ). If d < d i , subsequent integral images will not change.
The integration time (i.e., the time required to capture N images that are combined into one integral image) is: For c = 27.6 m, t is 27.6 s, 6.9 s, 4.63 s, and 2.8 s for flying speeds v f of 1 m/s, 4 m/s, 6 m/s, and 10 m/s. Note that a large t is unfavorable for the detection of fast-moving persons, as they introduce motion blur to the integral images that might not be classified correctly.
The sampling distances of single images that cause the same image disparity required for effective occlusion removal (as explained in Ref. 25 ) at different flight altitudes (but at the same FOV ) are linearly related (cf. Fig. 2): Thus, to match the occlusion removal efficiency of d i1 = 1 m sampling distance at h 1 = 35 m, AGL requires a sampling distance of d i2 = 28.6 m at h 2 = 1000 m AGL. Both synthetic aperture size and ground region covered scale proportionally to h 1 /h 2 to achieve the same D at these two altitudes. Covering the same ground region at www.nature.com/scientificreports/ the same D , however, requires the same scanning time for the same v f , as an identical distance must be flown. Scanning at lower altitudes ( h 1 < h 2 ) benefits from a h 2 /h 1 times higher spatial sampling resolution compared to scanning from higher altitudes ( h 2 > h 1 ), and from h 2 /h 1 times more intermediate classification results for the same scanning distance, as h 2 /h 1 times more single images are captured.
As explained in Ref. 25 , the orthographically projected occlusion density D in single images depends statistically on the density and size of the occluders (i.e., density of forest and size of branches, trunks, leafs, etc.) and the height of the occlusion volume (i.e., the height of trees). Here, the assumption of an orthographic projection implies an orthographic viewing angle α = 0 with respect to the synthetic aperture (SA) plane (i.e., looking straight down at the forest ground). More oblique viewing angles ( α > 0 ), however, result in an increase in projected occlusion density due to the longer imaging distance from the SA plane through the occlusion volume to the forest ground (see proof in Sect. S1 of the Supplementary Material): As illustrated in Fig. 3, large viewing angles (and consequently cameras with a large FOV = 2 · α ) are inefficient for occlusion removal. In the case of forests, oblique viewing angles would additionally cause larger projected occluder sizes, since side views of tree trunks project to larger footprints than top-down views. Larger occluder sizes, however, would require even larger sampling distances for efficient occlusion removal 25 .
Considering all the above sampling parameters, the following conclusions can be drawn: 1. A larger FOV is beneficial only up to a certain degree (when occlusion removal becomes seriously inefficient because viewing angles are too oblique). 2. For a given flying speed, higher flight altitudes have no effect on scanning time (i.e., the time needed to cover a certain forest range), but lead to lower spatial sampling resolution on the ground. Flight altitudes should therefore be chosen to be as low as possible. 3. A higher flying speed decreases not only scanning time but also the integral image overlap (and thus the number of times a person can be detected correctly). A suitable trade-off between flying speed and detection probability must be chosen. This might depend on the amount of occlusion (i.e., slower flights for denser forests).
Faster imaging speeds and shorter processing times are always beneficial, as both increase integral image overlap and allow faster flying speeds.

Combined classification
In Ref. 31 , a fully autonomous and classification driven drone was developed and deployed to carry out wilderness search and rescue operations without human intervention. The system utilizes YOLO 55 (a common and widespread object detection model 56,57 which provides real-time performance and consumes low enough power 58 to be deployed on mobile processors) to achieve state-of-the-art results for person detection in the wilderness. However, the system performs classification on discrete (non-overlapping) integral images (single integral images over a 30 m flight distance and after 30 s of flight) and thus tenders the probability of detecting persons on a single classification chance. Here we demonstrate how the continuous computation of integral images and combined classifications within them increase the chance that a person is detected correctly by a factor of o compared to single classifications of non-overlapping integral images, as in Refs. 30,31 .
Many real-time object-detection algorithms, such as YOLO 55 , output classification results as axis-aligned bounding boxes (AABBs) together with detection confidence scores. For classification combination, we project all AABBs of all individual integral images to a common coordinate system that is defined by the digital elevation model (DEM) of the ground surface. Thus, for a single discrete point on the ground surface, we collect Note that, if a DEM is not available, the ground surface can be automatically approximated by a simple shape 28 (e.g., a plane). Our hypothesis is that true detection often (but not always) project higher scores to the same spatial location, while false detections project predominantly (but not exclusively) lower scores to more randomly distributed locations. Thus, combining the projected scores should emphasize true and suppress false detection scores, and consequently separate the score ranges of these two groups more clearly than in single integral image classifications. This will in turn allow better thresholding and thus improve overall classification performance.
Confidence scores of multiple integral images can be combined by statistical or mathematical methods. We evaluated three different methods for combining the confidence scores of integral images (median, maximum, and the product of median and maximum), and present them along with confidence scores obtained in single integral images by using our previous approach 31 in the "Results" section.

Results
To evaluate our combination strategies, we performed 1D SA test flights at various constant speeds ( v f = 4, 6, and 10 m/s) from 30 m AGL above unoccluded (open field) terrain, and 37 m above dense forests (conifer and broadleaf forest). Integral images were computed from D = N = 30 single images recorded at a video rate of t i = 0.33 s (30 fps) and a FOV of 43.10°. The processing speed achieved was around t p = 0.5 s. Details on the hard-and software used, the new AOS image acquisition that achieves fast flying speeds and large integral image overlaps, the improved integral image computation that applies deferred rendering to reduce processing time, and the implementation of the person classification are provided in the "Methods" section. Figure 4 illustrates the test sites and computed single integral images along the 1D SA flight paths together with the ground-truth labels of persons on the ground. As predicted by (3), the same person appears in o integral images for different v f (14 vs. 13.8 for 4 m/s, 9-10 vs. 9.2 for 6 m/s, and 5-6 vs. 5.2 for 10 m/s). Note that slight variations are caused by different transition angles through the FOV and the fact that the FOV differs slightly for horizontal/vertical and diagonal image axes. Figure 5 shows and compares, for each flight, the probability maps of single integral images (utilizing the same YOLO model as in Ref. 31 , but additionally trained with unoccluded targets, as described in the "Methods" section) and combined classification results. For the open-field flights and single integral images, high confidence scores were obtained at the ground-truth positions, while low-score detections could easily be filtered out by a distinguished confidence threshold. This, however, was not the case for occlusion (especially during faster flights), where classification performance dropped significantly, and finding a proper confidence threshold to distinguish between false and true detections became increasingly difficult.  Fig. 5. While the maximum emphasized true but also aggregated false detections, and the median suppressed false but also reduced true detections, the maximum·median best suppressed false and emphasized true detections. That the score ranges between true and false detections can be separated best by a threshold with a maximum·median combination is also illustrated by the plot of confidence-score-ordered detections shown in Fig. 6. A steep gradient supports distinct confidence-score thresholds. Here, maximum·median outperformed a maximum and median combination as well as single integral image classification (without combination). Table 1 compares the maximum confidence scores of false detections with the minimum confidence scores of true detections. Their ratio represents the degree of separation between both groups (and how well confidence thresholds can be chosen). A value below 1 indicates overlapping scores for false and true detections. Thresholding will consequently lead to false classifications (or missing true classifications), which was always the case for our forest flights and classifications in single integral images (results of our previous approach 31 ). All combination methods increased the separation between false detections and true detection scores (ratio > 1), while the maximum·median method performed best. This implies that (in contrast to single integral image classification) a distinct and robust threshold can be found which leads to no false but all true detections.

Methods
We utilized an octocopter (MikroKopter OktoXL 6S12, two LiPo 4500 mAh batteries, 4.5 kg) that carried the following payload (cf. Fig. 1): a FLIR Vue Pro thermal camera (9 mm fixed focal length lens, 7.5 µm to 13.5 µm spectral band, 14-bit non-radiometric, 118 g, 1.2 W), a Flir HDMI and power module providing HDMI video output from the camera (640 × 480 @30 Hz; 15 g), a video capture card (Basetech, 640 × 480 @30 Hz, 22 g), a  The steepest gradient, which supports the best defined confidence-score thresholding, was always achieved with maximum·median combination. While the difference to other combination methods (and to no combination (none), that is, single integral image classification) was low without occlusion (open field), it was significant in the presence of occlusion (conifer and broadleaf forest). Our software was implemented in Python and runs on the SoCC, where various sub-processes, such as drone communication, imaging acquisition, and image processing (including integral image computation and classification), run in parallel to make use of the SoCC's multi-core capabilities. Inter-process queues are applied for efficient and secure communication between sub-processes. The drone communication sub-process interacts with the drone, utilizing a MikroKopter-customized serial protocol to receive IMU/GPS positions and send waypoint instructions which include GPS location, orientation, and speed. Received GPS positions and orientations are time-stamped and communicated to the image-processing sub-process. Note that all experiments were carried out in accordance with relevant guidelines and regulations. The study protocols were approved by ethics committee of Upper Austrian government.
Image acquisition. For imaging, we grabbed digital video frames from the video capture card connected to the thermal camera using OpenCV's video capturing module. The camera was set to SAR mode (mode suited to search-and-rescue operations in the wilderness, using 100% field of view), providing a high tonal range for higher temperatures and fewer gray levels for colder temperatures. The images were time-stamped to assign individual GPS coordinates, preprocessed using OpenCV's pinhole camera model to remove the lens distortion, and cropped to a field of view of 43.10° and a resolution of 512 px × 512 px.
To match the slow and asynchronous GPS signal rate (5 Hz in our case) to the faster video rate (30 Hz in our case), we applied time-based linear interpolation (assuming constant flying speeds). We thus interpolated GPS coordinates for each video frame grabbed. This, however, led to an interpolation error that depends on flying speed and imaging time: Given the drone's constant flying speed v f [m/s] and the camera's imaging time t i [s], the maximum interpolation error caused by an unknown delay between capturing and measuring the capturing timestamp (which includes the transmission time of the image from the camera to the processor) is: and adds to the GPS error. Thus, for t i = 0.033 s (30 fps), and v f = 1 m/s, 4 m/s, 6 m/s, and 10 m/s, E Imax = 1.67 cm, 6.67 cm, 10 cm, and 16.67 cm, for example. Note that assuming a constant drone speed results in E Imax being independent of the speed of the GPS sensor, as GPS positions of recorded camera frames can be linearly interpolated if they cannot be measured sufficiently fast. Only for non-constant speed segments (e.g., during acceleration and deceleration) are fast GPS samples beneficial for more precise piecewise linear or non-linear interpolations.
Deferred integral imaging computation. In principle, the preprocessed and GPS-assigned single thermal images are projected onto and averaged at a digital elevation model (DEM) using their individual poses and the camera's fixed intrinsic parameters. The DEM is a triangular mesh compatible with most standard graphics Table 1. Minimum confidence scores of true detections/maximum confidence scores of false detections (average overall detections and all integral images) and the ratio of those scores. Values below 1 (bold) indicate false classifications. The maximum·median combination performed best. In the case of occlusion (conifer and broadleaf forest), it separated false from true confidence scores by a factor of 4-93 (italic).  31 . For an integral image, the DEM with all projections is finally rendered from the center perspective of all integrated poses. In Ref. 31 , the above process was implemented with classical rendering, for which the entire DEM had to be processed for each projection. This did not achieve practical computation times for large numbers of projections and/or large DEMs (cf. Figure 7). To decouple image projection from the processing of the DEM's geometry, we now utilize a rendering technique known as deferred rendering (DR). This requires the DEM's geometry to be processed only once for each integral image and consequently speeds up computation time. With DR, the processed DEM geometry is preserved in the graphics memory and can be reused for all projections as long as the rendering perspective does not change. As shown in Fig. 7, compared to classical rendering, this leads to a significant decrease in computation time for large DEMs and large numbers of projections. Storing the DEM's processed geometry requires hardware support for floating-point precision buffers, which is supported by our SoCC.
Integral image computation was implemented using C, C ++, and OpenGL, and integrated into the main Python program using Cython. Preprocessing and integration of 30 thermal images require around 99 ms and 115 ms, respectively, for DEMs with 34k-2.6m vertices.
Classification. Person detection in the integral images was performed with the YOLOv4-tiny network architecture 59 on the VPU. More details on (pre-)training, parameters, and the selection of the best training weights can be found in Ref. 31 . Classifications are achieved in 84 ms and are optionally transmitted to a remote mobile device using the LTE communication hat connected to the SoCC.
For classifier training (using the Darknet software), we used 17 previously recorded flights 30 (F0-F11 and O1-O6; excluding F7). The data contains 7 flights over forests with persons on the ground (F0-F6), 4 flights over forests without persons (F8-F11), and 6 flights over a meadow with persons but without occlusions (O1-O6). From these flights F1, F8, and O6 were used for validation, while the remaining 14 flights were used for training. For our experiments, the recordings were resampled from 2D synthetic apertures to 1D synthetic aperture lines, pose data was computed from GPS readings, and persons were labeled, as explained in Ref. 31 .
Note that manual compass correction was applied to each flight. To compute the integral images of the training and validation datasets, we applied the following augmentations: We varied the synthetic aperture size N in 7 steps: 1 (pinhole), 5, 10, 15, 20, 25, and 30. Note that, due to resampling, some 1D apertures had fewer single images (e.g., the longest lines were only in the range 20 < N ≤ 25 for some scenes). Additionally, we applied 10 random image rotations by varying the up vector of the integral's virtual camera. The digital elevation model was translated up and down by 3 m in steps of 1 m to simulate defocus. Furthermore, we computed an additional integral image with a random compass error (rotation around each single-image camera's forward axis by ± 15 degrees). This led to a total of 980 variations for each of the 179 resampled 1D apertures.
The trained network achieved an average precision score of 88.7% (without combined classification) on test data presented in Table 2 of Ref. 31 (previously 86.2% in Ref. 31 ) and was also able to detect unoccluded persons in open terrain (e.g., meadows, fields), since it was additionally trained with the unoccluded person data from Refs. 30,60 . Note, that for combined classification, a precision score cannot be determined as pixel-precise ground truth labels are not available.

Discussion
We have shown that false detections can be significantly suppressed and true detections significantly boosted by classifying based on multiple combined AOS-rather than single-integral images. This improves classification rates especially in the presence of occlusion. To make this possible, we modified the AOS imaging process to support large overlaps between subsequent integrals, thus enabling real-time and on-board scanning and processing (processing speed of 0.5 s at 9.1 W) for groundspeeds of up to 10 m/s (while previously the maximum was 1 m/s 30,31 ). One of the major limitations of utilizing a deep learning-based classifier is its non-deterministic nature which (in contrast to the derivable dependency between target visibility in integral images and occlusion density 25,28 ) prevents us from establishing a distinct mathematical relationship between person detection rate and occlusion. Empirical evidence 31 only suggest a strong correlation between classifier performance and visibility while the relation between visibility and occlusion density has already been described for randomly distributed occluder models 25 . Furthermore, any simulation-based correlation investigation between occlusion density and classifier performance will be heavily dependent on the classifier configuration and the training data. Due to the slow sampling rate of standard GPS, only constant-speed segments are currently supported. For acceleration and deceleration segments, faster and more precise Real-Time Kinematic (RTK) devices are beneficial. Non-linear interpolation and prediction models will be investigated in the future to enable better mapping of faster imaging rates to slower pose sampling. In future, we will also investigate additional improvements, such as adaptive combination techniques, online adaptive thresholding, and options for extending flight endurance to enable fully autonomous beyond-visual-line-of-sight (BVLOS) search-and-rescue missions.