Introduction

Controlled hallucination1 is an evocative term referring to the Bayesian brain hypothesis2. It posits that perception is not merely a function of sensory information processing capturing the world as is. Instead, the brain is a predictive machine - it attempts to infer the causes of sensory inputs. To achieve this, the brain builds and continually refines its world model. The world model serves as a prior and when combined with the sensory signals will produce the best guess for its causes. Hallucination (uncontrolled) occurs when the sensory inputs cannot be reconciled with, or contradict the prior world model. This might occur in our model, and when it does, it manifests itself at the fusion stage with the stochastic gradient descent procedure getting trapped in a local minimum. The method presented in this paper is somewhat inspired by the Bayesian brain hypothesis, but it also builds upon multimodal generative modelling and deep compressed sensing.

Multimodal data fusion attracts academic and industrial interests alike3 and plays a vital role in several applications. Automated driving is arguably the most challenging industrial domain4. Automated vehicles use a plethora of sensors: Lidar, mmWave radar, video and ultrasonic, and attempt to perform some form of sensor fusion for environmental perception and precise localization. A high-quality of final fusion estimate is a prerequisite for safe driving. Amongst other application areas, a notable mention deserves eHealth and Ambient Assisted Living (AAL). These new paradigms are contingent on gathering information from various sensors around the home to monitor and track the movement signatures of people. The aim is to build long-term behavioral sensing machine which also affords privacy. Such platforms rely on an array of environmental and wearable sensors, with sensor fusion being one of the key challenges.

In this contribution, we focus on a one-time snapshot problem (i.e. we are not building temporal structures). However, we try to explore the problem of multimodal sensor fusion from a new perspective, essentially, from a Bayesian viewpoint. The concept is depicted in Fig. 1, alongside two main groups of approaches to sensor fusion. Traditionally, sensor fusion for classification tasks has been performed at the decision level as in Fig. 1(a). Assuming that conditional independence holds, a pointwise product of final pmf (probability mass function) across all modalities is taken. Feature fusion, as depicted in Fig. 1(b), has become very popular with the advent of deep neural networks3, and can produce very good results. Figure 1(c) shows our technique during the fusion stage (Stage 2). Blue arrows indicate the direction of backpropagation gradient flow during fusion.

Figure 1
figure 1

Multimodal Sensor Fusion: (a) Decision fusion, (b) Feature fusion, (c) Our technique: fusion in the latent representation Z with optional compressed sensing measurements; Red arrows show the generative model from the latent space and the data subsampling; Blue arrows show the fusion process, guided by the prior model. F features, p(z) prior model, \(\textbf{G}\) generators, X complete data, Y subsampled data. For clarity \(M=2\) modalities are shown, the concept generalises to any M.

Contributions

  • A novel method for multimodal sensor fusion is presented. The method attempts to find the best estimate (maximum a posteriori) for the causes of observed data. The estimate is then used to perform specific downstream fusion tasks.

  • The method can fuse the modalities under lossy data conditions i.e. when the data is subsampled, lost and/or noisy. Such phenomena occur in real-world situations such as the transmission of information wirelessly, or intentional subsampling to expedite the measurement (rapid MRI imaging and radar) etc.

  • It can leverage between modalities. A strong modality can be used to aid the recovery of another modality that is lossy or less informative (weak modality). This is referred to as asymmetric Compressed Sensing.

Related work

In this section, we review the state-of-the-art in three areas directly relevant to our contributions: multimodal generative modeling, sensor fusion, and compressed sensing. One of the main aims of Multimodal Variational Autoencoders (MVAEs) is to learn a shared representation across different data types in a fully self-supervised manner, thus avoiding the need to label a huge amount of data, which is time-consuming and expensive5. It is indeed a challenge to infer the low-dimensional joint representation from multiple modalities, which can ultimately be used in downstream tasks such as self-supervised clustering or classification. This is because the modalities may vastly differ in characteristics, including dimensionality, data distribution, and sparsity6. Recently, several methods have been proposed to combine multimodal data using generative models such as Variational Autoencoders (VAEs)5,7,8,9,10,11. These methods aim to learn a joint distribution in the latent space via inference networks and try to reconstruct modality-specific data, even when one modality is missing. In these works, a modality can refer to natural images, text, captions, labels or visual and non-visual attributes of a person. JMVAE (Joint Multimodal Variational Autoencoder)9 makes use of a joint inference network to learn the interaction between two modalities and they address the issue of missing modality by training an individual (unimodal) inference network for each modality as well as a bimodal inference network to learn the joint posterior, based on the product-of-experts (PoE). They consequently minimize the distance between unimodal and multimodal latent distribution. On the other hand, MVAE7, which is also based on PoE, considers only a partial combination of observed modalities, thereby reducing the number of parameters and improving the computational efficiency. Reference8 uses the Mixture-of-Experts (MoE) approach to learn the shared representation across multiple modalities. The latter two models essentially differ in their choices of joint posterior approximation functions. MoPoE (Mixture-of-Products-of-Experts)-VAE5 aims to combine the advantages of both approaches, MoE and PoE, without incurring significant trade-offs. DMVAE (Disentangled Multimodal VAE)10 uses a disentangled VAE approach to split up the private and shared (using PoE) latent spaces of multiple modalities, where the latent factor may be of both continuous and discrete nature. CADA (Cross- and Distribution Aligned)-VAE11 uses a cross-modal embedding framework to learn a latent representation from image features and classes (labels) using aligned VAEs optimized with cross- and distribution- alignment objectives.

In terms of multimodal/sensor fusion for human activity sensing using Radio-Frequency (RF), inertial and/or vision sensors, most works have considered either decision-level fusion or feature-level fusion. For instance, the work in12 performs multimodal fusion at the decision level to combine the benefits of WiFi and vision-based sensors using a hybrid Deep Neural Network (DNN) model to achieve good activity recognition accuracy for 3 activities. The model essentially consists of a WiFi sensing module (dedicated Convolutional Neural Network (CNN) architecture) and a vision sensing module (based on the Convolutional 3D model) for processing WiFi and video frames for unimodal inference, followed by a multimodal fusion module. Multimodal fusion is performed at the decision level (after both WiFi and vision modules have made a classification) because this framework is stated to be more flexible and robust to unimodal failure compared to feature level fusion. Reference13 presents a method for activity recognition, which leverages four sensor modalities, namely, skeleton sequences, inertial and motion capture measurements and WiFi fingerprints. The fusion of signals is formulated as a matrix concatenation. The individual signals of different sensor modalities are transformed and represented as an image. The resulting images are then fed to a two-dimensional CNN (EfficientNet B2) for classification. The authors of14 proposed a multimodal HAR system that leverages WiFi and wearable sensor modalities to jointly infer human activities. They collect Channel Sate Information (CSI) data from a standard WiFi Network Interface Card (NIC), alongside the user’s local body movements via a wearable Inertial Measurement Unit (IMU) consisting of an accelerometer, gyroscope, and magnetometer sensors. They compute the time-variant Mean Doppler Shift (MDS) from the processed CSI data and magnitude from the inertial data for each sensor of the IMU. Then, various time and frequency domain features are separately extracted from the magnitude data and the MDS. The authors apply a feature-level fusion method which sequentially concatenates feature vectors that belong to the same activity sample. Finally supervised machine learning techniques are used to classify four activities, such as walking, falling, sitting, and picking up an object from the floor. The work in15 proposed a feature-level sensor fusion approach for HAR and validated on a WiFi platform. The authors of16 proposed a decision-level sensor fusion network for HAR using LiDAR and visual sensors.

Compared to the aforementioned works12,13,14,15,16 which consider supervised models with feature-level fusion or decision-level fusion, our technique, in contrast, performs multimodal sensor fusion in the latent representation space leveraging a self-supervised generative model. Our method is different from current multimodal generative models such as those proposed in5,7,8,9 in the sense that it can handle cases where observations are accessed only via subsampling (i.e. compressed sensing with significant loss of data and no data imputation). And crucially our technique attempts to directly compute the MAP (maximum a posteriori) estimate.

The presented method is related to and builds upon Deep Compressed Sensing (DCS) techniques17,18. DCS, in turn, is inspired by Compressed Sensing (CS)19,20. In CS, we attempt to solve what appears to be an underdetermined linear system, yet the solution is possible with the additional prior sparsity constraint on the signal: \(\min L0\). Since L0 is non-convex, L1 is used instead to provide a convex relaxation, which also promotes sparsity and allows for computationally efficient solvers. DCS, in essence, replaces the L0 prior with a low dimensional manifold, which is learnable from the data using generative models. Concurrently to DCS, Deep Image Prior21 was proposed. It used un-trained CNNs to solve a range of inverse problems in computer vision (image inpainting, super-resolution, denoising).

Methods

Assume the data generative process so that latent and common cause Z gives rise to \(X_m\), which in turn produces observed \(Y_m\), i.e. \(Z\rightarrow X_m \rightarrow Y_m\) forms a Markov chain. Here, \(X_m\) is the full data pertaining to \(m^{th}\) modality, \(m \in \{1,\dots ,M\}\). Crucially, the modalities collect data simultaneously “observing” the same scene. As an example, in this work, we consider the different views (obtained via multiple receivers) from the opportunistic CSI WiFi radar as different modalities. The variable Z encodes the semantic content of the scene and is typically of central interest. Furthermore, \(X_m\) is not accessed directly, but is observed via a subsampled \(Y_m\). This is a compressed sensing setup: \(Y_m=\chi _{m}(X_m)\): \(\chi _{m}\) is a deterministic and known (typically many-to-one) function. The only condition we impose on \(\chi _{m}\) is to be Lipschitz continuous. With the above, the conditional independence between modalities holds (conditioned on Z). Therefore, the joint density factors as:

$$\begin{aligned} p\left( z,x_{1:M},y_{1:M}\right) = p\left( z\right) \prod _{m=1}^M{ p(y_{m}|x_{m}) p(x_{m}|z) }. \end{aligned}$$
(1)

The main task in this context is to produce the best guess for latent Z, and possibly, to recover the full signal(s) \(X_m\), given subsampled data \(Y_{1:M}\). We approach the problem in two stages. First we build a joint model which approximates Eq. (1), and will be instantiated as a Multimodal Variatational Autoencoer (M-VAE). More specifically, the M-VAE will provide an approximation to \(p_{\phi _{1:M},\psi _{1:M}}(z,x_{1:M})\), parameterized by deep neural networks \(\{\phi _{1},\dots , \phi _{M}\}\), \(\{\psi _{1},\dots , \psi _{M}\}\), referred to as encoders and decoders, respectively. The trained M-VAE will then be appended with \(p_{\chi _{m}}(y_{m}|x_{m})\) for each modality m: \(\{\chi _{1},\dots , \chi _{M}\}\) referred to as samplers. In the second stage, we use the trained M-VAE and \(\chi _{1:M}\) to facilitate the fusion and reconstruction tasks. Specifically, our sensor fusion problem amounts to finding the maximum a posteriori (MAP) \(\hat{z}_{MAP}\) estimate of the latent cause for a given (\(i^{th}\)) data point \(Y_{1:M}=y^{(i)}_{1:M}\):

$$\begin{aligned} \hat{z}_{MAP} = \arg \max _{z}p\left( z| Y_{1:M}=y^{(i)}_{1:M} \right) , \end{aligned}$$
(2)

where,

$$\begin{aligned} p\left( z| Y_{1:M}=y^{(i)}_{1:M} \right) \propto p\left( z\right) \prod _{m=1}^M{\int _{X_{m}} p( Y_{m}=y^{(i)}_{m} |x_{m}) p(x_{m}|z) \,dx_{m}}. \end{aligned}$$
(3)

The above MAP estimation problem is hard, and we will resort to approximations detailed in the sections below.

Multimodal VAE

The first task is to build a model of Eq. (1). As aforementioned, this will be accomplished in two steps. Firstly, during the training stage we assume access to full data \(X_{1:M}\), therefore training an approximation to \(p_{\phi _{1:M},\psi _{1:M}}(z,x_{1:M})\) is a feasible task. The marginal data log-likelihood for the multimodal case is:

$$\begin{aligned} \ &\log p(x_{1:M})= D_{KL}(q(z|x_{1:M}||p(z|x_{1:M})) \end{aligned}$$
(4)
$$\begin{aligned}& \quad+\left[ \sum _{X_{m}} { \mathbb {E}_{z \sim q(z|x_{1:M})} \log p(x_{m}|z)} - \mathbb {E}_{z \sim q(z|x_{1:M})} \log \frac{q(z|x_{1:M})}{p(z)} \right] , \end{aligned}$$
(5)

where \(D_{KL}\) is the Kullback-Leibler (KL) divergence. The first summand in Eq. (4), i.e. the sum over modalities follows directly from the conditional independence. And since KL is non-negative, Eq. (4) represents the lower bound (also known as Evidence Lower Bound - ELBO) on the log probability of the data (and its negative is used as the loss for the M-VAE). There exist a body of work on M-VAEs, the interested reader is referred to5,7,8,9 for details and derivation. The key challenge in training M-VAEs is the construction of variational posterior \(q(z|x_{1:M})\). We dedicate a section in the Supplementary Information document to the discussion on choices and implications for the approximation of variational posterior. Briefly, we consider two main cases: a missing data case - i.e. where particular modality data might be missing (\(X_m = x_m^{(i)} = \emptyset\)); and the full data case. The latter is straightforward and is tackled by enforcing a particular structure of the encoders. For the former case variational Product-of-Experts (PoE) is used:

$$\begin{aligned} q_{\Phi }(z|x_{1:M}) = p(z) \prod _{m=1}^M{q_{\phi _m}(z|x_{m})} . \end{aligned}$$
(6)

Should the data be missing for any particular modality, \(q_{\phi _m}(z|x_m) = 1\) is assumed. Derivation of Eq. (6) can be found in the Supplementary Information document.

Fusion on the M-VAE prior

Recall the sensor fusion problem as stated in Eq. (2). The p(z) is forced to be isotropic Gaussian by M-VAE, and the remaining densities are assumed to be Gaussian. Furthermore, we assume that \(p(x_{m}|z)= \delta (x_{m}-\psi _{m}(z))\). Therefore Eq. (2) becomes:

$$\begin{aligned} \hat{z}_{MAP} = \arg \max _{z} p\left( z| Y_{1:M}=y^{(i)}_{1:M} \right) \propto \exp {(- \Vert z\Vert ^2 )} \prod _{m=1}^M{\exp {(-\frac{1}{2\sigma _{m}^{2}} \Vert y_{m}^{(i)}- \chi _{m}(\psi _{m}(z))\Vert ^2 )}}. \end{aligned}$$
(7)

Hence, the objective to minimize becomes:

$$\begin{aligned} {\mathscr {L}}(z) = \lambda _0{\Vert z\Vert ^2} + \sum _{m=1}^M{\lambda _m \Vert y_{m}^{(i)}- \chi _{m}(\psi _{m}(z))\Vert ^2 }. \end{aligned}$$
(8)

Recall that the output of the first stage is p(z) and the decoders \(\prod _{m} {p_{\psi _{n}}(x|z)}\) are parameterized by \(\{\psi _{1:M}\}\), \(\{\lambda _{0:M}\}\) are constants. The MAP estimation procedure consists of backpropagating through the sampler \(\chi _m\) and decoder \(\psi _{m}\) using Stochastic Gradient Descent (SGD). In this step \(\{\psi _{1:M}\}\) are non-learnable, i.e. jointly with \(\chi _m\) are some non-linear known (but differentiable) functions.

$$\begin{aligned} z\leftarrow z - \eta _0\nabla _{z} ({{\Vert z\Vert ^2}}) - \sum _{m=1}^M{\eta _m \nabla _{z}( \Vert y_{m}^{(i)}- \chi _{m}(\psi _{m}(z))\Vert ^2) }. \end{aligned}$$
(9)

The iterative fusion procedure is initialized by taking a sample from the prior \(z^{0} \sim p(z)\), \(\{\eta _{0:M}\}\) are learning rates. One or several SGD steps are taken for each modality in turn. The procedure terminates with convergence - see Algorithm 1. In general, the optimization problem as set out in Eq. (8) is non-convex. Therefore, there are no guarantees of convergence to the optimal point (\(\hat{z}_{MAP}\)). We deploy several strategies to minimize the risk of getting stuck in a local minimum. We consider multiple initialization points (a number of points sampled from the prior with Stage 2 replicated for all points). In some cases it might be possible to sample from: \(z^{0} \sim p\left( z \right) \prod p\left( z\left| X = \check{x}_m^{(j)} \right. \right)\). Depending on modality, this might be possible with data imputation (\(\check{x}_{m}\) are imputed data). The final stage will depend on a particular task (multisensory classification/reconstruction), but in all cases it will take \(\hat{z}_{MAP}\) as an input. In our experiments, we observe that the success of Stage 2 is crucially dependent on the quality of M-VAE.

figure a

Experiments

In this work, we investigate the performance of the proposed method on two datasets for multimodal sensor fusion and recovery tasks: i) a synthetic “toy protein” dataset and ii) a passive WiFi radar dataset intended for Human Activity Recognition (HAR).

Synthetic toy protein dataset

A synthetic dataset containing two-dimensional (2D) protein-like data samples with two modalities is generated. The latent distribution \(p(z), z\in \mathbb {R}^4\) is a Gaussian mixture model with 10 components, simulating 10 different “classes” for samples. For each modality, the data generative model \(p(x_m|z), x_m \in \mathbb {R}^N\) is a one-layer multilayer perceptron (MLP) with random weights. Here \(m=1,2\) represents two modalities. 10,000 pairs of samples are generated using the generative model, with the protein size \(N=32\). Figure 2(a) shows an instance of the 2D protein data with \(N=64\).

Figure 2
figure 2

(a) Generated toy proteins examples (\(N=64\)) and (b) reconstruction from compressed sensing observations. With 2 out of 64 measurements (3.125%), near perfect reconstruction is possible even though the modalities are individually subsampled.

Passive WiFi radar dataset

We use the OPERAnet22 dataset which was collected with the aim to evaluate Human Activity Recognition (HAR) and localization techniques with measurements obtained from synchronized Radio-Frequency (RF) devices and vision-based sensors. The RF sensors captured the changes in the wireless signals while six daily activities were being performed by six participants, namely, sitting down on a chair (“sit”), standing from the chair (“stand”), laying down on the floor (“laydown”), standing from the floor (“standff”), upper body rotation (“bodyrotate”), and walking (“walk”). It should be noted that the six activities were performed in two different office rooms and in each room the participants performed the activities at different locations. The distribution of the six activities performed by the six participants in the two rooms is reported in22. We convert the raw time-series CSI data from the WiFi sensors into the image-like format, namely, spectrograms using signal processing techniques. More details are available in Sect. S2 of the Supplementary Information document. The interested reader is kindly referred to our previous works23,24,25 for more details on the signal processing pipelines for WiFi CSI. In this paper, we focus mainly on the design of a model that can fuse data from multiple modalities/sensors in the latent representation space effectively for several downstream tasks such as multisensory classification, denoising, and recovery from subsampled observations and missing pixels. 2,906 spectrogram samples (each of 4s duration window) were generated for the 6 human activities and 80% of these were used as training data while the remaining 20% as testing data (random train-test split).

Table 1 Few-shot learning sensor fusion classification results (\(F_1\) macro) for Human Activity Recognition.

Results and discussion

Classification results of WiFi CSI spectrograms for HAR

In this section, we evaluate the HAR sensor fusion classification performance under a few-shot learning scenario, with 1, 5 and 10 labelled examples per class. These correspond to 0.05%, 0.26% and 0.51% of labelled training samples, respectively. We randomly select 80% of the samples in the dataset as the training set and the remaining 20% is used for validation. The average \(F_1\)-macro scores for the HAR performance are shown in Table 1 for different models. To allow for a fair comparison, the same random seed was used in all experiments with only two modalities (processed spectrograms data obtained from two different receivers).

Prior to training our model (see Supplementary Fig. S1, the spectrograms were reshaped to typical image dimensions of size \((1\times 224\times 224)\). Our model was trained for 1,000 epochs using the training data with a fixed KL scaling factor of \(\beta = 0.02\). The encoders comprised of the ResNet-18 backbone with the last fully-connected layer dimension having a value of 512. For the decoders, corresponding CNN deconvolutional layers were used to reconstruct the spectrograms from each modality with the same input dimension. The latent dimension, batch size, and learning rate are set at 64, 64, and 0.001, respectively. Our model was implemented in PyTorch. We used a single NVIDIA GeForce RTX 2080 Ti (11 GB) GPU for training our model. The latter consists of 29.6M trainable parameters. The time taken for training our model on such a setup was approximately 1.94 hours. In the second stage, the generative model serves as a reconstruction prior and the search manifold for the sensor fusion tasks. Essentially, in this stage, we obtain the maximum a posteriori estimate of \(\hat{z}_{MAP}\) through the process described in Algorithm 1. The final estimate of the class is produced by K-NN in the latent representation space, with labelled examples sampled from the training set.

To benchmark our technique we investigate the performance of other state-of-the-art sensor fusion techniques. The feature-fusion is represented by CNN models (1-channel CNN, 2-channel CNN, dual-branch CNN). All are trained in a conventional supervised fashion from scratch using the ResNet-18 backbone and a linear classification head is appended on top of it consisting of a hidden linear layer of 128 units and a linear output layer of 6 nodes (for classifying 6 human activities). The dual-input CNN refers to the case where the embeddings from the two modalities’ CNNs are concatenated, and a classification head is then added (as illustrated in Fig. 1b). The “Probability Fusion” (decision fusion) model refers to a score-level fusion method where the classification probabilities (\(P_1\) and \(P_2\)) from each modality are computed independently (using an output SoftMax layer) and then combined using the product rule (this is optimal given conditional independence). These models are fine-tuned with labelled samples over 200 epochs, with a batch size of 64 and the Adam optimizer was used with learning rate of 0.0001, weight decay of 0.001 and \(\beta _1= 0.95\), \(\beta _2= 0.999\).

It can be observed from Table 1 that our method significantly outperforms all other conventional feature and decision fusion methods. The confusion matrix for HAR classification using our SFLR (Sensor Fusion in the Latent Representation space) model is shown in Fig. 3 for the case when only ten labelled examples are used at the (classification) fusion stage.

Figure 3
figure 3

Confusion matrix of Human Activity Recognition (HAR) classification using our SFLR model (with compressed sensing). Ten labelled examples per class are considered.

Figure 4
figure 4

Illustration of spectrogram recovery (for sitting down activity) using compressed sensing with measurements as low as 784 out of 50,176 (1.56%). No additive white Gaussian noise is considered. The left column shows the true spectrogram sample, the middle column shows reconstruction with an initial guess (no optimization) while the right column shows reconstruction with \(\hat{z}_{MAP}\).

Table 2 Compressed sensing mean reconstruction error over a batch of 50 WiFi spectrogram data samples (No additive Gaussian noise). An illustration is shown in Fig. 4.

Sensor fusion from subsampled observations (WiFi spectrogram data)

Next, we evaluate the recovery performance of spectrograms under different numbers of compressed sensing measurements. The measurement function \(\chi _m\) is a matrix initialized randomly and we assume that there is no additive Gaussian noise. The Adam optimizer is used to optimize \(\hat{z}_{MAP}\) with a learning rate of 0.01. The algorithm is run for 10,000 iterations. After the loss in Eq. (8) has converged during the optimization process, the samples are decoded/recovered for modality 1 and modality 2 using their respective decoders \(\hat{x}_{m} = \psi _{m}(\hat{z}_{MAP})\). Table 2 shows the compressed sensing results when a batch of 50 images is taken from the testing dataset and evaluated under different number of measurements (without noise). It can be observed that the samples can be recovered with very low reconstruction error when the number of measurements is as low as 196 (0.39%). An illustration is also shown in Fig. 4 where very good reconstruction is observed for the case when the number of measurements is equal to 784. Algorithm 1 was run for 10,000 iterations on a single GPU (GeForce RTX 2080 Ti) for a batch of 50 subsampled WiFi spectrograms (number of measurements = 784 (1.56%)), and the optimisation process took on average approximately 48.1 seconds to converge for a given subsampled WiFi CSI spectrogram sample. More illustrations are shown in Fig. S6 in the Supplementary Information document, with further experimental results in Sects. S4 and S5.

Sensor fusion from asymmetric compressed sensing (WiFi spectrogram data)

In this experiment, we analyse the reconstruction of the WiFi spectrogram samples under two different scenarios, where we demonstrate the benefits of having multiple modalities. We are interested in recovery for one modality that is subsampled (loss of data) and noisy. This can be referred to as the weak data (or weak modality). Using the SFLR method, we leverage the second modality data, which has no loss of information or does not suffer from noise (strong modality), to improve the recovery for the modality of interest i.e., the weak modality. In the first case, only modality 1 (subsampled and noisy) is considered in the reconstruction process. In the second case, modality 2 (strong modality) is added in the iterative fusion process to improve the reconstruction quality of modality 1.

The results are tabulated in Table 3, where additive Gaussian noise with a standard deviation of 0.1 is considered. The results show the mean reconstruction errors (over 50 WiFi spectrogram samples) when modality 1 is subsampled to different extents. We see that reconstruction error has a general tendency to decrease with increasing number of measurements. It can be observed that the samples can be recovered with very low reconstruction error when the number of measurements is as low as 196 (0.39%). Furthermore, from Table 3, we observe that when only modality 1 is considered in the reconstruction process, the reconstruction errors are high when the number of measurements is equal to 1 (0.002%) and 10 (0.02%). However, by leveraging the good modality 2, the reconstruction quality is greatly improved for the same number of measurements, demonstrating the clear benefit of having multiple modalities. An illustration of the reconstruction quality is depicted in Fig. 5, where it can be observed that the unimodal reconstruction of modality 1 is far from the true sample. On the other hand, the reconstruction quality of modality 1 is improved by leveraging the good modality data.

Table 3 Asymmetric compressed sensing. Mean reconstruction error over 50 WiFi spectrogram data samples. Noise standard deviation: 0.1.
Figure 5
figure 5

Reconstruction examples showing the benefit of multimodal system compared to a unimodal system. Modality 1 is subsampled data with 1 single measurement while modality 2 has full information (no noise and no loss of data). Additive Gaussian noise with a standard deviation of 0.1 is considered in this example: (a) reconstruction with modality 1 only, (b) reconstruction with both modalities 1 and 2. Left column shows true spectrogram sample, middle column shows reconstruction with initial guess (no optimization) while right column shows reconstruction with \(\hat{z}_{MAP}\). Adding modality 2 during reconstruction stage helps in the sample recovery of modality 1.

Table 4 Few-shot learning sensor fusion classification results (\(F_1\) macro) for synthetic proteins.

Toy protein classification

Similar to the experiments on the OPERAnet22 dataset, we perform two tasks, classification and sensor fusion from compressed sensing observations, on the synthetic toy protein dataset. As mentioned previously, the toy protein dataset contains 10 classes. The dataset is split into a training set and a test set, containing 80% and 20% of samples, respectively. We evaluate the classification performance under a few-shot learning setting, using 1, 5 or 10 labelled samples per class. The few-shot classification via the SFLR model consists of two stages. In the first stage, the M-VAE is trained in an unsupervised manner using the training set. Using the maximum a posterior \(\hat{z}_{MAP}\) and a few labels, the K-NN classifier is applied to the latent representation space. Here the encoder and decoder in M-VAE are two-layer MLPs, with 16 neurons in the hidden layer.

We compare the SFLR method with 4 baseline models. The single modality model only considers one modality without sensor fusion. The probability fusion model independently computes the classification probability for each modality, which is a representative model for decision-fusion (Fig. 1a). The dual-branch feature fusion model concatenates the embedding of two modalities before the classification layer, which is a feature fusion method (Fig. 1b). All baseline models are trained in a supervised manner, with the same neural network structure as the encoder. Table 4 shows the F1-macro scores for different methods on the test set. On the 10-class protein dataset, SFLR outperforms other sensor fusion models using limited labelled samples.

Table 5 Compressed sensing mean reconstruction error over a batch of 100 protein samples.

Sensor fusion from subsampled toy proteins

Another advantage of the proposed SFLR model is that it can fuse modalities in subsampled cases. We use a set of samplers \(\chi _{1:M}\) to simulate the subsampled observations. The measurement function \(\chi _m\) is a matrix initialized randomly. Here we use 10 initialization points to reduce the risk of getting trapped in a local minimum (points sampled from the prior with Stage 2 replicated for all of them). Figure 2(b) shows the recovered protein from subsampled observations, with only 2 measurements for each modality. Both modalities are successfully recovered from the latent representation space, even though the initial guess \(z^0\) is far from the true modality. Note that the proteins in Fig. 2 have a higher dimension than in the dataset, showing the robustness of the SFLR method. Table 5 shows the average reconstruction error of the synthetic protein dataset using different subsamplers. The reconstruction error reduced significantly when having 2 measurements for each modality, showing superior sensor fusion abilities.

The Supplementary Information document (see Sect. S6) contains additional experiments, including tasks showcasing the ability to leverage between modalities, where a strong modality can be used to aid the recovery of a weak modality. It also presents the performance under subsampled and noisy conditions.

Despite the fact that the SFLR method achieves superior performance in the HAR problem, it has its weaknesses and limitations. The performance of sensor fusion heavily relies on the success of the first stage, which requires high-quality training data. Learning the manifold of p(z) is the key to the estimation of \(\hat{z}_{MAP}\). In stage 2, z might fall into local minima which leads to sub-optimal solutions. This is a common issue when using gradient descent in optimization problems, with existing solutions to mitigate it.

Conclusions and broader impacts

The paper presents a new method for sensor fusion. Specifically, we demonstrate the effectiveness of classification and reconstruction tasks from radar signals. The intended application area is human activity recognition, which serves a vital role in the E-Health paradigm. New healthcare technologies are the key ingredient to battling spiralling costs of provisioning health services that beset a vast majority of countries. Such technologies in a residential setting are seen as a key requirement in empowering patients and imbuing a greater responsibility for own health outcomes. However, we acknowledge that radar and sensor technologies also find applications in a military context. Modern warfare technologies (principally defensive) could potentially become more apt if they were to benefit from much-improved sensor fusion. We firmly believe that, on balance, it is of benefit to the society to continue the research in this area in the public eye.