Multimodal sensor fusion in the latent representation space

A new method for multimodal sensor fusion is introduced. The technique relies on a two-stage process. In the first stage, a multimodal generative model is constructed from unlabelled training data. In the second stage, the generative model serves as a reconstruction prior and the search manifold for the sensor fusion tasks. The method also handles cases where observations are accessed only via subsampling i.e. compressed sensing. We demonstrate the effectiveness and excellent performance on a range of multimodal fusion experiments such as multisensory classification, denoising, and recovery from subsampled observations.


Introduction
Controlled hallucination 1 is an evocative term referring to the Bayesian brain hypothesis 2 .It posits that perception is not merely a function of sensory information processing capturing the world as is.Instead, the brain is a predictive machineit attempts to infer the causes of sensory inputs.To achieve this, the brain builds and continually refines its world model.The world model serves as a prior and when combined with the sensory signals will produce the best guess for its causes.Hallucination (uncontrolled) occurs when the sensory inputs cannot be reconciled with, or contradict the prior world model.This might occur in our model, and when it does, it manifests itself at the fusion stage with the stochastic gradient descent procedure getting trapped in a local minimum.The method presented in this paper is somewhat inspired by the Bayesian brain hypothesis, but it also builds upon multimodal generative modelling and deep compressed sensing.
Multimodal data fusion attracts academic and industrial interests alike 3 and plays a vital role in several applications.Automated driving is arguably the most challenging industrial domain 4 .Automated vehicles use a plethora of sensors: Lidar, mmWave radar, video and ultrasonic, and attempt to perform some form of sensor fusion for environmental perception and precise localization.A high-quality of final fusion estimate is a prerequisite for safe driving.Amongst other application areas, a notable mention deserves eHealth and Ambient Assisted Living (AAL).These new paradigms are contingent on gathering information from various sensors around the home to monitor and track the movement signatures of people.The aim is to build long-term behavioral sensing machine which also affords privacy.Such platforms rely on an array of environmental and wearable sensors, with sensor fusion being one of the key challenges.
In this contribution, we focus on a one-time snapshot problem (i.e.we are not building temporal structures).However, we try to explore the problem of multimodal sensor fusion from a new perspective, essentially, from a Bayesian viewpoint.The concept is depicted in Fig. 1, alongside two main groups of approaches to sensor fusion.Traditionally, sensor fusion for classification tasks has been performed at the decision level as in Fig. 1(a).Assuming that conditional independence holds, a pointwise product of final pmf (probability mass function) across all modalities is taken.Feature fusion, as depicted in Fig. 1(b), has become very popular with the advent of deep neural networks 3 , and can produce very good results.Fig. 1(c) shows our technique during the fusion stage (Stage 2).Blue arrows indicate the direction of backpropagation gradient flow during fusion. Contributions: • A novel method for multimodal sensor fusion is presented.The method attempts to find the best estimate (maximum a posteriori) for the causes of observed data.The estimate is then used to perform specific downstream fusion tasks.
• The method can fuse the modalities under lossy data conditions i.e. when the data is subsampled, lost and/or noisy.Such phenomena occur in real-world situations such as the transmission of information wirelessly, or intentional subsampling to expedite the measurement (rapid MRI imaging and radar) etc. • It can leverage between modalities.A strong modality can be used to aid the recovery of another modality that is lossy or less informative (weak modality).This is referred to as asymmetric Compressed Sensing.

Related Work
In this section, we review the state-of-the-art in three areas directly relevant to our contribution: multimodal generative modeling, sensor fusion, and compressed sensing.One of the main aims of Multimodal Variational Autoencoders (MVAEs) is to learn shared representation across different data types in a fully self-supervised manner, thus avoiding the need to label a huge amount of data, which is time-consuming and expensive 5 .It is indeed a challenge to infer the low-dimensional joint representation from multiple modalities, which can ultimately be used in downstream tasks such as self-supervised clustering or classification.This is because the modalities may vastly differ in characteristics, including dimensionality, data distribution, and sparsity 6 .Recently, several methods have been proposed to combine multimodal data using generative models such as Variational Autoencoders (VAEs) 5,[7][8][9][10][11] .These methods aim to learn a joint distribution in the latent space via inference networks and try to reconstruct modality-specific data, even when one modality is missing.In these works, a modality can refer to natural images, text, captions, labels or visual and non-visual attributes of a person.JMVAE (Joint Multimodal Variational Autoencoder) 9 makes use of a joint inference network to learn the interaction between two modalities and they address the issue of missing modality by training an individual (unimodal) inference network for each modality as well as a bimodal inference network to learn the joint posterior, based on the product-of-experts (PoE).They consequently minimize the distance between unimodal and multimodal latent distribution.On the other hand, MVAE 7 , which is also based on PoE, considers only a partial combination of observed modalities, thereby reducing the number of parameters and improving the computational efficiency.Reference 8 uses the Mixture-of-Experts (MoE) approach to learn the shared representation across multiple modalities.The latter two models essentially differ in their choices of joint posterior approximation functions.MoPoE (Mixture-of-Products-of-Experts)-VAE 5 aims to combine the advantages of both approaches, MoE and PoE, without incurring significant trade-offs.DMVAE (Disentangled Multimodal VAE) 10 uses a disentangled VAE approach to split up the private and shared (using PoE) latent spaces of multiple modalities, where the latent factor may be of both continuous and discrete nature.CADA (Cross-and Distribution Aligned)-VAE 11 uses a cross-modal embedding framework to learn a latent representation from image features and classes (labels) using aligned VAEs optimized with cross-and distribution-alignment objectives.
In terms of multimodal/sensor fusion for human activity sensing using Radio-Frequency (RF), inertial and/or vision sensors, most works have considered either decision-level fusion or feature-level fusion.For instance, the work in 12 performs multimodal fusion at the decision level to combine the benefits of WiFi and vision-based sensors using a hybrid deep neural network (DNN) model to achieve good activity recognition accuracy for 3 activities.The model essentially consists of a WiFi sensing module (dedicated Convolutional Neural Network (CNN) architecture) and a vision sensing module (based on the Convolutional 3D model) for processing WiFi and video frames for unimodal inference, followed by a multimodal fusion module.Multimodal fusion is performed at the decision level (after both WiFi and vision modules have made a classification) because this framework is stated to be more flexible and robust to unimodal failure compared to feature level fusion.Reference 13 presents a method for activity recognition, which leverages four sensor modalities, namely, skeleton sequences, inertial and motion capture measurements and WiFi fingerprints.The fusion of signals is formulated as a matrix concatenation.The individual signals of different sensor modalities are transformed and represented as an image.The resulting images are then fed to a two-dimensional CNN (EfficientNet B2) for classification.The authors of 14 proposed a multimodal HAR system that leverages WiFi and wearable sensor modalities to jointly infer human activities.They collect Channel Sate Information (CSI) data from a standard WiFi Network Interface Card (NIC), alongside the user's local body movements via a wearable Inertial Measurement Unit (IMU) consisting of an accelerometer, gyroscope, and magnetometer sensors.They compute the time-variant Mean Doppler Shift (MDS) from the processed CSI data and magnitude from the inertial data for each sensor of the IMU.Then, various time and frequency domain features are separately extracted from the magnitude data and the MDS.The authors apply a feature-level fusion method which sequentially concatenates feature vectors that belong to the same activity sample.Finally supervised machine learning techniques are used to classify four activities, such as walking, falling, sitting, and picking up an object from the floor.
Compared to the aforementioned works [12][13][14] which consider supervised models with feature-level fusion or decision-level fusion, our technique, in contrast, performs multimodal sensor fusion in the latent representation space leveraging a selfsupervised generative model.Our method is different from current multimodal generative models such as those proposed in 5,[7][8][9] in the sense that it can handle cases where observations are accessed only via subsampling (i.e.compressed sensing with significant loss of data and no data imputation).And crucially our technique attempts to directly compute the MAP (maximum a posteriori) estimate.
The presented method is related to and builds upon Deep Compressed Sensing (DCS) techniques 15,16 .DCS, in turn, is inspired by Compressed Sensing (CS) 17,18 .In CS, we attempt to solve what appears to be an underdetermined linear system, yet the solution is possible with the additional prior sparsity constraint on the signal: min L0.Since L0 is non-convex, L1 is used instead to provide a convex relaxation, which also promotes sparsity and allows for computationally efficient solvers.DCS, in essence, replaces the L0 prior with a low dimensional manifold, which is learnable from the data using generative models.Concurrently to DCS, Deep Image Prior 19 was proposed.It used un-trained CNNs to solve a range of inverse problems in computer vision (image inpainting, super-resolution, denoising).

Methods
Assume the data generative process so that latent and common cause Z gives rise to X m , which in turn produces observed Y m , i.e.Z → X m → Y m forms a Markov chain.Here, X m is the full data pertaining to m th modality, m ∈ {1, . . ., M}. Crucially, the modalities collect data simultaneously "observing" the same scene.As an example, in this work, we consider the different views (obtained via multiple receivers) from the opportunistic CSI WiFi radar as different modalities.The variable Z encodes the semantic content of the scene and is typically of central interest.Furthermore, X m is not accessed directly, but is observed via a subsampled Y m .This is a compressed sensing setup: Y m = χ m (X m ): χ m is a deterministic and known (typically many-to-one) function.The only condition we impose on χ m is to be Lipschitz continuous.With the above, the conditional independence between modalities holds (conditioned on Z).Therefore, the joint density factors as: The main task in this context is to produce the best guess for latent Z, and possibly, to recover the full signal(s) X m , given subsampled data Y 1:M .We approach the problem in two stages.First we build a joint model which approximates equation ( 1), and will be instantiated as a Multimodal Variatational Autoencoer (M-VAE).More specifically, the M-VAE will provide an approximation to p φ 1:M ,ψ 1:M (z, x 1:M ), parameterized by deep neural networks {φ 1 , . . ., φ M }, {ψ 1 , . . ., ψ M }, referred to as encoders and decoders, respectively.The trained M-VAE will then be appended with p χ m (y m |x m ) for each modality m: {χ 1 , . . ., χ M } referred to as samplers.In the second stage, we use the trained M-VAE and χ 1:M to facilitate the fusion and reconstruction tasks.Specifically, our sensor fusion problem amounts to finding the maximum a posteriori (MAP) ẑMAP estimate of the latent cause for a given (i th ) data point where, The above MAP estimation problem is hard, and we will resort to approximations detailed in the sections below.

Multimodal VAE
The first task is to build a model of equation (1).As aforementioned, this will be accomplished in two steps.Firstly, during the training stage we assume access to full data X 1:M , therefore training an approximation to p φ 1:M ,ψ 1:M (z, x 1:M ) is a feasible task.The marginal data log-likelihood for the multimodal case is: where D KL is the Kullback-Leibler (KL) divergence.The first summand in equation ( 5), i.e. the sum over modalities follows directly from the conditional independence.And since KL is non-negative, equation ( 5) represents the lower bound (also known as Evidence Lower Bound -ELBO) on the log probability of the data (and its negative is used as the loss for the M-VAE).There exist a body of work on M-VAEs, the interested reader is referred to 5,[7][8][9] for details and derivation.The key challenge in training M-VAEs is the construction of variational posterior q(z|x 1:M ).We dedicate a section in the Supplementary Information document S1 to the discussion on choices and implications for the approximation of variational posterior.Briefly, we consider two main cases: a missing data case -i.e.where particular modality data might be missing (X m = x (i) m = / 0); and the full data case.The latter is straightforward and is tackled by enforcing a particular structure of the encoders.For the former case variational Product-of-Experts (PoE) is used: Should the data be missing for any particular modality, q φ m (z|x m ) = 1 is assumed.Derivation of equation ( 6) can be found in the Supplementary Information document S1.

Fusion on the M-VAE prior
Recall the sensor fusion problem as stated in equation ( 2).The p(z) is forced to be isotropic Gaussian by M-VAE, and the remaining densities are assumed to be Gaussian.Furthermore, we assume that p(x m |z) = δ (x m − ψ m (z)).Therefore equation (2) becomes: Hence, the objective to minimize becomes: Recall that the output of the first stage is p(z) and the decoders ∏ m p ψ n (x|z) are parametrized by {ψ 1:M }, {λ 0:M } are constants.The MAP estimation procedure consists of backpropagating through the sampler χ m and decoder ψ m using Stochastic Gradient Descent (SGD).In this step {ψ 1:M } are non-learnable, i.e. jointly with χ m are some non-linear known (but differentiable) functions.
The iterative fusion procedure is initialized by taking a sample from the prior z 0 ∼ p(z), {η 0:M } are learning rates.One or several SGD steps are taken for each modality in turn.The procedure terminates with convergence -see Algorithm 1.In general, the optimization problem as set out in equation ( 8) is non-convex.Therefore, there are no guarantees of convergence to the optimal point (ẑ MAP ).We deploy several strategies to minimize the risk of getting stuck in a local minimum.We consider multiple initialization points (a number of points sampled from the prior with Stage 2 replicated for all points).In some cases it might be possible to sample from: . Depending on modality, this might be possible with data imputation ( xm are imputed data).The final stage will depend on a particular task (multisensory classification/reconstruction), but in all cases it will take ẑMAP as an input.In our experiments, we observe that the success of Stage 2 is crucially dependent on the quality of M-VAE.

Experiments
In this work, we investigate the performance of the proposed method on two datasets for multimodal sensor fusion and recovery tasks: i) a synthetic "toy protein" dataset and ii) a passive WiFi radar dataset intended for Human Activity Recognition (HAR).

Passive WiFi radar dataset
We use the OPERAnet 20 dataset which was collected with the aim to evaluate human activity recognition (HAR) and localization techniques with measurements obtained from synchronized Radio-Frequency (RF) devices and vision-based sensors.The RF sensors captured the changes in the wireless signals while six daily activities were being performed by six participants, namely, sitting down on a chair ("sit"), standing from the chair ("stand"), laying down on the floor ("laydown"), standing from the floor ("standff"), upper body rotation ("bodyrotate), and walking ("walk").We convert the raw time-series CSI data from the WiFi sensors into the image-like format, namely, spectrograms using signal processing techniques.More details are available in Section S2 of the Supplementary Information document.2,906 spectrogram samples (each of 4s duration window) were generated for the 6 human activities and 80% of these were used as training data while the remaining 20% as testing data (random train-test split).

Results and Discussion
Classification results of WiFi CSI spectrograms for HAR In this section, we evaluate the HAR sensor fusion classification performance under a few-shot learning scenario, with 1, 5 and 10 labelled examples per class.These correspond to 0.05%, 0.26% and 0.51% of labelled training samples, respectively.We randomly select 80% of the samples in the dataset as the training set and the remaining 20% is used for validation.The average F 1 -macro scores for the HAR performance are shown in Table 1 for different models.To allow for a fair comparison, the same random seed was used in all experiments with only two modalities (processed spectrograms data obtained from two different receivers).
Prior to training our model (see Supplementary Fig. S1), the spectrograms were reshaped to typical image dimensions of size (1 × 224 × 224).Our model was trained for 1,000 epochs using the training data with a fixed KL scaling factor of β = 0.02.The encoders comprised of the ResNet-18 backbone with the last fully-connected layer dimension having a value of 512.For the decoders, corresponding CNN deconvolutional layers were used to reconstruct the spectrograms from each modality with the same input dimension.The latent dimension, batch size, and learning rate are set at 64, 64, and 0.001, respectively.In the second stage, the generative model serves as a reconstruction prior and the search manifold for the sensor fusion tasks.Essentially, in this stage, we obtain the maximum a posteriori estimate of ẑMAP through the process described in Algorithm 1.The final estimate of the class is produced by K-NN in the latent representation space, with labelled examples sampled from the training set.
To benchmark our technique we investigate the performance of other state-of-the-art sensor fusion techniques.The featurefusion is represented by CNN models (1-channel CNN, 2-channel CNN, dual-branch CNN).All are trained in a conventional supervised fashion from scratch using the ResNet-18 backbone and a linear classification head is appended on top of it consisting of a hidden linear layer of 128 units and a linear output layer of 6 nodes (for classifying 6 human activities).The dual-input CNN refers to the case where the embeddings from the two modalities' CNNs are concatenated, and a classification head is then added (as illustrated in Fig. 1(b)).The "Probability Fusion" (decision fusion) model refers to a score-level fusion method where the classification probabilities (P 1 and P 2 ) from each modality are computed independently (using an output SoftMax layer) and then combined using the product rule (this is optimal given conditional independence).These models are fine-tuned with labelled samples over 200 epochs, with a batch size of 64 and the Adam optimizer was used with learning rate of 0.0001, weight decay of 0.001 and β 1 = 0.95, β 2 = 0.999.
It can be observed from Table 1 that our method significantly outperforms all other conventional feature and decision fusion methods.The confusion matrix for HAR classification using our SFLR (Sensor Fusion in the Latent Representation space) model is shown in Fig. S8 in the Supplementary Information document for the case when only ten labelled examples are used at the (classification) fusion stage.

Sensor fusion from subsampled observations
Next, we evaluate the recovery performance of spectrograms under different numbers of compressed sensing measurements.The measurement function χ m is a matrix initialized randomly and we assume that there is no additive Gaussian noise.The Adam optimizer is used to optimize ẑMAP with a learning rate of 0.01.The algorithm is run for 10,000 iterations.After the loss in equation ( 8) has converged during the optimization process, the samples are decoded/recovered for modality 1 and modality 2 using their respective decoders xm = ψ m (ẑ MAP ).Table 2 shows the compressed sensing results when a batch of 50 images is taken from the testing dataset and evaluated under different number of measurements (without noise).It can be observed that the samples can be recovered with very low reconstruction error when the number of measurements is as low as 196 (0.39%).An illustration is also shown in Fig. 3 where very good reconstruction is observed for the case when the number of measurements is equal to 784.More illustrations are shown in Fig. S7 in the Supplementary Information document, with further experimental results in Sections S4, S5, S6.

Toy protein classification
Similarly to the experiments on the OPERAnet dataset, we perform two tasks, classification and sensor fusion from compressed sensing observations, on the synthetic toy protein dataset.
As mentioned previously, the toy protein dataset contains 10 classes.The dataset is split into a training set and a test set, containing 80% and 20% of samples, respectively.We evaluate the classification performance under a few-shot learning setting, using 1, 5 or 10 labelled samples per class.The few-shot classification via the SFLR model consists of two stages.In the first stage, the M-VAE is trained in an unsupervised manner using the training set.Using the maximum a posterior ẑMAP and a few labels, the K-NN classifier is applied to the latent representation space.Here the encoder and decoder in M-VAE are two-layer MLPs, with 16 neurons in the hidden layer.
We compare the SFLR method with 4 baseline models.The single modality model only considers one modality without sensor fusion.The probability fusion model independently computes the classification probability for each modality, which is a representative model for decision-fusion (Fig. 1(a)).The dual-branch feature fusion model concatenates the embedding of two modalities before the classification layer, which is a feature fusion method (Fig. 1(b)).All baseline models are trained in a supervised manner, with the same neural network structure as the encoder.Table 3 shows the F1-macro scores for different methods on the test set.On the 10-class protein dataset, SFLR outperforms other sensor fusion models using limited labelled samples.

Sensor fusion from subsampled toy proteins
Another advantage of the proposed SFLR model is that it can fuse modalities in subsampled cases.We use a set of samplers χ 1:M to simulate the subsampled observations.The measurement function χ m is a matrix initialized randomly.Here we use 10 initialization points to reduce the risk of getting trapped in a local minimum (points sampled from the prior with Stage 2 replicated for all of them).Fig. 2(b) shows the recovered protein from subsampled observations, with only 2 measurements for each modality.Both modalities are successfully recovered from the latent representation space, even though the initial guess z 0 is far from the true modality.Note that the proteins in Fig. 2 have a higher dimension than in the dataset, showing the robustness of the SFLR method.Table 4 shows the average reconstruction error of the synthetic protein dataset using different subsamplers.The reconstruction error reduced significantly when having 2 measurements for each modality, showing superior sensor fusion abilities.
The Supplementary Information document (see Section S7) contains additional experiments, including tasks showcasing the ability to leverage between modalities, where a strong modality can be used to aid the recovery of a weak modality.It also presents the performance under subsampled and noisy conditions.

Conclusions and Broader Impacts
The paper presents a new method for sensor fusion.Specifically, we demonstrate the effectiveness of classification and reconstruction tasks from radar signals.The intended application area is human activity recognition, which serves a vital role in the E-Health paradigm.New healthcare technologies are the key ingredient to battling spiralling costs of provisioning health services that beset a vast majority of countries.Such technologies in a residential setting are seen as a key requirement in empowering patients and imbuing a greater responsibility for own health outcomes.However, we acknowledge that radar and sensor technologies also find applications in a military context.Modern warfare technologies (principally defensive) could potentially become more apt if they were to benefit from much-improved sensor fusion.We firmly believe that, on balance, it is of benefit to the society to continue the research in this area in the public eye.In this experiment, we analyse the reconstruction of the WiFi spectrogram samples under two different scenarios, where we want to demonstrate the benefits of having multiple modalities.We are interested in recovery for one modality that is subsampled (loss of data) and noisy.This can be referred to as the weak data (or weak modality).Using the SFLR method, we leverage the second modality data, which has no loss of information or does not suffer from noise (strong modality), to improve the recovery for the modality of interest i.e., the weak modality.In the first case, only modality 1 (subsampled and noisy) is considered in the reconstruction process.In the second case, the good modality 2 is added in the iterative fusion process to improve the reconstruction quality of modality 1.
The results are tabulated in Table S3, where additive Gaussian noise with a standard deviation of 0.1 is considered.The results show the mean reconstruction errors (over 50 WiFi spectrogram samples) when modality 1 is subsampled to different extents.We see that reconstruction error has a general tendency to decrease with increasing number of measurements.It can be observed that the samples can be recovered with very low reconstruction error when the number of measurements is as low as 196 (0.39%).Furthermore, from Table S3, we observe that when only modality 1 is considered in the reconstruction process, the reconstruction errors are high when the number of measurements is equal to 1 (0.002%) and 10 (0.02%).However, by leveraging the good modality 2, the reconstruction quality is greatly improved for the same number of measurements, demonstrating the clear benefit of having multiple modalities.An illustration of the reconstruction quality is depicted in Fig. S6, where it can be observed that the unimodal reconstruction of modality 1 is far from the true sample.On the other hand, the reconstruction quality of modality 1 is improved by leveraging the good modality data.

Sensor fusion from subsampled and noisy toy proteins
In this section, we present the sensor fusion results for toy protein reconstruction under subsampled and noisy observations, as an extension to Section "Sensor fusion from subsampled toy proteins" in the main document.Table S4 shows the mean reconstruction error of subsampled toy protein samples, with different levels of additive Gaussian noise.The proposed SFLR method recovers both modalities from as low as 4 subsampled and noisy observations.

Sensor fusion from asymmetric Compressed Sensing of toy proteins
We show the results of sensor fusion from asymmetric compressed sensing, regarding the third contribution of this paper.We claim that a strong modality can be used to aid the recovery of another modality that is lossy or less informative (weak   S5 shows the recovery results in two cases.In the first case, the subsampled modality 1 with additive Gaussian noise is observed and recovered.In the second case, the noise-free modality 2 with full observation is used to help the sensor fusion.We can see that modality 2 significantly helps with the recovery of modality 1, especially when the number of observations are relatively small.

Figure 1 .
Figure 1.Multimodal Sensor Fusion: (a) Decision fusion, (b) Feature fusion, (c) Our technique: fusion in the latent representation with optional compressed sensing measurements; F features, p(z) prior model, G generators, X complete data, Y subsampled data.For clarity M = 2 modalities are shown, the concept generalises to any M.

Figure 3 .
Figure 3. Illustration of spectrogram recovery (for sitting down activity) using compressed sensing with measurements as low as 784 out of 50,176 (1.56%).No additive white Gaussian noise is considered.The left column shows the true spectrogram sample, the middle column shows reconstruction with an initial guess (no optimization) while the right column shows reconstruction with ẐMAP .

Figure legends 1 . 3 .
Figure legends1.Multimodal Sensor Fusion: (a) Decision fusion, (b) Feature fusion, (c) Our technique: fusion in the latent representation with optional compressed sensing measurements; F features, p(z) prior model, G generators, X complete data, Y subsampled data.For clarity M = 2 modalities are shown, the concept generalises to any M. 2. (a) Generated toy proteins examples (N = 64) and (b) reconstruction from compressed sensing observations.With 2 out of 64 measurements (3.125%), near perfect reconstruction is possible even though the modalities are individually subsampled.3. Illustration of spectrogram recovery (for sitting down activity) using compressed sensing with measurements as low as 784 out of 50,176 (1.56%).No additive white Gaussian noise is considered.The left column shows the true spectrogram sample, the middle column shows reconstruction with an initial guess (no optimization) while the right column shows reconstruction with ẐMAP .

Table 1 .
Few-shot learning sensor fusion classification results (F 1 macro) for Human Activity Recognition.

Table 2 .
Compressed sensing mean reconstruction error over a batch of 50 WiFi spectrogram data samples (No additive Gaussian noise).An illustration is shown in Fig.3.

Table 3 .
Few-shot learning sensor fusion classification results (F 1 macro) for synthetic proteins.

Table 4 .
Compressed sensing mean reconstruction error over a batch of 100 protein samples.

Table S2 .
Missing pixel mean reconstruction error over a batch of 50 WiFi spectrogram data samples.Illustrations of spectrogram fusion under different missing pixel ratios are shown in Fig. S5.

Table S3 .
Mean reconstruction error over 50 WiFi spectrogram data samples.Noise standard deviation: 0.1

Table S5 .
Mean reconstruction error over 100 samples with asymmetric compressed sensing.Noise standard deviation: 0.1.
Table 1 in main manuscript for classification results in terms of macro F 1 score).