Development of the Senseiver for efficient field reconstruction from sparse observations

The reconstruction of complex time-evolving fields from sensor observations is a grand challenge. Frequently, sensors have extremely sparse coverage and low-resource computing capacity for measuring highly nonlinear phenomena. While numerical simulations can model some of these phenomena using partial differential equations, the reconstruction problem is ill-posed. Data-driven-strategies provide crucial disambiguation, but these suffer in cases with small amounts of data, and struggle to handle large domains. Here we present the Senseiver, an attention-based framework that excels in reconstructing complex spatial fields from few observations with low overhead. The Senseiver reconstructs n-dimensional fields by encoding arbitrarily sized sparse sets of inputs into a latent space using cross-attention, producing uniform-sized outputs regardless of the number of observations. This allows efficient inference by decoding only a sparse set of output observations, while a dense set of observations is needed to train. This framework enables training of data with complex boundary conditions and extremely large fine-scale simulations. We build on the Perceiver IO by enabling training models with fewer parameters, which facilitates field deployment, and a training framework that allows a flexible number of sensors as input, which is critical for real-world applications. We show that the Senseiver advances the state-of-the-art of field reconstruction in many applications. The reconstruction of dynamic, spatial fields from sparse sensor data is an important challenge in various fields of science and technology. Santos et al. introduce the Senseiver, a deep learning framework that reconstructs spatial fields from few observations using attention layers to encode and decode sparse data, enabling efficient inference.

The reconstruction of complex time-evolving fields from sensor observations is a grand challenge.Frequently, sensors have extremely sparse coverage and low-resource computing capacity for measuring highly nonlinear phenomena.While numerical simulations can model some of these phenomena using partial differential equations, the reconstruction problem is ill-posed.Data-driven-strategies provide crucial disambiguation, but these suffer in cases with small amounts of data, and struggle to handle large domains.Here we present the Senseiver, an attention-based framework that excels in reconstructing complex spatial fields from few observations with low overhead.The Senseiver reconstructs n-dimensional fields by encoding arbitrarily sized sparse sets of inputs into a latent space using cross-attention, producing uniform-sized outputs regardless of the number of observations.This allows efficient inference by decoding only a sparse set of output observations, while a dense set of observations is needed to train.This framework enables training of data with complex boundary conditions and extremely large fine-scale simulations.We build on the Perceiver IO by enabling training models with fewer parameters, which facilitates field deployment, and a training framework that allows a flexible number of sensors as input, which is critical for real-world applications.We show that the Senseiver advances the state-of-the-art of field reconstruction in many applications.
The goal of sparse data reconstruction is to take a few sensor values from a space that we cannot fully observe, and use them to reconstruct the global field.Reconstructing spatial fields from sensor data has been a grand challenge in a wide range of applications in industry, medicine and science [1][2][3] .Some examples include laboratory experiments 4 , monitoring industrial plants 5 , precision agriculture 6 , design of limb orthoses and prostheses 7 , structural health monitoring in aircraft 8 civil infrastructure 9 , subsurface sensing 10 , tectonic motion estimation 11 , weather and climate monitoring 12 , and identification of abandoned wells 13 , among others.
Common features of these applications are low-spatial-sensor coverage (typically less than 1%), three-dimensional (3D) domains, noisy data, nonlinear dynamical phenomena and a scarcity of available processing power within the sensor.These sensing assets are often deployed in areas with scarce or non-existent network connectivity, necessitating edge computing.For this reason, low-resource, Article https://doi.org/10.1038/s42256-023-00746-xgap in literature between proposed models and the practical application to real-world data.
Attention mechanisms 34 have greatly improved the baselines over other architecture for a variety of problems [35][36][37][38] , and recently, the Perceiver IO framework 39 overcame a crucial computational bottleneck using cross-attention with latent arrays, thereby constraining the bulk of the network activations to a fixed-sized space regardless of input size.While this was viewed as a way to handle a large set of inputs (for example, every pixel in an image), we exploit the fact that it also allows us to scale down the quantity of information fed into a network, resulting in a framework we call the Senseiver.For our applications, we need to make near real-time decisions on drones or other field-deployed instruments with limited processing power, and current machine learning architectures are too computationally expensive.By incorporating sparsity and refining the network architecture to drastically reduce the number of parameters needed, the Senseiver is able to accurately learn and process an entire field with a far smaller amount of resources.Crucially, sparse processing can treat spatial data that do not live on a fixed, regular, Cartesian-type mesh.Our approach also overcomes a limitation of earlier of graph element networks 27 , which produce bad estimates when a denser set of sensors is used at inference time.
In this Article, we demonstrate examples of these advantages on several datasets, and we compare the Senseiver to the recent successes of ref. 32, showing remarkable improvements in accuracy, scalability and efficiency in the limit of low sensor coverage.Beyond these improvements, we discuss additional benefits of the sparse processing model of the Senseiver, such as prediction of partial information, reduced memory requirements, and faster performance field-ready devices such as drones require machine learning approaches that are computationally efficient and accurate.While some applications can be fully described by physics-based partial differential equations (PDEs), incorporating field observations (for example, sensor measurements) back to the PDEs is challenging.A variety of techniques have been developed using PDE-based 14,15 and statistical [16][17][18] approaches.Still, widespread success has been elusive due to a lack of a generic framework to incorporate measured data at arbitrary times and locations.As a result, machine learning models have become an attractive alternative [16][17][18] as these have the capacity to learn complex relationships from data in a great many problems.Some notable examples include optimization of sensor placement [19][20][21] .Machine learning models have the potential to be successful even when the governing PDEs of the system are not available.Still, real-world sensors often have physically limited sensor positions (for example, floating buoys in the ocean 22 , Earth-based asteroid sensing 23 , the inlet/outlet for a laboratory experiment 24 ), and covering a field exhaustively can be prohibitively expensive, if not impossible.Furthermore, low-power sensors are often preferred when operating off the grid (for example, mobile sensing with drones), and so computational resources are precious.
With recent advances in machine learning algorithms, complex reconstruction problems have become more tractable 25,26 .Although fully connected networks could be used for this task, their rigid architecture has difficulty handling sensors that move around the field and/or go on and off with time.Another key challenge for reconstructing fields from sparse data is that the field itself does not always lie on a Cartesian grid, so methods that assume structured data, such as convolutional neural networks, must impose an artificial structure to the field values.Conversely, methods that overcome this problem by explicitly including the geometry of the data, such as graph neural networks 27 , require a choice of a graph topology, which may be problem and/or dataset specific.This aspect of the problem adds additional decisions to make and hyperparameters when applying a graph network method to new problems.Data-driven methods generally require a dense set of observations and training time and a sparse set of observations at inference time.Techniques such as the popular physics-informed neural networks 28,29 become prohibitively expensive for even small, canonical two-dimensional (2D) datasets 30 and success has therefore been limited in leveraging them for sparse sensing 9 .
A few notable works stand out.Manohar et al. 19 describes methods where the low-rank structure of the dataset can be exploited if it exists.They specifically target cases where the number of sensors is greater than the number of modes used in reconstruction, by introducing the idea of QR pivoting.They employ this mechanism to leverage known patterns in the data to optimize the sensor positioning.However, for chaotic, multiscale systems such as 3D turbulence, the number of modes for effective reconstruction are often quite high, and there could be fewer sensors in practice.In another recent approach, Güemes 31 proposed constructing low-resolution fields using moving averages from accessible sparse sensor data that then are mapped to higher resolutions using a learned mapping.In a similar vein, Fukami et al. 32 introduced a method based on Voronoi tessellation of observations onto the prediction domain, followed by a refinement using a convolutional network.Their approach has the attractive upsides of allowing arbitrary sensor placement within a 2D mesh, and allowing inference using sensor locations that differ from the ones used during training.Nevertheless, this approach inherits the hurdles of deep convolutional networks, such as the assumption of a regular grid and high memory costs for 3D domains 33 , which are features prevalent in real-world problems.Extending convolutional neural network approaches to 3D or large 2D datasets would be very computationally expensive, and may require considerable efforts in engineering implementation.This is a common denominator for methods currently in literature: they struggle to work in chaotic, high-dimensional datasets without demanding excessive compute resources.As a result, there is a major  during training and inference.We demonstrate this on large 3D cases, where even convolutional networks are not practical.The model is also computationally efficient by not processing invalid locations in irregular geometries where there are regions without data, such as in solid boundaries in porous media and continents; sparse processing allowing for training to domains of arbitrary size and structure.

Overview
The first aim of the Senseiver is to learn a compact representation of the state of a system from a small number of sensor observations at a given time.This encoded representation can then be used to decode the state of the full system from sensor data.The input to our model is a set of N s sensor observations s i taken at time t, {s 1 , s 2 , … s N s } t , with s i ∈ ℝ N I , where N I corresponds to the number of channels recorded by the sensors (for example, 1 for temperature, 3 for a velocity vector in 3D).The system has a domain Ω where a set of sensor locations {x 1 , x 2 , … , x N s } x i ∈ ℝ N D are extracted.Throughout the paper, we use bold lowercase letters to denote vectors, bold uppercase letters to denote matrices, italic uppercase letters to represent functions, and italic lowercase letters to denote scalars.
The Senseiver workflow, shown in Fig. 1a, has three main components.(1) A spatial encoder P E that maps a spatial coordinate x s to an array of spatial encodings a, that allow us to encode a precise n-dimensional spatial position to a vector.(2) An attention-based encoder E that maps the spatial encodings of the sensor positions a i and their values s i to a latent matrix Z, which is a compressed representation of the system at time t. 3) An attention-based decoder D, which outputs the reconstructed field value at arbitrary query position x q that are also represented using the spatial encoder.As equations, this process is given by: where the subscripts s and q stand for the sensor and query positions.These items are explained in detail (Methods).
The Senseiver is well suited to the demands of many real-world applications (some of which are shown in Figure 1b).First, it is agnostic to spatial dimensionality, as it can work in one, two or three dimensions, with no code or hyperparameter modifications.Second, we can train our model without specialized feature engineering, that is, no problem-specific processing of input features or labels is required, as the direct sensor observations can be used as input to the network.Third, our model is resolution/grid agnostic, as it works with arbitrary, continuous sensor locations.This addresses a key limitation in real-world problems, where we can place sensors only in locations such as the boundaries and edges, making the reconstruction problem harder.In addition, our model can treat not only sparse input but also sparse output, which allows training to large 3D fields using less memory by stochastically subsampling the output space.These features show promise in scaling our approach to datasets across a variety of scientific domains with arbitrary size, meshing and geometry.

Brief review of attention-based models
Attention-based neural networks process information by re-weighting a set of inputs, the sequence, such that each sequence element is weighted to take larger contributions from certain elements in the sequence (to 'attend'), while effectively ignoring others 34 .It is important to note that the term sequence need not represent a one-dimensional array of variables; in fact, attention layers can have arbitrary connections between the elements composing the sequence.Because of this fact, attention layers can model the interactions of n sensors among themselves and additional information using a pair-wise interaction modelling approach 40 .In practice, this is not memory efficient because it corresponds to a complete graph of connections between all sequence elements.This is especially relevant for applications that involve learning from large inputs (for instance, images or videos), as the memory requirements for complete pair-wise self-attention scale quadratically with input size.
However, attention mechanisms are very flexible, and this can be exploited.The Perceiver IO 41 presented an alternative to avoid the quadratic bottleneck by utilizing a latent sequence array (Q in ∈ ℝ N Q in ×N I ) to process arbitrarily sized inputs into a compact latent sequence representation of the same size as the latent array.The same principle is employed at the output to project the latent sequence into the desired output shape and dimensionality.Thus the model can use arbitrarily shaped inputs and outputs, while the bulk of latent computations are of fixed shape and computational cost (Fig. 2).This approach allows for non-local processing across input sequences and makes no assumptions about the structure of the data.This allowed the Perceiver IO to have a core network that is domain agnostic, so that different input/output branches can be used to process qualitatively different data streams, such as text, audio or image.For the Senseiver, it allows the treatment of partial information and frees the network from any assumption about the input domain geometry, unlike, for example, convolutional networks.

Training procedure
During training, a batch of data is used to optimize the weights of the model by minimizing the mean square error:   where t is time, and x s and x q j are the spatial coordinates of the sensors and the queries, respectively.A batch of data is composed by a selection of x q j in the training dataset.In our experiments, we used batches of randomly selected points through the training dataset-first sampling a time frame and then sampling query points within that time frame.Shuffled data samples proved much more effective versus ordered points (axis or patches) for training performance.Note that dense observations are required to train the model, while at inference time only sparse observations are necessary.We trained the model until the minimum loss value over the training period did not change for 100 consecutive epochs.As this is a nascent class of problems, the mean L2 error norm: across all the time steps of the test set is reported for comparison with recent work 32 .Here, ŝ is the prediction of the model.

Cyclic and quasi-cyclic dynamical phenomena
The first dataset considered is a simulation of a 2D unsteady flow past a cylindrical obstacle 42 .This results in a von Kármán vortex street, that is, an alternating shedding of left-and right-handed vortices in the flow field behind the cylinder.The dataset was created utilizing a numerical simulation that solves the incompressible Navier-Stokes equation at a Reynolds number of 100.The simulation is periodic in the x and y coordinates and its computational size a 192 × 112 grid with 5,000 time frames, which approximately spans four vortex shedding periods.The domain spans 10 cylinder diameters vertically and 15.7 cylinder diameters horizontally.The recorded output is the vorticity field of the fluid, which is impacted by having a solid obstacle.We trained our model to reconstruct the simulation based on three different sensor location configurations as inputs.During training, each batch observes the field at only a sparse set of locations, but collectively the batches sample the whole training field.The first configuration uses eight sensor locations as proposed by ref. 32.The second configuration uses only the first four sensors to show the impact of having less input information.The third proposed configuration sets the sensor locations at the inlet/outlet boundaries, motivated by sensing limitations of lab-on-a-chip experiments.In this case, for the approach proposed by ref. 32, the Voronoi-tessellated image would contain almost no information, as the values at the boundaries are very close to zero at every time step.Even when the variance of the values is very small, the attention layers are able to construct a robust representation.All these configurations are shown in Fig. 3.We trained our model using only 50 frames (1% of the dataset) and the remaining frames are used as the test set.The results are shown in the bar plot of Fig. 3.In comparison with the VoronoiCNN (the model proposed in ref. 32) trained on the same amount of data and with the same amount of sensors, our model achieves approximately ten times lower error.Even with just four sensors at the boundary, the Senseiver is able to reconstruct the entire simulation faithfully with a negligible drop in accuracy compared with the eight-sensor configuration.Higher errors are present where the vorticity magnitude changes the most, partly due to the fact that these locations are far away from the sensors.It is worth noting that the model in ref. 32 has over 682,000 trainable parameters, while ours is composed by less than 5% of that number.A complete hyperparameter analysis for this problem can be found in the Supplementary Information.
The second dataset that we considered is the National Oceanic and Atmospheric Administration sea surface temperature 43 .This real-world dataset was collected from satellite and ship-based observations over time.The data comprise weekly observations of the sea surface temperature of Earth at a spatial resolution of 360 × 180 (longitude and latitude, respectively)-giving a resolution of 1° longitude and 1° latitude.For training, we use 1,040 snapshots spanning from the year 1981 to 2001, and then we tested on snapshots from 2001 to 2018.During training, we do not use any information about field values on the continents, because there is no recorded value to reconstruct, which saves computational time (as the continents are 32% of the computational domain).For each batch of data, we select 1 to 100 sensors using a uniform distribution with the goal of training the model to be robust under different number of sensors and spatial configurations.Again, each batch during training observes a sparse set of observations, but collectively the batches with random sensor configurations sample from the whole field.For testing, we placed sensors randomly following the same procedure as ref.32.
In our results, we see that just 10 sensors allows for a very strong reconstruction performance of ϵ = 0.047.Ten sensors constitute a total spatial coverage of 0.0154%.By adding more sensors, the overall test error goes down and missing details are added to the local temperature field as seen in Supplementary Fig. 2. It also noticeable that as we use random sensors during training, the model does not take a big performance hit in the scenarios with fewer sensors (N s < 50) compared with ref. 32.The forecasting capabilities show that with good coverage (N s > 50), the trained model can perform estimates up to 18 years into the future with error substantially less than 1 °C.This shows that the Senseiver architecture can cope with real-world situations, including forecasting.All these results are shown in detail in the Supplementary Information.

Acyclic and nonlinear chaotic phenomena
The third dataset considered is a simulation of turbulent fluid flow through a channel 44 .The flow field data are obtained by a 3D numerical simulation of incompressible flow in a channel at a Reynolds number of 180.A slice is taken at the middle of the channel, which yields a computational domain of size 128 × 48, with 128 cells in the streamwise direction of the flow.The domain is non-dimensionalized using the half-width of the channel (δ), and the length of channel is 4π times the half-width.The target of interest is the velocity of the fluid in the direction of flow.
To make the model robust to changes in sensor locations, at each training time iteration we picked a random number of sensors (N s ) from the training pool, which ranged from 25 (low coverage) to 300 (medium coverage).We train our model with all the available simulation data and test its ability to reconstruct it.Similar to the previous datasets, 25-300 locations are sampled in each batch, but aggregating the locations sampled across all batches in an epoch results in the whole field being sampled.Of course, only sparse observations are used at inference time.In addition, we ran a case were we multiplied the value of the loss function (the mean squared error, or MSE) at each mini-batch times the number of sensors used in the forward pass, namely (ℒ = MSE × N s ).We found that this increases performance for cases where more than 150 sensors are provided to the model.We show that the Senseiver is able to provide excellent qualitative reconstructions, as illustrated by the cross-sections in Fig. 4.This training scheme also allows us to move the sensors at inference time.The model obtains the main features accurately, even with a small number of sensors as can be seen in the reconstructions of Fig. 4. A study of how temporal coverage and number of sensors affect the performance of the trained model is shown in Supplementary Fig. 3.
The fourth dataset considered is a simulation of two fluids flowing through a complex 3D medium comprised of spherical obstacles with periodic boundary conditions.A simulation was run using the lattice-Boltzmann method library MPLBM 45 for 4 days using 120 central-processing-unit cores.A non-wetting fluid was placed at the inlet and driven through the domain where its density was recorded during the simulation.The goal of this test case is to assess the capabilities of our model to train with very large arrays with extremely sparse inputs (0.0006% of spatial coverage).The computational domain is 128 × 128 × 512 (with a resolution of 3.5 μm) and we collected 100 frames (one every millisecond) totalling over 1.6 billion points.Three 3D snapshots throughout the simulation are illustrated in Supplementary Fig. 4.
Similar to the sea-temperature dataset, in this domain around 70% of the grid cells have no property value to reconstruct (that is, the cells inside the solid), hence the training is sped-up by a substantial factor compared with convolutional neural network approaches, which are forced to scan the whole domain.It is also worth noting that the 3D version of the approach in ref. 32 allocated the entire memory of a 24 Gb graphics processing unit for a mini-batch of only one sample, slowing down the training and being unable to train with bigger images.In contrast, our model uses only 4 Gb.The results of the abstraction performance of the model with different amounts of training data are shown in Fig. 5. Fifty per cent temporal coverage is enough to train a model that is able to provide accurate reconstructions (ϵ < 0.3) throughout the dataset.One important highlight is that the model is not predictive in highly non-stationary acyclic flows with transient dynamics 46 , for instance, frames far away from the training data (>100 in Fig. 5).The model has periodic boundary conditions so the fluid reaching the outlet is re-injected at the inlet, a situation not contemplated by the training data.
The fifth and last dataset is a contaminant being advected by a turbulent field.This set-up reflects the case were a pollutant is being transported and sparse velocity measurements are available.This example demonstrates the flexibility of the framework to model the relationship between a vector and a scalar.By withholding the concentration of the pollutants from the inputs, this example also illustrates the ability for the model to predict quantities that are unobserved.In this example, the velocity vector (v x , v y , v z , which yields in N I = 3) is measured in 1% of the 128 3 domain and the task of the model is to predict the concentration of the pollutant.In contrast to previous datasets, where the whole field is sampled across many batches in training, with this dataset we sample only 75% or 50% of spatial locations across all batches during training, to test scenarios where not all the ground truth is available.In Fig. 6, we observe that the model reconstructs larger-scale variations in the passive scalar fairly well, but does not reproduce the fine-scale structure.The overall R 2 for reconstruction is approximately 0.75, which is remarkable given that there is virtually no correlation between velocity, which is observed, and the tracer concentration, which is predicted (Supplementary Fig. 5).

Discussion
The flexibility of the Senseiver architecture allows the exploration of many uses cases, and although we aimed to cover as much ground as possible, there are many things still to explore.For instance, non-Cartesian or unstructured grids can be used during training and/or inference.In the same vein, the resolution of the field prediction can be increased by computing the desired property at intermediate intervals, thus repurposing the architecture for super-resolution.Multiple decoding heads can be trained to predict outputs with different boundary conditions or different downstream tasks (for example, segmentation or classification).Additional research could be carried out so that positional encodings can be used to train a model to have forecasting capabilities.During the development of this project, an attempt was made to encode time using sine-cosine encodings without success.However, we tried utilizing a trainable array where each time increment (dt) corresponded to one vector; this was successful but we found it impractical as it requires the model to visit every time increment (dt) during training.

Conclusion
With the advent of widespread access to satellite data and cheap sensors, we have an opportunity to address several problems in Earth sciences and engineering in a manner not possible before.However, these powerful data sources are typically sparse, and leveraging them requires specialized approaches that can map the measured local data to the physics of the global field under observation.The limitations of current approaches introduce large uncertainties in myriad applications such as aviation safety, forecasting accuracy in adverse weather predictions, migration patterns of wildfire, contaminant tracing and tracking sequestered CO 2 plumes.Having a general class of algorithms that is able to estimate and reconstruct the global field from sparse, local measurements will be a major advancement in this field.
In this work, we present an efficient and effective deep learning approach to reconstruct fields from such sparse measurements.From an information theoretic perspective, sparse sensing is an inverse modelling problem that maps sparse, low-dimensional measurements to a dense high-dimensional state.The goal of sparse-sensing algorithms is to obtain the best possible estimates useful enough to inform practical applications, as there are few other viable alternatives.We propose an attention-based neural network architecture, the Senseiver, to encode a compact representation of large systems.We validated the effectiveness of our method with extensive demonstrations on different datasets of interest to the sparse-sensing community, and also on a complex, realistic 3D fluids dataset.Our approach offers improved capabilities for large, practical applications compared with the state-of-the-art convolutional neural network architectures by demonstrating higher accuracy with a lower memory footprint.Five examples of global field reconstruction from local sensor measurements demonstrated the accuracy and robustness of our method.Sparse sensing of fluid flow data, especially turbulence, is extremely challenging due to nonlinearity and chaos.In addition, a low sensor coverage makes the task harder as the sensors can have non-unique reconstructions.Compared with previous efforts, our model scales effectively in large domains of high dimensionality.
Besides the greatly reduced memory footprint, compared with previous efforts, a key advantage of the Senseiver is using a query-based decoder, which allows us to predict domains of arbitrary sizes in a sequential manner.This decoupling of the query process from the dimensionality of the dataset makes it extremely memory efficient and allows our model to scale effectively to large domains.In summary, this work only scratches the surface of what is possible with attention-based architectures for sparse sensing.

Encoder-decoder architecture
Our encoder-decoder architecture is built upon Perceiver IO.The encoder module takes the locations and values of the sensors and maps them to a latent space (of size N f ) through scaled dot-product attention layers.First, the sensor data s i from a number of sensors observations

Article
https://doi.org/10.1038/s42256-023-00746-xN s and their corresponding positional encodings a i are concatenated to form the input E (0) ∈ ℝ N s ×N I +2N D N f , where the superscript indicates each box/layer in Fig. 2. E (0) is then processed using a fully connected linear layer to create E (1) ∈ ℝ N s ×N c where N c is a hidden dimension used throughout the architecture.Next, E (1) is processed with an attention block.Within each block is a multi-headed cross-attention layer (Fig. 2) that uses a trainable query array Q in ∈ ℝ N Q in ×N c .The attention block preserves the dimension such that the output Z of the encoder is of a fixed dimension N Q in × N c , regardless of the number of sensor observations.In summary, the steps of the encoder are given by E (1) = Linear(E (0) ) E (2) = AttentionBlock 1 (E (1) ; The dimensions inside the attention blocks are provided in Supplementary Information, along with definitions of multi-head attention mechanisms.Q in is the latent query array in each block, and θ and ϕ are the weights of the multi-layer perceptrons (MLPs) within each attention block.We also note that the second and third attentionblock modules share weights, and therefore this is a recurrent step in the architecture.This has the benefit of reducing the parameters.Furthermore, our preliminary experiments re-using the weights recurrently resulted in a small increase in accuracy (~10%) without any additional parameters.Linear refers to a simple linear layer with bias, although it is important to note that this projection decreases the dimension before the attention mechanism, that is, we have N c < < N I + 2N D N f .This dimension reduction improves computational efficiency while preserving key information-similar to low-rank methods or the Johnson-Lindenstrauss lemma, but further empowered by the use of attention.The number of channels output by the linear layer is a key hyper parameter that, while small, is somewhat application dependent.While increasing N f is used to capture higher spatial resolution (Supplementary Table 1), increasing the number of channels can be used to capture more complex temporal dynamics (Supplementary Table 2).Next, in the decoder block, the encoded position of the query a q is concatenated with a trainable query vector q out ∈ ℝ N c .However, more frequently we consider multiple query points, denoted N q .In this case, the query vector is repeated row-wise (once for each query point a q ) to make a matrix, Q out .We denote the concatenated positions and Q out as D (0) ∈ ℝ N q ×2N f N d +N c .This query matrix is process by a linear layer to output D (1) ∈ ℝ N q ×N c .D (1) serves as the queries in a multihead cross-attention.The keys and values for this cross-attention are provided by the latent input representation Z, which yields the output D (2) ∈ ℝ N Qout ×N c .D (2) is processed by a linear layer, which yields an output ŝ ∈ ℝ N o .In summary, the steps of the decoder are given by: D (1) = Linear(D (0) ) D (2) = MultiHead(D (0) , Z, Z) ŝ = Linear(D (2) ). ( For details of the attention layers, the network implementation and the design decisions in the Senseiver, see Supplementary Information.The end-to-end forward pass is illustrated in Fig. 2.

Spatial encodings
The attention mechanism does not explicitly account for the spatial location of the sensors or queries.To include this information, we encode the spatial position of these (equation ( 1)) using sine-cosine positional encodings 34 .These are visualized graphically in Fig. 2. For each d of the N D spatial dimensions, we specify a set of spatial frequencies { f k } d of size k over which to build the sine-cosine positional encodings.A position x ∈ ℝ N D is decomposed into a vector a ∈ ℝ 2×N D ×N f , where each entry in a is the value of a corresponding sine or cosine (hence the factor 2) at the specified frequency f and N f is the number of frequencies in the encoding.For each dimension d, there are 2N f entries in a; the first are sin(πf k x d ) and the second are This design choice does not require any additional training parameters, and the computational work required to produce them is negligible.In many applications, having a large N f is required to accurately encode the position of the sensors and query points-especially in three dimensions.The number of parameters increases rapidly for Perceiver IO as N f increases.Senseiver avoids this problem, making it more suitable in applications where precise locations are important-see Supplementary Table 2 for a comparison.In the examples used in this work, the data are located in Cartesian grids, so an array with components denoting the centre of each grid point in each coordinate direction is created, and then indexed during training and inference.Having a Cartesian grid is not a prerequisite to use our model; the sine-cosine spatial encodings can be evaluated on any mesh, or on arbitrary continuously variable points in space.A strong advantage of this flexibility is that it makes it possible to construct field predictions at arbitrary subsets of the full domain, which allows predictions to be made with very few computational resources, as a domain prediction can be constructed piece by piece.We take advantage of this fact during training, as described in 'Training procedure'.In addition, recent work in explainable artificial intelligence has shown that neural networks appear to learn Fourier representation of fluid flows internally 47 , supporting our assumption that these encodings are appropriate for many problems of interest.

Fig. 1 |
Fig. 1 | Overview of sparse reconstruction using the Senseiver model.a, The workflow of the Senseiver innovations for the sparse-sensing problem.We use sensor values and precise query locations that are sparse in the field domain and allow greater computational efficiency.The sensor values are processed by an encoder, and the resulting latent representation is passed along with the query information to a decoder, which estimates the field at a new location.In this example, the output is decoded into a structured grid.b, Overview of applications in this work. Articlehttps://doi.org/10.1038/s42256-023-00746-x Article https://doi.org/10.1038/s42256-023-00746-x

Fig. 3 |
Fig. 3 | Test error results for different sensor configurations.Left: test error (equation (5)) with different sensor configurations and number of training frames and comparison with VoronoiCNN 32 .The bars with diagonal lines indicate the error of our model.Note that the colour in the inset plot on the left is

Fig. 4 |
Fig.4| Performance of the model varying the number of sensors and their locations at inference time.The x axis depicts number of sensors used to reconstruct the field.We tested our trained model with ten different random sensor locations (using fixed seeds) for each x coordinate.The plot shows the 10th and 90th percentiles as bounds of the error (equation (5)) and the average of the 10 with a line.Insets: predictions for the same time frame are shown to depict how the prediction accuracy increases qualitatively with more sensor coverage.All the colorbars are normalized from −1 to 1. δ, half-width of the channel.

Fig. 5 |Fig. 6 |
Fig. 5 | Performance of the model with different amounts of training data.Left: error (equation (5)) of the model versus temporal coverage.The 3D sensor locations are depicted in the top left corner.In the plot, each line represents a trained model and the points represent the training data used for each