Human pointing errors suggest a flattened, task-dependent representation of space

People are able to keep track of objects as they navigate through space, even when objects are out of sight. This requires some kind of representation of the scene and of the observer’s location but the form this representation might take is debated. We tested the accuracy and reliability of observers’ estimates of the visual direction of previously-viewed targets. Participants viewed 4 objects from one location, with binocular vision and small head movements giving information about the 3D locations of the objects. Without any further sight of the targets, participants walked to another location and pointed towards them. All the conditions were tested in an immersive virtual environment and some were also carried out in a real scene. Participants made large, consistent pointing errors that are poorly explained by any consistent 3D representation. Instead, a flattened representation of space that is dependent on the structure of the environment at the time of pointing provides a good account of participants’ errors. This suggests that the mechanisms for updating visual direction of unseen targets are not based on a stable 3D model of the scene, even a distorted one.


Introduction
If a moving observer is to keep track of the location of objects they have seen earlier but which are currently out of view, they must store some kind of representation of the scene and update their location and orientation within that representation. There is no consensus on how this might be done in humans. One possibility is that the representation avoids using 3D coordinates and instead relies on a series of stored sensory states connected by actions ('view-based'), as has been proposed in simple animals such as bees or wasps [1,2,3], humans [4,5,6] and deep neural networks [7,8]. Alternatively, the representation might be based on 3D coordinate frames that are stable in world-, body-or head-centered frames [9,10], possibly based on 'grid' cells in entorhinal cortex [11,12] or 'place' cells in the hippocampus [13,14].
One key difference between these approaches is the extent to which the observer's task is incorporated in the representation. For 3D coordinate-based representations of a scene, task is irrelevant and, by definition, the underlying representation remains constant however it is interrogated. Other representations do not have this constraint. Interestingly, in new approaches to visual scene representation using reinforcement learning and deep neural networks, the task and the environment are inextricably linked in the learned representation [15,7]. This task-dependency is one of the key determinants we explore in our experiment on spatial updating.
If the brain has access to a 3D model of the scene and the observer's location in the same coordinate frame then, in theory, spatial updating is a straight-forward matter of geometry. It is harder to see how it could be done in a view-based framework. People can imagine what will happen when they move [16,17,18] although they often do so with very large errors [19,20,21,22,23,24]. In this paper, we examine the accuracy and precision of pointing to targets that were viewed from one location and then not seen again as the observer walked to a new location to point, in order to test the hypothesis that a single 3D reconstruction of the scene, built up when the observer was initially inspecting the scene, can explain observers' pointing directions.
The task is similar to that described in many experiments on spatial updating such as indirect walking to a target [25,26,27,28], a triangle completion task [29,30], drawing a map of a studied environment including previously viewed objects' location [31,19,32], or viewing a set of objects on a table and then indicating the remembered location of the objects after walking round it or after the table has been rotated [33,34,35,36].
However, none of these studies have compared directly the predictions of a 3D reconstruction model with one that varies according to the location of the observer when they point, as we do here. Spatial updating has been discussed in relation to both 'egocentric' and 'allocentric' representations of a scene [37,38,39,40] and, in theory, either or both of these representations could be used in order to point at a target. An 'egocentric' model is assumed to encode local orientations and distances of objects relative to the observer [41,16,33,38], while an allocentric model is world-based reflecting the fact that the relative orientation of objects in the representation would not be affected by the observer walking from one location to another [40]. People might use both [37,17,39], and Wang and Spelke [38] have emphasised consistency as a useful discriminator between the models. So, for example, disorientating a participant by spinning them on a chair should affect pointing errors to all objects by adding a constant bias if participants use an allocentric representation. The argument that Wang and Spelke [42] make about disorientation conflates two separate issues, one about the origin and axes of a representation (such as 'allocentric' or 'egocentric') and the other about the internal consistency of a representation. In this paper, we focus on the latter. We ask whether a single consistent, but possibly distorted, 3D reconstruction of the scene could explain the way that people point to previously-viewed targets.

Results
Our stimulus was designed so that participants had access to a rich set of information about the spatial layout of 4 target objects whose position they would have to remember. In either the real world or in a head mounted display, they had a binocular view of the scene and could move their head freely (typically, they moved ±25cm). The targets consisted of four different colored boxes that were laid out on one side of the room at about eye height (see Fig. 1a) while, on the other side of the room, there were partitions (referred to from now on as 'walls') that obscured the target objects from view once the participant had left the original viewing zone ('Start zone'), see Figs. 1d and 1e. Most of the experiments were carried out in virtual reality (Fig. 1b), although for one experiment the scene was replicated in a real room (Fig. 1a). After viewing the scene, participants walked to one of a number of pointing zones (the pointing zone was not known in advance) where they pointed multiple times to each of the boxes in a specified order (randomized per trial, see Section S1 for details). Fig. 1 shows the layout of the boxes in a real scene (Fig. 1a), a virtual scene (Fig. 1b) and in plan view (Fig. 1c, shown here for Experiment 1). The obscuring walls are shown on the left with a participant pointing in the real scene. The measure of pointing error used was the signed angle between the target and the 'shooting direction' (participants were asked to 'shoot' at the target boxes with the pointing device), as illustrated in Fig. 1d. Although not all participants had experience of shooting in video games or similar, the instruction was understood by all and this definition of pointing error gave rise to an unbiased distribution of errors for shots at a visible target (Fig. S3), which was not the case for a considered alternative, namely the direction of the pointing device relative to the cyclopean eye (Fig. 1e). In the first experiment, the participant pointed to each of the target boxes eight times in a pseudo-random order (specified by the experimenter) from one of three pointing zones (shown in a, b and c respectively).
Figs. 2a-2c illustrate examples of the raw pointing directions in this condition for one participant. It is clear that the participant makes large and consistent errors: For example, they point consistently to the 'North' of the targets from the northern pointing zones (A and B) but consistently to the 'South' of the targets from the southern zone, C (see Fig. 1c for definition of North). Not only this, some of the geometric features of the scene have been lost. For example, from zone C, the blue and yellow targets are almost co-aligned in reality but the participant points in very different directions to each. As we will see, these features turn out to be highly repeatable, both across participants and in multiple versions of the task.
For example, Fig. 2d shows the pointing errors gathered in two separate conditions plotted against each other (Experiment 1). The data shown are the mean pointing errors for 20 participants shown per target box (symbol colour) for different box layouts and for different pointing zones (symbols shape). Despite the fact Four boxes were arranged such that the blue and the pink, and the red and the yellow boxes lay along two separate visual directions as seen from the start zone (white diamond). From the start zone, the blue and the red box were always closer than the pink and the yellow box. The two visual directions subtended an angle of 25 . This angle was preserved for all box layouts (though distances varied, see Section S1). Pointing to targets was tested at 3 different pointing zones (A = amber, B = black and C = cyan diamond). Black lines indicate positions of walls (in the real world, these were made from partitions arrive at Zone C, whereas the abscissa shows pointing errors when the experiment was repeated with one of the walls removed (see Fig. 1c) so that participants could walk direct to each of the pointing zones. For zone C especially, this makes a dramatic difference to the path length to get to the pointing zone, so any theory that attributed the pointing errors to accumulation of errors in the estimation of the participant's location would predict a difference between the data for these two experiments, especially for pointing from Zone C but that is not the case. A post-hoc power analysis shows that the power achieved to rule out the correlation shown in Fig. 2d occurring by chance is, to a very close approximation, 100%. The data from all the other experiments reported in this paper conform to the same pattern, (see Figs. 3a, 3b, Fig. 4 and Fig. 7), amounting to nine independent replications. We will use the data from this first experiment to build a simple model that predicts the pointing directions in all nine replications and test this model against the alternative hypothesis that the visual system uses a 3D model to generate pointing directions to the unseen targets. For example, Fig. 3a shows that there was also a high correlation between pointing errors when participants repeated exactly the same conditions in a real or a virtual environment (correlation coefficient 0.88, p < 0.001), although the range of pointing errors was greater in the virtual room (slope 1.42). Waller and colleagues [43] also found a close match between performance in real and matched virtual environments.  We found the same pattern of pointing biases in a separate experiment that tested the role of egocentric orientation relative to the targets (Experiment 2). Fig. 3b shows a high correlation between pointing errors when participants looked either 'North' or 'South' in order to view the image that told them which was the next pointing target. In these two versions of the experiment, participants' rotation to point at the target was quite different, so any model based on egocentric direction at the moment the target was defined would predict a difference, but there was no systematic effect on the pointing errors (correlation coefficient 0.93, p < 0.001, slope 0.91). This high repeatability in the face of stimulus changes should be contrasted with the dramatic effect of changing the pointing zone. Fig. 3c shows that, combining data across Experiments 1 and 2, pointing errors when participants were in zones A and B (Fig. 2a-2b) were reliably anticlockwise relative to the true target location (positive in Fig. 3c, M = 12.1, SD = 20.1, t(949) = 18.7, p < 0.001 in a two-tailed t-test) while, when participants were in zone C (Fig. 2c), the errors were reliably clockwise (negative in Fig. 3c, M = 12.9, SD = 14.7, t(479) = 19.2, p < 0.001). Experiment 3, using pointing zones that were both to the 'West' and the 'East' of the target boxes showed a similar pattern of biases (see Fig. 7, Fig. S5c and Fig. 8). Another way to summarize the results is that, wherever the pointing zone was, the participants' pointing directions were somewhere between the true direction of the target and a direction orthogonal to the obscuring wall. Expressed in this way, it is clear that the pointing zone itself may not be the key variable.
Rather, it could be the spatial relationship between the target, the screen and the observer at the moment the participant points. In the next experiment, we examined paired conditions in which we kept everything else constant (box layout, start zone and pointing zone) other than the orientation of the obscuring wall.   to the obscuring wall, just as they did in Experiment 1 (Fig. 3c). We will return to these paired conditions once we have described a simple model and show that the distribution of pointing directions relative to the model prediction (as opposed to the ground truth) is not affected by the orientation of the wall ( Fig. 4j and Fig. 4i).

Noisy path integration
The pointing task requires the observer to update an internal representation of their location and orientation.
One possible model of our data is that a cumulative error in this updating process is the critical factor explaining the pointing biases (i.e. a noisy-path-integration model). However, as we have already seen from

Abathic distortion
Another possible cause for systematic errors is that the observer builds a distorted 3D representation of space when they are at the start zone and uses it to guide their pointing. Many models of binocular space perception assume that there is a distorted mapping between true space and represented space, often with an 'abathic' distance at which objects are judged to be at the correct distance and a compression of visual space towards this plane [44,45] as illustrated in Fig. 5a. Applying this 2-parameter model to our pointing data, the best-fitting abathic distance is actually about 5m behind the observer when they view the scene (i.e. behind the start zone, Section S4.3). Another problem for this model, just like the noisy-path-integration model, is that it makes no prediction about the effect of the slant of the obscuring wall.

Retrofit
The and all other conditions in Experiment 1. These points tend to cluster around a plane that is parallel to the plane of the obscuring wall whereas the true locations of the target boxes (shown in translucent colors in Fig. 6f) are nothing like this. As mentioned earlier, participants tend to point in a direction that is more orthogonal to the obscuring wall than the true target direction, but now it is clear that the inferred target locations all have a similar depth relative to the obscuring walls, as if the depth structure of the scene has been 'squashed'. In reality, the pink and blue boxes were very close to the obscuring wall in the center of the room, while red and yellow boxes were more distant. Yet, according to participants' pointing data, the apparent location of the targets were all at a similar distance relative to the wall.

Projection plane
So, as a post-hoc 'model' of this apparent distortion of the scene structure, we can apply the following prediction to the other experiments. Participants will point to remembered targets by assuming that all the targets in that experiment lie on a single plane that is parallel to and 1.77m behind the obscuring wall. The value of 1.77m is simply taken from the best fit to the points shown in Fig. 6e. Fig. 6f, Fig. 6g, Fig. 6h and Fig. 6i show examples of this projection-plane rule applied in other conditions.

Model comparison
We have already discussed the fact that the noisy-path-integration and abathic-distance-distortion models cannot account for the effect of the obscuring wall orientation (Fig. 4). Table S1 shows the residuals for these two models and for the projection-plane model for Experiment 1, where all three models are fitted to the data of all participants combined. The residuals are lower for the projection-plane model for 17/20 participants but the key test for this model is to see whether it can predict performance in Experiments 2 to 4 where it provides a zero-parameter prediction of the pointing data.
The retrofit model is rather different. It is provided with all the data and asked to derive a 3D structure of the scene that would best account for participants' pointing directions using a large number of free parameters (24) so, unlike the projection-plane model, it does not make predictions. Table S2

Discussion
For a spatial representation to be generated in one location and used in another, there must be some transformation of the representation, or its read-out, that takes account of the observer's movement. The experiments described here show that in humans, this process is highly inaccurate and the biases are remarkably consistent across participants and across many different conditions.
The data appear to rule out several important hypotheses. Crucially, any hypothesis that seeks to explain the errors in terms of a distorted internal model of the scene at the initial encoding stage will fail to capture the marked effects of the obscuring wall, which is generally not visible at the encoding stage (see Fig. 4).
We examined standard models of a distorted visual world, namely compression of visual space around an 'abathic' distance [44,45] (details in Section S4.3) but even when we allow any type of distortion of the scene that the observer sees from the starting zone to explain the data (provided that the same distortion is used to explain a participant's pointing direction from all pointing zones and any wall orientation), this type of model still provides a worse fit to the data than our projection-plane model (see Fig. 7, 8 and Table S2).
The post-hoc nature of the 'retrofit' model could be considered absurdly generous. The fact that it is still worse than a zero-parameter model generated from the first experiment provides strong evidence against 3D reconstruction models of this type.
We also examined and ruled out a hypothesis based on noisy path integration [46] as an explanation of our data. One strong piece of evidence in relation to such hypotheses is the high correlation between the magnitude of errors participants make when they walk to a pointing location either via a long or a short route ( Fig. 2d and Fig. S2a). These data suggest that people point to a memory of the target boxes that is quite different from their true locations but it is not the route to the pointing zone that can explain this.
Instead, it is something about the location and the scene when the participant gets there.
One paper that uses a similar paradigm to ours also shows consistent pointing biases [26]. When the pointing zone (in their experiment, called an 'indirect waypoint') was displaced from the location at which the scene layout was learned, pointing was shifted consistently in the direction of the waypoint. The participant was blindfolded during the test phase and guided to the waypoint so there was no equivalent of our wall and only two waypoints were tested so it is hard to make a direct comparison with our data. Nevertheless, their data suggest that, in line with our data, the location of the pointing zone with respect to the start zone has a systematic biasing effect on pointing directions.
It has often been shown that the observer's orientation can have a significant effect on pointing performance. For example, when the orientation of the participant in the test phase differs from their orientation during learning of the scene layout this can influence pointing directions (for both real walking [47] or imaginary movement [17,48] in the test phase). Meilinger et al. [49] investigated the effect of adding walls to an environment and showed they have a significant effect. However, the authors did not examine the biasing effects of moving to different pointing locations nor can the results be compared directly to ours since they report 'absolute errors' in pointing, a measure that conflates variable errors and systematic biases. In fact, this is a common problem in the literature; many other papers report only absolute errors in pointing [50,24,17,21,51,22,52,19]. Röhrich et al. [32] showed that the participant's location when pointing was likely to be important in determining the reference frame that they used. In this experiment, participants pointed to a well-known market square from different locations in the town. Pointing location affected the orientation in which participants drew a plan-view sketch of the square. However, the authors did not predict biases in pointing. There are also claims that egocentric factors play a role in pointing biases [17,39] but we did not find any effect in our experiment: pointing errors were unaffected by a 180 change in the observer relative to the scene at the moment they discovered which target they should point at (Fig. 3b).
The fact that the most lenient 3D reconstruction model does not do well in explaining our data raises the question of what participants do instead. The 'projection-plane' model is essentially a re-description of the data in one experiment, rather than an explanation, albeit one that then extends successfully to other situations. Whatever heuristic the visual system uses, it seems to ignore critical aspects of the geometry of the scene, such as relative depths, so that objects are treated in a more simplistic way (equivalent to assuming that the target objects all lie on a plane) than would be the case if the internal model behaved according to the geometry of a real 3D model. Recent findings using 'Generative Query Networks' [8] suggest that we may see rapid advances in our understanding of the process of 'imagining' novel views of an observed scene and how this might be achieved without using a 3D reconstruction.
Finally, it is worth considering what effect such large biases might have in ordinary life. The most relevant data from our experiments in relation to this question are, arguably, the pointing biases that we recorded from the very first trial for each participant. We made sure that, on this first trial, the participant was unaware of the task they were about to be asked to do. Biases in this case were even larger than for the rest of the data (Fig. S2h). If these data reflect performance in daily living, one might ask why we so rarely encounter catastrophic consequences. However, the task that we asked our participants to carry out is an unusual one and, under most circumstances, it is likely that visual landmarks will help to refine direction judgments en route to a target.

Conclusion
Our conclusions are twofold. First, although human observers can point to remembered objects, and hence must update some form of internal representation while the objects are out of view, we have shown that they make highly repeatable errors when doing so. Second, the best explanation of our data is not consistent with a single stable 3D model (even a distorted one) of the target locations. This means that whatever the rules are for spatial updating in human observers, they must involve more than the structure of the remembered scene and geometric integration (even with errors) of the path taken by the observer.

S1. General Methods
In Participants For the first experiment, we recruited 22 participants (aged 19-46) who were paid £10.00 per hour. Data from 2 participants had to be discarded as one failed to understand the task and one felt too uncomfortable wearing the head-mounted display. 19 out of the 20 were naïve to the experiment, and 1 was a researcher in our lab who had prior knowledge about the task.
Procedure In virtual reality, in order to start a trial, participants walked towards a large green cube with an arrow pointing towards it in an otherwise black space then, as soon as they stepped inside the cube, the stimulus appeared. In the real world, participants wore a blindfold and were guided to the start zone by the experimenter. As soon as they were in the center of the start zone, the experimenter removed the blindfold. In both real and virtual experiments, the participants then viewed 4 boxes from this start zone.
They were told that from where they were standing, the blue box was always closer than the pink box and both were in the same visual direction. Likewise, the red box was always closer than the yellow box and, again, both were in the same visual direction when viewed from the start zone (see the example plan view in Fig. 1a-c). Participants took as much time as they needed to view and memorize the box positions, although this was typically between 10 and 20 seconds. If they attempted to leave the start zone to walk closer to the target boxes, the whole scene disappeared in virtual reality or were inhibited by the experimenter in the real world. When they had finished memorizing the box layout, they walked behind a 'wall' (partition) towards a pointing zone. The layout of the partitions is shown in Fig. 1. The 'inner' north-south partition, shown by the dashed line in Fig. 1c, was removed in the 'direct' walking condition (only tested in virtual reality). The indirect and direct walking conditions were intermingled and tested in a randomised order. The participants did not know which condition they were being tested in at the beginning of each trial while standing at the start zone. In virtual reality, the pointing zone was indicated by a colored poster (colored according to the color of the box to which they should point) and this poster only appeared after they left the start zone.
Similarly, in the real-world task, there were three white posters at the three pointing zones (see Fig. 1c) and the participant was only told which one to stop at after they had left the start zone. In virtual reality, the poster changed its color after each shot to indicate the next target box to point at. In the real world, the experimenter told the participant which box to point at. Following these instructions, the participant had to point 8 times to each of the 4 boxes in a pseudo-random order, i.e. 32 pointing directions in all. The end of each trial was indicated by the whole scene disappearing in virtual reality (replaced by the large green box) and in the real world the experimenter told the participant to wait to be guided back blindfolded to the start. The participant received a score ranging from 0-100% reflecting the accuracy of all the shots but this information could not be used to infer the direction or magnitude of their pointing error on any given shot.
Box layouts and stimulus 8 participants were tested on 9 box layouts (1,440 pointing directions in total in VR per participant: 'Indirect' -9 layouts ⇥ 4 boxes ⇥ 3 zones ⇥ 8 shots, 'Direct' -9 layouts ⇥ 4 boxes ⇥ 2 zones ⇥ 8 shots), whereas the remaining 12 participants were tested on 4 layouts (640 trials in VR per participant). All participants repeated the 'indirect' condition in the real world in exactly the same way as they did in VR. All the box layouts are shown online in an interactive website, see Section S6, including the raw pointing data in each case. The virtual and the real stimulus were designed to be as similar as possible, with a similar scale, texture mapping taken from photographs in the laboratory and target boxes using the same colors and icons, see Fig. 1a-b.
The box positions in the 9 box layouts were chosen such that the following criteria were satisfied, see Fig. 1(c): • The blue and the pink boxes were positioned along one line (as seen in plan view) while the red and the yellow box were positioned along a second line. The two lines intersected inside the start zone.
• All box layouts preserved the box order: this meant that the blue box was always in front of the pink box, and the red box was in front of the yellow box, but the distances to each box varied from layout to layout.
• The 2 lines were 25 apart from one another. This number allowed a range of target distances along these line while the boxes all remained within the real room.

S2.3. Experiment 2
This experiment aimed to identify the influence of the facing orientation at the pointing zones. 7 participants, all of whom had been participants in the first experiment, viewed the same 6 layouts as in Experiment 1 but in virtual reality only. They then walked to either zone A or zone C, stopped and faced a poster located either to the 'North', 'South', or 'West' of the pointing zone (unlike Experiment 1 when there was only one poster position per pointing zone, see Fig. S1a. Otherwise, the procedure was identical to Experiment 1, with the exception that the participant pointed 6 times (rather than 8) to each of the boxes in a random order at the pointing zones. Data are shown in Fig. S2 and Fig. 3b.

S2.4. Experiment 3
The third experiment was also carried out only in virtual reality and here the testing room was scaled (in the x-y plane) by a factor of 1.5, i.e. the height of the room did not change. The pointing zones A, B and C were mirrored along the center line of the room to create an additional 3 pointing zones D, E and F, see Fig. S1b but the physical size of the laboratory dictated that participants could only go to one side of the scene or the other (either zones A, B, C or zones D, E, F). The participants did not know whether they would walk to the left or to the right from the start zone until the moment that they pressed the button (when the boxes disappeared). There were 6 participants, 3 of whom had taken part in the previous experiments while 3 were naïve.

S2.5. Experiment 4
The scene layout for this experiment is shown in Fig. S1c- We fitted this model to all the data from all participants in Experiment 1 by varying the two free parameters, w a and w d , to give the minimum root-mean-square-error between the actual and predicted pointing directions. (see Fig. S4a). The mean of the pointing errors is not significantly different from zero. The mean of the distribution for zone A and zone B was also not significantly different from zero.

S4.2. Zero-mean noise
If, instead, we assume that the noisy-path-integration noise has zero mean then there is no systematic effect on pointing, which we demonstrate as follows for an estimate of orientation. We added a normally distributed random error to the estimate of visual direction with respect to 'North' on each step, h b n : with the function randn(µ, s) returning a random number drawn from a distribution with a standard deviation s, and a mean µ. E is a random error added to the estimate of h b n , drawn from a distribution with a standard deviation of s = p 360 radians, and a mean of µ = 0. Using Eq. (1), predictions of pointing directions were calculated with a random additive noise on the direction of 'North'. Calculating the directions 100 times for each box in each layout, using the walking trajectory of every participant tested in the indirect walking condition of Experiment 1 at pointing Zone C, a histogram of errors is plotted in Fig. S4d (and the same result applies for Zone A and Zone B).

S4.3. Abathic Model
Johnston [44] shows psychophysical data described by a linear relationship between estimated (or 'scaling') distance to physical viewing distance (e.g. their Fig. 7). In general, we can fit the two parameters (intercept and slope) to our pointing data (Fig. 5a). In our case, the best-fitting values are a slope of 1.03 and an intercept of 0.17, which corresponds to an abathic distance of 5.66. Specifically, the misestimated egocentric distances of the boxes, d b est is: where d b true is the true egocentric distance and b = [1, . . . , 4] the index for each box.

S4.4. Retrofit model
We can allow box position to vary and calculate the maximum likelihood configuration of boxes that could account for the participant data (separately for each Experiment). L b,l,k = L b,l,k 1,1 · L b,l,k 1,2 · L b,l,k 1,3 · L b,l,k 2,1 · . . . · L b,l,k P,M where P = 20 for the total number of participants and M = 3 for the total number of shooting zones.

S6. Raw data
Raw data for all the figures and code to reproduce an example figure (Fig. 8)