Crowding for faces is determined by visual (not holistic) similarity: Evidence from judgements of eye position

Crowding (the disruption of object recognition in clutter) presents the fundamental limitation on peripheral vision. For simple objects, crowding is strong when target/flanker elements are similar and weak when they differ – a selectivity for target-flanker similarity. In contrast, the identification of upright holistically-processed face stimuli is more strongly impaired by upright than inverted flankers, whereas inverted face-targets are impaired by both – a pattern attributed to an additional stage of crowding selective for “holistic similarity” between faces. We propose instead that crowding is selective for target-flanker similarity in all stimuli, but that this selectivity is obscured by task difficulty with inverted face-targets. Using judgements of horizontal eye-position that are minimally affected by inversion, we find that crowding is strong when target-flanker orientations match and weak when they differ for both upright and inverted face-targets. By increasing task difficulty, we show that this selectivity for target-flanker similarity is obscured even for upright face-targets. We further demonstrate that this selectivity follows differences in the spatial order of facial features, rather than “holistic similarity” per se. There is consequently no need to invoke a distinct stage of holistic crowding for faces – crowding is selective for target-flanker similarity, even with complex stimuli such as faces.

Quantifying target-flanker similarity for our stimuli In this section we consider the specific visual dimensions that underlie the selectivity of face crowding for target-flanker similarity. As well as reducing the propensity for holistic processing [1][2][3][4] , the inversion of a face alters both the orientation of the facial features and the spatial order of these features (i.e. the eyes above nose above mouth pattern is reversed) 5 . Differences between the target and flanker faces in either of these properties could be driving the variations in crowding that we observe. In Experiment 5, we independently assessed the contribution of the orientation and order of facial features in crowding by introducing two additional types of "Thatcherised" 6  When the energy within the image is summed across all spatial frequencies, it is apparent that each image contained considerably more energy within horizontal bands than within vertical or oblique bands (Supplementary Figure 1B), consistent with previous reports 7,8 . The overall image content of these faces is therefore highly consistent (particularly since the upright and inverted faces are simply vertically flipped versions of each other, as are the two Thatcherised faces).
To characterise the differences in the target and flanker faces that could drive the crowding of an upright target face (as observed in Experiment 5), we next subtracted each filtered face from the upright filtered face, separately within each spatial frequency and orientation band (on a pixel-by-pixel basis), and again summed the energy across the image. The resulting image differences were then squared and summed to compute the total energy difference between the faces. The average difference (across spatial frequency) within each orientation band is plotted in Supplementary Figure 1C. Firstly, the subtraction of an upright face from itself necessarily produces a zero difference in energy (data not shown). If we instead subtract an inverted face from the upright target (red line in Figure S1C), there are image differences at all orientations, though this is clearly more so in the horizontal bands than in the vertical or oblique ranges. The difference between an upright face and the Thatcherised face with inverted positions follows a nearly identical pattern, with extremely similar values (yellow line in Figure S1C). In contrast, the differences between an upright face and the Thatcherised face with inverted features (in the same positions as the upright face) are considerably reduced at all orientations (purple line in Figure S1C). Figure S1: Image analyses for face stimuli A) Example face stimuli, taken from the Radboud Faces Database 9 and edited as described in the Methods section of Experiment 5, after convolution with log-Gaussian filters in the Fourier domain. All images are shown after convolution with a peak spatial frequency of 5 cycles per image. The first column shows the following horizontally filtered faces (from top to bottom): an upright face, an inverted face, a Thatcherised face with inverted features, and a Thatcherised face with inverted positions. The second column shows the same faces but vertically filtered. (B) The total Fourier energy in each image, summed across spatial frequencies within each orientation band (where 0° = horizontal and 90° = vertical). Note that because upright and inverted faces are simply flipped versions of the same stimulus, their values overlap completely. The same is true for the two Thatcherised faces. (C) Differences in Fourier energy between each face type and an upright face, plotted as a function of the orientation band. Lines show the average and shaded regions show the 95% confidence interval across spatial frequencies. Note that the difference between upright faces lies at zero. In part, these differences in image structure arise because faces are 'topheavy' stimuli with greater contrast variations in the upper half of the image (primarily due to the eyes and eyebrows) than the lower half 10,11 . We can quantify this in our stimuli by summing the squared contrast energy at all orientations and spatial frequencies within the lower-half of the image and subtracting this from the sum of squared contrast energy in the upper-half of the image. These values are shown in Figure S1D  The image-level differences in Supplementary Figures 1C and 1D show a strong similarity with the results from Experiment 5 (shown in Figure 5). The current analyses show that the inversion of a face produces strong changes in the spatial order of the image, particularly for horizontally oriented structure, with a shift from a 'top-heavy' to a 'bottom-heavy' configuration. In Experiment 5, this difference was associated with a reduction in crowding with inverted-face flankers, relative to the crowding produced by upright flankers. Similar changes in image content are produced by re-arranging the positions of the upright features to match an inverted face -accordingly, in Experiment 5 we observe a clear reduction in crowding with these flankers. There was far less change in image content when the features were rotated in place, which indeed produced no change in crowding (relative to that produced by upright flankers).
The differences between upright and inverted faces in the distribution of image content have previously been used to examine holistic processes in face recognition, showing that horizontal content is a more effective driver of the behavioural signatures of configural processing 7,12,13 , and that top-heavy stimulus configurations drive infants' looking preferences 10,11 . However, with crowding it is important to note the relations between target and flanker stimuli along these dimensions. Here we show that this image content differs markedly between upright and inverted faces, as well as with Thatcherised faces with large changes in the spatial ordering of features. These differences in image content could determine the strength of crowding in face stimuli in a similar fashion to the changes produced by other differences in dimensions such as colour [14][15][16][17] , contrast polarity 18 , and orientation [19][20][21][22] . For instance, differences in feature positions of this kind have previously been shown to modulate crowding within letter-like stimuli -an upright T will be crowded less by inverted T flankers than by other configurations 23 . We consider these mechanisms further in the General Discussion.

Quantifying target-flanker similarity with Mooney faces
The results of Experiment 5, combined with the above analyses of the orientation energy within face stimuli, suggest that the crowding of face stimuli is driven by target-flanker similarity in the top-heavy vertical configuration of horizontally oriented image structure, similar to the 'bar codes' argued to influence holistic processing in general 24,25 . However, evidence for an holistic stage of crowding derives not only from the recognition of photographic face stimuli 26 , but also from the usage of Mooney faces 27 where local image content is much more difficult to discern 28 . As we argue in the General Discussion, although it is true that these stimuli are degraded in terms of the visibility of their local image features (e.g. it is often difficult to make out a nose from an isolated patch of the image), it is not the case that Mooney faces contain no features at all. For instance, Experiment 3 of the study by Farzin et al. 27 demonstrates that these stimuli are susceptible to selfcrowding between local features in a similar way to photographic images of faces 29 .
Mooney faces also contain a spatial configuration of oriented image structure that is, by and large, similar to that of regular faces 24 . Here we consider the effect of inversion on this image structure and how this may interact with crowding.
To examine the image content of Mooney faces, an identical set of analyses was performed for a set of Mooney faces as performed on our photographic face stimuli above. A set of 24 Mooney faces was obtained from the Mooney-MF database within the Psychological Image Collection at the University of Stirling * . 12 were male and 12 female, with several faces that match those used previously by Farzin et al. 27 . To match stimuli to those analysed above, images were reduced to the same dimensions (395´292 pixels), with an oval-shaped aperture with the same dimensions then placed around stimuli to match the image shape required for presentation as crowding stimuli (as in Farzin et al. 27 ). Two example images are shown in Supplementary Figures 2A and 2B.
What is immediately apparent upon examining these Mooney images (and example images in the study of Farzin et al. 27 ) is their increased variability relative to standard face stimuli. In addition to standard frontal views of the face, * Available online at http://pics.stir.ac.uk/ The outcome of these analyses suggests that similar processes can account for the effects of crowding on Mooney faces 27 as those that we report to account for the effects on photographic face images 26 . Although Mooney faces are certainly more difficult to recognise than photographic face images, perhaps due to the attenuation of horizontal content shown in our analyses above, it is not the case that they contain no oriented content at all. Our analyses here demonstrate that a bank of oriented filters produces outputs with a bias towards the vertical, and that the structure of these variations changes from a predominantly top-heavy configuration to one that is predominantly bottom-heavy when inverted. As in the results of our Experiment 5, this shift from top-to bottom-heavy configuration in flanker elements (relative to an upright target face) would be expected to reduce crowding, similar to the effects seen with letter-like stimuli 23 . As we argue in the General Discussion, these stimuli can certainly produce inter-feature crowding (Farzin et al. 27 , Experiment 3), suggesting that these contours are themselves susceptible to crowding, just as occurs within photographic face stimuli 29 . The effects of task difficulty are also apparent with these stimuli (Farzin et al. 27 , Experiment 4), just as seen in previous studies with photographic images 26 and in Experiment 4 of our study. We therefore argue that Mooney faces do not require an additional stage of holistic crowding to account for the observed effects of flanker orientation -simpler operations based on target-flanker similarity of the kind invoked more generally to account for crowding suffice to account for the entirety of these effects.

Simulations of the effects of task difficulty on face crowding
In Experiment 2 we demonstrate that the selectivity of face crowding is identical for upright and inverted target faces when the task involves judgements of horizontal eye position. This differs from the patterns observed in Experiment 1 for judgements of identity, and in Experiment 3 with judgements of vertical eye position ( Figure 3). We attribute this to the susceptibility of these latter tasks to inversion [30][31][32] , not because this eliminates holistic processing per se, but because the resulting increase in task difficulty obscures the selectivity of crowding for target-flanker similarity. Accordingly, the results of Experiment 4a demonstrate that it is possible to obscure the selectivity of crowding with upright faces by increasing task difficulty.
We achieved this by reducing the interocular separation in our face stimuli.
Conveniently, this also presents the opportunity to conduct model simulations of this process, and to consider how this finding generalises to other tasks like identity judgements.
We thus performed a set of simulations on the effect of crowding for faces, in the context of horizontal eye judgements. This allowed us to consider both the potential mechanisms underlying these crowding effects and the effect of task difficulty. We first assume that there exists a population of detectors that is sensitive to dimensions such as eye position. This is consistent with both theoretical proposals regarding "face space" 33 , adaptation effects that shift the perceived eye position within faces 34,35 , and physiological measurements in the Inferior Temporal (IT) lobe of macaques 36,37 . Here we simulate a population of detectors selective for interocular eye separation in particular. We do so for the ease of modelling, rather than as a proposal that a population of this nature would be specifically utilised for this purpose -of course, interocular eye separation could be encoded either wholly by or in conjunction with cells selective for other facial properties.
Within this population, we assume that each detector is sensitive to a range of eye separations with a Gaussian tuning function that has a peak sensitivity centred on a particular eye separation, and some sensitivity to nearby values of eye separation, similar to the selectivity of V1 neurons for orientation 38 , MT/V5 neurons for direction 39 , and so on. Following the principles of population coding 40 the distribution of the resulting population activity would be a Gaussian function centred on the eye separation of the stimulus, with a bandwidth of activity equivalent to the sensitivity bandwidth of the underlying detectors. The perceived value of eye separation could then be read out from this distribution (e.g. as the peak response, or via maximum likelihood estimation).
The relationship between detector sensitivity and the population response means that we can simulate the population response directly as a Gaussian function.
We do so here by generating a Gaussian function with a base value of 0.1 and a peak of 1.0, using one free parameter for the standard deviation (to mimic the sensitivity bandwidth of the underlying detectors) and another for the magnitude of Gaussian noise that was added to this response distribution. If we encode the "normal" reference face with an eye-separation value of zero, then the population response to this face will be a Gaussian distribution centred on zero, as shown in Figure S3A.
These responses are shown as the average of 720 trials (as in our experiments), generated with an SD of 8 pixels for illustration, and with a comparatively large range in the x-axis of ±60 pixels to make the population response clear. When the crowded target face is the same as the reference (on "target same" trials) then a similar distribution would arise for this interval. A target face with a large inwards shift would produce a similar Gaussian with a mean located at -20 pixels for the easily detected larger inwards shifts (as in Experiment 2). With this model, we can depict the task as involving a judgement regarding whether the peak of the population response lies on either side of a criterion value -depicted here as a dashed line at -10 pixels ( Figure S3A). This is an ideal criterion for the 20 pixel eye shifts, sitting midway between the peak response to either face type. Peaks to the right of this criterion would be classed as the "same" as the reference; those to the left would be classed as "different". We can therefore simulate the task performed by our observers in this way. On crowded trials, targets were surrounded by six flankers: three reference faces and three faces with eyes shifted inwards (by 20 pixels in Experiment 2). The combined population response to these flankers would be a bimodal profile with peaks at each of the two eye-separation values, as shown in Figure S3B (again the average of 720 trials). In order to simulate crowding in these instances, we follow recent models of crowding that depict the process as a pooling of target and flanker elements [41][42][43][44][45] , and in particular to models that attribute this pooling to a combination of population responses to the target and flanker elements 42,43 . Rather than directly averaging these population profiles, we take a weighted average that allows a modulation of the precise combination of target and flanker responses, similar to previous models 41,46 . When the weighting of the target is high in this combination (relative to that of the flankers) there will be less crowding than when the weighting of the target is low. In this model, the precise weighting of the target was set as a third free parameter ranging from 0-1, with the flanker weighting equal to one minus this value. If we multiply the population response of the target by the target-weighting value, and the population response to the flankers by the flankerweighting value, then the crowded combination is produced by adding these values.
An example crowded population profile for the large eye-shifts is plotted in Figure S3C, produced with a target weight of 0.66. This gives a bimodal response, albeit with a higher peak (on average) for the target eye separation, given the higher weighting of the target response in this "target same" trial. For this population and this value of the target weight, we observe the reverse pattern of bimodality on target-different trials when the response to the target would be centered on -20 pixels. Nonetheless, in both cases, the secondary peak in this response distribution (caused by the flankers) increases the potential noise in the population response to lead to errors on individual trials. The potential for errors is increased as the flanker weighting increases.
To obtain a release from crowding when the flankers are inverted (as in the "flankers differ" condition of our results), we can therefore simply reduce the weight of the flankers in the weighted average. This is similar to the way that prior models have simulated the effects of target-flanker similarity 46 and the reduction in crowding with an increase in target spacing 42 . In other words, with an upright face amongst upright flankers there is a high flanker weight in the average, which is then reduced for an upright target amongst inverted flankers. The precise degree of the release from crowding is the fourth free parameter in our model, implemented as a value between 0-1 that is subtracted from the target weight in the "flankers match" condition.
Task difficulty is introduced here simply by decreasing the eye-separation to 10 pixels, as in Experiment 4a. As with the larger eye shift, the population response distribution to a "normal" reference face would be centered on zero ( Figure S3D).
When the target matches the reference face, the response would be identical.
However, when the target face has a small inward eye shift ("target different" trials), the response would be represented by a Gaussian distribution centered at -10 pixels.
The ideal criterion value would thus lie at -5 pixels, sitting midway between the peak response value to a normal face and the peak to a face with a 10 pixel inwards eye shift (dashed line, Figure S3D). The target distribution is then combined with the population response to the flankers. As in Experiment 2, three flankers had "normal" eyes and three had eyes shifted inwards, in this case by 10 pixels. Given the broad response profiles to these values, the reduction in eye shift for this experiment means that the combined response to the six flankers in the "flankers match" condition would have a unimodal profile centred on the criterion value ( Figure S3E).
As seen in Figure S3F, the response profile for the combined target and flanker responses (produced with a weight of 0.66) is also unimodal, with a peak between the target value and the decision criterion. As such, there is a higher rate of responses on the side of the flankers compared to the 20 pixel eye shift, which would increase the number of errors. A similar increase in errors would be observed in "target different" trials. This illustrates that when difficulty is increased by reducing the eye shift (as in Experiment 4a), there is greater overlap between target and flanker distributions, and thus a greater propensity for errors to arise due to noise.
From the above distributions we can obtain d' values by simulating both target-same and target-different trials and extracting the peak population response on each trial. Using the location of these peaks on each trial in relation to the decision criterion (as above), we can score each trial as producing a correct or incorrect response, and then compute a percent correct score in each condition, as in our experiments. We performed these simulations with the same number of trials as the real experiment -720 trials per observer with 5 observers -repeated 1000 times. The best-fitting parameters for the model were an SD of 13.47 pixels, a noise magnitude of 0.32, target weighting of 0.66 (out of 1), and a crowding release of 0.09 (the difference in target weight in the "flankers match" vs. "flankers differ" conditions).
The results of these simulations are shown in Figure S3G where the mean of the 1000 simulations is a white circle. For the "easy" condition with large eye shifts, the model clearly follows the data -d' values are high when the target is uncrowded, decreased with flankers that match the upright orientation of the target, and less impaired when the flankers differ by being inverted. For the difficult condition ("small shift"), uncrowded performance drops significantly, with a further decrease for the crowded conditions. The release from crowding is then considerably muted in these simulations -because performance is reduced overall, the effect of noise is greater and the release from crowding has far less effect.
The mean difference between crowded d' values for our model is 0.42 with large eye shifts and 0.20 for the more difficult small eye-shift condition. Note that the effect of task difficulty in this sense has nothing to do with our free parameters -this is simply introduced by altering the input values for eye separation from -20 to -10 in the target-different conditions. It is the combination of lowered performance and noise that flattens the selectivity of crowding for target-flanker similarity.
How might we then implement the effect of inversion within this framework?
For example, for the effects we observe in Experiment 3 with vertical eyejudgements, inversion could be implemented in several ways. Inversion is generally thought to produce a shift from configural processing to more local processing 1 . This may result in either a specific impairment in configural dimensions (that are not coded in a local/featural manner) or the use of inappropriate facial landmarks. To model the effects of vertical eye-position in this sense, inversion could be implemented as an increase in the noise associated with these eye-judgements, thereby increasing the propensity for errors, or it may broaden either the spatial or featural selectivity of the detectors sensitive to eye position (as suggested for the orientation selectivity of face identification 8 , for instance). In these cases, the effects of inversion would be similar to the effect of reduced eye displacements modeled above.
Of course, inversion also disrupts the identification of faces 31,32 . These effects could similarly be modelled in a crowding paradigm via an increase in noise, an increase in the spatial or featural bandwidth of the underlying detectors, or even as a shift in the population responses towards the decision boundary. In this case, the population response would be distributed across dimensions more relevant to identity (e.g. within a "face space" 33 ). Nonetheless, by expanding out from our simple model of interocular separation, we can consider that the effect of crowding on judgements of identity may arise in a similar fashion to that observed herein with judgements of eye position. Importantly however, it is not the mechanisms of crowding that would change with inversion in these cases, but rather the difficulty of the task, which in turn determines whether the selectivity of crowding for targetflanker similarity is evident or not. In this sense, we argue that crowding shares a common mechanism in all cases, rather than requiring processes specific to the holistic encoding of faces.