Near-optimal integration of facial form and motion

Human perception consists of the continuous integration of sensory cues pertaining to the same object. While it has been fairly well shown that humans use an optimal strategy when integrating low-level cues proportional to their relative reliability, the integration processes underlying high-level perception are much less understood. Here we investigate cue integration in a complex high-level perceptual system, the human face processing system. We tested cue integration of facial form and motion in an identity categorization task and found that an optimal model could successfully predict subjects’ identity choices. Our results suggest that optimal cue integration may be implemented across different levels of the visual processing hierarchy.

block contained 40 trials, half of which showed Laura and half Susan. The order of form and motion blocks was randomized across subjects, and the combined block was always shown last. At the beginning of each block, subjects were informed about the type of the block. In form blocks, subjects were asked to discriminate the face stimuli solely based on facial form and they were informed that facial motion is uninformative (i.e., the average between both facial motions), and vice versa for motion blocks. During combined blocks, subjects had to discriminate the face stimuli based on facial form and motion. At the beginning of each trial, a short cue (letter "F", "M", or "C" for form, motion and combined blocks, respectively) was presented for 0.3 s to remind subjects of the block type (i.e., the task to perform), followed by a 0.2 s fixation cross centred on the screen. Following the fixation period, a face stimulus was shown for 1 s showing either Laura or Susan with their basic facial form (100% facial form, average facial motion), facial motion (100% facial motion, average facial form) or both (100% facial form and motion). A response screen ("Laura or Susan?") appeared after the stimulus either until a response was recorded (left or right arrow for Laura or Susan, respectively) or until a maximum duration of 2 s was reached. Subjects could respond during the stimuli presentation (in which case the response screen did not appear) or during the presentation of the response screen.
At the end of each trial, feedback ("correct", "wrong", or "too late") was shown for 0.5 s on the screen. Note that in the training phase, we only showed the basic face stimuli (i.e., 100%) and subjects were never shown any of the intermediate morph stimuli or the "old" morphs.

Testing phase
During the testing phase, which lasted about 70 minutes, subjects had to categorize face stimuli to Laura or Susan based on form, motion or both cues combined in separate blocks (Fig. 1C), similar to the training phase. Subjects performed five form blocks, five motion blocks and 14 combined blocks in randomized order, and were informed about the type at the beginning of each block.
In contrast to the training phase, intermediate morph levels and "old" morphs were shown in addition to the basic face stimuli. Subjects were explicitly told about the occurrence of intermediate and "old" morphs. The trial sequence was the same as for the training phase, except that no feedback was provided at the end of a trial. In form and motion blocks (Fig. 1A, "Form", "Motion"), the basic face stimuli (morph levels 0 and 1), the nine intermediate morph levels (0.2, 0.3, 0.4, 0.45, 0.5, 0.55, 0.6, 0.7, 0.8), and the "old" morphs at all 11 morph levels were each presented three times for a total of 66 trials per block (11 morph levels x 2 old on/off x 3 repetitions).
Combined blocks contained 22 "congruent" (Fig. 1A, "Comb") and 44 "incongruent" trials ( Fig. 1A, "Comb, +Δ", "Comb, −Δ"). On "congruent" trials, a common morph level was chosen for form and motion from one of the 11 values listed above. Each face stimulus was presented only once, for a total of 22 "congruent" trials per block (11 morph levels x 2 old on/off). On "incongruent" trials, we showed face stimuli that had different morph levels for form and motion: when the original morph level was s, the form morph level was s+Δ/2 and the motion morph level was s−Δ/2. In "Comb, +Δ" trials, Δ was 0.15, and in "Comb, −Δ" trials, Δ was −0.15. To allow for such incongruence also at the lowest and highest morph levels, we replaced, only in the "incongruent" trials, the 0 and 1 morph levels by 0.1 and 0.9, respectively. Each face stimulus was presented once, yielding 44 "incongruent" trials (11 morph levels x 2 old on/off x 2 values of Δ) per block. Note that subjects were not aware of the presence of "incongruent" trials.

General model structure
Each model consists of an encoding stage (generative model) and a decision stage. In the decision stage, the observer applies a decision rule to determine their response, "Laura" or "Susan". The models that we tested only differ in that decision rule. One of the models we test uses an optimal decision rule. Although this model is similar to what has been widely used 1,2 , the underlying assumptions are worth spelling out, especially because we use a binary categorization task, and because the derivation needs to be modified for the suboptimal decision rules.

Encoding stage (generative model)
The generative model describes the task statistics and the observer's measurement noise.
Task statistics. Each trial is characterized by a motion morph parameter s m and a form morph parameter s f (both between 0 and 1) (Fig. 1A). Furthermore, each trial is characterized by the occurrence of "old" (i.e., old on/off) denoted by a categorical variable c taking values 0 and 0.35 (Fig. 1B). As described above, the value of 0.35 was chosen based on preliminary testing during the familiarization phase so that subjects clearly perceived the faces as "old" but were still able to discriminate Laura from Susan. In "old on" conditions, form was a mix consisting of 0.65 of s f and 0.35 of s f of the "old" perceptual average. Since the average "old" face consisted of 0. Measurement noise. We denote the noisy measurements of each feature by x m and x f for motion and form, respectively. We assume that these measurements are conditionally independent given s m and s f , and follow Gaussian distributions: (1)

Decision stage
Optimal model. Next, we model the observer's inference process. The optimal model is largely identical to the optimal model in earlier cue combination studies 1,2 .
However, it is worth spelling out the assumptions; moreover, some details are specific to our design. The optimal observer computes the probability of a stimulus s given the noisy measurements x m and x f . We make the common assumption that the observer acts as if they believe that there is only a single s to be inferred; this is somewhat plausible since no subject reported noticing a conflict.
We denote the likelihood ratio over face category as follows: where p(s|Susan) or p(s|Laura) is the probability of s under Susan or Laura, respectively. We assume that the observer believes these distributions of s to be uniform on a large interval from -a to some category boundary b (Laura) and from the same b to a (Susan). Then, p s Laura We assume a>>b so that we can make the approximation The optimal (accuracy-maximizing) observer would report "Susan" when L(Susan)>L(Laura). According to equation (3), this is equivalent to or in other words, when the median of the (normalized) likelihood function over s, which we define as L s (s)=p(x m ,x f |s), exceeds b. We now introduce the notation N(y;µ,σ 2 ) for a normal distribution over y with mean µ and variance σ 2 . We assume that the observer knows c is on any trial. Then, the likelihood function L s (s) can be evaluated as where we used the assumption of conditional independence in going from line 1 to line 2, absorbed s-independent factors into the proportionality sign in the second-tolast line, and introduced notation for precision: In the special case that c=0, the likelihood L s (s) reduces to the common expression for integrated likelihoods 3 .
Since the median of a normal distribution is the same as its mean, the optimal decision rule for an observer is to report "Susan" when Optimal model with incorrect beliefs. We now consider a variant of the optimal model. The optimal observer possesses and utilizes complete knowledge of the task structure. However, at least one aspect of this knowledge is rather unrealistic, namely the knowledge that the "old" face is a morph between Laura and Susan.
Human observers might therefore behave as if they do not have this knowledge and instead assume that the "old" version of Laura is pure Laura (instead of being morphed into the "old" average of Laura and Susan), and the "old" version of Susan is pure Susan. Then, the assumed noise distributions will be which corresponds to assuming that c=0 even though in reality it is not.
As a consequence, the decision rule, equation (5) simplifies to Best-cue model. In the best-cue model, the observer only relies on the cue with the highest J. Thus, the decision rule Eq. (5) gets replaced by Simple-average model. In the simple-average model, the observer responds "Susan" when

Experimental predictions
Finally, we derive experimental predictions for each of the models, based on their respective decision rules. To this end, we need the probability that the decision Optimal model. In the optimal model, the left-hand side of the decision rule is . Therefore, the probability of responding "Susan" is: where Φ is the conventional notation for the cumulative standard normal distribution (in Matlab: normcdf(…,0,1)). Finally, if the subject guesses randomly with probability λ, the probability of responding "Susan" becomes Optimal model with incorrect beliefs. We follow the same logic as in the optimal model, but now with a different decision rule, equation (6). The left-hand side of that equation has mean J m s m + J f 0.6c . Thus, we find for the probability of responding "Susan",  , and therefore the probability of responding "Susan" is

Methods
In any model, the probability of responding "Susan" within a given trial instead of a local maximum, we ran the function ten times using different initial parameters drawn from Gaussian distributions with mean and standard deviation as estimated from preliminary testing. The fit that returned the highest log likelihood then served to provide the maximum-likelihood estimates of the parameters.

Parameter recovery
To test how well our fitting procedure could recover the model parameters, we generated 100 synthetic data sets of the same size as a subject data set. To create a synthetic data set, we randomly drew the value of each parameter from a normal distribution using the median value and the interquartile range obtained from the joint fitting as mean and standard deviation. We then simulated trial-to-trial responses from the model's probabilities of responding "Susan" given those parameter values and the same stimuli as used in the experiment. Finally, we fitted the model used to generate the data. Given that the number of trials is finite, we expect the log likelihood of the estimated parameters to be slightly higher than the true parameters. All parameters were well recovered (see Fig. S1) and, as predicted, the log likelihoods of the estimated parameters were slightly higher than of the true parameters, for the optimal (1.

Single-cue fitting
We fitted all parameters based on the single-cue conditions using maximumlikelihood estimation. We assumed that the bias b and the lapse rate λ are shared across single-cue conditions. The standard deviations σ f and σ f,old were estimated from the form-only condition, σ m from the motion-only condition. We implemented this fitting using nested fmincon functions in Matlab. We report the parameter estimates in Table S1. The "old on" form manipulation reduced form reliability, as confirmed by a smaller estimated standard deviation for form in the "old off" than in the "old on" condition (-0.03, [-0.06, 0.01] (median difference, IQR)), although only marginally significant (z = −1.33, p = .092; one-sided Wilcoxon signed-rank test). To validate that the "old on" condition did not affect the motion discriminability, we further fitted σ m, b and λ in the motion-only condition separately for "old on" and "old off" faces.
There was no significant difference between estimated standard deviations for "old on" and "old off" (0.01, [-0.09, 0.05]; z = −0.60, p > .250, two-sided Wilcoxon signed-ranked test). We thus collapsed "old on" and "old off" in the motion-only condition for later analyses.

Methods
For each subject and each model, we calculated the maximum of the parameter log likelihood. We used non-parametric Wilcoxon signed-rank tests on these maximum log likelihoods to test for differences between models. In addition, we used a random-effects method for Bayesian model selection at the group level 4 .

Model recovery
To validate our model comparison process, we used the same synthetic data sets as for the parameter recovery but also fitted the models other than the one used to generate the data. For the data sets generated from the optimal model, the best-cue model fitted worse by 5. fitting the synthetic data generated by one model to itself and the two other models.
The model used to generate the synthetic data always reached a maximal protected exceedance probability of 1. This shows that our model comparison process recovers the correct model well if the true model is among the three models tested.

OPTIMAL MODEL WITH INCORRECT BELIEFS
We examined whether the optimal model with incorrect beliefs can better explain observers' behaviour than the optimal model. In particular, we fitted the parameters for this model using maximum likelihood estimation. The maximum-  Figure S2 shows the fit of the optimal model with incorrect beliefs to the psychometric curves. are shown for single cues ("Form" in blue, "Motion" in green; note that the combined-cue condition "Comb" in red is also shown for comparison) and for combined cues ("Comb" in red, "Comb, +Δ" in orange, and "Comb, −Δ" in purple), each separated for "old off" (first column) and "old on" (second column). Error bars and shaded areas represent ± 1 s.e.m. across subjects (n = 22), for data and model fit, respectively.

REACTION TIME ANALYSIS
Form and motion cues differ in how the available information develops over time (static form information is available from the beginning, while motion information evolves over time). To investigate how these inherent properties influence subjects' decision making in our task, subjects could freely choose when to make an identity choice, even during the presentation of the stimulus. Recent evidence has demonstrated that standard cue-integration models might be insufficient to explain cue integration behaviour in reaction-time tasks 5 . Thus, we examined reaction times in our experiment (Fig. S3). Visual inspection reveals that average reaction times depended on experimental condition, morph level and form reliability (i.e., old on/off). For all conditions, we can further see the typical inverse "U-shape" suggesting larger reaction times for intermediate morph levels than for morph levels at the outer bounds. To test for differences in reaction time for the experimental conditions, we performed multiple two-way (Condition x Morph level) repeated measures ANOVAs. In the single-cue conditions, we found a main effect of Morph level ("old off": F(10,10) = 12.23, p < .001, η p 2 = .21; "old on": F(10,10) = 6.49, p < .001, η p 2 = .13) supporting the inverse "U-shape" of reaction times. Furthermore, we found a significant effect of condition ("old off": F(1,10) = 293.93, p < .001, η p 2 = .39; "old on": F(1,10) = 128.03, p = .003, η p 2 = .22). During "old off", the estimated values of the standard deviation parameter for facial form were larger than for facial motion (see "Single-cue fitting") while reaction times were shorter, indicating a potential speed-accuracy trade-off.
Next we analysed both the congruent and incongruent combined conditions.
We did not consider the most extreme morph levels, as incongruent and congruent conditions differed at these morph levels (see Experimental Methods and Results above). As for single-cue conditions, we found a main effect of Morph level ("old ("Form" in blue, "Motion" in green; note that the combined-cue condition "Comb" in red is also shown for comparison) and for combined cues ("Comb" in red, "Comb, +Δ" in orange, and "Comb, −Δ" in purple), each separated for "old off" (first column) and "old on" (second column). Error bars represent ± 1 s.e.m. across subjects (n = 22).