The neural dynamics of hierarchical Bayesian causal inference in multisensory perception

Transforming the barrage of sensory signals into a coherent multisensory percept relies on solving the binding problem – deciding whether signals come from a common cause and should be integrated or, instead, segregated. Human observers typically arbitrate between integration and segregation consistent with Bayesian Causal Inference, but the neural mechanisms remain poorly understood. Here, we presented people with audiovisual sequences that varied in the number of flashes and beeps, then combined Bayesian modelling and EEG representational similarity analyses. Our data suggest that the brain initially represents the number of flashes and beeps independently. Later, it computes their numbers by averaging the forced-fusion and segregation estimates weighted by the probabilities of common and independent cause models (i.e. model averaging). Crucially, prestimulus oscillatory alpha power and phase correlate with observers’ prior beliefs about the world’s causal structure that guide their arbitration between sensory integration and segregation.


Supplementary Methods
To characterize the neural dynamics of Bayesian Causal Inference, we investigated whether and when the internal estimates of the Bayesian Causal Inference (BCI) model are represented in EEG activity patterns using representational similarity analysis (see main manuscript), support-vector regression (SVR) and canonical correlation analysis (CCA). The BCI model includes four internal numeric estimates: the i. unisensory visual (̂V,C=2), ii. unisensory auditory (̂A,C=2) estimates, iii. forced-fusion (̂AV,C=1), iv. final BCI audiovisual numeric estimate (̂A or ̂V depending on whether the auditory or visual modality were task-relevant). Further, we considered the posterior causal probability across all 32 conditions.
For the CCA approach, we correlated the EEG activity patterns from each 20 ms time window with each of the internal BCI estimates separately, using Matlab's canoncorr function. Because canonical correlations are by definition positive and are not subject to a regularization term (i.e. larger feature spaces lead to larger canonical correlations), we subtracted the mean prestimulus canonical correlation (from -100 to 0 ms) before plotting the first canonical correlation coefficient as a function of time (Supplementary Figure 6A). The Fisher's ztransformed correlation coefficients were tested against zero using a randomization test (sign flip of correlation coefficient in 5000 randomizations) and a cluster-based correction for multiple comparisons across time intervals (as applied for RSA in the main manuscript). From the canonical correlations' Wilks' λ, we computed the Bayesian Information Criterion as an approximation to the model evidence for each numeric estimate and time point (BIC = n * log(λ) + 1 * log(n); n = number of EEG activity patterns). We entered these participant-specific model evidences into a random-effects group analysis to compute the protected exceedance probability (SPM12) 1 that one numeric estimate was more likely encoded than any of the other estimates separately for each time interval (Supplementary Figure 6B).
For the SVR approach, we trained SVR models (libSVM 3.20 2 ) to learn the mapping from spatio-temporal EEG activity patterns from each 20 ms time window to the internal BCI estimates from all but one run. This learnt mapping was then used to decode the internal BCI estimates from EEG activity patterns of the remaining run. In a leave-one-run-out crossvalidation scheme, the training-test procedure was repeated for all runs. The SVRs' parameters (C and ν) were optimized using a grid search within each cross-validation fold (i.e. nested crossvalidation). Before training the SVR models, we recoded the internal estimates to the range of [-1, 1]. To quantify the decoding accuracy of the internal BCI estimates, we computed the Pearson correlation coefficient between true and decoded internal BCI estimate. The Fisher's z-transformed correlation coefficients were tested against zero (Supplementary Figure 6C) and the protected exceedance probability for each internal BCI estimates were computed as described for the RSA approach (Supplementary Figure 6D).

Supplementary Note Response time analysis
To avoid motor-response confounds in EEG signals, participants were instructed to respond only upon the presentation of a response screen, i.e. 750 ms after the onset of the first stimulus (Fig. 1a). As a result, response times (RTs) are less informative, because RTs comprise an unaccountable time window during which participants withhold their response to wait for the response screen. Yet, a 2 (numeric disparity: small ≤ 1 vs. large ≥ 2) × 2 (task relevance: auditory vs. visual report) repeated-measures ANOVA on RTs revealed a significant main effect of numeric disparity (F1, 22 = 14.63, p = 0.001, partial η 2 = 0.399). In line with the multisensory congruency effect on RTs 3 , smaller disparity led to faster RTs (Supplementary Figure 1). Critically, we also observed an interaction between numeric disparity and task relevance (F1, 22 = 5.74, p = 0.026, partial η 2 = 0.207). Observers were slower to report the number of flashes than beeps particularly for large (relative to small) disparity trials. In other words, observers were slower to report the less precise visual estimate particularly when they had to segregate sensory signals.

Supplementary Figures
Supplementary Figure 1. Reaction times. Reaction times (RT; across-participants mean ± SEM; n = 23) are shown as a function of numeric disparity (small: ≤ 1 vs. large: ≥ 2) and task relevance (auditory vs. visual report). Reaction times were computed as the mean across participants' median reaction times. Note that reaction times have to be interpreted with caution because participants had to withhold their report until the response screen was presented, i.e. 750 ms after the onset of the first stimulus (cf. Fig. 1a). Source data are provided as a Source Data file.

Supplementary Figure 2. Decoding accuracy in unisensory visual and auditory conditions.
Decoding accuracy (Fisher's z-transformed correlation; across-participants mean ± SEM; n = 23) of the unisensory decoders as a function of time. Auditory (resp. visual) decoders were trained to decode the stimulus number from EEG activity patterns from unisensory auditory (resp. visual) stimulation. Decoding accuracy was computed as Pearson correlation coefficient between true and decoded stimulus number in auditory (resp. visual) conditions. The horizontal dashed lines indicate significant (p < 0.05) clusters of decoding accuracy (one-sided clusterbased corrected randomization t22 test). Stimulus onsets are shown on the x axis. red = unisensory visual; green = unisensory auditory. Source data are provided as a Source Data file.  Fig. 6c and d. The predicted pcommon (B) is based on the sinusoidal models that were fit to pcommon over alpha phase deciles independently for each time point and participant (Equation (8)). Source data are provided as a Source Data file.

Supplementary Figure 6. Linking the representational geometry of the BCI models' estimates and EEG patterns by canonical correlation analysis and support-vector regression (SVR) decoding. (A)
Canonical correlation coefficients (across-participants mean of Fisher's z-transformed r; n = 23) as a function of time. The canonical correlation coefficients were computed between EEG activity patterns and separately for each of the BCI model's internal estimates: i. the unisensory visual (̂V,C=2), ii. the unisensory auditory (̂A,C=2) estimates under the assumption of independent causes (C=2), iii. the forced-fusion estimate (̂AV,C=1) under the assumption of a common cause (C=1) and iv. the final BCI estimate (̂A or ̂V depending on the sensory modality that is task-relevant) that averages the task-relevant unisensory and the forced-fusion estimate by the posterior probability estimate of each causal structure ( ( = 1| A, V )). Color-coded horizontal dashed lines indicate clusters of significant correlation coefficients (p < 0.05; one-sided cluster-based corrected randomization t22 test). Note that the mean of the canonical correlations in the prestimulus time window (i.e. -100 ms to 0 ms) was subtracted from the canonical correlations. Stimulus onsets are shown along the x axis. (B) To identify which of the four numeric BCI estimates is most likely represented in the EEG activity patterns, we computed the protected exceedance probability (i.e. the probability that a given estimate is more likely encoded in the EEG patterns than any other estimate, beyond differences due to chance) for each of the four numeric BCI estimates as a function of time. The length of an estimate's bar indicates the estimate's protected exceedance probability (n.b.: the y axis indicates protected exceedance probabilities cumulated over the BCI estimates). (C) Decoding accuracy (Fisher's z-transformed correlation; across-participants mean ± SEM) of the SVR decoders as a function of time. Decoding accuracy was computed as Pearson correlation coefficient between true and the BCI model's internal estimates that were decoded from EEG activity patterns using SVR models trained separately for each BCI estimate. Significant clusters shown as in (A). (D) The protected exceedance probability that one numeric estimate, decoded with SVR, is more likely encoded in the EEG pattern than any other of the four numeric BCI estimates as a function of time, shown as in (B). Source data are provided as a Source Data file.

Supplementary Figure 7. Decoding accuracy and relative audiovisual weight index for decoders trained separately for audiovisual conditions under auditory versus visual report. (A)
Decoding accuracy (Fisher's z-transformed correlation; across-participants mean ± SEM, n = 23) of the decoders as a function of time. Decoding accuracy was computed as Pearson correlation coefficient between true and decoded stimulus number in audiovisual congruent conditions. The horizontal dashed line indicates significant (p < 0.05) clusters of decoding accuracy (one-sided cluster-based corrected randomization t22 test). Stimulus onsets are shown on the x axis. (B) Relative weight index wAV (across-participants circular mean and bootstrapped 68% CI) of the decoders' predicted stimulus number as a function of numeric disparity (small vs. large), task relevance (visual (V) vs. auditory (A) report) and time. wAV = 90° for purely visual and wAV = 0° for purely auditory influence. Horizontal dashed lines indicate significant clusters with effects of numeric disparity, task relevance and their interaction (p < 0.05; cluster-based corrected randomization test based on an LRTS statistic). Source data are provided as a Source Data file. Figure 8. Bayes factor (BF) in favor of no effect (H0) of the previous disparity on alpha power. Time-frequency BF-value map (n = 23, averaged over occipital electrodes) shows substantial evidence (i.e., BF > 3.2 4 ) in favor of no difference in prestimulus alpha power between small versus large previous numeric disparity (i.e. trial order 1). BFs were computed from the time-frequency t-value map shown in Fig. 7c 5 . Source data are provided as a Source Data file.