The neural dynamics of hierarchical Bayesian inference in multisensory perception

Transforming the barrage of sensory signals into a coherent multisensory percept relies on solving the binding problem – deciding whether signals come from a common cause and should be integrated, or instead be segregated. Human observers typically arbitrate between integration and segregation consistent with Bayesian Causal Inference, but the neural mechanisms remain poorly understood. We presented observers with audiovisual sequences that varied in the number of flashes and beeps. Combining Bayesian modelling and EEG representational similarity analyses, we show that the brain initially represents the number of flashes and beeps and their numeric disparity mainly independently. Later, it computes them by averaging the forced-fusion and segregation estimates weighted by the probabilities of common and independent cause models (i.e. model averaging). Crucially, prestimulus oscillatory alpha power and phase correlate with observers’ prior beliefs about the world’s causal structure that guide their arbitration between sensory integration and segregation.

In everyday life, the brain is constantly confronted with a myriad of sensory signals. Imagine you are skipping stones on a lake. Each time the stone bounces off the water's surface, you see the impact and hear a brief splash. Should you integrate or segregate signals from vision and audition to estimate how many times the stone hits the water's surface? Hierarchical Bayesian Causal Inference provides a rational strategy to arbitrate between information integration and segregation by explicitly modelling the underlying potential causal structures, i.e. whether visual impacts and splash sounds are caused by common or independent events 1,2 . Under the assumption of a common cause, signals are integrated weighted by their relative precisions (or reliabilities, i.e. the reciprocal of variance) into one single 'forced-fusion' numeric estimate 3,4 .
If, however, some splash sounds are caused by a stone hitting the water surface out of the observer's sight (e.g. another person throwing a stone), audition and vision will provide conflicting information. In this segregation case, the brain needs to estimate the number of events independently for vision and audition. Importantly, the brain cannot directly access the world's causal structure, but needs to infer it from the signals' noisy sensory representations based on correspondence cues such as temporal synchrony or spatial co-location. To account for observers' causal uncertainty, a final Bayesian Causal Inference estimate is computed by combining the 'forced-fusion' and the task-relevant unisensory segregation estimates weighted by the posterior probability of common or independent causes 1 . Perception thus relies crucially on inferring the hidden causal structure that generated the sensory signals.
Accumulating evidence suggests that human and animal observers arbitrate between sensory integration and segregation approximately in line with Bayesian Causal Inference 1,5-8 .
For small intersensory conflicts, when it is likely that signals come from a common cause, observers integrate sensory signals approximately weighted by their relative precisions 3,4,9 , which leads to intersensory biases and perceptual illusions. Most prominently, in the soundinduced flash illusion, observers tend to perceive two flashes when a single flash appears together with two sounds 10 . For large intersensory conflicts such as temporal asynchrony, spatial disparity or numeric disparity, multisensory integration breaks down and crossmodal biases are attenuated 5,11 .
At the neural level, a recent fMRI study has demonstrated that the human brain performs multisensory Bayesian Causal Inference for spatial localization by encoding multiple spatial estimates across the cortical hierarchy 12,13 . While low-level sensory areas represented spatial estimates mainly under the assumption of separate causes, posterior parietal areas integrated sensory signals under the assumption of a common cause. Only at the top of the cortical hierarchy, in anterior parietal areas, the brain formed a Bayesian Causal Inference estimate that takes into account the observers' uncertainty about the signals' causal structure.
In summary, the brain should entertain two models of the sensory inputs, namely that the inputs are generated by common (i.e. forced-fusion model) or independent sources (i.e. segregation model). Using a decisional strategy called model averaging, hierarchical Bayesian Causal Inference accounts for the brain's uncertainty about the world's causal structure by averaging the forced-fusion and the task-relevant unisensory segregation estimates weighted by the posterior probabilities of their respective causal structures. Hence, hierarchical Bayesian Causal Inference goes beyond estimating an environmental property (e.g. numerosity, location) and involves inferring a causal model of the world (i.e. structure inference).
The hierarchical nature of Bayesian Causal Inference raises the intriguing question of how these computations evolve dynamically over time in the human brain. To assess this, we fitted the Bayesian Causal Inference model to observers' behavioral responses and then investigated how observers' forced-fusion, the full-segregation auditory and visual estimates and the final Bayesian Causal Inference (i.e. model averaging) estimates are dynamically encoded in neural responses measured with EEG. While the brain is likely to update all estimates continuously in recurrent loops across the cortical hierarchy 14,15 , the neural representations of unisensory segregation and forced-fusion estimates may be more pronounced at earlier latencies than the final Bayesian Causal Inference estimate whose computation requires the posterior probabilities of the potential causal structures (i.e. common vs. independent causes). Moreover, neural activity (i.e. alpha-, beta-and gamma-oscillations 16,17 ) prior to stimulus onset may modulate the causal prior or precision of sensory representations (e.g. visual variance) in early visual cortices and thereby in turn influence the outcome of Bayesian Causal Inference. We combined psychophysics, computational modelling and EEG representational similarity analyses to characterize the neural dynamics of Bayesian Causal Inference in perception of audiovisual stimulus sequences.

Results
During the EEG recording, we presented 23 human observers with sequences of auditory beeps and visual flashes in a four (1 to 4 flashes) × four (1 to 4 beeps) factorial design (Fig. 1).
Participants estimated and reported either the number of flashes or the number of beeps. We combined a GLM-based and a Bayesian modelling analysis to characterize the computations and neural mechanisms of how the brain combines information from vision and audition to estimate the number of auditory and visual stimuli.
- Figure 1 to appear about here -

Behavior -Audiovisual weight index and Bayesian modelling
Using a general linear model (i.e. GLM, regression) approach, we computed a relative audiovisual weight index wAV that quantifies the relative influence of the true number of beeps and flashes on participants' numeric reports. The audiovisual weight index wAV was analyzed as a function of numeric disparity between beeps and flashes (i.e. small ≤ 1 vs. large ≥ 2) x taskrelevance (visual vs. auditory report). This audiovisual weight index ranges from pure visual (90°) to pure auditory (0°) influence. As shown in figure 1C and figure 2A, observers' reported number of beeps was mainly influenced by the true number of beeps and only slightlybut significantlybiased by the true number of flashes (circular mean wAV = 3.871, p < 0.001, onesided randomization test on wAV > 0°; i.e. a visual bias on auditory perception 18,19 ). Conversely, the reported number of flashes was biased by the true number of beeps (circular mean wAV = 65.483, p < 0.001, one-sided randomization test on wAV < 90°), which is consistent with the well-known 'sound induced flash illusion 10,19 . Yet, despite these significant biases operating from vision to audition and vice versa, observers did not fuse stimuli into one unified percept.
Instead, the visual influence was stronger when the number of flashes was reported and the auditory influence was stronger when number of beeps was reported (effect of task on wAV, LRTS = 85.620, p < 0.001, randomization test of a likelihood ratio test statistic (LRTS); Table   1). As a result, observers reported different perceived numbers of flashes and beeps for audiovisual stimuli with a numeric disparity. Thus, participants flexibly adjusted the weights according to the task-relevant sensory modality. Crucially, this difference between auditory and visual report increased significantly for large relative to small numeric disparities. In other words, audiovisual integration broke down for large numeric disparities, when auditory and visual stimuli were more likely to be caused by independent sources (i.e. a significant interaction between task-relevance and numeric disparity, LRTS = 1.761, p < 0.001; for analysis of response times, see supplementary results and figure S1). Indeed, the model predictions in figure 1C show that this interaction between taskrelevance and numeric disparity is a key feature of Bayesian Causal Inference. As this behavioral profile can be accounted for neither by the classical forced-fusion model that assumes audiovisual stimuli are fused into one single estimate (i.e. common source model) nor by the full-segregation model (i.e. independent source model), the Bayesian Causal Inference model was the winning model for explaining the behavioral data based on formal Bayesian model comparison (Table 2). Further, the decisional function 'model averaging' outperformed 'model selection' and 'probability matching' at the group level (see supplementary Table S1, consistent with 5 , but see 20 ). In the following, we will therefore focus selectively on Bayesian causal inference with model averaging.
- Figure 2 to appear about here -
- Figure 3 to appear about here -

EEGmultivariate decoding and audiovisual weight index
To compute a neural audiovisual weight index wAV, we applied multivariate pattern analysis to single trial EEG activity patterns (i.e. 64 electrodes) of 20 ms time intervals. We trained a support-vector regression model on EEG activity patterns independently at each time point of the audiovisual congruent conditions to establish a mapping between EEG activity pattern and number of audiovisual stimuli. We then generalized to the congruent and incongruent conditions (i.e. leave-one-run out cross-validation). First, we ensured that we could decode the stimulus number for congruent trials significantly better than chance. Indeed, the decoder was able to discriminate between for instance three and four flash-beeps nearly immediately after the presentation of the fourth flash-beep (Fig. 3A) and thus before the ERP traces, when averaged over parietal electrodes, started to diverge (Fig. 1D). Pooling over all four congruent conditions, we observed better than chance decoding accuracy from around 100 ms to 740 ms measured from the onset of the first flash-beep slot (Fig. 3B).
We applied the same analysis approach as for behavioral responses to the audiovisual decoded numeric estimates and computed the neural audiovisual weight index wAV which quantified the relative auditory and visual influences on the decoded number of flashes and beeps across poststimulus time (i.e. from 100 ms to 740 ms). We assessed how the neural audiovisual weight index was affected by numeric disparity between beeps and flashes (i.e. small ≤ 1 vs. large ≥ 2) and task-relevance (visual vs. auditory report) in a 2 x 2 repeated measures analysis ( Fig. 3C and Table 1). We observed that the auditory influence was stronger for small relative to large numeric disparities from 400-480 ms poststimulus (i.e. effect of numeric disparity: 200-280 ms after the final flash-beep slot). Only when the numeric disparity was small and hence the two stimuli were likely to come from a common cause, did auditory stimuli impact the neural estimation of the number of flashes, which dominated the EEG activity patterns. Shortly later, i.e. 420-540 ms poststimulus, the influence of the auditory and visual stimuli on the decoded numeric estimate also depended on the sensory modality that needed to be reported (effect of task-relevance; for additional effects see Table 1). The number of flashes influenced the decoded numeric estimates more strongly for visual report, whereas the number of beeps influenced the decoded numeric estimates for auditory report. Crucially, at 560 ms and from 680 to720 ms poststimulus, we observed a significant interaction between task-relevance and numeric disparity, which is the key profile of Bayesian Causal Inference.
As predicted by Bayesian Causal Inference (cf. Fig 1C), the audiovisual weight index for auditory and visual report were similar (i.e. integration) for small numeric disparity, but diverged (i.e. segregation) for large numeric disparities when it is unlikely that the flash and the beep sequences were generated by a common cause.
- Figure 4 to appear about here -

Representational geometry of the numeric estimates of the Bayesian Causal Inference model and EEG activity pattern
Using representational dissimilarity analysis, we compared the representational geometry of the full-segregation auditory or visual, forced-fusion and the final Bayesian Causal Inference (BCI) estimates 24 with the representational geometry of observers' numeric reports (Fig. 2) and EEG activity patterns across poststimulus time (Fig. 4). First, we estimated the representational dissimilarity matrices (RDMs) by computing the pairwise absolute distance between the BCI model's four numeric estimates, i.e. (i) the forced-fusion, the full-segregation (ii) auditory and (iii) visual and (iv) the final BCI estimates as well as the posterior causal probability across all 32 conditions. As shown in figure 2 C, the RDM for the forced-fusion estimate (NAV,C=1) was a weighted average of the RDMs of full-segregation auditory (NA,C=2) and visual (NV,C=2) estimates. Further, because the auditory modality provides more precise temporal information (cf. NA or NV, depending on the sensory modality that needs to be reported) combines the forcedfusion estimate (NAV,C=1) with the task-relevant unisensory visual (NV,C=2) or auditory (NA,C=2) estimates (depending on report), weighted by the posterior probability of a common or separate causes, respectively (i.e. p(C = 1| xA, xV) or p(C = 2| xA, xV)). The probability of a common cause increased with smaller numeric disparity such that the influence of the forced-fusion estimate was greater for small numeric disparities. Figure 2B illustrates that the RDM computed from observers' behavioral numeric reports was nearly identical to the BCI RDM. This match was confirmed statistically by a high correlation between the BCI (i.e. NA or NV) RDM and participants' behavioral RDM (r = 0.878 ± 0.059, mean ± SEM, p < 0.001). Of course, this match between behavioral and BCI RDM was expected because the BCI RDM was computed from the predictions of the BCI model that well fit participants' numeric reports (i.e. circular dependency; cf. Table 2).
Next, we characterized the neural dynamics of Bayesian Causal Inference by comparing the representational geometry obtained from EEG activity patterns across time with the representational geometries of (i) the forced-fusion, the full-segregation (ii) auditory and (iii) visual and (iv) the final BCI estimates. As shown in figure 4A, the RDMs obtained from EEG activity patterns significantly correlated with the unisensory auditory RDM (NA,C=2; significant cluster 60-740 ms, p < 0.001, one-sided cluster-based corrected randomization t22 test), the unisensory visual RDM (NV,C=2; cluster 100-720 ms, p < 0.001), the forced-fusion RDM (NAV,C=1; cluster 80-740 ms, p < 0.001) and the BCI RDM (NA or NV; significant cluster 80-740 ms, p < 0.001). In short, the RDMs of EEG activity patterns correlated with multiple numeric estimates simultaneously. For the posterior probability of a common cause (p(C = 1|xA, xV), the correlation was weaker but significant in a later cluster (260-640 ms after stimulus onset, p < 0.001). The strong and sustained correlations of EEG RDMs and the RDMs of the four numeric estimates from the BCI model were expected because the four numeric estimates were highly correlated with one another. Hence, to account for these inherent correlations between these numeric estimates, we next computed the exceedance probability (i.e. the probability that the correlation with one numeric RDM was greater than that of any other RDMs) to determine which of the four numeric estimates was most strongly represented in the EEG activity patterns at a given time point (Fig. 4B). The exceedance probabilities showed that the EEG activity patterns predominantly encoded the unisensory visual estimate from 120 ms up to around 500 ms (i.e. 300 ms after the final flash-beep slot). This visual over auditory influence on EEG activity patterns at the scalp may be surprising, because the auditory sense exerts a stronger influence on observers' reported numeric estimates (Fig. 1C) and provides more precise temporal information when estimated from observers' numeric reports (cf. σA vs. σV in Table   2). Potentially, the visual neural sources elicit EEG activity patterns in sensor space that are more informative about the number of events (see methods section for caveats and critical discussion of the decoding analysis). Indeed, additional multivariate decoding analyses of the unisensory auditory and visual conditions showed that the number of visual stimuli could be more accurately decoded from visual EEG activity patterns than the number of auditory stimuli from auditory EEG activity patterns ( Supplementary Fig. S2). Potentially, this advantage for visual decoding under unisensory stimulation may further increase in audiovisual context when the visual signal is task-relevant because of additional attentional amplification.
Crucially, from 450 ms poststimulus (i.e. 250 ms after the presentation of the final flashbeep; Fig. 4B), the EEG representational geometries progressively reflected the BCI estimate.
Collectively, the model-based representational dissimilarity analysis suggests that Bayesian Causal Inference evolves by dynamic encoding of multiple sensory estimates. First, the EEG activity is dominated by the numeric unisensory and forced-fusion estimates (i.e. NV,C=2, NA,C=2, and NAV,C=1) and later by the BCI estimate (i.e. NA or NV) that takes into account the observers' uncertainty about the world's causal structure.
- Figure 5 to appear about here -

EEG: Effect of prestimulus oscillations on the causal prior probability
Previous research demonstrated that observers perceived a sound-induced flash illusion more often for low prestimulus alpha power and/or high gamma and beta power over occipital (i.e. visual) cortices 16,17 . Within the framework of Bayesian Causal Inference, the occurrence of a sound-induced flash illusion may increase when visual precision is reduced or the causal prior (i.e. the probability of a common versus independent causes, also known as binding tendency 25 ) is enhanced. We therefore investigated whether prestimulus oscillatory power (over occipital electrodes) alters participants' multisensory perception as parameterized by the causal prior (pcommon) or the precision of visual representations (i.e. the reciprocal of σV). For this, we sorted the trials into 10 deciles according to oscillatory power for each time and frequency point and re-fitted the causal prior or the precision of visual representations in the BCI model separately for each bin. Next, we computed the correlation of the causal prior or the visual precision with oscillatory power over deciles. This analysis showed that the causal prior correlated positively with gamma power (p = 0.036, two-sided cluster-based corrected randomization t22 test, starting at -220 ms prestimulus to stimulus onset) and negatively with alpha power (p = 0.042, from -320 ms to -100 ms prestimulus, Fig. 5A, B). Fig. 5C shows the weight index wAV computed from participants' behavior for each decile and the corresponding model predictions. Both human and model behavior showed more audiovisual influences (i.e,. wAV indices shifted towards 0.5) for high gamma power and low alpha power. Crucially, these audiovisual biases operated from vision to audition and vice versa (i.e. a bidirectional bias which cannot be modelled by changes in visual precision). Hence, prestimulus gamma and alpha oscillations tune how the brain arbitrates between sensory integration and segregation. High gamma and low alpha power prior to stimulus presentation increase the brain's tendency to bind stimuli across the senses. For completeness, we did not observe any significant effect of oscillatory power on the visual precision ( Supplementary Fig. S3A).
- Figure 6 to appear about here -Given the prominent role of alpha oscillations in temporal binding 26,27 in visual and multisensory perception, we next investigated whether the prestimulus alpha phase influenced the causal prior or visual precision. Using a similar sort-and-binning approach as for prestimulus power, we computed a circular-linear correlation between alpha phase and causal prior (or visual precision) over deciles as a function of prestimulus time. While there was again no significant effect of alpha phase on visual precision ( Supplementary Fig. S3B), we observed a significant cluster from -160 ms to -80 ms prestimulus (p = 0.015, one-sided cluster-based corrected randomization t22 test), where alpha phase correlated significantly with participants' causal prior (Fig. 6A): trials with a specific alpha phase led to a higher causal prior than trials with an opposing alpha phase. Importantly, the relation between alpha phase and causal prior progressed consistently over time at alpha frequency (i.e. 10 Hz; Fig. 6C). In support of this, a sinusoidal model in which the phase of an alpha oscillation modulated the causal prior  Fig. 6B). These differences between participants are expected and may arise from differences in cortical folding and hence orientations of the underlying neural sources. To account for these differences across participants, we therefore aligned the alpha phase individually for each participant, such that the phase at the peak group effect at -160 ms prestimulus was consistent across participants (cf. Supplementary Fig. S5 for data without phase-alignment). Figure 6C and D show that the alpha phase modulates the causal prior across nearly three cycles which is consistent across participants. Collectively, these results demonstrate that the power and phase of prestimulus alpha oscillations influence observers' causal prior, which formally quantifies their apriori tendency to bind signals from audition and vision into a coherent percept.
- Figure 7 to appear about here -  [31][32][33][34][35][36] and attention 37 , one may argue that alpha power is adjusted according to observers' causal expectations based on prior stimulus history. Contrary to this conjecture, the numeric disparity of previous stimuli did not significantly predict alpha power ). However, we observed a marginally significant interaction between numeric disparity of the previous trial and alpha power on observers' causal prior in two clusters from -340 to -240 ms, (p = 0.069) and from -220 to -120 ms (p = 0.096; two-sided cluster-based corrected randomization t22 test, Fig. 7C, top panel). The correlation between alpha power and the observers' causal prior was prominent when prior numeric disparity was small (cluster from -480 to -80 ms, p = 0.006), but not significant when previous numeric disparity was large (i.e. all clusters p > 0.05). In summary, alpha power did not mediate, but to some extent (i.e. only marginally significant) moderated the effect of stimulus history on observers' causal prior, i.e. their tendency to bind audiovisual signals (Fig. 7B).

Discussion
To form a coherent percept of the world, the human brain needs to integrate signals arising from a common cause, but segregate signals from independent causes. Perception thus relies crucially on inferring the world's causal structure 1,2 . To characterize the neural dynamics of how the brain solves this binding problem, we presented participants with sequences of beeps and flashes that varied in their numeric disparity.
Behaviorally, the number of beeps biased observers' perceived number of flashesa phenomenon coined sound-induced flash illusion 10 . Conversely, the number of flashes biased observers' perceived number of beeps 18,19 , albeit only to a small degree. This asymmetry of crossmodal biases operating from vision to audition and vice versa can be attributed to the smaller precision of vision for temporal estimation, which is consistent with forced-fusion models of reliability-weighted integration 3,4,9 (and Bayesian Causal Inference models, cf. Table   2). Crucially, as predicted by Bayesian Causal Inference, participants did not fully fuse auditory and visual stimuli into one unified percept, but they reported different numeric estimates for the flash and beep components of numerically disparate flash-beep stimuli. Moreover, audiovisual integration and crossmodal biases decreased for large numeric disparities, when the flash and beep sequences were unlikely to arise from a common cause 5,11 . Thus, observers flexibly arbitrated between audiovisual integration and segregation depending on the probabilities of the underlying causal structures as predicted by Bayesian Causal Inference (see Fig 2C).
At the neural level, our univariate and multivariate EEG analyses revealed that the  41,42 suggests that this explicit causal inference relies on a higher convergence layer, while the audiovisual biases in numeric estimates may be mediated via direct connectivity between auditory and visual layers and emerge from spatiotemporal receptive fields in auditory and visual processing. In contrast to such a feed-forward architecture, we generally observed a mixture of multiple representations that were concurrently expressed in EEG activity patterns, even though different numeric estimates dominated neural processing at different poststimulus latencies. Therefore, we suggest that Bayesian Causal Inference is iteratively computed via multiple feed-back loops across the cortical hierarchy whereby numeric estimates as well as causal inferences are recurrently updated as the brain accumulates knowledge about the causal structure and sources in the environment 14,15 .
In Bayesian inference, prior knowledge and expectations are crucial to guide the perceptual interpretation of the noisy sensory inputs 43 . Multisensory perception in particular relies on a so-called causal prior that quantifies observers' prior beliefs about the world's causal structure 1,2 . A 'high' causal prior (i.e. the belief that signals come from a common cause) influences multisensory perception by increasing observers' tendency to bind audiovisual signals irrespective of the signals' instantaneous intersensory congruency 28 . In the current study, we investigated whether the neural activity prior to stimulus onset is related to observers' causal prior. Indeed, low prestimulus alpha power and high gamma power were associated with a high causal prior, i.e. they increased participants' tendency to integrate audiovisual stimuli.
Accumulating research has shown effects of prestimulus alpha power on perceptual decisions such as detection threshold, decisional biases or perceptual awareness [31][32][33][34][35][36] . Further, low alpha power was also shown to increase the occurrence of the sound-induced flash illusion 16,17,26 (though see Keil et al. 16 for an effect in beta power). In our study, low prestimulus alpha power predicted a larger causal prior leading to stronger bidirectional interactions between audition and vision and audiovisual biases (see figure 5C, audiovisual weight index wAV). These enhanced audiovisual interactions might be explained by a tonic increase in cortical excitability for states of low alpha oscillatory power and associated high gamma power (though see Yuval-Greenberg et al. 44 for a cautionary note). Moreover, if peaks and troughs of alpha oscillatory activity are modulated asymmetrically 45 , low alpha power may also induce larger alpha troughs, thereby extending the temporal windows where gamma bursts and audiovisual interactions can occur 46,47 . Indeed, our results show that observers' causal prior depends not only on the tonic level of alpha power, but also on its phase. Prestimulus alpha phase may thus influence audiovisual binding by defining the optimal time window in which neural processing can interact across auditory, visual and association areas, thereby modulating the temporal parsing of audiovisual signals into one unified percept 27,48,49 .
Next we investigated whether the fluctuations in alpha power may enable observers to adapt dynamically to the statistical structure of the sensory inputs. Previous research has shown that prior exposure to congruent signals increases observers' tendency to integrate sensory signals, while exposure to incongruent signals enhances their tendency to process signals independently ( 28,29 , but see 30 ). In the current study, we also observed that previous low numeric disparity trials predicted a greater causal prior or tendency to bind audiovisual signals into a coherent percept. Surprisingly, however, the numeric disparity of previous audiovisual stimuli did not significantly influence alpha power. It only modulated the effect of alpha power on the causal prior (i.e. a marginally significant interaction between alpha power and prior numeric disparity). More specifically, alpha power correlated with observers' causal prior mainly when previous stimuli were of low rather than large numeric disparity.
Collectively, our results show that observers' causal prior dynamically adapts to the statistical structure of the world (i.e. previous audiovisual numeric disparity), but that these adaptation processes are not mediated by fluctuations in alpha power. Instead, spontaneous (i.e. as yet unexplained by stimulus history) fluctuations in prestimulus gamma and alpha power as well as alpha phase correlated with observers' causal prior. Alpha power, phase and frequency (i.e. speed) 50-52 together with gamma power may thus dynamically set the functional neural system into states that facilitate or inhibit interactions across brain regions 53,54 and temporal parsing of audiovisual signals into common percepts 27,47 .
In conclusion, to our knowledge this is the first study that resolves the neural computations of hierarchical Bayesian Causal Inference in time. We show that pre-stimulus oscillatory alpha power and phase correlates with the brain's causal prior as a binding tendency that guides how the brain dynamically arbitrates between sensory integration and segregation (see 55

Stimuli
The flash-beep paradigm was an adaptation of previous "sound-induced flash illusion" ms (see below).

Experimental design
In the flash-beep paradigm, participants were presented with a sequence of i. one, two, three or four flashes and ii. one, two, three or four beeps (Fig. 1A). On each trial, the number of flashes and beeps were independently sampled from one to four leading to four levels of numeric audiovisual disparities (i.e. zero = congruent to four = maximal level of disparity; Fig. 1B).
Each flash and/or beep were presented sequentially in fixed temporal slots that started at 0,

Overview of general linear model and Bayesian modelling analysis for behavioral data
To characterize how human observers arbitrate between sensory integration and segregation, we developed a general linear model (GLM)-based and a Bayesian modelling analysis approach.
The GLM-based analysis computed a relative weight index wAV which quantified the relative influence of the auditory and the visual numeric stimuli on observers' auditory and visual behavioral numeric reports. This GLM-based analysis allowed us to reveal audiovisual weight profiles in our 2 (numeric disparity) x 2 (task-relevance) factorial design that are qualitatively in line with the principles of Bayesian Causal Inference (Fig. 1C).
The Bayesian modelling analysis fitted the full-segregation, the forced-fusion and the Bayesian Causal Inference (BCI) model to the behavioral numeric reports with different decision functions. We then used Bayesian model comparison to determine the model that is the best explanation for observers' behavioral data ( Table 2 and supplementary Table S1).

Behaviour: GLM-based analysis for reported number of stimuliaudiovisual weight index
We quantified the influence of the true number of auditory and visual stimuli on the reported (behavioral) auditory or visual numeric estimates using a linear regression model 13  Unless otherwise stated, results are reported at p < 0.05. For plotting circular means of wAV ( Fig. 1C and 5C for behavioral wAV, Fig. 3C for neural wAV, see multivariate EEG analysis), we computed the means' bootstrapped confidence intervals (1000 bootstraps).

Behaviour: Full-segregation, forced-fusion and Bayesian Causal Inference models
Next, we fitted the full-segregation, the forced-fusion and the Bayesian Causal Inference model with model averaging, model selection and probability matching as decision functions to observers' behavioural reports. Using Bayesian model comparison, we then assessed which of these models is the best explanation for observers' reported numeric estimates.
In the following, we will first describe the Bayesian Causal Inference model from which Briefly, the generative model (Fig. 2C) assumes that common (C = 1) or independent (C = 2) causes are determined by sampling from a binomial distribution with the causal prior P(C = 1) = pcommon (i.e. a priori binding tendency 25 ). For a common cause, the "true" number of audiovisual stimuli NAV is drawn from the numeric prior distribution N(μP, σP). For two independent causes, the "true" auditory (NA) and visual (NV) numbers of stimuli are drawn independently from this numeric prior distribution. We introduced sensory noise by drawing xA and xV independently from normal distributions centered on the true auditory (respectively visual) number of stimuli with parameters σA (respectively σV). Thus, the generative model included the following free parameters: the causal prior pcommon, the numeric prior's mean μP and standard deviation σP, the auditory standard deviation σA, and the visual standard deviation σV. The posterior probability of the underlying causal structure can be inferred by combining the causal prior with the sensory evidence according to Bayes rule: (1) p(C = 1|xA, xV) = p(xA, xV|C=1)pcommon p(xA, xV) The causal prior quantifies observers' belief or tendency to assume a common cause and integrate stimuli prior to stimulus presentation. After stimulus presentation, the disparity between the number of beeps and flashes informs the observers' causal inference via the likelihood term (cf. Fig. 2C). In the case of a common cause (C = 1), the optimal audiovisual numeric estimate (NAV,C=1) is obtained under the assumption of a squared loss function, by combining the auditory and visual numeric estimates as well as the numeric prior (with a Gaussian distribution of N(μP, σP)) weighted by their relative reliabilities:  To provide a final estimate of the number of auditory or visual stimuli, the observer is thought to combine the estimates under the two causal structures using various decision functions such as "model averaging," "model selection," or "probability matching" 20 . According to "model averaging", the brain combines the two auditory numeric estimates weighted in proportion to the posterior probabilities of their underlying causal structures: According to the 'model selection' strategy, the brain reports the numeric estimate selectively from the more likely causal structure (eq. 6 only for NA): According to 'probability matching', the brain reports the numeric estimate of one causal structure stochastically selected in proportion to its posterior probability (eq. 7 only for NA):  Table. S1), the main report and analysis of the neural data focuses on model averaging.
To arbitrate between the full-segregation, forced-fusion and BCI models, we fitted each model to participants' numeric reports (Table 2)  We report the results (across-participants' mean and standard error) of the parameter setting with the highest log likelihood across these initializations (Table 2 and supplementary   Table S1). This fitting procedure was applied individually to each participant's data set for the Bayesian Causal Inference (with three different decision functions), the forced-fusion and the full-segregation models. The model fit was assessed by Nagelkerke's coefficient of determination 67 using a null model of random guesses of stimulus number 1-4 with equal probability 0.25. To identify the optimal model for explaining participants' data, we compared the candidate models using the Bayesian Information Criterion (BIC) as an approximation to the model evidence 68 . The BIC depends on both model complexity and model fit. We performed Bayesian model comparison 69 at the random effects group level as implemented in SPM12 70 to obtain the protected exceedance probability (the probability that a given model is more likely than any other model, beyond differences due to chance 71 ) for the candidate models.

EEG data acquisition and preprocessing
EEG signals were recorded from 64 active electrodes positioned in an extended 10-20 montage using electrode caps (actiCap, Brain Products, Gilching, Germany) and two 32 channel DC features. EEG activity patterns were z scored to control for mean differences between conditions. The first sampling point in the 20 ms time window was taken as the window's time point in all analyses.

Overview of EEG analysis
We characterized the neural processes underlying multisensory integration by combining several analysis approaches:

EEG: Univariate analysis of audiovisual interactions
To assess basic sensory components in ERPs and early audiovisual interactions, we averaged trial-wise EEG data time-logged to stimulus onset into ERPs for audiovisual congruent conditions. We then averaged the ERPs across parietal electrodes (i.e. Cz, CP1, CPz, CP2, P1, Pz, P2; Fig. 1D) or occipital electrodes (i.e. O1, O2, Oz, PO3, POz, PO4; Fig. 1E). To analyze early audiovisual interactions as reported for the sound-induced flash illusion, we computed the difference between audiovisual and the corresponding unisensory conditions (i.e. A1V1A2 -(A1A2 + V1)) 21,22 . However, the auditory and visual trials were acquired in separate unisensory runs and may therefore differ in attentional and cognitive context. Further, our experimental design did not include null trials to account for anticipatory effects around stimulus onset and ensure a balanced audiovisual interaction contrast 73 . Hence, these audiovisual interactions need to be interpreted with caution. To test whether the difference wave deviated from zero at the group level, we used a non-parametric randomization test (5000 randomizations) in which we flipped the sign of the individual difference waves and computed a two-sided one-sample t tests as a test statistic 74 . To correct for multiple comparisons across the sampling points, we used a cluster-based correction 75 with the sum of the t values across a cluster as cluster-level statistic and an auxiliary cluster defining threshold of t = 2 for each time point. This learnt mapping from EEG activity patterns to external number of stimuli was then used to decode the number of stimuli from spatio-temporal EEG activity patterns of the audiovisual congruent and incongruent audiovisual conditions of the remaining run. In a leaveone-run-out cross-validation scheme, the training-test procedure was repeated for all runs. To account for SNR differences across runs, predicted stimulus numbers were z-scored within each run. The decoded stimulus numbers for the congruent and incongruent conditions were used to assess i. decoding accuracy based on congruent trials only and ii. to compute the audiovisual weight index wAV in subsequent GLM-based analysis approaches (see below).

EEG: Multivariate GLM-based analysis, decoding accuracy and audiovisual weight index
First, we computed decoding accuracy based selectively on the audiovisual congruent conditions. We decoded stimulus numbers 1-4 at all time points even though the distinctions between high flash and/or beep numbers (e.g. three vs. four) was only possible at later time points. Hence, as expected the decoder was able to discriminate between higher stimulus numbers (e.g. three vs. four stimuli) only after about 250 ms (Fig. 3A). Next, we evaluated the decoder's accuracy in terms of the Pearson correlation between true and decoded stimulus number selectively in audiovisual congruent conditions (Fig. 3B). We tested whether individual

EEG: Multivariate Bayesian Causal Inference and Representational Dissimilarity Matrices
To characterize the neural dynamics of Bayesian Causal Inference, we next investigated whether and when the four numeric estimates of the Bayesian Causal Inference model are represented in EEG activity patterns using support vector regression (i.e. similar to a previous fMRI study 12 ), canonical correlation analysis and representational similarity analysis (RSA) 24 .
Because these three analysis approaches yield comparable results, we focus in the main manuscript on the RSA (see supplementary materials for the canonical correlation and support vector regression analysis and results, Fig. S6).
To define the representational dissimilarity matrices (RDMs) for the RSA 24  To resolve the evolution of the full-segregation auditory, full-segregation visual, forcedfusion and the BCI estimates in time, we correlated their RDMs with the EEG RDMs. The EEG RDMs were computed as the Mahalanobis distance between single trial spatiotemporal EEG activity patterns for 20 ms time windows over conditions (c.f. decoding analysis above) 77 . More specifically, we computed the Mahalanobis distance from the activity patterns' variancecovariance matrix using the pattern component modeling toolbox 78 . We quantified the similarity of the RDMs of the numeric estimates of the BCI model (Fig. 2C) with the EEG RDM at each 20 ms time interval using the Spearman's rank correlation r ( Fig. 4A; i.e. correlation of the RDMs' the upper triangular part). The Fisher's z-transformed correlation coefficients were tested against zero using a one-sided randomization test (sign flip of correlation coefficient in 5000 randomizations) and a cluster-based correction for multiple comparisons across time intervals (as applied to decoding accuracy, see above).
From the explained variance of the RDMs' correlation (i.e. r 2 ), we computed the Bayesian Information Criterion as an approximation to the model evidence for each estimate and time point 68 (BIC = n * log(1-r 2 ) + 1 * log(n); n = number of EEG activity patterns). We entered these participant-specific model evidences in a random-effects group analysis to compute the protected exceedance probability (SPM12) that one numeric estimate was more likely encoded than any of the other estimates separately for each time interval ( Fig. 4B; see above).

Time-frequency analysis of the effect of prestimulus oscillations on the causal prior and the visual precision parameters in the Bayesian Causal Inference model
We investigated whether prestimulus oscillatory power or phase over occipital electrodes (i.e. auxiliary cluster threshold t = 2). Note that initially, prior to the sort-and-bin approach, all five BCI parameters were re-fitted to the whole data set with valid EEG data. This additional refit was required because additional trials (i.e. 2.9 ± 0.4 % (mean ± SEM) of trials) were rejected due to EEG artefacts in the longer epochs from -1.5 s to 1 s. For illustrational purposes, we also computed the relative weight index wAV for each decile both for the BCI model's predictions and the participants' numeric reports averaged across the significant clusters (Fig. 5 C).
Second, based on previous research implicating prestimulus alpha phase in temporal binding 26,27 , we investigated whether the circular mean phase (i.e. averaged across trials within a decile) of alpha oscillations (8-12 Hz) was correlated with the BCI parameters (i.e. pcommon or σV) over alpha phase deciles using linear-circular correlation 66 . To enable an unbiased group level statistic, we first randomized the assignment between mean circular phase and BCI parameters across the deciles (5000 randomizations) within each participant. Next, we computed the percentile of a participant's true circular-linear correlation in relation to this participant's null-distribution of circular-linear correlations. At the group level, we then tested whether the across-participants' mean percentile was significantly greater than 50% (i.e. the mean percentile under the null hypothesis) using a randomization test (i.e. sign flip of deviation of percentile from 50% in 5000 randomizations; test statistic: one-sided t-tests) and clusterbased correction for multiple comparisons (Fig. 6A; cluster-level statistic: sum of the t values in a cluster; cluster threshold t = 2).
To characterize the modulation of pcommon by alpha phase, we first fitted a sine and cosine to p common dec over alpha phase deciles Φ dec at 10 Hz individually to each participant's data, separately for each time point (eq. 8). Thus, we computed the average phase Φ dec at 10 Hz across trials for each decile at a particular time point. We then used this average phase Φ dec in each decile at 10 Hz to predict p common dec for this particular time point over deciles based on a sinusoidal model: Crucially, this regression model (i.e. eq. 8) is estimated independently for each time point resulting in β sin and β cos separately for each time point. Thus, eq. 8 characterizes the relationship between alpha phase and p common over deciles for a particular time point, so that the phase of this modulation can in principle vary across time (Fig. 6D).
Second, we fitted a more constrained regression model with one single sine and cosine at F = 10 Hz that uses the Φ dec t averaged over trials within a particular decile = dec at a time point = t to predict p common . Hence, this model assumes that the modulation of p common by alpha phase for each time point (i.e. a column in Fig. 6B) evolves slowly over time according to a 10 Hz alpha oscillatory rhythm: (9) p common dec t = β sin sin(2π F t − Φ dec t ) + β cos cos(2π F t − Φ dec t ) + C + ε dec t p common dec t = causal prior estimated over trials in a particular decile for time t from equation (9) was consistent across participants and hence deviated significantly from a circular uniform distribution using a Raleigh test 66 . The distribution of phase angles over participants was not significantly different from uniformity, which can be explained by participant-specific cortical folding leading to differences in the orientation of the underlying neural sources. We therefore identified the peak in predicted p common dec , t at t = -160 ms in each participant (based on eq. 8), computed the difference in deciles between the participant's peak decile and the group peak decile (Supplementary Fig. S5B) and then shifted the predicted and observed p common dec , t in each participant by this difference across all time points. As a consequence, the adjusted participant's peak is aligned with the predicted group peak p common dec , t at t = -160 ms; -0.29 π= 52° ( Supplementary Fig. S5A, B) 82 . Then we averaged the observed and predicted (cf. eq. 8) p common across participants for illustrational purposes ( Fig. 6C and D). Finally, we investigated whether the effect of previous numeric disparity interacts with the correlation between alpha power and the causal prior (i.e. moderation). For this, we first sorted trials according to whether the previous trial (i.e. only order one) was of small or large numeric disparity. We then sorted and binned the trials according to their oscillatory power into 10 deciles separately for previous small versus large prior numeric disparity. We selectively recomputed the causal prior for each decile and assessed the influence of alpha power on causal prior in terms of correlation coefficients separately for previous low and high numeric disparity exactly as in the our initial main analysis on alpha power (see above). Finally, we compared the Fisher's z-transformed correlation coefficients of alpha power with the causal prior for previous low and high numeric disparity trials in a randomization test (i.e. sign flip of z-transformed correlation differences in 5000 randomizations; test statistic: two-sided t-tests) and clusterbased correction for multiple comparisons (as described above).

Assumptions and caveats of multisensory analysis approaches
This study combined several complementary approaches to characterize the neural processes underlying multisensory integration 83 : AV). Likewise, early putative audiovisual interactions in EEG have been suggested to emerge because of anticipatory ERP effects that precede all stimulus presentations and are therefore counted twice for A+V, but only once for AV (see 73 ). Therefore, multisensory interactions should optimally be computed including 'null events' to account for non-specific expectation effects (i.e. AV+Null vs. A+V). Further, in our study unisensory and multisensory conditions may differ in attentional context, because auditory and visual conditions were performed in separate experimental runs where either auditory or visual information were task-relevant.
Collectively, these factors need to be taken into account when interpreting audiovisual interactions in EEG (or fMRI) responses in our and other studies.
Multivariate decoding: The EEG activity patterns measured across 64 scalp electrodes represent a superposition of activity generated by potentially multiple neural sources located for instance in auditory, visual or higher-order association areas. The extent to which auditory or visual information can be decoded from EEG activity pattern depends therefore inherently not only on how information is neurally encoded by the 'neural generators' in source space, but also how these neural activities are expressed and superposed in sensor space (i.e. as measured by scalp electrodes). For example, the number of auditory beeps is perceptually more precisely represented than the number of flashes (based on observers' behavioral reports, Table 2), suggesting that the brain encodes the timing and number of events with a greater precision in audition than vision. Nevertheless, supplementary decoding analyses in sensor space revealed that the number of unisensory flashes can be more accurately decoded from EEG activity patterns than the number of unisensory beeps ( Supplementary Fig. S2). These discrepancies between precision (or accuracy) measured at the behavioral/perceptual level and EEG decoding accuracy at the sensor level may result from differences in neural encoding in source space or how these neural activities are expressed in sensor space (e.g. source orientation, superposition etc.). Potentially, the greater decodability of visual numeric information may contribute to the visual bias we observed for the audiovisual weight index wAV (Fig. 3C) and the dominance of the visual numeric estimates in our decoding analysis based on the estimates of the Bayesian Causal Inference model (Fig. 4A).
In the analysis of the audiovisual weight index wAV, we trained the support vector regression model on the audiovisual congruent conditions pooled over task-relevance to ensure that the decoder was based on activity patterns generated by sources related to auditory, visual and audiovisual integration processes. Moreover, this approach ensures that the effects of taskrelevance on the audiovisual weight index wAV cannot be attributed to differences in the decoding model (see 84 for a related discussion). In a supplementary analysis, we also trained the SVR models separately for visual and auditory report and obtained comparable results ( Supplementary Fig. S7) suggesting that our results are immune to this particular choice of decoding model. While the univariate interaction analysis (see above) cannot identify linear response combinations, this multivariate decoding analysis cannot exclude the possibility that auditory and visual stimuli jointly influence EEG activity pattern even though auditory and visual signals are not integrated at the single neuron level.
In our second multivariate analysis approach, we decoded (directly: SVR, canonical correlation analysis; or indirectly: RDM analysis) the numeric estimates of the Bayesian Causal Inference model from EEG activity pattern and then computed the exceedance probability that one numeric estimate was more likely encoded than any other one. The decoding approaches using support vector regression, canonical correlation analysis and representational dissimilarity analysis provided comparable results indicating that our results are robust to the specific decoding approach (Fig. 4 and supplementary Fig. S6). However, given the caveats discussed above (e.g. superposition of EEG activity patterns) and the high correlation between the different numeric estimates in the BCI model, it seems likely that multiple numeric estimates are concurrently represented in the brain even if the exceedance probability is high for only one particular numeric estimate.

Code availability
The Matlab code to fit the Bayesian Causal Inference model 1      Arrows indicate the influence of component estimates on the RDM of the final BCI estimate.
The probability of independent causes (dashed) increases with larger numeric disparity and vice versa for the probability of common cause (n.b. they sum to unity).  Table 1). the assumption of a common cause (C = 1) and iv. the final BCI estimate (NA or NV, depending on the task-relevant sensory modality, blue) that averages the task-relevant unisensory and the reliability-weighted estimate by the posterior probability estimate of each causal structure (p(C = 1|xA, xV)). Color-coded horizontal dashed lines indicate clusters of significant correlation coefficients (p < 0.05; one-sided cluster-based corrected randomization t22 test).
Stimulus onsets are shown along the x axis. (B) To identify which of the four numeric BCI estimates is most likely represented in the EEG activity patterns, we computed the protected exceedance probability (i.e. the probability that a given variable is more likely encoded in the EEG activity patterns than any other variable, beyond differences due to chance) for each of the four numeric BCI estimates as a function of time. The length of an estimate's bar indicates the estimate's protected exceedance probability (n.b.: the y axis indicates protected exceedance probabilities cumulated over the BCI estimates).. for purely visual and wAV = 0° for purely auditory influence.