Uncovering hidden brain state dynamics that regulate performance and decision-making during cognition

Human cognition is influenced not only by external task demands but also latent mental processes and brain states that change over time. Here, we use novel Bayesian switching dynamical systems algorithm to identify hidden brain states and determine that these states are only weakly aligned with external task conditions. We compute state transition probabilities and demonstrate how dynamic transitions between hidden states allow flexible reconfiguration of functional brain circuits. Crucially, we identify latent transient brain states and dynamic functional circuits that are optimal for cognition and show that failure to engage these states in a timely manner is associated with poorer task performance and weaker decision-making dynamics. We replicate findings in a large sample (N = 122) and reveal a robust link between cognition and flexible latent brain state dynamics. Our study demonstrates the power of switching dynamical systems models for investigating hidden dynamic brain states and functional interactions underlying human cognition.


Supplementary Methods
1 Bayesian switching dynamical systems

Model
Let y s t = (y s t1 , . . . , y s 1D ) denote a D-dimensional vector of ROI timeseries measured from subject s in time t, where D is the number of ROIs and denotes the transpose operator. We collect a sequence of T measurements from S subjects in Y = {Y s | s = 1, . . . , S} where Y s = {y s 1 , . . . , y s T }. Following the general formulation of the switching state-space models, we define z s t as the latent state variables and x s kt as the latent space variables associated to y s t at the k-th latent state, that is z s kt = 1. The set of latent state and latent space variables are shown by Z = {Z s | s = 1, . . . , S} and X = {X s k | s = 1, . . . , S, k = 1, . . . , K}, respectively. The latent state variable z s t is a 1-of-K discrete vector with elements of z s kt , ∀k = 1, . . . , K. As shown in the left panel of Supplementary  Figure 1, two consecutive time instances are dependent via a first-order Markov chain through a hidden Markov model (HMM). HMMs are statistical Markov models which have been extensively used in various applications for modeling timeseries 1,2,3 including dynamic functional connectivity analysis of the fMRI data 4,5,6,7 . Let elements of the HMM model be defined as: • initial state probability distribution, π = π k | K j=1 π k = 1 ; • state transition probability distribution, A = [a ij ] | K j=1 a ij , ∀i = 1, . . . , K ; • emission probability distribution O s t = {O s kt | ∀ k = 1, . . . , K}.
Given A, π, and using Markovian properties, the probability mass for the latent state variables is expressed as: We assume that at a given latent state k in time t, shown by z s kt = 1, the observed vector y s t is generated via probabilistic interpretation of factor analysis model 8,9 as: y s t = U k x s kt + µ k + e kt , ∀t, s | z s kt = 1, where U k = (u k1 , . . . , u kP ) is the linear transformation matrix, and P is the dimensionality of the latent space variable (in general P < D), µ k is the overall bias, e kt (t) is the measurement noise. With the normality assumption, that is x s kt ∼ N (0, 1) and e kt ∼ N (0, Ψ k ), the marginal distribution of y s t follows a Gaussian distribution as 9 : p(y s t | µ k , U k , Ψ k ) = N µ k , U k U k + Ψ k . Note that here we have assumed noise with the same covariance across subjects.
We then define a dynamical process on the latent space variables 10 using an autoregressive (AR) model of order R as (Supplementary Figure 1): where V k is a P 2 R-dimensional vector defined as: where 0 RP is a (1 × RP )-dimensional zero-vector and x s k,t−r is a P -dimensional vector of latent space variables at t − r, and finally kt ∼ N (m k , Σ k ) is the remaining error term in the latent space.
A graphical representation of the model is presented in Supplementary Figure 1. We have introduced some hierarchical parameters, ν kp , which are not explicit in the generative model. As we shall see in § 1.2, these hyperparameters play the role of regularizers on the model complexity and are used to determine effective number of latent space variables. The idea of using such hierarchical structure was first introduced by 11, 12 and it is now regarded as a common approach for regularizing model complexity by imposing sparsity in various Bayesian probabilistic models, for example, in Bayesian principal component analysis 13,14 . Supplementary Table 1 summarizes various variables introduced in this section.
where the effect of autoregressive process model parameters, ( V ,Σ,m), are implicit in the definition of the marginal density of y s t . More specifically, Σ k in Eq. (2) is assumed as identity matrix to avoid possible identifiability/degeneracy problems. Note that Σ k includes information about the noise variance in the latent space variables. We will still need to formally compute its posterior distribution in order to robustly estimate parameters involved in the expression of the dynamical process on the latent space variables. However, on the observation level, the effect of Σ k remains throughout as identity to avoid degeneracy.
To keep notation uncluttered, we define: θ HMM ={π,A}, θ FA ={µ,U ,Ψ}, and θ AR ={ V ,Σ,m}. Note that θ HMM , θ FA , and θ AR are group parameters, learnt using observed data from all subjects. The latent state variables, z s t , and latent space variables, x s kt , are however subject specific, inferred using the learnt group parameters.
Using Eq. (1), and Eq. (2), the joint distribution over observations and latent state variables is given by: p(y s t |z s t ,x s kt ,θ FA ) . (3)

Prior parameter distribution
Bayesian inference combines priors with data to produce posterior distributions of all model parameters. Here, we work with conjugate priors to the data likelihood so that the posteriors would have the same functional form as their priors. This would simplify learning and allow efficient implementations.

Prior over factor analysis model parameters
We use the same choice of prior distributions over the factor analysis model parameters, θ FA , as proposed by Ghahramani and Beal 14 as follows: a. p(U |ν)= K k=1 D d=1 p(u kd |ν k )= K k=1 D d=1 N 0,diag(ν k ) −1 , whereu kd shows the column vector corresponding to the d-th row of the linear transformation matrix U at the k-th state and 0 is a P -dimensional vector of zero values. b. p(ν)= K k=1 P p=1 p(ν kp )= K k=1 P p=1 G(a * ,b * ), where a * and b * are shape and inverse-scale hyperparameters for a Gamma distribution. For a noninformative initialization we set these parameters as: a * =b * =1. As discussed in 14 , when ν →inf, then the outgoing weights for latent space variable x s kt will approach zero, allowing the model to reduce the intrinsic dimensionality of the latent space if data do not provide enough evidence to keep the added dimension-this effect is commonly known as automatic relevance determination in Bayesian learning 11,12,13 .

Prior over autoregressive process model parameters
We use the same choice of prior distributions proposed by Fox 10 , as: indicate the degree of freedom and scale matrix of the inverse-Wishart densities.
and Σ (m) * k are mean vectors and covariance matrices of the Gaussian densities.

Variational inference
In a Bayesian view to uncertainty, all uncertain quantities are treated as random variables. Bayesian approach integrates over possible settings of all random variables. The resulting quantity is known as the marginal likelihood. Following the graphical model in Supplementary Figure 1, the marginal likelihood for the BSDS model is expressed as: Given the observed data and the priors, the posterior distribution may be inferred using Bayes's rule as Directly computing the posterior may not be analytically tractable. In the following, we approximate the posterior using a family of optimization methods known as variational inference 16,17 . Variational inference is based on reformulating the problem of computing the posterior distribution as an optimization problem. Here, we work with a particular class of variational methods known as mean-field methods, which are based on optimizing Kullback-Leibler divergence 16 . We aim to minimize the Kullback-Leibler divergence between the variational posterior distribution, shown as q(Z,X,θ FA ,θ AR ,θ HMM ), and the true posterior distribution, p(Z,X,θ FA ,θ AR ,θ HMM |Y ), which can be expressed as: where the operator · takes the expectation of variables in its argument with respect to the variational distribution q(·), e.g., The minimization of Eq. (5) is equivalent to the maximization of a lower bound, L, on the log marginal likelihood defined as: For the mean-field framework to yield a computationally efficient inference method, it is necessary to choose a family of distributions q(Z,X,θ FA ,θ AR ,θ HMM ) such that we can tractably optimize the lower bound. Here, we approximate the true posterior distribution using a partly factorized approximation in the form of: p(Z,X,θ FA ,θ AR ,θ HMM ) q(Z,X,θ FA ,θ AR ,θ HMM )=q(θ FA )q(θ AR )q(θ HMM )q(X |Z)q(Z).
Given our choice of factorization in Eq. (7), we can express the lower bound explicitly as: where L has been separated into two parts. The first part, L(Y |Z), is the Z-conditional expected value calculated using q(X,θ FA ,θ AR |Z) across X, θ FA , θ AR related only to the generation of Y . This part is calculated as a function of any given Z. The second term depends only on the Markov-chain model for the latent variables Z. Thus, the total lower bound can be maximized by alternating optimization of each of these two parts, while the other part of the model is kept unaltered.

Posterior distributions of HMM parameters
For the model defined through the joint probability distribution in Eq. (4) and prior parameter distributions specified in § 1.2.1, posterior distributions of the HMM model parameters, q(θ HMM )=q(π)q(A), which maximize the lower bound, Eq. (8), are given by: a. a Dirichlet density in form of: b. and independent Dirichlet densities on each row of A in form of: Using Eq. (9) and Eq. (10), it follows: where ψ(·) indicates the mathematical digamma function, α a ij is the j-th element of α a i and similarly α π k is the k-th element of α π .

Posterior distributions of factor analysis parameters
At a given latent state, z s kt =1, and a given latent space variable characterized with the sufficient statistics x s kt and x s kt ( x s kt ) , generative model of the BSDS can be viewed as a factor analysis model with an autoregressive process on its factor sources (latent space variables). Hence, in the following, we adopt similar methodology as in static factor analysis model 14 . For detailed discussions on the static factor analysis, readers are referred to 18 .
For the model defined through the joint probability distribution in Eq. (4) and prior parameter distributions specified in § 1.2.2, posterior distributions of the model parameters which maximize the lower bound, Eq. (8), are given as follows: a. Posterior distribution of U and µ is given by independent Gaussian densities in form of: whereμ kd denotes the d-th element of µ k andu kd denotes the column vector corresponding to the d-th row of U k , and b. Posterior distribution of ν is given by independent Gamma densities in form of: where a k and b k are the shape and inverse scale posterior hyperparameters of the Gamma densities given by,

Posterior distribution of the autoregressive process parameters
For the model defined through the joint probability distribution in Eq. (4) and prior parameter distributions specified in § 1.2.3, posterior distribution of the model parameters ( V ,Σ,m) which maximizes the lower bound, Eq. (8), are given as follows: a. Posterior distribution of the autoregressive coefficients are given by independent Gaussian densities in form of with posterior hyperparameters, b. Posterior distribution of the noise covariance matrices in the process, Σ, are given by inverse-Wishart densities in form of: with degree of freedom, ν (Σ) k , and scale matrix S c. Posterior distribution of the noise mean vectors, m, are given by Gaussian densities in form of: with posterior hyperparameters µ given by:

Optimization of the posterior hyperparameters
We then follow a similar approach as in static factor analysis model 14 for optimization of the posterior hyperparameters, {α * m * ,a * ,b * ,µ * ,ν * ,Ψ}, in the model. The optimized hyperparameters are computed as follows: a. Noise covariance, Ψ, is updated using maximum likelihood estimation given by: where b. Hyperparameters a * ,b * are computed from the following fixed-point equations: c. The scale prior m * k = 1 K is fixed and α * π =α * a i =α * are given for all i=1,...,K as: d. µ * and ν * are optimized as:

Sufficient statistics of the latent variables
We only require to know sufficient statistics of the latent space variables and the latent state variables which are inferred for each subject using group-level learnt posterior parameters in § 1.4.

Sufficient statistics of the latent space variables
For the model defined through the joint probability distribution in Eq. (4) and the marginal density of y s t given by Eq. (2), sufficient statistics of the latent space variables are inferred as: where X s kt is expressed using x s kt as: where x s t denote the data representations at time t for subject s computed as: (31)

Sufficient statistics of the latent state variables
Posterior distribution of the latent state variables, q(Z), are obtained by maximizing the second term in the lower bound expression by Eq. (8), and using the Markov properties for Z. In practice, a variant of the forward algorithm and backward algorithm can be used to efficiently compute the necessary marginal probabilities, namely, z s t q(z s t ) and z s t z s t−1 q(z s t ) . Computation of these statistics using the forward and backward algorithms requires the emission probability distribution, O s t ={O s kt |1≤k≤K}, which is given by: where Details of the forward-backward computations are discussed in the following.

Forward Algorithm
The forward procedure is initialized for t=1 as α s i1 =ã Ki O s it , whereã ij =exp loga ij q(A) , ∀ 1≤i≤K. Then, the forward iteration can formally run for t=2,...,T as For later use in the computation of the lower bound, we compute the the normalization constant C as Backward Algorithm To supplement the forward calculation, we need a backward variable to represent the observed conditions after time t. The backward variable is initialized as β s jT =1 for all j =1,...,K. The backward variable is recursively defined for t=T −1,...,1 as Finally, z s it q(z s it ) and z s it z s j,t−1 q(z s t ) are given by:

Algorithm
An algorithmic summary of BSDS is shown in Algorithm 1, Supplementary Table 2.

Measures
Transition probability between states Transition probabilities between latent states are given by:Â where loga ij q(A) is given by Eq. (11). Diagonal values onÂ give the self-transition probabilities and off-diagonal values give the cross-transition probabilities.
Temporal evolution of states Temporal evolution of the latent states indicates to which latent state a given time point belongs and it is given by the Viterbi path defined as the most likely sequence of latent states in the sequence of observed data and is computed using Viterbi algorithm 2 . Explicitly, the temporal evolution of states for a given subject s is expressed by the Viterbi path and is computed using estimated output probability distribution O s kt , given by Eq. (32), and the estimated transition probabilities given by Eq. (38). Note that, as the model parameters, (θ FA ,θ HMM ,θ AR ), are learnt in a group-level fashion using sufficient statistics from all subjects, there is a one-to-one correspondence between states across subjects such that a given state i would correspond to the same state for all subjects, s=1,...,S.
Occupancy rate and mean lifetime of states The occupancy rate for state i and subject s is computed as: where δ(z s it =1) is the Dirac delta function which is one if the current state at time t is the i-th state. To know at which state we are at a given time, we will use the temporal evolution of states.
The mean lifetime of a state is the average time that a given state i continuously persists before switching to another state. This quantity can be simply computed from the temporal evolution of states.
Occupancy rate and mean life time are complementary of each other. These quantities are computed for all states and across all subjects. As there is a one-to-one correspondence between states across subjects, we can monitor their variations across subjects.

Group-level mean and covariance
Using estimated posterior distribution of the model parameters, estimated group-level mean and covariance for each state are given by: whereμ kd ,ū kd and Ψ −1 are given by Eq. (14) and Eq. (23). Note that, using temporal evolution of states and estimated covariance of states, we can now compute changes in covariance over time.
3 Subject-level learning using BSDS

Robustness of findings with respect to ROI selection: BSDS applied to HCP n-back working memory (WM) task with ROIs from ICA-derived resting-state brain networks
Our primary analysis focused on fronto-parietal ROIs based on peaks of task-related activation and deactivation. To examine the robustness of our findings with respect to ROI selection, we conducted a complete set of parallel supplemental analyses using functional clusters from previously published brain networks derived using ICA on resting-state fMRI 19 . Networks of interest included the salience network (SN), central executive network (CEN), default mode network (DMN) and dorsal attention network (DAN), from which we chose the following fronto-parietal regions: bilateral AI, bilateral DLPFC, bilateral FEF, bilateral PPC, PCC, VMPFC and right DMPFC (Supplementary Figure 5). All other processing steps were identical to the previous analysis. As described in detail below, all major findings were replicated with this more general resting-state network-derived choice of ROIs.

Matching BSDS states between sessions
We applied BSDS to probe latent brain dynamics associated with ROIs in two data sessions separately. BSDS isolated five latent brain states in Sessions 1 and 2. To determine whether one brain state identified in one session match one brain state identified in another session, we conducted cross-session brain state correlation analysis. Using the same state matching algorithm applied in the main analysis (see Material and Methods for details), we found exclusive one-to-one mapping of the five latent brain states between two sessions. The Pearson's correlation coefficients between the matched brain states across two sessions range from 0.56 to 0.87 (Supplementary Table 15).

Fractional occupancy of task-dominant latent brain states during WM task
The 2-back, 0-back and fixation conditions were each dominated by distinct brain states designated SH (high-load state), SL (low-load state), and SF (fixation state) respectively (Supplementary Figure 6). The brain state SH during the 2-back WM task had an occupancy rate of 47.1±6.6% in Session 1 and 36±11.5% in Session 2 (Supplementary Figure 6d). In both Sessions, the occurrence of the SH during the 2-back task condition was significantly higher than the occurrence of other non-dominant states (all p-values <0.001, two-tailed t-test) except the 5th State in Session 2. Although the 5th State had high occupancy rate in 2-back WM task in Session 2, it also had high occupancy rate in 0-back WM task but low occupancy rate in rest. Therefore, the 5th State did not have uniquely significant contribution to the 2-back task. So this state is designated SG (general-control state). The mean lifetime of the state SH in the 2-back condition was 13±5 seconds in Session 1 and 9±4 seconds in Session 2, significantly longer than the mean lifetime of the other brain states (all p-values <0.001, two-tailed t-test), but much shorter than the 27.5 seconds task block (Supplementary Figure 6e). Thus, the 2-back condition is characterized by a mixture of brain states, with the dominant state active for only a relatively short interval. A similar pattern was observed for the states that were dominant in the 0-back and fixation conditions (Supplementary Figure 6d, 6e). These results demonstrate that the WM task is characterized by latent task-induced states whose fractional occupancy is relatively short compared to the task blocks.

Identification of a novel transition state
In addition to SH, SL, SF and SG, BSDS uncovered a transition state (ST in Supplementary Figure 6). The occupancy rate and mean lifetime of this state during the 2-back, 0-back and fixation conditions was comparably lower than the other states. However, ST was more likely to occur during transition after the onset of new task blocks (27±19% in Session 1 and 41±13% in Session 2, Supplementary Figure 6f), significantly higher than other latent states in both sessions (p-values <0.05, two-tailed t-test) but not to SG in Session 1. These results demonstrate that cognitive tasks with multiple conditions are characterized not only by latent task-induced states but also by transition states.

Latent brain states predict WM performance
To probe the relation between latent brain states and task performance, we took advantage of a key feature of BSDS, which provides estimates of moment-by-moment changes in brain states and connectivity. We examined whether time-varying brain state changes could predict WM performance. Specifically, we trained a multiple linear regression model to fit estimated the 2-back accuracy using occupancy rates of brain states in the 2-back condition, applied the model on unseen data to predict accuracy, and evaluated model performance by comparing estimated accuracy and observed value across all the subjects. This analysis revealed a significant relation between predicted and actual accuracy (all p-values <0.005, Pearson's correlation, Supplementary Figure 7a). Notably, each of these results was replicated in both Sessions 1 and 2, highlighting the robustness of our brain-behavior findings.
We then tested the hypothesis that the occupancy rate of individual brain states in the 2-back condition is associated with performance in the 2-back block. We found that WM task accuracy was positively correlated with the occupancy rate of the dominant state, SH, in the 2-back WM task condition (all p-values <0.01, Pearson's correlation, Supplementary Figure 7b). Thus, the dominant state SH is a behaviorally optimal brain state for 2-back working performance the more time spent in this brain state the better WM task performance, with more deviations leading to poorer performance.
Next, we investigated the mean lifetime, another key feature of temporal evolution of latent brain states, in relation to WM performance using the same analytic procedures described above. We found that the mean lifetimes of the latent brain states in the 2-back condition predicted WM task accuracy (all p-values <0.05, Pearson's correlation, Supplementary Figure 7c). Thus, maintenance of optimal hidden brain states results in better WM task performance.
2 Robustness of findings with respect to head movement: BSDS applied to HCP n-back WM task with 12 head motion regression parameters To examine whether the key latent brain states uncovered by BSDS are stable with respect to different motion correction procedures, we conducted additional analysis by regressing out 12 motion parameters, including 6 standard parameters (3 translational and 3 rotational movement) and derivatives of these 6 standard parameters, and then repeated the same BSDS analysis using 11 ROIs defined by 2-back vs 0-back contrast. As described in detail below, all major findings were replicated with this choice of movement parameters.

Matching BSDS states between sessions
We applied BSDS to probe latent brain dynamics associated with ROIs in two data sessions separately. BSDS isolated five latent brain states in Session 1 and Session 2. To determine whether one brain state identified in one session match one brain state identified in another session, we conducted cross-session brain state correlation analysis. Using the same state matching algorithm applied in the main analysis (see Material and Methods for details), we found exclusive one-to-one mapping of the five latent brain states between two sessions. The Pearson's correlation coefficients between the matched brain states across two sessions range from 0.93 to 0.97 (Supplementary Table 16).

Fractional occupancy of task-dominant latent brain states during WM task
The 2-back, 0-back and fixation conditions were each dominated by distinct brain states designated SH (high-load state), SL (low-load state), and SF (fixation state) respectively (Supplementary Figure 8). The brain state SH during the 2-back WM task had an occupancy rate of 37.6±12.2% in Session 1 and 34.7±11.5% in Session 2 (Supplementary Figure 8d). In both Sessions, the occurrence of the SH during the 2-back task condition was significantly higher than the occurrence of other non-dominant states (all p-values <0.001, two-tailed t-test) except the 5th State. Although the 5th State had high occupancy rate in 2-back WM task too, it also had high occupancy rate in 0-back WM task but low occupancy rate in rest. Therefore, the 5th State did not have uniquely significant contribution to the 2-back task. So this state is designated SG (general-control state). The mean lifetime of the state SH in the 2-back condition was 10±4 seconds in Session 1 and 8±7 seconds in Session 2, significantly longer than the mean lifetime of the other brain states (all p-values <0.001, two-tailed t-test), but much shorter than the 27.5 seconds task block (Supplementary Figure 8e). Thus, the 2-back condition is characterized by a mixture of brain states, with the dominant state active for only a relatively short interval. A similar pattern was observed for the states that were dominant in the 0-back and fixation conditions (Supplementary Figure 8d, 8e). These results demonstrate that the WM task is characterized by latent task-induced states whose fractional occupancy is relatively short compared to the task blocks.

Identification of a novel transition state
In addition to SH, SL, SF and SG, BSDS uncovered a transition state (ST in Supplementary Figure 8). The occupancy rate and mean lifetime of this state during the 2-back, 0-back and fixation conditions was comparably lower than the other states. However, ST was more likely to occur during transition after the onset of new task blocks (30±20% in Session 1 and 33±17% in Session 2, Supplementary Figure 8f), significantly higher than other latent states in both sessions (p-values <0.05, two-tailed t-test) except that it is marginally significant compared to SH in Session 1 (p=0.07, two-tailed t-test). These results demonstrate that cognitive tasks with multiple conditions are characterized not only by latent task-induced states but also by transition states.

Latent brain states predict WM performance
To probe the relation between latent brain states and task performance, we took advantage of a key feature of BSDS, which provides estimates of moment-by-moment changes in brain states and connectivity. We examined whether time-varying brain state changes could predict WM performance. Specifically, we trained a multiple linear regression model to fit estimated the 2-back accuracy using occupancy rates of brain states in the 2-back condition, applied the model on unseen data to predict accuracy, and evaluated model performance by comparing estimated accuracy and observed value across all the subjects. This analysis revealed a significant relation between predicted and actual accuracy (all p-values <0.01, Pearson's correlation, Supplementary Figure 9a). Notably, each of these results was replicated in both Sessions 1 and 2, highlighting the robustness of our brain-behavior findings.
We then tested the hypothesis that the occupancy rate of individual brain states in the 2-back condition is associated with performance in the 2-back block. We found that WM task accuracy was positively correlated with the occupancy rate of the dominant state, SH, in the 2-back WM task condition (all p-values <0.005, Pearson's correlation, Supplementary Figure 9b). Conversely, the occupancy rate of non-dominant brain states during the 2-back task was associated with poorer performance. Thus, the dominant state SH is a behaviorally optimal brain state for 2-back working performance -the more time spent in this brain state the better WM task performance, with more deviations leading to poorer performance.
Next, we investigated the mean lifetime, another key feature of temporal evolution of latent brain states, in relation to WM performance using the same analytic procedures described above. We found that the mean lifetimes of the latent brain states in the 2-back condition predicted WM task accuracy (all p-values <0.01, Pearson's correlation, Supplementary Figure 9c). Thus, maintenance of optimal hidden brain states results in better WM task performance.

Performance of related HMM-based models on optofMRI and n-back WM data
We next examined the performance of a broad class of HMM based methods and applied them to opto-fMRI and n-back WM fMRI task data. HMM models can be categorized based on their inference into maximum-likelihood and Bayesian models. In its standard form, the ML-based HMM model has limited application in practice, as the number of states has to be specified in advance. Thus, here, we only consider Bayesian HMM-based methods. The Bayesian models may also be broadly categorized into parametric Bayesian models and nonparametric Bayesian models. Bayesian inference in either forms provides a structured way of handling model complexity, and with sufficient amount of data, they can automatically prune away states with little influence. Here, we consider examples of each category.
In the following, we consider multiple Bayesian HMM-based methods and evaluate their performance on opto-fMRI and n-back WM data. Briefly, our analysis of these methods revealed they are unable to properly handle the model complexity by either overestimating uncertainties resulting in over-pruning of the latent states and converging to a single state, or underestimating uncertainties resulting in multiple states (>10) with little resemblance to the task. The latter case arises when models have limited flexibility and are strongly constrained by the Markovian assumptions as the only mechanism for capturing complex dynamical processes hidden in the data. Such models cannot deal with abrupt changes in data which may not be relevant to group patterns of interest. The former case arises in models where each state has extensive degree of flexibility with many parameters to be learnt from data. Although theoretically appealing, in practice, it is difficult to robustly learn all these parameters from noisy data and hence the model behaves too conservatively and can reliably estimate only one state.
In comparison, BSDS takes advantage of the strengths of each while minimizing the weaknesses. Similar to the nonparametric Bayesian switching linear dynamical systems 20 , it is rich in flexibility. It benefits from both Markovian assumptions and the autoregressive processes in its latent space. However, its flexibility is constrained by using a parametric model with fewer parameters to estimate from the data. On the other hand, it is flexible enough to not react too quickly to every local change in data, allowing it to estimate an optimal and parsimonious set of latent states across study participants.

Related HMM-based methods
We considered the following HMM-based methods 21,20,22 :

Analysis of opto-fMRI data
We applied all methods to opto-fMRI dataset to validate their performance on a real dataset for which the ground-truth is known to some extent. Only three of these methods could find more than one state, namely HDP-HMM BSFA is a state-space model which uses HMM to capture temporal dependencies as opposed to BSDS which uses both HMM and AR process wrapped into a state-space generative model. As shown in Supplementary Figure 12, the learning converges to several states (14 states) with no clear pattern. HDP-AR-HMM.A-D use both HMM and AR process in their generative models. We varied the AR order from 1 to 3. All methods consistently converged to a single state (results not shown here). HDP-SLDS.A-E are closest to the BSDS in their generative form but they differ in their inference. While BSDS uses variational inference, they use MCMC sampling. Furthermore, BSDS is a Bayesian parametric model while HDP-SLDS.A-E are Bayesian non-parametric models. In our analysis, we observed all cases converged to a single state.

Analysis of HCP WM data
As in the case of opto-fMRI data, HDP-AR-HMM and HDP-SLDS when applied to the WM dataset, converged to a single state.

General considerations
Within the HMM framework, another family of methods for modeling temporal data uses autoregressive (AR) process. In such models, each state of the HMM is assumed to be generated from an AR process. In a more general form, switching linear dynamical models are another important family of models for modeling timeseries. It should be noted that while prior studies have used HMMs for estimation of brain states on the observed data, unlike the present study, they do not explicitly model latent processes underlying observed data. When compared to standard HMM-based models 21,23,24,22 , crucially, BSDS applies HMM on the latent space variables, and it is their combination that creates a switching dynamical system which then generates the observed data. This is in contrast to previous approaches that have applied HMMs directly to observed MEG 24 and resting-state fMRI 23,22 . In comparison, each latent state of BSDS is richer and has greater flexibility due to the autoregressive processes in the latent space variables. Critically, the system only switches to a new state if the current state is not capable of explaining the data. As the result, BSDS shows greater robustness to the abrupt noisy and local changes in the data. Hence, BSDS typically requires fewer states to model the same data when compared to standard HMM based methods. These ,features help in the robust identification of brain states.

Performance of temporal clustering on n-back WM data
We conducted additional analysis using ICA in combination with temporal clustering, an approach widely used to investigate dynamic functional connectivity 25,26,27 . Briefly, this approach includes: (1) group-wise spatial ICA to identify components of interest, (2) extracting time series of components of interest, (3) applying a sliding window on time series and estimate time-varying covariance matrices, (4) clustering based on time-varying covariance matrix, and (5) determining the optimal number of clusters.
We used ICA maps from an independent study 19 because identifying the optimal number of components in spatial ICA and matching ICA components between two task sessions is non-trivial. Further, this approach also minimizes the impact of using individual spatial parcellations when comparing different methods and data across two sessions. The ICA maps here are the same as those used above to test the robustness of our findings with respect to ROI selection (Supplementary Figure 5). The ICA masks included bilateral AI, bilateral DLPFC, bilateral FEF, bilateral PPC, PCC, VMPFC and right DMPFC. We extracted time series from these regions and estimated dynamic functional interactions using a temporal sliding window approach with an exponentially decaying shape and a window length of 25 seconds (34 TRs) , which is shorter than the length of blocks (27.5 seconds), and a sliding step of 0.72 seconds (1 TR). Within each time window, we computed the z-transformed Pearson's correlation between the time-series taken pairwise. This resulted in a time-series of correlation matrices (T x C); here T is the number of time windows and C is number of pairwise interactions among regions at each time point. We then applied a group-wise k-mean clustering to the time-series of correlation matrices, with the number of clusters (k) ranging from 2 to 10. Twenty-five different initializations were used to reduce the chance of getting stuck in local minima. Clustering performance was estimated using the silhouette method and the optimal number of clusters was determined based on maximal silhouette across all the iterations 28 .
Clustering analyses revealed that the optimal number of clusters was 2 in both data sessions (Supplementary Figure 16). Thus, despite the presence of three separate task conditions, only two dynamic brain states could be identified using this approach (Supplementary Figure 17). Further, we did not find any significant correlation between the occupancy rate of any latent brain states in the 2-back task blocks and WM accuracy (p>0.6, Pearson's correlation, Supplementary Table 17).
To demonstrate that BSDS can reliably estimate dynamic brain states in other cognitive domains, we applied BSDS to the HCP Relational Processing task. As described in detail below, we found that BSDS identifies brain states that are stable and replicable.

Data selection
The Human Connectome Project (HCP) Relational processing task fMRI data of 90 individuals were selected based on the following criteria: (1) range of head motion in any translational and rotational direction is less than 1 voxel; (2) average scan-to-scan head motion is less than 0.25 mm; (3) performance accuracy per session is greater than 50%; (4) criterion (1) -(3) must met in both sessions separately; and (5) subjects are right handed.

Relational Processing task
The Relational Processing task was adapted from a previous study 29 . In this task, there were two task conditions: a relational processing condition and a control matching condition, and stimuli with 6 different shapes and 6 different textures. In the relational processing condition, two pairs of stimuli were presented, with one pair at the top of the screen and the other pair at the bottom.
Participants were told that they should first decide what dimension differs across the top pair of the stimuli (shape or texture) and then decide whether the stimuli at the bottom also differ along the same dimension. In the control matching condition, two stimuli were shown at the top of the screen and one at the bottom with a word in the middle of the screen (either shape or texture). Participants were told to decide whether the bottom stimulus matched either of the top two stimuli in that dimension. Each participant completed two runs of this task. Each run has three relational processing blocks, three control matching blocks and three 16-second fixation blocks. Each task block has 5 trials, lasting 18 seconds. Each stimulus was presented for 2800 ms, with a 400 ms ITI.

fMRI preprocessing
Minimally preprocessed fMRI data for both sessions were obtained from the Human Connectome Project. Spatial smoothing with a Gaussian kernel of 6mm FWHM was first applied to the minimally preprocessed data to improve signal-to-noise ratio as well as anatomy correspondence between individuals. High-pass temporal filtering (f >0.008 Hz) was applied to remove low frequency signals related to scanner drift.

General linear model and contrast of interest
A conventional general linear model (GLM) analysis was conducted in order to determine relational-processing and matching related activation/deactivation peaks. Each block in each session was modeled as one of the two vectors: relational or match. The onset and duration of each vector were the onset and duration of the corresponding block. The contrast of interest was relational versus match.

Region of interest (ROI) and time series
ROIs were determined on the contrast of interest: relational versus match, including bilateral lateral occipital cortex (LOC), supramarginal gyrus (SMG), angular gyrus (AG), middle frontal gyrus (MFG), frontal pole (FP), medial frontal pole (mFP), right anterior insula (AI) and pre-supplementary motor area (preSMA) (Supplementary Figure 18). Each ROI was 6-mm radius sphere centered at the corresponding peak voxel. Time series of the 1st eigenvalue was extracted from each ROI. A multiple linear regression approach with 6 realignment parameters (3 translations and 3 rotations) was applied to time series to reduce head-motion-related artifacts and resulting time series was further linearly detrended and normalized.

Matching BSDS states between sessions
We applied BSDS to probe latent brain dynamics associated with ROIs in two data sessions separately. BSDS isolated five latent brain states in data Session 1 and four latent brain states in data Session 2. To determine whether one brain state identified in one session match one brain state identified in another session, we conducted cross-session brain state correlation analysis. Each brain state was defined by a covariate matrix in the latent space, estimated from the set of ROI activation timeseries in each session separately; and ROI timeseries can, in turn, be represented as a timeseries of posterior probabilities of estimated brain states. If a brain state in one session corresponds to a brain state in another session, then timeseries of posterior probabilities of these two brain states in the same data session should be highly correlated. Specifically, after obtaining four brain states in each session, we first computed posterior probability timeseries of each brain state in the data session from which brain states are estimated. An example is to compute posterior probability of State 1 1 , estimated from Session 1 data. Next, we computed posterior probability timeseries of each brain state in the other data session. An example is to compute posterior probability of State 1 2 , estimated from Session 2 data. Then, we computed correlation posterior probability timeseries of brain states, which were estimated from different sessions, in the same data session. For example, compute correlation between the posterior probability of State 1 1 in Session 1 and the posterior probability of State 1 2 in Session 1. High correlation would suggest that State 1 1 matches State 1 2 as the two states have highly similar posterior probability in the same data. Indeed, we found exclusive one-to-one mapping between four out of the five latent brain states from the data Session 1 and the four latent brain states from the data Session 2. The Pearson's correlation coefficients between the matched brain states across two sessions range from 0.65 to 0.94 (Supplementary Table 18). To simplify the report and improve the readability, we relabeled the matched four brain states in the two sessions to State 1, State 2, State 3 and State 4, and the fifth brain state in the data Session 1 to State 5.

Fractional occupancy of task-dominant latent brain states during task
We compared temporal evolution of latent brain states (Supplementary Figure 19a) and block structure of the experimental task (Supplementary Figure 19b). The latent brain states were only partially aligned with onset and offset of the three experimental task conditions (Supplementary Figures 19a, 19c). The relational, match and fixation conditions were each dominated by distinct brain states S4, S3 and S1 respectively (Supplementary Figure 19d). The brain state dominant (S4) during the relational task condition had an occupancy rate of 47.5±15.6% in Session 1 and 50.9±15.6% in Session 2. In both Sessions, the occurrence of the dominant state (S4) during the relational task condition was significantly higher than the occurrence of other non-dominant states (all p-values <0.001, two-tailed t-test). The mean lifetime of the dominant state (S4) in the relational condition was 7±2.6 seconds in Session 1 and 7.4±2.7 seconds in Session 2, significantly longer than the mean lifetime of the other non-dominant brain states (all p-values <0.001, two-tailed t-test), but much shorter than the 18 seconds task block (Supplementary Figure 19e). Thus, the relational condition is characterized by a mixture of brain states, with the dominant state active for only a relatively short interval. A similar pattern was observed for the states that were dominant in the match and fixation conditions (Supplementary Figures 19d, 19e). These results demonstrate that the relational processing task is characterized by latent task-induced states whose fractional occupancy is relatively short compared to the task blocks.

Supplementary Figures
Supplementary Figure 1: Directed acyclic graph representing the Bayesian switching dynamic systems (BSDS) model. y s t is the observed variable, z s t is the latent state variable, and x s kt is the latent source factor (latent space variable) associated to the k-th latent state variable at time t for subject s. (left) Allowed dependencies among observations, latent state variables and latent space variables. Given latent state and latent space variables, observations are conditionally independent. Latent states are connected to each other using a first-order Markov chain. Within a given state, latent space variables follow a dynamical process by an autoregressive model (R=1). (right) Generative model at time t and for subject s at a given latent state k, z s kt =1 (indicated with a red box). π and A are HMM parameters indicating the initial state probability distribution and the state transition probability distribution. µ k is the overall bias in the data, Ψ k is the noise covariance matrix at state k, u kp is the p-th column of the factor loading matrix and at the k-th state, and ν kp is the prior on the variance of the u kp , known as ARD prior which controls the dimensionality of the latent subspace. V k indicates the autoregressive coefficients at the k-th state. Σ k and m k are the covariance and the mean in the latent space. Black boxes indicate replications.

Supplementary Figures
Supplementary Figure 1: Directed acyclic graph representing the Bayesian switching dynamic systems (BSDS) model. y s t is the observed variable, z s t is the latent state variable, and x s kt is the latent source factor (latent space variable) associated to the k-th latent state variable at time t for subject s. (left) Allowed dependencies among observations, latent state variables and latent space variables. Given latent state and latent space variables, observations are conditionally independent. Latent states are connected to each other using a first-order Markov chain. Within a given state, latent space variables follow a dynamical process by an autoregressive model (R=1). (right) Generative model at time t and for subject s at a given latent state k, z s kt =1 (indicated with a red box). ⇡ and A are HMM parameters indicating the initial state probability distribution and the state transition probability distribution. µ k is the overall bias in the data, k is the noise covariance matrix at state k, u kp is the p-th column of the factor loading matrix and at the k-th state, and ⌫ kp is the prior on the variance of the u kp , known as ARD prior which controls the dimensionality of the latent subspace.Ṽ k indicates the autoregressive coefficients at the k-th state. ⌃ k and m k are the covariance and the mean in the latent space. Black boxes indicate replications.  Figure 2: Time-varying posterior probabilities of latent brain states predicts 2-back, 0-back and fixation task conditions. A linear SVM classifier was trained and performance of the classifier was tested within and between sessions. Permutation testing was used to evaluate statistical significance of cross-validation accuracies. We found significant prediction accuracy in (a) within-session cross-validation and (b) between-session cross-validation. * , p=0.01; * * , p=0.002. Gray dash line indicates the chance level.  Figure S5 revealed latent brain states during WM, their dynamic properties and replication across Sessions 1 and 2. (a) Temporal evolution of the four latent brain states identified in each of the 122 participants. (b) Corresponding task waveforms of the three task conditions in the n-back WM task -0-back, 2-back and fixation blocks -are shown in the same layout. (c) Time-varying posterior probability of each latent brain state across participants. (d) Occupancy rates of latent brain states for the three states which dominate the 0-back, 2-back, and fixation task blocks, SL, SH and SF respectively. (e) Mean lifetimes of latent brain states for the three states which dominate the 0-back, 2-back and fixation task blocks, SL, SH and SF respectively (p-values <0.001, two-tailed t-test). (f) BSDS revealed a novel transition state ST, which occurs more frequently right after the onset of experimental task blocks than other latent states in both sessions (p-values <0.05, two-tailed t-test) but not to SG in Session 1. Note that SL, SH, ST, SF and SG were named by their task dominancy in panel c-f, which correspond to S1, S2, S3, S4 and S5 in panel a.  1 and 2. (a) A multiple linear regression model was trained using occupancy rates of latent brain states in the 2-back task to predict WM accuracy. A significant association was observed and predicted accuracies were correlated with observed accuracy in both sessions: (Session 1: r=0.29, p<0.002; Session 2: r=0.26, p<0.005, Pearson's correlation). (b) Occupancy rate of the latent brain state SH which dominates the 2-back WM task condition was correlated with WM task accuracy in both sessions (Session 1: r=0.31, p<0.001; Session 2: r=0.25, p<0.005, Pearson's correlation). No such relations were found for any of the other latent states. (c) A multiple linear regression model was trained using mean lifetimes of latent brain states in the 2-back task to predict WM accuracy. Here again, a significant association was found and the predicted accuracy was correlated with observed accuracy in both sessions (Session 1: r=0.22, p<0.05; Session 2: r=0.24, p<0.01, Pearson's correlation). (d) Mean lifetime of the latent brain state SH which dominates the 2-back WM task condition was correlated with WM task accuracy in Session 2 (r=0.20, p<0.05, Pearson's correlation) but marginally in Session 2 (p=0.06, Pearson's correlation). Shaded area represents 95% confidence interval. Color code mapping in panels b and d: dark blue -SL, cyan -SH, yellow -ST, red -SF, purple -SG. Figure 8: BSDS applied on data using 12 motion parameters regression revealed latent brain states during WM, their dynamic properties and replication across Sessions 1 and 2. (a) Temporal evolution of the four latent brain states identified in each of the 122 participants. (b) Corresponding task waveforms of the three task conditions in the n-back WM task -0-back, 2-back and fixation blocks -are shown in the same layout. (c) Time-varying posterior probability of each latent brain state across participants. (d) Occupancy rates of latent brain states for the three states which dominate the 0-back, 2-back, and fixation task blocks, SL, SH and SF respectively. (e) Mean lifetimes of latent brain states for the three states which dominate the 0-back, 2-back and fixation task blocks, SL, SH and SF respectively (p-values <0.001, Pearson's correlation). (f) BSDS revealed a novel transition state ST, which occurs more frequently right after the onset of experimental task blocks than other latent states in both sessions (p-values <0.05, Pearson's correlation) except that it is marginally significant compared to SH in Session 1 (p=0.07, Pearson's correlation). Note that SL, SH, ST, SF and SG were named by their task dominancy in panel c-f, which correspond to S1, S2, S3, S4 and S5 in panel Supplementary Figure 16: The optimal number of temporal clusters was determined using the maximal silhouette obtained across multiple iterations. In both data sessions (n-back task), the maximal silhouette value was 2. Silhouette is a measure for validating clustering, which evaluates how similar a data point is to its own cluster compared to other clusters. Each color represents a k-mean clustering performance with a random initialization (the number of clusters ranges from 2 to 10).

Supplementary
Supplementary Figure 17: Two temporal states were identified in each data session using the temporal clustering approach (HCP n-back task). Figure 19: Latent brain states during a Relational Processing task, their dynamic properties and replication across Sessions 1 and 2. (a) Temporal evolution of the latent brain states identified in each of the 90 participants. (b) Corresponding task waveforms of the three task conditions in the relational processing task -relational, match and fixation blocks -are shown in the same layout. (c) Time-varying posterior probability of each latent brain state across participants. (d) Occupancy rates of latent brain states for the three states which dominate the relational, match, and fixation task blocks, S4, S3 and S1 respectively. (e) Mean lifetimes of latent brain states for the three states which dominate the relational, match and fixation task blocks, S4, S3 and S1 respectively (p-values <0.001, two-tailed t-test). (f) Probability of transition time given a latent brain state. Color code mapping in panels a, c, d, e and f: dark blue -S1, cyan -S2, yellow -S3, red -S4, purple -S5. Color code mapping in panel b: dark blue -control match, red -relational processing, cyan -fixation.

Supplementary Tables
Supplementary Table 1:  Set the number of states, K (the initial value of K is usually set to a large value, and during learning those states with small contributions will get weights close to zero); Set the intrinsic dimensionality of the latent subspace, P , (in general P is set to be smaller than data dimension, P <D, however, in a fully noninformative initialization, one can simply set p=D−1).

step 1: initialization
• set the prior distribution parameters using § 1.2; • initialize z s t q(z s t ) (e.g., using K-means algorithm, with Euclidean distance as the similarity measure, initialized using K clusters).   Supplementary Table 8: Correlation between occupancy rates of latent brain states in the 2-back task and WM accuracy.

50
Supplementary Table 8. Correlation between occupancy rates of latent brain states in the 2back task and WM accuracy.  Table 9: Correlation between mean lifetimes of latent brain states in the 2-back task and WM accuracy.
Supplementary Table 9. Correlation between mean lifetimes of latent brain states in the 2-back task and WM accuracy.  Spearman correlation between occupancy rates of latent brain states in the 2-back task and WM accuracy.
53 Supplementary Table 11. Spearman correlation between occupancy rates of latent brain states in the 2-back task and WM accuracy.  Table 13: Correlation between occupancy rates of latent brain states in the 2-back task and WM drift rates.
Supplementary Table 13. Correlation between occupancy rates of latent brain states in the 2back task and WM drift rates.