Towards an efficient validation of dynamical whole-brain models

Simulating the resting-state brain dynamics via mathematical whole-brain models requires an optimal selection of parameters, which determine the model’s capability to replicate empirical data. Since the parameter optimization via a grid search (GS) becomes unfeasible for high-dimensional models, we evaluate several alternative approaches to maximize the correspondence between simulated and empirical functional connectivity. A dense GS serves as a benchmark to assess the performance of four optimization schemes: Nelder-Mead Algorithm (NMA), Particle Swarm Optimization (PSO), Covariance Matrix Adaptation Evolution Strategy (CMAES) and Bayesian Optimization (BO). To compare them, we employ an ensemble of coupled phase oscillators built upon individual empirical structural connectivity of 105 healthy subjects. We determine optimal model parameters from two- and three-dimensional parameter spaces and show that the overall fitting quality of the tested methods can compete with the GS. There are, however, marked differences in the required computational resources and stability properties, which we also investigate before proposing CMAES and BO as efficient alternatives to a high-dimensional GS. For the three-dimensional case, these methods generated similar results as the GS, but within less than 6% of the computation time. Our results contribute to an efficient validation of models for personalized simulations of brain dynamics.

: Examples of the algorithms' convergence as given by the values of the goodnessof-fit versus the iteration number. The plots show the gradual development of the obtained correlation corr(simFC / empFC) between simulated and empirical FC (colored lines) for one run of the methods indicated in the legends (NMA3D, PSO3D, CMAES3D and BO3D for NMA, PSO, CMAES and BO in the 3Dim parameter space, respectively). In addition to the algorithm results, the final outcome of the grid search is shown (GS3D, black line), where the shaded area represents the range from 95% to 105% of the grid search value (GS3D ± 5%). Each row represents the results for one randomly selected subject, which is mentioned in the titles along with the final number of executed iteration steps. The figure was created with MATLAB R2018a (www.mathworks.com).  Table S1: Medians of costs (MOC) across subjects and proportions of recommendations (REC, in %) for the tested optimization algorithms in both the 2Dim and 3Dim parameter space. The presented values pertain to the overall cost function Ψ cost as well as to its individual components (see main text for details). Bold letters indicate the most favorable numbers (i.e. lowest costs, highest proportions of recommendations) for every cost factor and dimensionality of the parameter space. The values for REC were rounded to the nearest integer. Figure S3: Distributions of the optimal model parameters detected by the considered techniques across all subjects. For each approach and subject, the best solution featuring the highest goodness-of-fit value is considered to estimate the distribution. It was selected from all similarity values computed on the 2Dim or 3Dim parameter grids (for the grid search GS2D or GS3D) and from all goodness-of-fit values obtained by 15 independent runs of the optimization algorithms (NMA2D, PSO2D, CMAES2D, BO2D in 2Dim and NMA3D, PSO3D, CMAES3D, BO3D in 3Dim). In the violin plots of one-parameter distributions for the 2Dim (A, B) and 3Dim (D-F) cases, the methods are indicated on the horizontal axes along with the optimized parameter values on the vertical axes. The plots (C1-C5) show the distributions of the optimal model parameters in the 2Dim space, where the relative frequency of the optimal parameter values is depicted by color and the respective methods are indicated in the plots. The figure was created with MATLAB R2018a (www.mathworks.com).  The respective regression coefficients of determination are provided in the titles of the subplots together with the considered algorithms. The figure was created with MATLAB R2018a (www.mathworks.com).
In the following, we provide the details about the mathematical optimization schemes whose performance we discussed and compared in the presented article. The algorithms were applied to maximize the goal function which returns the Pearson correlation coefficient between the matrices of simulated FC and empirical FC (after a vectorization of the upper triangular parts of the respective matrices). As stated in the main text, our intended maximization of can be formulated equivalently as a minimization of -in order to fit to the traditional manner of describing optimization problems as minimization problems. This adaptation allowed for the usage of mathematical algorithms aiming to minimize a real-valued function : → ℝ defined on a parameter space ⊆ ℝ Dim with Dim ∈ ℕ.

Nelder-Mead Algorithm
Following Hicken et al. 1 , the optimization starts with an initial set of = Dim +1 points 7 , … , Dim:7 ∈ not lying in the same plane and therefore defining the vertices of a simplex. In the most vivid case, Dim = 2, a simplex corresponds to a simple triangle. Tetrahedrons (Dim = 3) and more abstract structures appear in higher dimensions. After computing the objective function values at all initial simplex points, those are re-ordered, so that ( 7 ) ≤ . . . ≤ ( Dim ) ≤ ( Dim:7 ). We assume here that is not constant and therefore a meaningful ordering is possible. Then, in each iteration, the algorithm replaces the worst vortex Dim:7 (i.e. the one with the highest function value) by a new point found through a series of transformations around the centroid of the remaining vertices ¯= ∑ B Dim BC7 . These operations are reflection, expansion, inside contraction, outside contraction and shrinking, whereas the last one is special in the sense that it does not just replace the worst point, but decreases the size of the entire simplex instead, keeping only the current best point and adding Dim new vertices. Which of the mentioned transformations will eventually determine the new simplex for the next iteration, depends on the value of the objective function at the proposed points relative to the current ones.
Having computed ¯, it is probable that the straight line crossing Dim:7 and ¯ defines a descent direction. Reflection proposes a new point lying on this line, namely Visually, it is the mirrored equivalent of Dim:7 if the simplex was flipped around. We replace Dim:7 by D in case the condition ( 7 ) ≤ ( D ) < ( Dim ) is fulfilled, meaning that we obtained an amelioration compared to ( Dim )and ( Dim:7 ), but not yet to ( 7 ). If ( D ) < ( 7 ), then this search direction seems to be particularly promising and we investigate the objective function value further on this line, at the expanded point K =¯+ (¯− Dim:7 ), > .
If ( K ) < ( D ), then Dim:7 is replaced by K , otherwise D becomes the new vertex of the simplex. In case the search direction from above leads to no immediate satisfactory improvement, i.e. ( D ) > ( Dim ), we check whether a better point can be found by contractions. If even ( D ) ≥ ( Dim:7 ), we look inside the current simplex, between Dim:7 and ¯, and perform an inside contraction, leading to which will be accepted if it is better than Dim:7 , i.e. ( BN ) < ( Dim:7 ). Should the condition for an inside contraction not be fulfilled, in other words ( D ) < ( Dim:7 ), then an outside contraction with Following these steps, it may still happen that no replacement for the worst point Dim:7 has been found. In that case, the Nelder-Mead Algorithm shrinks its current simplex and defines the vertices of the new one by 7 and for ∈ {2, …, Dim+1}. Ideally, this procedure will create a sequence of simplices of decreasing size which accumulate around the searched-for minimum of .
The construction of the initial simplex was based on previously published ideas 4-6 . Starting with a point 7 sampled from a uniform distribution over the parameter space, we defined a scaling value Z = max{1, ‖ 7 ‖^} together with two helper variables, b√Dim + 1 + Dim − 1c and ^= in order to construct a regular simplex whose vertices are given by where B,g denotes the Kronecker delta and ∈ {2, …, Dim+1}. However, we modified the procedure slightly by working with j 7 = 0.35 7 as starting point and ̃Z = 0.7 Z as scaling value. The reason for this alteration was the fact that in the original setting, the majority of our initial simplexes contained only one or two feasible points within our relevant parameter space. Since we handled boundaries by assigning a high function value, ( unfeas ) = 1, to unfeasible points, as suggested in the algorithm's introductory work 3 , this led to a somewhat predictable behavior of the algorithm, namely that its optimization always began with a shrinking step. We note that this way of boundary handling may be outperformed by other alternatives 7 , however, having obtained nearly degenerated simplexes when testing Box's method 8 and in view of the results 7 that even transformations of the input variables may not always be the ideal way, we held on to the comparably simple penalty approach for this algorithm. For a stopping criterion, we focused on the length of the simplex edges. If then the simplex is considered small enough to stop the optimization. Additionally, we defined a maximal number of 80 iterations.

Particle Swarm Optimization
To begin the iterative minimization of , it is necessary to define a swarm 9,10 of ∈ ℕ particles 7 r , … , s r ∈ equipped with initial velocities 7 r , … , s r ∈ . One way to initialize B r and B r for ∈ {1, …, } is to choose them uniformly distributed within the bounded search space ; alternatives are reviewed in other works 11,12 . Subsequently, the objective function is evaluated at each position in the swarm and B , ∈ {1, …, }, which will serve as every individual particle's current best location, is set to B r = B r for all ∈ {1, …, }, whereas the global best position of all particles is denoted by and initialized to r = argmin{ ( B r )}. Then, in every iteration , the quantities B and B are updated according to where denotes an inertia term 13 , } , • represent a fixed cognitive and social parameter, respectively, and 7 z ,^z are chosen randomly from the interval [0, 1]. If a particle discovers a new location that is better than its current best, i.e. b B z:7 c < b B z c, then an update is made and B z:7 is set to B z:7 , otherwise B z:7 = B z . Analogously, the best position of the swarm is updated if b B z:7 c < ( z ) for some ∈ {1, …, }. In that case, the swarm has discovered a new global best point and z:7 = B z:7 , otherwise z:7 = z . This procedure is repeated until a stopping criterion is met.
According to Kennedy and Eberhart 10 , B corresponds conceptually to autobiographical memory, meaning that each particle remembers the most fortunate location of its own experience. The term } 7 z b B z − B z c in the velocity adjustment can hence be interpreted as nostalgia; an individual tends to return to the place that most satisfied it in the past. , by contrast, is described as publicized knowledge in the sense of a group norm or social standard that individuals strive to attain, here by the part •ẑ b z − B z c of the velocity adjustment. Consequently, the ratio between } and • determines whether a particle adjusts its velocity rather based on personal or public experiences. Tuning the balance can give the algorithm a (desired) bias towards either side and therefore guides the amount of communication among particles. Additionally, the parameter controls the velocity updates in a way that higher values for the inertia term enforce larger update steps leading to a more global exploration of the search space while lower values render the search more detailed on a smaller, local level. is typically initialized close to 1 and may then be gradually decreased throughout the optimization 12 .
In our implementation, we tested different alternatives 12 for (constant, linear decrease, dynamical decrease) and, besides the common choice 11,14 , } = • ≈ 1.496, we also modified the ratio } • ⁄ by setting either value to 1 and varying the other between 1.1, 1.2 and 1.5. Additionally, we investigated 16 different swarm sizes with between 8 and 144. Based on the results for a set of random test subjects with respect to convergence speed, most favorable objective function values and the final spread of particles in the swarm (using criteria analogous to those considered for the cost function Ψ cost in the main text), we finally arrived at = 60 and } = 1 < 1.5 = • as the internal algorithm parameters best suited for our problems. For the inertia weight , we adapted a procedure of dynamical decrease 12 , in which we initially set = 0.95 and multiplied it by 0.975 if had not improved for 5 consecutive iterations. Similarly, we also used the adaptive penalty approach presented in the same article 12 , allowing us to handle unfeasible points by adding penalties 'based on the average of the objective function and the level of violation of each constraint during each iteration step'. Even though a component-wise limitation of the particles' velocities was no longer vital due to the introduction of the inertia weight 14,15 , we still prevented too large position updates by incorporating a multiplier 16 of 0.9 for velocities whose components exceeded 20% of the feasible interval length (e.g. delay ∈ [0, 100], update limited to 20) in the respective dimension of the search space . The optimization procedure was stopped if had not improved for 50 iterations or if a maximum of 80 iterations had been reached.

Covariance Matrix Adaptation Evolution Strategy
As a preparation for the optimization of , a set of quantities has to be defined. Importantly, the number of particles (candidate solutions to minimize ) that form a generation can be freely chosen, analogously to the swarm size in Particle Swarm Optimization. However, as Hansen 17 states, '[s]mall population sizes usually lead to faster convergence, large population sizes help to avoid local optima'. A popular recommendation is therefore to set = 4 +3 ⌊ln(Dim)⌋ at least 17,18 . Next, =⌊ / 2⌋ is defined. It quantifies the number of particles that will later serve as a basis to compute the generation of candidate solutions for the next iteration. More precisely, the members of the new generation +1 will be sampled from a multivariate normal distribution centered around a weighted mean z of the best individuals from the previous iteration . The selection of the best individuals is based on their respective objective function values (lower = better) and the weights are given as With this definition, it is clear that the weights sum up to 1. The (co)variance of the considered distribution depends on a matrix z ∈ ℝ Dim´Dim which is adapted in every iteration. For the update, the evolution path ‹ z ∈ ℝ Dim , the conjugate evolution path OE z ∈ ℝ Dim and the step size z > 0 are computed, along with the following quantities: Having defined the constants, we initialize the evolution paths to ‹ r = OE r = 0 ∈ ℝ Dim and select r = ∈ ℝ Dim´Dim as the initial covariance matrix. Here, denotes the identity matrix. The step size r and the distribution mean r have to be chosen problem-dependent. If the algorithm is used to optimize a function within a normalized space [0, 1] Dim , then r = 0.5 and r uniformly randomly within [0, 1] Dim are the popular choices 18 that we also adapted. In a newer version 17 , alternative definitions not only for the starting points, but also for some constants such as the weights B are presented. In our test runs of the algorithm, however, the older version (from 2006) yielded slightly better results, so that we held on to it and its default values. After these preparations, the following steps are repeated for every iteration until a stopping criterion is met. At first, a generation of candidate solutions g z:7 ∼ ( z , ( z )^z), ∈ {1, . . . , }, is computed according to g z:7 = z + z z z g with g ∼ (0, ).
denotes the (multivariate) normal distribution. The matrices z and z are derived from an eigenvalue decomposition of z = z Ď z ( z ) , where z is an orthogonal matrix containing the normalized eigenvectors of z and Ď z is a diagonal matrix containing the corresponding eigenvalues. z is then obtained from Ď z by applying the square root for the entries on the diagonal, so that z = z z z ( z ) . Afterwards, the objective function is evaluated at every g z:7 , ∈ {1, …, }, and the candidate solutions are indexed in a way that b g ¡ z:7 c ≤ . . . ≤ ¢ g £ z:7 ¤ ≤ ¢ g £¥¡ z:7 ¤ ≤ . . . ≤ b g ¦ z:7 c.
As mentioned before, the best individuals are then selected for the new weighted mean z:7 , which will shift the center of the particles' distribution towards the locations of the most favorable solutions: Subsequently, the evolution paths ‹ z , N z and the step size z are updated: The expected value belonging to a multivariate normal distribution with zero mean and covariance matrix is typically approximated via Finally, the covariance matrix z is updated according to The new matrix z:7 determines the (co)variance of the particles' distribution in the next generation, while incorporating the knowledge about the location of the best function values from the current iteration. Together with the adjustments of z , this procedure aims at producing a population of particles that accumulate around the global optimum of .
Our implementation included both the 2006 variant 18 of the algorithm as well as the newer version 17 . Prior to the application to the entire subject cohort mentioned in the main text, we compared both alternatives' performance on a set of a few random test subjects. Apart from the population size , we set all hyperparameters to the recommended default values and analyzed the results with respect to convergence speed, most favorable objective function values and the final spread of particles within a population (using criteria analogous to those considered for the cost function Ψ cost in the main text). For , we tested 8 different values between 6 (in the 2Dim parameter space) or 8 (in the 3Dim parameter space) and 144. It turned out that the 2006 version of the algorithm in combination with = 24 particles was the most favorable and numerically stable option for the problems we consider. We handled unfeasible solutions according to an algorithm-related boundary handling procedure 19 . The only alteration we made was that we intensified the boundary weights B described in the article and worked with j B = 5 B .
The optimization procedure was stopped if the best objective function value discovered so far had not improved for 50 iterations or if a maximum of 80 iterations had been reached. This selection was made in analogy with the criteria chosen for Particle Swarm Optimization. Alternatives are described by Hansen 17 ; a stricter termination criterion would be to stop already after 10 + ⌊3 Dim / ⌋ iterations without improvement, for example.

Bayesian Optimization
The minimization of the objective function starts with the evaluation of a random sample of points 7 , … , s which may be uniformly distributed in . A possible choice is ≈ 10 Dim as initial sample size 20,21 , but lower values for are not uncommon. The C++ software package BayesOpt 22 uses = 10 and the MATLAB routine bayesopt (https://de.mathworks.com/help/stats/bayesopt.html) works with = 4 as dimensionindependent default value. To distinguish between the (constant) size of the initial sample and the (increasing) number of objective function evaluations computed so far, we denote the latter by Λ ¿ . Before the computation of new sampling points, both quantities are equal. As a next step, a mean (r) : → ℝ and kernel function have (r) : × X → ℝ to be chosen. This is essential for Bayesian Optimization since the algorithm tries to approximate the function , which may not be available in closed form, via a stochastic model. More concretely, a nonlinear process is used to model , typically a Gaussian Process (GP) 23 . It can be interpreted as a quantity analogous to the function : → ℝ, but instead of returning a scalar ( ) for a certain ∈ , a GP returns the mean and (co)variance of a normal distribution over possible values of at the point . Formally, this dependence can be expressed as At this stage, the GP serves as a prior belief about and its possible values. It is common to choose a constant or relatively simple function for the mean 24 , e.g. (r) ( ) = 0. The kernel should be positive semi-definite and it should guarantee that two points and ′ lying close to each other in are strongly correlated and thus generate a higher value (r) ( , ′) than two points that are further apart. One possible kernel is the automatic relevance determination (ARD) squared exponential kernel which is given by Note that we had to adapt our notation for this algorithm. Here, B , ∈ {1, …, Dim}, denotes the -th component of a point ∈ (in the parameter optimization considered in the main text, 7 would be the coupling strength , for example). In addition, the g , ∈ {0, …, Dim}, represent the hyperparameters of the kernel, whose estimation we will discuss later below. The ARD squared exponential kernel has the advantage of being an uncomplicated function with the desired attributes. In practice, however, its usage in a GP generates approximations of the goal function which are unrealistically smooth 25 . A useful alternative is therefore the ARD Matérn 5/2 kernel which is defined as Other options for the kernel function can be found in algorithm-related works 24,26 . After having chosen the mean and kernel functions, the hyperparameters are optimized based on the knowledge that was obtained from the evaluation of the initial sample points 7 , … , s . Importantly, the procedures of this algorithm allow for more hyperparameters than r , … , Dim . On the one hand, it is possible to consider a constant mean (r) ( ) = with unknown ∈ ℝ, on the other hand, it can be accounted for the unknown observation noise in stochastic objective functions 24,25 . Summarizing all considered hyperparameters in a vector = ( r , . . . , Dim , , ), the easiest approach is then to maximize the likelihood of the observations 7 , … , s (henceforth denoted by 7:s ) under the GP prior, i.e. for a given probability measure , we search for which is known as the maximum likelihood estimate (MLE). Alternatively, it can be assumed that also was chosen from a prior. The desired quantity Ð is then given by the It maximizes the posterior. The most exhaustive option is called the fully Bayesian approach, where an integration over all possible values of the hyperparameters is performed 24 . Further details on how to obtain the desired hyperparameters practically are provided in related papers 20,27 . The intention of all these considerations is to obtain an adequate prior distribution over possible functions for . Based on the function evaluations obtained so far, Bayesian Optimization tries to generate the best possible surrogate model for .
Having obtained the posterior distribution for possible values of ( s:7 ), the question arises what the most advantageous sampling point s:7 might be. In order to find it, Bayesian Optimization works with an acquisition function : → ℝ which provides an estimation about the area where the highest gain in terms of improved objective function values can be achieved. The acquisition function is typically designed in a way that aims at making it relatively easy to maximize. Since its evaluations are much cheaper to compute than those of the objective function , standard techniques 27 can lead to a fast maximization of . Frazier 24 recommended the quasi-Newton method L-BFGS-B 28 , while the Nelder-Mead Algorithm 3 and other methods were tested by Schonlau and collaborators 21 .
One popular approach is to consider the expected improvement as acquisition function. To compute it, we define Ô Ô = min{ ( 7 ), . . . , ( s )} and ( ) = max{0, Ô − ( )} as the best value detected so far and the absolute improvement compared to it, respectively. Then, the acquisition function is given as EI ( ) = [ ( )], for which a closed form is available 20 . Another commonly used alternative is the lower confidence bound function LCB ( ) = (s) ( ) − -(s) ( ). It creates a lower envelope of uncertainty around the posterior mean function whose width is given by the posterior standard deviation multiplied by a factor > 0. For a minimization of , -LCB is usually maximized. The parameter can be adjusted to tune the balance between the desire for exploitation (sampling near favorable mean values) and exploration (sampling in areas of high uncertainty). MATLAB's routine bayesopt works with = 2 as default value. Other possible acquisition functions are presented by Frazier 24 . When a new sampling point has been found, the corresponding objective function value is computed and then added to the list of previously obtained function evaluations. Afterwards, the hyperparameters of the initial mean and kernel function can be optimized again (with Λ ¿ instead of ) or the posterior distribution over possible values of is updated immediately based on the knowledge gained from the new sampling point. In the formulas, will be replaced by Λ ¿ then. Subsequently, the acquisition function is maximized again (this time with Λ ¿ ) to find another sampling point. This procedure is repeated until a stopping criterion is met. Since the matrix (r) ( 7:s , 7:s )(or rather (r) ( 7:Ú ¿ , 7:Ú ¿ )), whose size increases with every new sampling point, has to be inverted in each update step, it is recommended to either limit the number of maximum iterations to a moderate number or to gradually remove some of the initial points when a satisfactory amount of data has been collected 29 . From the viewpoint of numerical stability and performance, it may also be favorable to work with a Cholesky decomposition of the matrix 22 .
For our implementation, we used the C++ software package BayesOpt 22 with = 5 in the 2Dim parameter space and = 10 in the 3Dim case. We chose a maximum of 80 iterations and worked with the default settings to handle unfeasible points. In addition, we also used the default to model the objective function via a GP that features a constant mean function and the ARD