Estimation of the parameters of ETAS models by Simulated Annealing

This paper proposes a new algorithm to estimate the maximum likelihood parameters of an Epidemic Type Aftershock Sequences (ETAS) model. It is based on Simulated Annealing, a versatile method that solves problems of global optimization and ensures convergence to a global optimum. The procedure is tested on both simulated and real catalogs. The main conclusion is that the method performs poorly as the size of the catalog decreases because the effect of the correlation of the ETAS parameters is more significant. These results give new insights into the ETAS model and the efficiency of the maximum-likelihood method within this context.

I describe the algorithm in the first two sections of the paper. In the third section, the method is tested on simulated and real catalogs. In the last two sections, I analyze how the SA algorithm performs and discuss some general implications for the ETAS model.
The Very Fast Simulated Annealing algorithm SA is a stochastic method to solve problems of multidimensional global optimization, i.e. problems with the following form whereX is a D-dimensional subset of R D 8,21,23 . The term ''annealing'' refers to a process in which a solid, brought into a liquid phase by increasing its temperature, is brought back to a solid phase by a progressive reduction of the temperature. This process has been done in such a way that all the particles arrange themselves in a perfect crystallized state, representing the global minimum of a certain energy function. SA algorithms are random iterative procedures that generate a candidate point x opt ! and move to this point or stay at the current one based on a stochastic mechanism. The latter is controlled by the temperature T; when decreased, the search becomes more directive. In the following, I will refer to the case of the global maximization of a function f.
More formally, a general SA algorithm can be described as follows.
Initialization Generate an initial random solution x opt ! . Select a value for the initial temperature T 0 . 0. Set the count j 5 0. Inner loop Set x j !~x opt ! and repeat the following N in times: a) generate the next candidate solutionỹ~G x j ! À Á ; b) sample a uniformly distributed random number p g [0,1] and set where A is a suitable ''acceptance'' function; c) set x opt !~ỹ if fỹ ð Þwf x opt ! À Á : Outer loop Check a stopping criterion and, if satisfied, then STOP; otherwise a) set T j11 5 U(T j ) # T j and j 5 j 1 1; b) go back to the Inner loop.
In brief, an SA algorithm is an iterative procedure composed of two nested loops. In the outer loop, called the cooling process, the temperature T is decreased from its initial value until a convergence criterion is achieved. The inner loop is a random search of a possible better solution x opt ! in a region around the local maximum x j ! .
An SA algorithm applied to a specific problem requires 1) the distribution G generating the next candidate pointỹ; 2) the acceptance function A and the number of trials N in for the inner loop; 3) the initial temperature T 0 and cooling function U; and 4) the stopping condition. SA algorithms are conceptually simple, but the setting of these tuning parameters/functions is an extremely tricky and problem-dependent question, crucial for the efficiency of the algorithm (Ref. 19 and references therein). A bad solution to this problem invalidates the effectiveness and robustness of SA procedures, even if they have formal proofs of convergence to the global optima.
The inner loop: next candidate distribution G and acceptance criterion A. The function G defines the way the model x opt ! is updated. In brief, it consists of adding a random ''perturbation'' to the current model x j ! to obtain the new candidateỹ. In this study, I adopt the Very Fast Simulate Annealing (VFSA) procedure proposed by Szu and Hartley 22 and improved by Ingber 5,6 . This defines the function G as a D-dimensional Cauchy distribution such that, for each dimension k, the searching region is controlled by the temperature. Specifically where L k and U k are the lower and upper bounds in the kth dimension and u g [0, 1] is a uniformly distributed random number 5,6,21 .
The acceptance criterion A determines if the new computed solutionỹ is accepted or discarded. Here, I use the well-known Metropolis criterion 16,21 , given by In this way, the ascent steps are all accepted, whereas the descent steps are accepted with a probability controlled by T j , to not get trapped in local maxima.
The outer loop: initial temperature T 0 , cooling function U and the stopping criterion. The initial temperature T 0 and the cooling schedule are of critical importance to the success of SA, especially for a VFSA algorithm, in which the temperature defines both the next candidateỹ and the acceptance criterion A (see eq. 2 and 3). A low T 0 might cause an overly restricted search around the starting point x 0 ! . A high T 0 or a slow cooling schedule might cause an overly high computational time and a possible unsuccessful search, especially if the number of iterations N in is limited. Finally, a fast cooling schedule can trap the algorithm in a local maximum. Various choices of the initial temperature T 0 have been suggested in the literature (Ref. 1 and references therein). A general rule is that T 0 must be defined in such a way that any solutionỹ [X can be selected and almost any model perturbation must be accepted (A x 0 ! ,ỹ,T 0 À Á *1, Vỹ [X) at the beginning. The first condition is satisfied by assuming T 0 $ 1, making the distribution of r k almost uniform (see eq. 2). The second condition depends on the specific problem and can be achieved in different ways (Refs. 1, 7 and references therein). Here, I adopt the simple criterion, suggested by Lin et al. 11 , that expresses T 0 as the ratio between the size of the image of f and the number of data.
The cooling schedule regulates how rapidly the temperature T varies from high to low values as a function of the iteration count. Here, I apply an Adaptive Cooling method, which adjusts the decrease rate of T from information obtained during the algorithm's execution 23 . Specifically, I set where DF j~f x opt ! À Á {fx j À Á . The idea here is to keep the temperature unchanged when the local maximum fx j À Á is far from the global optimum f x opt ! À Á DF j wwT 0 À Á and to half the temperature when the global maximum is updated (DF j 5 0). The algorithm ends when a stopping criterion is satisfied. Two possible rules can be followed: an SA procedure is stopped when T decreases up to a pre-selected threshold 24 or when it does not make significant progress over several iterations 2 . Here, I adopt the second criterion: the algorithm is stopped if its progress over the last M iterations is small, i.e., when the following conditions are satisfied for a positive small : and log-likelihood where T , R and M are the temporal interval, the spatial region and the magnitude range covered by the N events of the catalog, respectively, and H t~ti , g is the history of earthquakes that occurred before t.
The algorithm may be adapted to any version of the ETAS model, but I adopt here a specific parameterization for the intensity lh t,x,y,m H t j ð Þ .
where c d,q,c is the normalization constant of the spatial function and r i is the distance between the locations (x, y) and (x i , y i ). Because the magnitude distribution is independent of the other variables, the b-value can be estimated from magnitudes alone. Therefore, b is not included inh. I model the background rate m(x, y) by using an equally spaced grid of N c cells C i , with central node (X i , Y i ), covering R. Specifically, I suppose that the background rate is homogeneous inside C i but where u i is the probability of having a background event inside C i and A i is the area of C i . The probabilities u i are unknown; thus, they belong to the set of parametersh. In this first version of the VFSA algorithm, I estimate the probabilities u i by using the iterative kernel method proposed by Zhuang et al. 27 . In brief, the background distribution is given by where . T tot is the length of the interval time T ; . r j is the probability that the jth event is triggered, given by . d j is a variable bandwidth, given by the radius of the smallest disk, centered at (x j , y j ), including at least n p observed events.
Each time the VFSA algorithm updates the best parameters, the probabilities u i are estimated as where m(X i , Y i ) is the background rate, computed at the central node (X i , Y i ) of the cell C i by eq. 10. Here, I consider n p 5 10 because the choice of n p does not significantly affect the results 27 . The log-likelihood calculation requires only the background probabilities of the N' c v~N c cells with earthquake occurrence (see eq. 2). Thus, the size ofh~m,k,c,p,a,d,q,c; u i ,i~1, . . . ,N' c f g depends on the specific dataset.
Below, I list the VFSA code in detail: 1) Set the count j 5 0 and the temperature T 5 T 0 . select an initial modelh 0 [ h at random and compute Lh 0 . compute the probability r j for all events using eq. 11 . update the background probabilities u i for all cells using eq. 12; 3) Repeat the following N in times . generate the next candidate set of parameters else Figure 1 | Background spatial probabilities ui (see eq. 9) adopted for ETAS simulations. The map was created using the software Generic Mapping Tools (http://gmt.soest.hawaii.edu/).

4) else
Application to ETAS simulated datasets and the Italian real earthquake catalog First, I check the VFSA estimation algorithm on two classes of 100 simulated ETAS datasets, with T tot 5 1 and 5 years, covering the Italian territory. The simulations are obtained by the thinning method 18 , with a b-value equal to 1 and the discrete background probability distribution {u i i 5 1, …, N c } shown in Figure 1  . I discard the combination of parameters for which the branching ratio is larger than 1.0 (causing the explosion of the process), and I repeat simulations that give fewer than 100 events. The complementary (or auxiliary) events (i.e., the events occurring outside the spatio-temporal target region T |R that have a possible triggering effect on the events inside) play a crucial role for correctly estimating the ETAS parameters 26 . Therefore, I take into account the triggering contribution of all the events simulated outside the target region. Moreover, I simulate 1 year of seismicity before the target period T (for both 1 and 5 year catalogs) and I include the triggering contributions of these events in log-likelihood computations. The number of events N in the 200 simulated datasets varies from 102 to 3889. To apply the VFSA algorithm described in the previous section, I fix N in 5 100, and I follow the criterion proposed by Lin et al. 11 to estimate T 0 . For each synthetic catalog, I estimate the log-likelihood's image as the largest difference among the 100 log-likelihood values, computed on as many random combinations of parameters. The values of T 0 obtained in this way vary from 1.0 to 22. Thus, I fix the starting temperature to the mean value T 0 5 10. Finally, I use the conservative value M 5 10 to define the stopping criterion (see eq. 5), and I fix , which I judge as an acceptable approximation for the loglikelihood of the ETAS model.
I check the performance of the algorithm by running the VFSA code 100 times on each catalog. Specifically, for each parameter and each catalog, I measure the systematic and the random errors by three quantities: the accuracy, i.e., the closeness of the estimations to the true values, the bias, i.e, the systematic shift of the estimations in one direction from the true values, and the precision, i.e., the degree of agreement for a series of estimations 10 . I measure these quantities as percentages of the size of the range (RG k ) of each parameter h k [h, because the ETAS parameters have different orders of magnitude. The bias of h k is measured by:  where h k opt is the median of the output estimations of 100 runs h k,ir opt ,i r~1 , . . . ,100 n o and h k pr is the pseudo-real value of h k (i.e. the value used to simulate the dataset). I compute the accuracy as the absolute value of the bias: Finally, the precision is given by: where I 90% h k,i r opt ,i r~1 , . . . ,100 n o is the size of the 90% confidence interval for the estimated h k , computed as the difference between the 95-th and 5-th percentiles of h k,ir opt ,i r~1 , . . . ,100 n o . Figure 2 shows the plot of h k opt versus h k pr for the 8 parameters {m, k, c, p, a, d, q, c} and for all simulated datasets. h k opt and h k pr may significantly differ, especially for small datasets, except for the para-meter m. Figure 3 is a synthesis of the bias/accuracy/precision measures. It shows no systematic significant bias (under or overestimation); the distribution of B k is centered at zero and is mainly symmetric (see Figure 3a) for all parameters. A k is well below 0.1 for more than 50% of the catalogs, but it reaches 0.4-0.5 for the c parameter (see Figure 3b). Similarly, the precision is below 20-30% of RG k many times but may reach 40% for the 1-year catalogs (see Figure 3c). A possible proxy of the accuracy/precision of the algorithm is the number of events N. Thus, I test the hypothesis of no correlation/dependence between A k /P k and N by three statistic tests: the Pearson's linear, the Spearman rho and the Kendall's tau coefficients 3 . The first is a measure of the linear relationship between two samples, while the others quantify a more general association. The results are shown in Figure 3d. The p-values of the tests (i.e. the probabilities of non-correlation/independence) are well below 0.01 for all parameters, except for m, suggesting a strong influence of N on the estimation. The high p-values for the m parameter confirm the accuracy/precision results (see Figures 2 and 3): the algorithm is able to estimate the overall background rate, no matter how large the dataset is. In contrast, the sample size N strongly affects the fit of the spatial probability background distribution (Figure 4). The medians (on 100 runs, for each catalog) of the estimated probabilities u i are closer to the pseudo-real values (mapped in Figure 1) as N increases. For small datasets, the estimated and pseudo-real probabilities may differ by as much as two orders of magnitude.    The analysis of log-likelihood values confirms the previous results (see Figure 5). First, the values DLL~Lh opt {Lh pr are negative for N below 1000/1500 (see Figure 5a), suggesting that the data are not sufficient to optimize the log-likelihood. If the background probabilities u i are fixed to pseudo-real values, DLL is always positive, varying from 0 to 20, with negligible differences for the remaining 8 parameters (see Figure 5b). This means that the negative values of DLL are entirely due to the poor estimation of the background probabilities u i . The largest difference among the log-likelihoods, deduced from the best 100 runs as follows: is mainly below 2 (see Figures 5c). This suggests that the low precision and accuracy of the estimations on the smallest catalogs are due to the flatness of the log-likelihood function. In other words, rather different combinations of parameters (see Figure 3c) may give similar log-likelihood values on small datasets. I further investigate this point by applying the three tests used in Section 3 on all possible 28 pairs of 100 parameter values obtained by as many runs. Specifically, I compute the proportion of catalogs with p-values , 0.05 for each pair of parameters. The aim of this analysis is to verify the hypothesis of non-correlation/independence among the ETAS parameters. I find significant correlations inside two parameter subsets: {k, c, p, a} (Omori law) and {d, q, c} (spatial distribution of triggered events). No systematic correlation is found between m and the other parameters, confirming that the algorithm is able to distinguish the background seismicity.
To clarify the procedure, I describe in more details the results of the algorithm for the smallest and the largest simulated catalogs (see Tables 1 and 2). The first catalog has 1 year of data (plus 1 year as an auxiliary period) and 102 events; the second has 3889 events collected over a five-year period (plus 1 year as an auxiliary period). The solution for the first dataset has poor precision (such as for the c parameter) and poor accuracy (the a parameter), but the values Lh ir opt are close and all lower than the pseudo-real value Lh pr .
No significant change is found by varying some tuning parameters of the algorithm (~0:1; T 0 5 1, 100; n p 5 5). If I fix the background probabilities to pseudo-real values, the algorithm gives similar estimations of the remaining 8 parameters and log-likelihood values larger than Lh pr and close to 2662.0. This confirms that the log-likelihood values strongly depend on the background probabilities. In contrast, the parameter estimation on the second catalog is accurate and precise; this proves that the algorithm provides correct solutions on sufficiently large datasets. I repeated the analysis on simulations with an extended auxiliary period (up to 5 years), but did not find significant changes.   www.nature.com/scientificreports Finally, I apply the algorithm presented here on real data as a further check. Specifically, I consider the events that occurred in the last 10 years or so (16/04/2005-31/08/2014) in the region shown in Figure 1 with local magnitude larger than 3.0 and depth lower than 30 km. This database is down-loadable from the site http://iside.rm. ingv.it and consists of 1788 events. It includes two important sequences, which occurred near L'Aquila city (2009, stronger event ML5.9) and in the Emilia region (2012, stronger event ML5.9). I run the VFSA algorithm 100 times, obtaining the median values and the 90% confidence bounds of the ETAS parameters, which are reported in Table 3. The comparison with the output estimations of the code developed by Zhuang and co-workers is not trivial, because they adopt a different formulation of the model with a normalized Omori law 27,28 . However, I find negligible differences by adapting the best parameters found with Zhuang's code to my formulation (see Table 3).

Discussions and Conclusions
In this study, I present a new algorithm to estimate the parameters of a spatio-temporal ETAS model. The algorithm is based on SA and allows an automatic ML estimation of the model, without ad-hoc tuning of the parameters. Moreover, it quantifies the uncertainties by multiple runs. It is a versatile and testable tool, used here to investigate the model and the possibility to give a physical interpretation to the parameters. This type of study needs repeated simulations and estimations, which are difficult to do using conventional ML procedures. Numerical optimization procedures may be computationally intensive or may have some convergence problems, especially when the log-likelihood is extremely flat or multimodal. Moreover, it is quite difficult to decide if log-likelihoods have converged to the global maximum. Finally, their performance strongly depends on the starting values of the parameters.
I test the algorithm proposed here on two classes of 100 synthetic datasets: 1 and 5 years (plus 1 year of an auxiliary period, see Ref. 26); these are simulated by using a specific formulation of ETAS model and randomly generated sets of parameters (see Section 3). The choice of testing the algorithm on relatively small datasets is intentional. The efficiency of the ML criterion on large catalogs is expected, if a proper auxiliary region is taken into account 26 . However, it is still unexplored if the ML criterion is able to discern the ETAS parameters on small catalogs, even if such datasets are largely used in studies on non-stationarities 4,9,13 .
The results of this study consistently show that a few data (below 1000) do not produce a precise and correct estimation of the model; only the overall background rate is correctly distinguished from the triggering contribution, regardless of the number of data. In particular, the algorithm is not able to give a reliable estimation of the discrete background probabilities u i for small datasets. This is the only cause for the failure in the log-likelihood optimization (see Figure 5a) and does not affect the other parameters (see Figure 5b). These inefficiencies may derive from the specific procedure used here, the inappropriateness of the ML criterion itself, as outlined by Veen and Schoenberg 25 , or the intrinsic impossibility of estimating so many parameters with a few events. This will be the topic of a future communication. What is certain is that rather different combinations of parameters may give close log-likelihoods on small datasets (see Figure 5c and Tables 1-2); this is the main cause of the low precision of the algorithm and it is in part due to the high correlation among the parameters of the triggering contribution ( Figure 6).
This study gives important hints on the ETAS model as a diagnostic tool of the physical processes responsible for seismicity. First, it shows that the spatial-temporal variability of the m parameter may be a proxy of the magma/fluid signal 4,13 or of induced seismic activity 9 because the ML method allows the background rate to be distinguished from the triggering contribution. However, any specu-lation on spatio-temporal variations of the triggering contribution must be treated with the utmost care. This topic needs more work, of course. However, to give more arguments, I conduct a ''partial'' estimation on the 1-year catalogs. Specifically, I estimate each parameter one at a time, keeping both the remaining parameters and the pseudo-real values of the background probabilities fixed (Figure 7). The resulting accuracy does not differ much from the one obtained from a ''full'' analysis (see Figure 3), for which the precision is much higher (Figure 7a) and the log-likelihood values are always larger than Lh pr (Figure 7b). All these results suggest two points: first, the low precision of the ''full'' estimation ( Figure 3c) is mainly due to the correlation of parameters; second, the low accuracy (Figures 3b  and 7a) derives from the low resolution of the ETAS model and/or of the ML criterion on small datasets. In particular, the limited magnitude ranges, covered with small datasets, make the estimations of the a and c parameters inaccurate. Different ETAS estimation methods might clarify if all these problems are intrinsic to the model or due to the flatness of log-likelihood. Until now, the only possible alternative is the EM-method of Veen and Schoenberg 25 , proposed as a better method than the ML criterion, especially on limited datasets. This does not mean that the ML criterion is wrong, but that the quality and amount of experimental data may be too poor to estimate the parameters unambiguously. And, if the model parameters are not well set, then all the points inferred from their estimation are misleading.
The procedure proposed here allows an estimation of ETAS parameters without an ad hoc tuning of the algorithm, as the Quasi-Newton methods require. Moreover, multiple runs quantify the errors. The results on the real Italian dataset show that the uncertainties are not symmetric (see Table 3). Therefore, the Hessian matrix 18 may wrongly estimate the errors, because it needs a normal distribution of the uncertainties.
In the present study, strong attention is not paid to cutting the computational time of the algorithm. However, I judge it reasonable in this first version, also for large datasets. Parallelization and/or some changes would make the code faster. This will be the topic of future work.