The excess volatility puzzle explained by financial noise amplification from endogenous feedbacks

The arguably most important paradox of financial economics—the excess volatility puzzle—first identified by Robert Shiller in 1981 states that asset prices fluctuate much more than information about their fundamental value. We show that this phenomenon is associated with an intrinsic propensity for financial markets to evolve towards instabilities. These properties, exemplified for two major financial markets, the foreign exchange and equity futures markets, can be expected to be generic in other complex systems where excess fluctuations result from the interplay between exogenous driving and endogenous feedback. Using an exact mapping of the key property (volatility/variance) of the price diffusion process onto that of a point process (arrival intensity of price changes), together with a self-excited epidemic model, we introduce a novel decomposition of the volatility of price fluctuations into an exogenous (i.e. efficient) component and an endogenous (i.e. inefficient) excess component. The endogenous excess volatility is found to be substantial, largely stable at longer time scales and thus provides a plausible explanation for the excess volatility puzzle. Our theory rationalises the remarkable fact that small stochastic exogenous fluctuations at the micro-scale of milliseconds to seconds are renormalised into long-term excess volatility with an amplification factor of around 5 for equity futures and 2 for exchange rates, in line with models including economic fundamentals explicitly.


Inference scheme for the epidemic model
This section provides an overview of the estimation schemes applied for the different model types used in the main text.

Description of the estimation algorithm
To estimate the epidemic model, we use an Expectation Maximization (EM) algorithm. Generally, the EM can be used to iteratively build an MLE using the complete data likelihood function L c (θ θ θ |ω ω ω, Z) 1 . Here, ω ω ω = {t i } i=1,...,N(T ) denotes the observation times of the events we want to model, in our specific setup these are the times when the observed price changed. N(T ) counts the number of events observed on the sample interval of length T . θ θ θ is a vector of model parameters and Z introduces a set of latent variables which help decompose the estimation into manageable sub-problems. In our specific setup, these latent variables are taken as the branching structure of the epidemic model. Formally, this is encoded as a matrix of indicator variables Z N(T )×N(T ) ∈ {0, 1} N(T )×N(T ) , where Z i, j = 1 if an event at t i in the cascade of events is an offspring of an event at t j < t i . If an event at t i was exogenously triggered (was an "immigrant" to the system), Z i,i = 1. Using the branching structure as the latent information in the EM achieves two important goals. First, it reduces estimation of the epidemic model into separate models for µ(t) and the offspring processes R 1 · h(t − s) [2][3][4] , where R 1 denotes the (random) number of events directly triggered by some other event at a time t after the event at time s. Second, it provides, as a by-product, an estimate for a key object of interest in our study: the branching structure itself.
The EM algorithm presented in the sequel is a modification of the one in 5,6 and extended to deterministically time-varying fertilities in 7 . The modification consists in dropping any assumption regarding a particular model for the fertility in the maximization step of the algorithm and using fully nonparametric offspring estimates per point, based on the branching structure obtained in the expectation step of the EM algorithm. Provided an initial guess of the parameter estimatesθ θ θ (0) , the EM algorithm iterates the expectation step (E-step) and maximization step (M-step) in each iteration l untilθ θ θ (l) converges 8 .
At the l-th iteration of the algorithm, this can be obtained using random thinning 9 as 3. M-step: In the maximization step, new values of the parametersθ θ θ (l) are obtained by maximizing the expected complete data log-likelihood, ℓ c (θ θ θ ), defined as where H(t) = t 0 h(s)ds is the cumulative distribution corresponding to the offspring density h. Assuming that µ and h do not share any parameters, the two parts ℓ µ and ℓ h can be maximized separately. Furthermore, we obtain estimates for the R 1 j 's from the branching structure aŝ 4. Check convergence: Stop estimation if ||θ θ θ (l) −θ θ θ (l−1) || ∞ is below some small threshold, or go back to the E-step for another iteration l = l + 1 otherwise.
An estimate of the branching ratio of the epidemic process can finally obtained as In our empirical calibrations, we stop the EM algorithm if the convergence criterion is satisfied at a threshold of 10 −4 for a few consecutive iterations, or if a maximum number of 500 iterations was performed.

Adaptive estimation of µ(t)
To facilitate a flexible and systematic estimation of the exogenous driving process µ, we use flexible logspline 10 estimators to control for trends and shocks in the data in an adaptive fashion 5 . The flexibility and smoothness provided by these modern, nonparametric estimators has been found to be crucial in controlling for the significant exogenous heterogeneity in high-frequency financial data 6 . To select models, we vary the degrees of freedom allowed for µ(t) in p ∈ {0, 3,6,10,15,20,25,30,40, 50, 60}, where p = 0 denotes a model with constant exogenous driving µ(t) ≡ µ > 0, compute log-likelihoods and select the best model for µ(t) according to the Bayesian Information Criterion (BIC) −2ℓ(θ θ θ (l) ) + log(N(T ))p.

Simulation study of the EM estimator
We test the ability of the proposed EM procedure to characterize the distribution of the R 1 j in a simulation study. For this purpose, we simulate realizations of Hawkes processes with constant immigration µ = 0.1, a branching ratio ofη = 0.6 and three different fertility distributions f (η), namely: 3. Power law distributed fertilities f γ (η) = γη γ 0 (η + η 0 ) −(1+γ) with η 0 = 0.3 and γ = 1.5. We use a simple exponential offspring density with ground truth h(t) = 2e −2t in all three cases. Fitting is then done using the EM on 100 realizations of length T = 86400 seconds from each of these three processes. The results from this simulation study are summarized in Figure S1. First, we notice that the branching ratiosη recovered by the EM are quite consistent, with estimates ≈ 0.62 for all three fertility distributions considered. Regarding the exact shape of the distributions, we find that the head of the distribution of R 1 appears difficult to estimate -not surprising given that the true distribution is discrete, whereas the estimator is real-valued. Regarding the tail, we see that even though there is clearly some bias, the shape of the tail is approximated quite well. In particular, the tail of the distribution recovered by the EM exhibits approximate power law dynamics in the case where the underlying true fertility distribution also follows a power law -the heaviness of the tail is however underestimated. In the case of exponential fertility, the EM estimates also clearly drop off exponentially in the tail. Therefore, we conclude that, while inferring the exact distribution of the fertilities directly from the EM estimates of the number of first generation offspring is difficult, testing for certain properties of the underlying distribution (e.g. computing its mean or conditioning it on some observable, testing for power law tail, asserting the share of zero-offspring points) works well. Figure S1. Shows the empirical complementary cumulative distribution (CCDF) of the simulated fertilities (η j ), the corresponding true random number of first generation offspring (R 1 j ) and the EM estimates thereof (R 1 j ), for the three simulation cases considered. The dashed green lines indicate 95% lower and upper confidence bounds for the CCDF of the EM estimates, obtained from Greenwood's formula. In the inset, the average of the three quantities across the 100 process realizations of length T = 86400 seconds, are compared to the true value (the meanη of the respective fertility distribution; dashed vertical line). The horizontal whiskers indicate 95% confidence intervals for the mean obtained by bootstrap resampling 100 times.

Extension of the estimation algorithm in the presence of exogenous stochastic clustering
Here, we give a short practical description of the extended algorithm required to estimate the model in the presence of stochastic exogenous clustering for the special case where γ(s) ≡γ and η(s) ≡η, i.e. with conditional intensity The algorithm synthesizes the principles described above for the generalized Hawkes model, with the MCEM algorithm as derived in 11 for the inclusion of the additional shot-noise effects.
For the estimation of a model like (8), or (7) of the main text, the missing data has to be extended to (Z,C), where C denotes the immigrant process induced by N µ , and the complete data log-likelihood thus takes the form which can be approximated by averaging over a sample c c c (k) , k = 1, ..., K of immigrant realizations from a density ∝ p(c c c|ω ω ω,θ θ θ (l) ) as To obtain a conditional MCMC sample from the immigrant process, a Metropolis-Hastings algorithm is employed. Given some state c c c (k) of the Markov chain, each Metropolis-Hastings iteration starts with randomly proposing the birth or death of an immigrant. In case we choose birth, we uniformly select some c * ∈ ω ω ω \ c c c (k) . The Metropolis-Hastings birth ratio to decide whether c * is accepted, given some immigrant vector c c c, is obtained from

3/12
and we accept c c c (k+1) = c c c (k) ∪ c * with acceptance probability min{r b (c c c (k) , c * ), 1}. In case we chose death, c * ∈ c c c (k) is removed from the current state of the Markov chain with acceptance probability min{1/r b (c c c (k) \ c * , c * ), 1}. The Metropolis-Hastings iterations can then be repeated until a sufficiently long chain is available, and the desired K samples be drawn from this chain. Then, given some c c c (k) = (c k 1 , ..., c k n k µ ) and due to the fact that the process maps to a branching process, the inner expectation in (10) decouples analogously to (4), using branching probabilities which can be evaluated due to random thinning 9 in a similar fashion like (3) as In the M-step of the algorithm, the objective (10) is maximized, which -as for estimation without shot-noise effects -has the structure of decoupled density estimation. First, µ(t) is estimated on the MCMC sample of immigration times {c c c (1) , ..., c c c (K) }. Next, the offspring density h(·) is estimated with objective ℓ h from (4), with the difference that the offspring probabilities used in the objective function need to be averaged over the MCMC samples as Estimation of g(·) finally requires to pool inter-event times and weights over all MCMC samples π π π g : where G(t) = t 0 g(s)ds is the cumulative distribution corresponding to the exogenous offspring density g(t). Estimating nested models can be done by enforcing the restrictionsγ = 0 and/orη = 0 12 . For the applications presented in the main text, the MCEM estimation is done with K = 50 samples taken uniformly from the second half of a chain of length 100000, generated by the Metropolis-Hastings algorithm presented here.

Model selection in the presence of exogenous stochastic clustering
In the absence of stochastic exogenous clustering (γ ≡ 0), model selection is possible consistently using information criteria (see section 1.2), because the unconditional likelihood is accessible. If however γ > 0, then conditional likelihoods must be computed using MCMC as discussed in 13 . To fix notation, denote by d f (·) the degrees of freedom of a function, thus we have the exogenous flexibility as p = d f (µ) and the endogenous flexibility of the model as r = d f (h) + d f (g). Thenθ θ θ p,r denotes the parameter estimates of a particular model with overall p + r degrees of freedom.
To compute conditional likelihoods, we create an MCMC sample c c c (1) , c c c (2) , ..., c c c (k) , ..., c c c (K) of immigrant realizations with density ∝ p(c c c|ω ω ω,θ θ θ p,r ) -just like in the E-step of the MCEM algorithm in the previous section -being the density of an immigrant realization c c c, conditional on the parameter estimatesθ θ θ p,r and the data ω ω ω. Each c c c (k) defines a conditional loglikelihood and a conditional BIC In order to select models, we apply a two-step procedure: 1. For each model type -represented with a given r -we first select the optimal degrees of freedom p * r for the immigration intensity µ. For this purpose, for b = 1, ..., B and p = 0, ..., P, we sample j p b ∈ [1, K] to obtain a bootstrap sample

4/12
We can then compute conditionally (on the particularly sampled immigrant realizations) optimal degrees of freedom p b r = arg min p B p,b , being the optimal degrees of freedom per bootstrap sample, according to the randomly chosen conditional BICs. The final choice across bootstrap samples is then p * r = mode({p b r } b=1,...,B ). 2. To compare two different models, say r ′ and r ′′ , we sample pairs and then select r ′ over r ′′ on a given bootstrap sample b if We can then finally compute selection probabilities P(r ′ , giving an estimate of the likelihood that a certain model type r ′ is to be preferred over another model type r ′′ .
For all applications in the main text, we choose B = 1000 and perform the test on an MCMC sample of K = 100 draws from a Markov chain of length 300 ′ 000.

Endogenous vs. exogenous clustering
In order to test which of the clustering mechanisms are dominant in practice, we conduct the following test. For some randomly selected days of data we calibrate models with constant branching coefficients γ(s) ≡γ, η(s) ≡η and the adaptive estimator using logsplines from section 1.2 for µ ε (t), which also allows for deterministic clustering through time-variation in µ ε (t). We then estimate increasingly complex models, starting from no stochastic clustering (Deterministic Exo:γ = 0,η = 0), to include only exogenous clustering (Stochastic Exo:η = 0), or deterministic exogenous and stochastic endogenous clustering (Stochastic Endo:γ = 0), or finally a mixture of all three clustering mechanisms (Stochastic Endo-Exo). Additionally, we allow h to either exhibit short-(exponential) or heavy-tailed (approximate power law 14 ) memory.
In Table S1, we illustrate how frequently (across samples and thresholds ε) a certain clustering mechanism is selected according to the procedure described in section 1.5. We observe first that the most frequently selected model is a model   Moreover, an endogenous offspring distribution with heavy tail is preferred over short-tailed memory. A more detailed summary of the estimates for the stochastic endo model with heavy-tailed memory, being the most frequently selected model, is additionally given in Figure S2.

First-order properties and relation to autoregressive time series
A string of recent literature has developed [15][16][17] and reviewed 18 first-and second-order statistical properties of self-excited / Hawkes point processes in the context of financial applications, which were originally formulated in 19,20 and their behavior  Figure S2. Shows a summary of the exemplary estimates for the stochastic endo model (γ(s) ≡ 0) with heavy-tailed (approximate Pareto 14 with tail index α) memory from Table S1, for the EUR/USD (left panel) and the E-mini (right panel) data. The heatmap in the top of each panel shows the integrated exogenous intensity t t−∆ µ(s)ds, with ∆ equal to one minute (cf. also the unconditionally expected activity profiles in Figure 1 of the main text), across various thresholds ε corresponding to the (5%, 10%, ..., 95%)-quantiles of the distribution of squared returns. The smaller plots below the heatmaps show, also as a function of ε, estimates of the branching ratioη, the characteristic time scale τ 0 and the tail index α of the memory kernel h(t).
Here we build on these results to derive the time-dependent properties of first-order characteristics of the epidemic process underlying the study in the main text. In a few extensive comments, we furthermore highlight some important parallels to the analysis of autoregressive time series.
In the following, we consider the conditional intensity λ (t) = µ(t) + ∑ i:t i <tη h(t − t i ) = µ(t) + t 0η h(t − s)dN(s) after having averaged over the distribution of fertilities. With some abuse of notation, we denote this averaged intensity also as λ (t). Taking the expectation of λ (t), we find the expected intensity as the solution to the equation for ϕ (t) =ηh(t) andη = E[η]. Equation (19) has the form of a classical renewal equation and can be expressed as the convolution for e(t) = E[λ (t)]. Denoting the Laplace transform of a function x(t) as x(z) = ∞ 0 e −zt x(t)dt, the solution of (20) is given in the Laplace domain as If we define the auto-convolution of ϕ for n ≥ 1 as ϕ * (n+1) (t) = t 0 ϕ (t − s)ϕ * n (s)ds = ϕ * ϕ * n (t) with ϕ * 1 = ϕ and ϕ * 0 = 1, we note that ϕ * n (t) =η n h * n (t).
Since h is a probability density function, h * n itself is the cumulative distribution of a sum of n independent random variables, each with density h. By explicitly calculating the L 1 -norm of h * n and remarking that ∞ 0 dt t 0 ds = ∞ 0 ds ∞ s dt, it can be shown that the L 1 -norm of h * n (t) is equal to the L 1 -norm of h * (n−1) (t). By recurrence, it is also equal to the norm of h(t) and consequently ϕ * n (t) has L 1 -norm equalsη n for all n. Thus the auxiliary function ψ 0 , defined as

6/12
is also in L 1 , givenη < 1, as To finally find the solution of (20), we see that ψ 0 (t) has Laplace transform Note that the infinite sum in (25) resembles closely the characteristic polynomial of an MA(∞) process. To see this, define the lag operator L as L s y t = y t−s and consider the AR(p) process y t = − ∑ p j=1 ϕ j L j y t +ε t with innovations ε t ∼ IID(0, σ 2 ε ). Then the characteristic polynomial Φ(η) = ∑ p n=0 ϕ j η n -defined on |η| < 1 -expresses the innovations as ε t = Φ(L)y t and assuming we can invert the polynomial, we get the equivalent moving-average form y t = Φ −1 (L)ε t = Ψ(L)ε t , where Φ(L)Ψ(L) = 1. In the specific case of an AR(1) process, the characteristic polynomial is Φ(L) = 1 − ηL and thus the characteristic polynomial of the moving-average representation in the time series case has an obvious conceptual correspondence in the point process case as In the Hawkes process case, we have a similar equivalence regarding the "characteristic polynomials" as (1 − ϕ (z))(1 + ψ(z)) = 1. Furthermore, the indicated comparison to the Laplace transform of ψ 0 can be interpreted such that -as intuitively reasonable -the "lag operator" in this moving average representation of the Hawkes process is given by the Laplace transform of the offspring density h(z), which precisely defines the probability of lagged event occurrences given some mother event. Furthermore, the integrability condition in (24) is equivalent to the condition in MA(∞) models where the absolute MAcoefficients must be summable for y t to take finite values and for the process not to explode. The similarities to time series analysis are conceptually striking: in order to learn about the probability structure of the Hawkes process (the autoregressive point process), we have ended up looking for an equivalent moving-average representation. But when working with data, we prefer the simple structure of the Hawkes. In time series analysis, when dealing with data, autoregression as a linear method is often very convenient and computationally fast. But when evaluating the structural properties of a process, they can often more easily be obtained via the moving-average representation -one example being the autocovariance function, which can be obtained from the convolution of the moving-average coefficients. Continuing the derivation, the expression for ψ 0 (z) from (25) can be inserted into (21) to find the solution of the renewal equation in the time domain by applying the inverse Laplace transform L −1 as where ψ(t) = ∑ n≥1 ϕ * n (t), which is in turn defined by its Laplace transform We note that since ϕ (t) is positive and supported by R + , and ψ(t) is simply defined as the sum of convolutions of ϕ (t), ψ(t) is also positive and supported by R + . The mapping of the Hawkes process to a Galton-Watson branching process 27 gives valuable intuition on the derivations at this point. We recall that the number of direct offspring of any point in the genealogy of the mapped branching process has Poisson distribution with meanη. The expected size of the n-th generation of the equivalent branching process is thusη n (cf. (22)). Further, the total number of individuals in the entire cluster formed by the branching process, including the immigrant (0-th generation), has a Borel distribution, having mean (1 −η) −1 (cf. (25)) and varianceη(1 −η) −3 . The expected size of the total offspring over all generations, triggered by one immigrant isη/(1 −η), cf. (28). As such, we see that ψ 0 (0) and ψ(0) are the expected value of the random variables representing respectively the cluster size including and without the triggering immigrant 1 .
Using this fact, the well-known result for the average intensity in the case of constant immigration µ(t) ≡ µ follows in a straight-forward manner as By virtue of (28), this translates into In the time domain, the function ψ(t) satisfies the equation and is often known as the "activity level" in systems with memory -in the language of branching processes the dispersion of all offspring of one immigrant over time. The function R(t) = ψ(t)/η = h * ψ 0 (t) is often called resolvent, renormalized kernel or response function and describes the response of the system to an impulsive (exogenous) perturbation. It integrates to 1/(1 −η), representing the mean cluster size of the Galton-Watson branching process to which the Hawkes process can be mapped. Note again the link to time series analysis: the impulse-response function (to a unit shock in the innovations) in an AR(p) process is given by the corresponding MA(∞) coefficients. Analogously -as already noted earlier -the function ψ 0 (t) (or ψ(t) when not counting the immigrant) defines the activity of the Hawkes process following an immigrant, i.e. the response to an innovation. In the case of an exponential offspring density, ψ(t) is known explicitly which illustrates the name "renormalized kernel" as the characteristic time scale of the kernel is renormalized with (1 −η), namely the characteristic time scale is amplified by the factor 1/(1 −η), thus expressing the increased duration of the cluster due to the cascade of generations.
Of particular interest in many applications are kernels with power law memory for Γ(a, x) the upper incomplete gamma function. In this case, the Laplace transform in (28) cannot be inverted analytically, but the asymptotic properties of the resulting resolvent have been studied in detail as well, see 25 .
To compute the expected number of points up to some time t, we can use the solution (27) to obtain For the case of constant exogenous intensity, the fact that ψ * µ = µ ψ(0) directly recovers E[N(t)] = µt/(1 −η).
For the case of time-dependent immigration, we notice that, by the Fubini theorem, the integral of the endogenous part of the compensator Λ(t) = t 0 λ (s)ds satisfies 1 For example, we can compute the average duration of a cluster as ∞ 0 t ψ 0 (t) ∞ 0 ψ 0 (s)ds dt, which e.g. in the case of an exponential kernel like in (32) yields 1 β (1−η) .

8/12
and integration by parts 2 with H(t) shows that this last expression equals ϕ * N(t) and thus we get Λ(t) = I(t) + ϕ * N(t) for I(t) = t 0 µ(s)ds. Taking expectations, a renewal equation for the expected event count is found, which is similar to (20), as where we have used the fact that, at fixed t,

N(t)|Λ(t) ∼ Poi(Λ(t)) and thus E[N(t)] = E[Λ(t)]. The solution of this equation is analogously
As a final remark underlining the conceptual similarities between autoregressive point processes and time series, it is worth noting that the scaling limits of the Hawkes process derived in 28,29 also find their natural analogue in the time series case 30 .

Effect of observation error on variance estimates of X and bias correction
A general result from stochastic calculus says that for a random partition 3 of [0,t], π = {0 = t 0 ≤ ... ≤ t n ≤ t}, and two general semimartingales W and Z, we have that 31 and the convergence in probability to the quadratic covariation [·, ·] t is for the mesh size sup i≥1 (t i − t i−1 ) → 0 as n → ∞. Furthermore, in the case of the specific dynamics we postulate for our fundamental price X, the fundamental identity for stochastic integrals 4 yields As a consequence, a common way to estimate the integrated variance of a process is to use the sum of squared returns, or realized variance. Assuming we are able to observe the latent process X without error at times t i , the ideal estimator converges to v(t) in probability as the sampling frequency increases 32 . However, in the presence of observation error, we are forced to consider instead so the sum of squared observed returns is in general inconsistent for v(t). This argument immediately makes clear that σ 2 Y from equation (4) in the main text, based on observed returns, is subject to the same biases as the realized variance estimator (40), as soon as there is observation error. We make the following assumptions: 1. The noise process U is stochastically independent of X. This is a standard assumption in the literature, even though there is some evidence for more intricate noise dynamics at tick frequencies 33 .

U(t) is a covariance stationary process with E[U(t)] = 0.
3. We sample the processes at the times t i where the observed price changes to compute our variance estimates.
Under assumptions 1. -3. and defining e i = U(t i ) −U(t i−1 ) with E[e 2 i ] = σ 2 e , the bias in equation (14) from the main text is found by taking expectations as Furthermore, under the assumptions made, (42) makes clear that the relative contributions of the exogenous and endogenous variances from Λ(t) are unchanged by the observation error -only their absolute magnitudes are inflated when using observed returns, due to the presence of U. 2 We can write 3 It should be noted that, so as to not complicate notation, we have omitted that in fact we are dealing with an increasing sequence of partitions, so that n → ∞. 4 For a semimartingale process X and an X-integrable process ξ , we have the identity [ ξ dX, ξ dX] = ξ 2 d [X].

9/12 5 Data sets and pre-processing
The main data sets used here are mid-prices of two highly liquid markets. First, we study the EUR/USD currency pair traded on Electronic Broking Services (EBS), one of the major interbank markets for foreign exchange and key market place for the EUR/USD currency pair. The time stamps of the EBS data set have a resolution of 100ms, and there is an entry whenever either the best bid, offer or both changed within the last 100ms interval. We use data for the entire calendar year of 2016. During this period, there were around 30'000 mid-price changes per day on average. Since foreign exchange markets operate continuously, we partition each week of the data into five 24 hour trading days, starting at the market opening on Sunday evening and disregard the 48h weekend period as there is no activity during that time.
The second main data set we use is from the S&P 500 E-mini futures contract traded on the Globex electronic platform at the Chicago Mercantile Exchange (CME), considered a key venue for equity price discovery in the United States. Trading on the Globex platform is possible almost continuously -only halted temporarily between 3:15pm and 3:30pm CT (Chicago Time) and for a maintenance window between 4pm and 5pm CT. We work with three months of data from January to March 2019 and define each trading day to start at 5pm CT and end at 3:15pm CT, discarding the 30 minutes of trading between the end of the temporary trading halt and the start of the maintenance window. This results in around 85'000 mid price changes per trading day on average in the sample. The time stamps of the E-mini mid-price changes have millisecond resolution.
For some intraday case studies, we will additionally revisit a data set for the GBP/USD currency pair from 7 , traded on Thomson Reuters Matching (now known as Refinitiv FX), which was obtained from Refinitiv Tick History and also has millisecond resolution.
As customary for point process calibration 14,34 , we randomize the time stamps of all data sets within an interval of uncertainty (here 100 milliseconds) and exclude any holidays from the analysis.

Estimation of the observation error
To complete our variance estimator requires specification of a particular model for the observation error. Under the common assumption of a serially independent noise process U with variance E[U 2 ] = ω 2 1 , the increments e i follow an MA(1) structure. In this case, an estimate for the observation error variance can be obtained from the relation E[y i y i−1 ] = −ω 2 1 35, 36 aŝ where y i = Y (t i ) − Y (t i−1 ). Under the independent observation error assumption, the bias correction term thus becomes σ 2 e = 2ω 2 1 . To account for more complex serial dependence in the observation error, we will consider the generalization of e i to an MA(q) structure as where ν i ∼ IID(0, ω 2 q ) and ρ 1 = 1 for identification purposes. In this case, the bias correction term is Estimation of the noise model (45) can be done using the quasi maximum likelihood (QML) approach of 37 . This procedure allows one to select the order q of the MA model consistently using information criteria. Here, we rely on the Akaike Information Criterion (AIC), which was shown to yield consistent estimates in 37 . Important for our purposes, the QML scheme also provides estimates for the autocovariance function of the noise process, which can then be factorized to estimate the ρ j . Estimates form and the observation error variance according to the different specifications in the previous subsection are presented in Table S2. The fact, thatω 2 1 is frequently found to be negative, indicates significant departure from the standard i.i.d. assumption for the observation error. This is especially pronounced for the E-mini price process, but also happens for EUR/USD prices. This pattern is also confirmed by the AIC-selected orders of the MA(q) observation error model, where the median selected order is 23 for the E-mini and 8 for the EUR/USD. Finally, when accounting for dependence in the observation error, its variance ω 2 q is approximately one order of magnitude larger for the E-mini than for the EUR/USD, suggesting a more significant role of observation error in E-mini variance estimation. For empirical applications presented in the main text, the bias correction applied is (45).  Table S2. Shows the median daily estimates of the respective quantities defined in previous subsections and (5%, 95%)-quantiles in brackets below. The units of the return parameters are in bp 2 . q AIC indicates the AIC-optimal order of the MA(q) noise (45) according to the QML procedure in 37 .