Analysis of temporal correlation in heart rate variability through maximum entropy principle in a minimal pairwise glassy model

In this work we apply statistical mechanics tools to infer cardiac pathologies over a sample of M patients whose heart rate variability has been recorded via 24 h Holter device and that are divided in different classes according to their clinical status (providing a repository of labelled data). Considering the set of inter-beat interval sequences \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\{\mathbf {r}(i) \} = \{ r_1(i), r_2(i), \ldots , \}$$\end{document}{r(i)}={r1(i),r2(i),…,}, with \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i=1,\ldots ,M$$\end{document}i=1,…,M, we estimate their probability distribution \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P(\mathbf {r})$$\end{document}P(r) exploiting the maximum entropy principle. By setting constraints on the first and on the second moment we obtain an effective pairwise \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(r_n,r_m)$$\end{document}(rn,rm) model, whose parameters are shown to depend on the clinical status of the patient. In order to check this framework, we generate synthetic data from our model and we show that their distribution is in excellent agreement with the one obtained from experimental data. Further, our model can be related to a one-dimensional spin-glass with quenched long-range couplings decaying with the spin–spin distance as a power-law. This allows us to speculate that the 1/f noise typical of heart-rate variability may stem from the interplay between the parasympathetic and orthosympathetic systems.

where i labels the patient and n labels the number of beats in each sequence (which is order of 10 5 and depends on the patient). Patients belong to three classes, according to their clinical status: healthy individuals (H), individuals with atrial fibrillation (AF) and individuals with congestive heart failure (hereafter simplified as cardiac decompensation) (CD). Their number is M H = 149 , M AF = 139 , and M CD = 60 , respectively; of course, M = M H + M AF + M CD . In Fig. 1 (left) we show examples of the series r(i) for three patients belonging to the different classes.
In order to make a meaningful comparison of the variability among the RR series r(i) of different patients, we standardize them with respect to their temporal mean and standard deviation, so that the study of HRV is recast in the study of fluctuations of the standardized RR series around the null-value. More precisely, we introduce or, in vectorial notation, where we defined The raw histograms for the standardized inter-beat intervals in the three classes of patients are shown in Fig. 2: notice that the frequency distributions exhibit heavy-tails.
We consider the points in the standardized RR series as random variables sampled by a hidden stochastic process, in such a way that the value of z n (i) at a given step n depends in principle on all the values {z m (i)} m<n taken in the previous steps m < n since the beginning of sampling. From this perspective, a meaningful observable to look at is the auto-correlation function at a distance τ , defined as where �z(i)� + = 1 N i −τ N i −τ n=1 z n (i) and �z(i)� − = 1 N i −τ N i n=τ z n (i) . Given the standardization over the whole segment [1, N i ] , as long as τ ≪ N i , we expect that �z(i)� + and �z(i)� − are both close to zero and shall be neglected in the following (indeed, we checked that this is the case, since �z(i)� + , �z(i)� − ∼ 10 −15 ÷ 10 −17 ). Then, the auto-correlation function we measure simply reduces to   www.nature.com/scientificreports/ Some examples of the autocorrelation function for patients of the three classes are reported in the right plot of Fig. 1, where we stress that the autocorrelation is non-null over a large τ window and its shape is patient-dependent . The last property stems from the 1/f noise shown by HRV which, in turns, yields to dominant contributions  in the low-frequency domain of the RR series.  Finally, we introduce a further average operation, this time on the sample of patients, namely, we define where class ∈ {H, AF, CD} and with " i ∈ class " we mean all the indices corresponding to patients belonging to a certain class. In the following we will consider the vectors z as random variables sampled from an unknown probability distribution P true class (z) , which we will estimate by the probability distribution P class (z) characterized by a minimal structure and such that its first and second moments are quantitatively comparable with E class (r) , E class (z 2 ) and E class (C(τ )) , respectively. Discussion on the model and on the inferential procedure. Our atomic variable is the sequence {z 1 , z 2 , . . . , z N } and, as anticipated above, we denote with P(z) the related probability distribution emerging from the inferential operations on the sample of experimental data. The Shannon entropy H[P(z)] associated to P(z) is According to the maximum entropy principle, we look for the distribution P(z) that maximizes H[P(z)] and such that its moments match those evaluated experimentally, in particular, here the we choose to apply the constraints on the one-point and two-points correlation function that is, E class (z) , E class (z 2 ) and E class (C(τ )) , respectively. To lighten the notation hereafter these moments shall be referred to simply as, respectively, µ (1) , µ (2) and C(τ ) , without specifying the class. In fact, the inferential procedure works analogously regardless of the class, the latter affecting only the quantitative value of the parameters occurring in P(z) . Constraints are set via Lagrange multipliers ( 0 , 1 , 2 , τ ) in such a way that the problem is recast in the maximization of the functional where integration is made over R N . Note that, while the derivation with respect to 1 , 2 and τ ensure, respectively, the agreement between the theory and the experiments at the two lowest orders, i.e. the temporal average µ (1) , the second moment µ (2) and the auto-correlation function C(τ ) , 0 guarantees that P(z) is normalized, so that P(z) is a probability distribution function. In the asymptotic limit of long sampling ( N → ∞ ) and under a stationarity hypothesis (see 4 for a similar treatment and the section dedicated to Methods for the proof), the solution of the extremization procedure, returning the probability of observing a certain sequence z , is given by where h and J(τ ) can be estimated from available data (vide infra). Here, P 0 is the N(0, 1) distribution and plays the role of prior for the variable z n , the parameter J(τ ) represents the pairwise interaction between elements at non-zero distance τ in the series (notice that each element occurs to be coupled to any other), and the parameter h represents the bias possibly affecting the single value in the sequence (and it is expected to be zero as we standardized the RR series). The factor Z plays here as a normalization constant, like the partition function in the statistical mechanics setting 10 . Notice that the interaction between two elements r n and r m ( n > m ) depends on the distance τ = n − m , but not on the particular couple considered. This stems from a"stationary hypothesis", meaning that one-point and two-point correlation functions calculated on a segment spanning O(τ ≪ N) E class (C(τ )) = 1 M class i∈class C(i, τ ). www.nature.com/scientificreports/ elements along the series are approximately the same and since the starting time of sampling is arbitrary, we get that J(n, m) = J(m − n). The standard inference setup for the model parameters is based on a Maximum (log-)Likelihood Estimation (MLE), i.e. the maximization of the function where D is the time-series database (of a given class) and where we made clear the dependence of P on the model parameters. However, such an approach requires the computation of the whole partition function Z, which is numerically hard in this case. Then, we chose to adopt as objective function for the inference procedure the pseudo-(log-)likelihood function 4 : that is, given L observations in a fixed time-series z , we maximize the conditional probability to observe the value z L+1 at the successive time step (i.e., we are introducing a cut-off in the interaction range). Further, we make two main modifications with respect to the standard pseudo-likelihood approach: (i) in order to use the entire available time-series in our database, we also adopt a window average procedure; (ii) we add regularization terms in order to prevent divergence for the model parameters. A detailed discussion is reported in Appendix 4.2. Our objective function is therefore given by where T is the largest τ we want to consider (namely T must be larger that the maximal decorrelation time), is the regularization weight and f (τ ) is a temporal regularizer that prevents the elements of J(τ ) to get too large for large τ (see Appendix 4.2 for a detailed description).
The inference method allows us to determine the values of the parameters J(τ ) and h as well as their uncertainties σ J(τ ) and σ h . As for the parameter h, due to series standardization, its value, evaluated over the different classes, is expected to be vanishing (this is indeed the case as it turns out to be h ∼ 10 −3 with a related uncertainty of the same order). As for the pairwise couplings, we find that for all the classes considered, J(τ ) is significantly non-zero only for relatively small values of τ , with a cutoff at T ∼ 10 2 , and, for a given τ < T , the coupling does not display a definite sign, that is, for pairs (z n , z n+τ ) and (z m , z m+τ ) at the same distance τ the related couplings can be of opposite signs. These results are shown in Fig. 3: in the left column we reported the inferred J(τ ) with the associated uncertainties for all τ , and in each panel in the right we reported the frequency distributions for the first values of τ as examples.
In order to study how decorrelation of RR intervals takes place, it is interesting to study how the interaction vector J(τ ) (regardless of its sign) vanishes as the delay time τ increases. We found that the long delay time behavior of the magnitude (i.e. disregarding the signs and oscillatory characteristics) of the interactions is welldescribed by a power law of the form We thus fitted the tails of the inferred J(τ ) with this trial function (the range of fitted points is chosen in order to maximize the adjusted R 2 score). In Table 1, we report best-fit parameters, the adjusted R 2 score and the reduced χ 2 . It is interesting to note that the scaling parameter β is around 1, meaning that, for each of the three classes, the leading behavior of the interactions at large τ is ∼ 1/τ (as we will see later, the same scaling also characterize the power spectral density in the Fourier domain, see Fig. 7). In the upper row of Fig. 4, we depicted with red circles the results of the inference procedure (once taken their absolute values), while the best fit of the general trend is represented with dashed black lines. In the lower row of the same figure, we also reported the residuals of the experimental values with respect to the best fit (normalized to the corresponding uncertainty). Apart from the first few points in the AF and CD cases, we see that the residuals are distributed in a range of at most 2σ J(τ ) (where σ J(τ ) is the standard deviation at each τ point), and, in particular, for sufficiently long τ (where oscillations are softened), experimental values are always contained in the range [−σ J(τ ) , σ J(τ ) ] , implying that the leading behavior of the delayed interaction is well-captured by ∼ 1/τ noise both in our datasets as well as in the model's prediction.
To summarize, here we highlight the main results.
• Couplings can be both positive and negative (see Fig. 3), defining the heart as a complex glassy system.
• The coupling magnitude decays in τ as a power-law whose leading order is ∼ 1/τ (see Fig. 4).
• The coupling magnitude displays a sharp scaling 1/τ solely in healthy patients, while for the remaining patients it display a bump in the short time-scales (see Fig. 3 and the residual plots in Fig. 4). • By fitting data via Eq. (16), we obtain estimates for the power-law exponent β (as reported in Table 1): interestingly, different classes are associated to different best-fit values of β , in such a way that classification of cardiac failures via HRV via this route seems possible.    = 1, 2, 3, 4 ). In both cases,the statistics consists in M = 500 different realizations of the J(τ ) which are realized by randomly extracting different mini-batches, each with size n = 20 . We stress that some frequency distributions present tails on negative values of J for some τ . This means that frustrated interactions are also allowed, implying that the system is fundamentally complex, i.e. a glassy hearth. Table 1. Best fit values regarding the scaling reported in Eq. (16). For each class, we report the best-fit parameters A and β , as well as the adjusted R 2 and the reduced χ 2 scores quantifying the fit goodness. For any class, we consider our estimate for J(τ ) , along with the estimate σ J(τ ) of its uncertainty, and we build the noisy estimate for J(τ ) , that is J (τ ) = J(τ ) + δJ(τ ) where δJ(τ ) = ησ J(τ ) and η is a N(0, 1) random variable. Next, taken a certain {z} , we convolve it with J (τ ) and this returns {z} . Of course, due to the initial standardization of the RR series, the inference procedure returned a vanishing bias parameter h, hence the synthetic series will also be centered at zero. However, a synthetic sequence is no longer standardized and this is done by hand.

Class
Then, it is natural to compare the synthetic sequences and the experimental ones. We generate a sample of data with the same size of the experimental data available, and we compute the empirical cumulative distribution function for both the experimental and synthetic data in order to compare them: results are reported in Fig. 5. In the first row, we directly compare the experimental (red solid line) and the synthetic (black dashed line) cumulative distributions highlighting an excellent agreement for all the classes. This is then corroborated by checking the probability plots in the same figure (second row): here, the red solid line shows the synthetic cumulative distribution versus experimental cumulative distribution, while the black dashed curve is the identity line. The green regions in the plot are confidence intervals with p = 0.95.
Next, we test whether the model is able to effectively capture correlation in the RR series, in particular by comparing experimental auto-correlation functions and their predicted counterparts. However, since autocorrelation functions are individual-dependent, starting from a single (randomly chosen) RR series we generate 100 synthetic series with different realizations of the J (τ ) according to the above mentioned procedure (see also App. 4.3). In this way, we can use our estimation of the uncertainties on J(τ ) in order to give a confidence interval for our predictions. In Fig. 6 we compare the auto-correlation functions for the experimental series and for the synthetic series; for the latter we also highlight the confidence interval with p = 0.68 . More precisely, we depict the experimental (red solid line) and theoretical (black dashed line) auto-correlation functions and see that the former always fall inside the confidence interval of the re-sampled series (the green region). Thus, we can conclude that our inferred minimal pairwise model is able to effectively capture the temporal autocorrelation in the RR series.
As a final comment, we also looked at the power spectrum density (PSD) of the provided datasets {z} that, as expected (see e.g., 12,13,21 ), displays the long tail 1/f (see Fig. (7), upper panel) and we made the following comparison: for all the patients, we evaluated its PSD and in the region [10 −4 , 10 −2 ] Heartbeat −1 we fit with a power-law where γ ∼ 1 and its value is taken as the x-coordinate of that patient in the lower panels of Fig. (7). The corresponding y-value is obtained by calculating the PSD over 100 synthetic RR-series generated by convolution with the empirical series playing as seed and using as value of J the one pertaining to the class the patient belongs to (H, AF, CD); results are in good agreement on the diagonal. www.nature.com/scientificreports/ conclusion Several past studies have highlighted that heart-rate fluctuations, in healthy individuals, exhibit the characteristic 1/f noise (see 12 and references therein). Deviations from this behavior can in fact be associated to cardiac pathologies such as atrial fibrillation or congestive heart failure 21 . In this work we tried to deepen the mechanisms possibly underlying this peculiar behavior, both in healthy and in compromised subjects.
To this aim we exploited inferential tools derived from statistical mechanics (i.e. the maximum entropy principle) to construct a probability distribution P(r) characterizing the occurrence of a RR series r . By requiring that P(r) is minimally structured (i.e., prescribing the maximum entropy) and that P(r) correctly matches the empirical first and second moments, we end up with a probabilistic model analogous to a spin-glass where quenched couplings J(τ ) among spins exhibit frustration and a power-law decay with the distance τ between spin pairs. This kind of system is known to display chaotic dynamics spread over several timescales and the 1/f noise. We thus speculate that the presence of competitive driving force are key features for the emergence of the  www.nature.com/scientificreports/ rich phenomenology displayed by heart-rate and we are naturally tempted to identify the two opposite driving forces with the sympathetic and parasympathetic systems.
We stress that our model for P(r) is robustly checked and that tests performed over different classes of patients (diplaying a different clinical status) allows us to candidate the exponent β controlling the coupling decay J(τ ) ∼ τ −β as an indicator for classify the patient clinical status.

Methods
This Section is devoted to the description of the statistical model and the optimization rules used in our inference procedure.
The two-body statistical mechanical model from maximum entropy principle. As stated in Sec.
2, the probabilistic model we use to frame the analysis of heart-rate variability contained in the standardized RR series {z n } N n=1 emerges as the solution of extremization procedure of the constrained Shannon entropy functional Here, we recall that the first term is the standard Shannon entropy for probability distribution for continuous variables, i.e.
while the other terms are constraints with Lagrangian multipliers 0 , 1 , 2 and τ . The extremization with respect to these parameters leads to the following conditions: dz P(z)z n z n+τ − (N − τ )C(τ ) .  www.nature.com/scientificreports/ i.e. that the function P(z) is a probability distribution and that the moments up to the second order are captured by the experimental temporal average µ (1) , the temporal standard deviation µ (2) and the auto-correlation function C(τ ) . The extremization with respect to the function P leads to the explicit form of the solution, i.e.
which can be rewritten as The constant in the latter equation is computed by using the normalization property of the probability distribution P(z) , and it is given by where we used the letter Z to make contact with the notion of partition function from the Thermodynamics dictionary. Since the model is essentially Gaussian, we can directly compute the partition function (at least, in formal way) as where E = (1, 1, . . . , 1) is a N-dimensional vector of ones and we defined the interaction matrix ( ) n,m = N τ =1 δ n,m−τ τ (which turns out to be an upper triangular Toeplitz matrix with zeros on the main diagonal). Because of this, the determinant of the kernel 2 I + is trivially det( 2 I + ) = N 2 . We can now determine the relation between the temporal average and standard deviation in terms of the model parameters. These relations read as where �·� th is the theoretical average as defined in Eq. 19.
Since z is temporally standardized, we directly get 1 = 0 and 2 = −1/2 . However, we left the former as a free parameter to be inferred and check a posteriori that it is consistent with zero. In order to get contact with Physics' dictionary, we rename the Lagrangian multipliers 1 = h and τ = J(τ ) , playing the role of external magnetic field and two-body interactions respectively. Then, the solution of the maximum entropy problem (after some rearrangements of the sum indices) is given by where P 0 (z) is the Gaussian distribution N(0, 1) and Z is the partition function.
The inference setup. The determination of the model parameters J(τ ) and h is based on a maximum likelihood approach. As stated in the model description, the usage of the full probability is computationally untractable (because of the high dimensionality of the integral in the partition function),thus we use a maximum pseudo-likelihood approach 4 (namely a tractable asymptotic correct estimator of the likelihood), in which the fundamental object to be maximized is the conditional probability that, given the observations {z n } L n=1 , the successive observation is equal to experimental data z L+1 : The second factor is easy to handle with, while the partition function can be evaluated by using the fact that dzP z|{z n } L n=1 = 1 , so that we finally have (19) dz P(z) = 1, 1 N N n=1 dz P(z)z n = µ (1) , 1 N N n=1 dz P(z)z 2 n = µ (2) , dz P(z)z n z n+τ = C(τ ), www.nature.com/scientificreports/ Notice that eqs. 27-28 are, in fact, approximations given that interactions between past (before L + 1 ) and future (after L + 1 ) events are neglected. Now, since the prior is Gaussian, we can directly integrate the denominator for carrying out a closed form for the conditional probability. Thus, we get The MLE is based on the maximization of this conditional probability, or equivalently of the pseudo log-likelihood, which is composed by quantities of the form where we discarded unessential constant terms. Since we would like to infer the first values of the delayed interaction vector J(τ ) and since the RR time-series have a size which is of the order of 10 5 , it is better to use a slidingwindow average approach, whose functioning is ensured by the stationary hypothesis. In this way, we can also perform a temporal average over all a single RR time-series. Supposing we want to infer the first T elements of the delayed interaction J(τ ) (i.e. we truncate long-term correlations) and given the time-series {z n } N n=1 of length N, we define the individual log-likelihood as In order to prevent the parameters to acquire large values, we also introduce some regularization. For the bias, we simply add a quadratic penalization term: R (h) = − h 2 /2 . Concerning the interaction vector, in order to discourage the algorithm to generate spurious correlation for high τ , we introduce a penalization which depends on the delay time τ , i.e. However, in order to ensure not to destroy correlation for interesting values of τ , we adopt a mild regularizer. In all of our tests, we found that a good choice is f (τ ) = log 2 (1 + τ ) . Putting all pieces together, we have the regularized individual pseudo log-likelihood The whole pseudo-likelihood is the average over the set D c time-series in each given class (where c ∈ {H, AF, CD} ): where M c is the number of examples in the class. By adopting a standard gradient descent (GD) approach, we can derive the following optimization rules: www.nature.com/scientificreports/ In order to speed up the inference procedure, we use a AdaGrad 39 adaptation method for the gradient descent rules (3536). Since we want a uncertainties estimation for the coupling matrix J(τ ) , we proceed in the following way: better than to realize a single delayed interaction for the whole database (for each class), we minimize the pseudo log-likelihood to M random subsets of cardinality n of the database of each class (i.e. the gradients are averaged with respect to this minibatch) and then let the inferential algorithm converge towards a fixed point. Then, we compute the mean values and the standard deviation with respect to this M realizations of the interaction vector J(τ ) . In our analysis, we choose M = 500 different inferential procedures for subset of cardinality n = 20.
Finally, we notice that the inferential approach followed here can also be seen as a (least squares) regression, where the regressors are all the past inter-peak durations, i.e., z L = τ J(τ )z L − τ + Gaussian noise . Along this line, other examples of statistical mechanics formulation include, for instance, a gaussian version of the Kinetic Ising Model (see e.g., [40][41][42][43][44] ).
Generation of synthetic data. Having an estimate for the interaction vector J(τ ) and for the bias h, we can use them to generate new synthetic sequences. From Eq. (29), we see that the conditional probability P(z = z L+1 |{z n } L n=1 ) is maximized if we take Therefore, once the first T components of J(τ ) and the bias parameter h are inferred, we can generate synthetic series {z n } as follows. First, we take the first T points in a given experimental time-series, then we convolve it with J(τ ) and add the bias, i.e.
Then, we can move the temporal window of length T along the whole experimental series in order to generate new points. Then, in general, the points in the synthetic time series are generated according to the rule We stress that, in our procedures, we used a stochastic version of the synthetic series generation, in which the inferred vector J(τ ) is replaced with its noisy version J (τ ) = J(τ ) + ησ J(τ ) with η is a N(0, 1) random variable. The synthetic series is then generated by collecting the N − T points z n .
Ethical statement. Informed consent was obtained from all subjects and related data have been treated in a completely anonymous form (in the full respect of the Declaration of Helsinki (1964) with the understanding and the consent of the human subjects involved in the study). APH and POLISA asked for explicit approval of the study by the responsible Ethical Committee: this approval was released to APH and POLISA on June 09 2016 by the Ethical Committee of Regione Marche (APH Hospital belongs to that region) and can be supplied upon request. All the methods were carried out in strick accordance with all the relative guidelines and regulations in Italy.