Detecting cardiac pathologies via machine learning on heart-rate variability time series and related markers

In this paper we develop statistical algorithms to infer possible cardiac pathologies, based on data collected from 24 h Holter recording over a sample of 2829 labelled patients; labels highlight whether a patient is suffering from cardiac pathologies. In the first part of the work we analyze statistically the heart-beat series associated to each patient and we work them out to get a coarse-grained description of heart variability in terms of 49 markers well established in the reference community. These markers are then used as inputs for a multi-layer feed-forward neural network that we train in order to make it able to classify patients. However, before training the network, preliminary operations are in order to check the effective number of markers (via principal component analysis) and to achieve data augmentation (because of the broadness of the input data). With such groundwork, we finally train the network and show that it can classify with high accuracy (at most ~85% successful identifications) patients that are healthy from those displaying atrial fibrillation or congestive heart failure. In the second part of the work, we still start from raw data and we get a classification of pathologies in terms of their related networks: patients are associated to nodes and links are drawn according to a similarity measure between the related heart-beat series. We study the emergent properties of these networks looking for features (e.g., degree, clustering, clique proliferation) able to robustly discriminate between networks built over healthy patients or over patients suffering from cardiac pathologies. We find overall very good agreement among the two paved routes.

As mentioned above, a particularly important application of machine learning in a healthcare context is (digital) diagnosis (see e.g. [10][11][12][13][14]. Machine learning models can detect patterns (precursor) of certain diseases within patient electronic healthcare records and inform clinicians of any anomalies. Among the most successful examples so far, we mention its use in breast and skin cancer screening, in macular degeneration and diabetic retinopathy detection and in distinguishing bacterial and viral pneumonia on chest X-rays (see e.g. [15][16][17]. Notably, in this context training is often impaired by the relative sparsity of pathological examples for which the statistics is, luckily, typically lower than statistics over healthy examples. In this work we focus on data concerning heart activity and we aim to apply machine learning tools to detect possible heart-related pathologies such as atrial fibrillation or congestive heart failure. The automatic prediction of pathological events from heart activity data has been intensively investigated, especially in the last decade (see e.g. [18][19][20][21][22][23][24] ): its relative low-cost and non-invasive nature make it particularly promising.
Our dataset is made of Holter recordings on labelled patients (labels distinguish the kind of pathology affecting the patient, if any). The underlying idea is that the heart-rate variability (HRV), namely the variability in the time interval between heartbeats (which, to some extent, is perfectly physiological), may reveal patterns that are i j C(i, j) 5   typical of heart-related pathologies. It is worth recalling that the HRV is usually measured in terms of the variation in the beat-to-beat interval and the common measure obtained from an Holter recording is the so-called RR variability (where R is a point corresponding to the peak of the Holter wave and RR is the interval between successive Rs, that in humans is ~10 3 ms). A preliminary statistical investigation on these raw data allows us to see that RR intervals display heavy-tailed distributions. A more convenient, corse-grained description can then be achieved in terms of a set of markers widely used in the cardiology research community 25 . These are also suitable candidate as inputs for a machine learning network. The examples available in our database are therefore used for the training and the validation of a multilayer feed-forward network which turns out to be able to classify patients into four different categories: healthy patients (i.e., control group), patients suffering from atrial fibrillation, or from congestive heart failure, or from other disease. Beyond these AI-based methods for distinguishing between healthy and pathology heart-beats, there exist -and have proved to be successful -other methods based on statistical mechanics. In particular, we mention approaches focusing on correlations displayed by beat-to-beat fluctuations, such as the detrended fluctuation analysis [26][27][28][29] , and on multifractality in heartbeat interval time series 30,31 . Within this wider scenario, in the second part of this paper, we look for consistency following a totally different investigation route based on network theory [32][33][34][35] . More precisely, we assign to each RR-series a node in a network and we tie nodes together according to a similarity measure between series. Then, we highlight that networks built on healthy patients (i.e., control group), on patients suffering from atrial fibrillation, or from congestive heart failure exhibit different topological features. Overall, the results of these two routes are in very good agreement and the potentialities shown by these kinds of approach should motivate the establishment of suitable repositories and clouds aimed for an accurate training of these networks and algorithms.

Results
Data description: from time series to markers. This research has been accomplished as part of the project MATCH (Mathematical Advanced Tools to Catch Heart-rate-variability), a scientific cooperation among Ascoli-Piceno Hospital (APH), International Polytechnic "Scientia et Ars" (POLISA), University of Salento, and University of Calabria. The initial database used in the present analysis consists in nominal 24 h Holter recordings of M 2829 = patients hospitalized in APH, whose data were managed at POLISA, in the period 2016-2019. Patients are divided into 6 main classes: healthy (H), suffering from atrial fibrillation (AF), from congestive heart failure (i.e. cardiac decompensation, CD), from diabetes (DIAB), from hypo-or hyperthyroidism (TIR), or from hypertension (TENS). Following Holter recordings, each patient is associated to an RR time-series, namely a series of temporal intervals between two consecutive heart-beats. For instance, being t n the time for the N -th beat and L the total number of RR intervals in a given time-series, the n th RR-value reads as Examples of these series for the different classes are reported in Fig. 1. Another possible measure for heart-rate variability is given by the Beats-Per-Minute (BPM) sequence, which   can be obtained by counting how many RR intervals occur in a single minute. As a preliminary statistical insight, we look at the distributions of RR's and BPM's values. By merging data pertaining to patients belonging to the same class, we obtain as many box-plots as shown in Fig. 2. For each class, the median value (over all related patients) is denoted with the blue vertical line. The boxes extend from the lower to the upper quartile values, while the outer bars (whiskers) extend from the lowest to the highest non-outlier data (we recall that outlier points are observation falling outside the interval where Q 1 and Q 3 are, respectively, the lower and upper quartiles). For both RR and BPM data the structure of box-plots look quite similar in all subclasses, with outliers especially falling on the right side and suggesting that the underlying distributions exhibit a right symmetry with a heavy tail.
Beyond this description based on raw data, we can obtain a coarser one based on a set of 49 markers indicated by the European Society of Cardiology and the North American Society of Pacing and Electrophysiology (see e.g. 25 ) to summarise the HRV within the RR and BPM series collected during an Holter recording (see also 18 ). These markers are of different nature as briefly summarized hereafter (the exhaustive list is reported in Appendix A): • Linear markers pertaining to the temporal domain, such as the mean value and the standard deviation of RR and of BPM data, the number of successive RR intervals that exceed a certain threshold, etc; • Linear markers pertaining to the frequency domain, such as the frequency peaks, the absolute/relative/ normalized powers in, respectively, the very low frequencies (VLF, i.e., ≤ . f 0 04 Hz), the low frequencies (LF, i.e., ∈ .
As mentioned before, moving from a description in terms of the RR or BPM sequences to a description in terms of markers implies a coarsening and, as a consequence, the clinical picture of the n-th patient, with = … n M 1, , is now simply represented by a vector x x x ( , , ) , is a scalar quantity. For any i n ( , ), we can compute the marker value x n i ( ) from the raw RR time-series by means of Matlab-based software 36 . Then, the average and the standard deviation (over the whole population making up the database) follow, respectively, as  Table 3 of Appendix A. Despite the coarsening applied, the space 49  still exhibits a relative high dimension, which makes inference rather challenging. However, one can see that the 49 markers considered are not uncorrelated with each other. In fact, some markers present trivial relations as, for instance, the average of RR intervals and the average of the BPM. Therefore, it is convenient to preliminary study the correlations between markers in order to drop out redundant ones, yet preserving the whole information acquired. From a machine learning point of view, this analysis has the benefit of sensibly decreasing the number of free parameters to be tuned in the training procedure, so that over-fitting risks are effectively reduced. The correlation analysis will be performed in the next Section.
Correlation analysis and dimensionality reduction. The simplest quantifier for correlation between marker i and marker j (with i j , 1, , 49 = … ) is the Pearson correlation coefficient C ij that reads as are, respectively, the sample covariance and variance. Since we want to unveil (linearly) dependent markers, we will not care of the sign of the correlation but we will just look at the absolute value of the Pearson correlation coefficient, which is graphically represented in Fig. 3 for all the 49 × 49 possible pairs. By inspecting this plot, we see that there exists a non-empty set of mutually correlated variables: we report in Table 1 marker's couples i j ( , ) whose Pearson correlation coefficient is in magnitude higher than ≥ . C 0 990 ij . Examples of scatter plots for these highly-correlated markers are reported in Fig. 4. As remarked above, many of these correlations are somewhat trivial, for instance, this is the case for quantities in the frequency domain which are computed with FFT-based and with autoregressive methods. Since there is a negligible information loss if we discard one of two highly correlated markers, as a result of the analysis performed in this section, we can reduce the dimensionality of the marker space. In particular, we choose to discard eight markers: #22 (normalized power of the LF band evaluated with FFT-base methods), #27 (relative power of the VLF band evaluated with autoregressive methods), #29 (absolute power of the LF band evaluated with autoregressive methods), #31 (normalized power of the LF band evaluated with autoregressive methods), #35 (normalized power of the HF band evaluated with autoregressive methods), #38 (standard deviation of the Poincaré plot in the direction orthogonal to the identity line).
Classification tasks by feed-forward neural networks. One of the goals of the current work is to evaluate whether observations based on Holter recording allow an automatic inference about the presence of pathologies like atrial fibrillation, congestive heart failure, diabetes, hypo-or hyperthyroidism, hypertension, etc. This task can be approached by several perspectives, from classical statistical inference methods to machine learning techniques. In any case, having large samples is a necessary condition for meaningful outcomes. As mentioned in Sec. 2.1, our databases is split into 6 main classes (H, AF, CD, DIA, T, TENS), of which only three (i.e., H, AF, CD) display a relatively large size. We therefore look for a tradeoff between statistical soundness and refinment in the emerging classification: in the following analysis, in order to avoid sparse classes, we slightly rearrange the initial database in order to generate two data-sets optimized for training classification of AF and CD patients solely, corresponding to (see Table 2) • clinical data (markers) for healthy patients (H), patients suffering from atrial fibrillation (AF), patients suffering from other (not specified) diseases (O);  Note that the healthy patients in the two databases coincide, while the O class for the first database partially overlaps with the CD class of the second one (and, likewise, the class O for the second database may contain AF patient too). Since we analyze the two databases independently, this is not a problem, rather, this will allows us to double-check after classification (e.g., a patient targeted as AF in the first database should belong to the class O in the second database and a patient classified as CD in the second database should belong to the class O in the first database; breaking this rule would result in a fault-classification by the neural network).
Before proceeding, we perform a couple of tests to see if the rearranged dataset allows for some kind of data clusterization which may encourage further investigations via machine learning tools.
First, we look at the joint distribution P x x ( , ) i j class ( ) () , where "class" can be H, AF, O (or H, CD, O), which is obtained by counting the number of patients belonging to the class considered and displaying value x i ( ) for marker i and value x j ( ) for marker j. In particular, we look at the joint probability by distinguishing between the classes H/AF/O and check whether some clusterization occurs in some 2-dimensional plane in the marker's space (which can be a useful prerequisite in order to obtain a meaningful classification of the patients). Indeed, this turns out to be the case, as shown in Fig. 5, where one can see that the projections onto given planes in the high dimensional data space of the joint distribution evaluated for the AF class has a clearly different clusterization with respect to the H and O classes.
A similar clusterization of AF patients can also be visualized in the space of the principal components. In particular, in Fig. 6 we show the scatter plot in the space spanned by the first four principal components, which overall encode for the 75% of the variability contained in the (standardized) marker data. From these plots, we can see that the population of H and O patients forms a wide cloud centered on the origin of the plane and distributed over a wide region, while AF patients tend to concentrate far from the origin, a feature that is clear in particular looking at the pc 1 vs pc 4 and pc 2 vs pc 3 scatter plots.
Enlarging the size of the markers via Principal Component Analysis soiling. Data augmentation is a popular and consolidated technique that allows improving classification and generalization in (deep) neural networks, whose training requires massive databases not always of immediate availability (as in the present case dealing with RR series of patients affected by particular diseases), see e.g. 37 . Indeed, its purpose is to synthesize new examples following the original data distribution; note that since such enriched database for training the machine can improve the generalization capabilities of the latter, it can be seen as an implicit and effective regularization. Data augmentation has already shown remarkable results in cross-cutting fields such as image processing 38,39 or speech recognition 40,41 .
In general, in machine learning applications, we divide the database in two parts. The first one, referred to as the training set, contains a larger number of examples and it is used to infer the network parameters, while the second one, referred to as the validation set, is used to estimate the prediction performances of the model. Despite the encouraging results obtained from a statistical perspective in the previous sections, our database is rather small for a sound machine learning approach to the classification problem: techniques of data augmentation are needed in this case. The one adopted here is based on the work 42 , and consists in the augmentation of the data by introducing some noise in the principal components. More precisely, the augmentation algorithm is the following: 1. For each point in the dataset, multiply the first principal component pc 1 by a factor α drawn from a uniform distribution with support − +   (1 /2, 1 /2), with 0 1  < < . For each data point, this operation can be performed several times; 2. Propagate the new points back in the original data space (according to the associated rotation matrix). www.nature.com/scientificreports www.nature.com/scientificreports/ This procedure introduces some noise, but still preserves the variability structure of data. This fact can be visualized for instance by comparing the histograms of each marker before and after the data augmentation, as reported in Fig. 7. The augmentation is performed by choosing the perturbation amplitude = .
0 1  and generating 20 new data points for each example in the original set. As a result, the validation set is augmented from 2300 initial patients to 46000 data points. As one can see from Fig. 7, the histograms of the marker before and after augmentation are nicely overlapped.
Neural network design and analysis. Given the results of the previous sections, we are now able to design neural network models for the classification problem. Of particular inspiration to this aim are the joint probability plots in Fig. 5, from which we find that there is a certain degree of separation between individuals with atrial fibrillation with respect to the remaining background (i.e., healthy people or patients with some other pathology). Therefore, we start developing an AF/NAF (Atrial Fibrillation vs Not Atrial Fibrillation) classifier for separating these two possibilities.
Then, analogous classifiers for H/NH (Healthy vs Not Healthy) and CD/NCD (Cardiac Decompensation vs Not Cardiac Decompensation) are also designed in such a way that the whole classifier is composed by these three building blocks, as reported in Fig. 8.  www.nature.com/scientificreports www.nature.com/scientificreports/ The main advantage of such an architecture is that each single classifier can be trained in separate and parallel way with respect to the others (moreover, this also allows to realized classifiers for specific tasks whether the performances are low). This modular scheme considerably reduces the computation time needed to find an optimal tuning of the parameters with respect to a monolitic architecture.
Training, generalization and classification performances. In this subsection we describe the functioning of each of the three blocks making up our model.
The neural model is realized using the Keras framework in Python; the hardware hosting the neural network is a double cluster composed by 16 CPU (all the cores work at a clock frequency of 3.0 Ghz) handing 216 GPU and equipped with a 32 GB RAM per cluster. Inputs (the markers describing the status of a given patient) are sent to the neural network which is composed by three hidden layers (made up of respectively 256, 512, 1024 exponential linear units, see Figs. 9 and 10). At each layer input, a Gaussian dropout (with value 0.2) operation is performed in order to avoid overfitting 43 . The signal outgoing from the third hidden layer is finally subjected to a Batch Normalization 44 , and then sent to the two output soft-max 45 neurons (see caption of Fig. 9 for more details). We found experimentally that such a choice for network architecture and neuron's activation function is a good compromise between generalization performances and training time.  Overall architecture of the classifier developed in this work. From left to right: first, for a given example n, we evaluate the entries of the vector x n and we use this as input for the machine. The number of entries in the input vector is = N 41. This input is then simultaneously passed to the H/NH block -which evaluates whether the example corresponds to a healthy unit or not -to the AF/NAF block -which evaluates whether the example corresponds to a unit displaying atrial fibrillation or not -and to the CD/NCD blockwhich evaluates whether the example corresponds to a unit displaying congestive heart failure or not. The outcomes stemming from this layer are compared checking for consistency; if this test is passed the outer layer provides the classification. where s i is the stimulus acting on the i-th soft-max neuron and f s ( ) i is the related outcome to be compared with the true value t i ; minimization of  is achieved via a stochastic gradient descent method (both with momentum 46 and Nesterov 47,48 acceleration methods). As we work with labeled databases, the training stage is fully supervised, and it is split in two stages as explained hereafter. In the first stage, we present to the network a large and noisy version of the database (i.e., the one produced with the PCA-based augmentation criterion). This is the pre-training stage, in which the network gets prepared typically resting in a configuration close to the optimal one (i.e., the one related to the global minimum in the loss function landscape). Each pre-training stage is composed by 50 epochs, each epoch handling a mini-batch of 2000 example on which the gradient is computed and averaged. After that, the network is trained with the real database for 700 epochs with mini-batches of 300 examples for each of them.
After each epoch we evaluate the network performance in terms of the categorical accuracy, that is measured as the fraction of examples correctly classified by the network; the adjective "categorical" refers to the binarization of the network output as soft-max neurons actually provide an estimate for the probability of each classe (e.g., H versus NH, see Fig. 9) and the class finally selected is just the most probable. Notice that, after each epoch, accuracy is measured over the training set as well as the validation set.  www.nature.com/scientificreports www.nature.com/scientificreports/ The evolution of accuracy and loss over epochs, for the H/NH, AF/NAF, and CD/NCD classifiers is shown in Fig. 11. In general, our learning procedure gives good performances with accuracy around 0.8-0.9 for all classification tasks. In particular, the training and the validation accuracies have a monotonic trend within the considered learning time and the former is always below the latter (this is a known effect due to dropout regularization, since during training the network deals with an incomplete representation of the data). www.nature.com/scientificreports www.nature.com/scientificreports/ Classification tasks via algorithmic network theory. Another route to heart failure's classification could be paved by dealing directly with RR series, exploiting algorithms from network's theory: these are particularly suitable [33][34][35]49 as they allow introducing novel classes of network-based markers (e.g., degree centrality, maximal degree, clustering coefficient, betweenness centrality, reciprocity and cliques, vide infra).
The underlying idea is to consider a sample of N patients within each of the highlighted categories (i.e. H, AF, CD) and associate to each of them a node of a graph H,AF,CD G , then, for all the possible couples of nodes within this graph, we measure the similarity between the related time series. The similarity between the RR series r i and r j corresponding to nodes labelled as i and j, respectively, provides the weight associated to the link connecting i and j. Of course, for any choice of the sample of patients we obtain a different realization of the graphs G H,AF,CD , and we are interested in any characteristic feature able to discriminate among the three classes and that is robust with respect to the sampling.
Graph realization via dynamic time warping. Similarity between patient RR-series can be defined in various way, but the key point is that two RR series are similar if they have comparable structure. Probably, the easiest choice would be in terms of the Euclidean distance between points in the two time series that occur at the same time . This is a good metric for similarity if both time series are in sync and move at exactly the same speed and time (i.e., all similar events in both time series occur at exactly the same time). However, when the series are out of sync this turns out to be a bad choice. In fact, in this case similar points in the two series could be stretched farther apart by time and the Euclidean distance would then get larger, suggesting, wrongly, that the series are becoming less similar.
To overcome this issue, we will adopt the so called "dynamic time warping" (DTW) 50,51 . This is an algorithm used to measure similarity between two sequences which may vary in time or speed. It works as follows: 1. Divide the two series into equal points  … n n n , , , . Softwares designed to evaluate this distance often implement some optimizations in the algorithms in order to contain the computation time (see e.g. 52 ).
As anticipated, according to the DWT similarity measure we derive a weight w ij between any pair of nodes. This is used to generate a fully-connected, symmetric weighted graph, where the weight associated to the link between the nodes corresponding to i and j is simply w ij .
En route for the adjacency matrix, we proceed our analysis by applying an operation f : { 0, 1}  → + , which makes the network un-weighted. A possible choice, determined by a parameter < k N , is given by represents the set including the k nodes most similar to i, that is, j is a nearest neighbour of i if w ij is among the k largest values in www.nature.com/scientificreports www.nature.com/scientificreports/ The operation f k defines the adjacency matrix A k of the resulting unweighted graph: its i j ( , ) element is 0 k = as we do not allow for self-loops. Notice that f k does not preserve the symmetry, namely, in general, Before proceeding, we stress that since we are now dealing with raw data, due to computational constraints, hereafter we focus solely on graphs made of O(100) nodes, examples of which are reported in Fig. 12.
Degree distributions. Due to the fact that the graphs G H,AF,CD are directed (see 2.9), we need to distinguish between the out-degree z in and the in-degree z out as In other terms, the out-degree for the node i is the number of nodes stemming from node i, while the in-degree is the number of links pointing to node i. The former is, by construction, equal to k, while the latter can vary (although its average is still equal to k): the in-degree for node i is large if the related RR series is particularly similar to a large number of other series. The distribution for the in-degree is shown in Fig. 13, for several choices of O -node graphs. For low values of k (i.e. k 5 = ) we see that, for both the H and AF cases, the majority of nodes display rather low in-degree, and the in-degree distribution exhibits a long tail. Conversely, for the CD case, the in-degree is rather homogeneous up to ∼ d 10 in , beyond which the tendency of nodes to acquire more links is softened (apart for a tiny fraction of nodes with d 20 25 in ∼ ÷ ). By increasing k, nodes with low degree get fewer but still the most predominant for both H and AF patients but new peaks appear in the distribution (see = k 10) highlighting a change of the topological structure of the network (see also the CD case). This structural change is clear for higher values of k (i.e. = k 20), especially for H and CD, for which low-linked nodes are fewer, the majority of nodes presenting an in-degree comparable with k (d 20  www.nature.com/scientificreports www.nature.com/scientificreports/ . Apart for the largest value of k (where the graph tends to a fully connected), in all the other cases, the average maximal degree seems able to split AF from CD patients aiming for proper classification. Right panel: Distributions of the degree standard deviations for values of k (5,10,20,30) ∈ ; notice that, since for a certain choice of k, the average degree evaluated on different realizations of G H,AF,CD is constant and equal to k, the standard deviation correponds, a constant k apart, to the variation coefficient. For all values of k, also the global clustering coefficient seems able to discriminate among the various classes. In particular, systematically, AF patients share lower clustering while CD patients share higher values of clustering. Right panel: Distributions of the GR for values of k (5,10,20,30) ∈ . While the larger k the more pronounced the overlap between H and CD patients, yet classification of AF patients seems robust. www.nature.com/scientificreports www.nature.com/scientificreports/ k. In general, a broad distribution can be related to networks where the similarity relation between nodes, as established by Eq. 2.8, is far from symmetric. This seems to be the case especially in the networks G AF concerning AF patients.
Besides the degree distributions, we also study the maximum degree for each sample (see Fig. 14, left panel) and the degree standard deviations (see Fig. 14, right panel). As for the maximum in-degree, we see that for low values of k, its distribution for H patients is broad, for AF patients there are two distinct peaks (for  d 15 max and d 20 25 max ∼ ÷ ) tending to merge when increasing k while, for CD patients, the distribution always presents the same shape, but with a different mean value; for ∼ k N /2, in all of the three cases the most linked nodes are connected to all of the others in the graph. Finally, as for the degree standard deviations, recalling that for a given choice of k the average in-degree is constant and equal to k, we can derive that, as expected, the broadness of the degree distribution decreases with k and that G AF (resp. G CD ) displays the largest (resp. lowest) broadness. This further suggests that AF patients are relatively heterogeneous.  www.nature.com/scientificreports www.nature.com/scientificreports/ Clustering, reciprocities and cliques. The global clustering coefficient (GC) is a standard quantity in graph theory and it measures the fraction of closed triplets over the total number of triplets (both open and closed), that is, the likelihood that two neighbors of a node are neighbors themselves. For directed graphs, as those under study, it is convenient to define the CG as the fraction of closed path of length three over the total number of paths of the same length, where paths can only be taken in the allowed directions, that is, GC #allowed closed path of length three #allowed path of length three (2 12) = . .
The results for this quantifier are reported in Fig. 15 (left panel), where we show the histograms for the CG measured over 1000 different realizations of O(100)-node graphs per class. As one can see, the distribution of global clustering coefficient is clearly different for the three populations (especially for low and high values of k). In particular, the GC measured in G H and AF G is on the average smaller than that obtained for G CD . This suggests that, in the former cases, it is more likely for nodes to link with individuals across the whole network rather than in a restricted neighborhood, implying that the populations is heterogeneous (in agreement with the previous analysis on the degree distribution).
We now move to graph reciprocity (GR), which computes the reciprocal linkage of nodes in directed graphs: it is defined as the fraction of mutual (i.e. bidirectional, i j → and j i → ) links over the total number of edges in the graph, i.e. GR #reciprocal links #links (2 13) = . .
Results for its distribution over the analyzed data-sets are reported in Fig. 15 (right panel) where it emerges that, for all values of k, AF patients have always a low degree of reciprocity suggesting that, in the corresponding adjacency matrix, for a large fraction of couples i j ( , ) (such that nodes tend to link with other nodes rather then forming reciprocal bridges). At contrary, H and CD patients tend to link reciprocally in a similar way.
Another interesting approach to determine emergent properties of a network is by community detection. More precisely, in this kind of analysis one aims to figure out the existence of groups of nodes, also called communities or clusters, displaying many edges joining nodes in the same group and comparatively few edges joining nodes of different groups. Detecting communities in large networks can be a hard problem and many algorithms have been proposed in the past years (see e.g. 53 ); here, communities are detected with clique percolation methods 54,55 . An example of community detection in G H is reported in Fig. 16 (left panel). Further, the mean global clustering coefficient is measured in the various communities detected in 1000 different realization of O(100) -node G H,AF,CD and the related distributions are reported in Fig. 16 (right panel). Also from this perspective, we see that AF patients present a generally lower value of mean community clustering coefficient, suggesting that neighbors in a given community do not tend to cluster among themselves.

Discussion
Goal of the present study is the development of neural network models and machine learning algorithms that, given as input RR series, are able to discriminate between healthy versus cardiopath (atrial fibrillation or congestive heart failure) individuals. In particular, in the first part of the work, this task is framed into a multi-label classification problem tackled by a feed-forward neural network, with four possible outcomes: healthy (H), atrial fibrillation (AF),  www.nature.com/scientificreports www.nature.com/scientificreports/ congestive heart failure (i.e., congestive decompensation CD), and other -not specified -pathologies (O). The classification is achieved on the basis of cardiological analysis: for each patient an Holter recording over a suitable time span of 24 h is available, from which standard clinical markers have been evaluated, resulting in a coarse-grained data-base containing the status of all the patients in terms of the values of these markers. Furthermore, in such a database, each patient is also provided with a label specifying the pathology it is suffering from (if any), allowing for supervised training of the network: the machinery developed turned out to successfully classify patients up to an accuracy ~÷ 80 85%. It should be stressed that accuracy and, in general the network performance, could be further improved with a larger data-set. In the present case, the lack of a sufficiently large data-sets could be overcome with larger and larger on-line repositories where real data from routinely screened patients can be stored. This kind of policy has already been applied in several diagnostic fields (see e.g. 56 ).
In the second part of the work, we investigated another possible route for disease classification that is based on network theory. More specifically, starting from the raw RR series and keeping the analysis independently split in the various classes of healthy patients, atrial fibrillation patients and congestive heart failure patients, we built class-related graphs by a standard similarity measure (the dynamical time warping) and then we inspected the emerging properties of these networks by studying standard topological features in network theory (e.g. degree distribution, global and local clustering coefficients, reciprocity and clique proliferation). Remarkably, even this route turned out to be successful in cardiac pathology classification, hence providing a complementary route to this purpose.
Overall the analysis carried on in this work evidenced that machine learning routes in cardiac pathology classification via HRV time-series analysis are possible and this may provide important benefits in terms of social costs: this should further prompt the establishment of shared repositories. Likewise, extensions of this approach to other pathologies could be feasible, as long as suitable experimental datasets are available.

Apendix
A Additional Information: a few details about the set of clinical markers. In this appendix we collect details about the markers considered in this work. First, in Tables 3 and 4, we report the full list of markers pertaining to the time domain and to the frequency domain, respectively; in Table 5, we report the full list of non-linear markers.
In Fig. 17 we present the box-plot for the standardized markers. As expected (given that the statistics underlying HRV is heavy-tailed), there is a large presence of outlier points highlighting the broadness in the marker distributions.
Ethical Statement 1: Informed consent was obtained from all subjects and related data have been treated in a completely anonymous form (in the full respect of the Declaration of Helsinki (1964) with the understanding and the consent of the human subjects involved in the study).
Ethical Statement 2: APH and POLISA asked for explicit approval of the study by the responsible Ethical Committee: this approval was released to APH and POLISA on June 09 2016 by the Ethical Committee of Regione Marche (APH Hospital belongs to that region) and can be supplied upon request.
Etichal Statement 3: all the methods were carried out in strick accordance with all the relative guidelines and regulations in Italy.