Effective high compression of ECG signals at low level distortion

An effective method for compression of ECG signals, which falls within the transform lossy compression category, is proposed. The transformation is realized by a fast wavelet transform. The effectiveness of the approach, in relation to the simplicity and speed of its implementation, is a consequence of the efficient storage of the outputs of the algorithm which is realized in compressed Hierarchical Data Format. The compression performance is tested on the MIT-BIH Arrhythmia database producing compression results which largely improve upon recently reported benchmarks on the same database. For a distortion corresponding to a percentage root-mean-square difference (PRD) of 0.53, in mean value, the achieved average compression ratio is 23.17 with quality score of 43.93. For a mean value of PRD up to 1.71 the compression ratio increases up to 62.5. The compression of a 30 min record is realized in an average time of 0.14 s. The insignificant delay for the compression process, together with the high compression ratio achieved at low level distortion and the negligible time for the signal recovery, uphold the suitability of the technique for supporting distant clinical health care.


Method
Before describing the method let's introduce the notational convention.  is the set of real numbers. Bold face lower cases are used to represent one dimension arrays and standard mathematical fonts to indicate their components, e.g.  ∈ c N is an array of N real components c(i),i = 1, …, N, or equivalently c = (c(1), …, c(N)). Within the algorithms, operations on components will be indicated with a dot, e.g. c. 2  (1) Approximation Step. Applies a DWT to the signal keeping the largest coefficients to produce an approximation of the signal up to the target quality. (2) Quantization Step. Uses a scalar quantizer to convert the wavelet coefficients in multiples of integer numbers.

(3) Organization and Storage
Step. Organizes the outputs of steps (1) and (2) for economizing storage space.
At the Approximation Step a DWT is applied to convert the signal  ∈ f N into the vector  ∈ w N whose components are the wavelet coefficients (w(1), …, w(N)). For deciding on the number of nonzero coefficients to be involved in the approximation we consider two possibilities: At the Quantization Step the selected wavelet coefficients c = (c(1), …, c(K)), with K = N − k and c(i − k) = w(γ i ), i = k + 1, …, N, are transformed into integers by a mid-tread uniform quantizer as follows: where ⌊ ⌋ x indicates the largest integer number smaller or equal to x and Δ is the quantization parameter. After quantization, the coefficients and indices are further reduced by the elimination of those coefficients which are mapped to zero by the quantizer. The above mentioned option (b) follows from this process. It comes into effect by skipping Algorithm 1. The signs of the coefficients are encoded separately using a binary alphabet (1 for + and 0 for −) in an array (s(1), …, s(K)).
Since the indices = …  i K , 1, , i are large numbers, in order to store them in an effective manner at the Organization and Storage Step we proceed as follows. These indices are re-ordered in ascending order This induces a re-order in the coefficients, → Δ Δ  c c and in the corresponding signs →s s. The re-ordered indices are stored as smaller positive numbers by taking differences between two consecutive values.
with unique recovery. The size of the signal, N, the quantization parameter Δ, and the arrays Δ  c , s, and δ  are saved in HDF5 format. The HDF5 library operates using a chunked storage mechanism. The data array is split into equally sized chunks each of which is stored separately in the file. Compression is applied to each individual chunk using gzip. The gzip method is based of on the DEFLATE algorithm, which is a combination of LZ77 23 and Huffman coding 24 . Within MATLAB all this is implemented simply by using the function save to store the data.
Algorithm 2 outlines a pseudo code of the above described compression procedure.

Algorithm 1. Selection of the largest wavelet coefficients Procedure
The fast wavelet transform has computational complexity O(N). Thus, if the approach (a) is applied, the computational complexity of Algorithm 2 is dominated by the sort operation in Algorithm 1 with average computational complexity O(NlogN). Otherwise the complexity is just O(N), because the number K of indices of nonzero coefficients to be sorted is in general much less than N. Nevertheless, as will be shown in the Numerical Example III, in either case the compression of a 30 min record is achieved on a MATLAB platform in an average time less then 0.2 s. While compression performance can be improved further by adding an entropy coding step before saving the arrays, if implemented in software such a step slows the process.
When selecting the number of wavelet coefficients for the approximation by method a) the parameter tol is fixed as follows: Assuming that the target PRD before quantization is PRD 0 we set = f tol PRD /100 0 . The value of PRD 0 is fixed as 70-80% of the required PRD. The quantization parameter is tuned to achieve the required PRD.
Signal recovery. At the Decoding Stage the signal is recovered by the following steps.
• Read the number N, the quantization parameter Δ, and the arrays Δ  c , δ  , and s from the compressed file. • Recover the magnitude of the coefficients from their quantized version as • Recover the indices from the array δ  as: • Invert the wavelet transform to recover the approximated signal f r .
As shown in Tables 5-7, and the recovery process runs about 3 times faster than the compression procedure, which is already very fast.

Results
We present here four numerical tests with different purposes. Except for the comparison in Test II, all the other tests use the full MIT-BIH Arrhythmia database 25 which contains 48 ECG records. Each of these records consists of N = 650000 11-bit samples at a frequency of 360 Hz. The algorithms are implemented using MATLAB in a notebook Core i7 3520 M, 4GB RAM.
Since the compression performance of lossy compression has to be considered in relation to the quality of the recovered signals, we introduce at this point the measures to evaluate the results of the proposed procedure. The quality of a recovered signal is assessed with respect to the PRD calculated as follows, r where, f is the original signal, f r is the signal reconstructed from the compressed file and ⋅ indicates the 2-norm. Since the PRD strongly depends on the baseline of the signal, the PRDN, as defined below, is also reported.
r where, f indicates the mean value of f. When fixing a value of PRD, the compression performance is assessed by the Compression Ratio (CR) as given by = .

CR
Size of the uncompressed file Size of the compressed file (5) The quality score (QS), reflecting the tradeoff between compression performance and reconstruction quality, is the ratio: Since the PRD is a global quantity, in order to detect possible local changes in the visual quality of the recovered signal, we define the local PRD as follows. Each signal is partitioned in Q segments f q , q = 1 …, Q of L samples. The local PRD with respect to every segment in the partition, which we indicate as prd(q), q = 1, … Q, is calculated as q q q r where f q r is the recovered portion of the signal corresponding to the segment q. For each record the mean value prd (prd) and corresponding standard deviation (std) are calculated as The mean value prd with respect to all the records in the database is a double average prd. When comparing two approaches on a database we reproduce the same mean value PRD. The quantification of the relative gain in CR of one particular approach, say approach 1, in relation to another, say approach 2, is given by the quantity: The gain in QS has the equivalent definition.
Numerical test I. We start the tests by implementing the proposed approach using wavelet transforms corresponding to different wavelet families at different levels of decomposition. The comparison between different wavelet transforms is realized using approach (b), because within this option each value of PRD is uniquely determined by the quantization parameter Δ. Thus, the difference in CR is only due to the particular wavelet basis and the concomitant decomposition level. Table 1 shows the average CR (indicated as CR b ) and corresponding standard deviation (std) with respect to the whole data set and for three different values of PRD. For each PRD-value CR b is obtained by means of the following wavelet basis: db5 (Daubechies) coif4 (Coiflets) sym4 (Symlets) and cdf97 (Cohen-Daubechies-Feauveau). Each basis is decomposed in three different levels (lv). As observed in Table 1, on the whole the best CR is achieved with the biorthogonal basis cdb97 for lv = 4. In what follows all the results are given using this basis for decomposition level lv = 4.
Next we produce the CR for every record in the database for a mean value PRD of 0.53. Table 2 shows the results obtained by approach (a) where the CR and QS produced by this method are indicated as CR a and QS a , respectively. The PRD values for each of the records listed in the first column of Table 2 are given in the forth columns of those tables. The second and third columns show the values of prd and the corre-www.nature.com/scientificreports www.nature.com/scientificreports/ sponding std for each record. The CR is given in the fifth column and the corresponding QS in sixth column of the table. The mean value CR obtained by method (b) for the same mean value PRD = 0.53 is CR b = 22. 16. Table 3 shows the variations of the CR a with different values of the parameter PRD 0 in method (a).

Numerical test II.
Here comparisons are carried out with respect to results produced by the set partitioning in hierarchical threes algorithm (SPHIT) approach proposed in 26 Tables  III of 26 are unfairly reproduced for comparison with values of PRD obtained without subtraction of the 1024 base line. The values of PRD with and without subtraction of that baseline, which are indicated as PRD B and PRD respectively, are given in Table 4. As seen in this table, for the same approximation there is an enormous difference between the two metrics. A fair comparison with respect to the results in 26 should either involve the figures in the second row of Table 4 or, as done in 26 , the fact that a 1024 base line has been subtracted should be specified. The figures in the 3rd row of Table 4 correspond to the CRs in 26 . The 4th row shows the CRs resulting from method (b) of the proposed approach without entropy coding and the 5th row the results of adding a Huffman coding step before saving the compressed data in HDF5 format. The last two rows show the quantization parameters Δ which produce the required values of PRD B and PRD.
Numerical test III. This numerical test aims at comparing our results with recently reported benchmarks on the full MIT-BIH Arrhythmia database for mean value PRD in the rage [0.23, 1.71]. To the best of our knowledge the highest CRs reported so far for mean value PRD in the range [0.8, 1.30) are those in 12 , and in the range (1.30,1.71] those in 14 . For PRD < 0.8 the comparison is realized with the results in 11 , as shown in Table 7. Table 5 compares our results against the results in Table III of 12 and Table 6 against Table 1 of 14 . In both cases we reproduce the identical mean value of PRD. The differences are in the values of CR and QS. All the Gains given in Table 5 are relative to the results in 12 while those given in Tables 6 and 7 are relative to the results in 14 and 11 , respectively. As already remarked, and fully discussed in 27 , when comparing results from different publications care should be taken to make sure that the comparison is actually on the identical database, without any difference in baseline. From the information given in the papers producing the results we are comparing with (the relation between the values of PRD and PRDN) we can be certain that we are working on the same database 25 , which is described in 28 .
The parameters for reproducing the required PRD with methods (a) and (b) are given in the last 3 rows of Tables 5-7. The previous 3 rows in each table give, in seconds, the average time to compress (t c ) and recover (t r ) a record. As can be observed, the compression times of approaches (a) and (b) are very similar. The given times were obtained as the average of 10 independent runs. Notice that the CR in these tables do not include the additional entropy coding step. Figure 1 gives the plot of CR vs PRD for the approaches being compared in this section.

Numerical test IV.
Finally we would like to highlight the following two features of the proposed compression algorithm.
(1) One of the distinctive features stems from the significance of saving the outputs of the algorithm directly in compressed HDF5 format. In order to highlight this, we compare the size of the file saved in this way against the size of the file obtained by applying a commonly used entropy coding process, Huffman coding, before saving the data in HDF5 format. The implementation of Huffman coding is realized, as in Table 4, by the off the shelf MATLAB functions Huff06 available on 29 . In Table 8 CR a and CR b indicate, as before, the CR obtained when the outputs of methods (a) and (b) are directly saved in HDF5 format. CR a Huff and CR b Huff indicate the CR when Huffman coding is applied on the outputs (a) and (b) before saving the data in HDF5 format. The rows right below the CRs give the corresponding compression times. www.nature.com/scientificreports www.nature.com/scientificreports/ (2) The other distinctive feature of the method is the significance of the proposed Organization and Storage step. In order to illustrate this, we compare the results obtained by method (b) with those obtained using the conventional Run-Length (RL) algorithm 30 instead of storing the indices of nonzero coefficients as      Table 5. Comparison between the average performance of the proposed method and the method in 12 for the same mean value of PRD.
www.nature.com/scientificreports www.nature.com/scientificreports/ proposed in this work. The CR corresponding to RL in HDF5 format is indicated in Table 8 as CR RL . When Huffman coding is applied on RL before saving the outputs in compressed HDF5 format, the CR is indicated as CR RL Huff .

Discussion
We notice that, while the results in Table 1 show some differences in CR when different wavelets are used for the DWT, it is clear from the table that the selection of the wavelet family is not the crucial factor for the success of the technique. The same is true for the decomposition level. That said, since the best results correspond to the cdf97 family at decomposition level 4, we have realized the other numerical tests with that wavelet basis. We chose to produce full results for a mean value PRD of 0.53 (c.f. Table 2) as this value represents a good compromise between compression performance and high visual similitude of the recovered signal and the raw data. Indeed, in 15 the quality of the recovered signals giving rise to a mean value PRD of 0.53 is illustrated in relation to the high performance of automatic QRS complex detection. However, the compression ratio of their method is low. For the same mean value of PRD our CR is 5 times larger: 4.5 15 vs 23.17 ( Table 2). As observed in Table 2 the mean value of the local quantity prd is equivalent to the global value (PRD). Nevertheless the prd may differ for some of the segments in a record. Figure 2 plots the prd for record 101 partitioned into Q = 325 segments of length L = 2000 sample points. Notice that there are a few segments corresponding to significantly larger values of prd than the others. Accordingly, with the aim of demonstrating the visual quality of the recovered signals, for each signal in the database we detect the segment ⁎ q of maximum distortion with respect to the prd as PRD  Table 7. Comparison between the average compression performance of the proposed method and the method in 11 for the same mean value of PRD. t c a) (s) 0.14 0.14 0.14 0.14 0.14 0.14  Table 6. Comparison between the average performance of the proposed method and the method in 14 for the same mean value of PRD.
The left graphs of Fig. 3 correspond to the segments of maximum prd with respect to all the records in the database and segments of length L = 2000. These are: the segment 25 of records 101, when applying the approximation approach (a) (top graph), and segment 175 of record 213 for approach (b) (bottom graph). The upper waveforms in all the graphs are the raw ECG data. The lower waveforms are the corresponding approximations which have been shifted down for visual convenience. The bottom lines in all the graphs represent the absolute value of the difference between the raw data and their corresponding approximation. The right graphs of Fig. 3 have the same description as the left ones but the segments correspond to values of prd close to the mean value prd for the corresponding record.
It is worth commenting that the difference in the results between approaches (a) and (b) is consequence of the fact that the concomitant parameters are set to approximate the whole database at a fixed mean value PRD. In that sense, approach (a) provides us with some flexibility (there are two parameters to be fixed to match the required PRD) whereas for approach (b) the only parameter (Δ) is completely determined by the required PRD. As observed in Table 3, when setting the parameter PRD 0 much smaller than the target PRD the approximation is only influenced by the quantization parameter Δ and methods (a) and (b) coincide. Contrarily, when setting the PRD 0 too close to the target PRD the quantization parameter needs to be significantly reduced, which affects the compression results. For a target PRD≥0.4 we recommend to set PRD 0 as 70-80% of the required PRD. PRD 1.0 0.   Another feature that appears for PRD < 0.4 is that applying the entropy coding step, before saving the data in compressed HDF5 format, improves the CR much more than for larger values of PRD. This is because for PRD < 0.4 the approximation fits noise and small details, for which components in higher wavelet bands are required. Contrarily, for larger values of PRD the adopted uniform quantization keeps wavelet coefficients in the first bands. As a result, through the proposed technique the location of the nonzero wavelet coefficients is encoded in an array which contains mainly a long stream of ones. For small values of PRD the array's length increases to include different numbers. This is why the addition of an entropy coding step, such as Huffman coding which assigns smaller bits to the most frequent symbols, becomes more important. In any case, if the outputs are saved in HDF5 format, adding the Huffman coding step is beneficial. Nonetheless, since when implemented in software the improvement comes at expense of computational time, for PRD > 0.4 this step can be avoided and the CR is still very high.
Comparisons with the conventional RL algorithm, in Table 8, enhances the suitability of the proposal for storing the location of nonzero coefficients. A similar storage strategy has been successfully used with other approximation techniques for compression of melodic music 31 and X-Ray medical images 32 . In this case the strategy is even more efficient, because the approximation is realized using a basis and on the whole signal, which intensifies the efficiency of the storage approach.

Conclusions
An effective and efficient method for compressing ECG signals has been proposed. The proposal was tested on the MIT-BIH Arrhythmia database, which gave rise to benchmarks improving upon recently reported results. The main feature of the method is its simplicity and the fact that for values of PRD > 0.4 a dedicated entropy coding to save the outputs can be avoided by saving the outputs of the algorithm in compressed HDF5. This solution involves a time delay which is practically negligible in relation to the signal length: 0.14 s for compressing a 30 min record. Two approaches for reducing wavelet coefficients have been considered. Approach (b) arises from switching off in approach (a) the selection of the largest wavelet coefficients before quantization. It was shown that, when approximating a whole database to obtain a fixed mean value of PRD, approach (a) may render a higher mean vale of CR when the target PRD is greater the 0.4.
The role of the proposed Organization and Store strategy was highlighted by comparison with the conventional Run Length algorithm. Whilst the latter produces smaller CRs, the results are still good in comparison with previously reported benchmarks. This outcome leads to conclude that, using the a wavelet transform on the whole signal, uniform quantization for all the wavelet bands works well in the design of a codec for lossy compression of ECG signals.
Note: The MATLAB codes for implementing the whole approach have been made available on a dedicated website 29,33 .

Data Availability
The data used in this paper are available on https://physionet.org/physiobank/database/mitdb/ We have also placed the data, together with the software for implementing the proposed approach, on http://www.nonlinear-approx.info/examples/node012.htm.