A lossless compression method for multi-component medical images based on big data mining

In disease diagnosis, medical image plays an important part. Its lossless compression is pretty critical, which directly determines the requirement of local storage space and communication bandwidth of remote medical systems, so as to help the diagnosis and treatment of patients. There are two extraordinary properties related to medical images: lossless and similarity. How to take advantage of these two properties to reduce the information needed to represent an image is the key point of compression. In this paper, we employ the big data mining to set up the image codebook. That is, to find the basic components of images. We propose a soft compression algorithm for multi-component medical images, which can exactly reflect the fundamental structure of images. A general representation framework for image compression is also put forward and the results indicate that our developed soft compression algorithm can outperform the popular benchmarks PNG and JPEG2000 in terms of compression ratio.


Scientific Reports
| (2021) 11:12372 | https://doi.org/10.1038/s41598-021-91920-x www.nature.com/scientificreports/ for compressing the medical images by finding the region of interest on an image, respectively. In 22 , it attempted to implement the region of interest based image compression using embedded zero-tree wavelet algorithm for medical images. In 25 , it adopted a context-based and region of interest based approach to compress medical images in particular vascular images. Similarly, in 26 , it considered a multi-region of interest medical image compression problem with edge feature preserving. The compression method based on prediction can make good use of the continuity of an image 27,28 . In 29 , it proposed an method for the compression of medical images that exploits the three-dimensional nature of the data by using linear prediction. The paper 30 proposed a lossless compression scheme based on prediction by partial matching. In 31 , it adopted a method that combines super-spatial structure prediction with interframe coding to achieve compression effect. There are also some algorithms not specially designed for medical image compression [32][33][34][35] , which can be applied due to their generality. The image compression methods based on neural network 36,37 introduce the concept of learning into this filed, possessing good performance. Perceptually lossless compression 38,39 can attain higher compression performance without loss of important information and has good application potential over bandwidth limited channels.
Using the characteristics of medical images to complete compression is the mainstream direction. Due to the fact that mostly the human body possesses bilateral symmetry, which means that an organ of the human body can be divided into two symmetrical halves by simply drawing a vertical line down their centers. In 40 , it proposed an approach to compressing the medical images by making use of their symmetry feature. The paper 41 represented a hybrid lossless compression method that combines a segmentation technique with a lossless compression scheme. There are also some methods that combine these technologies to apply to medical image compression. The paper 42 adopted HEVC for diagnostically acceptable medical image compression. In 43 , it proposed a method in medical image compression by using sequential-storage of differences technique.
For an image, it is a combination of numerous pixels, which not only contains the intensity value of each pixel, but also includes its location. However, in conventional image representation methods, pixel intensity values are stored in a certain order (such as scanning from left to right and from top to bottom). These approaches turn the location into a definite quantity, which leads to no need to encode the location. In fact, the method of not considering the intensity value of each pixel and its location at the same time is certainly not as good as that of the compression ratio from both directions simultaneously.
Let us see a toy model. Figure 1 illustrates three patterns: fish, pistol and robot, but they are all only made up of the three basic shapes shown in Fig. 2. When we see these two figures, we will think that if these shapes in Fig. 2 are used as the basic units of the image to complete the compression of several patterns in Fig. 1, the compression effect will be great. Data mining 44,45 points out the way to solve this problem.
In recent years, data mining is one of the most interesting area of research that includes classification, clustering, association, regression in health domain 46 . In 47 , it presented a ransription factors network in the major organs of the mouse, allowing data mining and generating knowledge to elucidata the roles in various biological processes. The paper 48 combined predictive data mining with experimental evaluation in patient-derived xenograft cells to identify therapeutic options for high-risk patients. In 49 , data mining and model-predicting were used in a global disease reservoir for low-pathogenic avian influenza.
Our purpose is to find the basic shape in images similar to Fig. 2 with data mining. This is exactly the starting point of soft compression whose basic component unit is the shape, representing an image by using both shapes and locations. Of course, this is merely a visual explanation for soft compression. The actual algorithm is more scientific and theoretical than this example. Soft compression is a lossless image compression method whose codebook is no longer designed artificially or only through statistical models but through data mining, which can eliminate coding redundancy and spatial redundancy simultaneously. It was first proposed in 50 , dedicated to binary image compression. Then in 51 , soft compression was analyzed theoretically and the compression algorithm for gray image is designed.  www.nature.com/scientificreports/ In this paper, we present a general framework for representing image compression in philosophy. Under the guidance of this framework, a new multi-component image compression algorithm based on big data mining is designed, which is especially serviceable for medical images.

Results
A general representation framework for image compression. In this new framework, we adopt the basic unit instead of the pixel as the component unit of an image. It provides a point of view to consider both coding redundancy and spatial redundancy simultaneously. The basic unit can be pixel intensity values one by one, or shapes and symbols combined by different pixels.
Let I denote a digital image which is composed of a great deal of basic units, whose row and column dimensions are M and N, respectively. Let l I (x i , y i ) and l P (x i , y i ) represent the number of bits needed to denote a basic unit and its location, then the number of bits required for an image is where T is the number of basic units needed to represent an image.
For Huffman coding 4 , the storage order is in a certain mode, so only the probability distribution of pixel intensity value is considered. The location is not taken into account, so l P (x i , y i ) becomes zero (Because the location loses randomness and becomes a certain quantity when it is encoded in a certain order, namely, the entropy of the location is zero). The basic unit with huffman coding for images is a single pixel, so T = MN , formula (1) can be simplified as (2).
The representation of Golomb coding 6 is the same as huffman coding, as shown in formula (3). The difference is that Golomb coding is designed for non-negative integer input with geometric distribution.
LZW coding 8 is also stored in a certain mode and l P (x i , y i ) = 0 . In this method, the number of basic units is not MN, but a value T less than MN, as shown in (4).
Run length coding 7 compresses the repeated symbols, and uses a fixed number of bits to represent the number of repetitions of the symbol. In this way, formula (1) can be expressed as (5).
where l C is the required number of bits to represent a location.
Symbol-Based coding 52 is mainly designed for document storage, which takes the repeated characters in the text as a symbol. It considers both symbols and locations, which can be expressed by formula (6).
The representation of soft compression is similar to Symbol-Based coding and can be expressed by formula (7). The difference is that the basic unit of soft compression is the shape, which is obtained by searing in datasets based on data mining rather than by artificial design. Compared with Symbol-Based coding, it is more close to the nature of images and reflects the essential information of a dataset.
In this new framework, we can unify the representation of different compression methods, which is helpful to the comparison and analysis of diverse approaches. We summarize these methods in Table 1.
Soft compression algorithm for multi-component image. For a multi-component image, it is first decomposed into multiple single component images and then reversible component transformation is performed. Each single component image will be divided into the shape layer and detail layer after predictive coding and mapping. The shape layer is regular and sparse, while the detail layer is irregular and dense. Therefore, different layers should www.nature.com/scientificreports/ be coded depending on their properties. The compressed image can be obtained by combining the coding data of each shape layer and detail layer of each single component image. Malaria is a disease caused by Plasmodium parasites that remains a major threat to global health, affecting 200 million people and causing 400,000 deaths a year. Identifying and quantifying malaria could have a huge significance for research in both the medical and computer science field, whose dataset 53 will be employed to reveal sound effects of the soft compression algorithm. Figure 3(a) is a multi-component image from Malaria dataset. We will try to use the visual representation of this image to describe each step in the encoding and decoding process of soft compression algorithm. With regard to a multi-component image, the first step is to decompose it into three single component images B, G and R, as shown in Fig. 3(b), (c) and (d). These three single components represent the intensity of blue, green and red of each pixel in the image. The next step which takes advantage of the correlation between these components is to perform reversible component transformation to generate three new components Y, U and V, namely, color decorrelation for efficient compression, reasonable color space with respect to the human visual system and ability of having lossless compression 11 . The transformation of two color spaces takes the same form as JPEG2000 11 , which is shown in the formula (8).  www.nature.com/scientificreports/ The Y-component image is processed through multiple processing steps (which will be described in the method section). The shape layer image and detail layer image of Y-component are obtained by layer separation, as shown (binarization has been made for clearer appearance) in Fig. 3(e) and (f). The reason for layering is that different coding methods will be adopted according to the different properties of shape layer and detail layer. The former is regular and sparse, while the latter is irregular and dense. Therefore, for the shape layer, the shape is regarded as the basic unit for representing an image. The number of bits required to represent a shape layer image containing T shapes is , where l I (x i , y i ) and l P (x i , y i ) represent the length to denote a shape and its location respectively. Due to the irregularity of the detail layer, it can be encoded by common statistical coding methods. Similarly, the other two components are taken with the same treatment as the Y component, which are illustrated from Fig. 3(g) to (j). Decoding the compressed data from Fig. 3 with soft compression algorithm for multi-component image can acquire the reconstructed image, which is shown in Fig. 3(k). The compression ratio of this instance is 4.40, which largely eliminates coding redundancy and spatial redundancy.
BCCD dataset is a small-scale dataset for blood cells detection. We select the first 200 images of the BCCD dataset as the training set and the remaining 166 images as the testing set. Then, soft compression algorithm and traditional Huffman coding are applied to gain the compression ratio, and their results are statistically analyzed to obtain the frequency histogram, Fig. 4(a) and (b). The results of Huffman coding come from the independent coding of the three components without any other processing. The comparison indicates that if an image is compressed only from the perspective of coding redundancy, the results will be poor. From these two figures, we can draw a conclusion that soft compression is much better than traditional Huffman coding in lossless image compression because it aims to eliminate both coding redundancy and spatial redundancy simultaneously. Table 2 illustrates the experimental results of soft compression algorithm for multi-component images and other classical systems on Malaria, BCCD, Melanoma and FIRE 54 datasets. The statistics include mean, minimum, maximum and variance about compression ratio. The results of Table 2 indicate that the average compression ratio with soft compression is obviously higher than other image lossless compression methods. Through comparison, we can reach a conclusion that soft compression algorithm for multi-component algorithm outperforms the popular classical benchmarks JPEG, PNG and JPEG2000.
In lossless mode of JPEG2000, 5/3 reversible wavelet transform is adopted after preprocessing which includes region division, DC level shifting and reversible component transformation. The wavelet coefficients are then sent to bit plane modeling encoder and arithmetic encoder for embedded block coding with optimized truncation. In lossless mode of JPEG, the first step is linear prediction, and then the compressed data is obtained by using Huffman coding and class code. PNG mainly consists of three parts: prediction, LZ77 and Huffman coding. Table 3 illustrates the difference and comparison of soft compression and baselines. All of our methods outperform the widely-used PNG and JPEG2000 in terms of bits per sub-pixel (bpsp).

Discussion
Soft compression algorithm for multi-component image makes full use of the two properties of medical images mentioned in Section I from the perspective of data mining. For the algorithm, its codebook is complete. In other words, it always contains shapes of size one, which ensures that the reconstructed image is exactly the same as the original one. Compared with the original image, the image decoded from compressed data has no information loss, which ensures the authenticity of medical images. This corresponds to the first property of medical images.  www.nature.com/scientificreports/ In addition, soft compression algorithm uses the shape as the basic unit, reflecting the essential composition of an image. This takes advantage of the second property of medical images. Soft compression is a universal method. It performs well even if the training stage and testing stage belong to different scenes. Soft compression algorithm is not only suitable for multi-component images, but also for single component images, because the processing of each component is independent. However, we can also consider the relationship between different components, utilizing this information to further improve the compression effect.
There are several significant differences between soft compression algorithm and other methods. These differences make soft compression more suitable and competitive to deal with medical images.
• The basic unit of soft compression is the shape, rather than the pixel.
• The location of a basic unit is no longer arranged in a definite order, but changes from a constant to a random variable. • The codebook is no longer designed artificially or only through statistical models, but through data mining.
In the specific algorithm design, we adopt some preprocessing operations that are conducive to soft compression, such as prediction coding, mapping and layering, so that we can fully utilize the characteristics of images. The advantage of soft compression algorithm is that the codebook obtained in the training stage can be reused until it needs to be updated. When storing and transmitting images, one only needs to obtain the compressed data according to the codebook. After that, all operations can focus on the compressed data, which greatly reduce the consumption of communication and storage resources.

Methods
Soft compression algorithm for multi-component image. For coding, the codebook is one of the most critical things. The codebook directly determines the compression effect. The basic unit of soft compression algorithm for multi-component image is the shape. How to find the corresponding codeword of each shape is  www.nature.com/scientificreports/ our main consideration. The codebook of soft compression is obtained by searching and dynamically updating in the dataset, which can reflect the essential information of a certain kind of images from the perspective of spatial correlation. In the process of codebook acquisition, it mainly includes prediction coding, negative-topositive mapping, layer separation and searching. For a multi-component image I that has m components, we first divide it into m single component images and perform reversible component transformation. After obtaining the new m components, one can process each component image independently. For each component, we will use predictive error to represent it by prediction coding 55 . Since the predictive error will have a negative value, the second step is to map it to a non-negative value, which is conducive to the subsequent layer separation operation. Layer separation is to separate the image into the shape layer and detail layer. The shape layer retains the main information of an image, which is instrumental in using the combination of locations and shapes for coding. On the other hand, the detail layer retains all the information except the shape layer. When the shape layer is obtained, search and update shape units dynamically to get the final shape set that will be used to generate the codebook. While searching in the shape layer, the distribution of intensity value in the detail layer should also be counted.
In the process of obtaining shapes, the method is to predefine a set whose elements satisfy the initial condition. During the training, the shape that meets this condition is included in the shape library. The size of the set is dynamically updated according to the frequency and weight of each shape to ensure that there is no quantity explosion. Suppose that A is an M × N matrix whose i-th row and j-th column are represented by vectors u i and v j respectively. The matrix whose u i and v j that follow (9) and (10) is appropriate to generate the shape.
Removing the zero elements in the matrix and combining the remaining elements with the intensity value, one can get the shape that satisfies the initial condition. This prevents different matrices from forming the same shape. However, these shapes only become candidates, but they do not necessarily enter the codebook. In the training stage, we will match each candidate shape in the dataset. Frequency and size are the key factors to judge whether a shape can enter the codebook. We will keep the shapes with high frequency and large size. In addition, the shape with small frequency and size will be eliminated. After the final shape set is obtained, the codebook can be generated according to the size and frequency of each shape. Figure 5 shows some shapes generated with training on BCCD.
For the shape layer, one needs to consider the frequency and size of each shape to generate the codebook. In this process, it aims to make the average code length as short as possible. However, for the detail layer, the optimal code can be obtained only by considering the frequency distribution of intensity values. Figure 6 illustrates the whole procedure of acquiring codebooks for images with soft compression algorithm. The codebook can be applied all the time after it is obtained, which indicates that the cost will be very tiny in the average sense. When the terminal intends to store and transmit an image, it only needs to process the compressed data, which greatly reduces the storage space and communication bandwidth.
Encoding. The preprocessing for images of encoding is the same as the acquisition of codebooks. After a multi-component image is divided into several single component images, the prediction coding is applied for each single component image, and the predictive error is mapped into a non-negative value. The predictive error is layered to generate the shape layer and detail layer, which will be compressed by different coding methods. Figure 7 illustrates the encoding procedure. Figure 8 is the encoding process of a RGB image with soft compression.
Filling the shape layer with the codebook for shape layer yields many shapes and corresponding locations, which are represented as (x i , y i , S i ) . Since the location difference approximately obeys the geometric distribution, Golomb coding is applied for the location difference. By recording the location representation and corresponding codeword of each shape used in filling, the encoded data of shape layer can be generated. According to the codebook, the encoded data of detail layer are obtained by scanning from left to right and from top to bottom. After that, they are combined with the encoded data of shape layer and some information about an image (e.g., size) to generate the compressed file of each component. Concatenating the compressed data of each component can form the final compressed data of an image. In storage and transmission, the compressed data will be used as another lossless representation of an image.   Figure 7 illustrates the decoding part of soft compression algorithm.   Table 4.

Conclusion
In this paper, we propose a new general representation framework for image compression. This framework takes many coding methods into account, which can be applied to represent the image compression scheme. Under the guidance of it, we design a novel coding method for medical images from the view of data mining. Soft compression algorithm for multi-component image adopts shapes as the basic unit, regarding an image as a combination of shapes. Since shapes and locations are taken into account for representing an image, the algorithm can eliminate coding redundancy and spatial redundancy at the same time. Experimental results indicate that soft compression algorithm for multi-component image can outperform the popular classical benchmarks PNG and JPEG2000. In applications such as intelligent medicine, soft compression algorithm can help compress medical images to reduce the occupation of communication bandwidth and storage space. Of course, it can also be applied to other scenes that need lossless compression, such as precious image preservation. However, in telemedicine, the role of soft compression is not only to compress images, it may lead to more significant applications. The foreseeable research includes: (i) High fidelity video stream coding technology, which may surpass the current international standards. (ii) Efficient channel coding technology suitable for certain types of images, which is based on shapes rather than pixels. (iii) Develop the corresponding storage coding and fast encoding and decoding methods, as well as local image information extraction methods. (iv) Combined with artificial intelligence, one can develop a widely used software platform and open source library.
In the future, on the one hand, it can improve the performance of the algorithm by taking advantage of the characteristics of medical images to do the corresponding preprocessing. On the other hand, mining more effective shape acquisition methods can bring better results. In addition, the combination of soft compression algorithm and other coding methods such as transform domain, can achieve efficient lossy compression.

Data availability
The code used and the datasets analyzed during the current study are available from the corresponding author on reasonable request and can also be found at https:// github. com/ ten22 one/ Soft-compr ession-algor ithm-formulti-compo nent-image