Classification with 2-D convolutional neural networks for breast cancer diagnosis

Breast cancer is the most common cancer in women. Classification of cancer/non-cancer patients with clinical records requires high sensitivity and specificity for an acceptable diagnosis test. The state-of-the-art classification model—convolutional neural network (CNN), however, cannot be used with such kind of tabular clinical data that are represented in 1-D format. CNN has been designed to work on a set of 2-D matrices whose elements show some correlation with neighboring elements such as in image data. Conversely, the data examples represented as a set of 1-D vectors—apart from the time series data—cannot be used with CNN, but with other classification models such as Recurrent Neural Networks for tabular data or Random Forest. We have proposed three novel preprocessing methods of data wrangling that transform a 1-D data vector, to a 2-D graphical image with appropriate correlations among the fields to be processed on CNN. We tested our methods on Wisconsin Original Breast Cancer (WBC) and Wisconsin Diagnostic Breast Cancer (WDBC) datasets. To our knowledge, this work is novel on non-image tabular data to image data transformation for the non-time series data. The transformed data processed with CNN using VGGnet-16 shows competitive results for the WBC dataset and outperforms other known methods for the WDBC dataset.


Introduction
In recent times, there are growing interest in the development of machine learning (ML) models for medical datasets due to the advancements in digital technology and improvements in data collection methods.Increasingly, several ML-based systems have been designed as an early warning or diagnostic tool for chronic illnesses, for example diagnosing depression, diabetes and cancer [1].Breast cancer is arguably one of the deadliest forms of cancer amongst women with millions of reported cases around the world of which many cases become fatal [2,3].Breast cancer is caused by abnormal growth of some of the breast cells in the lining of the milk glands or ducts of the breast (ductal epithelium) [4,5].Compared to healthy cells, these cells divide more rapidly and accumulate, forming a lump or mass.At this stage, the cells become malignant and may spread through the breast to lymph nodes or other parts of the body.

Problem Statement
The study of breast cancer has attracted considerable attention in the past decades.Improving data collection and storage technologies has resulted in various types and amounts of data collected on breast cancer from around the world.These include data on Ribonucleic Acid (RNA) signatures for cell mutations that cause breast cancer [6,7], mammogram images [8,9] and data on symptoms and diagnosis [10].Many traditional Computer-Aided Diagnosis (CADx) systems require hand-crafted feature extraction which is a challenging task [11,12].Even conventional ML techniques require the extraction of an optimal set of features manually prior to model training.An extensive review on various feature selection and extraction techniques can be found in [13,14].Some commonly used approaches for ML models are Principal Component Analysis (PCA) [15], information gain [16], GAbased feature selection [17], recursive feature elimination (RFE) [18], metaheuristic methods [19] and rough sets [20].Feature selection and extraction, therefore, is an important consideration in the pre-processing step before applying any ML algorithm such as decision trees, Bayesian models, Support Vector Machines (SVM) and Artificial Neural Networks (ANN).The behavior of ML algorithms and their prediction accuracy is influenced by the choice of features selected [21,22].Many times manual feature extraction or knowledge of domain experts is needed to have a good understanding on the relevance of the attributes [23].

Context and Background
To address these issues surrounding the use of conventional ML algorithms has propelled the need for new approaches and methods to automatically extract features from large datasets.As a result, Deep Learning (DL) algorithms such as Convolutional Neural Network (CNN or ConvNet) and Recurrent Neural Networks (RNNs) have emerged in recent times that can accept raw data and are automatically able to discover patterns in them [24,25].
CNN is one of the most popular algorithms for deep learning which is mostly used for image classification, natural language processing, and time series forecasting.Its ability to extract and recognize the fine features has led to the state-of-the-art performance in various application domains such as computer vision, image recognition, speech recognition, natural and language processing [26,27,28].CNN is an enhancement of a canonical Neural Networks architecture that is specifically designed for image recognition in [29].Since then many variations have been added to the architecture of CNN to enhance its ability to produce remarkable solutions for deep learning problems such as AlexNet [26], VGG Net [27] and GoogLeNet [30].CNN eliminates the need for manual feature extraction because the features are learned directly by different convolutional layers [31,26].It does not require a separate feature extraction strategy which requires domain expert and other preprocessing techniques where complete features may still not be extracted [32].Despite its huge success with image data, CNN is not designed to handle non-image1 data in non-time series form.Arguably, any problem that can represent the correlation of features of a given data example in a single map, maybe attempted via CNN.
CNNs have proven to work best on data that are in 2-D form, such as images and audio spectrograms [33].This is attributed to the fact that the convolution technique in CNN requires data examples to have at least two dimensions.Conversely, CNN has been explored on application-specific 1-D data as well.These include gene sequencing data such as DNA sequences being treated as text data (sequence of words) [34], and signals and sequences in text mining, word detection and natural language processing (NLP) [35,36].More specifically, CNN for Time-Series Classification (TSC) has been recently explored with some new methods such as Multi-Scale CNN (MCNN) [25] and an ensemble of CNN models with AlexNet on Inception-v4 architecture [37,38].These methods have made significant improvement in the accuracy of the classifiers with the state-of-the-art ensemble methods such as Flat-COTE and HIVE-COTE [39,40].Moreover, raw time-series data has also been used into 1-D CNN by calculating the area of the signal for convolution with better time complexity and scalability [41,42].Nonetheless, much data still exists in a 1-D format such as clinical data of medical records, and therefore, opens challenging research questions on whether they can be effectively trained for classification using CNN.This paper is aimed at filling this gap by proposing a novel non-time series 1-D numerical data to 2-D data transformation methods and processing them with CNN.

Motivation
The main motivation for this paper is to realize the potential of CNN for non-image clinical data for breast cancer because it eliminates the need for manual feature extraction.The features are learned directly by CNN whereby it also produces state-of-the-art recognition results [43].The key difference between traditional ML and DL is in how features are extracted.Traditional ML approaches use handcrafted engineering features by applying several feature extraction algorithms and then apply the learning algorithms.On the other hand, in the case of DL, the features are learned automatically and are represented hierarchically at multiple levels.This is the strong point of DL against traditional machine learning approaches [43].

Hypothesis
We have proposed some novel methods to transform non-image clinical data of breast cancer to 2-D feature map images in R 2 so that a large set of these kinds of data are not deprived of the services of CNN.This would also encourage other variations and/or methods for text to image transformation to be developed in the future.The scope of this paper is to broaden the usage of CNN to those applications where d-dimensional raw data has set of N , 1-D data vectors in R as shown in Figure 1.Each row represents This dataset from UCI [10] is a record of medical examination of patients to diagnose breast cancer, where each row is a 1-D vector representing a numerical data example.We demonstrate our method of non-image breast cancer data transformation to image data -processed in CNN -produces exceptional results for classification accuracy.Some research demonstrates the use of 1-D convolutions on 1D datasets such as data in the form of signals and time sequences [44].Though this provides a possibility of using 1-D convolutions in this research, our experiments revealed their unsuitability on our experimental datasets.Having applied the data in its raw form into 1-D CNN gave highly unpredictable results.
This paper is organized as follows: Section 2 briefly describes the general architecture of CNN.Section 3 describes our three proposed methods of data wrangling from non-image Breast Cancer data [10] to image data.Section 4 describes the complete methodology of the classification of breast cancer data with CNN.Section 5 shows the experimental results and Section 6 discusses the outcome of the experiments.Lastly, Section 7 concludes the paper by summarizing the results and proposing some further extensions to the research.

Convolutional Neural Networks
A convolutional neural network (CNN or ConvNet) is a deep learning algorithm designed for computer vision.Its architecture is based on backpropagation artificial neural networks [29].It takes an input image whose each pixel represents input data that goes through a series of the feature selection process through convolution which is later sent to the weighted perceptrons where the learning happens through backpropagation.The major advantage of CNN is its ability to learn the features by itself while in the canonical neu-ral networks feature selection is a separate process where the final accuracy of the model depends on the choice of preprocessing and feature selection methods [45,46].CNN has become a prominent deep learning model with a plethora of literature available on its structure and functionality, however, a brief description of individual layers of CNN is given below.

Feature Selection layer
This layer is a feature extraction layer for CNN which means any additional domain-specific feature selection preprocessing is not required.This layer can be divided into 3 sublayers:

Convolutional Layer
This layer directly accepts raw images as input where a set of small filters is convolved over the image to produces one or more feature maps [47,48].Convolution happens through sliding the filter across the image while computing the dot product of elements of the filter and image [49].This process results in the extraction of certain features from the image [50].

Activation Layer
The results of the convolutional layer are passed through an activation function to produce a bounded output.CNN generally uses the Rectified linear unit (ReLU) that converts negative values to 0. It also trains the network several times faster than its counterparts such as tanh [26].

Pooling Layer
This layer does the downsampling that also reduces the input size along each dimension [50].Some common pooling methods are average pooling and max pooling where the received image is partitioned into a set of nonoverlapping rectangles.Max-pooling and average pooling get only the maximum value and average value of every sub-region respectively.This process downsamples the image [51,52].

Classification Layer
After learning features in the above layer, the architecture of CNN shifts to classification.This fully connected layer is similar to the fully connected network in the conventional neural network models [32].The final layer of the CNN architecture uses a classification layer such as softmax to provide the classification output [50].The complete architecture of CNN taking an image of number 2 is shown in Figure 2 (taken from [46]).The image goes through all the layers which are then classified between values 0 -9.

Preprocessing Methods to Transform Numerical Data to Image
We have proposed three basic techniques of data wrangling to convert Breast Cancer numerical data to image data.The converted image must reflect some patterns to depict a given class.We have used Wisconsin Original Breast Cancer (WBC) and Wisconsin Diagnostic Breast Cancer (WDBC) datasets from the UCI library [10] for the classification of numerical data in this work.

Equidistant Bar Graphs
The bar graph represents the measurement of every feature of a given dataset.There are lots of possibilities of drawing a bar graph but we have used a simplistic approach.The dataset is first normalized to [0, 1] then every feature is drawn based on its measured value.The width of the image in pixels is ψd + γ(d + 1) where d is total features, ψ is the width of a bar and γ is gap between two consecutive bars.The height of the image is normalized to produce a square image.We used 1−pixel length for ψ and 2−pixels length for γ in our experiments.This produces the square image of size [3d×3d] approximately.Few data examples of WDBC dataset converted to bar graphs are shown in Figure 3 with class labels -Benign and Malignant.The algorithm for this approach is given in the Appendix.These pictures are only useful to CNN if they depict a pattern in a convolved image.The first convolutional layer produces 6 features which are shown in Figure 4 where some sort of distinguishing features have been reflected.
Intuitively, the "correct" order of the bars ought to give better results.The datasets of numerical data were reorganized where the related fields were put close to each other according to the order of their similarity.Firstly, a covariance matrix on data fields was generated then each value of the matrix is converted to 'rank' that determines how closely one field is related to the other.This is a shortest-path problem where algorithms such as dynamic programming or any metaheuristic algorithm [53] such as Genetic Algorithm (GA) [54], Particle Swarm Optimization [55] or Reincarnation Algorithm (RA) [56] can be used to get the optimum order of bars based on their respective rank.Thereafter, a new set of images was created using this new order of bars.This process has been elaborated more in Section 6.

Normalized Distance matrix
The next method is the formation of a distance matrix which is a squared matrix of size [d × d] where d represents total features of a given example.Matrix elements are the difference between two features i.e., x ij = x i − x j where x i and x j represent the measurement of a given feature with i, j ∈   distance matrix are shown in Figure 5 with class labels.The images can be easily scaled up to [3d × 3d].The first convolutional layer produces 6 features similar to bar graphs is shown in Figure 6.

Combination of options (bar graph, distance matrix, normalized numeric
data) Apparently, the above two strategies can be combined to give a third option for generating an image from numerical data.We create a colored image of 3 layers of size [3d × 3d] where the first layer has a normalized distance matrix, the second layer has bar graphs, and the third layer has a copy of numerical data stored row-wise, i.e., x ij = x i where i, j ∈ [1, d] shows row and column of a matrix and x i represents the measurement of a given feature.Few data examples of WDBC dataset converted to the combination of options are shown in Figure 7 with the class labels.The first convolutional layer in this case, is not able to produce any distinct feature but the scaled up image shows different colors with some bars in Figure 8.The 3 rd convolved block (12 th layer) produces some blobs scattered in the images in Figure 9.

Classification of Non-Image Data With CNN
As described in Section 2, CNN completes the classification process in two steps.The first step is the auto-feature extraction of the images and the second step is the classification of the same images with backpropagation neural networks.In the case of a numerical dataset that is not in the form of images, first goes through the data wrangling process described in Section 3, where either of the three options is used for non-image to image data conver-  sion.The transformed images may not make logical sense to human eyes but CNN is capable to extract relevant features out of it.Figure 10 illustrates the complete flowchart of the training process of CNN with non-image data sets.The process contains four important parts: Firstly, numeric input data (A) undergoes pre-processing of data wrangling (B) where it is normalized and converted to 2D image format using one of the data wrangling techniques described in Section 3 (the figure shows distance matrix method of Section 3.2).The generated image is filtered through the CNN convolution layers for feature extraction (C).The features are trained in the fully connected layers to obtain classification outputs (D).

Experiments
The objective of the experiment is to provide an alternative classification method with CNN for the non-image dataset of Breast Cancer and other similar datasets without any need for manual feature selection.We have used WBC and WDBC datasets from the UCI library [10] for the experiments.The properties of these datasets are given in Table 1.We have tested the efficacy of our method with other published state-of-the-art methods used for Breast Cancer diagnosis, namely, variations of Neural Networks (NN) [57], Support Vector Machine (SVM) [58,16,59], Decision Tree (DT) [60] and Naïve Bayes (NB) [61].These methods are generally supported by additional feature selection methods such as IG, Rough set or weight NB.For CNN, we used VGG16 [27] architecture with 4 convolutional blocks.Each convolutional block has 2D convolutional layer with the filter size of [3× 3], 0.5×Layer× image filters, ReLU layer and lastly max pooling layer with of pool size and stride of [2 × 2].Additionally, Bayesian optimization was used for parameter tuning.All parameter settings are shown in Table 2.For regularization and initial learning rate we used log transformation.
Initially, every dataset is divided into 80% training and 20% testing then 20% of training data is kept aside for validation data.After 30 attempts on each dataset, we have collected best and average classification accuracies on validation and test data sets shown in Tables 3 and 4 respectively.Bold figures represent the overall best result.CNN types 1, 2 and 3 represent equidistant bar graph, normalized distant matrix, and combined options respectively.px1 shows that the image is formed with bars of 1-pixel width only.Similarly px2 and px4 show width of 2 and 4 pixel sizes respectively for bars in an image.
Additionally, it is highly desirable in medical diagnosis to have high sensitivity and specificity measures.Sensitivity is the ability of a test to correctly identify those with the disease, and specificity is correctly identifying those without the disease.Alternatively, the F1 score can be used as a derived metric that merges both sensitivity and precision measures.Tables 5 and 6 show the best and average of these additional metrics respectively, for WDBC and WBC datasets on classification.We have also performed experiments using CNN with 1-D convolutions on raw data without any sophisticated data transformation.However, we have obtained poor results when compared to our method with the average classification accuracy of 76.11 and 89.64 for WDBC and WBC datasets respectively.The comparison of our methods with other state-of-the-art methods is shown in Table 7.The table shows different methods from 2009 -2019.The results show accuracy, sensitivity and specificity of WBC and/or WDBC datasets.Authors in [11] have used mammogram images of breast cancer as CNN works on images.In some cases, authors got 100% accuracy with 10-fold cross-validation for WBC dataset.Lower fold of cross-validation generally  gives lower accuracy [58,16,57].

Discussion
The experimental results of data transformation from non-image breast cancer datasets to image have been promising for the utilization of CNN for classification accuracy.Although the proposed methods are in the early stages, the obtained results are very significant in the development of new strategies with data wrangling for deep learning.This also provides an opportunity to derive even better alternatives for CNN in the future.It was observed that our proposed combined approach, i.e.Type-3 transformation and bar width of 1 pixel i.e. px1, has been the most significant method as it carries the most information about the data in three dimensions of an image.It has outperformed other methods for the WDBC dataset by clocking 100% accuracy (with 1.0 sensitivity, specificity and F1 score).It has also shown very competitive results for the WBC dataset with 99.27% accuracy and 1.0 sensitivity 0.99 specificity and 0.99 F1 score.
As discussed in Section 3, different order of bar graphs for Type-1 and Type-3 transformations produces different images.A bar represents its corresponding field value of a given sample.We have tried to bring the related bars closer to each other by using a covariance matrix that determines the  "closeness" of two fields.For example Figure 11 shows the Adjacency Matrix of co-variance of each field for WBC dataset.The data is arranged row-wise such that each value represents the rank of i th row with j th column of a given field.To get the "best" arrangement of fields, we minimize the total covariance rank by using a meta-heuristic algorithm GA to solve this shortest path problem.The process of minimization for WDBC is shown in Figure 12 where the minimum rank is obtained by the end of 10 th generation.The dataset fields were reorganized where the related fields were put close to each other according to the order of their similarity.The final order of fields for WBC and WDBC produced through minimum ranks are shown in Table 8.The images of these datasets were generated accordingly for the experiment.The only shortcoming of the CNN algorithm is its high processing cost than other methods, especially with bigger sized images.Generally, it takes 9-15 seconds for a MATLAB 2018 program to complete the training process on DELL XPS i7-9700 @ 3GHz machine with 8 CPUs and NVIDIA GEFORCE RTX 2060 GPU.Despite this, the experimental results demonstrate the size of data has no direct impact on the performance of CNN.Additionally, with the advent of quantum computing [62] and parallel GPUs with enough memory can produce results in a reasonable time frame.The data wrangling process of converting non-image data to the image is not too expensive either.The every-case time complexity of the bar graph approach has the order of O(N d) and the normalized distance matrix has the order of O(N d 2 ).The details of the algorithms are given in the Appendix.

Conclusion
The objective of this paper was to process non-image data (in a nontime series form) of Breast Cancer datasets WDBC and WBC into CNN due to its state-of-the-art performance and elimination of manual feature extraction for image recognition applications.The utilization of CNN has been confined largely to image data only except for some domain-specific data conversion techniques such as NLP and voice recognition.We have proposed some novel approaches to convert numerical non-time series data to image data.This process of conversion is very straightforward with the efficiency of the order of not more than O(N d 2 ).The experimental results on classification accuracy show the competitiveness of these methods.There is also a high potential for improving these approaches further to have more outstanding results.For example, bar graphs with different shapes, sizes, color and even arrangements can be tried.Similarly, distance matrix can be enhanced to have more information such as the mean/variance of the neighboring elements.It still needs to be seen how other applications with various types and orientations of numerical data would respond to CNN after non-image data conversion to image data.Intuitively, the more the information on data would produce the better the results as observed with the combined approach.Finally, the classification accuracy of numerical data without any sophisticated data transformation on 1-D CNN did not produce acceptable results.

Figure 1 :
Figure1: Snapshot of data file for Breast Cancer dataset WBC from[10]

Figure 3 :
Figure 3: Bar graph for some data examples of WDBC dataset.

[ 1 ,
d].We used Euclidean distance in our experiments.The matrix is then normalized between [0 − 1].This produces the square image of size [d × d] which has a gain of 3 folds in length compared to bar graphs described in Section 3.1.Few data examples of WDBC dataset converted to normalized

Figure 4 :
Figure 4: Features learned by the first convolutional layer for Breast Cancer dataset.

Figure 5 :
Figure 5: The normalized distance matrix for some data examples of WDBC dataset.

Figure 6 :
Figure 6: Features learned by the first convolutional layer for WDBC dataset with normalized distance matrix.

Figure 7 :
Figure 7: Combined 3 layered matrix (colored image) for some data examples of WDBC dataset.

Figure 8 :
Figure 8: Features learned by the first convolutional layer for WDBC dataset with normalized distance matrix.

Figure 9 :
Figure 9: Features learned by the 12 th convolutional layer for WDBC dataset.

Figure 10 :
Figure 10: A complete process of non-image data classification with CNN.

Figure 11 :
Figure 11: Ranking of co-variance for WBC dataset in Adjacency Matrix

Figure 12 :
Figure 12: Minimization of total covariance for a given combination of fields for WDBC dataset

Table 2 :
Parameter setting for CNN

Table 3 :
Best results obtained on classification accuracy

Table 4 :
Average results for classification accuracy

Table 5 :
Best Score with Type3 on px1

Table 6 :
Average Score with Type3 on px1

Table 7 :
Comparison of the proposed method with other methods