Breath analysis based early gastric cancer classification from deep stacked sparse autoencoder neural network

Deep learning is an emerging tool, which is regularly used for disease diagnosis in the medical field. A new research direction has been developed for the detection of early-stage gastric cancer. The computer-aided diagnosis (CAD) systems reduce the mortality rate due to their effectiveness. In this study, we proposed a new method for feature extraction using a stacked sparse autoencoder to extract the discriminative features from the unlabeled data of breath samples. A Softmax classifier was then integrated to the proposed method of feature extraction, to classify gastric cancer from the breath samples. Precisely, we identified fifty peaks in each spectrum to distinguish the EGC, AGC, and healthy persons. This CAD system reduces the distance between the input and output by learning the features and preserve the structure of the input data set of breath samples. The features were extracted from the unlabeled data of the breath samples. After the completion of unsupervised training, autoencoders with Softmax classifier were cascaded to develop a deep stacked sparse autoencoder neural network. In last, fine-tuning of the developed neural network was carried out with labeled training data to make the model more reliable and repeatable. The proposed deep stacked sparse autoencoder neural network architecture exhibits excellent results, with an overall accuracy of 98.7% for advanced gastric cancer classification and 97.3% for early gastric cancer detection using breath analysis. Moreover, the developed model produces an excellent result for recall, precision, and f score value, making it suitable for clinical application.

www.nature.com/scientificreports/ difficult to reach a conclusion when a patient's symptoms are complex and contradictory. The physician evaluates the observation and makes a decision depending on his understanding and analyzing the patient's data. In Greek, ancient physicians found odor in the breath to diagnose different diseases 13 . From the previous studies, it has been confirmed that breath gas is a complex mixture, which contains more than 3000 VOC biomarkers [14][15][16] . These VOCs change their properties during the metabolism, and hence can be used as cancer VOC biomarkers for the detection of GC 17 . Lung cancer and GC have been diagnosed by breath analysis 18 . In previous studies, authors have focused on the prewarning of different cancers.
Breath analysis is gaining attention in the diagnosis of the diseases, it is noninvasive in nature. Breath analysis can produce accurate and reproducible results without producing any harm to the patient during the diagnostic tests. VOCs are measured from the breath to distinguish patients from healthy populations. VOC biomarkers reflect cellular metabolite levels due to the disease states, which are transferred to the blood, urine, and saliva. These VOCs are responsible for the disease state discerned in the breath.
Computer Aided Diagnosis based techniques have been developed by computer scientists to help the physician in the course of making decision 19 . In general, the pathologists analyze the whole image to observe the abnormality in the specific cell/tissue. Moreover, the human eye is less adept to recognize these changes. Therefore, it is the need of time to develop sophisticated methods, which can help the pathologists to diagnose the disease with some ease. In this study, the authors have proposed a computer-aided based method using deep learning for breath analysis in gastric cancer classification, which can overcome the above-stated problems.
Deep learning is a subset of Artificial Intelligence 20 . Deep learning methods have been used for medical diagnosis, robotics, computer vision, bioinformatics, audio and speech recognition, industrial applications 21 . SSAE, CNN, DBN, and recurrent neural networks are some basic deep learning techniques, which have been used to achieve the state of art results for several tasks 22 . The success of many deep learning applications depends upon the big data, which is a prerequisite and converges the fields of data analytics and deep learning 23 . The use of Deep learning based algorithms has improved the accuracy of cancer prediction outcome 15-20% 24 .
The deep convolutional neural network is commonly used for two-dimensional data. Most of the deep learning based systems have been developed for breast cancer, very few authors have established systems for the prognosis of GC. Most of the authors have worked on medical images, which include histopathological images, PET images, X-Ray, MRI, and CT images 25 . A GC multistage detection system was developed by Oikawa et al., by using pathological images, which achieved a 14.1% false rate 26 . At the first stage, they used SVM to extract handicraft features at low resolution. In the next stage, CNN was developed for making the final decision. Another CNN based GC classification system for histopathological images was developed by Xu et al 27 . Their system was based on segmentation and classification. Firstly, they created small patches, and in the next stage, they used CNN to classify the epithelial and stromal tissues. Wang et al 28 also presented a GC classification system, their work is very similar to Xu's work. They also created patches, and in last they used CNN to classify these patches. Li et al 29 proposed a system for the classification of GC using deep learning. They developed shallow and deep layers for the classification of GC. They achieved 100% accuracy for sliced based classification. They did not mention that how many of the patients belong to the early stage, as early-stage diagnosis is very important. Daniel et al 30 proposed a GC classification system using VOC biomarker. They achieved an accuracy of 93%. They developed an Artificial Neural Network (ANN) using backpropagation algorithm. They also haven't defined the EGC and AGC groups. Although, they achieved very good accuracy, still there is some room for better results. Few studies have already been carried out using miRNA, but their use in clinical applications is very limited, because of the low sensitivity [31][32][33][34] . Therefore, they cannot discriminate between benign and malignant samples at the early stages. Few authors have worked on gene expression signatures for the prognosis and detection of cancner [35][36][37] . These studies have potential, but at the same time, they have some limitations of microarray, due to which they cannot be preferred in the clinics.
Deep learning has already shown its effectiveness in several fields in recent years. The first deep autoencoderbased neural network was presented by Hinton et al 38 40 . They used two-layer autoencoder for feature learning. Softmax classifier was used to classify the images. Feng et al., proposed a method, they extracted the features from the histopathological images by using deep manifold preserving autoencoder 41 . These features were learned from unlabeled data. Like Xu's work, they used Softmax classifier in the last layer to classify the images. Their work is also based on breast cancer classification.
In this study, we have proposed and developed a deep learning based neural network that can distinguish between healthy people and cancerous patients. The proposed method can also distinguish between AGC and EGC. The diagnosis of EGC is very difficult, so we developed a CAD system to overcome this problem. The proposed deep stacked sparse autoencoder neural network architecture exhibits excellent results, with an overall accuracy of 96.3% for gastric cancer classification and 97.4% for early gastric cancer detection using breath analysis. The algorithm can be trained with less amount of time. Moreover, breath analysis based developed neural network has outperformed all the other existing techniques up to date.

Results.
In this work, we have developed several deep neural networks based on stacked sparse autoencoders. This study aims to develop a deep neural network architecture that can be used to distinguish early-stage gastric cancer patients from healthy persons. We developed and studied different schemes of hidden layers in each deep neural network to visualize and analyze the effect of it on feature extraction. Based on the investigation of different hidden layers schemes, we have evaluated the performance of each deep neural network. www.nature.com/scientificreports/ We developed K-NN, Support Vector Machine, Linear Discriminant, and Decision Tree based neural networks for classification. The comparison of these developed neural networks and deep-stacked sparse neural network, we developed and investigated all the networks listed in Table 1. The proposed and developed deep stacked sparse autoencoder has outperformed all the other developed models of neural network. All the deep neural networks in this study consist of two hidden layers, each hidden layer comprises of different number of neurons. The objective of this study is to build a neural network architecture, which can classify the early-stage gastric cancer at high accuracy. The train + validate data was fixed at 70% and 15% respectively. Whereas, the test data was fixed to 15% of the total data.
The first model was developed with [100 20] size of autoencoders, the first hidden layer has 100 neurons and the second hidden layer carries 20 number of neurons. This model produces an overall accuracy of 81.5%. The second model was developed with [100 40] size of autoencoder, 100 and 40 are the number of neurons in the 1st and 2nd hidden layers respectively. This model produces an overall accuracy of 96.5%, this model misclassified only three samples of early-stage gastric cancer. Moreover, this model produces very good accuracy for predicting healthy people and advanced gastric cancer as well. This model provides an excellent result. This model yields an accuracy of 92.2%, 97.3% and 98.7% for EGC, Healthy, and AGC patients respectively. The third model was developed with [100 60] size of autoencoders, this model produces an overall accuracy of 84.2%. This model was unable to distinguish between gastric cancer patients and healthy persons more precisely. This model produces a misclassification of 12.7% in the healthy person class. The error rate was high in this class due to which this model can't be used in clinical applications. The fourth and fifth models were developed with [100 80], [100 100] size of hidden layers respectively. These models produce an overall accuracy of 88.3% and 90.2% respectively.
Area Under the Curve (AUC) have been calculated to find the measure of performance across all possible classification thresholds. We have found the maximum AUC value for the second developed model, and the minimum value for AUC was obtained for the first developed model. All the AUC values for different models have been shown in Table 1.
The results from the  Figure 1 is representing the results in the form of confusion matrices obtained from the experiments. We trained the breath samples to distinguish the early-stage gastric cancer patients from the healthy persons with different neural networks. Figure 1, is indicating the training results of the different developed neural networks. In Fig. 1, the confusion matrices of training data, validation data, and test data have been shown separately. Whereas, the last confusion matrix of each in Fig. 1 is showing the overall accuracy result of that particular deep neural network.
Receiver Operating Characteristics (ROC) curve is an important tool for the evaluation of the neural network. The ROC curve for the DSSAENN is shown in Fig. 2. The ROC curves were used to visualize the performance of each developed deep neural network. These curves tell us about the compatibility of each model to distinguish between each class. The performance of the model is high if the area under the curve is more, and if the area under the curve is less this indicates poor performance of the model.  www.nature.com/scientificreports/  In this study, we have used autoencoder to extract the features from breath. In near future, we will extract the features by using Convolutional Neural Network (CNN). Feature extraction plays an important role in the performance of the neural network. Subsequently, a Computer-Aided Design system based on Deep Stacked Sparse Autoencoder Neural Network using Field Programmable Gate Array or other embedded systems are still an exciting task, and the hardware application of such systems can support medical professionals in the diagnosis of EGC. We are also developing a communication link so that we can use the standalone application in remote areas as well 42 . In near future, we will develop a neural network on a single chip combined with breath diagnosis sensors to diagnose precisely the early gastric cancer in the remote areas via internet of things.

Patients. This study was carried out under the guidelines of Reporting Recommendations for Tumor Marker
Prognostic Studies (REMARK). All the breath samples were collected from the Shanghai Tongren Hospital, Shanghai, China. All the individuals were already guided about the conduction of clinical research. This study was approved by ethics committee of Shanghai Jiao Tong university. There were 200 volunteers, which include 55 EGC patients, 56 healthy persons, and 89 AGC patients. There were three criteria which were followed while collecting the breath samples, (1) individuals have already gone through the clinical diagnosis of GC using different techniques, either biopsy or endoscopy; (2) excluding the patients with other malignancies; (3) excluding the patients with metabolic diseases, mainly including diabetes. Table 2 shows the clinical characteristics of the volunteers.
The AJCC Cancer Staging Manual was used for the GC stages. Age and gender have no impact on the EGC patients, AGC patients, and healthy person, therefore, we have excluded this information to make the resulting bias less. All the volunteers were asked to clean the mouth and refrain from eating and drinking for about 1 h. We assigned 75% breath samples for the training set and 25% breath samples were used for the validation set. In this study, we have used spectral region from 400 to 1500 nm for modeling. Data augmentation. The stability of any neural network depends on how well the model has learn the internal characteristics of the input data. As the number of samples increases, the stability of the neural network also increases. A sufficient amount of data is needed to avoid the overfitting and under fitting problems. In this study, we used data augmentation technique to produce additional data. The input to the proposed architecture is one dimensional, it contains the entire spectrum. This size is very large. Therefore, we crop the spectrum and used 1200 different values of each spectrum, 200 spectra are in total. Dataset can be expanded by moving breath samples either to right or left. In this study, we have shifted the breath samples on the right by 2 cm −1 . There were 1200 spectral values for each breath sample and total number of spectral values were 240,000, after performing the data augmentation the number of samples were 368 and total number of spectral values for input data was 453,600. Data preprocessing. The breath samples were collected from the hospital, and these samples were affected by some noise. The data preprocessing steps were carried out on these breath samples to make them useful. This noise factor may lead us to the wrong classification. We eliminate and reduce the irrelevant and random variations from the breath samples. www.nature.com/scientificreports/ Spikes were generated, as the breath sample hits the detector. These spikes have narrow bandwidth with positive peaks. These spikes are random in nature and produced due to different position on the sensors. The bandwidth of spike is very small as compare to the Raman spectra. We remove the spikes from the Raman samples by using the above assumption. The noise has high frequency. There are several techniques which have been proposed to remove noise from the data. In this study, we have used median filter to remove noise from breath samples. The medial filter eliminates the noise effectively.
Baseline correction is an essential part of preprocessing to avoid the left over background problem, which is produced due to the negative values of the spectra. The baseline correction does not cut down Raman band signal strength. Labspec5 software was used for the baseline correction of each spectrum, and smoothing the spectrum was also carried out by the Labspec5 software.

Feature extraction.
We defined a total of 1200 breath features from the breath samples of each individual.
The Raman spectral feature includes Raman spectral pattern, band numbers, peak positions, peak width, area, and so on. These parameters play an important part in the interpretation of the Raman spectra. We extracted dominant peaks with in-house developed data analysis software incorporated in MATLAB 2017b. We identified fifty peaks in each sample. These fifty peaks were used as an input to the proposed Deep Stacked Sparse Autoencoder Neural Network (DSSAENN) to train the desired model.

Autoencoder (AE).
Autoencoder is an unsupervised machine learning tool. It develops a better feature representation for the high dimensional input data. It finds the correlation between the input data. It is a multilayer feed-forward neural network, which represents the input with backpropagation. Figure 3 shows the basic structure of an autoencoder. Autoencoder minimizes the differences between the input and reconstructed data with the help of backpropagation.
Here x is the input data, z is some latent representation, s is an activation function, f denotes the encoder function and g denotes the decoder function, w is the weight, h is an approximation of the identity function and b represents the biases values.
Basic Sparse Autoencoder (SAE). The basic structure of Sparse Autoencoder (SAE) for high-level feature learning of breath analysis is shown in Fig. 4. SAE learned high dimensional structured features representation of cancerous or non-cancerous data by using an unsupervised feature learning algorithm. The input x was transformed into h by corresponding representation, at the input layer of the encoder. The hidden layer h visualizes the input data with new features. The hidden representation h was reconstructed from the new input data x , this was done by the decoder at the output layer. The minimum discrepancy was found out by training the autoencoder between input x and its reconstructed value x , to attain the optimal parameter values. The discrepancy was achieved with the use of backpropagation algorithm. The cost function of SAE is shown in Eq. (4), which comprises of three terms 42,43 .
The discrepancy between input and its reconstructed representation is shown by the first term, which is the average sum of square error. In the second term of Eq. (4), n represents the number of hidden layers and index j is summing the network over hidden units. The parameter ɑ shows the sparsity value. In general, this value is www.nature.com/scientificreports/ approximately zero but not equal to zero. The target activation of h is represented by p and the average activation of j-th hidden unit over n training data is denoted by ῤ. This is calculated by the following formula.
KL (p||ῤ j ) is the Kullback-Leibler divergence function 43 , which was define as KL(p||ῤ j ) = plog p/ῤ j + (1-p) log (1-p) / (1-ῤ j ). The difference of two different distributions is measured by KL divergence function. The third term helps in overfitting the model, weight decay. This term tries to decrease the weight.
Here, the number of layers and number of neurons in first layer is represent by nl and sl respectively. The connection between j-th neuron of l-1 layer and j-th neuron of l layer is shown by the term w (l) i,j . Let X = {x(1),x(2),x(3),x(4),x(5),x(6),…,x(N)} T is the entire unlabeled dataset, used for training in this study. Here x(k) ϵ R d n , N is the total number of breath samples and d n is the total number of attributes in each breath sample. The learned high level features at 1st layer are represented as h (1) , h 1 6 (k), . . . . . . .., h 1 d n (k)} T , for k-th breath sample, d k represents the hidden units at the 1st layer. We used superscript notation to define hidden layers and subscript notation to define units in the whole manuscript. From the following figure, h 1 j , indicates the j-th unit in the 1st layer. For simplicity, x and h 1 denotes the input breath sample and its representation at 1st layer respectively.

Stacked Sparse Autoencoder (SSAE).
We developed a stacked sparse autoencoder by cascading multiple layers of basis sparse autoencoder. The output of each layer was fed as an input to the successive layer. In this study, we constructed two layers of sparse autoencoder to develop two layers stacked sparse autoencoder neural network. The basic structure of the stacked sparse autoencoder neural network has shown in Fig. 5. The first layer is the input layer, the last layer is called the output layer, the hidden layers work as a bridge between the input layer to the output layer. There were d x = 50 * 200 input units in the input layer. The first hidden layer has d h(1) = 100 units and the second layer has d h(2) = 50 units as well.
Softmax layer. Stacked autoencoders have trained each layer of the network using unlabeled data, as SAE belongs to the unsupervised learning algorithm category. The reconstruction of the input has provided by a feature vector. This feature vector will feed into the classifier so that the classification of the stacked sparse autoencoder's input data. Logistic regression is commonly used for supervised classification, where we have one or two classes at the output. Since we have three classes at the output, we cannot use logistic regression and we used Softmax classifier because of its multiclass classification property. Softmax classification is the modified form of the logistic regression whose function is to generalize the logistic regression.  Implementation. The proposed deep neural network was developed and tested using MATLAB 2017b (Math Works, Natick, MA, United States) for the classification of gastric cancer. The network was trained on Core i5-2350 M CPU, 2.3 GHz. The initial learning rate was set to 0.0001 after trivial methods. The neural network was converged after 1000 epochs. The values of all the parameters for this study have been shown in Table 3.
Performance evaluation. This study aims to develop a classifier that can distinguish between the EGC, AGC, and healthy person. We developed DSSAENN and compared this method against the two other methods: (1) Softmax classifier (2) SAE + Softmax Classifier. Softmax classifier was used to classify the raw data. Whereas, in SAE + SMC based neural network, the features were learned by the SAE and these features act as raw input to the Softmax classifier, which was used to classify the EGC, AGC, and healthy persons. In this study, GC classifi-  www.nature.com/scientificreports/ cation is a three-class problem. Three possible results can occur at the outcome of the classifier 0,1 and 2, which were regarded as EGC, AGC, and healthy respectively. The classification results of each developed model were calculated in terms of F1 score, Recall, specificity, sensitivity, and detection rate.
Here TP, TN, FP and FN are known as true positive, true negative, false positive, and false negative respectively. For a good classifier, the model should have high accuracy, but at the same time, the precision and recall should also be minimum 44 . If any of the above criteria does not fulfill, the designed model is not accurate and it cannot be used in clinical applications.
Informed consent. Informed consent was obtained from all individual participants included in this study.

Data availability
Dataset used in this particular study can be obtained from the corresponding author on reasonable request.