Apple quality identification and classification by image processing based on convolutional neural networks

This work researched apple quality identification and classification from real images containing complicated disturbance information (background was similar to the surface of the apples). This paper proposed a novel model based on convolutional neural networks (CNN) which aimed at accurate and fast grading of apple quality. Specific, complex, and useful image characteristics for detection and classification were captured by the proposed model. Compared with existing methods, the proposed model could better learn high-order features of two adjacent layers that were not in the same channel but were very related. The proposed model was trained and validated, with best training and validation accuracy of 99% and 98.98% at 2590th and 3000th step, respectively. The overall accuracy of the proposed model tested using an independent 300 apple dataset was 95.33%. The results showed that the training accuracy, overall test accuracy and training time of the proposed model were better than Google Inception v3 model and traditional imaging process method based on histogram of oriented gradient (HOG), gray level co-occurrence matrix (GLCM) features merging and support vector machine (SVM) classifier. The proposed model has great potential in Apple’s quality detection and classification.

Focal length = 4 mm, Redmi Note 4X, China) 28 , which were then divided into three categories of premium, middle and poor grade. The process of images capturing using mobile camera was shown in Fig. 1.
The problem of insufficient size of the training dataset has been solved by techniques of data augmentation 29 . In the case of insufficient dataset, a direct and effective way of data enhancement technology can increase the diversity of training samples, improve the robustness of the model, and avoid overfitting. Changing the training samples can reduce the dependence of the model on some attributes or features using data augmentation technology. The more valuable data based on existing dataset were created by data augmentation strategy. As a result, the performance and robustness of model were improved. In order to reduce the calculation time of data augmentation, the pixels of the original pictures were zoomed out images with a quarter of the original images (a resolution of 780 × 1040). A few sample images in each class were shown in Fig. 2.
In this work, several augmentation techniques without changing the semantics of the images were applied in each scaled-down image such as increasing salt and pepper noise, increasing Gaussian noise, flips, rotation, brightness, and darkness operation. The noise density of salt and pepper noise was 0.3. The operation of increasing Gaussian noise prevented effectively the neural network from fitting all the features of the input image. The manipulation of horizontal flips was used. The percentage of brightness and darkness was respective 1.5 and 0.9. The manipulation of rotation was used to each sub-image, which could generate other five sub-images at 60°, 120°, 180°, 240°, and 300°. Finally, 36,000 apples sub-images were obtained. Augmented sub-images were shown in Fig. 3. The training and validation sets were independent and randomly sampled form 36,000 apple sub-images dataset with proportion of 80% and 20% (28,800 for training, 7200 for validation).
Overall identification architecture. A typical convolutional neural network (CNN) consists of input layer, convolutional layer, full connection layer and output layer. The advantage of convolution is local receptive fields and shared weights, rather than the way of all neurons are connected in the artificial neural network (ANN). In this way, the training parameters of the network were substantial decreased 30 . A CNN-based identification architecture, which was composed of an input layer, 6 convolutional layers (convolution and pooling operations), 2 full connection layers and an output layer, was developed for apple quality recognition. The specific configurations of the proposed model were shown in Table 1.
In the first convolutional layer, in order to acquire high-level features, convolutional kernel shape was 5 × 5, the number of convolution kernels were 8, the stride of the convolution kernel was 1. After the convolution operation, the size of the input image did not change, but the dimension was increased from 3 to 8 because of the convolution mode of SAME PADDING was selected. During the convolution operation, the original features of the input images were not lost using SAME PADDING. In order to reduce the dimensions of images, pooling layer kernel shape was 3 × 3, the stride of the pooling kernel was 2. Inspired by the Hebbian theory 31 , convolutional kernel shape of convolution layer 3 was 1 × 1. The convolution kernel (1 × 1) could connect highly correlated features in the same spatial location but different channels. This leaded to a large difference in feature information between adjacent pixels. Therefore, pooling was not used in this layer.
Network updating process. The proposed model was trained using the error back propagation algorithm, which was divided into two processes. The flowchart of the updating process for proposed model was shown in Fig. 4.
Feed-forward propagation. When it comes to convolution operations, it was inseparable from the concept of discrete convolution in mathematics. Discrete convolution was defined as Eq. (1). www.nature.com/scientificreports/ where x(n) and h(n) represent discrete sequences respectively. y(n) represents a new sequence obtained by convolution.
The convolution of the two-dimensional discrete function f (x, y) and g (x, y) was defined by Eq. (2).
The output of convolution layer neuron were defined as Eq. (3).
where act represents activation function.x ij ,θ and b represents i row and j column of pixels, kernel shape and offset value, respectively. The output of convolution layer and pooling layer were obtained by Eqs. (4) and (5), respectively.
( www.nature.com/scientificreports/ where S j represents pooling output of the j feature. f, P and b represent activation function, down-sampling function and offset value, respectively. After the processing of the pooling layer, a series of feature maps were obtained. Take out the pixels in order from the feature maps and arranged them into a vector. This method was called rasterization. The definition of rasterization was shown as Eq. (6).
where O k represents rasterized vector. Finally, the rasterized vectors were input to the fully connected layer and the classification results were obtained.   www.nature.com/scientificreports/ Back-ward propagation. Backward propagation was mainly the propagation of errors. The error vector δ k of rasterization was defined as Eq. (7).
The vector error of the pooling layer and convolution layer were shown as Eqs. (8) and (9), respectively.
where m and p represent the number of pooling and convolution, respectively. k and p represent vector error of the pooling layer and convolution layer, respectively. F represents Up-sampling Function. The weight update of a certain region C in the convolutional layer q was calculated by Eq. (10).
where E,θ q represent error function, weight, respectively. rot180 represents the matrix that was rotated 180°.O p , q represent Pooling output, sum of all bias gradients, respectively. The final propagation error p was defined as Eq. (11): Apple quality identification using other two methods. In order to compare the performance of apple quality identification by proposed CNN-based architecture, the Google Inception-v3 model 32 was used for quality identification under the same dataset. In this model, the fully connected layer was replaced by global average pooling for reducing the computational complexity. The main contribution of Google Inception-v3 was the Inception module. Inception v3 was the most classic and stable model of Google Net, it contained 10 inception modules. The accuracy of the model was improved by increasing the depth and width of the network and reducing parameters in Inception module. The structure of Inception was composed of convolution operations corresponding to 1 × 1, 3 × 3, and 5 × 5 convolution kernels and pooling operations corresponding to 3 × 3 filters, which increased the adaptability of the network to scale. The structure diagram of Inception was shown in Fig. 5. www.nature.com/scientificreports/ Similarly, a traditional method was applied for apple quality identification in this study. The traditional method was the work of converting images data from two-dimensional gray space to target pattern space. The result of classification was that the image was divided into several subareas of different categories according to different attributes. Generally, the difference in properties between different image regions after classification should be as large as possible, and the internal properties of the regions should be stable. The flowchart of traditional method was shown in Fig. 6.
The comprehensive information of the image gray level related to direction, adjacent interval, and amplitude of change were reflected by the GLCM of apple image, which were the basis for analyzing the local patterns of the image and arrangement rules. The texture description method of GLCM studied the spatial dependence of gray levels in image texture 33 .
The apple images features were extracted by calculating and counting the gradient direction histogram of the local area of the images using HOG method. In order to improve the robustness of HOG features to change in illumination, square root Gamma compression was used to achieve the normalization. The normalized images were convolved using one-dimensional discrete differentiation [− 1,0,1] to obtain the gradient component in the horizontal and vertical direction. According to the horizontal vertical gradient of the current image pixel, the gradient amplitude and gradient direction of the pixel were obtained, and the gradient direction histogram was also constructed.

Results and discussions
Experimental details and results of proposed CNN-based architecture. The methods of parameters selection were used in a variety of ways in the literatures of training CNN model. However, some basic principles still need to be observed in the parameter setting. In this work, Cross entropy function was selected as loss function. Adam Optimizer was selected as optimizer since Adam algorithm made the update of weights and offsets more stable. The size of the input images was set to 208 × 208 × 3. The maximum number of training step was set to 3001 taking into account the total number of data sets and the number of layers of the architecture. In the training process, the learning rate is too large, which makes the network unable to converge, and the learning rate is too small to make the function converge slowly 34 . Therefore, learning rate was set to 0.0001 in this work. The training batch size was selected as 20.
The sub-images from the dataset need to be processed and recognized by the learning model before model training. Figure 7 showed that two batches of images were randomly generated by preprocessing (label 0, label 1, label 2, label 1). Label 0, label 1, and label 2 represent premium grade, middle grade, and poor grade, respectively.  The training and validation accuracy curves of the proposed model were shown in Fig. 9. The whole training process achieved satisfactory results by optimizing weights and bias values at each step. The training time of proposed model was 27 min. The accuracy curves of training and validation sets increased exponentially at 1000 steps and held steady around 96% and 93% after about 2000 steps, respectively. It showed that there is no or slight overfitting in proposed model. After 3001 steps, the trained proposed model and it all parameters were saved. It also indicated that the recognition accuracies in training and validation sets reached their maximum at the 2590th and 3000th step (99% and 98.98%), respectively. The corresponding losses were 0.554 and 0.589 for training and validation, respectively. The training results demonstrated that the proposed architecture has a great potential for apple quality identification.

Performance of Google Inception v3 model for apple quality identification. Although Google
Inception v3 model has achieved very good recognition results on the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), the process of training all the parameters of the model were relatively time-consuming with the huge training datasets applied. A trained Google Inceptionv3 model was downloaded from Github. In order to save time and prevent overfitting, the parameters of the convolutional layers and the pooling layers were not changed and only the last layer of the model was trained during the training process. Parameters setting for Google Inception v3 model were similar to proposed model. The greatest accuracy generated by Google Inceptionv3 model in training and validation were 92% and 91.2% at the 3000th and 2700th step, respectively, as shown in Fig. 10. In addition, the training time for Google Inception v3 model was 51 min. Although the accuracy curves of the Google Inception v3 model fluctuated less than the proposed model, the proposed model can achieve much better accuracy in apple quality identification than the Google Inception v3 model.

Performance of traditional methods for apple quality identification.
In order to achieve the effect of image enhancement, the gray range of the sub-images were changed by weighted mean method. After the grayscale processing, bilateral filtering was applied in this work which reduced sharp changes and noise of image grayscale. Bilateral filtering was a non-linear filter 35 that retained the relationship between pixels in the spatial www.nature.com/scientificreports/ distance at the time of sampling and that increased the correlation between pixels to maintain edge features. In the preprocessing of the apple sub-images, the neighborhood diameter was set to 90, sigmaColor and sigmaSpace were set to 75. Bilateral filtering could preserve the detailed contour information of the apple. The results of preprocessing were shown in Fig. 11. After preprocessing, the local and structural features of the apple images were extracted using GLCM and HOG methods, respectively. In order to improve recognition efficiency, four texture parameters of Angular    www.nature.com/scientificreports/ Second Moment (Asm), Entropy (Ent), Contrast (Con) and Correlation (Cor) were adopted as texture features. Asm was used to calculate the uniformity of the images. Ent described the amount of information of the apple images. Con reflected the clarity of the images and the depth of the textures. Cor measured the similarity of the gray levels of the images in the row or column direction. Their average and variance were used as the local extracted features. Before HOG features extraction for apple images, the appropriate apple images block size need to be selected. The setups of the block size in this work were referred to Zhao et al. 36 providing a practical guidance for image identification. The blocks size was 4 × 4. If the block size was too large, the feature extraction was missing and the feature expression was blurred. If the block size was too small, the excess of useless interference information was collected with computational complexity increasing. With the parameters were selected, the horizontal gradient map and vertical gradient map of the apple images were calculated. After calculation, the images were divided into cell units and a histogram of gradient directions were constructed. A block was composed of 4 × 4 cells and the normalized gradient histogram was obtained within the block. The HOG feature of the image was obtained by concatenating the features of all blocks. Extraction of structural features using HOG were shown in Fig. 12.
GLCM and HOG features were merged as the input of SVM classifier 37 . Confusion matrix, a supervised classification learning algorithm, was selected to evaluate the accuracy of the classification results for apple quality. The accuracy can be described as Eq. (12): www.nature.com/scientificreports/ where A, N T and N V represent accuracy of apple quality classification, the number of correct classification, and total number of validation datasets, respectively. The training time for traditional method was 287 min. After training process, 498 and 23 images of premium apple samples were considered as middle apple class and poor apple class, respectively. 213 and 451 images of middle apple samples were considered as premium apple class and poor apple class, respectively. 21 and 368 images of poor apple samples were considered as premium apple class and middle apple class, respectively. The classification results of validation set using traditional method, which combined GLCM and HOG features and SVM classifier, were shown in Table 2. The SVM classifier was developed based on the GLCM and HOG features to distinguish apple quality with the overall accuracy of 78.14% for validation data set.
Testing and discussions. Compared with Google inception v3 model and traditional image processing classification method, the proposed model obtained satisfying performance 38 . Therefore, a software was developed for image acquisition using Python and PyQt5 on Windows system. OpenCV and camera's API were integrated into this software by Python language employing to acquire and save images. The weights, biases, and structure of the trained proposed model were saved and converted to a Protobuf format file. this file was loaded the software to realize online detection and classification for apple quality. The online detection system was shown in Fig. 13. Simultaneously, an independent testing dataset (300 apples with 100 for each class) was established to test the performance of the proposed model in online detection system. In the test experiment, four images of every apple were acquired by the camera at different angles. Then these images were predicted and scored by the trained proposed model. Some results of prediction and score were shown in Fig. 14. The quality category (premium apple, middle apple, and poor apple) with the highest total score was considered to be the predicted result. The proposed model demonstrated excellent performance for the separate testing dataset. The performance of proposed model was assessed the assistance of a confusion matrix in Fig. 15. In confusion matrix, 96 premium apples were rightly identified and 4 premium apples were considered as middle apples. 5, 93, and 2 middle apples were shown as premium, middle and poor apples, respectively. 97 poor apples were correctly classified and 3 poor apples were recognized as middle apples. The overall classification accuracy of proposed model for testing set was 95.33%. The trained Google Inception v3 model was also loaded into the software to test the performance with overall accuracy of 91.33% for separate testing dataset. Meanwhile, the trained SVM classifier was tested to distinguish apple quality with the overall accuracy of 77.67% for independent testing dataset. www.nature.com/scientificreports/ Although the detection and classification accuracy will be affected by the complicated working environment such as Apple's moving speed, the number, performance and angle of cameras 39 , the detection and classification results obtained by proposed model were superior to spectral imaging technology 9,11,12,40 and traditional machine vision method 22,23 . Feature learning was an advantage of deep convolutional networks over traditional image processing method. Zhang et al. 41 researched blueberry bruising using VGG-16 model (popular architectures). Due to the large number of layers and training parameters of the popular framework, the calculation time did not meet the requirements of Apple's detection and classification. Therefore, a new model for apple detection and classification was proposed in this article.  www.nature.com/scientificreports/

Conclusions
In this paper, a novel method based on Convolutional Neural Networks (CNN) was proposed and employed for apple quality classification containing disturbing background. Three methods of proposed model based on CNN, Google Inception v3 model (popular architectures) and HOG/GLCM + SVM (traditional imaging process method) were trained and validated to identify apple quality. The proposed model was trained and validated with best training and validation accuracy of 99% and 98.98%, respectively. The greatest accuracy generated by Google Inceptionv3 model in training and validation were 92% and 91.2%, respectively. The SVM classifier was trained based on the GLCM and HOG features to distinguish apple quality with the overall accuracy of 78.14% for validation data set. The proposed model was more acceptable than the other two methods from the accurate results. In addition, the training time of proposed model, Google Inception v3 model and HOG/GLCM + SVM were 27, 51, and 287 min, respectively. The proposed model took the shortest times for training process. Moreover, three methods were tested using independent testing set, obtaining the accuracy of 95.33%, 91.33%, and 77.67%, respectively. The overall results showed that the proposed model has great potential in apple quality detection and classification. The proposed model detects more apple quality attributes including color, size, types, ripe or unripe, and physiological disorders in the future. The proposed model can be further extended to identify more than three categories of apple quality and classify other fruits. The proposed model will be deployed real online sorting equipment to test it performance in the future.