Using deep learning to identify maturity and 3D distance in pineapple fields

Pineapples are an important agricultural economic crop in Taiwan. Considerable human resources are required to protect pineapples from excessive solar radiation, which could otherwise lead to overheating and subsequent deterioration. Note that simple covering all of the fruit with a paper bag is not a viable solution, due to the fact that it makes it impossible to determine whether the fruit is ripe. This paper proposes a system by which to automate the detection of ripe pineapples. The proposed deep learning architecture enables detection regardless of lighting conditions, achieving accuracy of more than 99.27% with error of less than 2% at distances of 300 ~ 800 mm. This proposed system using an Nvidia TX2 is capable of 15 frames per second, thereby making it possible to mount the device on machines that move at walking speed.

www.nature.com/scientificreports/ system developed in the current study was based on the assumption that the paper hats leave at least half of the fruit visible from beneath. Numerous researchers have investigated methods by which to determine the maturity and varietal of pineapples in the field. Azman et al. used a convolutional neural network (CNN) for the indoor assessment of ripeness (unripe, partially ripe, and fully ripe), achieving classification accuracy of 94% when applied to 27 samples with known ripeness 5 . Angel et al. used color analysis and image recognition software to classify pineapple maturity into 11 levels when viewed under indoor conditions 6 . They achieved classification accuracy of 96% for 550 pineapple samples. Researchers in Thailand examined two pineapple varieties using gradient analysis of textures 7 . Sanyal et al. reported on the use of computer technology to diagnose different diseases aimed at increasing the production of pineapples 8 . Amjad et al. used VGG16 in identifying bromeliads, achieving classification accuracy of 100% 9 . Edwin et al. used image processing and fuzzy theory to classify post-harvest maturity into four categories. They achieved classification accuracy of 90% for partial maturity and 95% overall accuracy 10 . Note however that all of the methods described above rely on a controllable light source under indoor conditions. Wan et al. used a UAV equipped with a color camera to determine the number of pineapple crowns in a field. They achieved classification accuracy of 94% 11 .
Deep learning is a machine learning technique that uses multi-layer artificial neural networks to analyze signals. In the current study, we employed a CNN for the real-time detection of pineapple ripeness in the field as follows: unripe, ripe, ripening, and bagged. Unripe: un-harvestable, the fruit was not final the development and not to reaches the final size. Ripe: The solid sound fruit must be harvested when the dark green peel turns to light green, and the fruit has reached its final size at this time. Ripening: the peel turns to yellow. Bagged: the fruit can't be saw the peel and can't be identify the mature.
To deal with the difficulties of real-time computing and cost, we employed an embedded system NVIDIA TX2 in implementation of the YOLO 12 pre-trail network. The Intel D435i 13 camera was the input data to TX2, it has RGB data and depth data for each frame. We assume that this device (including cameras and embedded systems) was built under the platform of the elevated tracked vehicle and attached to a simple harvesting mechanism.

Results
In this section, we discuss the effects of training on recognition performance for different databases (Table 1). We also discuss the means by which to assess accuracy in classification and identification tasks.
Deep learning. During training, root mean square error (RMSE) was used for evaluation, based on the standard deviation of the residual, where values that are closer to zero indicate superior accuracy.
Accumulative database. We categorized the fruit as ripening or unripe using 518 training images and 360 testing images. The results were as follows: training time (19.2 h), loss (0.0042), and RMSE (0.06). The classification accuracy (CA) and box selection accuracy (BA) are shown in Fig. 1. In training database as Fig. 1a-d, a-c can be correctly classified as ripening and (d) as unripe, and the confidence score for each image is 94.61%, 97.79%, 96.96%, and 91.07%. In testing database as Fig. 1e-h, e,f can be correctly classified as ripening, and (g) and (h) are unripe, and the confidence score for each image is 88.68%, 86.20%, 95.93%, and 61.37%. The results in the testing dataset had lower confidence than those in the training dataset, such as (g) and (h) are the fruits sunhats without in the training set, but they can be correctly distinguished and positioned in both frame selection and classification. Especially in the Fig. 1c, the all-yellow fruit without crowns is mainly due to the rapid changes in temperature, resulting in the reduction or disappearance of the growth point differentiation ability, which was appear in the actual field. Our database was generated for on-site assessment. In the future, the database of fruit development that will appear in field-related field management will continue to increase. In Table 2, the first part (database 1), the classification accuracy (CA) and box selection accuracy (BA) of unripe fruits (120 images) were 100% in the training set. However, one of the ripening fruits (total 398 images) fails to correctly selection the box of the fruit, and the rest of the box selection fruits can be correctly selected. In the test set, the CA of unripe fruits (21 images) was 100%, and only one of them was misclassified. However, among the ripening fruits (339 images), 4 images of them failed to correctly selection the box of the fruit, so the BA was 98.82%, among the 335 correct box selections, 11 images were classified incorrectly, so the CA was 96.71%.
We also categorized the fruit as unripe, ripening, or ripe using 693 training images and 572 test images. The results were as follows: training time (25.6 h), loss (0.0036), RMSE (0.06). For the additional category (i.e., www.nature.com/scientificreports/ unripe), framing of the pineapple was successful (i.e., the pineapples were correctly positioned in the frame), but some of the ripe fruits were classified as unripe due to similarities in color. In Table 2 (database2), 4 images in the ripe training set (175 images) were misclassified as ripe, and the CA was 97.91%, and the ripening BA was 99.75% which the one image fail to detect the fruit in ripening set (398 images). In the training part of unripe, the BA was 100%, but CA was 99.17% which one image misclassified as ripe. Among the 339 images of unripe fruit in test set, pineapple detection failed in 8 images (BA was 97.64%), 74 were misclassified as ripe and five were misclassified as ripening (CA was 77.34%). In the ripening test set, 21 images can be detected the fruits, but one image fail to classified as unripe (CA: 95.24% and BA:100%). Among the 212 test images of ripe, seven were misclassified as unripe (CA:96.70% and BA:100%).
We also categorized the fruit as unripe, ripening, ripe, or bagged using 928 training images and 572 test images. The results were as follows: training time (34.5 h), loss (0.0056), RMSE (0.07). The position of the fruit was correctly framed in the training image set. Training images: 1) In classifying ripe pineapples, one image was misclassified as unripe and one image was misclassified as ripening (CA:98.86%). 2) In classifying ripening pineapples, one image was misclassified as bagged (CA:99.75%). 3) In classifying unripe pineapples, two images were misclassified as ripe (CA:98.33%). The position of the fruit was correctly framed in the test image. In classifying ripe and ripening pineapples, all images were correct. In unripe 339 images, 3 images of them failed to correctly selection the box of the fruit, so the BA was 99.12%. In classifying the 339 unripe images, one image was misclassified as ripening, one image was misclassified as bagged, and 75 images were misclassified as ripe (CA: 773.8%).
Increasing the number of images in the database was shown to improve detection performance. Our inability to exceed 95% accuracy can be attributed to the fact that ripe pineapples appear similar to unripe pineapples.  www.nature.com/scientificreports/ 3D Distance. We performed experiments under indoor and outdoor testing environments. In the indoor test, 3D distance estimates were verified using a checkerboard grid with cells measuring 21 mm. The accuracy of 3D distance estimates could not be obtained under outdoor conditions; therefore, we used only D z distance error. The checkerboard in the indoor test was maintained in a fixed position using a ruler and a laser rangefinder. Depth values were obtained using the depth camera (D434i). In order to verify the accuracy of the distance we calculated, we added the laser rangefinder as the standard answer, calculated the difference between the two as the calculation error (D), and divided the difference by the standard distance to obtain the error rate (error rate of D). Laser rangefinder The architecture and arrangement of the camera are shown in Fig. 2a and d.The experiment setup is shown in Fig. 2b and a depth map is shown in Fig. 2e. We used the intersection of black and white cells closest to the center of the screen as a center point for calculating the distance based on sampled coordinate points, the results of which are shown in Table 3. In the experiment, the Z-axis distance was 250 ~ 800 mm, D x distance was + 58.09 mm and -59.58 mm, D y distance was -62.48 mm (upper area) and + 57.79 mm (lower area), and D z distance (laser range-depth camera) finder was + 12 mm. We can find that in the value of the Z axis, our calculation results are more than 4 mm larger than the distance of the laser rangefinder. Therefore, when we subtract 6 mm from the calculated values as △D z , the error rate of △D z are all less than 1.12% at the distance of 300 ~ 800 mm. As shown in Table 3, over a distance range of 300 ~ 800 mm, 3D distance error was less than 2%, which means that under Z-axis distance of 300 mm, the maximum error would be 6 mm.
In the outdoor experiment, we attached the laser rangefinder aligned with the x-axis of the camera, as shown in Fig. 2c and f. During this test, we were unable to obtain precise distance measurements along the x-and y-axes; therefore, we tested accuracy only along the z-axis. Note that the position of the pineapple does not necessarily match the point measured by the laser rangefinder and the body of the pineapple is arc-shaped. We placed Fig. 2d in a pineapple field for testing, placed the camera at nine different distances and positions, and tested the difference from the laser rangefinder and our system. Thus, the maximum error obtained from nine groups of test results was 1.53% (see Table 1).
The YOLO V2 12 architecture made it possible to detect pineapple maturity in 99.27% of cases. In 3D distance experiments, error in 3D distance estimates was less than 2% at 300 ~ 800 mm in an indoor environment. Z-axis distance estimates obtained in the field were consistent with our indoor test results, resulting in error of 1.53%.

Discussion
This paper presents preliminary work on the development of an automated system for the harvesting of pineapples. We used a database of images collected in the field (Tainung No. 17) for offline training. The detected fruits were classified as harvestable (ripe and ripening) or un-harvestable (unripe). Note that these assessments were obtained for fruits that were partially shaded from the sun using a paper bag. Performance in box selection reached 99.27% with 3D distance error of less than 2%. A failure to exceed classification accuracy of 95% prompted us to subdivide the ripe samples into the following two groups with the aim of enhancing classification accuracy: stage 709 and stage 8. Trained network weight files were transferred from an i7-1165G7 2.8 GHz host computer with no graphics card to an embedded system equipped with an NVIDIA TX2, thereby increasing the frame rate from 5 to 15 FPS (including distance conversion). The distance data was then transferred to the harvesting machine via an RS232 transmitter.
We expect to improve on the poor recognition rate achieved in the test set in the future. We will also augment the database by adding more data pertaining to inflorescence, fruit development period, and pineapple varieties. The preliminary design and planning of the simple harvesting machinery is shown in Fig. 3. The harvester www.nature.com/scientificreports/ moves back and forth to align the attached slides (green in the image) with the fruit, as shown in Fig. 3a. After alignment, the two semi-circular halves of the harvester device are then lowered to surround the fruit, as shown in Fig. 3b. The two halves are then moved to within a set distance of the stem, whereupon an electric cutting device picks the fruit, as shown in Fig. 3c.

Methods
Offline training was conducted using Matlab 2021b running on a PC equipped with an Intel i9-11900F and RTX-3080Ti graphics card with the weight files held in an embedded NVIDIA TX2. In the following, we describe how we obtained the database and then outline the network architecture and our reasons behind its selection. Finally, we describe the means by which detection data was converted into 3D distances.
DataBase. The most important part of any deep learning implementation is the collection of data and the accuracy of ground truth values. Collecting data in the field can be time-consuming; therefore, we discussed the issue with Ching-Shan Kuan and Hsin-Yi Tseng at the Taiwan Agricultural Research Institute before initiating the collection process. Data collected in the field were classified and labelled in accordance with BBCH 1 codes throughout the growing period. The correctness of the data was confirmed manually. We compiled a database of 8,852 images, as follows: stage 5 (2186), stage 6 (2176), stage 7 (3,449), stage 701 (1990), stage 705 (1002), stage 709 (457), and stage 8-ripening (419), and stage 8-ripe (387). We also collected 235 images of bagged pineapples. The database data we used was shown in Table 4. It was classified according to the BBCH, and randomly divided into training set and test set. Table 4 lists the categories and the number of images in each. Our main objective was to assess fruits in the fields during the final stage.   www.nature.com/scientificreports/ The fruit was divided into four categories: bagged (Fig. 4a), ripe (Fig. 4b), ripening (Fig. 4c), and unripe (Fig. 4d). 'Ripe' refers to fleshy fruit that is entirely green but must be harvested. 'Unripe' refers to fleshy fruit and hollow-sounding fruit that are not harvestable. 'Ripening' refers to hollow-sounding fruit undergoing a color change from green to yellow. The extent of color change can be subdivided according to maturity level. In Fig. 4b,c, the head of the fruit is slightly different, the head of (c) is slightly yellow, and turns from bottom to top; (b) is the whole fruit entirely green without turn yellow. Note that when paper bags are used to protect the fruit from sun exposure, it is not possible to determine the degree of maturity. Note that images showing sun hats (Fig. 4e) were not included in the training or test sets.
Deep learning. The YOLO architecture is available in three versions: First generation 14 (V1) unifies the two-stage solution (Class, Bounding Box) under the CNN architecture into a regression problem with calculation speed of 45 FPS; second generation 12 (V2) offers improved accuracy, convergence, and calculation speed of 67 FPS; and third generation 15 (V3) exhibits improved detection of small objects but is slower. The fourth generation 16 (V4) was meant to find an optimal balance of resolution and the number of convolution layers. This enables a high recognition rate with high computing efficiency and speed of 65 FPS. V5 17 is largely the same as V4; however, the computing speed is twice as fast.
Note that in this study, we performed detection using a single station in the orchard. The size of objects detected in the images exceeded the 300*300 px resolution of the camera and no more than two objects were ever included in the same frame. Thus, there was no need to detect multiple small objects. The background in the pineapple field was less complex and less variable than that in the COCO database. Thus, in the selection of a network architecture, our primary consideration was speed and accuracy. We opted for the YOLO V2 network architecture using models pre-trained in MATLAB. The parameter settings were as follows: output images (448*448*3) with anchor boxes (344,338), (108,115), (223,290), (238,167). The Yolonet parameters in experiments were as follows: feature layer (leakyrelu_24), mini-batch-size (32), max-epochs (200), learning rate (0.001). In order to avoid the overfitting problem, we randomly selected the training set in the database, and used the test images that are not in the training set to confirm whether there is overfitting. 3D Distance. Note that the selected camera uses left to right dual-camera parallax to estimate distances in conjunction with TOF technology to correct for distance error, resulting in Z-axis distance error of less than 2% 13 . Note also that the working range of the camera was from 20 cm to 10 m, which was in line with our needs. As shown in Fig. 5, after the camera recognizes the frame selection position of the object through deep learning, the center point of the frame is calculated to read the Z-axis distance of the corresponding pixel position. Convert the 3-axis actual distance (mm) from the obtained Z-axis distance and the screen frame selection center point.
After determining the position of the pineapple via deep learning, as shown in Fig. 5, we used the coordinates of the upper left corner of the frame (x a , y a ) to calculate the center point position (x b , y b ) as showed in Eq. (1). The distance value of the depth image was read from this position. To account for the risk of distance failure under strong lighting conditions, we sought to strengthen the depth values using Eq. (2). Thus, we used the eight adjacent points and the center point to calculate depth value D Z . Once the camera is installed beneath the field harvester, the body of the machine should block strong light.
(1) www.nature.com/scientificreports/ D z values were used to calculate the x-axis distance D x and y-axis distance D y . Using the known camera FOV (H: 64°, V: 41°) and camera resolution (1920*1080 pixels), distance can be calculated using trigonometric functions via Eqs. (4)(5). In the center of the screen, Dx = Dy = 0 mm, in the upper right corner of the screen, Dx > 0, Dy < 0, and in the lower left corner of the screen, Dx < 0, Dy > 0.
where (× 1,y1) indicate center point of the image, the first value was m/2, and the second value was n/2. The position of the object determined using deep learning can be converted into 3D distance (mm) using Eqs. (1)(2)(3)(4)(5).

Data availability
The database was collected by Dr. Ching-Shan Kuan and Dr. Hsin-Yi Tseng from the chiayi agricultural experiment branch and assistant researcher, plant germplasm division, Taiwan Agricultural Research Institute. Since the data is still being collected and the plan is being implemented, the information cannot be made public. If necessary, the network structure and program can be sent to the corresponding author, and the corresponding author will execute the feedback data. (2) (3) num = Calculate the number of 9 depth image values in D Z that is greater than 1