Identifying and extracting bark key features of 42 tree species using convolutional neural networks and class activation mapping

The significance of automatic plant identification has already been recognized by academia and industry. There were several attempts to utilize leaves and flowers for identification; however, bark also could be beneficial, especially for trees, due to its consistency throughout the seasons and its easy accessibility, even in high crown conditions. Previous studies regarding bark identification have mostly contributed quantitatively to increasing classification accuracy. However, ever since computer vision algorithms surpassed the identification ability of humans, an open question arises as to how machines successfully interpret and unravel the complicated patterns of barks. Here, we trained two convolutional neural networks (CNNs) with distinct architectures using a large-scale bark image dataset and applied class activation mapping (CAM) aggregation to investigate diagnostic keys for identifying each species. CNNs could identify the barks of 42 species with > 90% accuracy, and the overall accuracies showed a small difference between the two models. Diagnostic keys matched with salient shapes, which were also easily recognized by human eyes, and were typified as blisters, horizontal and vertical stripes, lenticels of various shapes, and vertical crevices and clefts. The two models exhibited disparate quality in the diagnostic features: the old and less complex model showed more general and well-matching patterns, while the better-performing model with much deeper layers indicated local patterns less relevant to barks. CNNs were also capable of predicting untrained species by 41.98% and 48.67% within the correct genus and family, respectively. Our methodologies and findings are potentially applicable to identify and visualize crucial traits of other plant organs.

Species identification is a fundamental component in every discipline of biology to properly utilize, monitor, and protect highly diverse living organisms on Earth. However, conventional identification workflows that manually distinguish key visual features are slow and error-prone because of the complexity and intraspecific variation in morphological traits 1 . There is also increasing concern that the number of professional and amateur taxonomists is persistently declining, and that the gap in taxonomic knowledge between professionals and the public is increasing [1][2][3] . Thus, an increasing number of studies propose automated identification systems based on rapidly advancing machine learning methods, to support both professionals and the public [4][5][6][7][8] . Among the various fields where automation is being introduced, plant identification is one of the most vigorously studied areas with high demand owing to its easy accessibility, rich diversity, and increasing curiosity about natural creatures in urban life 9 .
In addition to conventional taxonomic approaches to identify plants, previous studies regarding automated plant identification have mainly focused on extracting visual features from either or both reproductive and vegetative organs [10][11][12][13][14][15] . However, the availability of these organs highly depends on the season, phenological changes, and height of the crown base, especially in tree species. Hence, owing to the non-rigid and three-dimensional nature of leaves and flowers, it is challenging to produce high-quality image data to train and evaluate machine

Results
The training of the two CNN models was successful, and both CNN models reached over 90% overall accuracy, as shown in Table 1. The performance metrics ("Materials and methods") did not show substantial differences among the four measures, even between the models (ranged within 0.5%). A more complex model, namely EfficientNet, required a much larger amount of training to reach its maximum accuracy; it took 65 epochs in VGG-16 (90.7%) and 161 epochs in EfficientNet (91.0%). A more detailed comparison of classification performance with respect to each species is given as confusion matrices in Figs. 1 and 2, produced by VGG-16 and EfficientNet, respectively. Most species showed high accuracies of > 90%, but two models tended to show lower www.nature.com/scientificreports/ accuracies in some species that had multiple species in each genus or family, and they were mostly misclassified as other species in the same genus or family. In VGG-16, for example, large misclassifications were made in Abies balsamea (by 6.39% to Picea mariana), Castanea crenata (by 7.0% and 6.6% to Quercus aliena and Quercus serrata, respectively), Taxodium distichum (by 12.7% to Metasequoia glyptostroboides), Acer rubrum (by 11.5% to Acer saccharum), Picea glauca (by 18.7% to Picea abies), and Quercus serrata (by 33.9% to Quercus aliena). EfficientNet followed the same trend but showed weaknesses in slightly different species: Abies balsamea (by 8.0% to Picea mariana), Acer rubrum (by 14.5% to Acer saccharum), Betula alleghaniensis (by 14.47% to Betula papyrifera), and Quercus aliena (by 12.6% to Quercus serrata). Ulmus americana was misclassified to species in other taxon groups, including Acer saccharum (9.5%) and Ginkgo biloba (5.1%).
Meanwhile, the accuracy of the CNN models for each species did not show a clear trend with the number of training samples (Figs. 1 and 2, and Table 2). Species with high error rates (e.g., Acer rubrum, Castanea crenata, Picea glauca, and Quercus serrata) were trained with an intermediate number of unique bark images, ranging from 161 to 226, whereas species trained with a relatively small number of images still showed high accuracies (e.g., Acer palmatum var. amoenum, Magnolia obovata, and Picea rubens). However, considering that the  www.nature.com/scientificreports/ ImageNet dataset consisted of approximately 1000 images per class 35 , we suggest that more than 100 images per species should be prepared, assuming that 10 cropped pieces can be sampled for each bark image (Fig. 6, "Data pre-processing and augmentation"). These results suggest that CNNs can successfully identify most species; however, varying the extent of intragenus and intra-family similarities could lead to low accuracy, and two CNNs might learn and utilize classification keys in different ways.
The diagnostic features of each species were identified as heat maps displaying the relative importance of classification ( Fig. 3 and extended figures at https:// doi. org/ 10. 6084/ m9. figsh are. 14727 834). In the heat maps extracted from VGG-16, the saliency responses in most species were typically matched with localized shape patterns, which were also readily noticeable by human eyes. For example, the patterns were shown as blisters in Abies, vertical stripes and crevices in Acer, large horizontal lenticels in Betula, shaded parts of scaly bark in Picea, scaly and flaking patterns in Platanus, small striped and dotted lenticels in Prunus, and vertical crevices and clefts in Quercus (detailed information by species listed in SI Appendix, Table S1). A large number of images was activated by higher density in several species that did not retain salient features but exhibited repeated textures along the entire bark. It was mostly found in four Cupressaceae species (Chamaecyparis pisifera, Metasequoia www.nature.com/scientificreports/ glyptostroboides, Taxodium distichum, and Thuja occidentalis) and in some samples of Fagus grandifolia, and Magnolia obovata, as large numbers of elongated narrow cracks or stripes. These results indicate that VGG-16 mostly catches locally prominent features if available, and if not, they utilize the entire texture patterns that are extensively shown and repeated. In EfficientNet, the disparity of heat maps in terms of both quantity and quality was observed. The model utilized much smaller and weakly activated regions that did not match the proposed patterns from VGG-16 in many species. Along with smaller sizes of each activation region, they were less matched with the exact location Table 2. Details of the tree species and corresponding number of bark images utilized in this study. Only the number of cropped test samples are described because we concurrently generated train samples during the training phase (see "Data pre-processing and augmentation"), rather than making the cropped samples from the prior train dataset. Superscripts a and b denote the source of the images; a represents the data collected in this study, and b from BarkNet 1.0 dataset. www.nature.com/scientificreports/

Discussion
Our results suggest that CNNs have sufficient capacity to identify tree bark images with high accuracy in most species. However, the proposed methods also showed weaknesses in distinguishing the inter-species similarities between some genera and families, which might be attributed to disparity in diagnostic features, whereas the overall accuracy did not show substantial differences between the two models. The contrasting diagnostic features from EfficientNet could be explained by its complicated architecture consisting of 54 convolution layers, 17 pooling layers, and other dropout and reshape layers (240 layers in total). In contrast, VGG-16 consists of 13 convolution layers and five pooling layers (23 layers in total). Convolution layers at different depths learn distinct levels of representation, and the deepest layers are trained to capture the features with the highest abstraction 36 . It could be inferred that increasing the number of convolution layers and architectural complexity led the CNN model to utilize more condensed information in the data.
In this aspect, augmenting the complexity of the model increased performances of large-scale image classification datasets (e.g., ImageNet) of objects with rigid boundaries and intermediate intra-class variances. However, natural images show large variation in the same species (age and environmental variation) and frequently show high inter-species similarity. If the model excessively focuses on the local features shown in a specific entity rather than grasping general patterns shared at the species level, it might overfit and decrease overall performance.
Previous studies reported that ImageNet pre-trained models showed high domain adaptation ability. That is, high accuracy models on ImageNet showed higher accuracies on other domains (R 2 > 0.95), tested with a benchmark dataset consisted of 65 objects (Office-Home Dataset) 37 . In this study, EfficientNet did not show substantially better accuracy than VGG-16, regardless of the superior performance on ImageNet.
The varying accuracies and CAM representation capacities of CNNs with natural images have been studied less, but few studies are in accordance with the findings of this study. When tested for detecting flowering events using five distinctive CNN architectures 38 , the accuracies ranged only within 1.3%, but CAM visualization yielded largely contrasting results. In the study, simpler models showed activation to the large and relevant region to flowers, whereas the complex and better performing model showed much smaller activated region. Therefore, the complex model predicted flowering by only a small number of flowers with higher sensitivity.
In the case of barks, augmented complexity and sensitivity could have mixed effects on both positive and negative sides. Complex models might identify species with more localized and sophisticated features that simple models cannot find, but along with the risk of overfitting. Consequently, ensemble techniques that combine and predict different types of models or architectural modifications that enable multi-receptive levels (e.g., Inception modules in GoogLeNet 39 ) would be suitable for real-world applications.
A qualitative inspection of misclassified samples in Fig. 1 and 2 revealed that CNNs tended to show errors in some typical conditions which should be avoided in data preparation. They could be typified as excessive background masking in Fig. 4a, foreign object coverage in Fig. 4b, and improper light conditions in Fig. 4c. The other misclassifications that could not be typified, may originated from high inter-species similarity along with the black-box characteristics of deep learning models. As shown by Rzanny, et al. 14 , where fusing multiple organs increased accuracy up to 20%, combining the bark with leaf and flower characteristics would lead to improved accuracy and robustness. www.nature.com/scientificreports/ Meanwhile, CNN models tend to misclassify other species in the same genus or family ( Figs. 1 and 2). These results imply that diagnostic features are shared within higher taxa, and CNNs can recognize and utilize these general features for identification. We further investigated this generalization capacity with images of new species within the same genus or family 27 , consisting of 20 species in Abies, Acer, Alnus, Betula, Carpinus, Castanea, Chionanthus, Fagus, Fraxinus, Platanus, Populus, Quercus, and Ulmus (Fig. 5). When tested with two models, VGG-16 predicted 42.0% of the total samples to the species in the same genus (among 42 original species), and 48.7% to the species in the same family; and EfficientNet predicted 37.8% and 44.5% of the species in the same genus and family, respectively. In detail, VGG-16 showed high generalization ability in Acer pictum subsp. mono (93.1%) and Acer psuedosieboldianum (68.6%) into Acer palmatum var. amoenum, Platanus orientalis (49.2%) into Platanus occidentalis, Quercus dentata (98.1%) and Quercus × urticaefolia (100%) into Quercus aliena. Efficient-Net particularly showed better results in Betula platyphylla (45.3%), Fagus sylvatica (53.3%), Platanus orientalis (82.0%) and Populus grandidentata (66.4%). These results suggest that CNNs can catch the shared features within a higher taxon and categorize new species into their genera or families if trained before. Furthermore, it seems that VGG-16 learned and utilized more generalized features from the original dataset than EfficientNet, which In summary, we demonstrated the capacity of CNNs to distinguish sophisticated patterns in barks and proposed a novel approach to extract and narrate diagnostic keys found by computer vision models. We insist that architectural differences in CNNs largely affect the quality of diagnostic features, which is not shown by quantitative accuracy metrics. Our proposed methods are easily applicable to other plant organs (e.g., leaves, flowers, and wood anatomical features) that also exhibit complex patterns on their surfaces. Furthermore, we suggest that using the bark as a baseline prediction factor and combining the information from other organs which will enable highly precise tree species identification.

Materials and methods
Bark image datasets. The Bark dataset for training and evaluation of CNNs was prepared by combining two different datasets: BarkNet 1.0 27 and the additional dataset collected for this study. In terms of size and number of species, BarkNet is currently the only publicly available dataset of its kind and contains more than 23,000 bark images captured from 23 tree species found in Quebec City, Canada, with 10-40 repeated captures from different angles and heights of individual trees. The captured raw images were pre-processed in the dataset by two discrete crop processes: trimming off two vertical border lines between the trunk and the background (side-crop) and then cropping into square-shaped patches to match the input shape for CNNs (square-crop). In this study, side-cropped images of the BarkNet dataset were utilized. As the images in BarkNet were captured with a large number of repetitions, this might lead to high train-test dataset similarity, which impedes efficient training of CNNs. Consequently, four images were randomly sampled from each of 10 to 40 repetitions and then manually filtered out of images that share a large portion of each other.
For the additional dataset, the images were captured according to the protocols of BarkNet with minor modifications. The protocols were as follows: captured from September to November 2020; in parks and forests near Seoul City, Republic of Korea; with a wide range of crown density; under various weather conditions from sunny to cloudy and rainy; using an iPhone SE2 camera; at a distance between 20 and 60 cm away from the trunk; with three to four repetitions from different angles per individual tree. The species selected were common species that can be easily found in most parks and forests in the Republic of Korea. We collected samples from multiple species in each genus to investigate intra-genus similarities.
Consequently, two datasets consisting of 3672 images (20 species; BarkNet 1.0) and 5703 images (23 species; for the present study) were prepared, having one species Pinus strobus in common. A more detailed list of species and the corresponding number of captured images in the two datasets are given in Table 2.
Data pre-processing and augmentation. The overall pre-processing and augmentation are illustrated in Fig. 6. Before aggregating the two datasets, side-crop and masking were applied to our dataset. Because our dataset contained images that had non-vertical boundary lines, all images were side-cropped to a length that did not cut out pixels of barks, and the remaining background was masked by black pixels. After the images were cropped and masked, BarkNet and our dataset were aggregated and split into two subsets along with each class: 80% and 20% for training and testing, respectively. Figure 6. Demonstration of data pre-processing procedures. Raw images were side-cropped to a length that did not cut out the bark (A) and the backgrounds were masked with black pixels (B). Then, the images were sampled using random window sizes (40-60% of total width) and position (B). If the window contained more black pixels than the pre-defined threshold (0.05 in the present study), the window was re-sampled. After the batched crop images were prepared (C), they were resized to match the model input size and 'RandAugment' was applied (D). These procedures were performed concurrently during the training phase. www.nature.com/scientificreports/ For the training dataset, a random square crop with thresholding was applied to the images before putting them into the CNNs. Unlike previous studies, the images were randomly square-cropped by proportional length instead of fixed pixel lengths, such as 224 × 224 27 and 256 × 256 40 . For each crop, the pixel length of the crop was first randomly selected in the range of 40-60% of the total width, and then the position of the cropping square was randomly sampled. To exclude excessive background, the ratio of masked pixels to total pixels was calculated, and if the ratio was higher than the threshold (0.05, in this study), the crop region was re-sampled. This approach has several advantages: minimizing the loss of bark features that overlap with crop region, flexible cropping that copes with varying resolutions, and cropped images that require no manual filtering.
For data augmentation, a process that increases the training data by giving slight modifications, 'RandAugment' was utilized 41 . RandAugment is an automated data augmentation process that randomly selects a predefined number of augmentation functions and then applies them with a randomly sampled magnitude from a predefined range. The 'candidate' functions include equalization, rotation, solarization, adjustment of contrast, brightness, and color. The entire list of augmentation functions and hyperparameters regarding RandAugment could be found in the provided Python scripts ("Materials and methods": Data and code availability).
These random square-crop and augmentation processes were applied concurrently with the training. More specifically, side-cropped training samples remained unmodified, the random square-crop followed by augmentation was performed, and the output images with the number of predefined batch sizes were fed into the CNNs. For the testing dataset, each side-cropped sample was square-cropped by 50% of the total width, without randomized sampling, and no augmentation was applied. All training and testing samples were resized to 331 × 331 before data augmentation.

Convolution neural networks.
In this study, among numerous CNN architectures that have been extensively studied and released, the following architectures were adopted and compared: VGG-16 and EfficientNet 42,43 . As one of the early CNN architectures released in 2014, VGG-16 remains a widely used CNN architecture due to its simple structure and powerful performance in various computer vision tasks, including classification 44 , detection 45 , and segmentation 46 . In contrast, EfficientNet is currently the most powerful CNN architecture, showing the highest benchmark accuracy on ImageNet (26) while reducing the number of parameters compared to other recent models. EfficientNet is an approach that searches for the best scaling coefficients in terms of the width, depth, and resolution of CNNs. In this study, EfficientNet with the smallest scale size (B0) was used to match the input size with VGG-16.
Using two CNN architectures, each model was modified for effective transfer learning by following sequences: (1) truncate the final fully-connected (FC) layer which contained neurons for predicting 1000 classes (in the original ImageNet dataset), (2) reshape (flatten) the output of the final convolution layer and apply dropout, (3) insert a new FC layer with the same number of neurons as the prediction classes (42 species), (4) apply softmax activation with cross-entropy loss functions.
Here, a convolution layer performs element-wise multiplication between the input data and the weights; a pooling layer simply reduces the dimensions of the input data; a dropout layer randomly sets input node values to zero and prevents overfitting; and a FC layer connects all nodes in one layer to every node in itself.
Extracting diagnostic features with class activation mapping aggregation. Since CAM was first proposed by Zhou, et al. 33 , there have been several advances in CAM techniques. The original CAM had certain limitations in that it required a global average pooling layer in the architecture, along with laborious re-training. These limitations were overcome by Selvaraju, et al. 29 in gradient-weighted class activation mapping (Grad-CAM), using a back-propagation gradient in any target convolution layer. An advanced method called Grad-CAM++ has recently been proposed, which provides better localization capability than Grad-CAM, especially with multiple object instances occurring in a single image. Considering that the diagnostic features would show multiple occurrences in a single bark image, Grad-CAM++ was utilized in this study as the CAM extraction method.
To build a general representation of diagnostic features over the entire bark rather than individual squarecropped patches, aggregation and averaging methods were applied, referred to as CAM aggregation (Fig. 7). The CAM of each tree image was calculated using the following steps: zero-padding, CAM calculation from sliding windows, and taking the mean of CAM outputs. More specifically, bark images of whole trees were padded with black pixels with corresponding widths and heights so that the window slides over the bark area for an equivalent number of times. Then, the receptive window slid over the padded image with a predefined stride of half window size and calculated the CAM of the cropped image. All CAM calculations were then combined and averaged along the entire bark image.
Performance metrics. Two CNN models were evaluated using four performance metrics: overall accuracy and macro-averaged precision, recall, and f-score with β = 1. Each metric was calculated by the multi-class classification performance measurements adopted from Sokolova and Lapalme 47 , using overall accuracy instead of average accuracy in this study. The measurements were as follows: Overall accuracy = l i=1 tp i tp i + tn i + fp i + fn i www.nature.com/scientificreports/ where tp i , tn i , fp i , and fn i represent true positive, true negative, false positive, and false negative counts, respectively, for each species class C i and with l number of total classes ( l = 42 ). M indices represent the macro-averaged calculation of each metric.
Training details. When training VGG-16 and EfficientNet, we utilized the following hyper-parameters: 10 −5 as the learning rate, Adaptive Moment Estimation (Adam) as the optimizer, 50% for dropout rate of the fully connected layer, and training batch size of 8. Two CNN models were pre-trained with the ImageNet dataset, and then fine-tuned until the maximum overall accuracy on test dataset did not increase for 10 consecutive epochs.
CNN models were trained and tested under Windows 10 pro 64-bit OS and Python version 3.8.5, with hardware specification of an AMD Ryzen 5 3600XT CPU, 32 GB of RAM and an NVIDIA GTX 1080 Ti GPU. Tensorflow (2.4.1) and Keras (2.4.3) were utilized as deep learning frameworks.

Code availability
The python scripts used in this study are available on GitHub (https:// github. com/ snutp/ TBKFE). Refer to the scripts in the repository for technical implementation details of the methods in this study.