Automated rock mass condition assessment during TBM tunnel excavation using deep learning

Rock mass condition assessment during tunnel excavation is a critical step for the intelligent control of tunnel boring machine (TBM). To address this and achieve automatic detection, a visual assessment system is installed to the TBM and a lager in-situ rock mass image dataset is collected from the water conveyance channel project. The rock mass condition assessment task is transformed into a fine-grain classification task. To fulfill the task, a self-convolution based attention fusion network (SAFN) is designed in this paper. The core of our method is the discovery and fusion of the object attention map within a deep neural network. The network consists of two novel modules, the self-convolution based attention extractor (SAE) module and the self-convolution based attention pooling algorithm (SAP) module. The former is designed to detect the intact rock regions generating the attention map, and the latter is designed to improve the performance of classifier by fusing the attention map that focuses on the intact rock regions. The results of SAFN are evaluated from aspects of interpretability, ablation, accuracy and cross-validation, and it outperforms state-of-the-art models in the rock mass assessment dataset. Furthermore, the dynamic filed test show that our assessment system based on the SAFN model is accurate and efficient for automated classification of rock mass.

www.nature.com/scientificreports/ health in civil engineering 13 in an end-to-end multilayer fashion. Chen et al. 14 obtain the rock tunnel face image to classify the rock structure of tunnel face. However, this method can only be used in manual drilling and blasting methods because the narrow space between the TBM cutterhead and tunnel face. In addition, deep learning is dependent on huge training samples 15 and there is scarce research on the tunnel rock image at present. In this paper, we classify the rock mass through the direct detection method. We monitor the rock excavated by the tunnel boring machine and collect a rock mass assessment dataset. This dataset is the first attempt to classify the excavated rock during TBM tunnel excavation. To fulfill the rock assessment task, we propose an assumption that if the model recognizes significant image areas and amplifies their effects while suppressing irrelevant and potentially confusing information in other regions, the classification task will benefit. Based on this assumption, the self-convolution based attention fusion network (SAFN) is proposed. The core of our method is the discovery and fusion of the object attention map within a deep neural network. We focus on the attention map not only for the region of interest (ROI) extraction, but also for improving the performance of CNN by fusing it that focuses on the in-tact rock regions in an image. The monitoring results of the system intuitively show the different types of rocks under the surface, which can provide a scientific reference for TBM control research 16 As the first attempt to use CNNs for classification of rock mass structures captured from the excavated rock during tunnel excavation, this research made the following two contributions. First, it paves the way for other researchers to apply a higher accuracy and efficiency framework in order to classify the rock structure during tunnel excavation. Secondly, it confirms that the proposed image technique significantly improves the efficiency of conventional overall recognition. This paper is organized as follows. The rock assessment dataset is described briefly in "Rock assessment dataset". "Rock assessment model" describes the SFAN for rock mass assessment. "Experiment results" provides the experiment. Finally, the conclusion of this paper is drawn in "Conclusion and future work".

Rock assessment dataset
In general, rock mass assessment is operated by the trained TBM drivers. However, the manual operation comes at a high cost in labor and time, but the assessment results are not recorded continuously in real-time. Therefore, it is necessary to construct an automated visual assessment system to provide a reliable assessment. The schematic diagram of the assessment system is shown in Fig. 2a, which includes an area-array camera, light sources, a trigger unit, a conveyor group, and a data processing server.
The camera is HIKVISION MV-CA013-21UM with a focal length of 8 mm. The work distanced between the array camera and excavated rock is 1000 mm by shooting the cameras in the vertical orientation. The horizontal and vertical viewing angles are 55.6 • and 47.6 • , respectively. The monitoring region covers a region of 1000 mm × 800 mm in the conveyor belt. When the conveyor belt runs at a speed of 5 m/s, the system can capture images at a speed of 30 frames per second 17 . If the captured image has no "ghosting" details, the exposure time should be set to 200 ms. Then, the captured image is sent to the server via Ethernet and classified by the deep learning network. According to the results of the rock mass assessment, the control system of TBM can get the accurate rock mass parameters for geological adaptive control. As shown in Fig. 2b, the visual monitoring system is installed in the conveyor belt behind the segment erector, approximately 20 m away from the nose of TBM.
This research team spent three months collecting the excavated rock images built in different rock mass tunnel excavation field in Hangzhou Second Source Water Conveyance Channel Project (Shanling section, Jiangnan Route), Hangzhou, China. This project, with a total length of 13.21 km, had been designed to convey and divert water from Xianlin Reservoir to Hangzhou. The size of rock images was set at 1024 × 1024 . In this study, 12,600 nine types of rock images were collected: granite porphyry (GP), conglomeratic sandstone (CS), medium fine sandstone (MFS), argillaceous siltstone (AS), silty mudstone (SM), tectonic breccia (TB), cryptoexplosive breccia (CB), argillaceous silty limestone (ASL) and bioclastic limestone (BL) as shown in Fig. 3.

Rock assessment model
This section focuses on designing assessment network for the rock mass. There are three challenges for our rock feature discovery. 1. Prior knowledge of the rock mass. The relatively intact rock regions retain prior knowledge of the rock veins and structure compared to the clast. It should be incorporated to learn in the classification task. The mainstream classification network can be divided into mask-guide 18,19 methods, attention-based 20,21 methods and global-feature-based 22,23 methods. The mask-guided methods and the attention-based method can utilize the object information. However, the mask-guided method should label the image in large-scale collections, which is unrealistic in industry applications. We choose the attention-based method to complete the classification task, which involves an attention mechanism to extract additional prior knowledge features.
2. Attention mechanism fusion to CNN. The attention map has long been used for visual explanation 24 . The high response value is the attention location in image recognition. Few studies are attempt to improve the  www.nature.com/scientificreports/ performance of CNN by fusing attention mechanism. We propose the SAP module integrated the attention map into the classification network.
3. Real-time requirement. The proposed visual system is used to assist TBM driving. The classification result of the rock mass is the input of the geological adaptive control. The real-time performance is critical for our system. In the existing methods, the branch structure network is widely used to extract the attention map. Our method improves the real-time performance by self-convolution operation instead of the branch deep neural network.
Overall flow. We propose the self-convolution based attention fusion network (SAFN) for rock mass assessment. The attention-based network architecture is highly similar to the human body consisting of three parts illustrated in Fig. 4: the baseline, extracting hidden information from the image; the limb, extracting the attention map of ROI by the self-convolution operation; and the head, fusing the attention feature to the backbone by SAP and making correct judgments on the basis of rigorous analysis.
The overall flowchart of the SAFN is depicted in Fig. 4. Our task is to predict an attention map A and a category C for each image. Given the image I, its proposals R i by selective search and the image label B, the feature map of the I is firstly extracted by the CNN baseline. Then, our model obtains the attention map A about the intact rock region by the self-convolution attention extractor. At last, the SAP distance is obtained by the selfconvolution based attention pooling algorithm and applied to the classification network.
Feature extractor. In the attention-based network, the baseline network is the basic 25 . The deep residual learning framework (ResNet) 26 is proposed for the degradation problem of the increased number of convolutional layers, which is one of the most popular models at present. The ResNet-50 framework is depicted in Fig. 5, which can be divided into 5 stages. The ResNet framework addressed this problem through shortcut connections, skipping one or more layers, and simply performing identity mapping. The residual block fits the residual function (1).
where x is the identity mapping; H(x) is arbitrary desired mapping. Since the output of multiple nonlinear layers F(x) degrades to zero, the adverse effect of the vanishing gradients can be ignored.

Self-convolution based attention extractor module.
In the traditional image field, self-convolution operation is an important method to calculate the similarity between two images. We aim to obtain the attention map about the intact rock region. To solve this problem, the self-convolution operation between the feature maps of the image I and the feature maps of proposals R i has been introduced in this part. The feature map of proposals is the convolution kernel, while the feature maps of the image is the region to be convolved. It can help us get the attention map because of the two factors: 1. The value of the feature map to be convoluted. The response value will be larger when the value of the convolution kernel and the region to be convolved is larger. 2. The similarity between original image I and proposals region. The response value will be larger when the similarity is higher.
In general, the feature map value of conveyor belt and the clast region have small, and the similarity with the intact rock region is low. So we can get the more accurate attention map about the intact rock region.
First of all, around 150 region proposals are extracted by the selective search 27 on the original image (The selective search's "fast mode" is utilized in this method on the original image (Fig. 6a) as shown in Fig. 6b. This www.nature.com/scientificreports/ method is implemented in two steps. First, the original image is initialized to get the small scale area by the segmentation method based on the graph theory. Then the large size area is merged considering the characteristics such as color, texture and computational complexity. The results show that the Mean Average Best Overlap is over 0.879. The result of removing the repeat regions which the covered area between the region proposals are over 90% is shown in Fig. 6c.
First, an original image I and its N proposals R i are inputted into the CNN baseline to get the corresponding feature map F I and F R , with  www.nature.com/scientificreports/ columns and channels of the feature map, respectively. Then, we choose the x i ∈ F R as convolution kernel, while the F I is the region to be convolved. For any convolution kernel, the response map is calculated as follow: where * is the self-convolution operation, which can be done with standard convolution. Figure 7 illustrates this proceed. For example, the yellow box region in Fig. 7 is the roller of the conveyor belt, so when the features of the yellow box region are used as the convolution kernel to deconvolve F I , the response value of the roller region in the figure has a very high response value. While the red and green box regions are the intact rock region, it is easy to find that the focal regions of the response map are the intact rock regions. At the same time, because the red and green boxes correspond to rocks of different sizes, respectively, the response values of the response map obtained are also different for rock regions of different sizes. Finally, there are N response maps after self-convolution operation. However, each response map can represent the results of only one local region. We should fuse all response map x i to obtain the final attention map A.
where S = 1 N N i=1 S i , is the normalized hyperparameters of the softmax. The softmax normalization is to smooth the contribution of each region attention map.

Self-convolution based attention pooling algorithm module. The attention map A is provided
through the self-convolution operation in the previous section. In order to make the classification task more accurate, we proposed the self-convolution based attention pooling algorithm (SAP) as shown in Fig. 8.
First of all, the SAP utilizes the global average pooling to obtain the global average features of the image, and performs L 2 normalization on the global average pooling. Then the distance between local features is calculated as follows: Then the weight vector V = {v 1 , v 2 , v 3 , . . . , v 9 } of the local feature is obtain by the global average pooling of attention map A. To the classification task, we can simply and directly take the local feature and weight vector progressive multiplication as the final classification distance vector. But this approach is problematic in dealing with the classification issue. The result gap in the final classification distance matrix becomes smaller. Therefore, we introduce the idea of residual to obtain SAP distance vector.
Finally, the SAP distance vector is used to complete the classification task. The local feature of the intact rock regions is higher than other region. Because the response value of the attention map A is higher. Thus the effect of the intact rock area in the SAP distance is effectively enhanced.
(2) S i = x i * F I , i = 1, 2, . . . , N,  www.nature.com/scientificreports/ In this paper, our proposed SFAN method includes two modules. The modules are integrated into the baseline network without breaking the baseline framework, making the modules easier to fuse with other deep learning networks.

Experiment results
In this section, we first evaluate the performance of SAFN from aspects of interpretability, ablation, accuracy and cross-validation in the rock mass assessment dataset. Moreover, we conduct the dynamic test to analyze the assessment system based on SAFN model effective and real-time performance further.
Dataset and implementation detail. Based on the assessment system, we select 12,600 images on nine types of rock to construct the rock mass assessment dataset. There are 900 train images and 500 value images in each category of the rock mass. The size of the input image is uniformly scaled to 448 × 448 pixels. We use standard mini-batch SGD, and adopt learning rate warm up as in Ref. 28 . The minimizing cross-entropy loss is select for classification task. In general, deep learning only works when there are a lot of data available. The change of camera angle and light intensity also require many samples. To enlarge the scale of the dataset, we adopt the data augmentation including random crop operation, random horizontal flip operation and etc.
Evaluation metrics. We employed the accuracy (A), precision (P), recall (R), and F1-score (F 1 ) 29    These results indicate that the self-convolution operation can enable visual explanation that takes into account the rock information.
Then, we evaluate the bounding box (Bbox) localization error. The localization metric is suggested in Ref. 30 . In detail, if the Intersection-over-Union (IoU) of the rock mass is observed to be greater than 50% in the overlapped area between predicted Bbox and ground truth Bbox, the image bounding box can be considered as a correctly predicted label.
As shown in Table 1, we select the latest methods of GAP 31 and CCAM 32 to evaluate the Bbox localization errors. GAP utilizes the class activation map (CAM) to obtain target object regions. CCAM observes the activation maps from the highest to the lowest probability classes, and utilizes this attribute to suppress the background region, so as to achieve accurate object localization. Our model is achieving a 20.4% and 5.9% error reduction compared with GAP and CCAM, respectively. These qualitative results show that our model is able to select the intact rock regions for attribute recognition.
Ablation experiment. In this paper, our proposed SFAN method includes two modules. One is the selfconvolution based attention extractor module and the other is the self-convolution based attention pooling algorithm module. To investigate the influence of every module on rock mass assessment, we conduct the ablation test. We perform an ablation analysis of SAFN on the rock mass assessment dataset to evaluate how differ- www.nature.com/scientificreports/ ent components affect the detection performance. Table 2 shows the experimental results. First, the baselines are comparing in the 1st row (VGG network 33 ) and the 4th row (ResNet network), the self-convolution based attention extractor module dramatically improves the accuracy as shown in the 2nd and 5th row. Also, there is no extra time introduced to the system. We find that our module makes the accuracy of ResNet-50 and VGG-19 improve at least 17.3% and 16.3%, respectively. Then, we verify the effectiveness of the SAP module. The accuracy of SAFN improves by 4.8% and 4.3%, as shown in the 3rd and 6th row. These results indicate that the SAP is slightly more accurate than SAE operation, which proves that the self-convolution based attention pooling algorithm can improve the model performance.
Accuracy evaluation. In this section, we compare the accuracy results of ResNet-50 baseline with other state-of-the-art methods. The average accuracy and time cost per image on the rock mass assessment dataset are shown in Table 3. The most representative and advanced method are selected for the accuracy evaluation: WSCPM 34 , ABN 35 , ResNet-101 26 . The WSCPM detects the object part in a weakly supervised manner to build the complementary part. The classification task is complete by the local feature. The ABN introduces an attention-based branch structure to the classification network. The classification accuracy of WSCPM, ABN, ResNet-101 achieves 80.3%, 84.7% and 78.4%, respectively. Our model improves the accuracy by at least 7.8%. Compare with the state-of-art methods. We use the self-convolution operation instead of the network branch structure in WSCPM and ABN. Our method has better advantages in terms of time.

Cross-validation of rock mass assessment dataset.
To further explore the SAFN model, the confusion matrix is used to evaluate the classification results. The confusion matrix is a widely used index for recognition evaluation. Each column represents the predicted category, and the total number of each column represents the number of data classified into categories. Each row represents the actual category to which the data belongs, and the total number of data in each row represents the number of data instances of that category. The confusion matrix is calculated by summing up the total number of observation accuracy values of the false and correct categories in the statistical recognition model. Table 4 shows the confusion matrix obtained in the rock mass assessment dataset using the SFAN model. GP is with an accuracy of 95.77%, which is the easiest to be accurately identified in the rock mass assessment, followed by SM, CB, AS, BL, CS, TB, ASL and MFS with accuracies of 94.72%, 93.18%, 92.42%, 92.12%, 91.8% 91.29%, 91.05%, 90.05%, respectively. Furthermore, it can be inferred that among the 9 categories, TB is easy to be misclassified as CB, since its size and distribution of rock mass is located closer to CB.  www.nature.com/scientificreports/ Additionally, the comparison experiment between the four methods for different categories rock mass is shown in Fig. 10. It can be seen that all the evaluation metrics of CNN models present a similar trend. Overall, the values of the metrics from the highest to the lowest appear in this trend: SM, CB, BL, CS, TB, MFS, AS, and ASL. Since the texture features and distinct appearance of the GP images, it makes the all the methods corresponding more prominent in GS identification. The ASL and AS images in the dataset have many similar features. Therefore, the classification of ASL and AS images suggest relatively poor performance. However, our method focuses on the intact rock details and has better performance at distinguishing AS from ASL. This also proves that our method has great advantages in the classification of similar rocks.
Dynamic test for the assessment system. To verify the performance of the SAFN model in application, the classification dynamic test with the assessment system based on the SAFN model are conducted in the section of D5 + 650 to D5 + 655 in Hangzhou Second Source Water Conveyance Channel Project (Shanling section, Jiangnan Route), Hangzhou, China as shown in Fig. 11. There are 400 online samples in the three types of geological conditions during this section: MFS, SM and AS. The real-time classification results for the dynamic test are shown in Fig. 12.
The classification results are shown in Table 5. The precision of the MFS, SM and AS reached to 94.8%, 94.7% and 96.7%, respectively. Overall, the dynamic test indicates that our system is efficient for automated rock mass classification.
Analysis of real-time performance. The real-time performance is crucial for the assessment system based on SAFN model. We evaluate the inference speed of the system. This experiment was implemented on one NVIDIA GeForce RTX 3090 GPU. The data upload time is 300 ms. The inference speed of the SAFN model under image resolution of 1024 × 1024 achieves 132 ms, under image resolution of 448 × 448 achieves 53 ms and under image resolution of 224 × 224 achieves 17 ms. In general, the mean run time of a rock image is less than 1 s.

Conclusion and future work
In this research, we construct a visual assessment system, install it on the TBM. Then, we collect a larger in-situ rock mass image dataset from the construction site of the TBM. It includes around 12,600 rock mass images covering nine different rock types. This dataset is the first attempt to classify the excavated rock during TBM tunnel excavation. To assess the rock mass, we have presented a self-convolution based attention fusion network for rock mass assessment. The core of our method is the discovery and fusion of the object attention map within a deep neural network. We focus on the attention map not only for the region of interest (ROI) extraction, but also for improving the performance of CNN by fusing it that focuses on the intact rock regions in an image. First, SAFN detects the intact rock regions in the image by the SAE module. Then, the SAP module is proposed in the classifier, which is trainable for image recognition in an end-to-end manner. We integrate a region-based part attention map into the deep network through the SAP module.
To evaluate the SAFN model, we conduct extensive experiments to indicate the accuracy, interpretability and efficiency of the SFAN model in the rock mass assessment dataset. Moreover, the dynamic test shows that our assessment system based on the SAFN model is accurate and efficient for automated classification of rock mass during TBM tunnel excavation. However, establishing an automatic driving system for TBM based on the rock mass assessment proposed in this paper remains a challenging task.