Robust 3D lane detection in complex traffic scenes using Att-Gen-LaneNet

Robust 3D lane detection is the key to advanced autonomous driving technologies. However, complex traffic scenes such as bad weather and variable terrain are the main factors affecting the robustness of lane detection algorithms. In this paper, a generalized two-stage network called Att-Gen-LaneNet was proposed to achieve robust 3D lane detection in complex traffic scenes. The Efficient Channel Attention (ECA) module and the Convolutional Block Attention Module (CBAM) were combined in this network. In the first stage of the network, we improved the semantic segmentation network ENet and proposed the weighted cross-entropy loss function to solve the problem of ambiguous distant lane segmentation. This method improved Pixel Accuracy to 99.7% and MIoU to 89.5%. In the second stage of the network, we introduced the interpolation loss function to achieve accurate lane fitting. This method outperformed existing detection methods by 6% in F-score and Average Precision on the Apollo Synthetic dataset. The proposed method achieved better overall performance in 3D lane detection and was applicable to broader and more complex traffic scenes.

In recent years, with the rise of autonomous driving technology, the transportation industry is developing rapidly in the direction of intelligence and autonomy. One of the prerequisites for these directions is the automatic detection and identification of various elements in the traffic scenes. Lanes are essential traffic signs, so accomplishing robust detection of lanes in complex traffic scenes is the key to implementing advanced autonomous driving technologies [1][2][3] .
Most of the current lane detection methods only stay at the 2D level, and the emergence of 2D lane detection datasets such as Tusimple 4 and Culane 5 make this research direction develop more rapidly. Researchers were committed to improving the detection accuracy of lane, and attention modules such as Dual Attention and SAD 6-10 were added to semantic segmentation networks. These methods used spatial or channel correlation to assist in accurate lane fitting. 2D lane detection algorithms usually perform semantic segmentation of the image first 5,[11][12][13] and then convert the driver's view image into the bird's eye view using the inverse perspective transformation 14,15 . Curve fitting uses polynomials in the bird's eye view 16,17 and the output detection results are approximate curves of the 3D lane in the real scenes. Real-time performance is an essential metric for evaluating lane detection algorithms. The CondLaneNet proposed by Liu et al. 18 achieved a detection rate of 220 FPS on the Culane dataset, and the PolyLaneNet proposed by Lucas Tabelini et al. 19 achieved a detection rate of 115 FPS on the Tusimple dataset. These lane detection algorithms used many assumptions on lane properties such as flat roads and uniform lighting. Due to the above assumptions, the existing lane detection technologies have poor robustness and provide false perceptions when the vehicle is driving up and down hills, curve lanes, and complex traffic scenes such as rain or snow, in general lack adaptive capabilities compared to drivers.
In recent years, scholars have started to research 3D lane detection, which mostly relies on the geometric relationship between the in-vehicle camera settings and the road surface, as shown in Fig. 1. 3D-LaneNet 20 was one of the first end-to-end network frameworks proposed in this research direction, which implemented IPM transformation internally and introduced the concept of anchor representation. 3D-LaneNet implemented endto-end training with image views and bird's eye views in parallel, and achieved superior results in complex traffic scenes such as lane merging and splitting. Netalee Efrat et al. 21 proposed a camera-based DNN method. This method followed the parallel structure in 3D-LaneNet and decomposed the lanes into the lane line segments using grids in the bird's eye view. In this approach, adjacent grids will have overlapping perceptual fields, so the lane line segments of adjacent grids can be clustered into complete lanes. The two-stage network proposed by Guo et al. 22 computed 3D lane point coordinates using the geometric transformation between the in-vehicle camera coordinate system and the vehicle coordinate system. This method was beginning to be applied to unseen scenes. When segmenting lane images, we will face the problem of ambiguous distant lane segmentation due to complex traffic scenes such as hills and curve lanes. In addition, the great difference in the number of lane and background pixels lead to unclear lane edge segmentation. When predicting 3D lane structure, the model has poor generalization ability in unseen scenes and complex weather. Therefore, we designed a two-stage network that focused on segmentation of 3D lane and 3D lane coordinates prediction. In the first stage, we improved the lightweight semantic segmentation network ENet 23 and introduced the ECA attention module 24 in the decoder part of ENet to improve the segmentation effect by enhancing the lane and background discrimination ability of the model. The CBAM attention module 25 was introduced between the top view encoding layer and the prediction head of the second stage geometric encoding subnetwork to aggregate more global information to enhance the model's generalization ability. We designed the weighted cross-entropy loss function to constrain the problem of the unbalanced number of lane and background pixels in the semantic segmentation process. We also introduced the interpolation loss function 26 to solve the problem of poor local fit of the lanes.
The contributions of this paper are summarized as follows: A two-stage 3D lane detection network was designed with superior generalization performance of the model for a wider range of traffic scenes. The ECA attention mechanism and the CBAM attention mechanism were introduced in the two stages, which improved the segmentation effect and prediction accuracy of the network accordingly. The weighted cross-entropy loss function and the interpolation loss function were improved in the two stages to enhance the model's generalization ability.
The main goal of this research is to design a 3D lane detection algorithm with more robust performance to provide a model for more advanced autonomous driving techniques. In this paper, the second section describes the network architecture and the main methodology, the third section shows the experimental data and result plots, and last section illustrates our conclusions.

Methods
Attention mechanism. ECA attention module. In the first stage of the network, we introduce the ECA attention module 24 to assist the 3D lane segmentation. The structure of the ECA attention module is shown in Fig. 2. The overall structure after adding the ECA attention module is shown in Fig. 6. This module contains only nine parameters, which not only learns the correlation between different channels of the feature maps and improves the sensitivity of the network to the 3D lane structures. The ECA attention module first performs global average pooling of the input feature and then performs a one-dimensional convolution operation with a convolution kernel of k . The generated feature maps through the sigmoid activation function to obtain the weights of each channel, and the weights are multiplied with the corresponding elements of the input feature maps to obtain the output feature maps. The value of k is determined adaptively by the channel dimension in the input feature maps, and the equation is as follows. CBAM attention module. In the second stage of the network, we introduce the CBAM attention module 25 to assist the 3D lane prediction. We add it between the Top-view Segmentation encoder and the lane prediction head. The overall structure after adding the CBAM attention module is shown in Fig. 7. The CBAM attention module consists of a spatial attention module and a channel attention module in series. The overall structure of the CBAM attention module is shown in Fig. 3. The outputs of the convolutional layer will first pass through the channel attention module to get the weighted results and then will pass through the spatial attention module to get the final weighted results. The CBAM attention module extracts more global information in both channel and spatial dimensions to predict 3D lane structures better. The channel attention module performs the input feature's global max pooling and global average pooling to obtain two one-dimensional vectors. These two vectors will pass through the shared multi-layer perceptron (MLP) and be summed, and finally the channel attention feature maps are generated using the sigmoid activation function. The channel attention feature maps are element-wise multiplied with the input feature maps to generate the input feature maps of the spatial attention module. In the spatial attention module, we first perform global max pooling and global average pooling operations on the channel dimension and concatenate the two results. The generated results are reduced to one channel through a convolutional layer and then through the sigmoid activation function to generate the spatial attention feature maps. The spatial attention feature maps are multiplied with the channel-refined feature maps to obtain the final refined feature maps. The following equations describe the channel attention module and the spatial attention module.  where f 7 * 7 denotes the convolution operation with a filter size of 7 × 7 , F s avg and F s max denote the average pooling feature and the max pooling feature in the spatial attention module.
Geometric transformation and anchor representation. In this paper, the 3D lane is represented in the vehicle's coordinate system consisting of X, Y , Z axes and the origin O . We use the height H cam and the pitch angle θ to indicate the camera's pose. The camera coordinate system is represented by X c , Y c , Z c and the origin C . Z denotes the real height of a 3D lane. The 3D lane can be projected onto the image plane by projection transformation and then the lane image can be projected onto a flat road surface by planer homography to generate a bird's eye view. Due to the camera parameters are involved, the lanes in the bird's eye view have different X, Y values compared to the 3D lanes in the vehicle's coordinate system. We set the bird's eye view as a special coordinate system defined by the x, y, Z axes and the origin O . The geometric transformation between the coordinate system of the bird's eye view and the vehicle's coordinate system can be expressed by the following equation: The 3D lane coordinates are represented as the x, y in the bird's eye view coordinate system and the real height Z . Since the geometric transformation is independent of the camera parameters, the geometric transformation holds whether the vehicle is driving on the uphill or downhill scenes. Using this geometric transformation, the lane coordinates in the bird's eye view can be mapped back to the real road coordinates. The geometric transformation is shown in Fig. 4.
We use the anchor representation 22  We choose the improved ENet as the first subnetwork for semantic segmentation of images. The asymmetric ENet network contains a large encoder and a small decoder. The whole network consists of 6 blocks. Block1 is the initial block for generating feature maps and fusing the feature maps generated by pooling and convolution operations. Block2 and block3 are downsampling blocks, and block4 repeats the structure of block3 to increase the depth of the network. Block5 and block6 are upsampling blocks and blocks 2-6 all have bottleneck as the  www.nature.com/scientificreports/ base structure. We use skip connection to lead the shallow features to the deeper layers of the network so that the decoder has more detailed information to obtain better segmentation and accelerate the model training. We apply the ECA attention module in the decoder to strengthen the network's ability to focus on the information of relevant channels. Figure 6 shows the first subnetwork architecture.
In the second subnetwork, the segmentation results are input to the top-view segmentation encoder and projected to the bird's eye view through inverse perspective mapping. The segmentation results are encoded in the feature maps through a series of convolution operations. The lane prediction head will use the anchor representation to predict the properties of the 3D lanes and calculate the real coordinates of the 3D lanes based on the geometric transformation. The architecture of the geometric encoding subnetwork is shown in Fig. 7.

Loss functions of Att-Gen-LaneNet.
In the first subnetwork, we use the standard cross-entropy loss function. To solve the problem of unbalanced sample distribution where the lane pixels are much less than the background pixels, we weight the loss. The equation is as follows: The weights are bounded when the probability of the lane class is close to 0 . c is an additional hyperparameter, which we set to 1.06. It makes the class weights restricted to the interval [1,50].
In the second subnetwork, the proposed loss function consists of the cross-entropy loss function, geometric distance error in x and z directions. The cross-entropy loss function is used to evaluate the predicted lanes presence probability p and visibility v correctness. The following equation can express the cross-entropy loss function: In previous studies, researchers only used the two ends of the fitted lane and the ground truth lane for error estimation, which resulted in a large amount of valuable ground truth information being ignored. To solve this problem, we insert more points in the x-direction to reflect the quality of the fit for the whole lane. Sparse sampling and dense sampling on each anchor are denoted as {y where f (·) denotes the interpolation rule, and the parameter of n is set to 0.6. The total loss function can be expressed as:  Dataset division strategy. The dataset is divided according to the following three strategies to evaluate our model in different aspects.
1. We divide the unbiased images into the training set and test set according to the ratio of 5:1, as a way to perform a basic test of our algorithm. 2. The above ratio is still used to divide the number of images in the training and test sets. The training set uses unbiased data, but the test set is chosen from the traffic scene images that do not appear in the training set. We use this approach to verify the generalization ability of our method when encountering unseen scenes. 3. Many images of the same scene in the Apollo Synthetic dataset are taken at different times of the day. We store images of the same scene taken at different times of the day into the training set and test set to verify the generalization ability of our method when the scene changes visually.
Evaluation method. We use Pixel Accuracy (PA) and MIoU as the main evaluation metrics when training the first subnetwork. Pixel Accuracy is used to calculate the ratio of the number of correctly predicted lane category pixels to the total number of pixels. For explanation, we count the lane categories as β . P nm indicates the number of pixels that belong to class n but are mistakenly detected as class m , P mn denotes the number of pixels that belong to class m but are mistakenly detected as class n , and P nn denotes the true number. MIoU is the result of the model first finding the ratio of the intersection and concatenation of the predicted and real values for each category, then summing and averaging them. β+1 denotes the sum of the number of lane categories and background categories. The following equations can express Pixel Accuracy and MIoU: In the second subnetwork, we use Average Precision (AP) and F-score as the primary metrics to evaluate the prediction results. Precision is the percentage of matched predicted 3D lanes, and recall is the percentage of matched ground truth 3D lanes. The following equation can express Average Precision: where N indicates the number of all images in the test set, p(k) indicates the precision value when k photos can be recognized and �r(k) indicates the change in recall value when the number of recognized images changed from k − 1 to k.
The following equation can express F-score: We adjust the value of α by weighing the two metrics, precision and recall. If we think that precision is more critical than recall, we adjust the value of α to be less than 1. If we think that recall is more critical than precision, we adjust the value of α to be greater than 1.
In addition, we use 40 m as the benchmark, define 40-100 m as the far distance and define 0-40 m as the near distance. We use the anchor representation to calculate the lane fitting error in the x and z directions of the far distance and near distance. At the predefined y positions with X j = {x , the error distance between X j and X n is expressed as:   Table 1, "ECA" indicates the addition of the ECA attention module, "ECA/Wclass" indicates the addition of the ECA attention module and the weighted cross-entropy loss function. We verify the performance improvement of each part by the above method.
Benefits of the ECA attention module. When adding only the ECA attention module to the original networks, the Pixel Accuracy can be improved by 0.6% on average, and the MIoU can be improved by 4.6% on average.
Benefits of the weighted cross-entropy loss function. When adding the ECA attention module and the weighted cross-entropy loss function to the original networks, the Pixel Accuracy can be improved by 0.8% on average, and the MIoU can be improved by 6.1% on average.
The 3D lane prediction results in Fig. 9a,b show that our model has superior generalization ability in complex traffic scenes such as uphills and downhills, unseen scenes, and visual changes. Figure 9c,d show that our model has superior generalization ability in complex traffic scenes such as curve lanes, unseen scenes, and visual changes.
In Table 2, "CA" indicates the addition of the CBAM attention module, and "CA/IL" indicates the addition of the CBAM attention module and the interpolation loss function. The increasing trend of the data in Table 2 fully illustrates that our improvements have improved the model performance in all scenes.
Benefits of the CBAM attention module. When adding only the CBAM attention module to 3D-LaneNet and Gen-LaneNet, the F-score can achieve an average of 5% improvement in the three scenes, especially in the Unseen scenes, 3D-LaneNet achieves 10.2% improvement. AP achieves an average of 5% improvement in the three scenes, especially in the Visual variations, and 3D-LaneNet achieves 8.2% improvement.
Benefits of the interpolation loss function. When adding the CBAM attention module and the interpolation loss function to 3D-LaneNet and Gen-LaneNet, F-score achieves an average of 1.3% improvement in the three scenes compared to adding the CBAM attention module only. AP achieves an average of 1.5% improvement in the three scenes.
Most importantly, our improved Att-Gen-LaneNet shows the best results in the three scenes.

Conclusions
In this paper, the proposed two-stage network Att-Gen-LaneNet perfectly solved the problem of ambiguous distant lane segmentation and achieved robust 3D lane structure prediction in complex traffic scenes. The model had strong generalizability and could be extended to unseen scenes. This work is beneficial to the combination and development of deep learning and autonomous driving technologies. The method proposed in this paper achieves excellent results, but it is very demanding on data. This method not only requires a large number of 3D scene images but also these images need to provide accurate annotation information, so the weakly supervisedbased method will be a development direction.

Data availability
All data included in this study are available upon request by contact with the corresponding author.