Instance-level 6D pose estimation based on multi-task parameter sharing for robotic grasping

Six-dimensional pose estimation task predicts its 3D rotation matrix and 3D translation matrix in the world coordinate system by inputting the color image or depth image of the target object. Existing methods usually use deep neural networks to directly predict or regress object poses based on keypoint methods. The prediction results usually have deviations depending on whether the surface shape of the object is prominent or not and the size of the object. To solve this problem, we propose the six-dimensional pose estimation based on multi-task parameter sharing (PMP) framework to incorporate object category information into the pose estimation network through the form of an object classification auxiliary task. First, we extract the image features and point cloud features of the target object separately, and fuse them point by point; then, we share the confidence of each keypoint in pose estimation task and the knowledge of the classification task, get the key points with higher confidence, and predict the object pose; finally, the obtained object pose is passed through an iterative optimization network to obtain the final pose. The experimental results on the LineMOD dataset show that the proposed method can improve the accuracy of pose estimation and narrow the gap in the prediction accuracy of objects with different shapes. We also tested on a new dataset of small-scale objects, which contains object RGBD images and accurate 3D point cloud information. The proposed method is applied to the grasping experiment on the UR5 robotic arm, which satisfies the real-time pose estimation results during the grasping process.

pose estimation results usually vary with the prominence and size of the object's surface.For large-scale objects and small-scale objects or objects with low surface features, the estimation results usually have large deviations.
According to existing multi-tasking techniques 5,6 , complex problems can be solved by decomposing into multiple sub-problems.Zhou et al. 5 reported that the complex problem can first be refined and decomposed into simple and mutually independent sub-problems.These sub-problems are connected and interconnected by common characteristics, which can help the primary task pick up new features easily and strengthen generalization capability.Vandenhende et al. 6 highlighted that numerous issues are intricate and require multiple steps to resolve.For example, the problem of determining the safety of a self-driving car can usually be broken down into the following steps: Identify objects in the scene, detect and localize them, and estimate the motion trajectory.The parameter-sharing mechanism in multi-tasking techniques could be divided into soft-sharing, hard-sharing, and layer-sharing.Soft-sharing 7 methods make each task have its own parameters, encouraging parameters to be more similar through regularization.Compared with soft-sharing, hard-sharing [8][9][10] methods can greatly reduce the number of parameters, which use shared layers aimed at sharing hidden parameters and apply a task-specific layer for each task.For example, to train jointly for numerous tasks, Sun et al. 11 proposed a hard-sharing method called sparse sharing, in order to abstract more similar sub-networks with higher overlap for strongly linked tasks and less overlapping sub-networks with larger differences for weakly related tasks, they experimentally demonstrated it requires the least number of parameters and exhibited superior accuracy in multi-task natural language interpretation.
Inspired by the concept of sparse sharing 11 , we modeled the multi-task approach to jointly optimize the object classification task and pose estimation task.Specifically, the object classification task is applied as an auxiliary task to provide inductive knowledge for pose estimation.Our contributions can be summarized as follows: • We have created a multi-angle dataset featuring small-sized objects.PMP dataset was designed to address the performance limitations observed in existing methods when dealing with small objects in the LineMOD dataset.The dataset includes images capturing various postures of the objects, simulating the states observed during robotic arm gripping.• We proposed a PMP framework for multi-task pose estimation based on parameter sharing.By employing a parameter sharing approach, object classification is used as an auxiliary task to provide object category information for keypoint feature extraction in pose estimation.This helps reduce prediction biases among different-shaped objects and improves accuracy.• We not only achieves higher accuracy but also exhibits smaller accuracy deviations among objects than exist- ing pose estimating models on the LineMOD dataset.Even in robotic grasping scenarios, the PMP model effectively estimates the pose of the objects.
The remainder of our paper is structured as follows.Section II describes the overall architecture of the our method and the details of each part of the implementation, Section III describes the dataset and associated experimental results.Finally, Section IV discusses the contributions of the model and prospects.

Methodology
Given an RGBD image, the goal of 6D pose estimation is to obtain the translation vector T ∈ R 3 and rotation matrix R ∈ SO(3) of the target of interest from the world coordinate system to the camera coordinate system.The prediction results usually have deviations depending on whether the surface shape of the object is prominent or not and the size of the object.To solve this problem, a novel method was proposed herein.As shown in Fig. 1, we applied multi-tasking techniques for this problem, and the target object poses and categories are predicted by randomly discarding different neurons.The method consists of three parts.As shown in Fig. 2, the first part extracts and fuses the RGB and depth features of the target object.The second part comprises object pose estimation and classification through shared

Image feature extraction
The image feature extraction module is split into two sections that are based on a transformer 12 and CNN 13 .First, the input image crops are cut into size p × size p image patches, and then each image patch is encoded into one- dimensional features, that are fed into the transformer-based encoder-decoder structure.The encoder extracts the high dimensional global hidden features of the input image features, then a lightweight decoder reduces the extracted hidden features to the original image size such that each pixel point contains global attention to the pixels in the image.Compared with the symmetric encoder-decoder, the asymmetric structure uses a lighterweight decoder, which greatly reduces the amount of calculation and hides more high-dimensional information in the encoder.
The reduced global features of the decoder are then sent into a multi-scale convolution structure PSPNet 14 layer, the multi-scale convolutional network can use its inductive bias to extract local information, it uses a pyramid structure of multi-scale feature fusion to enhance the attention to local pixels around the pixel point.features from the point cloud and enable dense fusion with image features.In Fig. 3, C i represents the feature output channels in the image feature extraction module, C p represents the channels of the input point cloud, and the output of the dense fusion network is the concatenation of features from multiple scales.

Multi-task pose-estimation module
After extracting and fusing the RGB and depth feature of the input image crops, the next step is to determine the key point information of the target object.For the extracted multimodal features, three separate convolutional neural networks are used to predict the 6D pose p = [R | T] of the target object, which contains the translation vector T ∈ R 3 and rotation matrix R ∈ SO(3) , while the third network is used to predict the key point confidence c for pose estimation task and object class information cls for classification task simultaneously, which is completed in the form of shared network parameters.Conv_t, Conv_r, and Conv_c in Fig. 2 are all one-dimensional convolutions that are fed separately with the per-point feature vectors that generated from the feature extraction module.They are mapped using the translation, rotation, and confidence vectors into features with dimensions of num_obj*4, num_obj*3, and num_obj*1, respectively, where num_obj represents the total number of objects in the current dataset.
Inspired by Sun et al. 11 , we utilized a shared neural network with the purpose of adding classification task as an auxiliary task for the six-dimensional pose estimation, and it is expected to provide prior Knowledge.Three concepts are involved here: base network, subnetwork, and shared network.The base network refers to the network that contains all the parameters of the model, and the sub-network is a part extracted from the base network, and the common part of all sub-networks is the shared network.As shown in Fig. 4, σ denotes the base network.To generate key point confidence while predicting object categories, and to simultaneously enhance the confidence credibility and persuasiveness, different masks M c and M cls , were generated and sub-networks σ c and σ cls with both difference and commonality from the base network were extracted for the two tasks; subsequently, the two tasks were trained in parallel.As a result, the base network σ can contain solutions for multiple tasks such that the two associated tasks can share the learned knowledge through the common sub-network, and the negative transfer due to differences in tasks can be avoided through the differences in the sub-network.
For the pose-estimation task, as in the previous work 16 , ADD was used to calculate the difference in distance between the corresponding points of the 3D model of the object transformed by the ground truth pose p = [R | T] and the pose ppred = Ri | Ti obtained from the model estimation.Thus, the accuracy of the object pose estimation is: (1) where i indicates the i th key point, D p i represents the mean distance between point pairs onset by the i th densepixel prediction result pi = Ri | Ti , x j represents the j th 3D point among the selected 3D point clouds set M, and m is the number of M.
However, a symmetric object representation has multiple pose-estimation results ppred = Ri | Ti that all achieve the same shape as the ground truth poses p = [R | T] and cannot be measured using the strict compu- tational method ADD.The ADD-S proposed in PoseCNN 17 determines the distance between the 3D model predicted by the target object and the nearest point of the ground truth 3D model.Herein, it was used to measure the accuracy of the positional estimation of the symmetric object, as follows.
where x k represents the point with the smallest distance difference from the current point x j after pose conversion.
For multiple point predictions of the poses of the object where N represents the number of key points in the object, the confidence level c was employed to set the pose loss: where i denotes the i th key point, N is the number of key points, c i is the confidence of the predicted poses of the key points, and D p i denotes the distance between the predicted and true poses of each key point, which was calculated using ADD 16 and ADD-S 17 for symmetric and asymmetric objects, respectively.To balance the distance and confidence, and avoid c i being excessively small, weight w was added to limit the size of the confidence c i .
For the classification task, as individual objects in the dataset vary in shape and object size, focal loss 18 was used to overcome the sample imbalance and difficult sample mining problem by using it to predict the object class cls: where p cls is the probability of the prediction being category cls.The γ parameter, called a modulating factor, is used to reduce the weight of the easily classified samples to ensure that the model focuses on the hard-to-classify samples.
To supervise two tasks simultaneously, the final multi-task loss is: where is set to an initial value of 0 at initialization; as the training epoch increases, decreases exponentially according to 2 −i .According to 19 , training the auxiliary tasks first, and sharing the learned knowledge as a "hint" to the main task can improve the generalization ability of the model.The decrease in in the method allows the model to learn the less difficult target classification; consequently, the model learns more potential feature expressions and the speed of learning pose estimation increases.

Pose rectify module
When all the above steps are completed, the calculated pose and generated picture feature are sent into the pose rectify module.The primary goal of this step is to fine-tune the roughly approximated pose and bring it closer to the original pose of the object.As shown in Fig. 2, the pose rectify module consists of a PSPNet 14 layer and a series of consecutive linear layers, and the output of this module is p rec , which is a fine-tuning relative to the pose-estimation module.Herein, the pose p pred × p rec obtained by the two-step adjustment of the object and the ground truth pose p true were used to map the three-dimensional points of the object in the world coordinate system to the counterpart in the camera coordinate system, which were named as P pred and P true .

Experiments
The LineMOD 16 and PMP datasets were used herein.The LineMOD dataset was first published in the year 2012 and has become a benchmark for 6D pose estimation, while the PMP dataset is a small-sized objects dataset that we captured using UR5 and RealSense D455 cameras.It also contains RGB and depth images, and it is used to verify the accuracy of pose estimation of small-sized objects by our method in the case of real robotic arm grasping.The diameter of the objects in the LineMOD and PMP dataset can be observed from Table 1.

Datasets
1. LineMOD Dataset: The LineMOD dataset 16 is widely used in pose estimation.It consists of 13 image sequences, each sequence comprises RGBD images of a single object, with a total of 15 objects, and the 3D point cloud model of each object is provided.Thus, it contains approximately 15,000 images.The minimum diameter of the object is 10.2 cm, and the largest can reach 28.2 cm.

PMP Dataset:
The PMP dataset is an RGBD dataset taken with the RealSense D455 camera and the UR5 robotic arm.Unlike the traditional pose-estimation dataset, it is obtained by shooting 360 • around the target object, and the poses vary considerably between frames.Each object is shown separately in Fig. 5 below.
(2) www.nature.com/scientificreports/It contains five image sequences, each sequence has an average of 1100 images, and the total number of images exceeds 5000.It comprises six objects, namely bear, duck, shelf, glue, roll, and boy.Compared with the LineMOD dataset, the object size is smaller, and the diameter fluctuates between 13.9 cm and 4.7 cm.

Metrics
The objects in the LineMOD 16 and PMP dataset can be divided into two categories: symmetrical and asymmetrical objects.An asymmetric object is one whose surface shape does not appear repeatedly, that is, when the object remains static, it can only be represented by one pose.The static state of a symmetrical object may not be a single pose, which affects the performance of the model.As in previous work 4,17,20 , the accuracy of pose estimation was measured using the distance difference computed by ADD 16 and ADD-S 17 , and the result was deemed correct when the distance difference was less than 10% of the object diameter.

Parameters
In the encoder-decoder model, the width W and height H of the image cropping block were set to 40, 80, 120, and 160 in increments of 40, according to the size of the target object.To ensure that the image is divided by integer multiples, each segmented image patch size size p was set to 8. The encoder and decoder were not sym- metrical.The encoder encodes the image into a high-dimensional feature vector, whereas the lightweight decoder is responsible for restoring the vector to the pixel size of the image.The vector dimension e_embed_dim in the encoder was set to 768, and the encoder layer number e_depth was set to 12.The vector dimension d_embed_dim in the decoder was set to 512, and the decoder layer number d_depth was set to 8.
According to the empirical evaluation results, w in Eq. ( 3) was set to 0.01.In Eq. ( 4), the value of γ was set to 2, which was used to increase the discrimination of difficult samples.The initial value 0 of in Eq. ( 5) was set to 0.01 to balance classification tasks and pose-estimation tasks.

Experiment results
Herein, several experiments were conducted on the LineMOD public dataset and PMP dataset with the guarantee that the accuracy of our proposed method's classification task reached 95%.
Table 1.The diameter of objects in the LineMOD and PMP dataset.The sizes of objects in the LineMOD dataset are displayed by the first three rows, while the sizes of objects in the PMP dataset are displayed by the final row.The numbers in the table are in millimeters (mm) as the unit.Indicates the symmetric objects.www.nature.com/scientificreports/Performance on the LineMOD dataset Fig. 7 shows the comparison of the decline rate on the LineMOD dataset using the our pipeline and DenseFusion.Both models were trained for 87 epochs, and the test results were printed once per epoch.In our proposed method, the drop ratios of the classification and pose estimation task in the following experimental results were set to 0.3 and 0.1 respectively.The parameter settings are discussed in section 3.5.Evidently from Fig. 7, starting from Epoch 6, our results has lower error distance than DenseFusion.Moreover, our proposed method exhibited substantially low trends in the first 20 epochs and began to decline steadily from Epoch 20.After 87 epochs of training, the minimum error distance reached 6.3 mm.By contrast, Dense-Fusion, owing to the lack of sufficient knowledge in the previous learning, gradually converged from Epoch 35, indicating that our method learned sufficient experiences in the early stage for pose estimation.The proposed method took 59 h to train 87 epochs, compared with 33 h required by DenseFusion, and our method achieved a distance of 11.64 mm at the 10th epoch, which is equivalent to the result of DenseFusion at Epoch 2. This indirectly shows that our method achieved results comparable to those of DenseFusion in less time upon gaining prior knowledge with the multi-task module.The final predicted object poses on the LineMOD and PMP dataset are visualized in Fig. 6, where the green and red pointsets display the 3D point cloud obtained by the predicted pose p pred and the real value p in a two- dimensional image after 2D-3D conversion, respectively.Five poses for the three objects are visualized in the figure, and the predicted poses for DenseFusion and our proposed method are listed in that order.In the two sets of results, the average precision errors of the DenseFusion and our method for estimating the object pose on the public dataset LineMOD are 3.0 mm and 2.1 mm, respectively, and the average errors on the collected datasets are 7.5 mm and 6.9 mm.Compared with DenseFusion, our method achieves smaller errors on both datasets.

Final results
The results of our method using ADD(-S) metric and 2,16,[20][21][22] are presented in Table 2 for the LineMOD dataset comparison, where SSD-6D 20 uses the ICP algorithm for 3D point cloud conversion, BB8 2 and DenseFusion use neural networks for overall refinement, and this study used the 10% of the object's diameter as a threshold value.Five sets of experiments were conducted on the LineMOD dataset using the proposed method, listed in the table are the average results of five experiments; evidently, optimal results were achieved on 6 of the 13 objects.In the comparison of the final results, except for the symmetrical objects eggbox and glue, they all use ADD-S as the evaluation metric, which can get a higher accuracy, the asymmetrical objects with the lowest average accuracy are ape and duck, which both have the smallest diameters.The object which is asymmetrical with the highest accuracy rate obtained by the proposed method is iron, with an accuracy of 98.2%, and object with the lowest rate is duck, which has an accuracy of 93.5%, and the gap between them is only 4.7%.However, compared with SSD-6D 20 , PointFusion 21 , BB8 2 , DenseFusion 16 and Uni6D 22 , the extreme values of accuracy they obtained were 37%, 35.9%, 23%, 9%, 9.9% respectively.
The above results show that the proposed method can reduce the accuracy deviation between objects while improving the accuracy rate, so that the estimation accuracy of the pose of objects with different shapes can be balanced.Among the five smallest-sized objects in the LineMOD dataset, namely ape, duck, cat, camera, and holepuncher, the proposed method achieves the best results for the first four objects, with accuracies of 93.8%, 93.5%, 98.1%, and 98.2%, respectively.Additionally, compared to DenseFusion, our method exhibits a 2.2% improvement for the holepuncher object.In Table 2, Uni6D 22 achieved a fused RGBD input in the preprocessing stage by physical fusion, resulting in a prediction accuracy of 97.0%.However, when dealing with small-sized things as apes, ducks, and cats, the physical fusion input resulted in inefficient feature extraction inside the modality, leading to feature loss.Our solution lowered the accuracy bias from 9.9 to 4.7%, more than half as much as Uni6D.Table 3 presents our results of the five object tests in the PMP dataset, the proposed method was able to predict the small-size object poses well, essentially achieving an accuracy of 98.1%, while DenseFusion has an accuracy of 93.1%.The extreme value of the proposed method is only 4.3%, compared with 8.3% generated by DenseFusion.

Two dropout rates
The effect of the Drop = D pose , D cls ratio on the LineMOD dataset of pose estimation was tested.Two drop ratios, ranging from {0.1, 0.2, 0.3} , were employed as the ratio of dropped neurons for classification tasks and pose-estimation tasks.To verify the influence of prior knowledge brought by parameter sharing on pose estimation, the error distance between the test and real values obtained by training for 10 epochs was recorded and expressed in the form of a heat map, as shown in the Fig. 8.The maximum distance in the figure was 12.6 mm.After evaluating, when the drop ratio was {0.0, 0.0} , the average distance obtained by the model was 12.7 mm, Evidently from the Fig. 8, when the proportions of the two groups of drops were the same, that is, in the form of {0.1, 0.1}, {0.2, 0.2}, {0.3, 0.3} , it has greater error than those whose values are not equal, thus indicating that setting the same drop ratio for tasks with different levels of difficulties is inappropriate.The optimal set of results of the model was {0.1, 0.3} , which illustrates simpler classification tasks use fewer neurons.On the contrary, providing more parameters to the pose-estimation task can yield better results.

Instance-level robotic grasping
Several duck items of the same size as in the PMP dataset were chosen for grasping experiments to test the efficacy of the proposed technique.Our experimental procedures included target localization and instance segmentation, target mask construction, and posture estimation to achieve end-to-end object grabbing.In the experiment, firstly, real-time picture data were fed into SegNet 23 , the pre-trained image segmentation network, to locate the position of the object and produce a mask image of the region.The following phase involved displaying the position discovered in the previous step as a white mask in the photograph, then estimating the pose of the target object and performing a grabbing experiment with the UR5 robotic arm.The following is a link to the video of the real-time robotic grasping experiment https:// www.youtu be.com/ watch?v= xRtFa YIFpb4.

Conclusion
In this work, we propose a multi-task parameter sharing method to improve the accuracy of pose estimation.The proposed method improves the accuracy in the field of pose estimation and narrows the deviation of the accuracy rate between objects of different shapes and sizes by using auxiliary tasks as prior information.More importantly, the proposed method can achieve a lower distance error than DenseFusion after fewer training times on the LineMOD dataset, and in the real robot arm grasping experiment, it can achieve the effect of realtime prediction of grasping.The proposed method works only for most rigid objects with regular geometry, and the accuracy decreases for objects with complex shapes or transparency, or under non-ideal conditions such as occlusion, and lighting changes.

Figure 1 .
Figure1.Pipeline of our method.The pose-estimation task in the figure is performed using the blue neurons, whereas the target object categorization task is performed using the gray neurons.The accuracy of posture estimate and the ability of the model to generalize can both be increased as the two associated tasks can share and supplement each other's learned knowledge.

Fig. 3 Figure 2 .Figure 3 .
Figure 2. Overall structure of our method.The model structure is divided into three modules, the feature extraction module, multi-task pose-estimation module, and pose rectify module.

Figure 4 .
Figure 4. Schematic of the shared parameters between the pose estimation and classification task.The orange and blue parts of the diagram represent various parts of various sub-networks, whereas the green parts of the diagram indicate the overlapping parts of the sub-networks.

Figure 6 .Figure 7 .
Figure 6.Visualization of predicted poses on the LineMOD (top two rows) and PMP (bottom two rows) datasets.The green point set represents the predicted object pose, whereas the red point set represents the ground truth object pose.

Figure 8 .
Figure 8. Different dropout rate combinations in our proposed method.The numbers in the figure represent the average error distance obtained by the model test, and the horizontal and vertical coordinates represent the dropout rate of the pose-estimation and classification tasks, respectively.

Table 2 .
Evaluation results of each comparison model on the LineMOD dataset.* The bolded categories indicate symmetric objects.The Mean value represents the average prediction accuracy for all objects, while the Extreme value represents the accuracy extreme for objects excluding symmetrical ones. SSD-

Table 3 .
Evaluation results of PMP dataset.* the bolded categories indicate symmetric objects.