Accurate 3D hand mesh recovery from a single RGB image

This work addresses hand mesh recovery from a single RGB image. In contrast to most of the existing approaches where parametric hand models are employed as the prior, we show that the hand mesh can be learned directly from the input image. We propose a new type of GAN called Im2Mesh GAN to learn the mesh through end-to-end adversarial training. By interpreting the mesh as a graph, our model is able to capture the topological relationship among the mesh vertices. We also introduce a 3D surface descriptor into the GAN architecture to further capture the associated 3D features. We conduct experiments with the proposed Im2Mesh GAN architecture in two settings: one where we can reap the benefits of coupled groundtruth data availability of the images and the corresponding meshes; and the other which combats the more challenging problem of mesh estimation without the corresponding groundtruth. Through extensive evaluations we demonstrate that even without using any hand priors the proposed method performs on par or better than the state-of-the-art.


Scientific Reports
| (2022) 12:11043 | https://doi.org/10.1038/s41598-022-14380-x www.nature.com/scientificreports/ descriptor into a generative graph model to encode the surface level information, explicitly capturing the 3D features associated with the mesh vertices. • The proposed approaches not only address the problem of mesh reconstruction for the coupled datasets where one-to-one mapping prevails between the images and the groundtruth meshes, but also simultaneously address the problem of reconstructing meshes for the datasets which do not have the corresponding groundtruth annotations. • We do not use the depth images; as such we increase the potential of using our model for the datasets which do not have the corresponding depth images.
The remainder of this paper is organised as follows. In section "Evaluations" we discuss the related work in the research area and their limitations. section "Conclusion" describes our methodology with subsections outlining each component where we describe the approach for the coupled data as well as for uncoupled data. In section 5 we present our experimental results including ablative studies and comparison of our method to the state-of-theart, and we conclude the paper in section 5 while describing the results of the proposed methodology.

Related work
3D hand pose and mesh estimation using parametric models: Majority of existing 3D hand pose and shape estimation methods 8,13 are based on MANO 5 model which is a low dimensional parametric representation of hand mesh. However, there remain few weaknesses in using such parametric models. Firstly, the model is generated in controlled environments which are different from the images that are encountered in real world 14 thus causing a domain gap. Secondly, the low dimensional nature of the parametric models limits their capability to capture non-linear shapes of hands 10 . Thirdly, to create a parametric model it requires a large amount of data, which makes it challenging to adopt those methods to other object classes. Due to these limitations, in this paper we propose a hand mesh reconstruction approach which does not utilize a parametric hand model. Model free 3D hand pose and mesh estimation: In the recent approaches on 3D hand pose and mesh estimation where parametric models are not used, other priors are employed. For an instance using 2D pose of the hand as an input to the network 14 . This requires the annotation of the 2D pose on input images, which limits the approach's ability in adopting to the datasets where the 2D pose annotation is not available. In addition, there exists some approaches that rely on heatmaps of the keypoints at early stages 8,10 , which requires additional steps of keypoint estimation which can later be extracted directly from the estimated mesh. In contrast, we do not employ 2D or 3D keypoint locations in our method.
Graph Neural Networks (GNNs) for hand mesh estimation: In the recent literature, several approaches can be found where GNNs have been employed in estimating the 3D mesh of human hand 10,14 . However, the objective functions of these methods are limited to the vertex coordinates and other properties associated with the vertex location in the final mesh, where the resultant features of the GCNs are not fully utilized. To fully harness the strengths of GCNs we incorporate a 3D feature descriptor to our method, where the GCN is aimed to learn not only the vertex locations but it also learn to estimate the 3D feature descriptor, which elevate the overall accuracy of mesh estimation.
Effective use of datasets for hand pose and mesh estimation: When the datasets for hand pose and mesh estimation are considered, most recent datasets (Dome dataset by Kulon et al. 15 , FreiHAND dataset by Zimmermann et al. 16 ) contain the images and their corresponding groundtruth mesh. These datasets have been used by the state-of-the-art methods for hand pose and mesh estimation 13,14 . However, datasets such as RHD 17 and STB 18 contains the images and their corresponding groundtruth 3D pose, and the methods that have used those datasets have targeted only on estimating the 3D pose 19,20 . However, we propose an approach where the existing datasets which do not contain the groundtruth mesh details can effectively be used for the task of hand mesh estimation.

Im2Mesh GAN-single image mesh generation
When considering the available datasets for single image hand mesh reconstruction, there are 2 main variations; (1) The datasets which contain images and the corresponding by groundtruth mesh (i.e., the dataset by Kulon et al. 15 (referred as the Dome dataset hereafter) and the dataset by Zimmermann et al. 16 (referred as the FreiHAND dataset hereafter), and (2) Other standard datasets such as Rendered Handpose Dataset (RHD) 17 and Stereo Handpose Dataset (STB) 18 that do not contain the groundtruth meshes, instead they contain the 3D and 2D keypoint annotations of the human hand. Therefore we use two different network architectures: (1) To reap the maximum benefit of the availability of the coupled data in the Dome dataset 15 and FreiHAND dataset 16 and (2) To use the mesh data in Dome dataset along with the image data in other standard datasets (i.e., STB and RHD) for the robust estimation of the hand mesh.
In this section, we describe the details of our method. First, we briefly introduce the network architectures while distinguishing the architectural differences between the considered 2 settings, then we introduce the hand mesh representation and the 3D surface feature descriptor we use in this paper. We then elaborate on the details of each network architecture along with the objective functions that were employed.
Architecture for Coupled Data vs Architecture for the Uncoupled Data. The network architecture for the coupled training data is depicted in Fig. 1 and the network architecture for the uncoupled training data is depicted in Fig. 2. For the coupled training data a Conditional GAN architecture is employed (Eq. 3), where RGB images (I) are fed as the input to the "Generator". At the training time RGB images (I) with the generated meshes (G(I)) which contain the hand mesh representation (detailed in section "Evaluation protocol") and the 3D surface descriptor (detailed in section "Ablative studies") and RGB images with corresponding groundtruth meshes are fed to the "Discriminator". www.nature.com/scientificreports/ For the uncoupled training data a Cycle GAN architecture is employed (Eq. 12) where RGB images from a particular dataset (e.g., RHD 17 dataset) is fed to the "Mesh Generator" ( G_M ) and mesh data from a different distribution (e.g. Dome dataset 16 ) is fed to the "Image Generator" ( G_I).

Hand mesh representation.
In this work we represent hand mesh M as in Eq. (1), where V denotes the vertices and F denotes the faces that comprises the mesh. Each vertex in V is denoted by its x, y and z coordinates (i.e., v i = x i , y i , z i ) and each face is denoted by the vertex numbers which have contributed for that face (i.e.,

3D surface descriptor. In GNNs an attributed graph is defined as,
where V is the set of vertices/nodes which is directly extracted from M (Eq. 1), E is the set of edges which is derived using F in Eq. (1). X, can either be node attributes (i.e., X v ε R N×d such that X v i ε R d , is the feature vector of node v i ), or edge attributes (i.e., X e ε E T×c where T is the number of edges in the graph).
In this work, we use a node feature that can represent distinctive node properties. We selected the Signature of Histogram of Orientations (SHOT) descriptor 21 , which has the ability to generate descriptive features for 3D points. SHOT descriptor is a combination of the concepts of "signature" 22 and "histogram" 23 , such that the descriptor possess computational efficiency while maintaining the robustness. Apart from the evaluations that  www.nature.com/scientificreports/ have been performed by the developers of the SHOT descriptor Samuele et al. 21 , the SHOT descriptor has demonstrated optimum performance in different domains 24 , including in frameworks with deep learning techniques 25 . Furthermore compared with other 3D feature descriptors (e.g.Point Feature Histogram (PFH) descriptor 26 and Fast Point Feature Histogram (FPFH) descriptor 27 ), SHOT descriptor the SHOT descriptor has been shown to better capture information of the surface as it encodes the details across radial, azimuth and elevation axes of the support region. The dimension of the feature descriptor depends on the parameters such as the number of neighbours that should be considered at the time of the feature descriptor creation.
Network architecture for coupled training data. We use a variation of a conditional GAN in this work to generate realistic hand meshes, based on the RGB hand images. The objective of a conditional GAN is expressed as in Eq. (3), where G and D are the functions learned by the generator and the discriminator respectively. Conditional GANs are capable of learning the mapping between the input and the desired output.
The configuration of the conditional GAN that we use is depicted in Fig. 1. The generator network has 2 components to predict the position vectors (i.e., V in Eq. 1) and the node features (i.e., X in Eq. 2, wherein this work we have used SHOT descriptor). Similarly, the discriminator network is also composed of 2 components, as "Position Discriminator" and "Surface Descriptor Discriminator".
When it comes to conditional GANs, the generator's objective is to generate output that closely resembles the ground-truth output while fooling the discriminator. Therefore, we define L (G) as in Eq. (4), to measure the similarity between the predicted values and the corresponding groundtruth values.
The final objective of the conditional GAN is, The first two terms of Eq. (4) are aimed at minimizing the reconstruction error between the position vector and the SHOT descriptor respectively. L pos is defined as, where pred pos and gt pos are the predicted and groundtruth vertex locations (i.e., position values) of the mesh 28 . We introduce L shot , which is the difference between the surface descriptors (SHOT descriptor) of the groundtruth mesh and the predicted mesh. L shot is defined as in Eq. (7).
A loss based on the surface normals of the mesh is introduced to enforce the smoothness of the mesh. To ensure that the surface normals of the groundruth mesh and the predicted mesh are parallel, the dot product among them is used. L normal is calculated as in Eq. (8), where n i pred and n i gt denote the normal vector of face i in the predicted mesh and the groundtruth mesh respectively.
In addition, to further enhance the smoothness of the mesh we employ the Laplacian loss ( L Laplacian ) 10 . We introduce two components to the Laplacian loss (Eq. 9), where the L VertexLaplacian is calculated for each of the vertices in the mesh considering the adjacent neighbours while enforcing the smoothness in a fine grained context, and L KeypointLaplacian is calculated for the keypoints while considering the neighbours in a broad range, thus enforcing the smoothness in a more coarser level. The weights of the L VertexLaplacian and L KeypointLaplacian are denoted by α and β . Laplacian error in general is defined as in Eq. (10), where w i = pred_pos i − gt_pos i for v i . When considering L VertexLaplacian , the neighbours of vertex v i are defined as N( whereas L KeypointLaplacian is defined based neighbours that are identified through a graph unrolling and graph traversing process. www.nature.com/scientificreports/ To define neighbours for L KeypointLaplacian calculation, we unrolled the hand mesh into a graph format. For each of the keypoints, a separate graph is created by traversing the mesh using the vertex related to the keypoint as the starting node. We use breadth first search 29 based graph unrolling. As the vertices are not uniformly distributed throughout the mesh, the number of layers in each graph and the number of nodes in each layer is different. The Quadratic loss ( L Quadratic ) (Eq. 11) 30 is also used to penalize the predicted points in the normal direction. In Eq. (11), Q v gt v pred stands for the quadratic error ( 31,32 ) which is calculated based on the triangle incidents that correspond to v gt .
In general, the objective of the the discriminator network (D in Eq. 3) is to classify whether the given input is from the real sample or whether it has been generated by the generator network (G). However, the existing work on GANs is focused on discriminating the generated data such as class labels and images and thus have used fully connected or convolutional layers in the discriminator.
In this work we use graph convolutional layers in the "Surface Descriptor Discriminator" network ( Fig. 1), where the node features are taken into consideration. As the edge connections (E in Eq. 2) remain the same for all the estimated meshes we use spectral based graph convolution operations. We used GCN layers introduced by Kipf et al. 33 .
Network architecture for uncoupled training data. Hand mesh estimation from a single image suffers from the problem of having a limited amount of training data and hence the deep learning based techniques with supervised learning can not be used. As a solution many of the existing methods use the datasets with 3D keypoint annotations and estimate the hand pose. The estimated pose is then used along with parametric models such as MANO 5 for the 3D mesh reconstruction.
In this paper, we use a variation of cycle GAN 34 to estimate the 3D mesh of hand using a single image, based on uncoupled training data. The overview of the framework that we use in this work is depicted in Fig. 2, where "Mesh Generator" and "Mesh Discriminator" consists of position and surface descriptor related components which are denoted in Fig. 1.
We define the objective function of the network as, where L GAN is defined according to the adversarial loss proposed in 35 (Eq. 13). L cyc , which stands for cycle consistency loss is used to constraint the possible mapping functions such that the mapping from image to mesh can be made as unique as possible. We define the cycle consistency loss with two components as L cyc_mesh and L cyc_im , which stand for the cycle consistency of the mesh generator and the image generator respectively.

Generator
For the preliminary layers of the generator we used a convolution based architecture which is in the shape of a U-Net 36 , however we did not use any skip connections in this work as the domains of the input and output are different. In denoting the 2D convolution layers followed by a batch normalization and a ReLU activation we use the notation of Convolution-BatchNorm-ReLU. We used a similar architecture as in 37   Discriminator The discriminator network contains 2 branches where the position vectors and the node features are compared with the corresponding groundtruth values. "Position Discriminator" takes the input of size N × 3 and the "Surface Descriptor Discriminator" takes the input of size N × 221 , where N is the number of vertices in the resultant mesh. For the position discriminator we use three Convolution-BatchNorm-ReLU layers with 64,64 and 1 kernels in each with 5 × 1 filters and for the "Surface Descriptor Discriminator", we use 2 GCNConv layers where we reduce the feature size from 221 to 100 and then to 50. We pass each of these through a fully connected layer with 2048 nodes and then concatenate the output from the 2 discriminators before passing them trough another fully connected layer of 1048 nodes followed by a fully connected layer with size 1 and softmax activation.
Generator and discriminator for uncoupled training data. Compared with the network settings that are described in the above section where the coupled training data is available, when our implementation of cycle GAN (Fig. 2) is considered the main difference is that in cycle GAN we have enforced the cycle consistency loss where we feed the mesh generated by "Mesh Generator" in Fig. 2 to the "Image Generator" and vice-versa. To allow this, the inputs for each generator should be same in shape. Hence we consider the coarse grained mesh which has 224 vertices and set the input image width and height to 224 pixels. For the generators and for the "Image Discriminator" we used the same architectures that have been used by the original paper of cycle GAN 34 . For the mesh discriminator, we used the same architecture that we described in the "Discriminator" subsection of section "3.6".
For the setting of uncoupled training data, we separately trained the mesh enhancer (Fig. 3) such that it learns the mapping from low dimensional mesh to the high dimensional mesh, and the generated low dimensional meshes were upsampled using it.
Training. For the network which uses coupled data we set δ = 10 in Eq. (5). All the parameters , µ , θ , γ and phi in Eq. (4), and α and β in Eq. (9) were set to 1. To see the effectiveness of the constraints that were enforced to obtain surface smoothness we evaluate the method with setting the parameters θ and γ to 0. The results related to those settings can be found in the section "Conclusion". For the network which used uncoupled data, δ was set to 10. We used a training procedure similar to 34,37 where all the models were trained from the scratch and with a learning rate of 0.0002 using Adam optimizer 39 .

Evaluations
In this section we describe the datasets and the evaluation matrix that we used, the ablative studies that we conducted to evaluate the effectiveness of the components in our model, the experimental results that we obtained in benchmarking our model with the state-of-the-art methods. It should be noted that when recording the results of the state-of-the-art methods, we have used the evaluations that are been performed by the respective authors and the results with the best configuration of their proposed methods have been selected for the comparison.
Datasets. This work utilizes two types of publicly available datasets, (1) Coupled dataset where the images and the corresponding groundtruth mesh is available and (2) Datasets which contain only the images. For the coupled dataset we used the Dome dataset 15 and FreiHAND dataset 16 and for the later we used the RHD 17 and Figure 3. The mesh enhancement process. It should be noted that this image depicts the process of upsampling a graph which has N nodes and a feature dimension of d, to a graph with R nodes. The depicted network contains two cascaded levels of graph upsampling followed by the "Coordinate Reconstructor" which calculates the position vector of the upsampled graph. k and q are the feature dimensions that of the generated features at cascaded level 1 and 2. Since the objective of our work is to upsample the graph while retaining the number of features we set k = q = d. Evaluation protocol. For the Dome dataset we used the L1 reconstruction error, between the ground truth mesh and the predicted mesh. For the quantitative evaluations of the datasets for which groundtruth meshes are not available we extracted the 3D locations of the keypoints and used the accuracy measurement of Percentage of Correct Keypoints (PCK) scores. In PCK calculation, if the predicted keypoint, which we extract from the estimated mesh lies within a sphere with a specific radius with respect to the groundtruth value, it is considered as correct keypoint.

Scientific Reports
Ablative studies. Our ablative studies were conducted on the Dome dataset which has the groundtruth mesh data, as such the quantitative evaluations could be carried out. We conducted ablative studies, with the aim of evaluating the contribution of the components in Eq. (4). The results related to the secon ablative studies are recorded in Table 1.
Effectiveness of using the 3D surface descriptor: We performed this ablative study to evaluate the effectiveness of incorporating L shot , which measures the similarity in the groundtruth and generated SHOT descriptor. We trained our model on training subset and tested on the test subset of the Dome dataset 15 . For this ablative study, we set µ = 0 in Eq. (4).
Effectiveness of enforcing the surface smoothness in the mesh: As described in section "Selection of the 3D feature descriptor", our method combines several loss functions to enforce the surface smoothness of the mesh. We evaluate the effectiveness of each of these loss components (i.e., L normal and L Laplacian , where the later consists of 2 components as L VertexLaplacian and L KeypointLaplacian ). We assessed the effectiveness of using components individually and in combination. First we eliminate all the loss values that are related to the surface smoothness (i.e., surface normal loss, vertex laplacian loss and the keypoint laplacian loss), and the corresponding results can be found in the second raw of Table 1. Similarly by eliminating individual lossless we compared the reconstruction error (Table 1). Figure 4 better visualizes the effect of constraints that are used to smooth the surface. From the conducted ablative studies it is evident that the contribution of the components in the loss function is vital and the compound of all the components has resulted in better accuracy values. Table 1. Results of the conducted ablative studies to evaluate the effectiveness of the loss components that are indicated in Eq. (4) and in Eq. (9). The first row demonstrates the importance of using the 3D surface descriptor, where as the next five rows (highlighted in gray) demonstrate the effectiveness of surface smoothness and the seventh row demonstrates the effectiveness of the quadratic loss.  15 . The results are recorded in Table 2 and clearly demonstrate the superior performance of the SHOT descriptor (Tables 3, 4).
Comparison to the state-of-the-art. For the Dome dataset we compared our method with the state-ofthe-art method which has used the same dataset. When the two methods were compared using the L1 reconstruction error, the obtained results are recorded in Table 5. From the obtained values it can be seen that our method has outperformed the state-of-the-art method with significant margins. Similarly for FreiHAND dataset, we compared our results with the state-of-the-art methods and the obtained results are recorded in Table 6. For the RHD dataset and STB dataset, the evaluations were performed using the 3D PCK values, where we compare the extracted 3D keypoint values with the groundtruth 3D keypoint values. The obtained PCK values are recorded in Fig. 5. From the results it can be seen that, though the objective of our method was not estimating the 3D keypoint locations, with our method we have obtained PCK values which are comparable to the    Table 3 and the AUC values related to Fig. 5b are recorded in Table 4. It should be noted that when comparing the quantitative performance of the state-of-the-art methods, we have used the evaluations that are been performed by respective authors and the results with the best configurations of their proposed methods have been considered.
The qualitative results that were obtained for the coupled datasets (i.e., Dome dataset 15 and FreiHand dataset 16 ) are depicted in figure where the meshes are consisted with 7907 vertices are depicted in Figs. 6 and 7 respectively. It should be noted that the Dome dataset contains meshes with 7907 vertices while FreiHand 16 dataset contains the meshes with 778 vertices. In Fig. 7 the 5th column depicts the mask that was obtained by projecting the estimated mesh. Qualitave results for the RHD dataset 17 , which is an uncoupled dataset are depicted in Fig. 8.

Conclusion
While 3D mesh reconstruction of the human hand using a single image has been explored in the past, the problem still remains a challenge due to the high degree of freedom of the human hand. In this paper, we have presented a method to create 3D mesh of the hand using the single image that can effectively use the existing databases for better reconstruction of the 3D mesh using a single image. We have designed a loss function that can generate more realistic hand meshes, and we demonstrate the effectiveness of that loss function in two settings of Generative Adversarial Networks. The first setting is targeted on the effective use of coupled datasets where the groundtruth meshes are available, whereas the second setting is targeted on uncoupled datasets. In addition, we employ a 3D surface descriptor in this work along with graph convolution networks, which enable the preservation of the surface details of generated meshes. We confirm that our framework outperforms the state-of-the-art as well represents the first effort to incorporate explicit 3D features in a single image-based 3D mesh reconstruction. One of the interesting properties of the proposed mesh recovery approach is that there is no need for parametric hand models as priors. The geometry of the hand is learned and encoded directly in the generator through the end-to-end adversarial training process. This fact enables the proposed algorithm to be easily adapted to other mesh problems such as other body parts or 3D objects.