Joint stereo 3D object detection and implicit surface reconstruction

We present a new learning-based framework S-3D-RCNN that can recover accurate object orientation in SO(3) and simultaneously predict implicit rigid shapes from stereo RGB images. For orientation estimation, in contrast to previous studies that map local appearance to observation angles, we propose a progressive approach by extracting meaningful Intermediate Geometrical Representations (IGRs). This approach features a deep model that transforms perceived intensities from one or two views to object part coordinates to achieve direct egocentric object orientation estimation in the camera coordinate system. To further achieve finer description inside 3D bounding boxes, we investigate the implicit shape estimation problem from stereo images. We model visible object surfaces by designing a point-based representation, augmenting IGRs to explicitly address the unseen surface hallucination problem. Extensive experiments validate the effectiveness of the proposed IGRs, and S-3D-RCNN achieves superior 3D scene understanding performance. We also designed new metrics on the KITTI benchmark for our evaluation of implicit shape estimation.


Introduction
"The usefulness of a representation depends upon how well suited it is to the purpose for which it is used".-Marr 1 Estimating 3D attributes of outdoor objects is a fundamental vision task enabling many important applications.For example, in vision-based autonomous driving and traffic surveillance systems 2 , accurate vehicle orientation estimation (VOE) can imply a driver's intent of travel direction, assist motion prediction and planning, and help identify anomalous behaviors.In outdoor augmented reality systems, rigid shape estimation can enable photo-realistic lighting and physics-based surface effect simulation.In auto-labeling applications 3 , off-board deep 3D attribute estimation models serve as an important component in a data closed-loop platform by speeding up the labeling efficiency.This study proposes a new multi-task model to fulfill this task, which takes a pair of calibrated images (Fig. 1(a)) and can detect objects as bounding boxes (Fig. 1(b)) as well as estimate their implicit shapes (Fig. 1(c)).
Even though the binocular human visual system can recover multiple 3D object properties effortlessly from a glance, this task is challenging for computers to accomplish.The difficulty results from a lack of geometry information after image formation, which causes the semantic gap between RGB pixels and unknown 3D properties.This problem is even exacerbated by a huge variation in object appearances.Recently, advances in deep learning greatly facilitated the representation learning process from data, where a neural network predicts 3D attributes from images 4,5 .In this paradigm, paired images and 3D annotations are specified as inputs and learning targets respectively for supervising a deep model.No intermediate representation is designed in these studies, which need a large number of training pairs to approximate the highly non-linear mapping from the pixel space to 3D geometrical quantities.
To address this problem, instead of directly regressing them from pixels with a black-box neural network, we propose a progressive mapping from pixels to 3D attributes.This design is inspired by Marr's representational framework of vision 1 .In Marr's framework, intermediate representations, i.e., the 2 1 2 -D sketch, are computed from low-level pixels and are later lifted to 3D model representations.However, how to design effective intermediate representations toward accurate and robust outdoor 3D attribute recovery is still task-dependent and under-explored.In this study, our research question is can a neural network extract explicit geometrical quantities from monocular/stereo RGB images and use them for effective object pose/shape estimation?
The conference version of this study 6 addressed a part of the question, i.e., how to estimate orientation for one class of objects (vehicles) from a single RGB image.The proposed model, Ego-Net, computed part-based screen coordinates from object part heatmaps and further lifted them to 3D object part coordinates for accurate orientation inference.While the study 6 is effective for VOE, it is limited in several aspects which demands further exploration.Firstly, the approach was demonstrated only for a single-view setting.The difficulty of monocular depth estimation makes it less competent in perception accuracy compared with multi-view systems.Thus, an extension to multi-view sensor configuration and a study of the effectiveness of the proposed IGRs in such cases is highly favorable and complementary.Secondly, the approach was only validated for vehicles and not shown for other common articulated objects such as pedestrians.These objects are smaller and have much fewer training labels in the used dataset.It would be intriguing whether the proposed IGRs show consistent effectiveness or not.Lastly, the approach can only predict object orientation.It does not unlock the full potential of the extracted high-resolution instance features to recover a more detailed rigid shape description, and neither does it discuss how to design effective IGRs to achieve it.A complete and detailed rigid shape description beyond 3D bounding boxes is desirable for various machine vision applications.For example, an autonomous perception system can give a more accurate collision prediction within an object's 3D bounding box.As an illustration, the green ray reflecting on the predicted object surface in Fig. 1(d) cannot be obtained by using 3D bounding boxes to represent objects due to the lack of fine-grained surface normal.In addition, our approach describes rigid objects with implicit representations, which can be rendered with varying resolutions (Fig. 1

(e)).
To fully address our research question, this study presents an extended model S-3D-RCNN for joint object detection, orientation estimation, and implicit shape estimation from a pair of RGB images.Firstly, we demonstrate the effectiveness of the proposed IGRs in a stereo perception setting, where part-based screen coordinates aggregated from two views further improve the VOE accuracy.Secondly, we validate the robustness of the proposed IGRs for other outdoor objects such as pedestrians.These tiny objects are underrepresented in the training dataset, yet the proposed approach still achieves accurate orientation estimation.Lastly, we propose several new representations in the framework to further extend EgoNet for implicit shape estimation.We formulate the problem of implicit shape estimation as an unseen surface hallucination problem and propose to address it with a point-based visible surface representation.For quantitative evaluation, we further propose two new metrics to extend KITTI's object detection evaluation 7 to consider the retrieval of the object surface.In summary, this study extends the previous conference version in various aspects with the following added contributions.
• It explores the proposed IGRs in a stereo-perception setting and validates the effectiveness of the proposed approach in recovering orientation in SO(3) for two-view inputs.
• It shows the proposed approach is not limited to rigid objects and has strong orientation estimation performance for other small objects that may have fewer training labels.
• It extends the representational framework in Ego-Net with several new IGRs to achieve implicit shape estimation from stereo region-of-interests.To the best of our knowledge, Ego-Net++ is the first stereo image-based approach for 3D object detection and implicit rigid shape estimation.
• To quantitatively evaluate the proposed implicit shape estimation task, two new metrics are designed to extend the previous average precision metrics to consider the object surface description.
We introduce and compare with relevant studies in the next section and revisit Ego-Net in section 2.2.We then detail extended studies in designing Ego-Net++ in senction 2.3 followed by experimental evaluations in section 3.

Related Work
This study features outdoor environment, orientation estimation, and implicit shape reconstruction.It draws connections with prior studies in the following domains yet has unique contributions.Image-based 3D scene understanding requires recovering 3D object properties from RGB images which usually consist of multiple sub-tasks [8][9][10][11][12][13][14][15] .Two popular paradigms were proposed.The generative, a.k.a analysis-by-synthesis approaches [16][17][18][19] build generative models of image observation and unknown 3D attributes.During inference, they search in the 3D state space to find an optimum that best explains the image evidence.However, good initialization and iterative optimization are required to search in a high-dimensional state space.In contrast, the discriminative approaches [20][21][22][23] directly learn a mapping from image observation to 3D representations.Our approach can be categorized into the latter yet is unique.Unlike previous studies that are only applicable for indoor environments with small depth variation 11,[24][25][26][27][28] or only consider the monocular camera setting 22,[29][30][31] , our framework can exploit two-view geometry to accurately locate objects as well as enables resolution-agnostic implicit shape estimation in challenging outdoor environments.Compared with recent multi-view reconstruction studies 32,33 , our study does not take 3D mesh inputs as 32 or use synthetic image inputs as 33 .
Learning-based 3D object detection learns a function that maps sensor input to objects represented as 3D bounding boxes 34 .Depending on the sensor configuration, previous studies can be categorized into RGB-based methods 5,6,[35][36][37][38][39] and LiDAR-based approaches [40][41][42][43] .Our approach is RGB-based which does not require expensive range sensors.While previous RGB-based methods can describe objects up to a 3D bounding box representation, the quality of shape predictions within the bounding boxes was not evaluated with existing metrics.Our extended study fills this gap by proposing new metrics and designing an extended model that complements Ego-Net with implicit shape reconstruction capability from stereo inputs.Learning-based orientation estimation for 3D object detection seeks a function that maps pixels to instance orientation in the camera coordinate system via learning from data.Early studies 44,45 utilized hand-crafted features 46 and boosted trees for discrete pose classification (DPC).More recent studies replace the feature extraction stage with deep models.ANN 47 and DAVE 48,49 classify instance feature maps extracted by CNN into discrete bins.To deal with images containing multiple instances, Fast-RCNN-like architectures were employed in 34,[50][51][52][53] where region-of-interest (ROI) features were used to represent instance appearance and a classification head gives pose prediction.Deep3DBox 4 proposed MultiBin loss for joint pose classification and residual regression.Wasserstein loss was promoted in 54 for DPC.Our Ego-Net 6 is also a learning-based approach but possesses key differences.Our approach promotes learning explicit part-based IGRs while previous works do not.With IGRs, Ego-Net is robust to occlusion and can directly estimate global (egocentric) pose in the camera coordinate system while previous works can only estimate relative (allocentric) pose.Compared to Ego-Net, orientation estimation with Ego-Net++ in this extended study is no longer limited to monocular inputs and rigid objects.In addition, Ego-Net++ can further achieve object surface retrieval for rigid objects while Ego-Net cannot.Instance-level modeling in 3D object detection builds a feature representation for a single object to estimate its 3D attributes 22,53,[55][56][57] .FQ-Net 55 draws a re-projected 3D cuboid on an instance patch to predict its 3D Intersection over Union (IoU) with the ground truth.RAR-Net 57 formulates a reinforcement learning framework for instance location prediction.3D-RCNN 22 and GSNet 53 learn a mapping from instance features to the PCA-based shape codes.Ego-Net++ in this study is a new instance-level model in that it can predict the implicit shape and can utilize stereo imagery while previous studies cannot.Neural implicit shape representation was proposed to encode object shapes as latent vectors via a neural network [58][59][60][61][62] , which shows an advantage over classical shape representations.However, many prior works focus on using perfect synthetic point clouds as inputs and few have explored its inference in 3D object detection (3DOD) scenarios.Instead, we address inferring such representations under a realistic object detection scenario with stereo sensors by extending the IGRs in Ego-Net to accomplish this task.

Overall framework
Our framework S-3D-RCNN detects objects and estimates their 3D attributes from a pair of stereo images with designed intermediate representations.S-3D-RCNN consists of a proposal model D and an instance-level model E (Ego-Net++) for instance-level 3D attribute recovery as shown in Fig. 2. E is agnostic to the design choice of D and can be used as a plug-andplay module.Given an image pair (L , R) captured by stereo cameras with left camera intrinsics K 3×3 , D predicts N cuboid A local cost volume is constructed from instance features to estimate disparities.The visible surface coordinates are computed from the predicted disparities and an estimated mask, and then normalized to a canonical coordinate system.An encoder-decoder component Ha infers the missing surface of the object.The complete surface coordinates are passed to an encoder to extract an implicit shape vector, which can be used by a decoder for resolution-agnostic mesh extraction.For orientation estimation, a zoomed-in view is shown in Fig. 8. FCN stands for a fully convolutional network module.The 2D part coordinates are lifted to 3D coordinates by Li in Eq. 6.
Conditioned on each proposal b i , E constructs instance-level representations and predicts its orientation as E (L , R; θ E |b i ) = θ θ θ i.In addition, for rigid object proposals (i.e., vehicles in this study), E can further predict its implicit shape representation.In implementation, D is designed as a voxel-based 3D object detector as shown in Fig. 3.The following sub-sections first revisit the motivation and representation design in Ego-Net, and then highlight which new representations are introduced in Ego-Net++ for a stronger 3D scene understanding performance.

Orientation estimation with a progressive mapping
Previous studies 4,5,22,64 regress vehicle orientation with the computational graph in Eq. 1.A CNN-based model N is used to map local instance appearance x i to allocentric pose, i.e., 3D orientation in the object coordinate system (OCS), which is then converted to the egocentric pose, i.e., orientation in the camera coordinate system (CCS).The difference between these two coordinate systems is shown in Fig. 4.This two-step design is a workaround since an object with the same egocentric pose θ θ θ i can produce different local appearance depending on its location 22 and learning the mapping from x i to θ θ θ i is ill-posed.In this two-step design, OCS was estimated by another module.The error of this module can propagate to the final estimation of egocentric poses, and optimizing N does not optimize the final target directly.Another problem of this design is that the mapping from pixels x i to pose vectors α α α i is highly non-linear and difficult to approximate 65 .
Ego-Net instead learns a mapping from images to egocentric poses to optimize the target directly.However, instead of relying on a black box model to fit such a non-linear mapping, it promotes a progressive mapping, where coordinate-based IGRs are extracted from pixels and eventually lifted to the 3D target.Specifically, Ego-Net is a composite function with learnable modules {H , C , Li}.Given the cropped 2D image patch of one proposal x i , Ego-Net predicts its egocentric pose as 5 depicts Ego-Net, whose computational graph is shown in Eq. 2. H extracts heatmaps h(x i ) for 2D object parts that are mapped by C to coordinates φ l (x i ) representing their local location on the patch.φ l (x i ) is converted to the global image plane coordinates φ g (x i ) with an affine transformation A i parametrized with scaling and 2D translation.φ g (x i ) is further lifted to a 3D representation ψ(x i ) by Li.The final pose prediction derives from ψ(x i ).Existing solutions first estimate an allocentric pose in the object coordinate system (blue) and convert it to an egocentric pose in the camera coordinate system (green) based on the object location.

Design of Labor-free Intermediate Representations
The IGRs in Eq. 2 are designed based on the following considerations: Availability: It is favorable if the IGRs can be easily derived from existing ground truth annotations with none or minimum extra manual effort.Thus we define object parts from existing 3D bounding box annotations.
Discriminative: The IGRs should be indicative for orientation estimation, so that they can serve as a good bridge between visual appearance input and the geometrical target.
Transparency: The IGRs should be easy to understand, which makes them debugging-friendly and trustworthy for applications such as autonomous driving.Thus IGRs are defined with explicit meaning in Ego-Net.
With the above considerations, we define the 3D representation ψ(x i ) as a sparse 3D point cloud (PC) representing an interpolated cuboid.Autonomous driving datasets such as KITTI 7 usually label instance 3D bounding boxes from captured point clouds where an instance x i is associated with its centroid location in the camera coordinate system t i = [t x ,t y ,t z ], size [h i , w i , l i ], and its egocentric pose θ θ θ i .For consistency, many prior studies only use the yaw angle denoted as θ i .As shown in Fig. 6, denote the 12 lines comprising a 3D bounding box as {l j } 12 j=1 , where each line is represented by two endpoints (start and end) as representing the point's location in the camera coordinate system.As a complexity-controlling parameter, q more points are sampled from each line with a pre-defined interpolation matrix B q×2 as The 8 endpoints, the instance's centroid, and the interpolated points for each of the 12 lines form a set of 9 + 12q points.The concatenation of these points forms a 9 + 12q by 3 matrix τ(x i ).Since we do not need the 3D target ψ(x i ) to encode location, we deduct the instance translation t i from τ(x i ) and represent ψ(x i ) as a set of 8 + 12q points representing the shape relative to the centroid ψ( where v ∈ {s, 1, . . ., q, e} and j ∈ {1, 2, . . ., 12}.Larger q provides more cues for inferring pose yet increases complexity.In practice, we choose q = 2 and the right figure of Fig. 6 shows an example with and 2 points are interpolated for each line. Serving as the 2D representation to be located by H and C , φ g (x i ) is defined to be the projected screen coordinates of τ(x i ) given camera intrinsics K 3×3 as φ g (x i ) implicitly encodes instance location on the image plane so that it is less ill-posed to estimate egocentric pose from it directly.In summary, these IGRs can be computed with zero extra manual annotation, are easy to understand, and contain rich information for estimating the instance orientation.

Ego-Net++: towards multi-view perception and implicit shape inference
To study the effectiveness of screen coordinates in encoding two-view information, and achieve a finer description for perceived objects, several new IGRs are designed in Ego-Net++.

Orientation estimation with paired part coordinates
Under the stereo perception setting, another viewpoint provides more information to infer the unknown object orientation.It is thus necessary to extend the IGRs to aggregate information from both views.In Ego-Net++, paired part coordinates (PPCs) are  defined with a simple yet effective concatenation operation to aggregate two-view information as a k by 4 representation PPC enhances the IGR in Ego-Net with disparity, i.e., differences of part coordinates in two views.Such disparity provides geometrical cues for object part depth and have larger mutual information with the target orientation.PPC is illustrated in Fig. 7. for one object part, and it extends the computational graph in Eq. 2 as The learnable modules are implemented with convolutional and fully-connected layers, and a diagram of orientation estimation from stereo images is shown in Fig. 8.During training, ground truth IGRs are available to penalize the predicted heatmaps, predicted 2D coordinates, and 3D coordinates.Such supervision is implemented with L 2 loss for heatmaps and L 1 loss for coordinates.In inference, the predicted egocentric pose is computed from the predicted 3D coordinates ψ(x i ).Denote the 3D coordinates at a canonical pose as ψ 0 (x i ), the predicted pose is obtained by estimating a rotation R(θ θ θ i ) from ψ 0 (x i ) to ψ(x i ).
In implementation, this least-square problem is solved with singular value decomposition (SVD).This process is efficient due to a small number of parts.

Implicit shape estimation via surface hallucination
Previous sub-sections enable recovering object orientation from one or two views.However, such perception capability is limited to describing objects as 3D bounding boxes, failing at a more detailed representation within the box.While some previous studies 53, 66-68 explore shape estimation for outdoor rigid objects, they cannot exploit the stereo inputs or require PCA-based templates 53,[66][67][68] which are limited to a fixed mesh topology 53 .In contrast, Ego-Net++ can take advantage of stereo information and conduct implicit shape estimation which can produce flexible resolution-agnostic meshes.To the best of our knowledge, how to design effective intermediate representations for recovering implicit rigid shapes for outdoor objects with stereo cameras is under-explored.We design the IGRs for implicit shape estimation based on the following fact.The implicit shape representation s(x i ) for each rigid object describes its complete surface geometry.However, the observation from stereo cameras only encodes a portion of the object's surface.This indicates that the implicit shape reconstruction problem can be modeled as an unseen surface hallucination problem, i.e., one needs to infer the unseen surface based on partial visual evidence.Specifically, Ego-Net++ addresses this problem by extending the IGRs in Ego-Net with several new representations and learning a progressive mapping from stereo appearances to s(x i ).This mapping is represented as a composite function E • Ha • O • V, that has learnable parameters {Ha,V, E} and the following computational graph Here v(x i ) represents the visible object surface.After a normalization operator O, such representation is converted to the OCS.To recover the missing surface, a point-based encoder-decoder Ha hallucinates a complete surface based on learned prior knowledge.E encodes the complete shape into an implicit shape vector.

The visible-surface representation
We propose a point-based representation for the visible object surface.Given a pair of stereo RoIs, V estimates the foreground mask and depth, samples a set of pixels from the foreground, and re-projects them to the CCS.Denote M (x i ) as the predicted set of foreground pixels, we sample e elements from it as M sp (x i ).These elements are re-projected to the CCS to form a set of e 3D points as where (m x , m y ) denotes the screen coordinates of pixel m.Concatenating these elements gives a 3 by e matrix v(x i ) encoding the visible instance PC in the CCS.
In implementation, M (x i ) is obtained by applying fully convolutional layers using 2D features for foreground classification.To obtain the depth prediction for the foreground region, a local cost volume is constructed to estimate disparity for the local patch.The disparities are then converted to depth as m z = f B/m disp where m disp , B, and f are the estimated disparity, stereo baseline length, and focal length respectively.

Hallucination with normalized coordinates
We found the learning of unseen surface hallucination more difficult when the input point coordinates represent a large variation of object pose and size.Thus we propose to disentangle the estimation of rigid shape from object pose.Specifically, we use operator O to normalize the visible object surface to a canonical OCS with a similarity transformation.Denote a detected object b i as a 7-tuple b i = (x i , y i , z i , h i , w i , l i , θ i ), where (x i , y i , z i ), (h i , w i , l i ) and θ i denote its translation, size (height, width, length) and orientation in the CCS respectively.The normalized coordinates are computed conditioned on b i as

8/23
Ha is implemented as a point-based encoder-decoder module, which extracts point-level features 69 from o(x i ) and infers N c by 3 coordinates c(x i ) to represent the complete surface.Finally, the shape encoder E maps the estimated complete surface c(x i ) into a latent vector s(x i ) that encodes the object's implicit shape.To extract a mesh representation from the predicted implicit shape code, E uses the occupancy decoder 59 where a set of grid locations is specified and predicts the occupancy field on such grid.Note this grid is not necessarily evenly distributed thus one can easily use the shape code in a resolution-agnostic manner.Given the occupancy field, we then use the Marching Cube algorithm 70 to extract an isosurface as a mesh representation.
To optimize the learnable parameters during training, the supervision consists of the cross-entropy segmentation loss, the smooth L 1 disparity estimation loss, and the hallucination loss implemented as Chamfer distance.We train the shape decoder on ShapeNet 71 by randomly sampling grid locations within object 3D bounding boxes with paired ground truth occupancy.In inference, we apply zero padding if Card(M (x i )) < e.

Penalizing in-box descriptions for 3DOD
Per the new IGRs introduced in Ego-Net++, S-3D-RCNN can estimate 3D bounding boxes accurately from stereo cameras as well as describe a finer rigid shape within the bounding boxes.However, existing 3DOD studies on KITTI 7 cannot measure the goodness of a more detailed shape beyond a 3D bounding box representation.To fill this gap and validate the effectiveness of the new IGRs, we contribute new metrics to KITTI for the intended evaluation.
As a reference metric, the official Average Orientation Similarity (AOS) metric in KITTI is defined as where r is the detection recall and s(r) ∈ [0, 1] is the orientation similarity (OS) at recall level r.OS is defined as , where D(r) denotes the set of all object predictions at recall rate r and ∆ θ i is the difference in yaw angle between estimated and ground-truth orientation for object i.If b i is a 2D false positive, i.e., its 2D intersection-over-union (IoU) with ground truth is smaller than a threshold (0.5 or 0.7), δ i = 0. Note that AOS itself builds on the official average precision metric AP 2D and is upper-bounded by AP 2D .AOS = 1 if both the object detection and the orientation estimations are perfect.
Based on Minimal Matching Distance (MMD) 72,73 , we propose a new metric AP MMD in the same manner as where r is the same detection recall and s MMD (r) ∈ [0, 1] is the MMD similarity (MMDS) at recall level r.MMDS is defined as where δ MMD i is a indicator and MMD(c(x i )) denotes the MMD of prediction i which measures the quality of the predicted surface.If the i-th prediction is a false postive or MMD(c(x i )) > γ, δ MMD i = 0. AP MMD is thus also upper-bounded by the official AP 2D .AP MMD = AP 2D if and only if MMD(c(x i )) = 0 for all predictions.In experiments we set γ as 0.05.
Since instance-level ground truth shape is not available in KITTI, MMD(c(x i )) is implemented as category-level similarity similar to 72,73 .For a predicted instance PC c(x i ), it is defined as the minimal L 2 Chamfer distance between it and a collection of template PCs in ShapeNet 71 that has the same class label.It is formally expressed as MMD(c where SN stands for the set of ShapeNet template PCs and d CD (c(x i ), G ) is defined as During the evaluation, we downloaded 250 ground truth car PCs that were used in 72,73 for consistency.The evaluation of AP MMD considers false negatives.For completeness, we also design True Positive Minimal Matching Distance (MMDTP) to evaluate MMD for the true positive predictions similar to 74 .We define MMDTP@β as the average MMD of the predicted objects that have 3D IoU > β with at least one ground truth object, where T P(i) is 1 if b i is a true positive and 0 otherwise as

Experiments
We first introduce the used benchmark dataset and the evaluation metrics, followed by a system-level comparison between our S-3D-RCNN with other approaches in terms of outdoor 3D scene understanding capabilities.We further present a module-level comparison to demonstrate the effectiveness of Ego-Net++.Finally, we conduct an ablation study on key design factors and hyper-parameters in EgoNet++.For more training and implementation details, please refer to our supplementary material.

Experimental settings
Dataset and evaluation metrics.We employ the KITTI object detection benchmark 7 that contains stereo RGB images captured in outdoor scenes.The dataset is split into 7,481 training images and 7,518 testing images.The training images are further split into the train split and the val split containing 3,712 and 3,769 images respectively.For consistency with prior studies, we use the train split for training and report results on the val split and the testing images.We use the official average precision metrics as well as our newly designed AP MMD and MMDTP.As defined in Eq. 11 in the main text, AOS is used to assess the system performance for joint object detection and orientation estimation.3D Average Precision (AP 3D ) measures precisions at the same recall values where a true positive prediction has 3D IoU > 0.7 with the ground truth one.BEV Average Precision (AP BEV ) instead uses 2D IoU > 0.7 as the criterion for true positives where the 3D bounding boxes are projected to the ground plane.Each ground truth label is assigned a difficulty level (easy, moderate, or hard) depending on its 2D bounding box height, occlusion level, and truncation level.

System-level comparison with previous studies
Ego-Net++ can be combined with a 2D proposal model to build a strong system for object orientation/rigid shape recovery.

Joint object detection and orientation estimation performance
Ego-Net can be used with a 2D vehicle detection model to form a joint vehicle detection and orientation estimation system, whose performance is measured by AOS.Per the proposals used in 6 , Tab. 1 compares the AOS of our system using Ego-Net with other approaches on the KITTI test set for the car category.Among the single-view image-based approaches, our system outperforms others by a clear margin.Our approach using a single image outperforms Kinematic3D 76 which exploits temporal information using RGB video.In addition to monocular orientation estimation, we show system performance with our E that can exploit two-view information in Tab. 1 and Tab. 2 for the car and pedestrian class respectively.Our S-3D-RCNN using E achieves improved vehicle orientation estimation performance.For performance comparison of pedestrian orientation estimation, we use the same proposals in 84 .Using our E consistently outperforms the proposal model, and the system performance based on RGB images even surpasses some LiDAR-based approaches 42,83 that have more accurate depth measurements.This result indicates that the LiDAR point cloud is not discriminative for determining the accurate orientation of distant non-rigid objects due to a lack of fine-grained visual information.In contrast, our image-based approach effectively addresses the limitations of LiDAR sensors and can complement them in this scenario.

Comparison of 3D scene understanding capability
To our knowledge, S-3D-RCNN is the first model that jointly performs accurate 3DOD and implicit rigid shape estimation for outdoor objects with stereo cameras.Tab. 3 presents a summary and comparison of perception capability with previous image-based outdoor 3D scene understanding approaches.Qualitatively, S-3D-RCNN is the only method that can utilize stereo geometry as well as predict implicit shape representations.Compared to the monocular method 3D-RCNN 22 that uses template meshes with fixed topology, our framework can produce meshes in a resolution-agnostic way and can provide an accurate estimation of object locations by exploiting two-view geometry.We show qualitative results in Fig. 9 where the predicted implicit shapes are decoded to meshes.Our approach shows accurate localization performance along with plausible shape predictions, which opens up new opportunities for outdoor augmented reality.

Module-level comparison with previous studies
Here we demonstrate the effectiveness of EgoNet++ as a module.Based on the designed IGRs and progressive mappings in this study, we show how using Ego-Net++ can contribute to improved orientation/shape estimation performance for state-of-the-art 3D scene understanding approaches.

Comparison of orientation estimation performance
To assess if Ego-Net can help improve the pose estimation accuracy of other 3DOD systems, we download proposals from other open-source implementations and use Ego-Net for orientation refinement.The result is summarized in Tab. 4 for the car category.While AOS depends on the detection performance of these methods, using Ego-Net consistently improves the pose estimation accuracy of these approaches.This indicates that Ego-Net is robust despite the performance of a vehicle detector varying with different recall levels.We also compare with OCM3D 90 using the same proposals, and higher AOS further validates the effectiveness of our proposed IGRs for orientation estimation.
For true positive predictions we plot the distribution of orientation estimation error versus different depth ranges and occlusion levels in Fig. 10.The error in each bin is the averaged orientation estimation error for those instances that fall into it.While the performance of M3D-RPN 5 and D4LCN 64 degrades significantly for distant and occluded cars, the errors of our approach increase gracefully.We believe that explicitly learning the object parts makes our model more robust to occlusion as the visible parts can provide richer information for pose estimation.shown in BEV and the arrows point to the heading direction of those vehicles.From the local appearance, it can be hard to tell whether certain cars head towards or away from the camera, as validated by the erroneous predictions of D4LCN 64 since it regresses pose from local features.In comparison, our approach gives accurate egocentric pose predictions for these instances and others that are partially occluded.Quantitative comparison with several 3D object detection systems on the KITTI val split is shown in Tab. 5. Note that utilizing Ego-Net can correct wrong pose predictions especially for difficult instances, which leads to improved 3D IoU and results in significantly improved AP BEV .Apart from using Ego-Net for monocular detectors, we provide extended studies of using Ego-Net++ for a state-of-the-art stereo detector.Tab.6 shows a comparison for the non-rigid pedestrian class where the same proposals are used as 88 .Thanks to the effectiveness of PPC, our E significantly improves the 3DOD performance for these non-rigid objects and some examples are shown in Fig. 12.This improvement shows the IGRs designed in this study are robust despite the training data for the pedestrian class being much fewer.

Comparison of shape estimation performance
Here we present a quantitative comparison of rigid shape estimation quality using our new metric MMDTP.We compare with 81 due to its availability of the official implementation.We downloaded 13,575 predicted objects from its website in which 9,488 instances have 3D IoU large enough to compute MMDTP@0.5 shown in Tab. 3. Our E can produce a complete shape description of a detected instance thanks to our visible surface representation and explicit modeling of the unseen surface hallucination problem.This leads to a significant improvement in MMDTP.Fig. 13 shows a qualitative comparison of the predicted instance shapes as PCs.Note that the visible portion of the instances varies from one to another, but our E can reliably infer the invisible surface.Fig. 14 shows the relationship between MMDTP with factors such as object depth and bounding box quality.More distant objects and less accurate 3D bounding boxes suffer from larger MMD for 81 due to fewer visible points and larger alignment errors.Note that our approach consistently improves 81 across different depth ranges and box proposal quality levels.Per our definition of AP MMD , the s MMD at different recall values are shown in Fig. 15 for the predictions of Disp-RCNN.In contrast, the performance of using our E for the same proposals is shown in Fig. 16.The detailed quantitative results are shown in Tab. 7. The AP 2D (IoU > 0.7), i.e., the upper bound for AP MMD , is 99.16, 93.22, 81.28 for easy, moderate, and hard categories respectively.Note our E has contributed a significant improvement compared with 81 .This indicates our approach has greatly complemented existing 3DOD approaches with the ability to describe outdoor object surface geometry within the 3D bounding boxes.

Ablation study
Direct regression vs. learning IGRs.To validate the design of our proposed IGRs for orientation estimation, we compare the pose estimation accuracy of our approach to a baseline that directly regresses pose angles from the instance feature maps.To eliminate the impact of the used object detector, we compare AOS on all annotated vehicle instances in the validation set.This is equivalent to measuring AOS with AP 2D = 1 so that the orientation estimation accuracy becomes the only focus.The Is PPCs better than the single-view representation?Tab. 9 shows the performance comparison between single-view Ego-Net and Ego-Net++ that uses stereo inputs.Using PPCs leads to better performance and validates the improvement of Ego-Net++ over Ego-Net.It also validates the coordinate representations designed in this study can be used easily with multi-view inputs.
Is Ha useful?Hereafter we use the same object proposals as used in Tab. 3 for consistency.We compare the performance without Ha in Tab. 10, where the normalized representation ocsi is directly used for evaluation before being passed to Ha.The results indicate Ha effectively hallucinates plausible points to provide a complete shape representation.

Conclusion
We propose the first approach for joint stereo 3D object detection and implicit shape reconstruction with a new two-stage model S-3D-RCNN.S-3D-RCNN can (i) perform accurate object localization as well as provide a complete and resolution-agnostic shape description for the detected rigid objects and (ii) produce significantly more accurate orientation predictions.To address the challenging problem of 3D attribute estimation from images, a set of new intermediate geometrical representations are designed and validated.Experiments show that S-3D-RCNN achieves strong image-based 3D scene understanding capability and brings new opportunities for outdoor augmented reality.Our framework can be extended to non-rigid shape estimation if corresponding data is available to train our hallucination module.How to devise an effective training approach to achieve stereo pedestrian reconstruction is an interesting research question which we leave for future work.In implementation, we train V and Ha separately.V is trained with L seg + L disp .We train with a batch size of 16 instances for 50 epochs.Adam optimizer is used and the learning rate is 0.001.For training Ha we use ShapeNet training set as 73 .The hallucination loss is a Chamfer Distance loss between the predicted and ground truth point clouds.The training adopts a batch size of 50 and lasts 300 epochs.The learning rate starts at 0.001 and is multiplied by 0.9 after every 50 epochs.The experiments are conducted on NVIDIA RTX 3090 GPUs.
The training process of the proposal model follows LIGA 88 .

Figure 1 .
Figure 1.Given a pair of stereo RGB images, S-3D-RCNN can detect 3D objects and predict implicit rigid shapes with one forward pass.(a) Alpha-blended image pair to show the disparities.(b) 3D object proposals shown as 3D bounding boxes.(c) Shape predictions for the detected objects.(d) Estimated surface normal of the nearby object where the red ray indicates incorrect reflection effects with only the 3D bounding box prediction.(e) The predicted implicit shape supports a spatially varying resolution.

Figure 2 .
Figure 2. Diagram of E (Ego-Net++).E performs orientation and rigid shape estimation with intermediate geometric representations.A local cost volume is constructed from instance features to estimate disparities.The visible surface coordinates are computed from the predicted disparities and an estimated mask, and then normalized to a canonical coordinate system.An encoder-decoder component Ha infers the missing surface of the object.The complete surface coordinates are passed to an encoder to extract an implicit shape vector, which can be used by a decoder for resolution-agnostic mesh extraction.For orientation estimation, a zoomed-in view is shown in Fig.8.FCN stands for a fully convolutional network module.The 2D part coordinates are lifted to 3D coordinates by Li in Eq. 6.

Figure 3 .
Figure 3.The proposal model D of S-3D-RCNN.In this implementation, a volumetric 3D scene representation is built from semantic features and cost-volume-based geometric features similar to 63 .An anchor-based object detector processes the Bird's Eye View feature maps to generate 3D object proposals.Note that Ego-Net++ E is agnostic to the design choice of D and can be used with other 3D object detectors.

Figure 4 .
Figure 4. Local appearance cannot uniquely determine egocentric pose.Existing solutions first estimate an allocentric pose in the object coordinate system (blue) and convert it to an egocentric pose in the camera coordinate system (green) based on the object location.

Figure 5 .Figure 6 .
Figure 5. Model architecture of Ego-Net.A fully convolutional model H regresses part heatmaps from a 2D patch of a proposal.The heatmaps are mapped to local coordinates with several strided convolution layers.The local coordinates are transformed to screen coordinates φ g (x i ) and mapped to a point-based 3D representation ψ(x i ) of a cuboid, whose orientation directly represents egocentric pose in the camera coordinate system.k=33 when q=2 as in Sec.2.2.2.

Figure 7 .
Figure 7. Diagram of the paired part coordinates representation (red) for an object part of a non-rigid object.

Figure 8 .
Figure 8. Zoomed-in view of the orientation estimation part in E in Fig. 2.

Figure 9 .
Figure 9. Qualitative results of S-3D-RCNN on KITTI val set.Left: input left images and 3D proposals.Middle: rendered objects in the original camera view.Right: rendered objects in a different view (camera 2 meters higher with 15 • pitch angle).

Figure 10 .
Figure 10.Average orientation error (AOE) on KITTI val split in different depth ranges and occlusion levels.Our approach is robust to distant and partially occluded instances.

Figure 11 .
Figure 11.Detected instances on KITTI val split along with the comparison of the predicted vehicle orientations in bird's eye view.

Figure 12 .
Figure 12.Similar comparison as Fig. 11 showing pedestrian orientation predictions using two-view images as inputs.

Figure 13 .
Figure 13.Qualitative comparison for instances on the KITTI val split.Left: instance RoIs in the left image.Middle: instance point cloud predictions from Disp R-CNN 81 .Right: Our predictions c(x i ) after the hallucination module.

Figure 14 .
Figure 14.The bar plots show the distribution of the predicted bounding boxes used in MMDTP evaluation.The average MMD for each bin is shown in the line plots.Using E consistently improves performance for objects in all bins.

Figure 15 .
Figure 15.MMD similarity of Disp R-CNN 81 used for computing the newly introduced metric AP MMD .Top: 2D IoU >0.7 is used to determine a true positive.Bottom: 2D IoU >0.5 is used instead.

Figure 18 .
Figure 18.Detailed architecture of the visible surface extraction module V .Conv3D-BN-ReLU is a 3D convolution layer followed by batch normalization and ReLU activation.ConvTrans3D denotes a transposed 3D convolution layer.Conv3D-BN-ReLU-R means the output is added to the input as the residual.k denotes kernel size.p denotes padding.s denotes stride.d denotes dilation.SPP-Module refers to the pyramid module in 93 .

Figure 19 .
Figure 19.Detailed architecture of the unseen surface hallucination module Ha.The DGC layer denotes a dynamical graphical convolution layer.The transformer module and the folding net modules follows 73 and 94 respectively.

Figure 20 .Figure 21 .
Figure 20.Detailed architecture of the point encoder module.

Table 1 .
76stem-level evaluation by comparing Average Orientation Similarity (AOS) with previous learning-based methods on the KITTI test set for the car category.Without using LiDAR data75during training (indicated by +LiDAR) or temporal information76, a monocular system using Ego-Net out-performs previous image-based methods.A stereo system using Ego-Net++ shows further performance improvements.

Table 2 .
84mparison of AOS with other methods on the KITTI test set for the pedestrian category.Without using LiDAR data83, a stereo system using Ego-Net++ significantly outperforms previous studies.The proposals are the same as GUPNet84.

Table 3 .
A comparison of image-based outdoor 3D scene understanding performance on the KITTI val set for the Car class.When evaluating MMDTP, the predictions are grouped according to the predicted depth.

Table 4 .
AOS evaluation on KITTI validation set.After employing Ego-Net, the vehicle pose estimation accuracy of other 3DOD systems can be improved.The space for AOS improvement is upper-bounded by AP 2D .

Table 5 .
64 BEV evaluated on KITTI validation set.Ego-Net can correct the erroneous pose predictions from64as shown in Fig.11.

Table 6 .
88antitative comparison for orientation predictions for pedestrians on the KITTI val set using the same proposals as88.40 recall values are used for consistency.

Table 7 .
Quantitative comparison using our introduced new metric AP MMD on KITTI val split.

Table 9 .
Module-level evaluation assuming perfect object detection on the KITTI val split for pedestrians.Ego-Net++ consistently improves over its single-view counterpart.

Table 10 .
MMDTP evaluated with and without using the Ha module.The same object proposals are used as in Tab. 3. in Tab. 8.Note learning IGRs outperforms the baseline by a significant margin.Deep3DBox 4 is another popular architecture that performs direct angle regression.Our approach outperforms it with the novel design of IGRs.