Skip to main content

Joint Pose and Shape Estimation of Vehicles from LiDAR Data

In this paper, the authors address the problem of estimating the pose and shape of vehicles from LiDAR Data. This is a common problem to be solved in autonomous vehicle applications. Autonomous vehicles are equipped with many sensors to perceive the world around them. LiDAR being one of them is what the authors focus on in this paper. A key requirement of the perception system is to identify other vehicles in the road and make decisions based on their pose and shape. The authors put forth a pipeline that jointly determines pose and shape from LiDAR data. 

More about Pose and Shape Estimation

LiDAR sensors capture the world around them in point clouds. Often, the first step in LiDAR processing is to perform some sort of clustering or segmentation, to isolate parts of the point cloud which belong to individual objects.  The next step is to infer the pose and shape of the object. This is mostly done by a modal perception. Meaning the whole object is perceived based on partial sensory inputs. So for our estimation, we get a partial point cloud of an object from which we need to estimate the complete shape and pose.

Investigations on Pose and Shape Networks

State-of-the-art baseline network

Baseline point completion network

As per the figure above, the baseline for predicting pose and shape first estimates the pose of an unaligned partial point cloud. The predicted pose is then applied to the partial point cloud, bringing the observation into a canonical coordinate system. This is then given as input to an encoder-decoder completion network which has been trained to perform completion of canonically -aligned partial point clouds.

Disadvantages

  • Any error in the output of the pose estimation network will affect the completion network's performance.
  • The partial point cloud is encoded twice. Once in the pose estimation network and then again in the shape completion network. This is a computational redundancy.

Shared-encoder Network

As per the disadvantages of the baseline network, it is clear that the sequential arrangement of the pose and shape estimation networks is the root issue. Keeping this in mind, a shared-encoder network is proposed.
Intuitively, this means a shared encoder learns feature mappings from the partial point cloud which is then used by both the pose and shape estimation networks. This is exactly what is being hypothesized in the paper. 

Shape training

The shared-encoder network is trained by freezing the pose decoder weights and training only the encoder and the shape decoder. The inputs are canonically unaligned partial point clouds. Chamfer distance is used as the loss function. 
X is the ground truth completed point cloud and X~ is the estimated completion. Thus, the encoder learns to extract a fixed-length feature vector, from which the complete shape of the vehicle can be recovered. This shape would be in the same unknown pose as in the partial point cloud.

Pose training

As per the hypothesis, it is assumed that the features generated by the encoder while shape estimation must also capture information about the pose. To test this, the encoder weights are frozen and the pose decoder is trained to estimate poses of the partial point clouds using the "frozen" weights with the Pose Loss (LP). The performance is shown to match or outperform the pose estimation of the baseline network. The combination of the above two training methods makes up the shared-encoder model.

Joint Training

Finally, the authors propose a joint training method to further enhance the performance of the shared-encoder model. Once the shape and pose training steps are completed, all parts of the network are activated and trained using a joint loss function (LJ). 
As you can see both completion and pose loss are combined in a weighted manner to give the joint loss. σCD and σP are learned parameters representing the uncertainty of the pose and completion predictions. These are inversely proportional to the weight assigned to the respective loss term. The log term prevents the uncertainties from becoming too large.
The jointly-trained network outperforms both the baseline and standard shared-encoder networks in the accuracy of both completion and pose estimation.

General training details

All training uses Adam optimizer with a learning rate of 1e-4 and batch size 32. All completion networks output a point cloud containing 16,384 points, and all pose predictors output a heading θ and translation (x,y).

Dataset

Training a supervised model to predict complete shape from partial observations requires a large dataset of partial and complete observations. To meet this requirement, a large synthetic dataset was constructed using high-resolution CAD models of various vehicle types. Points were then sampled from these models using a simulated LiDAR sensor. Then networks are trained to produce a complete point cloud representing the exterior of the vehicle given only a partial point cloud of the object. 

Results

Completion and pose accuracy on the synthetic dataset

Completion and pose accuracy on synthetic dataset
The shared-encoder (SE) and joint-trained shared-encoder (JTSE) networks outperform the state-of-the-art baseline in completion quality (measured as Chamfer Distance) as well as pose estimation (heading angle and translation) on the validation set from the synthetic dataset. 
For each metric, JTSE gives the highest ratio of the validation set under a given error threshold, followed by SE. (Left) Our shared-encoder architecture enjoys robust completion accuracy as it is not subject to errors in the pose estimation. (Center) Heading error can be more accurately estimated using the shared-encoder architecture, and further improved through joint-training. (Right) Using the shared-parameter architecture is as-good or better at estimating accurate translation compared to the baseline, despite using fewer network parameters and learning pose from pre-trained encodings

Final Thoughts

  • Information about pose and shape can effectively be shared and jointly-learned, to produce better performance in each task individually.
  • The proposed method has difficulty when the LiDAR point cloud as very few points.

Footnote

  • All diagrams and results shown are borrowed from the paper.

Comments

Popular Posts

Chest X-Ray Analysis of Tuberculosis by Deep Learning with Segmentation and Augmentation

In this paper , the authors explore the efficiency of lung segmentation, lossless and lossy data augmentation in  computer-aided diagnosis (CADx) of tuberculosis using deep convolutional neural networks applied to a small and not well-balanced Chest X-ray (CXR) dataset. Dataset Shenzhen Hospital (SH) dataset of CXR images was acquired from Shenzhen No. 3 People's Hospital in Shenzhen, China. It contains normal and abnormal CXR images with marks of tuberculosis. Methodology Based on previous literature, attempts to perform training for such small CXR datasets without any pre-processing failed to see good results. So the authors attempted segmenting the lung images before being inputted to the model. This gave demonstrated a more successful training and an increase in prediction accuracy. To perform lung segmentation, i.e. to cut the left and right lung fields from the lung parts in standard CXRs, manually prepared masks were used. The dataset was split into 8:1:1...

Ocean: Object-aware Anchor-free Tracking

The paper titled " Ocean: Object Aware Anchor Free Tracking " presents a novel approach to visual object tracking that is poised to outperform existing anchor-based approaches. The authors propose a unique anchor-free framework named Ocean, designed to address certain challenges in the current field of visual tracking. Introduction Visual object tracking is a crucial part of computer vision technology. The widely utilized anchor-based trackers have their limitations, which this paper attempts to address. The authors present the innovative Ocean framework, designed to transform the visual tracking field by improving adaptability and performance. The Problem with Anchor-Based Trackers Despite their wide usage, anchor-based trackers suffer from some notable drawbacks. They struggle with tracking objects experiencing drastic scale changes or those having high aspect ratios. The anchors, with their fixed scale and fixed ratios, can limit the flexibility of the trackers, making the...