Skip to main content

Joint Pose and Shape Estimation of Vehicles from LiDAR Data

In this paper, the authors address the problem of estimating the pose and shape of vehicles from LiDAR Data. This is a common problem to be solved in autonomous vehicle applications. Autonomous vehicles are equipped with many sensors to perceive the world around them. LiDAR being one of them is what the authors focus on in this paper. A key requirement of the perception system is to identify other vehicles in the road and make decisions based on their pose and shape. The authors put forth a pipeline that jointly determines pose and shape from LiDAR data. 

More about Pose and Shape Estimation

LiDAR sensors capture the world around them in point clouds. Often, the first step in LiDAR processing is to perform some sort of clustering or segmentation, to isolate parts of the point cloud which belong to individual objects.  The next step is to infer the pose and shape of the object. This is mostly done by a modal perception. Meaning the whole object is perceived based on partial sensory inputs. So for our estimation, we get a partial point cloud of an object from which we need to estimate the complete shape and pose.

Investigations on Pose and Shape Networks

State-of-the-art baseline network

Baseline point completion network

As per the figure above, the baseline for predicting pose and shape first estimates the pose of an unaligned partial point cloud. The predicted pose is then applied to the partial point cloud, bringing the observation into a canonical coordinate system. This is then given as input to an encoder-decoder completion network which has been trained to perform completion of canonically -aligned partial point clouds.

Disadvantages

  • Any error in the output of the pose estimation network will affect the completion network's performance.
  • The partial point cloud is encoded twice. Once in the pose estimation network and then again in the shape completion network. This is a computational redundancy.

Shared-encoder Network

As per the disadvantages of the baseline network, it is clear that the sequential arrangement of the pose and shape estimation networks is the root issue. Keeping this in mind, a shared-encoder network is proposed.
Intuitively, this means a shared encoder learns feature mappings from the partial point cloud which is then used by both the pose and shape estimation networks. This is exactly what is being hypothesized in the paper. 

Shape training

The shared-encoder network is trained by freezing the pose decoder weights and training only the encoder and the shape decoder. The inputs are canonically unaligned partial point clouds. Chamfer distance is used as the loss function. 
X is the ground truth completed point cloud and X~ is the estimated completion. Thus, the encoder learns to extract a fixed-length feature vector, from which the complete shape of the vehicle can be recovered. This shape would be in the same unknown pose as in the partial point cloud.

Pose training

As per the hypothesis, it is assumed that the features generated by the encoder while shape estimation must also capture information about the pose. To test this, the encoder weights are frozen and the pose decoder is trained to estimate poses of the partial point clouds using the "frozen" weights with the Pose Loss (LP). The performance is shown to match or outperform the pose estimation of the baseline network. The combination of the above two training methods makes up the shared-encoder model.

Joint Training

Finally, the authors propose a joint training method to further enhance the performance of the shared-encoder model. Once the shape and pose training steps are completed, all parts of the network are activated and trained using a joint loss function (LJ). 
As you can see both completion and pose loss are combined in a weighted manner to give the joint loss. σCD and σP are learned parameters representing the uncertainty of the pose and completion predictions. These are inversely proportional to the weight assigned to the respective loss term. The log term prevents the uncertainties from becoming too large.
The jointly-trained network outperforms both the baseline and standard shared-encoder networks in the accuracy of both completion and pose estimation.

General training details

All training uses Adam optimizer with a learning rate of 1e-4 and batch size 32. All completion networks output a point cloud containing 16,384 points, and all pose predictors output a heading θ and translation (x,y).

Dataset

Training a supervised model to predict complete shape from partial observations requires a large dataset of partial and complete observations. To meet this requirement, a large synthetic dataset was constructed using high-resolution CAD models of various vehicle types. Points were then sampled from these models using a simulated LiDAR sensor. Then networks are trained to produce a complete point cloud representing the exterior of the vehicle given only a partial point cloud of the object. 

Results

Completion and pose accuracy on the synthetic dataset

Completion and pose accuracy on synthetic dataset
The shared-encoder (SE) and joint-trained shared-encoder (JTSE) networks outperform the state-of-the-art baseline in completion quality (measured as Chamfer Distance) as well as pose estimation (heading angle and translation) on the validation set from the synthetic dataset. 
For each metric, JTSE gives the highest ratio of the validation set under a given error threshold, followed by SE. (Left) Our shared-encoder architecture enjoys robust completion accuracy as it is not subject to errors in the pose estimation. (Center) Heading error can be more accurately estimated using the shared-encoder architecture, and further improved through joint-training. (Right) Using the shared-parameter architecture is as-good or better at estimating accurate translation compared to the baseline, despite using fewer network parameters and learning pose from pre-trained encodings

Final Thoughts

  • Information about pose and shape can effectively be shared and jointly-learned, to produce better performance in each task individually.
  • The proposed method has difficulty when the LiDAR point cloud as very few points.

Footnote

  • All diagrams and results shown are borrowed from the paper.

Comments

Popular Posts

BLIP: Bootstrapping Language-Image Pretraining for Unified Vision-Language Understanding

BLIP is a new vision-language model proposed by Microsoft Research Asia in 2022. It introduces a bootstrapping method to learn from noisy image-text pairs scraped from the web. The BLIP Framework BLIP consists of three key components: MED  - A multimodal encoder-decoder model that can encode images, text, and generate image-grounded text. Captioner  - Fine-tuned on COCO to generate captions for web images. Filter  - Fine-tuned on COCO to filter noisy image-text pairs. The pretraining process follows these steps: Collect noisy image-text pairs from the web. Pretrain MED on this data. Finetune captioner and filter on the COCO dataset. Use captioner to generate new captions for web images. Filter noisy pairs using the filter model. Repeat the process by pretraining on a cleaned dataset. This bootstrapping allows BLIP to learn from web-scale noisy data in a self-supervised manner. Innovations in BLIP Some interesting aspects of BLIP: Combines encoder-decoder capability in one...