Skip to main content

Posts

BLIP: Bootstrapping Language-Image Pretraining for Unified Vision-Language Understanding

BLIP is a new vision-language model proposed by Microsoft Research Asia in 2022. It introduces a bootstrapping method to learn from noisy image-text pairs scraped from the web. The BLIP Framework BLIP consists of three key components: MED  - A multimodal encoder-decoder model that can encode images, text, and generate image-grounded text. Captioner  - Fine-tuned on COCO to generate captions for web images. Filter  - Fine-tuned on COCO to filter noisy image-text pairs. The pretraining process follows these steps: Collect noisy image-text pairs from the web. Pretrain MED on this data. Finetune captioner and filter on the COCO dataset. Use captioner to generate new captions for web images. Filter noisy pairs using the filter model. Repeat the process by pretraining on a cleaned dataset. This bootstrapping allows BLIP to learn from web-scale noisy data in a self-supervised manner. Innovations in BLIP Some interesting aspects of BLIP: Combines encoder-decoder capability in one unified model

CLIP: Learning Transferable Visual Models From Natural Language Supervision

CLIP (Contrastive Language-Image Pre-training) is a new approach to learning visual representations proposed by researchers at OpenAI in 2021. Unlike traditional computer vision models which are trained on large labeled image datasets, CLIP learns directly from natural language supervision. The Core Idea The key insight behind CLIP is that we can connect images and text captions without generating the captions. By training the model to predict which caption goes with an image, it learns a rich visual representation of the world. As illustrated above, CLIP consists of two encoders - an image encoder and a text encoder. The image encoder takes in an image and outputs a visual representation vector. The text encoder takes in a caption and outputs a text representation vector. During training, these representations are optimized to be closer for matching image-text pairs, and farther apart for non-matching pairs. This is known as a contrastive loss objective. Benefits of CLIP There are sev

ES3Net: Accurate and Efficient Edge-based Self-Supervised Stereo Matching Network

Efficient and accurate depth estimation plays an indispensable role in many real-world applications, such as autonomous vehicles, 3D reconstruction, and drone navigation. Despite the precision of stereo matching, its computational intensity can pose significant challenges for edge deployment. Moreover, the struggle of acquiring ground-truth depths for training stereo-matching networks further amplifies these challenges. Enter ES3Net, the Edge-based Self-Supervised Stereo matching Network, a solution designed to mitigate these obstacles. The Challenges of Depth Estimation When it comes to applications like autonomous driving or drone navigation, the importance of accurate depth estimation is hard to overstate. It provides a foundational understanding of the 3D world, allowing for intelligent decision-making and navigation. Traditionally, stereo matching has provided greater accuracy than monocular depth estimation due to the availability of a reference image. However, it also bri

4D Panoptic LiDAR Segmentation (4D-PLS)

Introduction In the realm of computer vision, LiDAR segmentation remains a challenging area. Often, we have to rely on the downscaling of scans, followed by individual detections and temporal associations. The recently published paper, "4D Panoptic LiDAR Segmentation (4D-PLS)", seeks to address these challenges with an innovative approach and techniques, offering a fresh perspective on LiDAR segmentation. LiDAR Segmentation: Challenges and Opportunities LiDAR segmentation, specifically sequence segmentation, is a task with substantial hurdles. Due to memory constraints, scans must be downscaled, even for a single scan. This results in detection being performed on individual scans, and then followed by temporal association. It's a piecemeal approach that lacks efficiency and accuracy.  A New Take: The 4D-PLS Framework This is where the 4D-PLS approach comes into play. Drawing inspiration from space-time, the authors developed a system to overlap 4D volumes, assigning seman

Ocean: Object-aware Anchor-free Tracking

The paper titled " Ocean: Object Aware Anchor Free Tracking " presents a novel approach to visual object tracking that is poised to outperform existing anchor-based approaches. The authors propose a unique anchor-free framework named Ocean, designed to address certain challenges in the current field of visual tracking. Introduction Visual object tracking is a crucial part of computer vision technology. The widely utilized anchor-based trackers have their limitations, which this paper attempts to address. The authors present the innovative Ocean framework, designed to transform the visual tracking field by improving adaptability and performance. The Problem with Anchor-Based Trackers Despite their wide usage, anchor-based trackers suffer from some notable drawbacks. They struggle with tracking objects experiencing drastic scale changes or those having high aspect ratios. The anchors, with their fixed scale and fixed ratios, can limit the flexibility of the trackers, making the

Deeper and Wider Siamese Networks for Real-Time Visual Tracking

 In this paper , the authors investigate how to increase the robustness and accuracy of existing Siamese trackers used for visual object tracking. Visual object tracking Visual object tracking is one of the fundamental problems in computer vision. It aims to estimate the position of an arbitrary target in a video sequence, given only its location in the initial frame. It has numerous applications in surveillance, robotics, and human-computer interaction. Siamese Networks and their usage in Trackers Siamese networks are a class of neural networks that fundamentally learns to generate comparable feature vectors from their twin inputs. By learning to compute these comparable feature vectors, it learns differentiable characteristics for each type of image class. With these output vectors, it is possible to compare the two inputs and say if they belong to the same image class or not. For example, this is used in one-shot learning for facial recognition. Here the siamese network learns to di

Joint Pose and Shape Estimation of Vehicles from LiDAR Data

In this paper , the authors address the problem of estimating the pose and shape of vehicles from LiDAR Data. This is a common problem to be solved in autonomous vehicle applications. Autonomous vehicles are equipped with many sensors to perceive the world around them. LiDAR being one of them is what the authors focus on in this paper. A key requirement of the perception system is to identify other vehicles in the road and make decisions based on their pose and shape. The authors put forth a pipeline that jointly determines pose and shape from LiDAR data.  More about Pose and Shape Estimation LiDAR sensors capture the world around them in point clouds. Often, the first step in LiDAR processing is to perform some sort of clustering or segmentation, to isolate parts of the point cloud which belong to individual objects.  The next step is to infer the pose and shape of the object. This is mostly done by a modal perception . Meaning the whole object is perceived based on partial sensory in