Skip to main content

BLIP: Bootstrapping Language-Image Pretraining for Unified Vision-Language Understanding

BLIP is a new vision-language model proposed by Microsoft Research Asia in 2022. It introduces a bootstrapping method to learn from noisy image-text pairs scraped from the web.

The BLIP Framework

BLIP consists of three key components:

  • MED - A multimodal encoder-decoder model that can encode images, text, and generate image-grounded text.
  • Captioner - Fine-tuned on COCO to generate captions for web images.
  • Filter - Fine-tuned on COCO to filter noisy image-text pairs.


The pretraining process follows these steps:
  1. Collect noisy image-text pairs from the web.
  2. Pretrain MED on this data.
  3. Finetune captioner and filter on the COCO dataset.
  4. Use captioner to generate new captions for web images.
  5. Filter noisy pairs using the filter model.
  6. Repeat the process by pretraining on a cleaned dataset.

This bootstrapping allows BLIP to learn from web-scale noisy data in a self-supervised manner.


Innovations in BLIP

Some interesting aspects of BLIP:

  • Combines encoder-decoder capability in one unified model (MED). Can encode, decode, and match images and text.
  • Careful pretraining objectives - image-text contrastive loss, matching loss, conditional language modeling loss.
  • Bootstraping loop to denoise image-text data.
  • Better performance than CLIP and ALIGN on vision-language tasks.

Results

BLIP achieves strong performance on vision-language benchmarks like text-image retrieval, visual question answering, etc. Some examples:

  • 83.5% accuracy on NLVR2 compared to 69.9% for CLIP.
  • 91.5% accuracy on Flickr30K image-text retrieval compared to 89.0% for ALIGN.
  • State-of-the-art on 8/14 vision-language tasks.

Limitations and Future Work

  • Bootstrapping can compound errors if the captioner or filter makes mistakes. Need careful fine-tuning.
  • Requires large supervised datasets (COCO) which can be expensive.
  • Can add iterative bootstrapping rounds to progressively improve the model.
  • Explore other modalities like video, audio, etc.

BLIP provides a scalable approach to learning joint vision-language representations from web data. The bootstrapping framework can pave the way for large multimodal models.


Comments

Popular Posts

Chest X-Ray Analysis of Tuberculosis by Deep Learning with Segmentation and Augmentation

In this paper , the authors explore the efficiency of lung segmentation, lossless and lossy data augmentation in  computer-aided diagnosis (CADx) of tuberculosis using deep convolutional neural networks applied to a small and not well-balanced Chest X-ray (CXR) dataset. Dataset Shenzhen Hospital (SH) dataset of CXR images was acquired from Shenzhen No. 3 People's Hospital in Shenzhen, China. It contains normal and abnormal CXR images with marks of tuberculosis. Methodology Based on previous literature, attempts to perform training for such small CXR datasets without any pre-processing failed to see good results. So the authors attempted segmenting the lung images before being inputted to the model. This gave demonstrated a more successful training and an increase in prediction accuracy. To perform lung segmentation, i.e. to cut the left and right lung fields from the lung parts in standard CXRs, manually prepared masks were used. The dataset was split into 8:1:1...

Ocean: Object-aware Anchor-free Tracking

The paper titled " Ocean: Object Aware Anchor Free Tracking " presents a novel approach to visual object tracking that is poised to outperform existing anchor-based approaches. The authors propose a unique anchor-free framework named Ocean, designed to address certain challenges in the current field of visual tracking. Introduction Visual object tracking is a crucial part of computer vision technology. The widely utilized anchor-based trackers have their limitations, which this paper attempts to address. The authors present the innovative Ocean framework, designed to transform the visual tracking field by improving adaptability and performance. The Problem with Anchor-Based Trackers Despite their wide usage, anchor-based trackers suffer from some notable drawbacks. They struggle with tracking objects experiencing drastic scale changes or those having high aspect ratios. The anchors, with their fixed scale and fixed ratios, can limit the flexibility of the trackers, making the...