Skip to main content

BLIP: Bootstrapping Language-Image Pretraining for Unified Vision-Language Understanding

BLIP is a new vision-language model proposed by Microsoft Research Asia in 2022. It introduces a bootstrapping method to learn from noisy image-text pairs scraped from the web.

The BLIP Framework

BLIP consists of three key components:

  • MED - A multimodal encoder-decoder model that can encode images, text, and generate image-grounded text.
  • Captioner - Fine-tuned on COCO to generate captions for web images.
  • Filter - Fine-tuned on COCO to filter noisy image-text pairs.


The pretraining process follows these steps:
  1. Collect noisy image-text pairs from the web.
  2. Pretrain MED on this data.
  3. Finetune captioner and filter on the COCO dataset.
  4. Use captioner to generate new captions for web images.
  5. Filter noisy pairs using the filter model.
  6. Repeat the process by pretraining on a cleaned dataset.

This bootstrapping allows BLIP to learn from web-scale noisy data in a self-supervised manner.


Innovations in BLIP

Some interesting aspects of BLIP:

  • Combines encoder-decoder capability in one unified model (MED). Can encode, decode, and match images and text.
  • Careful pretraining objectives - image-text contrastive loss, matching loss, conditional language modeling loss.
  • Bootstraping loop to denoise image-text data.
  • Better performance than CLIP and ALIGN on vision-language tasks.

Results

BLIP achieves strong performance on vision-language benchmarks like text-image retrieval, visual question answering, etc. Some examples:

  • 83.5% accuracy on NLVR2 compared to 69.9% for CLIP.
  • 91.5% accuracy on Flickr30K image-text retrieval compared to 89.0% for ALIGN.
  • State-of-the-art on 8/14 vision-language tasks.

Limitations and Future Work

  • Bootstrapping can compound errors if the captioner or filter makes mistakes. Need careful fine-tuning.
  • Requires large supervised datasets (COCO) which can be expensive.
  • Can add iterative bootstrapping rounds to progressively improve the model.
  • Explore other modalities like video, audio, etc.

BLIP provides a scalable approach to learning joint vision-language representations from web data. The bootstrapping framework can pave the way for large multimodal models.


Comments

Popular Posts

TX-CNN: DETECTING TUBERCULOSIS IN CHEST X-RAY IMAGES USING CONVOLUTIONAL NEURAL NETWORK

In this paper , the authors propose a method to classify tuberculosis from chest X-ray images using Convolutional Neural Networks (CNN). They achieve a classification accuracy of 85.68%. They attribute the effectiveness of their approach to shuffle sampling with cross-validation while training the network. Methodology Convolutional Neural Network This has been the ultimate tool for researchers and engineers for computer vision tasks. It has been widely used for many general purpose image and video related tasks. There are many great resources to learn about them. I will link a few of them at the end of this post. In this paper, the authors study the famous AlexNet and GoogLeNet architectures in classifying tuberculosis images. A CNN model usually consists of convolutional layers, pooling layers and fully connected layers. Each layer is connected to the previous layers via kernels or filters. A CNN model learns parameters of the kernel to represent global and local features ...

End-to-End Speech-Driven Facial Animation with Temporal GANs

In this paper , the authors present a system for generating videos of a talking heard, using a still image of a person and an audio clip containing speech. As per the authors this is the first paper that achieves this without any handcrafted features or post-processing of the output. This is achieved using a temporal GAN with 2 discriminators. Novelties in this paper Talking head video generated from still image and speech audio without any subject dependency. Also no handcrafted audio or visual features are used for training and no post-processing of the output (generate facial features from learned metrics). The model captures the dynamics of the entire face producing natural facial expressions such as eyebrow raises, frowns and blinks. This is due to the Recurrent Neural Network ( RNN ) based generator and sequence discriminator. Ablation study to quantify the effect of each component in the system. Image quality is measured using Model Architecture The model consists...