BLIP is a new vision-language model proposed by Microsoft Research Asia in 2022. It introduces a bootstrapping method to learn from noisy image-text pairs scraped from the web.
The BLIP Framework
BLIP consists of three key components:
- MED - A multimodal encoder-decoder model that can encode images, text, and generate image-grounded text.
 - Captioner - Fine-tuned on COCO to generate captions for web images.
 - Filter - Fine-tuned on COCO to filter noisy image-text pairs.
 
The pretraining process follows these steps:
- Collect noisy image-text pairs from the web.
 - Pretrain MED on this data.
 - Finetune captioner and filter on the COCO dataset.
 - Use captioner to generate new captions for web images.
 - Filter noisy pairs using the filter model.
 - Repeat the process by pretraining on a cleaned dataset.
 
This bootstrapping allows BLIP to learn from web-scale noisy data in a self-supervised manner.
Innovations in BLIP
Some interesting aspects of BLIP:
- Combines encoder-decoder capability in one unified model (MED). Can encode, decode, and match images and text.
 - Careful pretraining objectives - image-text contrastive loss, matching loss, conditional language modeling loss.
 - Bootstraping loop to denoise image-text data.
 - Better performance than CLIP and ALIGN on vision-language tasks.
 
Results
BLIP achieves strong performance on vision-language benchmarks like text-image retrieval, visual question answering, etc. Some examples:
- 83.5% accuracy on NLVR2 compared to 69.9% for CLIP.
 - 91.5% accuracy on Flickr30K image-text retrieval compared to 89.0% for ALIGN.
 - State-of-the-art on 8/14 vision-language tasks.
 
Limitations and Future Work
- Bootstrapping can compound errors if the captioner or filter makes mistakes. Need careful fine-tuning.
 - Requires large supervised datasets (COCO) which can be expensive.
 - Can add iterative bootstrapping rounds to progressively improve the model.
 - Explore other modalities like video, audio, etc.
 
BLIP provides a scalable approach to learning joint vision-language representations from web data. The bootstrapping framework can pave the way for large multimodal models.
Comments
Post a Comment