BLIP is a new vision-language model proposed by Microsoft Research Asia in 2022. It introduces a bootstrapping method to learn from noisy image-text pairs scraped from the web.
The BLIP Framework
BLIP consists of three key components:
- MED - A multimodal encoder-decoder model that can encode images, text, and generate image-grounded text.
- Captioner - Fine-tuned on COCO to generate captions for web images.
- Filter - Fine-tuned on COCO to filter noisy image-text pairs.
The pretraining process follows these steps:
- Collect noisy image-text pairs from the web.
- Pretrain MED on this data.
- Finetune captioner and filter on the COCO dataset.
- Use captioner to generate new captions for web images.
- Filter noisy pairs using the filter model.
- Repeat the process by pretraining on a cleaned dataset.
This bootstrapping allows BLIP to learn from web-scale noisy data in a self-supervised manner.
Innovations in BLIP
Some interesting aspects of BLIP:
- Combines encoder-decoder capability in one unified model (MED). Can encode, decode, and match images and text.
- Careful pretraining objectives - image-text contrastive loss, matching loss, conditional language modeling loss.
- Bootstraping loop to denoise image-text data.
- Better performance than CLIP and ALIGN on vision-language tasks.
Results
BLIP achieves strong performance on vision-language benchmarks like text-image retrieval, visual question answering, etc. Some examples:
- 83.5% accuracy on NLVR2 compared to 69.9% for CLIP.
- 91.5% accuracy on Flickr30K image-text retrieval compared to 89.0% for ALIGN.
- State-of-the-art on 8/14 vision-language tasks.
Limitations and Future Work
- Bootstrapping can compound errors if the captioner or filter makes mistakes. Need careful fine-tuning.
- Requires large supervised datasets (COCO) which can be expensive.
- Can add iterative bootstrapping rounds to progressively improve the model.
- Explore other modalities like video, audio, etc.
BLIP provides a scalable approach to learning joint vision-language representations from web data. The bootstrapping framework can pave the way for large multimodal models.
Comments
Post a Comment