Skip to main content

Posts

Showing posts from October, 2023

BLIP: Bootstrapping Language-Image Pretraining for Unified Vision-Language Understanding

BLIP is a new vision-language model proposed by Microsoft Research Asia in 2022. It introduces a bootstrapping method to learn from noisy image-text pairs scraped from the web. The BLIP Framework BLIP consists of three key components: MED  - A multimodal encoder-decoder model that can encode images, text, and generate image-grounded text. Captioner  - Fine-tuned on COCO to generate captions for web images. Filter  - Fine-tuned on COCO to filter noisy image-text pairs. The pretraining process follows these steps: Collect noisy image-text pairs from the web. Pretrain MED on this data. Finetune captioner and filter on the COCO dataset. Use captioner to generate new captions for web images. Filter noisy pairs using the filter model. Repeat the process by pretraining on a cleaned dataset. This bootstrapping allows BLIP to learn from web-scale noisy data in a self-supervised manner. Innovations in BLIP Some interesting aspects of BLIP: Combines encoder-decoder capability in one unified model

CLIP: Learning Transferable Visual Models From Natural Language Supervision

CLIP (Contrastive Language-Image Pre-training) is a new approach to learning visual representations proposed by researchers at OpenAI in 2021. Unlike traditional computer vision models which are trained on large labeled image datasets, CLIP learns directly from natural language supervision. The Core Idea The key insight behind CLIP is that we can connect images and text captions without generating the captions. By training the model to predict which caption goes with an image, it learns a rich visual representation of the world. As illustrated above, CLIP consists of two encoders - an image encoder and a text encoder. The image encoder takes in an image and outputs a visual representation vector. The text encoder takes in a caption and outputs a text representation vector. During training, these representations are optimized to be closer for matching image-text pairs, and farther apart for non-matching pairs. This is known as a contrastive loss objective. Benefits of CLIP There are sev