Skip to main content

CLIP: Learning Transferable Visual Models From Natural Language Supervision

CLIP (Contrastive Language-Image Pre-training) is a new approach to learning visual representations proposed by researchers at OpenAI in 2021. Unlike traditional computer vision models which are trained on large labeled image datasets, CLIP learns directly from natural language supervision.

The Core Idea

The key insight behind CLIP is that we can connect images and text captions without generating the captions. By training the model to predict which caption goes with an image, it learns a rich visual representation of the world.


As illustrated above, CLIP consists of two encoders - an image encoder and a text encoder. The image encoder takes in an image and outputs a visual representation vector. The text encoder takes in a caption and outputs a text representation vector.

During training, these representations are optimized to be closer for matching image-text pairs, and farther apart for non-matching pairs. This is known as a contrastive loss objective.


Benefits of CLIP

There are several advantages to this approach:


  • Transferable visual representations - The image encoder learns generally useful visual concepts like objects, scenes, actions, etc. This representation can be transferred to other vision tasks through fine-tuning.
  • Scale - CLIP is trained on a dataset of 400 million image-text pairs scraped from the internet. This provides a rich supervision signal.
  • Zero-shot transfer - Remarkably, CLIP can perform zero-shot classification without seeing any examples from a dataset. It simply relies on comparing image embeddings to text prompts.
  • Robustness - CLIP shows robustness to image variations like blur, brightness changes etc. This indicates it learns high-level semantic representations.



CLIP In Action


A few examples of what CLIP can do:


  • Classify an image of a dog as a "dog" without seeing any dog images during training.
  • Correctly match the caption "A happy couple on a gondola" to the corresponding image.
  • Be fine-tuned to get 97.9% accuracy on CIFAR-10 in just 10 minutes of training.

CLIP provides simple yet powerful off-the-shelf visual representations for a variety of computer vision tasks. Its limitations include a lack of spatial understanding and bias in the training data. Nonetheless, it's an impressive demonstration of learning visual concepts directly from natural language.



Comments

Popular Posts

BLIP: Bootstrapping Language-Image Pretraining for Unified Vision-Language Understanding

BLIP is a new vision-language model proposed by Microsoft Research Asia in 2022. It introduces a bootstrapping method to learn from noisy image-text pairs scraped from the web. The BLIP Framework BLIP consists of three key components: MED  - A multimodal encoder-decoder model that can encode images, text, and generate image-grounded text. Captioner  - Fine-tuned on COCO to generate captions for web images. Filter  - Fine-tuned on COCO to filter noisy image-text pairs. The pretraining process follows these steps: Collect noisy image-text pairs from the web. Pretrain MED on this data. Finetune captioner and filter on the COCO dataset. Use captioner to generate new captions for web images. Filter noisy pairs using the filter model. Repeat the process by pretraining on a cleaned dataset. This bootstrapping allows BLIP to learn from web-scale noisy data in a self-supervised manner. Innovations in BLIP Some interesting aspects of BLIP: Combines encoder-decoder capability in one...

Learning to Read Chest X-Rays: Recurrent Neural Feedback Model for Automated Image Annotation

In this paper , the authors present a deep learning model to detect disease from chest x-ray images. A convolutional neural network (CNN) is trained to detect the disease names. Recurrent neural networks (RNNs) are then trained to describe the contexts of a detected disease, based on the deep CNN features. CNN Models used and Dataset CNNs encode input images effectively. In this paper, the authors experiment with a Network in Network (NIN) model and GoogLeNet model. The dataset contains 3,955 radiology reports and 7,470 associated chest x-rays. 71% of the dataset accounts for normal cases (no disease). The data set was balanced by augmenting training images by randomly cropping 224x224 images from the original 256x256 size image. Adaptability of Transfer learning Since this boils down to a classification problem on a small dataset, transfer learning is a technique that comes to our mind. The authors experimented this with ImageNet trained models. ImageNet trained CN...

Chest X-Ray Analysis of Tuberculosis by Deep Learning with Segmentation and Augmentation

In this paper , the authors explore the efficiency of lung segmentation, lossless and lossy data augmentation in  computer-aided diagnosis (CADx) of tuberculosis using deep convolutional neural networks applied to a small and not well-balanced Chest X-ray (CXR) dataset. Dataset Shenzhen Hospital (SH) dataset of CXR images was acquired from Shenzhen No. 3 People's Hospital in Shenzhen, China. It contains normal and abnormal CXR images with marks of tuberculosis. Methodology Based on previous literature, attempts to perform training for such small CXR datasets without any pre-processing failed to see good results. So the authors attempted segmenting the lung images before being inputted to the model. This gave demonstrated a more successful training and an increase in prediction accuracy. To perform lung segmentation, i.e. to cut the left and right lung fields from the lung parts in standard CXRs, manually prepared masks were used. The dataset was split into 8:1:1...