Skip to main content

TX-CNN: DETECTING TUBERCULOSIS IN CHEST X-RAY IMAGES USING CONVOLUTIONAL NEURAL NETWORK

In this paper, the authors propose a method to classify tuberculosis from chest X-ray images using Convolutional Neural Networks (CNN). They achieve a classification accuracy of 85.68%. They attribute the effectiveness of their approach to shuffle sampling with cross-validation while training the network.

Methodology

Convolutional Neural Network

This has been the ultimate tool for researchers and engineers for computer vision tasks. It has been widely used for many general purpose image and video related tasks. There are many great resources to learn about them. I will link a few of them at the end of this post.

In this paper, the authors study the famous AlexNet and GoogLeNet architectures in classifying tuberculosis images. A CNN model usually consists of convolutional layers, pooling layers and fully connected layers. Each layer is connected to the previous layers via kernels or filters. A CNN model learns parameters of the kernel to represent global and local features of the image.

Transfer Learning

This is a technique used to reuse features learned by a network on one domain/dataset in another domain/dataset. In domains such as the medical one, the number of labeled data available is generally less. So it is common in deep learning to train a network on a well known large dataset (like ImageNet) and then fine-tune the same model by retraining it on the medical data. This allows learning of basic features from the randomly initialized weights of the network during training on the large dataset and then later by fixing those features, learning medical related features during the fine tuning phase.

Since the chest X-ray images are not abundant to train large complex models, the authors chose this strategy to pretrain their networks on the ImageNet dataset and then fine-tune this model on the TB dataset. They also use shuffle-sampling and cross-validation while training on the chest X-ray images.

Shuffle Sampling

If you read this paper, you will see that the authors attribute the boost in accuracy for their models to this pre-training technique. Shuffle sampling is a data augmentation technique used to balance the categories of datasets. For very unbalanced datasets (some categories have much more data than others), it is very hard to learn a general classifier for all the categories. According to experiments in this paper, this technique increased training time by 2 hours for AlexNet and accuracy from 53.02% to 85.68%.
The steps followed for this technique are as follows
  1. Index every image in a category and take N as the number of images in the largest category.
  2. For each category, generate N unique numbers.
  3. For each category, calculate mod value of the N unique numbers with Nc (number of images in respective category)
  4. This mod value will give N numbers each of which is an index of an image in the respective category.
  5. Substitute this index with the corresponding image and the dataset expands to a size of N x Number of categories.
Please refer the image borrowed from the paper.

Learning Links



Comments

Popular Posts

BLIP: Bootstrapping Language-Image Pretraining for Unified Vision-Language Understanding

BLIP is a new vision-language model proposed by Microsoft Research Asia in 2022. It introduces a bootstrapping method to learn from noisy image-text pairs scraped from the web. The BLIP Framework BLIP consists of three key components: MED  - A multimodal encoder-decoder model that can encode images, text, and generate image-grounded text. Captioner  - Fine-tuned on COCO to generate captions for web images. Filter  - Fine-tuned on COCO to filter noisy image-text pairs. The pretraining process follows these steps: Collect noisy image-text pairs from the web. Pretrain MED on this data. Finetune captioner and filter on the COCO dataset. Use captioner to generate new captions for web images. Filter noisy pairs using the filter model. Repeat the process by pretraining on a cleaned dataset. This bootstrapping allows BLIP to learn from web-scale noisy data in a self-supervised manner. Innovations in BLIP Some interesting aspects of BLIP: Combines encoder-decoder capability in one...

A non-local algorithm for image denoising

Published in   2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, this paper introduces two main ideas Method noise Non-local (NL) means algorithm to denoise images Method noise It is defined as the difference between the original (noisy) image and its denoised version. Some of the intuitions that can be drawn by analysing method noise are Zero method noise means perfect denoising (complete removal of noise without lose of image data). If a denoising method performed well, the method noise must look like a noise and should contain as little structure as possible from the original image The authors then discuss the method noise properties for different denoising filters. They are derived based on the filter properties. We will not be going in detail for each filter as the properties of the filters are known facts. The paper explains those properties using the intuitions of method noise. NL-means idea Denoised value at...

CLIP: Learning Transferable Visual Models From Natural Language Supervision

CLIP (Contrastive Language-Image Pre-training) is a new approach to learning visual representations proposed by researchers at OpenAI in 2021. Unlike traditional computer vision models which are trained on large labeled image datasets, CLIP learns directly from natural language supervision. The Core Idea The key insight behind CLIP is that we can connect images and text captions without generating the captions. By training the model to predict which caption goes with an image, it learns a rich visual representation of the world. As illustrated above, CLIP consists of two encoders - an image encoder and a text encoder. The image encoder takes in an image and outputs a visual representation vector. The text encoder takes in a caption and outputs a text representation vector. During training, these representations are optimized to be closer for matching image-text pairs, and farther apart for non-matching pairs. This is known as a contrastive loss objective. Benefits of CLIP There are sev...