Skip to main content

Learning to Read Chest X-Rays: Recurrent Neural Feedback Model for Automated Image Annotation

In this paper, the authors present a deep learning model to detect disease from chest x-ray images. A convolutional neural network (CNN) is trained to detect the disease names. Recurrent neural networks (RNNs) are then trained to describe the contexts of a detected disease, based on the deep CNN features.

CNN Models used and Dataset

CNNs encode input images effectively. In this paper, the authors experiment with a Network in Network (NIN) model and GoogLeNet model. The dataset contains 3,955 radiology reports and 7,470 associated chest x-rays. 71% of the dataset accounts for normal cases (no disease). The data set was balanced by augmenting training images by randomly cropping 224x224 images from the original 256x256 size image.

Adaptability of Transfer learning

Since this boils down to a classification problem on a small dataset, transfer learning is a technique that comes to our mind. The authors experimented this with ImageNet trained models. ImageNet trained CNN weights were re-used except for the last layer for classification, and were trained on the chest x-ray images with a learning rate 1/10 times lower than the default learning rate.
Experiment showed that, the fine-tuned model could not exceed 15% accuracy while the same model with random initialization achieved close to 60% validation accuracy. Hence it was concluded that features trained specifically for chest x-rays are better suited than re-using features learned from ImageNet.

Regularization

As with any deep learning models, regularization should be properly configured to avoid overfitting and force the model to learn useful features. Batch normalization and dropout regularization techniques were employed to increase training and validation accuracy.
GoogLeNet model showed better results than the NIN model when used with batch normalization and dropout layers. Also, a further increase in accuracy was observed when images were duplicated instead of randomly cropping to augment training data.

Annotation generation with RNN

Till this point, it is a normal application of a deep CNN model on a classification problem. However, further in this paper, the authors introduce the usage of RNNs to annotate images with human like diagnosis. We know that RNNs are heavily used in NLP systems and machine translations. In this paper RNNs are used to learn the annotations of the chest X-ray images.

From the dataset, majority of annotations contain up to five words. Longer ones are ignored by constraining RNNs to roll up to 5 time steps and shorted ones are zero padded. The authors experiment with Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU).

Initial state of the RNN is set as the CNN image embedding and the first annotation word as the initial input. NIN and GoogLeNet use average-pooling layers after the convolutional layers unlike other models which use fully-connected layers. Hence the input to the RNN, is the output of the last average pooling layer of the model. Output of the RNNs are the following annotation sequences. Trained by minimizing the negative log likelihood of output sequences and true sequences.


where yt is the output of the RNN at step t and st is the correct word. CNN(I) is the embedding of input I, and N is the number of words in the annotation.

Recurrent Feedback Model with Image Labeling with Joint Image/Text Context

Using this RNN, we get annotations for the image. This is then used to create a more diverse set of image labels from which we can infer information beyond just the name of the disease.
For this, a joint image/text context vector is generated by applying mean-pooling on the state vectors of the RNN at each step. Note that this state vector is initialized with the CNN embedding and then fine tuned by unrolling over the annotation sequence. Below is an illustration of how the joint image/text context vector is calculated.

The obtained vector encodes both image context and text context describing the image. From this, we obtain new image labels taking disease context into account. 
The CNN is trained once more with the new labels (fine tune previous CNN by replacing the last classification layer). Then train the RNN again with the new image embedding, and finally generate image annotations.
Results borrowed from the paper are shown below



Additional Reading

  1.  Network in Network model : https://arxiv.org/abs/1312.4400


Comments

Popular Posts

Network In Network

In this paper , the authors introduce a new network structure for the traditional CNN to better extract and interpret latent features. It is named "Network In Network (NIN)". NIN vs tradional CNN In a traditional CNN, convolutional layers and spatial pooling layers are stacked followed by fully connected layers and an output layer. The convolution layers generate feature maps by linear convolutional filters followed by non-linear activation functions. The NIN structure addresses the following 2 limitations of a traditional CNN. Kernels/filters used for each CNN layer works well when the features to be extracted are linearly separable. Fully connected layers at the end of the CNN leads to over-fitting the training data.  Convolution with linear filter vs Neural network The convolution layers involve a kernel that slides over the previous field (input or layers) and extracts features. The kernel is usually a matrix with which convolution is done. This is a linear operation. Mea...

Joint Pose and Shape Estimation of Vehicles from LiDAR Data

In this paper , the authors address the problem of estimating the pose and shape of vehicles from LiDAR Data. This is a common problem to be solved in autonomous vehicle applications. Autonomous vehicles are equipped with many sensors to perceive the world around them. LiDAR being one of them is what the authors focus on in this paper. A key requirement of the perception system is to identify other vehicles in the road and make decisions based on their pose and shape. The authors put forth a pipeline that jointly determines pose and shape from LiDAR data.  More about Pose and Shape Estimation LiDAR sensors capture the world around them in point clouds. Often, the first step in LiDAR processing is to perform some sort of clustering or segmentation, to isolate parts of the point cloud which belong to individual objects.  The next step is to infer the pose and shape of the object. This is mostly done by a modal perception . Meaning the whole object is perceived based on partial s...

BLIP: Bootstrapping Language-Image Pretraining for Unified Vision-Language Understanding

BLIP is a new vision-language model proposed by Microsoft Research Asia in 2022. It introduces a bootstrapping method to learn from noisy image-text pairs scraped from the web. The BLIP Framework BLIP consists of three key components: MED  - A multimodal encoder-decoder model that can encode images, text, and generate image-grounded text. Captioner  - Fine-tuned on COCO to generate captions for web images. Filter  - Fine-tuned on COCO to filter noisy image-text pairs. The pretraining process follows these steps: Collect noisy image-text pairs from the web. Pretrain MED on this data. Finetune captioner and filter on the COCO dataset. Use captioner to generate new captions for web images. Filter noisy pairs using the filter model. Repeat the process by pretraining on a cleaned dataset. This bootstrapping allows BLIP to learn from web-scale noisy data in a self-supervised manner. Innovations in BLIP Some interesting aspects of BLIP: Combines encoder-decoder capability in one...