Skip to main content

Network In Network

In this paper, the authors introduce a new network structure for the traditional CNN to better extract and interpret latent features. It is named "Network In Network (NIN)".

NIN vs tradional CNN

In a traditional CNN, convolutional layers and spatial pooling layers are stacked followed by fully connected layers and an output layer. The convolution layers generate feature maps by linear convolutional filters followed by non-linear activation functions. The NIN structure addresses the following 2 limitations of a traditional CNN.
  1. Kernels/filters used for each CNN layer works well when the features to be extracted are linearly separable.
  2. Fully connected layers at the end of the CNN leads to over-fitting the training data. 

Convolution with linear filter vs Neural network

The convolution layers involve a kernel that slides over the previous field (input or layers) and extracts features. The kernel is usually a matrix with which convolution is done. This is a linear operation. Meaning the filtering done by kernel is linear and is sufficient to differentiate features that are linearly separable. However, feature maps are generally highly non-linear functions of the input data.
The NIN addresses this by introducing a "micro neural network (a Multilayer Perceptron or MLP)" which is a universal function approximator which can model non linear data spaces. The resulting structure is named as the "mlpconv" layer. Here an MLP network is slid over the input in a similar manner to extract feature maps which are then fed into the next layer. The overall structure of the NIN is the stacking of multiple mlpconv layers. Given below is a comparison of linear convolution layer and mlconv layer (borrowed from the paper)

The linear convolution layer includes a linear filter while  the mlpconv layer includes a micro network.
The linear convolution layer includes a linear filter while
the mlpconv layer includes a micro network.

    These can be trained using back-propagation along with training the entire network.

Fully connected layers vs Global averaging pooling layers

The second limitation, over-fitting by fully connected layers is addressed by replacing them with global average pooling layers. Here we directly take the average of the feature maps from the last mlpconv layer as the confidence of the categories. The fully connected layers presented the problem of over-fitting (solved partially by dropouts). However there are no parameters to learn during global averaging making it inherently inert to over-fitting. Also, the authors argue that the averaging creates a more meaningful correspondence between the feature maps and the categories as opposed to the black-box-like correspondence with the fully connected layers. Furthermore, the global average pooling sums out the spatial information, thus is more robust to spatial translation of the input.

The overall structure is given below (borrowed from the paper)
NIN structure with three mlpconv layers stacked and one global average pooling layer
NIN structure with three mlpconv layers stacked and one global average pooling layer

We can see global average pooling as a structural regularizer that explicitly enforces feature maps to be confidence maps of concepts (categories). This is made possible by the mlpconv layers, as they makes better approximation to the confidence maps than linear filters.

Comparing with maxout networks

To overcome the limitation of linear filters another type of network was introduced known as maxout networks. In a maxout network, max pooling is done over the results of the linear convolution without applying the activation function. Maximization over linear functions makes a piecewise linear function which is capable of approximating any convex functions. Here the assumption is that the features lie in a within a convex space derived from the input. However this prior imposition does not hold true in all cases.
Mlpconv differs from maxout layer in that the convex function approximator is replaced by the universal function approximator (MLP), which has greater capability in modeling various distributions of latent concepts.


The experiments done the author shows that the NIN + dropout structure achieves best or comparable performance with other state-of-the-art methods on four benchmark datasets: CIFAR-10, CIFAR-100, SVHN and MNIST. You can go over the results in the actual paper.

Comments

Popular Posts

BLIP: Bootstrapping Language-Image Pretraining for Unified Vision-Language Understanding

BLIP is a new vision-language model proposed by Microsoft Research Asia in 2022. It introduces a bootstrapping method to learn from noisy image-text pairs scraped from the web. The BLIP Framework BLIP consists of three key components: MED  - A multimodal encoder-decoder model that can encode images, text, and generate image-grounded text. Captioner  - Fine-tuned on COCO to generate captions for web images. Filter  - Fine-tuned on COCO to filter noisy image-text pairs. The pretraining process follows these steps: Collect noisy image-text pairs from the web. Pretrain MED on this data. Finetune captioner and filter on the COCO dataset. Use captioner to generate new captions for web images. Filter noisy pairs using the filter model. Repeat the process by pretraining on a cleaned dataset. This bootstrapping allows BLIP to learn from web-scale noisy data in a self-supervised manner. Innovations in BLIP Some interesting aspects of BLIP: Combines encoder-decoder capability in one...

A non-local algorithm for image denoising

Published in   2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, this paper introduces two main ideas Method noise Non-local (NL) means algorithm to denoise images Method noise It is defined as the difference between the original (noisy) image and its denoised version. Some of the intuitions that can be drawn by analysing method noise are Zero method noise means perfect denoising (complete removal of noise without lose of image data). If a denoising method performed well, the method noise must look like a noise and should contain as little structure as possible from the original image The authors then discuss the method noise properties for different denoising filters. They are derived based on the filter properties. We will not be going in detail for each filter as the properties of the filters are known facts. The paper explains those properties using the intuitions of method noise. NL-means idea Denoised value at...

CLIP: Learning Transferable Visual Models From Natural Language Supervision

CLIP (Contrastive Language-Image Pre-training) is a new approach to learning visual representations proposed by researchers at OpenAI in 2021. Unlike traditional computer vision models which are trained on large labeled image datasets, CLIP learns directly from natural language supervision. The Core Idea The key insight behind CLIP is that we can connect images and text captions without generating the captions. By training the model to predict which caption goes with an image, it learns a rich visual representation of the world. As illustrated above, CLIP consists of two encoders - an image encoder and a text encoder. The image encoder takes in an image and outputs a visual representation vector. The text encoder takes in a caption and outputs a text representation vector. During training, these representations are optimized to be closer for matching image-text pairs, and farther apart for non-matching pairs. This is known as a contrastive loss objective. Benefits of CLIP There are sev...