Skip to main content

Network In Network

In this paper, the authors introduce a new network structure for the traditional CNN to better extract and interpret latent features. It is named "Network In Network (NIN)".

NIN vs tradional CNN

In a traditional CNN, convolutional layers and spatial pooling layers are stacked followed by fully connected layers and an output layer. The convolution layers generate feature maps by linear convolutional filters followed by non-linear activation functions. The NIN structure addresses the following 2 limitations of a traditional CNN.
  1. Kernels/filters used for each CNN layer works well when the features to be extracted are linearly separable.
  2. Fully connected layers at the end of the CNN leads to over-fitting the training data. 

Convolution with linear filter vs Neural network

The convolution layers involve a kernel that slides over the previous field (input or layers) and extracts features. The kernel is usually a matrix with which convolution is done. This is a linear operation. Meaning the filtering done by kernel is linear and is sufficient to differentiate features that are linearly separable. However, feature maps are generally highly non-linear functions of the input data.
The NIN addresses this by introducing a "micro neural network (a Multilayer Perceptron or MLP)" which is a universal function approximator which can model non linear data spaces. The resulting structure is named as the "mlpconv" layer. Here an MLP network is slid over the input in a similar manner to extract feature maps which are then fed into the next layer. The overall structure of the NIN is the stacking of multiple mlpconv layers. Given below is a comparison of linear convolution layer and mlconv layer (borrowed from the paper)

The linear convolution layer includes a linear filter while  the mlpconv layer includes a micro network.
The linear convolution layer includes a linear filter while
the mlpconv layer includes a micro network.

    These can be trained using back-propagation along with training the entire network.

Fully connected layers vs Global averaging pooling layers

The second limitation, over-fitting by fully connected layers is addressed by replacing them with global average pooling layers. Here we directly take the average of the feature maps from the last mlpconv layer as the confidence of the categories. The fully connected layers presented the problem of over-fitting (solved partially by dropouts). However there are no parameters to learn during global averaging making it inherently inert to over-fitting. Also, the authors argue that the averaging creates a more meaningful correspondence between the feature maps and the categories as opposed to the black-box-like correspondence with the fully connected layers. Furthermore, the global average pooling sums out the spatial information, thus is more robust to spatial translation of the input.

The overall structure is given below (borrowed from the paper)
NIN structure with three mlpconv layers stacked and one global average pooling layer
NIN structure with three mlpconv layers stacked and one global average pooling layer

We can see global average pooling as a structural regularizer that explicitly enforces feature maps to be confidence maps of concepts (categories). This is made possible by the mlpconv layers, as they makes better approximation to the confidence maps than linear filters.

Comparing with maxout networks

To overcome the limitation of linear filters another type of network was introduced known as maxout networks. In a maxout network, max pooling is done over the results of the linear convolution without applying the activation function. Maximization over linear functions makes a piecewise linear function which is capable of approximating any convex functions. Here the assumption is that the features lie in a within a convex space derived from the input. However this prior imposition does not hold true in all cases.
Mlpconv differs from maxout layer in that the convex function approximator is replaced by the universal function approximator (MLP), which has greater capability in modeling various distributions of latent concepts.


The experiments done the author shows that the NIN + dropout structure achieves best or comparable performance with other state-of-the-art methods on four benchmark datasets: CIFAR-10, CIFAR-100, SVHN and MNIST. You can go over the results in the actual paper.

Comments

Popular Posts

BLIP: Bootstrapping Language-Image Pretraining for Unified Vision-Language Understanding

BLIP is a new vision-language model proposed by Microsoft Research Asia in 2022. It introduces a bootstrapping method to learn from noisy image-text pairs scraped from the web. The BLIP Framework BLIP consists of three key components: MED  - A multimodal encoder-decoder model that can encode images, text, and generate image-grounded text. Captioner  - Fine-tuned on COCO to generate captions for web images. Filter  - Fine-tuned on COCO to filter noisy image-text pairs. The pretraining process follows these steps: Collect noisy image-text pairs from the web. Pretrain MED on this data. Finetune captioner and filter on the COCO dataset. Use captioner to generate new captions for web images. Filter noisy pairs using the filter model. Repeat the process by pretraining on a cleaned dataset. This bootstrapping allows BLIP to learn from web-scale noisy data in a self-supervised manner. Innovations in BLIP Some interesting aspects of BLIP: Combines encoder-decoder capability in one...