Skip to main content

Deeper and Wider Siamese Networks for Real-Time Visual Tracking

 In this paper, the authors investigate how to increase the robustness and accuracy of existing Siamese trackers used for visual object tracking.

Visual object tracking

Visual object tracking is one of the fundamental problems in computer vision. It aims to estimate the position of an arbitrary target in a video sequence, given only its location in the initial frame. It has numerous applications in surveillance, robotics, and human-computer interaction.

Siamese Networks and their usage in Trackers

Siamese networks are a class of neural networks that fundamentally learns to generate comparable feature vectors from their twin inputs. By learning to compute these comparable feature vectors, it learns differentiable characteristics for each type of image class. With these output vectors, it is possible to compare the two inputs and say if they belong to the same image class or not. For example, this is used in one-shot learning for facial recognition. Here the siamese network learns to differentiate different types of faces and then finally given an exemplar face and an image to search in, it can identify if the former is present in the latter or not.

Limitation in existing Siamese networks for visual tracking discussed in this paper

Siamese networks have drawn great attention in visual tracking because of their balanced accuracy and speed. However, the backbone networks used in Siamese trackers are relatively shallow, such as AlexNet, which does not fully take advantage of the capability of modern deep neural networks

Leveraging deeper and wider convolutional nets in Siamese networks

It is observed that direct replacement of backbones with existing powerful architectures, such as ResNet and Inception, does not bring improvements. The main reasons are that 

  1. large increases in the receptive field of neurons lead to reduced feature discriminability and localization precision; and 
  2. the network padding for convolutions induces a positional bias in learning. 

More into why direct replacement of deeper networks lead to detrimental results

One intuitive reasoning is that these deeper and wider network architectures are primarily designed for image classification tasks, where the precise localization of the object is not paramount. To investigate the concrete reason, we analyze the Siamese network architecture and identify that the receptive field size of neurons, network stride, and feature padding are three important factors affecting tracking accuracy. In particular, the receptive field determines the image region used in computing a feature. A larger receptive field provides a greater image context, while a small one may not capture the structure of target objects. The network stride affects the degree of localization precision, especially for small-sized objects. Meanwhile, it controls the size of output feature maps, which affects feature discriminability and detection accuracy. Moreover, for a fully-convolutional architecture, the feature padding for convolution induces a potential position bias in model training, such that when an object moves near the search range boundary, it is difficult to make an accurate prediction. These three factors together prevent Siamese trackers from benefiting from current deeper and more sophisticated network architectures.

Analysis of the performance degradation

An ablation study is performed on the internal network structures to analyse the ones that are most responsible for the performance drop. These structures include
  1. Network stride (STR)
  2. Receptive field (RF) of neurons
  3. Output feature size (OFS)
  4. Padding

Network Stride

When network stride (STR) increases from 4 or 8 to 16, the performance drops significantly.
This illustrates that Siamese trackers prefer mid-level features (stride 4 or 8), which are more precise in object localization than high-level features (stride ≥ 16). Network stride affects the overlap ratio of receptive fields. for two neighboring output features. It thereby determines the basic degree of localization precision. Therefore, when network depth increases, the stride should not increase accordingly.

Receptive field

For the maximum size of the receptive field (RF), the optima lie in a small range.
It is about 60%∼80% of the input exemplar image size.
It illustrates that the size of RF is crucial for feature embedding in a Siamese framework. The underlying reason is that RF determines the image region used in computing a feature. A large receptive field covers much image context, resulting in the extracted feature being insensitive to the spatial location of target objects. On the contrary, a small one may not capture the structural information of objects, and thus it is less discriminative for matching. Therefore, only RF in a certain size range allows the feature to abstract the characteristics of the object, and its ideal size is closely related to the size of the exemplar image

Output feature size

For the output feature size, it is observed that a small size (OFS ≤ 3) does not benefit tracking accuracy. This is because small feature maps lack enough spatial structure description of target objects, and thus are not robust in image similarity calculation.

Network padding

Network padding has a highly negative impact on the final performance.
If the networks contain padding operations, the embedding features of an exemplar image is extracted from the original exemplar image plus additional (zero)-padding regions. Differently, for the features of a search image, some of them are extracted only from image content itself, while some are extracted from image content plus additional (zero)-padding regions (e.g. the features near the border). As a result, there is an inconsistency between the embeddings of target object appearing at different positions in search images, and therefore the matching similarity comparison degrades. For example, when the target object moves to image borders, its peak does not precisely indicate the location of the target

Guidelines based on the analysis

  1. Siamese trackers prefer a relatively small network stride. With regards to accuracy and efficiency, an empirically effective choice is to set the stride to 4 or 8.
  2. The receptive field of output features should be set based on its ratio to the size of the exemplar image. An empirically effective ratio is 60%∼80% for an exemplar image.
  3. Network stride, receptive field, and output feature size should be considered as a whole when designing a network architecture.
  4. For a fully convolutional Siamese matching network, it is critical to handle the problem of perceptual inconsistency between the two network streams. There are two feasible solutions. One is to remove the padding operations in networks, while the other is to enlarge the size of both input exemplar and search images, and then crop out padding-affected features. 

Addressing the issues of deeper networks in object tracking

New residual modules and architectures are introduced that allow deeper and wider backbone networks to unleash their power in Siamese trackers. First, a group of cropping-inside residual (CIR) units that crop out padding-affected features inside the block (i.e., features receiving padding signals), and thus prevent convolution filters from learning the position bias.
Second, we design two kinds of network architectures, namely deeper and wider networks, by stacking the CIR units. In these networks, the stride and neuron receptive field is formulated to enhance localization precision.
Finally, we apply the designed backbone networks to two representative Siamese trackers: SiamFC [2] and SiamRPN

Deeper and Wider Siamese Networks


CIR Unit

Augment the residual unit with a cropping operation, which is incorporated after the feature addition. The cropping operation removes features whose calculation is affected by the zero-padding signals. Since the padding size is one in the bottleneck layer, only the outermost features on the border of the feature maps are cropped out.

Downsampling CIR Unit

Changes the convolutional stride from 2 to 1 within both the bottleneck layer and shortcut connection. Cropping is again inserted after the addition operation to remove the padding-affected features. Finally, max-pooling is employed to perform spatial downsampling of the feature map.
If we were only to insert cropping after the addition operation, as done in the proposed CIR the unit, without changing the position of downsampling, the features after cropping would not receive any signal from the outermost pixels in the input image.

CIR Inception and CIR-NeXT

In CIR Inception, a 1×1 convolution is inserted into the shortcut connection, and the features of the two branches merged by concatenation, rather than by addition. In CIR-NeXT The bottleneck layer is split into 32 transformation branches, and aggregate them by addition. Moreover, for the downsampling units of CIR-Inception and CIR-NeXt, the modifications are the same as those in CIR-D (Fig. 3(b 0 )), where the convolution stride is reduced and max-pooling is added. These two multi-branch structures enable the units to learn richer feature representations

Results

Experiments show that solely due to the proposed network architectures, the SiamFC+ and SiamRPN+ obtain up to 9.8%/5.7% (AUC), 23.3%/8.8% (EAO) and 24.4%/25.0% (EAO).



Footnote

This post is me taking notes from the paper. Hence, All tables, results and wordings in this post are directly taken from the paper.

Comments

Popular Posts

BLIP: Bootstrapping Language-Image Pretraining for Unified Vision-Language Understanding

BLIP is a new vision-language model proposed by Microsoft Research Asia in 2022. It introduces a bootstrapping method to learn from noisy image-text pairs scraped from the web. The BLIP Framework BLIP consists of three key components: MED  - A multimodal encoder-decoder model that can encode images, text, and generate image-grounded text. Captioner  - Fine-tuned on COCO to generate captions for web images. Filter  - Fine-tuned on COCO to filter noisy image-text pairs. The pretraining process follows these steps: Collect noisy image-text pairs from the web. Pretrain MED on this data. Finetune captioner and filter on the COCO dataset. Use captioner to generate new captions for web images. Filter noisy pairs using the filter model. Repeat the process by pretraining on a cleaned dataset. This bootstrapping allows BLIP to learn from web-scale noisy data in a self-supervised manner. Innovations in BLIP Some interesting aspects of BLIP: Combines encoder-decoder capability in one...