In this paper, the authors investigate how to increase the robustness and accuracy of existing Siamese trackers used for visual object tracking.
Visual object tracking
Visual object tracking is one of the fundamental problems in computer vision. It aims to estimate the position of an arbitrary target in a video sequence, given only its location in the initial frame. It has numerous applications in surveillance, robotics, and human-computer interaction.
Siamese Networks and their usage in Trackers
Siamese networks are a class of neural networks that fundamentally learns to generate comparable feature vectors from their twin inputs. By learning to compute these comparable feature vectors, it learns differentiable characteristics for each type of image class. With these output vectors, it is possible to compare the two inputs and say if they belong to the same image class or not. For example, this is used in one-shot learning for facial recognition. Here the siamese network learns to differentiate different types of faces and then finally given an exemplar face and an image to search in, it can identify if the former is present in the latter or not.
Limitation in existing Siamese networks for visual tracking discussed in this paper
Siamese networks have drawn great attention in visual
tracking because of their balanced accuracy and speed.
However, the backbone networks used in Siamese trackers
are relatively shallow, such as AlexNet, which does not
fully take advantage of the capability of modern deep neural networks
Leveraging deeper and wider convolutional nets in Siamese networks
It is observed that
direct replacement of backbones with existing powerful architectures, such as ResNet and Inception, does
not bring improvements. The main reasons are that
- large increases in the receptive field of neurons lead to reduced feature discriminability and localization precision;
and
- the network padding for convolutions induces a positional bias in learning.
More into why direct replacement of deeper networks lead to detrimental results
One intuitive reasoning is that these deeper and wider
network architectures are primarily designed for image classification tasks, where the precise localization of the object
is not paramount. To investigate the concrete reason, we analyze the Siamese network architecture and identify that the
receptive field size of neurons, network stride, and feature
padding are three important factors affecting tracking accuracy. In particular, the receptive field determines the image
region used in computing a feature. A larger receptive field
provides a greater image context, while a small one may not
capture the structure of target objects. The network stride
affects the degree of localization precision, especially for
small-sized objects. Meanwhile, it controls the size of output feature maps, which affects feature discriminability and
detection accuracy. Moreover, for a fully-convolutional architecture, the feature padding for convolution induces a potential position bias in model training, such that when
an object moves near the search range boundary, it is difficult to make an accurate prediction. These three factors
together prevent Siamese trackers from benefiting from current deeper and more sophisticated network architectures.
Analysis of the performance degradation
An ablation study is performed on the internal network structures to analyse the ones that are most responsible for the performance drop. These structures include
- Network stride (STR)
- Receptive field (RF) of neurons
- Output feature size (OFS)
- Padding
Network Stride
When network stride (STR)
increases from 4 or 8 to 16, the performance drops significantly.
This illustrates that Siamese trackers prefer mid-level features (stride 4 or 8), which are more
precise in object localization than high-level features (stride
≥ 16). Network stride affects the overlap ratio of receptive fields. for two neighboring output features. It thereby determines the basic degree of localization precision. Therefore, when network depth increases, the stride should not
increase accordingly.
Receptive field
For the maximum size of the receptive field (RF), the
optima lie in a small range.
It is about 60%∼80% of the input exemplar image size.
It illustrates that the size of RF is crucial for feature embedding in a Siamese framework. The underlying reason is that
RF determines the image region used in computing a feature. A large receptive field covers much image context,
resulting in the extracted feature being insensitive to the
spatial location of target objects. On the contrary, a small
one may not capture the structural information of objects,
and thus it is less discriminative for matching. Therefore,
only RF in a certain size range allows the feature to abstract
the characteristics of the object, and its ideal size is closely
related to the size of the exemplar image
Output feature size
For the output
feature size, it is observed that a small size (OFS ≤ 3) does
not benefit tracking accuracy. This is because small feature maps lack enough spatial structure description of target
objects, and thus are not robust in image similarity calculation.
Network padding
Network padding has a highly negative impact on the final
performance.
If the networks contain
padding operations, the embedding features of an exemplar
image is extracted from the original exemplar image plus
additional (zero)-padding regions. Differently, for the features of a search image, some of them are extracted only
from image content itself, while some are extracted from
image content plus additional (zero)-padding regions (e.g.
the features near the border). As a result, there is an inconsistency between the embeddings of target object appearing at different positions in search images, and therefore the
matching similarity comparison degrades. For example, when the target object
moves to image borders, its peak does not precisely indicate
the location of the target
Guidelines based on the analysis
- Siamese trackers prefer a relatively small network stride. With regards to accuracy and efficiency, an empirically effective choice is to set the stride
to 4 or 8.
- The receptive field of output features should be set based
on its ratio to the size of the exemplar image. An empirically effective ratio is 60%∼80% for an exemplar image.
- Network stride, receptive field, and output feature size
should be considered as a whole when designing a network architecture.
- For a fully convolutional Siamese matching network, it
is critical to handle the problem of perceptual inconsistency between the two network streams. There are two
feasible solutions. One is to remove the padding operations in networks, while the other is to enlarge the size
of both input exemplar and search images, and then crop
out padding-affected features.
Addressing the issues of deeper networks in object tracking
New
residual modules and architectures are introduced that allow deeper and
wider backbone networks to unleash their power in Siamese
trackers. First, a group of cropping-inside
residual (CIR) units that crop out padding-affected features inside the block (i.e., features receiving padding signals), and thus prevent convolution filters from learning the
position bias.
Second, we design two kinds of network architectures, namely deeper and wider networks, by stacking the CIR units. In these networks, the stride and neuron receptive field is formulated to enhance localization
precision.
Finally, we apply the designed backbone networks to two representative Siamese trackers: SiamFC [2]
and SiamRPN
Deeper and Wider Siamese Networks
CIR Unit
Augment the residual unit with a cropping operation,
which is incorporated after the feature addition. The cropping operation removes features
whose calculation is affected by the zero-padding signals.
Since the padding size is one in the bottleneck layer, only
the outermost features on the border of the feature maps
are cropped out.
Downsampling CIR Unit
Changes the convolutional stride from 2 to 1 within both the bottleneck layer
and shortcut connection. Cropping is again inserted after
the addition operation to remove the padding-affected features. Finally, max-pooling is employed to perform spatial
downsampling of the feature map.
If we were only to insert cropping
after the addition operation, as done in the proposed CIR the unit, without changing the position of downsampling, the features after cropping would not receive any signal from
the outermost pixels in the input image.
CIR Inception and CIR-NeXT
In CIR Inception, a 1×1 convolution is inserted into the shortcut connection, and the features
of the two branches merged by concatenation, rather than by addition. In CIR-NeXT The bottleneck layer is split into 32
transformation branches, and aggregate them by addition. Moreover, for the downsampling units of CIR-Inception
and CIR-NeXt, the modifications are the same as those in
CIR-D (Fig. 3(b
0
)), where the convolution stride is reduced
and max-pooling is added. These two multi-branch structures enable the units to learn richer feature representations
Results
Experiments show that solely due
to the proposed network architectures, the SiamFC+ and
SiamRPN+ obtain up to 9.8%/5.7% (AUC), 23.3%/8.8%
(EAO) and 24.4%/25.0% (EAO).
Footnote
This post is me taking notes from the paper. Hence, All tables, results and wordings in this post are directly taken from
the paper.
Comments
Post a Comment