2. OUTLINE
• A Simple Framework For Contrastive Learning Of Visual Representations
• Tracking Objects As Points
• 𝑆2SiamFC: Self-supervised Fully Convolutional Siamese Network For Visual
Tracking
• CL-MOT: A Contrastive Learning Framework For Multi-object Tracking
• Multi-object Tracking With Self-supervised Associating Network
• Self-supervised Learning For Multi-object Tracking
3. A Simple Framework For Contrastive Learning Of
Visual Representations
• SimCLR: a simple framework for contrastive learning of visual representations
• (1) composition of data augmentations plays a critical role in defining effective predictive tasks
• (2) introducing a learnable nonlinear transformation between the representation and the
contrastive loss substantially improves the quality of the learned representations
• (3) contrastive learning benefits from larger batch sizes and more training steps compared to
supervised learning.
4. A Simple Framework For Contrastive Learning Of
Visual Representations
A simple framework for contrastive learning of visual representations
Two separate data augmentation operators are
sampled from the same family of augmentations
and applied to each data example to obtain two
correlated views. A base encoder network f() and
a projection head g() are trained to maximize
agreement using a contrastive loss. After training
is completed, we throw away the projection head
g() and use encoder f() and representation h for
downstream tasks.
5. A Simple Framework For Contrastive Learning Of
Visual Representations
Data Augmentation
9. Tracking Objects As Points
• Tracking is dominated by pipelines that perform object detection followed by temporal
association, also known as tracking-by-detection.
• Centertrack, a simultaneous detection and tracking algorithm, is simpler, faster, and more accurate.
• It applies a detection model to a pair of images and detections from the prior frame.
• Given this minimal input, Centertrack localizes objects and predicts their associations with the
previous frame.
• Centertrack is simple, online (no peeking into the future), and real-time.
• Codes: https://github.com/xingyizhou/centertrack
10. Tracking Objects As Points
The network takes the current frame, the previous frame, and a heatmap rendered from tracked object
centers as inputs, and produces a center detection heatmap for the current frame, the bounding box size
map, and an offset map. At test time, object sizes and offsets are extracted from peaks in the heatmap.
11. Tracking Objects As Points
The training objective based on the focal loss
Size prediction is learned by regression
CenterNet regresses to a refined center local location using an analogous L1
loss. The overall loss of CenterNet is a weighted sum of all three loss terms:
focal loss, size, and local location regression.
12. Tracking Objects As Points
To associate detections through time, CenterTrack predicts a 2D displacement
as two additional output channels.
It learns this displacement using the same regression objective as size or
location refinement:
CenterTrack is first and foremost an object detector, and trained as such. The
architectural changed from CenterNet to CenterTrack are minor: four additional
input channels and two output channels.
This allows to fine-tune CenterTrack directly from a pretrained CenterNet detector.
13. Tracking Objects As Points
• Follow the CenterNet training protocol and train all predictions as multi-task learning.
• Training on static image data: simulate the previous frame by randomly scaling and
translating the current frame.
• To perform mono 3D tracking, it adopts the monocular 3D detection form of CenterNet.
• Specifically, train output heads to predict object depth, rotation (encoded as an 8-
dimensional vector), and 3D extent.
• Since the projection of the center of the 3D bounding box may not align with the center
of the object’s 2D bounding box, also predict a 2d-to-3d center offset.
• Backbone is DLA (Deep Layer Aggregation).
18. 𝑆2SiamFC: Self-supervised Fully Convolutional
Siamese Network For Visual Tracking
• It adapts the state-of-the-art supervised Siamese based trackers into unsupervised
ones by utilizing the fact that an image and any cropped region of it can form a
natural pair for self-training.
• It applies Anti-clutter weighting (AC) which can adaptively adjust the weight of
each training sample by determining whether the pair is informative or not.
• It proposes Adversarial Masking which helps the tracker to learn other context
information by adaptively blacking out salient regions of the target.
• Extend SiamFC (“Fully-convolutional Siamese Networks For Object Tracking”) to
S2SiamFC (Self-Supervised).
19. 𝑆2SiamFC: Self-supervised Fully Convolutional
Siamese Network For Visual Tracking
Illustration of the difference between (a) common
unsupervised learning approach and (b) the proposed
self-supervised learning approach. In (b), the regions
are partly overlapping and the positive samples are
highlighted in red and negative samples are
highlighted in black.
20. 𝑆2SiamFC: Self-supervised Fully Convolutional
Siamese Network For Visual Tracking
The challenges of self-supervised tracking are two-fold.
1) in the training phase, when you randomly crop a region from an image as the target
template and then extend the chosen region as the search image as a training pair, it
may lead to a potential issue which is about “background content tracking” due to
the randomness in the process of sampling a training pair from the same image.
2) The self-supervised tracking is challenging because only a limited amount of
appearance variations can be captured during the training phase.
21. 𝑆2SiamFC: Self-supervised Fully Convolutional
Siamese Network For Visual Tracking
The training pipeline mainly consists of two stages: 1) The training pairs are sampled from the same image
and calculate the loss between the raw template and the search region first. 2) The values with the positive
labels in response map are chosen to calculate the channel-wise saliency maps by backpropagation. One of
the thresholded saliency maps is chosen to mask the template image and feed the masked template into the
network again for learning appearance-robust features.
22. 𝑆2SiamFC: Self-supervised Fully Convolutional
Siamese Network For Visual Tracking
The SiamFC loss function Anticlutter weighting of each training sample
indicator function
Anti-clutter loss function
23. 𝑆2SiamFC: Self-supervised Fully Convolutional
Siamese Network For Visual Tracking
Illustrations of the concept about “background
tracking”. The predicted response map is
resized to 255x255 for better visualization. (a)
denotes a meaningful pair that has fewer large
positive values of predicted response map since
the template region is unique in the search
region. (b) denotes a meaningless pair and the
predicted response map tend to be flat (many
large positive values) since the template region
is a common pattern in the search region.
24. 𝑆2SiamFC: Self-supervised Fully Convolutional
Siamese Network For Visual Tracking
adversarial appearance masking module
Inspired by Grad-CAM, obtain the
saliency map in a self-guidance
manner by doing backpropagation
from the location of the ground truth
label which is positive in the response
map. Then choose one of those
saliency maps as a mask and force the
model to learn the other relevant
context info of the target. In this way,
the model is forced to correctly predict
the similarity when some important
details are not available.
28. CL-MOT: A Contrastive Learning Framework For
Multi-object Tracking
• CL-MOT: A semi-supervised contrastive learning framework for MOT;
• Learn by clustering object embeddings from different views of static frames;
• Transfer an object detector to a tracker within this pre-text learning paradigm;
• Codes: https://github.Com/danielzgsilva/CL-MOT
29. CL-MOT: A Contrastive Learning Framework For
Multi-object Tracking
• Comprised of an encoder-decoder backbone, along with separate object detection and embedding
branches, this one-shot tracking network predicts object bounding boxes and appearance embeddings in
a single forward pass;
• Leverage a fully-convolutional resnet-34 as the backbone with the deep layer aggregation (DLA) variant;
• Replace convolutions in the up-sampling layers with deformable convolutions for adapting receptive field;
• Object detection as a center-based keypoint estimation task and regress to other properties such as
height and width.
• Therefore, three parallel regression heads are appended to the backbone network to predict an object
heatmap, object center offsets and bounding box sizes, respectively.
30. CL-MOT: A Contrastive Learning Framework For
Multi-object Tracking
• CL-MOT treats tracking as an online multi-object re-identification task;
• The network learns a feature space that discriminates between object instances in a single scene;
• Combine this representation learning framework with an association algorithm in object tracking;
• It "learns to track" objects in a self-supervised manner, forgoing the identity annotations;
• The object detection branch of CL-MOT is trained in a supervised manner;
• A focal loss to the estimated object heatmap, as well as an L1 regression loss to the size and offset predictions;
• Once trained, it runs online and in real time by leveraging an appearance-based association algo;
• To finalize the association step, bipartite matching is performed on the combined cost matrix by the
Hungarian algorithm.
32. Multi-object Tracking With Self-supervised
Associating Network
• Tracking by detection: Feature based object re-identification;
• Self-supervised learning using a lot of short unlabeled videos;
• The re-identification network trained to solve the lack of training data problem;
• A self-supervised associating tracker(SSAT) which is a tracking algorithm that train the feature
extraction network without data constraints by a self-supervised manner, and utilize it directly to
re-identify targets to track without a separate downstream tasks.
33. Multi-object Tracking With Self-supervised
Associating Network
It considers all the frames of one short video as
image patches with the same ID and train the
network to classify N-class classification with N
number video clips. Then it utilizes output of
the backbone network, 512 channel embedding
as a feature of input image to associate
detections with tracks.
34. Multi-object Tracking With Self-supervised
Associating Network
• Assume that the frames of a sufficient short videos have similar appearance to each other, use only
short video clip to train the network.
• Some videos may be composed of completely different frames nonetheless, but it is expected that
if enough data is accumulated and learned, it will be solved by itself.
• Since self-supervised learning is free from labeling problems, it is possible to learn with a lot of
data.
• Set the short time to 10 seconds, secured a lot of short youtube videos of about 10 seconds and
learned this.
• In the case of a 30fps video, data in which 300 frames of images are assigned with one ID can be
obtained.
35. Multi-object Tracking With Self-supervised
Associating Network
• This process is very similar to face recognition task;
• Thus train the network by referring to Cosface (“Cosface: Large Margin Cosine Loss For Deep
Face Recognition”), simple and effective in face recognition;
• The backbone network is Resnet-50, which extracts 512 channel features;
• As the loss function, large margin cosine loss (LMCL) , also the loss function of Cosface.
• After training, when we apply our network to the mot task, a feature of 512 channels which is the
output of the backbone network is used to compare each patch with tracks.
36. Multi-object Tracking With Self-supervised
Associating Network
• A tracker based on Center-track,
which performed well in MOT;
• Center-track uses Center-net to obtain
or refine detection results and then
associating results by its own method;
• In addition, Un-super-track, the
unsupervised method, is also based
on Center-track;
• Note: the detector is trained
supervised, and the association
network is trained self-supervised.
39. Self-supervised Learning For Multi-object Tracking
• Under the detect-to-track framework: assume that an object detector, trained on image-level
bounding box annotations, is available, but train a tracking model using only unlabeled video;
• Dual-tracker consistency: a self-supervised training method; At a high-level, the approach creates
a self-supervisory signal by applying two instances of a tracker model (where the instances share
the same parameters) through two distinct input variations extracted from one video sequence; the
tracker is then trained to produce similar outputs through the video sequence under both inputs;
• Construct two distinct inputs for one video sequence, where each input is a variation of the video
sequence where different information has been hidden;
• Then apply two instances of a tracker independently on each input, and train the model to
produce consistent outputs.
40. Self-supervised Learning For Multi-object Tracking
• Adopt a tracker model that is similar to “multi-object tracking with neural gating using bilinear LSTM”;
• A self-supervisory signal for training an RNN tracker model through a three-step process:
• 1) during training, repeatedly randomly sample a video {I0,I1,…,In}; Let dk by the detections
automatically computed in Ik, and im is the window of ik corresponding to that box; Apply an input-hiding
scheme to select two input variations for the video segment, where each variation is a modified sequence
of detections in the frames;
• 2) apply two instances of the tracker model through each input variation to derive two probabilistic
tracking outputs, represented as transition matrices.
• 3) compare the transition matrices with dot product similarity to update the RNN parameters.
42. Self-supervised Learning For Multi-object Tracking
• The tracker maintains a set of active tracks that have not yet left the camera frame;
• Given a video segment {I0,…,In}, and sets of detections Dk detected in each frame Ik, to initialize
the tracking process, create a track ti for each detection d0i in the first video frame I0;
• On each subsequent frame Ik, the model outputs a probability that each active track ti
corresponds to each detection.
• At inference time, apply the Hungarian algorithm to match active tracks with detections based on
these probabilities.
• For each detection in Ik that no track matches to, create a new active track for that detection.
• Similarly, if a track does not match to any detection or tage consecutive frames, remove it from the
active set; thus, tage is a threshold on the maximum age of a track since it last matched to some
detection.
43. Self-supervised Learning For Multi-object Tracking
• During training, repeatedly sample segments of up to 16 consecutive video frames.
• Apply one of two input-hiding schemes (occlusion-based and visual spatial), to extract two distinct input
variations from a sampled video segment.
• Then apply instances of the tracker model on each variation, where the instances share the same model
parameters.
• Dual-tracker consistency trains the model by enforcing it to produce similar outputs on both inputs.
• To represent tracker outputs, compute a transition matrix M, which element is the probability that the
active track ti matches each detection.
• When applying the model over video segments during training, update tracks with new detections
based on the scores output by the model on intermediate frames, but do not create additional active
tracks on frames after I0; thus, each active track ti corresponds directly to a detection in I0.
• Then, applying two instances of the tracker yields two transition matrices A and B.
• Train the model (CNN, RNN, and matching network) end-to-end to maximize the dot-product similarity
between these matrices.
44. Self-supervised Learning For Multi-object Tracking
Occlusion-based hiding produces input variations with
different subsequences of occluded frames where all
detections are hidden from the tracker. It also
independently applies the tracker before and after a
hand-off frame (I4 and I2), and merges the outputs
through the matrix product.
Visual-spatial hiding. One variation includes
only visual inputs, and the other includes
only spatial inputs.