Ukrainian Catholic University
Faculty of Applied Sciences
Data Science Master Program
January 22nd
Abstract. The thesis introduces the reader to the concepts of edge computing in terms of person re-identification and tracking problem. It describes the challenges, limitations, and current state-of-the-art solutions. The author proposed a pipeline for the task, launched several experiments on validating different parts of the system, and provided a theoretical explanation of the person re-identification process in the overlapping multi-camera environment.
2. Motivation and goal
Motivation: Studies in the field of real-time multi-
object tracking could bring up new areas of
optimization in architecture, public spaces, retail
marketing.
Goal: find the way to assign identifier to moving
person going from one camera view to another.
3. Introduction
Cloud computing Edge computing
● Auto scaling
● Easy to use
● Various ready to use
tools
● Durability
● Pay as you go
● Latency reduction
● Bandwidth
● Reliability
● Compliance
● Security
● Cost reduction
GDPR Hivecell
9. The proposed method
1. Person detection (YOLOv3 pre-trained
on COCO dataset)
2. Re-identification (Deep part of
DeepSort framework. Pretrained on
MARS dataset)
3. Streaming (Kafka)
4. Multi-camera tracking (Sort part of
DeepSort framework)
5. Overlapping multi-camera tracking
(custom algorithm)
10. Person ID + Person tracking (DeepSort)
- Designed for one camera tracking.
- Extract features using CNN.
- Cosine similarity or euclidean distance as a
distance metric.
- Kalman filtering for tracking is used.
- Hungarian algorithm for assignment is used.
- Incrementally increases track_id.
13. Overlapping multi-camera tracking
(the idea)
What data could be used to find the same
person on several cameras?
- Color distribution
- Nearby people
- Feature vectors
- Absolute position on same cameras
16. Overlapping multi-camera tracking
(algorithm)
On each step of multi-object tracker do:
1. calculate cosine similarity between features of each track.
2. calculate the distance between each track center.
a. map tracks positions to global coordinates.
b. set all distances greater than some threshold to 0.
c. normalize the distances to convert to (0, 1) range.
d. convert values, so that, the closest to each other be
detections have higher score.
3. calculate weighted sum of cosine similarity and distance score.
4. if for any two detections this sum exceeds the threshold.
5. for every column select only one, closest to 1 entry, assign both
detections same track_id.
18. Overlapping multi-camera tracking
(how to tune the algorithm?)
Multi-camera association coefficient - regulates the
influence of features.
Multi-camera association threshold - justifies that two
tracks relate the same person.
Multi-camera distance threshold - the max euclidean
distance between two tracks that could be small
enough to consider the tracks related to the same
person.
20. Pros / Cons
of the system
Pros Cons
1. Modularity
2. Scalability
3. Fault tolerance
1. Contains non-
generalizable parts
2. Requires video stream
synchronization
21. Appealing to reviewer comments
- Why YOLOv3?
- Why not use CNNs for tracking?
- Why the main focus is on
engineering part?
- What are the metrics?
- Where are the architecture of re-id
network?
23. Contribution
- Created a dataset of 4000 labeled images
- Designed the pipeline for multi-camera
tracking.
- Proposed the algorithm for overlapping
multi-camera person re-id and tracking.
24. Feature work
1. Proposed method improvement
2. Adding data visualization tools
3. Framework for multi-camera
person re-identification
I would like to present my master thesis which is called:
“Person re-identification in a top-view multi-camera environment”
The real-time analysis of the movement in public places could help municipal governments to improve the traffic jams, airport managers to deal with long lines and retail companies to find the right places for the goods.
The creation of movement maps could show the bottlenecks of public space design.
Goal: find the way to assign identifier to moving person going from one camera view to another.
Let’s start with some definitions.
The last decade was a rise of cloud computing.
Cloud computing is an approach to perform computational operations remotely:
Auto scaling - your app will be scaled depending on the load.
Easy to use - cloud providers try to make the process of integration as simple as possible.
Various ready to use tools - a lot of tools available.
Durability - usually there is no downtime in cloud instances.
Pay as you go - pay for running time only.
But this decade will be the time of edge computing.
Edge computing is an approach to perform some computational operations on premises:
Latency reduction - 60 miles / 1 ms.
Bandwidth (пропускная способность) - if there are lots of data, transferring them to the cloud increase the bill drastically.
Reliability - being on the edge allow to reduce lost of the data.
Compliance - the governmental regulations could be met. No raw data stored on cloud.
Security - some data considered to be too sensitive to transfer it.
Cost reduction - make part of computations on the edge reduces the cloud provider bill.
GDPR:
Edge computing helps to solve the problem with GDPR because no raw data is stored on the servers.
The data are not personalized.
Hivecell:
Ricker-Lyman Robotics is interested in the project because of their edge computing devices called Hivecell. It is computation unit with GPU on board and ability to scale computational power linearly with increasing the number of units in cluster.
Detection is a bounding box selected by the object detection model.
Identification is a process of assigning an identifier to the detected person on a frame.
Re-identification is a process of recognizing an individual on different frames.
Track is an entity representing the moving object throughout the images sequence.
Multi-person tracking is the process of detection of the moving person and tracking it throughout the frame sequence.
Multi-camera person tracking is a process of tracking person movement from several data sources.
Challenges:
Identity switches - changes of object identifier throughout video stream.
Fragmentation issue - fragmentation occurs when some detections are missed, but identity switches did not happen. This leads to tracking fragmentation.
Streams synchronization - delivering messages with delay or in broken order could have a substantial negative impact on the tracking system in a multi-camera environment.
Person detection output is a number of bounding boxes where the people are located in current frame. The defining characteristics for detection phase are accuracy measure by IoU or mAP and running speed measured in frames per second.
Identification output is a number of vectors that describe the current detection (one for each bounding box). The task of component is to output similar vectors for similar detections.
Streaming component is responsible for transferring metadata from single camera to multi-camera tracker.
Multi-camera tracker is the component that deals with person tracking. Matadata is retrieved from several sources.
Analytics is the component responsible for visualizing and insights of final tracking result.
In the method described further I will focus on first 4 parts, but firstly let us check existing methods.
Ricker Lyman Robotics gave me 5 hours of videos from 5 cameras recorded I august of 2019 in UCU dining place.
So, I have labeled almost 4000 images manually and used them to train object detection model.
So, here is an architecture of the solution.
The system is split into 2 parts. Such type of systems could be called “hybrid”.
All GPU required components are located here while all parts that requires synchronization are placed on the cloud.
On the edge part:
Object detections
Feature extraction (identification) happens
Then data is streamed to cloud part.
Cloud part:
consumes messages from every camera (one by one)
doing multi-person tracking
global positioning
And the resultant metadata is stream to the analytics tools.
I’ve used:
YOLOv3 for person detection because of its outstanding running time.
DeepSort framework to deal with feature extraction and tracking.
Kafka for streaming.
Let us focus on the core of the system, which is, DeepSort framework and algorithm for overlapping multi-camera tracking algorithm.
Designed for one camera tracking.
Extract features using CNN.
Cosine similarity or euclidean distance as a distance metric.
Uses Kalman filtering for tracking.
Hungarian algorithm for assignment
Incrementally increases track_id
Changes done to framework:
Moved Deep part (CNN) to the edge.
Prepare it to use in multi-camera setup.
Implemented overlapping multi-camera tracking algorithm.
There are 2 corner cases to deal with in the system:
Problem: when person appears in two cameras simultaneously the same ID has to be kept or assigned to several tracks at once.
Problem: when person goes from one camera to another through the blind zone.
The second problem is addressed by default. Each track has a so called “saturation period”, which is the number of missed frames before the track moves to “Deleted” state. It is configurable and could be tuned according to certain conditions.
So, now let us check how tracking works without the overlapping multi-camera tracking algorithm.
What data could be used to find same person?
Color distribution
Nearby people
Feature vectors
Absolute position on same cameras
Color distribution could vary significantly depending on the camera angle and the pose of the person.
Nearby people could be seen on one camera but be absent on another.
So, I’ve decided to use last 2 of possible features - feature vectors and absolute position.
I’ve created a global offsets marked by camera_id to map relative in-camera coordinates to global one.
And the feature vectors have already been introduced by DeepSort framework.
λ - multi-camera association coefficient 0 ≤ λ ≤ 1,
v - multi-camera association threshold 0 ≤ v ≤ 1,
A - camera coefficient. It is used in to avoid the comparison of tracks in the same cameras. Entries in matrix A is 0 or 1.
B - cosine similarity matrix entry,
C - absolute distance coefficient.
On each step of multi-object tracker do:
calculate cosine similarity between features of each track
calculate the distance between each track center
map tracks positions to global coordinates
set all distances greater than some threshold to 0
normalize the distances to convert to (0, 1) range
convert values, so that, the closest to each other be detections have higher score
calculate weighted sum of cosine similarity and distance score
if for any two detections this sum exceeds the threshold, assign both detections same track_id
for every column select only one, closest to 1 entry.
Algorithm does not run for detections with same camera_id.
If camera already has object with the some id, no other object could be marked with this id in this particular camera.
Multi-camera association coefficient - regulates the influence of the feature vector and the absolute position on the final decision.
Multi-camera association threshold - justifies that two tracks relate the same person.
Multi-camera distance threshold - the max euclidean distance between two tracks that could be small enough to consider the tracks related to the same person.
Let us check the results. Notice the IDs.
Pros:
Modularity - each part of the system could be replaced according to the needs.
Scalability - heavy computation tasks are done on premises that means that scaling edge part is easy, while increasing available resources for multi-camera tracker could allow to deal with increasing load on cloud part.
Fault tolerance - kafka setup allows to neglect the lost of real-time data from particular camera, but it doesn’t mean the data will be lost completely.
Cons:
Contains non generalizable parts - for every new place and situation new setup required. Basically, new mappings to global coordinates and object detection model training.
Lags in video has to be avoided as much as possible - since kafka is used we could potentially handle this problem.
Why YOLOv3? Because it could be used for real-time processing due to its running time.
Why not use CNNs for tracking? Kalman filter could be run on CPU that will drastically reduce the costs of cloud provider and could be used for real-time tracking.
Why the main focus is on engineering part? Because the solution consists of several subtasks and the main purpose was to “find the way to assign identifier to moving person going from one camera to another” and how to transfer the data from edge to cloud. To be honest I’m an engineering guy and I looked for an engineering topic first of all.
Where are the metrics?
YOLOv3 mAP 51.5 on COCO dataset
YOLOv3 mAP 59 on custom dataset
Where are the architecture of re-id network? It is inside the master thesis paper.
Custom dataset is not a good idea overall and it takes the time to label it.
It is better to solve one particular problem and not try to combine a lot of tools. Every subproblem is point of failure.
To be honest, this work should have been called “Person tracking in a top-view multi-camera environment” and it is my fault that I’ve chosen such a misleading name.