People detection in a video

Detect Known People in a
Video
Yonatan Katz
My journey to

The Journey
Deep
Learning
Face
Detection
Shot
Boundaries
Detection
Face
Recognition
Object
Tracking
Computer
Vision

The Problem: When a specific
person appears in a video?
D. Trump:
[0:07- 1:23, 1:52-2:03]
B. Obama (nickname:
Obamush):
[0:07- 1:23]

Journey Outline
1. We will parse the video into frames
2. We will detect faces in the frame
3. We will try to recognize the faces
4. We will track the faces back and forth in the video
a. We will split the video into shots

Parsing the video
(or: choosing the technology)

● Why Python?
● OpenCV
● NumPy
● Code example:
video = cv2.VideoCapture(video_path)
video.set(3, cv2.cv.CV_CAP_PROP_FRAME_WIDTH)
video.set(4, cv2.cv.CV_CAP_PROP_FRAME_HEIGHT)
while True:
ret, frame = video.read()
if frame is None:
break
cv2.imshow('video', frame)
if cv2.waitKey(1) & 0xFF == ord('q'):
break
video.release()
cv2.destoryAllWindows()

Detect Faces
(sounds complicated, it’s not)

Let’s examine the code first
win = dlib.image_window()
image = io.imread(file_name)
face_detector = dlib.get_frontal_face_detector()
detected_faces = face_detector(image, 1)
win.set_image(image)
for i, face_rect in enumerate(detected_faces):
win.add_overlay(face_rect)
dlib.hit_enter_to_continue()
The MAGIC is here.
You don’t need to
invent anything

But how does it really work??
Taken from this great meduim post
1. Convert to grayscale image
2. Look at every pixel, and the pixels
surrounding it

3. Find the direction where pixels become
darker

4. Convert the image to “darker vectors”
ONLY THE “DARKNESS RATIO” METTERS - works
on both dark and bright images!

6. Compare patterns!
5. Reduce the size of the vector

Recognize Faces
(Deep learning. Not only a buzzword)

Intro to machine learning
1. Train:
a. Find the data that may affect the end result (“features”)
b. Train a model that takes as an input:
i. List of features
ii. The end result (“label”)
c. Get the weights for each feature
2. Test:
a. Apply the weights on the your data
b. Compute the most relevant result
I’m a man. 32 years old. I watched 32 drama movies, 3 comedy movies (in average, I saw
only 75 % of these boring movies) and no action movie. What youtube will recommend me?
1. Borat
2. Hit
3. Titanik
Do you want to be data scientist?
13 x Feature1 + 5 x Feature2…. = score

Intro to deep learning
● How does a child learn to ride a bicycle?
● Neural network is trying to imitate a man learning process
● Invented by psychologist - ‫עושים‬‫היסטוריה‬

Deep Learning in computer vision
● Classic problem: what is this number?
● Are these images represent the same number?

Back to our journey
● The problem: recognize people
Donald Trump of course!
KE’ILU DA!
I have no idea. But he is pretty
similar to this weirdo guy:

Moment before we jump into code...
● In order to compare faces, we need to center the face (“apples to apples”)
● In order to do saw, we need to find landmarks
Alignment code example can be found here

From their website:
OpenFace is a Python and Torch implementation of face recognition with deep neural networks and is based on the CVPR 2015 paper
FaceNet: A Unified Embedding for Face Recognition and Clustering by Florian Schroff, Dmitry Kalenichenko, and James Philbin at Google.
Torch allows the network to be executed on a CPU or with CUDA.
Nightmare to install
:(

Finally - CODE !
align = openface.AlignDlib(args.dlibFacePredictor)
net = openface.TorchNeuralNet(args.networkModel, args.imgDim)
def getRep(imgPath):
bgrImg = cv2.imread(imgPath)
rgbImg = cv2.cvtColor(bgrImg, cv2.COLOR_BGR2RGB)
bb = align.getLargestFaceBoundingBox(rgbImg)
alignedFace = align.align(args.imgDim, rgbImg, bb,
landmarkIndices=openface.AlignDlib.OUTER_EYES_AND_NOSE)
rep = net.forward(alignedFace)
return rep
d = getRep(img1) - getRep(img2)
print("distance between representations: {:0.3f}".format(np.dot(d, d)))
Full code can be found here

Summery
● Assuming we know who’s gonna be in the video, we download images of
these people
● We run over the video frame - by - frame:
○ For each frame, search for faces
■ For each face -
● Make some image manipulation to align the face image
● Get its representation from the neural network (openface)
● Compare the representation with the representation of the pre-downloaded images

Object Tracking
(or: Why recognition over video is different from loop
over image recognition algorithm)

Problem Definition
● We are good at finding frontal faces, but not profile faces
○ There are some models that support profile pictures as well
● It is problematic to compare profile pictures
○ We need to train a model (is there data scientist in the room?)
○ We need to have too many profile pictures…
● What if our dear president-elect decides to turn around?

Object Tracking
● Dlib have an API for tracking objects
● We need to run forward and backward once we find a face
● Problem: if there is a camera cut in the middle, it doesn’t know.
video = cv2.VideoCapture(video_path)
video.set(3, cv2.cv.CV_CAP_PROP_FRAME_WIDTH)
video.set(4, cv2.cv.CV_CAP_PROP_FRAME_HEIGHT)
tracker = dlib.correlation_tracker()
tracker.start_track(frame, face_rectangle)
while True:
if frame is None:
break
tracker.update(frame)
pos = tracker.get_position()
bl = (int(pos.left()), int(pos.bottom()))
tr = (int(pos.right()), int(pos.top()))
cv2.rectangle(frame, bl, tr, color=(153, 255, 204), thickness=3)
cv2.imshow('video', frame)
if cv2.waitKey(1) & 0xFF == ord('q'):
break
video.release()
cv2.destoryAllWindows()
Dlib is a modern C++ toolkit containing
machine learning algorithms and tools
for creating complex software in C++ to
solve real world problems. It is used in
both industry and academia in a wide
range of domains including robotics,
embedded devices, mobile phones, and
large high performance computing
environments
From dlib website:

Shot Boundaries
Detection
(last known stop in our journey)

Movie Shots
● We need it in order to cut the object trackers
● Shot types:
○ Camera cut
○ Dissolve
○ Wipe
○ Fade-in / Fade out
● Tools that do shot detection:
○ Ffmpg
○ Scene Segmentation
● Not good enough...

Comparison Metrics
● Color histogram

Comparison Metrics
● Edge Change Ratio - Compare the in-pixels and out-pixels
Frame # NFrame # N -1

Considerations (ok ok , and some code…)
● Thresholds for shot change
● Compare every two consecutive frames, or distant frames
● Do we prefer more shots (maybe wrong ones), or less shots (and miss ones)
● Check the complete frame, or the tracked object square
● Crop the image before comparison (prevent subtitles, logo noises, etc.)
● What will happen if a cat is sitting on a table, and then jumps?
● ECR doesn’t have much effect. But it’s cool!
● ECR code here

So Where are We Standing?
● Problems with model (= neural network)
○ Grayscale images
○ Colored people
● We need validation of 3rd party
○ But not on all frames
● We want to build an images database
● Hardware requirements are very high
○ Maybe we will process only ‘important videos’

People detection in a video

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a People detection in a video

Semelhante a People detection in a video (20)

Último

Último (20)

People detection in a video