3. The Problem: When a specific
person appears in a video?
D. Trump:
[0:07- 1:23, 1:52-2:03]
B. Obama (nickname:
Obamush):
[0:07- 1:23]
4. Journey Outline
1. We will parse the video into frames
2. We will detect faces in the frame
3. We will try to recognize the faces
4. We will track the faces back and forth in the video
a. We will split the video into shots
8. Let’s examine the code first
win = dlib.image_window()
image = io.imread(file_name)
face_detector = dlib.get_frontal_face_detector()
detected_faces = face_detector(image, 1)
win.set_image(image)
for i, face_rect in enumerate(detected_faces):
win.add_overlay(face_rect)
dlib.hit_enter_to_continue()
The MAGIC is here.
You don’t need to
invent anything
9. But how does it really work??
Taken from this great meduim post
1. Convert to grayscale image
2. Look at every pixel, and the pixels
surrounding it
10. But how does it really work??
Taken from this great meduim post
3. Find the direction where pixels become
darker
11. But how does it really work??
Taken from this great meduim post
4. Convert the image to “darker vectors”
ONLY THE “DARKNESS RATIO” METTERS - works
on both dark and bright images!
12. But how does it really work??
Taken from this great meduim post
6. Compare patterns!
5. Reduce the size of the vector
14. Intro to machine learning
1. Train:
a. Find the data that may affect the end result (“features”)
b. Train a model that takes as an input:
i. List of features
ii. The end result (“label”)
c. Get the weights for each feature
2. Test:
a. Apply the weights on the your data
b. Compute the most relevant result
I’m a man. 32 years old. I watched 32 drama movies, 3 comedy movies (in average, I saw
only 75 % of these boring movies) and no action movie. What youtube will recommend me?
1. Borat
2. Hit
3. Titanik
Do you want to be data scientist?
13 x Feature1 + 5 x Feature2…. = score
15. Intro to deep learning
● How does a child learn to ride a bicycle?
● Neural network is trying to imitate a man learning process
● Invented by psychologist - עושיםהיסטוריה
16. Deep Learning in computer vision
● Classic problem: what is this number?
● Are these images represent the same number?
17. Back to our journey
● The problem: recognize people
Donald Trump of course!
KE’ILU DA!
I have no idea. But he is pretty
similar to this weirdo guy:
18. Moment before we jump into code...
● In order to compare faces, we need to center the face (“apples to apples”)
● In order to do saw, we need to find landmarks
Alignment code example can be found here
19. From their website:
OpenFace is a Python and Torch implementation of face recognition with deep neural networks and is based on the CVPR 2015 paper
FaceNet: A Unified Embedding for Face Recognition and Clustering by Florian Schroff, Dmitry Kalenichenko, and James Philbin at Google.
Torch allows the network to be executed on a CPU or with CUDA.
Nightmare to install
:(
20. Finally - CODE !
align = openface.AlignDlib(args.dlibFacePredictor)
net = openface.TorchNeuralNet(args.networkModel, args.imgDim)
def getRep(imgPath):
bgrImg = cv2.imread(imgPath)
rgbImg = cv2.cvtColor(bgrImg, cv2.COLOR_BGR2RGB)
bb = align.getLargestFaceBoundingBox(rgbImg)
alignedFace = align.align(args.imgDim, rgbImg, bb,
landmarkIndices=openface.AlignDlib.OUTER_EYES_AND_NOSE)
rep = net.forward(alignedFace)
return rep
d = getRep(img1) - getRep(img2)
print("distance between representations: {:0.3f}".format(np.dot(d, d)))
Full code can be found here
21. Summery
● Assuming we know who’s gonna be in the video, we download images of
these people
● We run over the video frame - by - frame:
○ For each frame, search for faces
■ For each face -
● Make some image manipulation to align the face image
● Get its representation from the neural network (openface)
● Compare the representation with the representation of the pre-downloaded images
22. Object Tracking
(or: Why recognition over video is different from loop
over image recognition algorithm)
23. Problem Definition
● We are good at finding frontal faces, but not profile faces
○ There are some models that support profile pictures as well
● It is problematic to compare profile pictures
○ We need to train a model (is there data scientist in the room?)
○ We need to have too many profile pictures…
● What if our dear president-elect decides to turn around?
24.
25. Object Tracking
● Dlib have an API for tracking objects
● We need to run forward and backward once we find a face
● Problem: if there is a camera cut in the middle, it doesn’t know.
video = cv2.VideoCapture(video_path)
video.set(3, cv2.cv.CV_CAP_PROP_FRAME_WIDTH)
video.set(4, cv2.cv.CV_CAP_PROP_FRAME_HEIGHT)
tracker = dlib.correlation_tracker()
ret, frame = video.read()
tracker.start_track(frame, face_rectangle)
while True:
ret, frame = video.read()
if frame is None:
break
tracker.update(frame)
pos = tracker.get_position()
bl = (int(pos.left()), int(pos.bottom()))
tr = (int(pos.right()), int(pos.top()))
cv2.rectangle(frame, bl, tr, color=(153, 255, 204), thickness=3)
cv2.imshow('video', frame)
if cv2.waitKey(1) & 0xFF == ord('q'):
break
video.release()
cv2.destoryAllWindows()
Dlib is a modern C++ toolkit containing
machine learning algorithms and tools
for creating complex software in C++ to
solve real world problems. It is used in
both industry and academia in a wide
range of domains including robotics,
embedded devices, mobile phones, and
large high performance computing
environments
From dlib website:
27. Movie Shots
● We need it in order to cut the object trackers
● Shot types:
○ Camera cut
○ Dissolve
○ Wipe
○ Fade-in / Fade out
● Tools that do shot detection:
○ Ffmpg
○ Scene Segmentation
● Not good enough...
30. Considerations (ok ok , and some code…)
● Thresholds for shot change
● Compare every two consecutive frames, or distant frames
● Do we prefer more shots (maybe wrong ones), or less shots (and miss ones)
● Check the complete frame, or the tracked object square
● Crop the image before comparison (prevent subtitles, logo noises, etc.)
● What will happen if a cat is sitting on a table, and then jumps?
● ECR doesn’t have much effect. But it’s cool!
● ECR code here
31. So Where are We Standing?
● Problems with model (= neural network)
○ Grayscale images
○ Colored people
● We need validation of 3rd party
○ But not on all frames
● We want to build an images database
● Hardware requirements are very high
○ Maybe we will process only ‘important videos’