Deep Visual Understanding from Deep Learning by Prof. Jitendra Malik

Deep Visual Understanding from Deep
Learning
Jitendra Malik
UC Berkeley & Google

Moravec’s argument(1998)
ROBOT: Mere Machine To Transcendent Mind
• 1 neuron = 1000 instructions/sec
• 1 synapse = 1 byte of information
• Human brain then processes 10^14 IPS and
has 10^14 bytes of storage
• In 2000, we have 10^9 IPS and 10^9 bytes on
a desktop machine
• Assuming Moore’s law we obtain human level
computing power in 2025, or with a cluster of
100 nodes in 2015.

Embodied Cognition
Vision
(broadly,perception)
Motor Control
(broadly,planning)
Language
Semantic Reasoning

Ontogeny of Intelligence
• The Cambrian period (543-490 million yrs ago)
led to the emergence of wide variety of
animal life. These animals had vision and
locomotion capabilities.
• Sensory systems provide great benefits only
when accompanied by the ability to move - to
find food, avoid predators etc.

If you don’t need to move you don’t need an eye or a brain!
https://goodheartextremescience.wordpress.com/2010/01/27/meet-the-creature-that-eats-its-own-brain/
7

Hominid evolution in last 5 million years
• Bipedalism freed the hand for tool making.
Dexterous hands coevolved with larger brains.
• Anaxagoras: It is because of his being armed
with hands that man is the most intelligent
animal

Origins of Language (from Trask)

The evolutionary progression
• Vision and Locomotion
• Manipulation
• Language
Successes in AI seem to follow the same order!

Is Object Detection nearly solved?

Hubel and Wiesel (1962) discovered orientation sensitive
neurons in V1

Convolutional Neural Networks (LeCun et al )
Used backpropagation to train the weights in this architecture
• First demonstrated by LeCun et al for handwritten digit recognition(1989)
• Applied in sliding window paradigm for tasks such as face detection in the
1990s.
• However was not competitive on standard computer vision object
detection benchmarks in the 2000s.
• And then Imagenet and Alexnet happened..

R-CNN: Regions with CNN features
Girshick, Donahue, Darrell & Malik (CVPR 2014)
Input
image
Extract region
proposals (~2k / image)
Compute CNN
features
Classify regions
(linear SVM)
This and the Multibox work from Google showed
how to apply these architectures for object detection

Fast R-CNN (Girshick, 2015)
R-CNN with SPP features, no need to warp individual windows
There is also Faster R-CNN
which doesn’t require external proposals

The 3R’s of Vision:
Recognition, Reconstruction & Reorganization
Recognition
ReorganizationReconstruction
Talk at POCV Workshop, CVPR 2012

Fifty years of computer vision 1965-
2015
• 1960s: Beginnings in artificial intelligence, image
processing and pattern recognition
• 1970s: Foundational work on image formation: Horn,
Koenderink, Longuet-Higgins …
• 1980s: Vision as applied mathematics: geometry, multi-
scale analysis, control theory, optimization …
• 1990s: Multiple View Reconstruction well understood
• 2000s: Learning approaches to recognition problems in
full swing. Large datasets are collected and annotated
e.g. ImageNet
• 2010s: Deep Learning becomes popular building off
availability of GPUs and annotated datasets.

Reconstructing the world
… automatically from huge collections of photos
downloaded from the Internet
Over the past 10 years, 3D modeling from images has made huge
advances in scale, quality, and generality. We can reconstruct
scenes…
Snavely, Seitz, Szeliski.
Reconstructing the World from Internet Photo
Collections.

scenes…
… that vary over time
Matzen & Snavely.
Scene Chronology.
ECCV 2014

scenes…
… that vary over
time
Matzen & Snavely.
Scene Chronology.
ECCV 2014
Martin-Brualla, Gallup, Seitz.
Time-lapse Mining from
Internet Photos.
SIGGRAPH 2015

Reconstructing the great indoors…
Choi, Zhou, Koltun.
Robust Reconstruction of
Indoor Scenes.
CVPR 2015
… using Depth Cameras
Ikehata, Yan, Furukawa.
Structured Indoor Modeling.
ICCV 2015
… using Semantic
Reconstruction of
Rooms and Objects
pointcloud3Dmeshrendering

ShapeNet (Stanford & Princeton)

Some problems that we can solve…

Block Diagram of the Primate Visual System

Neuroscience & Computer Vision
• A feed-forward view of processing in the ventral
stream with layers of simple and complex cells led to
the neo-cognitron and subsequently convolutional
networks.
• We now know that the ventral stream is much more
complicated with bidirectional as well as feedback
connections.
• I am interested in computer vision tasks where
feedback is key to the solution. This is a very natural
way to capture “context”. Helpful in pose recovery,
instance segmentation etc.

IEF : Carreira, Agrawal, Fragkiadaki & Malik

Social Perception
• Computers today have pitifully low “social
intelligence”
• We need to understand the internal state of
humans as they interact with each other and
the external world
• Examples: emotional state, body language,
current goals.

What we would like to infer…
Will person B put some money into Person C’s tip bag?

Visual Semantic Role Labeling
Gupta & Malik (2015)

What we can’t do (yet)
• The hierarchical structure of human behavior-
movement, goals, actions and events
ACTION = MOVEMENT + GOAL

Events
e.g. A meal at a restaurant
• Classical AI/Cognitive Science Solution –
Schemas (frames, scripts etc.)
• To have a robust, visually grounded solution
we need to learn the equivalent from video +
Knowledge Graph like structures
• Perhaps best tackled in particular domains e.g.
team sports, instructional videos etc.

What has been responsible for recent
AI successes?
• Big Computing
• Big Data

What has been responsible for recent
AI successes?
• Big Computing
• Big Data
• Big Annotation
• Big Simulation

Game scenarios can be simulated, but
it’s not so easy in other settings

42
External
Teacher
Signal
Internal
Teacher
Signal
Self
Supervision
External
Supervision

The Development of Embodied Cognition:
Six Lessons from Babies
Linda Smith & Michael Gasser

The Six Lessons
• Be multi-modal
• Be incremental
• Be physical
• Explore
• Be social
• Use language
• An example: Learning to see by moving, P.
Agrawal, J. Carreira, J. Malik (ICCV 2015)

Consider Poking
45
Same Poke  Different Outcomes
The knowledge of object class explains it!

On Mental Models
If the organism carries a `small-scale model’ of
external reality and of its own possible actions
within its head, it is able to try out various
alternatives, conclude which is the best of them,
react to future situations before they arise, utilize
the knowledge of past events in dealing with the
present and the future, and in every way to react
in a much fuller, safer, and more competent
manner to the emergencies which face it (Craik,
1943,Ch. 5, p.61)
Modern Control theory (Kalman et al) uses a state
space formalism to achieve this.

For acting in novel situations,
Model of the Agent Model of the Environment
How will the environment look in the future
when the agent interacts with it?

48
Force?
Consider the specific case of billiards
What force to apply?
Model of the environment

49
Force?
Consider the specific case of billiards
What force to apply?
Model of the environment
How to apply this force?
Model of the agent

LEARNING VISUAL PREDICTIVE MODELS
OF PHYSICS
FOR PLAYING BILLIARDS
Katerina Fragkiadaki*, Pulkit Agrawal*, Sergey Levine, Jitendra Malik
ICLR, 2016
*Equal contribution
50

Moving Balls World
Factors of change
Table Geometry
Number of balls
Ball Size
Color of Balls/Walls
51
Like the real world, moving ball worlds provides constantly
changing environments

52
Visual Predictive Model of Physics
Neural network
F
Prediction
Module
A model that can predict “future visual states”
(i.e. visual imagination)

Key Idea: Use object-centric
predictions
F
53
F
World-Centric Prediction Object-Centric Prediction

Our Model
54
Xt-3
Xt-2
Xt-1
Xt
CNN LSTM
Force (Ft)
ut
.
.
.
ut+19
(v elocit y )
LEARNING ENVIRONMENT MODEL DIRECTLY FROM VISUAL INPUTS
Assume that the agent can observe and interact with the world.
Assume that agent can track objects.
MODEL PREDICTIONS
70
80
1 Ball
2 Balls
Scaling with number of balls
70
80
2B-on-3B
3B-on-4B
Generalization to novel worlds

Model’s Imagination
55
Only Inputs
Visual of the
first frame
Applied forces

Model’s Imagination – Multiple Balls
56
Only Inputs
Visual of the
first frame
Applied forces

Model’s
Imagination
Model’s Imagination
Ground-
Truth
We successfully model collisions.
Input to the model – visual of the first frame and the applied force (yellow arrow)
Dark  Light Blue: Progression of time
57

58
Object Centric Frame Centric
Train on 2/3 Balls  Generalize to 6 ball worlds
REPLACE
THIS PLOT

59
Now that we have learnt predictive models
We can plan Actions!

60
Visually imagine the effect of different forces !!
Then chose the optimal force which “in imagination” (simulation)
leads to the desired goal state.
Planning Actions

Accuracy of pushing a ball to a desired
position
61

LEARNING TO POKE BY POKING:
EXPERIENTIAL LEARNING OF INTUITIVE
PHYSICS
Pulkit Agrawal*, Ashvin Nair*, Pieter Abbeel, Jitendra Malik,
Sergey Levine
NIPS 2016
*Equal contribution
62

Forward
Model
65
Decoder
Zt
vt
Predictor
Zt+1
Image
(Xt)
Action
(ut)
Image
Encoder
Action
Encoder
We donot want to predict
pixels!!
Inverse Model
Multimodality in output space

Our Method: Jointly Learn Forward and Inverse Model
66
Zt
vt
Predictor
Zt+1
Image
(Xt)
Action
(ut)
Image
Encoder
Action
Encoder
Simultaneously learn and
predict in an abstract features
space (Zt)
Forward model regularizes
inverse model !!
Image
Encoder

Results
68
I nit ial Final Target

69
I nit ial Final Target
A
B
Results

Pinto, L., & Gupta, A., Supersizing self-supervision: Learning to grasp from 50k tries and
700 robot hours, ICRA, 2016.
Levine, S., Pastor, P., Krizhevsky, A., & Quillen, D., Learning Hand-Eye Coordination for
Robotic Grasping with Deep Learning and Large-Scale Data Collection, arXiv:1603.02199,
2016.
Pinto, L., Gandhi, D., Han, Y., Park, Y. L., & Gupta, A., The Curious Robot: Learning Visual
Representations via Physical Interactions, arXiv:1604.01360, 2016.
Related and Contemporary Work

Deep Visual Understanding from Deep Learning by Prof. Jitendra Malik

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Deep Visual Understanding from Deep Learning by Prof. Jitendra Malik

Similar to Deep Visual Understanding from Deep Learning by Prof. Jitendra Malik (20)

More from The Hive

More from The Hive (19)

Recently uploaded

Recently uploaded (20)

Deep Visual Understanding from Deep Learning by Prof. Jitendra Malik