3. Moravec’s argument(1998)
ROBOT: Mere Machine To Transcendent Mind
• 1 neuron = 1000 instructions/sec
• 1 synapse = 1 byte of information
• Human brain then processes 10^14 IPS and
has 10^14 bytes of storage
• In 2000, we have 10^9 IPS and 10^9 bytes on
a desktop machine
• Assuming Moore’s law we obtain human level
computing power in 2025, or with a cluster of
100 nodes in 2015.
6. Ontogeny of Intelligence
• The Cambrian period (543-490 million yrs ago)
led to the emergence of wide variety of
animal life. These animals had vision and
locomotion capabilities.
• Sensory systems provide great benefits only
when accompanied by the ability to move - to
find food, avoid predators etc.
7. If you don’t need to move you don’t need an eye or a brain!
https://goodheartextremescience.wordpress.com/2010/01/27/meet-the-creature-that-eats-its-own-brain/
7
8. Hominid evolution in last 5 million years
• Bipedalism freed the hand for tool making.
Dexterous hands coevolved with larger brains.
• Anaxagoras: It is because of his being armed
with hands that man is the most intelligent
animal
12. Hubel and Wiesel (1962) discovered orientation sensitive
neurons in V1
13.
14. Convolutional Neural Networks (LeCun et al )
Used backpropagation to train the weights in this architecture
• First demonstrated by LeCun et al for handwritten digit recognition(1989)
• Applied in sliding window paradigm for tasks such as face detection in the
1990s.
• However was not competitive on standard computer vision object
detection benchmarks in the 2000s.
• And then Imagenet and Alexnet happened..
15. R-CNN: Regions with CNN features
Girshick, Donahue, Darrell & Malik (CVPR 2014)
Input
image
Extract region
proposals (~2k / image)
Compute CNN
features
Classify regions
(linear SVM)
This and the Multibox work from Google showed
how to apply these architectures for object detection
16. Fast R-CNN (Girshick, 2015)
R-CNN with SPP features, no need to warp individual windows
There is also Faster R-CNN
which doesn’t require external proposals
17.
18. The 3R’s of Vision:
Recognition, Reconstruction & Reorganization
Recognition
ReorganizationReconstruction
Talk at POCV Workshop, CVPR 2012
19. Fifty years of computer vision 1965-
2015
• 1960s: Beginnings in artificial intelligence, image
processing and pattern recognition
• 1970s: Foundational work on image formation: Horn,
Koenderink, Longuet-Higgins …
• 1980s: Vision as applied mathematics: geometry, multi-
scale analysis, control theory, optimization …
• 1990s: Multiple View Reconstruction well understood
• 2000s: Learning approaches to recognition problems in
full swing. Large datasets are collected and annotated
e.g. ImageNet
• 2010s: Deep Learning becomes popular building off
availability of GPUs and annotated datasets.
20. Reconstructing the world
… automatically from huge collections of photos
downloaded from the Internet
Over the past 10 years, 3D modeling from images has made huge
advances in scale, quality, and generality. We can reconstruct
scenes…
Snavely, Seitz, Szeliski.
Reconstructing the World from Internet Photo
Collections.
21. Reconstructing the world
Over the past 10 years, 3D modeling from images has made huge
advances in scale, quality, and generality. We can reconstruct
scenes…
… that vary over time
Matzen & Snavely.
Scene Chronology.
ECCV 2014
22. Reconstructing the world
Over the past 10 years, 3D modeling from images has made huge
advances in scale, quality, and generality. We can reconstruct
scenes…
… that vary over
time
Matzen & Snavely.
Scene Chronology.
ECCV 2014
Martin-Brualla, Gallup, Seitz.
Time-lapse Mining from
Internet Photos.
SIGGRAPH 2015
23. Reconstructing the great indoors…
Choi, Zhou, Koltun.
Robust Reconstruction of
Indoor Scenes.
CVPR 2015
… using Depth Cameras
Ikehata, Yan, Furukawa.
Structured Indoor Modeling.
ICCV 2015
… using Semantic
Reconstruction of
Rooms and Objects
pointcloud3Dmeshrendering
29. Neuroscience & Computer Vision
• A feed-forward view of processing in the ventral
stream with layers of simple and complex cells led to
the neo-cognitron and subsequently convolutional
networks.
• We now know that the ventral stream is much more
complicated with bidirectional as well as feedback
connections.
• I am interested in computer vision tasks where
feedback is key to the solution. This is a very natural
way to capture “context”. Helpful in pose recovery,
instance segmentation etc.
33. Social Perception
• Computers today have pitifully low “social
intelligence”
• We need to understand the internal state of
humans as they interact with each other and
the external world
• Examples: emotional state, body language,
current goals.
34. What we would like to infer…
Will person B put some money into Person C’s tip bag?
36. What we can’t do (yet)
• The hierarchical structure of human behavior-
movement, goals, actions and events
ACTION = MOVEMENT + GOAL
37. Events
e.g. A meal at a restaurant
• Classical AI/Cognitive Science Solution –
Schemas (frames, scripts etc.)
• To have a robust, visually grounded solution
we need to learn the equivalent from video +
Knowledge Graph like structures
• Perhaps best tackled in particular domains e.g.
team sports, instructional videos etc.
38. What has been responsible for recent
AI successes?
• Big Computing
• Big Data
39. What has been responsible for recent
AI successes?
• Big Computing
• Big Data
• Big Annotation
• Big Simulation
43. The Development of Embodied Cognition:
Six Lessons from Babies
Linda Smith & Michael Gasser
44. The Six Lessons
• Be multi-modal
• Be incremental
• Be physical
• Explore
• Be social
• Use language
• An example: Learning to see by moving, P.
Agrawal, J. Carreira, J. Malik (ICCV 2015)
46. On Mental Models
If the organism carries a `small-scale model’ of
external reality and of its own possible actions
within its head, it is able to try out various
alternatives, conclude which is the best of them,
react to future situations before they arise, utilize
the knowledge of past events in dealing with the
present and the future, and in every way to react
in a much fuller, safer, and more competent
manner to the emergencies which face it (Craik,
1943,Ch. 5, p.61)
Modern Control theory (Kalman et al) uses a state
space formalism to achieve this.
47. For acting in novel situations,
Model of the Agent Model of the Environment
How will the environment look in the future
when the agent interacts with it?
49. 49
Force?
Consider the specific case of billiards
What force to apply?
Model of the environment
How to apply this force?
Model of the agent
50. LEARNING VISUAL PREDICTIVE MODELS
OF PHYSICS
FOR PLAYING BILLIARDS
Katerina Fragkiadaki*, Pulkit Agrawal*, Sergey Levine, Jitendra Malik
ICLR, 2016
*Equal contribution
50
51. Moving Balls World
Factors of change
Table Geometry
Number of balls
Ball Size
Color of Balls/Walls
51
Like the real world, moving ball worlds provides constantly
changing environments
52. 52
Visual Predictive Model of Physics
Neural network
F
Prediction
Module
A model that can predict “future visual states”
(i.e. visual imagination)
53. Key Idea: Use object-centric
predictions
F
53
F
World-Centric Prediction Object-Centric Prediction
54. Our Model
54
Xt-3
Xt-2
Xt-1
Xt
CNN LSTM
Force (Ft)
ut
.
.
.
ut+19
(v elocit y )
LEARNING ENVIRONMENT MODEL DIRECTLY FROM VISUAL INPUTS
Assume that the agent can observe and interact with the world.
Assume that agent can track objects.
MODEL PREDICTIONS
70
80
1 Ball
2 Balls
Scaling with number of balls
70
80
2B-on-3B
3B-on-4B
Generalization to novel worlds
58. 58
Object Centric Frame Centric
Train on 2/3 Balls Generalize to 6 ball worlds
REPLACE
THIS PLOT
59. 59
Now that we have learnt predictive models
We can plan Actions!
60. 60
Visually imagine the effect of different forces !!
Then chose the optimal force which “in imagination” (simulation)
leads to the desired goal state.
Planning Actions
66. Our Method: Jointly Learn Forward and Inverse Model
66
Zt
vt
Predictor
Zt+1
Image
(Xt)
Action
(ut)
Image
Encoder
Action
Encoder
Simultaneously learn and
predict in an abstract features
space (Zt)
Forward model regularizes
inverse model !!
Image
Encoder
71. Pinto, L., & Gupta, A., Supersizing self-supervision: Learning to grasp from 50k tries and
700 robot hours, ICRA, 2016.
Levine, S., Pastor, P., Krizhevsky, A., & Quillen, D., Learning Hand-Eye Coordination for
Robotic Grasping with Deep Learning and Large-Scale Data Collection, arXiv:1603.02199,
2016.
Pinto, L., Gandhi, D., Han, Y., Park, Y. L., & Gupta, A., The Curious Robot: Learning Visual
Representations via Physical Interactions, arXiv:1604.01360, 2016.
Related and Contemporary Work