2. Deep Learning
• Google is using Machine Learning
• Machine Learning is difficult
• Requires domain knowledge from human experts
Deep Learning:
• Great performances for many problems
• Works well with a large amount of data
• Requires less domain knowledge
Focus:
• Scale deep learning to bigger models and bigger problems
Quoc V. Le
3. Deep Learning
• Google is using Machine Learning
• Machine Learning is difficult
• Requires domain knowledge from human experts
Deep Learning:
• Great performances for many problems
• Works well with a large amount of data
• Requires less domain knowledge
Focus:
• Scale deep learning to bigger models and bigger problems
Quoc V. Le
5. What is Deep Learning?
…
v = g(B u)
B
A
u = g(A x)
x
(images, audio, texts, etc.)
Quoc V. Le
6. What is Deep Learning?
…
v = g(B u)
B
A
u = g(A x)
x
(images, audio, texts, etc.)
Quoc V. Le
7. High-level features by Deep Learning
Face detector, Cat detector
…
Edge detectors
Pixels
Quoc V. Le
8. Google’s DistBelief
Model
Goal: Train deep learning on many
machines
Model: A multiple layered architecture
Forward pass to compute the
features
Backward pass to compute the
gradient
Training Data
Quoc V. Le
9. Model partition with DistBelief
Model
DistBelief distributes a model across
multiple machines and multiple cores.
Machine (Model Partition)
Training Data
Quoc V. Le
10. Model partition with DistBelief
Model
DistBelief distributes a model across
multiple machines and cores.
Machine (Model Partition)
Training Data
Core
Quoc V. Le
11. Model partition with DistBelief
Model
Stochastic Gradient Descent (SGD)
Model parameters are partitioned
Can use up to 1000 cores
Training Data
Quoc V. Le
12. Model partition with DistBelief
Model
But training is still slow on large data sets
Can we add more parallelism?
Idea: Train multiple models on different
partitions of the data, and merge them
Training Data
Quoc V. Le
13. Data partition with DistBelief
Parameter Server
∆p
p’ = p + ∆p
p’
Model
Workers
Data
Shards
Quoc V. Le
14. Parallelism in DistBelief
Model parallelism via model partitioning
Data parallelism via data partitioning and asynchronous communications
DistBelief can scale to billion examples and use 100,000 cores or more
Thanks to its speed, DistBelief dramatically improves many applications
Quoc V. Le
25. Text understanding
Very useful but also difficult
We should try to understand the meaning of words
Deep Learning can learn the meaning of words
Quoc V. Le
27. Predicting the next word in a sentence
Classifier
Hidden Layers
E
E
E
E
E
the!
Word Matrix
cat!
sat!
on!
the!
is a matrix of dimension ||Vocab|| x d
Quoc V. Le
28. Visualizing the word vectors
•
Example nearest neighbors trained on Google News
apple
Apple
iPhone
32. Joint work with
Kai Chen
Greg Corrado
Rajat Monga
Andrew Ng
Jeff Dean
Matthieu Devin
Paul Tucker
Ke Yang
Samy Bengio, Tom Dean, Josh Levenberg, Geoff Hinton, Tomas
Additional
Mikolov, Mark Mao, Patrick Nguyen, Marc’Aurelio Ranzato,
Thanks:
Mark Segal, Jon Shlens, Ilya Sutskever, Vincent Vanhoucke