Note: these are the slides from a presentation at Lexis Nexis in Alpharetta, GA, on 2014-01-08 as part of the DataScienceATL Meetup. A video of this talk from Dec 2013 is available on vimeo at http://bit.ly/1aJ6xlt
Note: Slideshare mis-converted the images in slides 16-17. Expect a fix in the next couple of days.
---
Deep learning is a hot area of machine learning named one of the "Breakthrough Technologies of 2013" by MIT Technology Review. The basic ideas extend neural network research from past decades and incorporate new discoveries in statistical machine learning and neuroscience. The results are new learning architectures and algorithms that promise disruptive advances in automatic feature engineering, pattern discovery, data modeling and artificial intelligence. Empirical results from real world applications and benchmarking routinely demonstrate state-of-the-art performance across diverse problems including: speech recognition, object detection, image understanding and machine translation. The technology is employed commercially today, notably in many popular Google products such as Street View, Google+ Image Search and Android Voice Recognition.
In this talk, we will present an overview of deep learning for data scientists: what it is, how it works, what it can do, and why it is important. We will review several real world applications and discuss some of the key hurdles to mainstream adoption. We will conclude by discussing our experiences implementing and running deep learning experiments on our own hardware data science appliance.
TeamStation AI System Report LATAM IT Salaries 2024
Deep Learning for Data Scientists: A Technical Overview
1. Deep Learning
for Data Scientists
Andrew B. Gardner
agardner@momentics.com
http://linkd.in/1byADxC
www.momentics.com/deep-learning
2.
3. Deep Learning in the Press…
Ng
Hinton
LeCun
Zuckerberg
Google Hires Brains that Helped
Supercharge Machine Learning.
Wired 3/2013.
Kurzweil
Facebook taps ‘Deep Learning’ Giant for New AI Lab.
Wired 12/2013.
Is “Deep Learning” A Revolutions in
Artificial Intelligence?
The Man Behind the Google Brain: Andrew
Ng and the Quest for the New AI.
New Yorker 11/2012.
Wired 5/2013.
New Techniques from Google and Ray Kurzweil Are
Taking Artificial Intelligence to Another Level.
MIT Technology Review 5/2013.
4. … Publication & Search Trends …
Google Scholar Citations
Google Trends
600
big data
500
data science
400
300
“deep learning” +
“neural network”
deep learning
machine learning
200
100
0
‘06
‘11
‘06
‘11
domains: computer vision, speech & audio, bioinformatics, etc.
Conferences: NIPS, ICLR, ICML, …
5. … Industry & Products
• Google
Microsoft Real-time English-Chinese Translation
– Android Voice
Recognition
– Maps
– Image+
•
•
•
•
SIRI
Translation
Documents
…
https://www.youtube.com/watch?v=Nu-nlQqFCKg
Microsoft Chief Research Officer Rick
Rashid, 11/2012
6. Deep Learning Epicenters (North America)
de Freitas (UBC)
Microsoft
Bengio (U Montreal)
Hinton (U Toronto)
Facebook
Ng (Stanford)
Google
Yahoo
LeCun (NYU)
12. fa
ng
of
in
ch
on,
of
of
us
on
is
is
bly
How Good is “More Data?”
speech. The memory-based learner used only
the word before and word after as features.
Labels are expensive Less data
1.00
• More data dominates*
better techniques
.975
0.95
0.90
Test Accuracy
a
93,
In
is
fic
es
are
m
ber
• Often have lots of data
0.85
.825
0.80
Memory-Based
Winnow
0.75
Perceptron
Naïve Bayes
0.70
0.1
1
10
100
Millions of Words
1000
Learning curves for confusion set
Figure 1. Learning Curves for Confusion Set
disambiguation, e.g. {to, two, too}.
Disambiguation
We collected a 1
-billion-word training
corpus from a variety of English texts, including
• … we just don’t have
lots of labels
• What if there was a
way to use unlabeled
data?
“Scaling to Very Very Large Corpora for Natural Language Disambiguation,” Banko and Brill, 2001.
13. The Impact of Features
Intuitively: better features are good
• Critical to success – even more than data!
• How to create / engineer features?
– Typically shallow
• Domain-specific
• What if there was a way to automatically
learn features?
14. Machine Learning (What We Want)
Building a Cat Detector 2.0
bountiful
important*
Features + Detector
(Classifier)
end-to-end
15. AR” Building an Object Recognition System
”
“CAR”
Deep Nets Intuition
“CAR”
car
intermediate representations
CLASSIFIER
FEATURE
EXTRACTOR
label
IDEA: Use data to optimize features for the given task.
olutional DBN's for scalable unsup. learning...” ICML 2009
Lee et al. ICML 2009
12
Ranzato
2
Ranzato
13
Ranzato
Ranza
16. on from low
structure as
hical Another Example of Hierarchy
Learning
rchical Learning
mplexity from low
progression
ral progression from low
high level structure as
to high level structure as
natural complexity
in natural complexity
what is being
eto monitor whatisisbeing
the machine
o monitor what being
r
and guide the machine
es toto guide themachine
t and
er subspaces
tter subspaces
od lower level
llower level heads
ntation can be used for
sentation can be usedfor
ndistinct tasks for
be used
istinct tasks
s
faces
as
parts
edges
17. d tomachine machine
e guide the
he
subspaces Hierarchy Reusability?
faces
cars
elephants
chairs
wer level
be used forbe used for
tation can
tinct tasks
5
5
18. A Breakthrough
G. E. Hinton, S. Osindero, and Y. Teh, “A fast learning
algorithm for deep belief nets,” Neural
Computation, vol. 18, pp. 1527–1554, 2006.
G. E. Hinton and R. R. Salakhutdniov, “Reducing the
dimensionality of data with neural networks,”
Science, vol. 313, no. 5786, pp. 504-507, July 2006.
before
after
20. MNIST Sample Errors
Ciresan et al. “Deep Big Simple Neural Networks Excel on
Handwritten Digit Recognition,” 2010
21. Key Ideas
• Learn features from data
– Use all data
• Deep architecture
– Representation
– Computational efficiency
– Shared statistics
• Practical training
• State-of-the-art (it worked)
22. After: Cat Detector
unlabeled images (millions)
labeled images (few)
deep learning
network
more data
automatic (deep) features
24. This Is A Neuron
output
1. Sum all inputs (weighted)
y
x = w0 + w1z1 + w2 z2 + w3z3
f(x)
2. Nonlinearly transform
y = f ( x)
weights
w0 w1
w2
sigmoid
w3
tanh
1
bias
z1
z2
inputs
z3
activation function
25. A Neural Network
forward propagation: weighted sum inputs, produce activation, feed forward
cat
dog
Output
Hidden
13.5
weight
21
n_teeth
16
n_whiskers
Inputs
(the features)
26. Training
Back propagation of error.
1
0
cat
dog
total error at top
proportional
contributions going
backwards
13.5
weight
21
n_teeth
16
n_whiskers
27. After Training
network
layer weights
weights as a matrix
[.5, -.2, 4, .15, -1,…]
-.5
.4
0
.1
.1
.5
-1
2
[-.5, -.3, .4, 0, …]
-.3
.7
-.2
.4
we can view weight matrix as image
… plus performance evaluation & logging
28. Building Blocks
So many choices!
network topology
• Network Topology
– Number of layers
– Nodes per layer
• Layer Type
– Feedforward
– Restricted Boltzmann
– Autoencoder
– Recurrent
– Convolutional
layer type
neuron type
• Neuron Type
– Rectified Linear Unit
• Regularization
– Dropout
• Magic Numbers
29. A Deep Learning Recipe, 1.0
• Lots of data, some+
labels
• Train each RBM layer
greedily, successively
• Add an output layer
and train with labels
labels
30. A Few Other Important Things
• Deep Learning Recipe 2.0
– Dropout / regularization
– Rectified Linear Units
•
•
•
•
Convolutional networks
Hyperparameters
Not just neural networks
Practical Issues (GPU)
35. Application: Speech
frequencies
in window
“He can for example present significant university wide
issues to the senate.”
small time window
slide 15ms
phoneme
Spectrogram: window in time -> vector of frequences; slide; repeat
36. Automatic Speech
CDBNs for speech
Unlabeled TIMIT data -> convolutional DBN
Trained on unlabeled TIMIT corpus
Experimental R
• Speaker identification
TIMIT Speaker identification
Accuracy
Prior art (Reynolds, 1995)
99.7%
Convolutional DBN
100.0%
• Phone classification
TIMIT Phone classification
Accuracy
Clarkson et al. (1999)
77.6%
Gunawardana et al. (2005)
78.3%
Sung et al. (2007)
78.5%
Petrov et al. (2007)
78.6%
Sha & Saul (2006)
78.9%
Yu et al. (2009)
79.2%
Convolutional DBN
80.3%
Learned first-layer bases
Lee et al., “Unsupervised feature learning for audio classification using convolutional deep
68
belief networks”, NIPS 2009.
37. A Long List of Others
• Kaggle
– Merck Molecular Activation (‘12)
– Salary Prediction (‘13)
•
•
•
•
Learning to Play Atari Games (‘13)
NLP – chunking, NER, parsing, etc.
Activity recognition from video
Recommendations
38. Deep Learning In A Nutshell
•
•
•
•
•
•
•
•
Architectures vs. features
Deep vs. shallow
Automatic* features
Lots of data vs. best technique
Compute- vs. human intensive
State-of-the-art
Breaks expert, domain barrier
Details & tricks can be complex
http://www.deeplearning.net/
39. Interested in Deep Learning?
Connect for:
• Training Workshop (interest list)
• Projects / consulting
• Collaboration
• Questions
agardner@momentics.com
http://www.momentics.com/deep-learning/
Notas do Editor
(1:00)Thank organizers & attendeesMy background thesisInvitation to connectTalk in 3 parts: introduction and motivate the topichigh-level overview of deep learning detailsexamples
How many heard of deep learning
joke: Wired and ad placementCompanies are qcquiring talent and demonstrating use caseZuckerberg @ NIPS
Growing popularityLots of applications motivated by vision and audioSensible because of connections to perception, AI and neural networksRevolutions have participants
Products are seeing big liftExample of real-time translation kept it in the same voice!“I’m speaking in English and hopefully you’ll hear me speaking in Chinese in my own voice”
Apology for ommission
- As a data scientist, consume machine learning
Consider canonical problem: classificationCats and dogs, cats and data scientistsIn this case, we want to build a magic box that discriminates cats vs dogsPlay on the google cat detector: 1000 nodes, 16000 cores, 1 week per trial @ $1/hr = ? June 2012Cat detector detects better than a catLeaving data on the dable
Many examples, from all classes, requiredConsequence -> use less dataFeatures require lots of engineering and workExample here, SIFT, took over a decade for David Lowe to developMany examples of features: tail, fur, eyes, edges, height, etc.
Features: raw numbers to smaller, better pile of numbersMany examples, from all classes, requiredConsequence -> use less dataFeatures require lots of engineering and workExample here, SIFT, took over a decade for David Lowe to developMany examples of features: tail, fur, eyes, edges, height, etc.Best disciplined approach: copy and tweakShow of hands – how many of you have experienced this?
80% of the data scientist jobWe don’t scale – how long to get a Phd?Each loop we have to do invention and ideation“Won a kaggle contest using RF”Workflow, feature engineering
This is not always true, but good for high variance problemsWhat examples of extra data?Not just a little more data, but a lot of dataOften have a lot more data today in the connected world
No principled way to generate featuresNo playbook for alien data features
Modules that learn featuresStack and I get a hierarchical decomposition
Hinton split timeBefore & after
Describe MNIST, boring easy“everything works at 96% accuracy”
This network achieved 0.35% error using online backprop6 hidden layers, 2500, 2000, 1500, 1000, 500, 10 with validation & test error .35% & .32%
Data flows from bottom to topAffine + nonlinearityNonlinear regressionWe have to learn the weights and biasWe have to pick the activation function
Backprop topBackprop global
1000 categories25% -> 15% errorAcquired by Google 1/13