1. AlexNet:
Context, Summary & Impact
Discussion by B. Hatt on
ImageNet Classification with Deep
Convolutional Neural Network, NIPS 2012
by A. Krizhevsky, I. Sutskever & G. Hinton
2. Outline
• Importance of AlexNet
• Scientific Context
• Neural nets in 2012
• Convolutional nets
• KSH ’12 findings
• Limits
• Critics & costs
• Further works
• Industrial impact
• This presentation should last
about 50 min. without questions
• Feel free to interrupt at anytime
for questions, misunderstanding
4. The most influential paper on data science
• 20,000 citations, more than any cited by or citing this paper
• Taught to all aspiring data scientists, at university & on-line
• Fastest growing academic requirement for new positions
• Applications are rather narrow: image to tags
• Recognising faces for photos in social context
• Flagging content for further attention, search
• Ideas could be applied to other computationally intensive tasks
• Solutions are too complex to be explained to a human
5. Dramatic
performance gain by pro
• Well below 25%
• No team did worst
than 25% since
• Proved that Deep
Convolutional
networks were the
way to go for that
problem class
• As long as you train
your model on
multiple GPUs
6. Important papers cited by or citing KSH ’12
Key paper cited by KSH or influential work
• Back propagation applied to handwritten zip code
recognition. NEURAL COMPUTING 1989, LeCun & al.
• The MNIST database of handwritten digits, 1998
LeCun & al.
• Gradient-based learning Applied to document
recognition. IEEE 1998, LeCun & al.
• Learning to parse images.
NIPS 2000, Hinton, Ghahramani & The
• Learning methods for generic object recognition
with invariance to pose and lighting. CVPR 2004,
LeCun & al.
• ImageNet: A Large Scale Hierarchical Image
Database. 2012, Deng & al.
Major paper citing KSH directly or not
Deeper
• Going Deeper with Convolution, 2015, Szegedy & al.
[Google & Magic Leap]
• Very Deep convolutional networks for large scale
image recognition, ICLR 2015, Simonyan & Zisserman
[Oxford]
Other architecture
• Generative Adversarial Networks, 2014
Goodfellow & al.
• Dynamic Routing Between Capsules, NIPS 2017,
Sabour, Frosst & Hinton
7. A more engineering than academic problem
• Reproduction is difficult without
unpublished code, computing
and engineering resources
• Few large datasets of tagged
images to test and train, bias in
use cases
• Conference presentation more
publicized that papers’ claims
• Heavily sponsored conferences
• Targeted at applicant hiring
• Scalable computing framework
and resources
• Theano, TensorFlow, Keras, etc.
• Amazon Web Services,
Google Cloud Platform
8. What this paper is
Image to classification
• Parallel architecture to scale model
• Set of implementation tips
• Imperfect solution to scale,
reflection, colour & illumination,
rotation (2D), point of view (3D)
Other image processing
• Leverage hierarchy in training set
• Locate the interesting parts
• Boolean if object is in at all
• Edge detection, Separate shapes
• Imagine what is hidden behind
• 2D to 3D representation
• Duplicate elements, counting
• Video processing, Still selection
What it is not about
11. A network of layers, activity flowing forward
• Very large input set (images)
• into a small outcome (classifier)
• Each neurons has an active value
• Neuron in each layer relies on
lower layers for its activation
• Algebraic sum of previous layer
• Filtered by an activation function
a0
(1) = σ(w0, 0
(1).a0
(0) + w1, 0
1.a1
(0) + … )
• Iteratively, until last layer
• Historically σ [0, 1]
Now, σ commonly max(0, x)
a0
(1)
w0, 0
(1)a0
(0)
a4
(3)
13. Back propagation 2: how to learn step by step
• Initialisation
• Training loop
• First feed-forward
• Algebra input + Activation
• Calculate Errors
• Back-propagate
• Deeply combined derivatives
• Learn new weight & bias
• Next loop
• Until cost slows to a stop
Cost := | Y - SoftMax(an
1,…,an
M) |
∂ Cost / ∂ wi, j
k = ∑ σ’(…)·W·σ o …L-1 times o σ (…)
This chain rule defines a gradient
along ∑layer (Nl.Nl-1 + Nl) dimensions
14. The perceptron: Universal approximator
• Theorem: If σ is continuous & non-linear,
any function can be approximated as closely as you like
by a perceptron, given enough layers & neurons
• Exploding & vanishing gradient:
• weights just below or above 1, their cumulated impact resp.
disappears or increases dramatically on deep networks.
• Exploding complexity of the gradient:
• ∂ σ o …L times o σ (w) / ∂ w= ∑ σ’· σ o …L-1 times o σ (w)
is unwieldly for continuous finite functions: sigmoid, tanh, logit
• ReLU: x max(0, x) easier to derivate iteratively
17. Architecture
• Image re-sizing, color to grey-scale
• Three specific processes to reduce dimensions
• Convolution: recognize small patterns
• Rectification/Activation: ignore irrelevant
• Pooling: feature, not exact position
• Final layers
• Flatten (re-shape)
• Fully interconnected (much smaller dimension than image definition)
• Normalisation, typically SoftMax
18. Matrix filters aka Convolutions
• Local concerns
• Hierarchical structure
• Each cell of the filter is a parameters
• One layer typically share a set of filters
• Kernel = Filter = Weight = Feature matrix =
Feature map = Activation map =
Parameters to be trained
Span
1 9 weights
2 25 weights
20. Whole image processed locally,
with some possible overlap
Step
N(0) x N (0) N(1) x N(1) = N(0)/step x N (0)/step
21. Pooling, typically using MaxPool
Stride
N(1) x N (1) N(2) x N(2) = N(1)/stride x N (1)/ stride
22. Architecture
• Image re-sizing, color to grey-scale
• Three specific processes to reduce dimensions
• Convolution: recognize small patterns
• Rectification/Activation: ignore irrelevant
• Pooling: feature, not exact position
• Final layers
• Flatten (re-shape)
• Fully interconnected (much smaller dimension than image definition)
• Normalisation, typically SoftMax
Span, Step & Padding
Sigmoid, tanh or ReLU
Stride, Pooling function
23. The challenge of increasing layers
• We can go from one large, colourful image to far fewer dimensions
• But that requires many layers, and
the network becomes complex to train
• Larger networks have better results
• Largest networks hit computation limits with passable results until 2012
24. Progress so far
• Importance of AlexNet >>
• Scientific Context
• Neural nets in 2012 >>
• Convolutional nets >>
• KSH ’12 findings >>
• Limits
• Critics & costs >>
• Further works >>
• Industrial impact >>
25. A. Krizhevsky, I. Sutskever
& G. Hinton at NIPS 2012
The AlexNet paper itself: findings, insights
26. Abstract
• We trained a large, deep convolutional neural network to classify the 1.2
million high-resolution images in the ImageNet LSVRC-2010 contest into
the 1000 different classes.
• On the test data, we achieved top-1 and top-5 error rates of 37.5% and
17.0% which is considerably better than the previous state-of-the-art.
• The neural network, which has 60 million parameters and 650,000
neurons, consists of five convolutional layers, some of which are followed
by max-pooling layers, and three fully-connected layers with a final 1000-
way softmax.
• To make training faster, we used non-saturating neurons and a very
efficient GPU implementation of the convolution operation.
• To reduce overfitting in the fully-connected layers we employed a
recently-developed regularization method called “dropout” that proved
to be very effective.
• We also entered a variant of this model in the ILSVRC-2012 competition
and achieved a winning top-5 test error rate of 15.3%, compared to 26.2%
achieved by the second-best entry.
Context: classic image tagging reference
Results: exceptional, big step forward.
Architecture: rather sophisticated then
Approach 1: relevant training shortcut
Approach 2: new regularisation technique
More result: also great on similar dataset
27. Findings
• 37% top 1 identification: not perfect or usable industrially
• 17% top 5: can suggest to make a human classifier more efficient
• Largest convolutional network at the time, limited overfitting
• Not:
• New technique overall: minor training improvements on 5-layer ConvNet
• Hopeful that deeper approaches would work
• Exploiting Residual learning differently
• No discussion on scoring quality estimate to prioritise ground-truth feedback
28. Dataset
• ImageNet Large-Scale Visual Recognition Challenge (ILSVRC)
• 15 million high-resolution images
• 22k hierarchical labels via Amazon’s Mech. Turk
• 1.2 M training, 50k validation & 150k testing images
• Test on subset of 1000 images in 1000 categories
• Down-sampled images to 256 × 256
• Rectangular images: rescaled shorter side to 256 & cropped
middle
• Subtracting the mean activity over the training set from each
pixel
• No other pre-processing
30. Methodology & issues
• Object recognition too complex: Convolutional neural network
• Flexible (depth and breadth) & Relevant (stationarity & locality)
• Less parameters than other NN, easier to train
• Still prohibitively expensive at large scale on high-resolution images
• Current GPUs & highly-optimized convolution
• C++ implem. of ConvNet code.google.com/archive/p/cuda-convnet available
publicly
• Best results ever on ILSVRC-2010 & -2012
• Unusual features for performance & faster training
• New technique to prevent overfitting
• Less layers, worst off performance
31. Architecture 1: key features
• Non-linearity: Rectified Linear Unit (ReLU): f(x) = max(0, x)
• Trains an order of magnitude faster than saturating alternatives
• Parallel processing: more memory allows larger networks
• 2 x NVIDIA GTX 580 3GB GPUs , suitable for convolutional structure
• Normalisation: local inhibition
• bi
x, y = ai
x, y / [ k + ∑j ∈ i ± n (aj
x, y)2 ]β
• AlexNet use k = 2, n = 5, α = 10−4, and β = 0.75
• Overlapping pooling: s = 2 and z = 3
• First layer: s = 4 and z = 11; second: s = 2 and z = 5
34. Image augmentation
• Image translations and horizontal reflections
• Test: ten 224 × 224 patches (corners & center) + horizontal reflections
averaging softmax predictions of the ten patches.
• Change intensity and color of the illumination (object invariant)
• PCA on the set of RGB values in training set;
add multiples of the found principal components
35. Drop out
• Setting to zero the output of each hidden neuron with probability 0.5
• first two fully-connected
• “Dropped out” neurons no forward pass no back- propagation
• Reduces complex co-adaptations of neurons
• One neuron cannot rely on the presence of particular other neurons
• Learn more robust features, rely different random subsets of neurons
• Doubles training time
36. Training
• Training: stochastic gradient descent
• Batch size of 128 examples
• Large momentum of 0.9
• Small decay of 0.0005
• Initialisation: not null to give signal to ReLU
• w0 ~ G(0, 0.01)
• a0 = 1 in 2nd, 4th & 5th convolutional & fully-connected hidden layers
• Learning rate: 0.01 overall
• Divided by 10 when validation error was not improving
41. Discussion in the paper
• Very little self-criticism: industrial result
• Some suggestion
• performance degrades if a single convolutional layer is removed (2% top-1)
• [no] unsupervised pre-training even though we expect that it will help
• Perspectives
• we still have many orders of magnitude to go in order to match the infero-
temporal pathway of the human visual system
• we would like to use very large and deep convolutional nets on video
sequences where the temporal structure provides very helpful information
42. Information for reproduction
A lot given
• Image set public
• Cuda code published
• Standard Training/test
• Meta-parameters set
• Extensive supplement data
• Sample of closest vector images
Missing
• Variance between iterations
• Training issues, exploding gradient
• Kernel for each layer
43. Progress so far
• Importance of AlexNet >>
• Scientific Context
• Neural nets in 2012 >>
• Convolutional nets >>
• KSH ’12 findings >>
• Limits
• Critics & costs >>
• Further works >>
• Industrial impact >>
45. Cost of computation
• Dedicated amateur could reproduce
• Better results mainly by brute-force
• Complexity justifies Computing-as-a-service
• Centralisation of the ability to recognize images
46. Finding image dataset to train with
• Who needs to flag so many images?
• Applications are specialized (logistical chain, faces)
therefore potential users have the right dataset
• Non-image sets easier to find (e.g. speech, or text to classifier)
47. Unbalanced detection
• CNN are not great with unbalanced training sets
• Better results with Kalman filters & SVM
• Radiology: Healthy scans vs. potential cancer
• Astronomy: Galaxy vs. gravitational lenses
49. Deeper neural nets
• More processing power, better results
• Complexity of training follows
• Fit representation of all weights in memory
• Computation of gradient along many dimensions
• Well-informed training sets are expensive to gather
• Reach a depth were convergence is hard to get
• Shortcuts in passing neurons allow start with simpler inferences
• Capsules: complementary structural elements from layers
50. Computation framework
• Meta-parameters become the problem
• ‘Neural network architects’
• Need to express structure in a coherent syntax using framework
• Caffe, PyTorch, Thean, TensorFlow
• Handle engineering challenge, like parallelisation
51. Deconvolution & Adversarial approaches
Labels to image suggestion
• Generative adversarial network: two responding networks
• Flag generated images to improve quality
• Deep Dream: introduce more positive minor elements
• ML Hacking: Minor perturbation targeted for consistent errors
53. Recognising images
• Image Search at a product
• Medical images to diagnostic
• Industrial applications
• Object categorization,
position in a logistical chain
• Photos as economic proxy
e.g. Parking for activity
• Face labelling & social ties
• Opening more images to
copyright abuses
• Re-think of detection processes
Concentration of diagnostic tool
• Automated flagging of
inappropriate content
• Discrimination from photos
54. Beyond direct application
• More inference
• Separate edges, attention
• 2D to 3D representation, position
• Imagine what is hidden behind
• Image coloration
• Photos transformation
• Style transfer: artist, season, light
• Adversarial generation
• Psycadelic augmentation
• Video processing
• Still selection
• Video editing
• Classifying behavior
• Beyond photos
• 2, 3 dimensional signal
• Music, voice generation
• Recurrent Neural network with
Long Short-Term Memory:
• Text translation, generation
• Speech processing
LeCun: Apply Gradient to neural nets,
ImageNet: not published, challenge
reference for emerging technique
large training set, essential
Clusters of neurons matching features
Of increasing level of abstraction
Given enough parameters
With non-linearities
Chain rule explosion
We could represent the whole math with matrix products
Actually, code that implements this
like Coda, which is the engine of AlexNet
was published at the same time.
But I want to illustrate how complicated the math really is
Given enough parameters
With non-linearities
Clusters of neurons matching features
Of increasing level of abstraction
For each convolutional layer,
you can have one, two of three of the slice type
Each with their parameters
Explicit reference to biological neuronal structure isolated 60s
LeCun 1989:
- Local connection
- Layers of increasing abstraction
Spacial invariance
Only local part is connected, “slide over”
Otherwise, combinatorial explosion
once we know that
a specific feature is in the original input volume
(there will be a high activation value),
its exact location is not as important as
its relative location to the other features
For each convolutional layer,
you can have one, two of three of the slice type
Each with their parameters
Any question about the context of the paper so far?
Do both ideas of
what is a convolutional net, and
what is back propagation
make sense to you?
Ok, so let’s move on to the core of this presentation:
The findings of the paper itself.
What is interesting is that this paper
was not seen as spectacular at the time
Overall, very exhaustive documentatiojn
which explains a lot of the success
The rest of the industry might not have done so well
Support vector machines can more easily
Separate the part of the feature space
that we are trying to detect
without assigning to rare events to maximize cost function