This is the slide that Terry. T. Um gave a presentation at Kookmin University in 22 June, 2014. Feel free to share it and please let me know if there is some misconception or something.
(http://t-robotics.blogspot.com)
(http://terryum.io)
Block diagram reduction techniques in control systems.ppt
ย
Introduction to Machine Learning and Deep Learning
1. Terry Taewoong Um (terry.t.um@gmail.com)
University of Waterloo
Department of Electrical & Computer Engineering
Terry Taewoong Um
MACHINE LEARNING,
DEEP LEARNING, AND
MOTION ANALYSIS
1
2. Terry Taewoong Um (terry.t.um@gmail.com)
CAUTION
โข I cannot explain everything
โข You cannot get every details
2
โข Try to get a big picture
โข Get some useful keywords
โข Connect with your research
3. Terry Taewoong Um (terry.t.um@gmail.com)
CONTENTS
1. What is Machine Learning?
(Part 1 Q & A)
2. What is Deep Learning?
(Part 2 Q & A)
3. Machine Learning in Motion Analysis
(Part 3 Q & A)
3
4. Terry Taewoong Um (terry.t.um@gmail.com)
CONTENTS
4
1. What is Machine Learning?
5. Terry Taewoong Um (terry.t.um@gmail.com)
WHAT IS MACHINE LEARNING?
"A computer program is said to learn from experience E
with respect to some class of tasks T and performance
measure P, if its performance at tasks in T, as measured
by P, improves with experience Eโ โ T. Michell (1997)
Example: A program for soccer tactics
5
T : Win the game
P : Goals
E : (x) Playersโ movements
(y) Evaluation
6. Terry Taewoong Um (terry.t.um@gmail.com)
WHAT IS MACHINE LEARNING?
6
โToward learning robot table tennisโ, J. Peters et al. (2012)
https://youtu.be/SH3bADiB7uQ
"A computer program is said to learn from experience E
with respect to some class of tasks T and performance
measure P, if its performance at tasks in T, as measured
by P, improves with experience Eโ โ T. Michell (1997)
7. Terry Taewoong Um (terry.t.um@gmail.com)
TASKS
7
classification
discrete target values
x : pixels (28*28)
y : 0,1, 2,3,โฆ,9
regression
real target values
x โ (0,100)
y : 0,1, 2,3,โฆ,9
clustering
no target values
x โ (-3,3)ร(-3,3)
"A computer program is said to learn from experience E
with respect to some class of tasks T and performance
measure P, if its performance at tasks in T, as measured
by P, improves with experience Eโ โ T. Michell (1997)
8. Terry Taewoong Um (terry.t.um@gmail.com)
PERFORMANCE
8
"A computer program is said to learn from experience E
with respect to some class of tasks T and performance
measure P, if its performance at tasks in T, as measured
by P, improves with experience Eโ โ T. Michell (1997)
classification
0-1 loss function
regression
L2 loss function
clustering
9. Terry Taewoong Um (terry.t.um@gmail.com)
EXPERIENCE
9
"A computer program is said to learn from experience E
with respect to some class of tasks T and performance
measure P, if its performance at tasks in T, as measured
by P, improves with experience Eโ โ T. Michell (1997)
classification
labeled data
(pixels)โ(number)
regression
labeled data
(x) โ (y)
clustering
unlabeled data
(x1,x2)
10. Terry Taewoong Um (terry.t.um@gmail.com)
A TOY EXAMPLE
10
? Height(cm)
Weight
(kg)
[Input X]
[Output Y]
11. Terry Taewoong Um (terry.t.um@gmail.com)
11
180 Height(cm)
Weight
(kg)
80
Y = aX+b
Model : Y = aX+b Parameter : (a, b)
[Goal] Find (a,b) which best fits the given data
A TOY EXAMPLE
12. Terry Taewoong Um (terry.t.um@gmail.com)
12
[Analytic Solution]
Least square problem
(from AX = b, X=A#b where
A# is Aโs pseudo inverse)
Not always available
[Numerical Solution]
1. Set a cost function
2. Apply an optimization method
(e.g. Gradient Descent (GD) Method)
L
(a,b)
http://www.yaldex.com/game-
development/1592730043_ch18lev1sec4.html
Local minima problem
http://mnemstudio.org/neural-networks-
multilayer-perceptron-design.htm
A TOY EXAMPLE
13. Terry Taewoong Um (terry.t.um@gmail.com)
13
32 Age(year)
Running
Record
(min)
140
WHAT WOULD BE THE CORRECT MODEL?
Select a model โ Set a cost function โ Optimization
14. Terry Taewoong Um (terry.t.um@gmail.com)
14
? X
Y
WHAT WOULD BE THE CORRECT MODEL?
1. Regularization 2. Nonparametric model
โoverfittingโ
15. Terry Taewoong Um (terry.t.um@gmail.com)
15
L2 REGULARIZATION
(e.g. w=(a,b) where Y=aX+b)
Avoid a complicated model!
โข Another interpretation :
: Maximum a Posteriori (MAP)
http://goo.gl/6GE2ix
http://goo.gl/6GE2ix
16. Terry Taewoong Um (terry.t.um@gmail.com)
16
L2 REGULARIZATION
โข Another interpretation :
: Maximum a Posteriori (MAP)
http://goo.gl/6GE2ix
http://goo.gl/6GE2ix
โข Bayesian inference
๐ ๐ต๐๐๐๐๐ ๐ท๐๐ก๐ =
๐ ๐ต๐๐๐๐๐ ๐(๐ท๐๐ก๐|๐ต๐๐๐๐๐)
๐(๐ท๐๐ก๐)
posterior
prior likelihood
ex) fair coin : 50% H, 50% T
falsified coin : 80% H, 20% T
Letโs say we observed ten heads consecutively.
Whatโs the probability for being a fair coin?
๐ ๐ต๐๐๐๐๐ = 0.2
๐ ๐ท๐๐ก๐|๐ต๐๐๐๐๐ = 0.510
โ 0.001
๐ ๐ต๐๐๐๐๐|๐ท๐๐ก๐ โ 0.2 โ 0.001 = 0.0002
normalization
(you donโt believe this coin is fair)
Fair
coin?
Falsified
coin?
๐ ๐ต๐๐๐๐๐ = 0.8
๐ ๐ท๐๐ก๐|๐ต๐๐๐๐๐ = 0.810
โ 0.107
๐ ๐ต๐๐๐๐๐|๐ท๐๐ก๐ โ 0.8 โ 0.107 = 0.0856
Fair =
0.0002
0.0002+0.0856
= 0.23% , Unfair = 99.77%
17. Terry Taewoong Um (terry.t.um@gmail.com)
17
WHAT WOULD BE THE CORRECT MODEL?
1. Regularization 2. Nonparametric model
training time
error
training error
test error
we should
stop here
training
set
validation
set
test
set
for training
(parameter
optimization)
for early
stopping
(avoid
overfitting)
for evaluation
(measure the
performance)
keep watching the validation error
18. Terry Taewoong Um (terry.t.um@gmail.com)
18
NONPARAMETRIC MODEL
โข It does not assume any parametric models (e.g. Y = aX+b, Y=aX2+bX+c, etc.)
โข It often requires much more samples
โข Kernel methods are frequently applied for modeling the data
โข Gaussian Process Regression (GPR), a sort of kernel method, is a widely-used
nonparametric regression method
โข Support Vector Machine (SVM), also a sort of kernel method, is a widely-used
nonparametric classification method
kernel function
[Input space] [Feature space]
19. Terry Taewoong Um (terry.t.um@gmail.com)
19
SUPPORT VECTOR MACHINE (SVM)
โMyoโ, Thalmic Labs (2013)
https://youtu.be/oWu9TFJjHaM
[Linear classifiers] [Maximum margin]
Support vector Machine Tutorial, J. Weston, http://goo.gl/19ywcj
[Dual formulation] ( )
kernel function
kernel function
20. Terry Taewoong Um (terry.t.um@gmail.com)
20
GAUSSIAN PROCESS REGRESSION (GPR)
https://youtu.be/YqhLnCm0KXY
https://youtu.be/kvPmArtVoFE
โข Gaussian Distribution
โข Multivariate regression likelihood
posterior
prior
likelihood
prediction conditioning the joint distribution of the observed & predicted values
https://goo.gl/EO54WN
http://goo.gl/XvOOmf
21. Terry Taewoong Um (terry.t.um@gmail.com)
21
DIMENSION REDUCTION
[Original space] [Feature space]
low dim. high dim.
high dim. low dim.
๐ โ โ (๐)
โข Principal Component Analysis
: Find the best orthogonal axes
(=principal components) which
maximize the variance of the data
Y = P X
* The rows in P are m largest eigenvectors
of
1
๐
๐๐ ๐
(covariance matrix)
22. Terry Taewoong Um (terry.t.um@gmail.com)
22
DIMENSION REDUCTION
http://jbhuang0604.blogspot.kr/2013/04/miss-korea-2013-contestants-face.html
23. Terry Taewoong Um (terry.t.um@gmail.com)
23
SUMMARY - PART 1
โข Machine Learning
- Tasks : Classification, Regression, Clustering, etc.
- Performance : 0-1 loss, L2 loss, etc.
- Experience : labeled data, unlabelled data
โข Machine Learning Process
(1) Select a parametric / nonparametric model
(2) Set a performance measurement including regularization term
(3) Training data (optimizing parameters) until validation error increases
(4) Evaluate the final performance using test set
โข Nonparametric model : Support Vector Machine, Gaussian Process Regression
โข Dimension reduction : used as pre-processing data
24. Terry Taewoong Um (terry.t.um@gmail.com)
CONTENTS
24
Questions about Part 1?
25. Terry Taewoong Um (terry.t.um@gmail.com)
CONTENTS
25
2. What is Deep Learning?
26. Terry Taewoong Um (terry.t.um@gmail.com)
26
PARADIGM CHANGE
PAST
Knowledge
ML
Method
(e.g.
GPR, SVM)
PRESENT
What is the best
ML method for
the target task?
Knowledge
Representation
How can we find a
good representation?
27. Terry Taewoong Um (terry.t.um@gmail.com)
27
PARADIGM CHANGE
Knowledge
PRESENT
Representation
How can we find a
good representation?
kernel function
28. Terry Taewoong Um (terry.t.um@gmail.com)
28
PARADIGM CHANGE
Knowledge
PRESENT
Representation
(Features)
How can we find a
good representation?
IMAGE
SPEECH
Hand-Crafted Features
29. Terry Taewoong Um (terry.t.um@gmail.com)
29
PARADIGM CHANGE
IMAGE
SPEECH
Hand-Crafted Features
Knowledge
PRESENT
Representation
(Features)
Can we learn a good representation
(feature) for the target task as well?
30. Terry Taewoong Um (terry.t.um@gmail.com)
30
DEEP LEARNING
โข What is Deep Learning (DL) ?
- Learning methods which have deep (not shallow) architecture
- It often allows end-to-end learning
- It automatically finds intermediate representation. Thus,
it can be regarded as a representation learning
- It often contains stacked โneural networkโ. Thus,
Deep learning usually indicates โdeep neural networkโ
โDeep Gaussian Processโ (2013)
https://youtu.be/NwoGqYsQifg
http://goo.gl/fxmmPE
http://goo.gl/5Ry08S
31. Terry Taewoong Um (terry.t.um@gmail.com)
31
OUTSTANDING PERFORMANCE OF DL
error rate : 28% โ 15% โ 8%
(2010) (2014)(2012)
- Object recognition (Simonyan et al., 2015)
- Natural machine translation (Bahdanau et al., 2014)
- Speech recognition (Chorowski et al., 2014)
- Face recognition (Taigman et al., 2014)
- Emotion recognition (Ebrahimi-Kahou et al., 2014)
- Human pose estimation (Jain et al., 2014)
- Deep reinforcement learning(mnih et al., 2013)
- Image/Video caption (Xu et al., 2015)
- Particle physics (Baldi et al., 2014)
- Bioinformatics (Leung et al., 2014)
- And so onโฆ.
โข State-of-art results achieved by DL
DL has won most of ML challenges!
K. Cho, https://goo.gl/vdfGpu
32. Terry Taewoong Um (terry.t.um@gmail.com)
32
BIOLOGICAL EVIDENCE
โข Somatosensory cortex learns to see
โข Why do we need different ML methods
for different task?
Yann LeCun, https://goo.gl/VVQXJG
โข The vental pathway in the visual cortex has multiple stages
โข There exist a lot of intermediate representations
Andrew Ng, https://youtu.be/ZmNOAtZIgIk
33. Terry Taewoong Um (terry.t.um@gmail.com)
33
BIG MOVEMENT
http://goo.gl/zNbBE2 http://goo.gl/Lk64Q4
Going deeper and deeperโฆ.
34. Terry Taewoong Um (terry.t.um@gmail.com)
34
NEURAL NETWORK (NN)
Hugo Larochelle, http://www.dmi.usherb.ca/~larocheh/index_en.html
โข Universal approximation theorem (Hornik, 1991)
- A single hidden layer NN w/ linear output can approximate any cont. func. arbitrarily well,
given enough hidden units
- This does not imply we have learning method to train them
35. Terry Taewoong Um (terry.t.um@gmail.com)
35
TRAINING NN
Hugo Larochelle, http://www.dmi.usherb.ca/~larocheh/index_en.html
โข First, calculate the output using data & initial parameters (W ,b)
โข Activation functions
http://goo.gl/qMQk5H
1
36. Terry Taewoong Um (terry.t.um@gmail.com)
36
TRAINING NN
Hugo Larochelle, http://www.dmi.usherb.ca/~larocheh/index_en.html
โข Then, calculate the error and update the weights from top to bottom
โข Parameter gradients
http://goo.gl/qMQk5H
: Backpropagation algorithm
2
known
37. Terry Taewoong Um (terry.t.um@gmail.com)
37
TRAINING NN
Hugo Larochelle, http://www.dmi.usherb.ca/~larocheh/index_en.html
โข Then, calculate the error and update the weights from top to bottom
โข Parameter gradients
http://goo.gl/qMQk5H
: Backpropagation algorithm
2
known
38. Terry Taewoong Um (terry.t.um@gmail.com)
38
TRAINING NN
Hugo Larochelle, http://www.dmi.usherb.ca/~larocheh/index_en.html
โข Then, calculate the error and update the weights from top to bottom
โข Parameter gradients
http://goo.gl/qMQk5H
: Backpropagation algorithm
2
known
39. Terry Taewoong Um (terry.t.um@gmail.com)
39
TRAINING NN
Hugo Larochelle, http://www.dmi.usherb.ca/~larocheh/index_en.html
โข Then, calculate the error and update the weights from top to bottom
โข Parameter gradients
http://goo.gl/qMQk5H
: Backpropagation algorithm
2
known
40. Terry Taewoong Um (terry.t.um@gmail.com)
40
TRAINING NN
โข Repeat this process with different dataset(mini-batches)
http://goo.gl/qMQk5H
- Forward propagation (calculate the output values)
- Evaluate the error
- Backward propagation (update the weights)
- Repeat this process until the error converges
3
โข As you can see here, NN is not a fancy algorithm,
but just a iterative gradient descent method with
huge number of parameters
โข NN is often likely to be
stuck in local minima pitfall
41. Terry Taewoong Um (terry.t.um@gmail.com)
41
FROM NN TO DEEP NN
โข From NN to deep NN (since 2006)
- NN requires expertโs skill to tune the hyperparameters
- It sometimes gives a good result, but sometimes gives a bad result.
The result is highly depend on the quality of initialization, regularization,
hyperparameters, data, etc.
- Local minima is always problematic
โข A long winter of NN
Yann LeCun
(NYU, Facebook)
Yoshua Bengio
(U. Montreal)
Geoffrey Hinton
(U. Toronto, Google)
42. Terry Taewoong Um (terry.t.um@gmail.com)
42
WHY IS DL SO SUCCESSFUL?
http://t-robotics.blogspot.kr/2015/05/deep-learning.html
โข Pre-training with unsupervised learning
โข Convolutional Neural Network
โข Recurrent Neural Net
โข GPGPU (parallel processing) & big data
โข Advanced algorithms for optimization,
activation, regularization
โข Huge research society
(Vision, Speech, NLP, Biology, etc.)
43. Terry Taewoong Um (terry.t.um@gmail.com)
43
UNSUPERVISED LEARNING
โข How can we avoid pathologic local minima cases?
(1) First, pre-train the data with unsupervised learning method
and get a new representation
(2) Stack up this block structures
(3) Training each layer in end-to-end manner
(4) Fine tune the final structure with (ordinary) fully-connected NN
โข Unsupervised learning method
- Restricted Boltzmann Machine (RBM)
โ Deep RBM, Deep Belief Network (DBN)
- Autoencoder
โ Deep Auto-encoder
http://goo.gl/QGJm5k
Autoencoder http://goo.gl/s6kmqY
44. Terry Taewoong Um (terry.t.um@gmail.com)
44
UNSUPERVISED LEARNING
โConvolutional deep belief networks for scalable unsupervised learning of hierarchical representationโ, Lee et al., 2012
45. Terry Taewoong Um (terry.t.um@gmail.com)
45
CONVOLUTIONAL NN
โข How can we deal with real images which is
much bigger than MNIST digit images?
- Use not fully-connected, but locally-connected NN
- Use convolutions to get various feature maps
- Abstract the results into higher layer by using pooling
- Fine tune with fully-connected NN
https://goo.gl/G7kBjI
https://goo.gl/Xswsbd
http://goo.gl/5OR5oH
46. Terry Taewoong Um (terry.t.um@gmail.com)
46
CONVOLUTIONAL NN
โVisualization and Understanding Convolutional Networkโ, Zeiler et al., 2012
47. Terry Taewoong Um (terry.t.um@gmail.com)
47
CONVNET + RNN
โLarge-scale Video Classification with Convolutional Neural Networkโ,
A. Karpathy 2014, https://youtu.be/qrzQ_AB1DZk
48. Terry Taewoong Um (terry.t.um@gmail.com)
48
RECURRENT NEURAL NETWORK (RNN)
t-1 t t+1
[Neural Network] [Recurrent Neural Network]
http://www.dmi.usherb.ca/~larocheh/index_en.html
49. Terry Taewoong Um (terry.t.um@gmail.com)
49
RECURRENT NEURAL NETWORK (RNN)
[Neural Network] [Recurrent Neural Network]
back propagation
back propagation
through time
(BPTT)
โข Vanishing gradient problem : Canโt have long memory!
โTraining Recurrent Neural Networks, I. Sutskever, 2013
50. Terry Taewoong Um (terry.t.um@gmail.com)
50
RNN + LSTM
โข Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997)
โTraining Recurrent Neural Networks, I. Sutskever, 2013
51. Terry Taewoong Um (terry.t.um@gmail.com)
51
INTERESTING RESULTS FROM RNN
http://pail.unist.ac.kr/carpedm20/poet/
http://cs.stanford.edu/people/karpathy/deepimagesent/
โgenerating sequences with RNNโ,
A.Graves, 2013
52. Terry Taewoong Um (terry.t.um@gmail.com)
52
WHY IS DL SO SUCCESSFUL?
http://t-robotics.blogspot.kr/2015/05/deep-learning.html
โข Pre-training with unsupervised learning
โข Convolutional Neural Network
โข Recurrent Neural Net
โข GPGPU (parallel processing) & big data
โข Advanced algorithms for optimization,
activation, regularization
โข Huge research society
(Vision, Speech, NLP, Biology, etc.)
53. Terry Taewoong Um (terry.t.um@gmail.com)
CONTENTS
53
Questions about Part 2?
54. Terry Taewoong Um (terry.t.um@gmail.com)
CONTENTS
54
3. Machine Learning in
Motion Analysis
55. Terry Taewoong Um (terry.t.um@gmail.com)
55
MOTION DATA
โ์ธ๋ฆฌ๋โ, ์ด์์ ๊ตญ์ค๋ก
56. Terry Taewoong Um (terry.t.um@gmail.com)
56
MOTION DATA
We need to know the state not only at time t
but also at time t-1, t-2, t-3, etc.
๐ = ๐(๐ฅ, ๐ก)
โ์ธ๋ฆฌ๋โ, ์ด์์ ๊ตญ์ค๋ก
57. Terry Taewoong Um (terry.t.um@gmail.com)
57
MOTION DATA
โข Why do motion data need special treatment?
- In general, most machine learning techniques assume i.i.d. (independent
& identically distributed) sampling condition.
e.g.) coins tossing
- However, motion data is temporally & spatially correlated
http://goo.gl/LQulvcswing motion manipulability ellipsoid https://goo.gl/dHjFO9
58. Terry Taewoong Um (terry.t.um@gmail.com)
58
MOTION DATA
http://goo.gl/ll3sq6
We can infer the next state
based on the temporal &
spatial information
But, how can we exploit
those benefits in ML method?
59. Terry Taewoong Um (terry.t.um@gmail.com)
59
WHAT CAN WE DO WITH MOTION DATA?
โข Learning the kinematic/dynamic model
โข Motion segmentation
โข Motion generation / synthesis
โข Motion imitation (Imitation learning)
โข Activity / Gesture recognition
TASKS
Data
โข Motion capture data
โข Vision Data
โข Dynamic-level data
Applications
โข Biomechanics
โข Humanoid
โข Animation
http://goo.gl/gFOVWL
60. Terry Taewoong Um (terry.t.um@gmail.com)
60
HIDDEN MARKOV MODEL (HMM)
Prob. of (n+1) state only depends on state at (n+1)
61. Terry Taewoong Um (terry.t.um@gmail.com)
61
LIMITATIONS OF HMM
1. Extract features (e.g. PCA)
2. Define the HMM structure (e.g. using GMM)
3. Train a separate HMM per class (Baum-Welch algorithm)
4. Evaluate probability under each HMM (Fwd/Bwd algorithm)
or 3. Choose most probable sequence (Viterbi algorithm)
- HMM handle discrete states only!
- HMM has short memory! (using just the previous state)
- HMM has limited expressive power!
- [Trend1] features-GMM โ unsupervised learning methods
- [Trend2] features-GMM-HMM โ recurrent neural network
โข A common procedure of HMM for motion analysis
โข Limitations & trend change in speech recognition area
62. Terry Taewoong Um (terry.t.um@gmail.com)
62
CAPTURE TEMPORAL INFORMATION
โข 3D ConvNet
- โ3D Convolutional Neural Network for
Human Action Recognitionโ (Ji et al., 2010)
- 3D convolution
- Activity recognition / Pose estimation from video
โJoint Training of a Convolutional Network
and a Graphical Model for Human Pose
Estimationโ, Tompson et al., 2014
63. Terry Taewoong Um (terry.t.um@gmail.com)
63
CAPTURE TEMPORAL INFORMATION
โข Recurrent Neural Network (RNN)
โHierarchical Recurrent Neural Network for Skeleton Based Action Recognitionโ, Y. Du et al., 2015
โข However, how can we capture the
spatial information about motions?
64. Terry Taewoong Um (terry.t.um@gmail.com)
64
CHALLENGES
We should connect the geometric information with deep neural network!
โข The link transformation from the i-1 th link to the i th link
โข Forward Kinematics
constant, Mvariable, ๐
c.f.)
๐๐โ1,๐ = ๐ ๐๐ก ๐ง, ๐๐ ๐๐๐๐๐ ๐ง, ๐๐ ๐๐๐๐๐ ๐ฅ, ๐๐ ๐ ๐๐ก ๐ง, ๐ผ๐ = ๐[๐ด ๐]๐ ๐ ๐๐โ1,๐
๐0,๐ = ๐[๐ด1]๐1 ๐0,1 ๐[๐ด2]๐2 ๐1,2 โฏ ๐ ๐ด ๐ ๐ ๐ ๐ ๐โ1,๐
= ๐[๐1]๐1 ๐[๐2]๐2 โฏ ๐[๐ ๐]๐ ๐ ๐0,๐
๐๐ = ๐ด๐ ๐01โฏ๐ ๐โ2,๐โ1
๐ด๐ , ๐ = 1, โฏ , ๐
propagated forces
external force acting
on the ith body where
โข Newton-Euler formulation for inverse dynamics
Lie group & Lie algebra,
http://goo.gl/uqilDV
65. Terry Taewoong Um (terry.t.um@gmail.com)
65
CHALLENGES
https://www.youtube.com/watch?v=oxA2O-tHftI