Learning to Learn by Gradient Descent by Gradient Descent
1. Learning to learn by
gradient descent by
gradient descent
citation: 9 -> 38 Katy, 2016/11/25@DataLab
NIPS 2016
2. Background
• learn:
1. a task
2. training experience
3. a performance measure
• a computer program is said to learn if its
performance at the task improves with experience.
Mitchell [Mitchell, 1993]
3. Background
• learning to learn:
1. a family of tasks
2. training experience for each of these tasks
3. a family of performance measures
• an algorithm is said to learn to learn if its
performance at each task improves with
experience and with the number of tasks.
Thrun, Sebastian, and Lorien Pratt, eds. Learning to learn. Springer Science & Business Media, 2012.
4. Background
• Frequently, tasks in machine learning can be
expressed as the problem of optimizing an
objective function defined over some domain
• The goal is to find the minimizer
• the standard approach for differentiable functions
is some form of gradient descent, resulting in a
sequence of updates
5. Motivation
• Most of the modern work is based around
designing update rules for specific classes of
problems, it might perform poorly on other class of
problems
6. Motivation
• In this work we take a different tack and instead
propose to replace hand-designed update rules
with a learned update rule
9. Related Work
• C. Daniel, J. Taylor, and S. Nowozin. Learning step
size controllers for robust neural network training. In
Association for the Advancement of Artificial
Intelligence, 2016.
12. • In this work, they proposed to replace hand-
designed update rules with a learned update rule,
which we called the optimizer(a LSTM) m, with its
own parameter
• This results in updates to the optimizee f of the form
φ
gt is the output of LSTM
13. How to train the optimizer
• For training the optimizer, we have an objective that
depends on the trajectory for a time horizon T
• θ the optimizee parameters
• ϕ: the optimizer parameters
• f: the function in question
m is the LSTM
18. Information Sharing
Between Coordinates
• global average cells(GAC) designate a subset of
the cells in each LSTM layer for communication.
their outgoing activations are averages at each
step across all coordinates.
• allowing different LSTMs to communicate with each
other
26. Conclusion
• So far the learning process is handcraft, but this
work shows how to train a NN by a NN
• generalize well on different architecture but not on
different activation function
• execution time?
• sometimes, when you are confused for long, try to
email the author(all of them). A typo can kill you.