SlideShare uma empresa Scribd logo
1 de 40
Baixar para ler offline
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
AI	System	Dept.	
System	Management	Unit	
Kazuki	Fujikawa	
Matching Networks for One Shot
Learning
https://papers.nips.cc/paper/6385-matching-networks-for-one-
shot-learning
論⽂紹介
1
NIPS2016	読み会	@Preferred	Networks	
2017/01/19
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
n  One-shot learning with attention and memory
⁃  Learn a concept from one or only a few training examples
⁃  Train a fully end-to-end nearest neighbor classifier: incorporating
the best characteristics from both parametric and non-parametric
models
⁃  Improved one-shot accuracy on Omniglot from 88.0% to 93.2%
compared to competing approaches
2
Abstract
Figure 1: Matching Networks architecture
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
AGENDA
n  Introduction
n  Related work
⁃  One-shot learning
⁃  Attention mechanisms
n  Matching Networks
n  Experiments
⁃  Omniglot
⁃  ImageNet
⁃  Penn Treebank
3
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
Supervised Learning
n  Learn a correspondence between training data and labels
⁃  Require a large labeled dataset for training
(ex. CIFAR10 [Krizhevsky+, 2009]: 6000 data / class)
⁃  It is hard to let classifiers learn new concepts from little data
4
airplane
automobile
bird
cat
deer
Classifier
examples Labels
0 airplane
1 automobile
0 bird
0 cat
0 deer
Classifier
Training phase Predicting phase
https://www.cs.toronto.edu/~kriz/cifar.html
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
One-shot Learning
n  Learn a concept from one or only a few training examples
⁃  A classifier can be trained by datasets with labels which donʼt
be used in predicting phase
5
airplane
automobile
bird
cat
deer
Classifier
examples Labels
0 airplane
1 automobile
0 bird
0 cat
0 deer
Classifier
(Pre-)Training phase Predicting phase(one-shot learning phase)
https://www.cs.toronto.edu/~kriz/cifar.html
dog
frog
horse
ship
truck
Classifier
examples Labels
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
One-shot Learning
n  Task: N-way k-shot learning
6
T’: Testing taskT: Training task
dog
frog
horse
ship
truck
airplane
automobile
bird
cat
deer
dog
frog
horse
ship
truck
airplane
automobile
bird
cat
deer
•  Separate labels for training and testing
•  All the labels which you use in testing
phase (one-shot learning phase) are not
used in training phase
https://www.cs.toronto.edu/~kriz/cifar.html
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
One-shot Learning
n  Task: N-way k-shot learning
7
T’: Testing taskT: Training task
dog
frog
horse
ship
truck
airplane
automobile
bird
cat
deer
dog
frog
horse
ship
truck
airplane
automobile
bird
cat
deer
•  T’ is used for one-shot learning
•  T can be used freely to train
(e.g. Multiclass classification)
https://www.cs.toronto.edu/~kriz/cifar.html
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
One-shot Learning
n  Task: N-way k-shot learning
8
T’: Testing taskT: Training task
dog
frog
horse
ship
truck
airplane
automobile
bird
cat
deer
dog
frog
horse
ship
truck
airplane
automobile
bird
cat
deer
L’: Label set
sampling N labels from Tʼ
•  In this figure, Lʼ has 3 classes, thus
“3-way k-shot learning”
automobile
cat
deer
https://www.cs.toronto.edu/~kriz/cifar.html
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
One-shot Learning
n  Task: N-way k-shot learning
9
T’: Testing taskT: Training task
dog
frog
horse
ship
truck
airplane
automobile
bird
cat
deer
dog
frog
horse
ship
truck
airplane
automobile
bird
cat
deer
L’: Label set
S’: Support set : Query
automobile
cat
deer
sampling N labels from Tʼ
sampling k examples from Lʼ sampling 1 example from Lʼ
ˆx
•  Task: classify into 3
classes, {automobile, cat,
deer}, using support set
ˆx
https://www.cs.toronto.edu/~kriz/cifar.html
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
Related Work (One-shot Learning)
n  Convolutional Siamese Network [Koch+, 2015]
⁃  Learn image representation with a siamese neural network
⁃  Reuse features from the network for one-shot learning
10
Figure 1: Matching Networks architecture
train it by showing only a few examples per class, switching the task from minibatch to minibatch,
much like how it will be tested when presented with a few examples of a new task.
Besides our contributions in defining a model and training criterion amenable for one-shot learning,
we contribute by the definition of tasks that can be used to benchmark other approaches on both
ImageNet and small scale language modeling. We hope that our results will encourage others to work
CNN CNN
Same?
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
Related Work (One-shot Learning)
n  Memory-Augmented Neural Networks (MANN) [Santoro+, 2016]
⁃  Quickly encode and retrieve new information using external
memory, inspired by the idea of Neural Turing Machine
11
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
Related Work (One-shot Learning)
n  Siamese Learnet [Bertinetto+, NIPS2016]
⁃  Learn the parameters of a network to incorporate domain
specific information from a few examples
12
siamese
siamese learnet
learnet
Figure 1: Our proposed architectures predict the parameters of a network from a single example,
replacing static convolutions (green) with dynamic convolutions (red). The siamese learnet predicts
the parameters of an embedding function that is applied to both inputs, whereas the single-stream
learnet predicts the parameters of a function that is applied to the other input. Linear layers are
denoted by ⇤ and nonlinear layers by . Dashed connections represent parameter sharing.
discriminative one-shot learning is to find a mechanism to incorporate domain-specific information in
the learner, i.e. learning to learn. Another challenge, which is of practical importance in applications
of one-shot learning, is to avoid a lengthy optimization process such as eq. (1).
We propose to address both challenges by learning the parameters W of the predictor from a single
exemplar z using a meta-prediction process, i.e. a non-iterative feed-forward function ! that maps
(z; W0
) to W. Since in practice this function will be implemented using a deep neural network, we
call it a learnet. The learnet depends on the exemplar z, which is a single representative of the class of
interest, and contains parameters W0
of its own. Learning to learn can now be posed as the problem of
optimizing the learnet meta-parameters W0
using an objective function defined below. Furthermore,
the feed-forward learnet evaluation is much faster than solving the optimization problem (1).
In order to train the learnet, we require the latter to produce good predictors given any possible
exemplar z, which is empirically evaluated as an average over n training samples zi:
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
Related Work (Attention Mechanism)
n  Sequence to Sequence with Attention [Bahdanau+, 2014]
⁃  Attend to the word relevant to the generation of the next
target word in the source sentence
13
t t
her architectures such as a hybrid of an RNN
alchbrenner and Blunsom, 2013).
ral machine translation. The new architecture
3.2) and a decoder that emulates searching
n (Sec. 3.1).
x1 x2 x3 xT
+
αt,1
αt,2 αt,3
αt,T
yt-1 yt
h1 h2 h3 hT
h1 h2 h3 hT
st-1 st
Figure 1: The graphical illus-
tration of the proposed model
trying to generate the t-th tar-
get word yt given a source
sentence (x1, x2, . . . , xT ).
al probability
(4)
by
–decoder ap-
on a distinct
annotations
ntence. Each
put sequence
word of the
ons are com-
sum of these
(5)
ij)
Figure 1: The graphical illus-
tration of the proposed model
trying to generate the t-th tar-
get word yt given a source
sentence (x1, x2, . . . , xT ).
si = f(si 1, yi 1, ci).
It should be noted that unlike the existing encoder–decoder ap-
proach (see Eq. (2)), here the probability is conditioned on a distinct
context vector ci for each target word yi.
The context vector ci depends on a sequence of annotations
(h1, · · · , hTx
) to which an encoder maps the input sentence. Each
annotation hi contains information about the whole input sequence
with a strong focus on the parts surrounding the i-th word of the
input sequence. We explain in detail how the annotations are com-
puted in the next section.
The context vector ci is, then, computed as a weighted sum of these
annotations hi:
ci =
TxX
j=1
↵ijhj. (5)
The weight ↵ij of each annotation hj is computed by
↵ij =
exp (eij)
PTx
k=1 exp (eik)
, (6)
where
eij = a(si 1, hj)
is an alignment model which scores how well the inputs around position j and the output at position
i match. The score is based on the RNN hidden state si 1 (just before emitting yi, Eq. (4)) and the
j-th annotation hj of the input sentence.
We parametrize the alignment model a as a feedforward neural network which is jointly trained with
all the other components of the proposed system. Note that unlike in traditional machine translation,
3
Figure 1: The graphical illus-
tration of the proposed model
trying to generate the t-th tar-
get word yt given a source
sentence (x1, x2, . . . , xT ).
si = f(si 1, yi 1, ci).
It should be noted that unlike the existing encoder–decoder ap-
proach (see Eq. (2)), here the probability is conditioned on a distinct
context vector ci for each target word yi.
The context vector ci depends on a sequence of annotations
(h1, · · · , hTx
) to which an encoder maps the input sentence. Each
annotation hi contains information about the whole input sequence
with a strong focus on the parts surrounding the i-th word of the
input sequence. We explain in detail how the annotations are com-
puted in the next section.
The context vector ci is, then, computed as a weighted sum of these
annotations hi:
ci =
TxX
j=1
↵ijhj. (5)
The weight ↵ij of each annotation hj is computed by
↵ij =
exp (eij)
PTx
k=1 exp (eik)
, (6)
where
eij = a(si 1, hj)
is an alignment model which scores how well the inputs around position j and the output at position
i match. The score is based on the RNN hidden state si 1 (just before emitting yi, Eq. (4)) and the
j-th annotation hj of the input sentence.
We parametrize the alignment model a as a feedforward neural network which is jointly trained with
all the other components of the proposed system. Note that unlike in traditional machine translation,
3
Figure 1: The graphical illus-
tration of the proposed model
trying to generate the t-th tar-
get word yt given a source
sentence (x1, x2, . . . , xT ).
proach (see Eq. (2)), here the probability is conditioned on a distinct
context vector ci for each target word yi.
The context vector ci depends on a sequence of annotations
(h1, · · · , hTx
) to which an encoder maps the input sentence. Each
annotation hi contains information about the whole input sequence
with a strong focus on the parts surrounding the i-th word of the
input sequence. We explain in detail how the annotations are com-
puted in the next section.
The context vector ci is, then, computed as a weighted sum of these
annotations hi:
ci =
TxX
j=1
↵ijhj. (5)
The weight ↵ij of each annotation hj is computed by
↵ij =
exp (eij)
PTx
k=1 exp (eik)
, (6)
where
eij = a(si 1, hj)
is an alignment model which scores how well the inputs around position j and the output at position
i match. The score is based on the RNN hidden state si 1 (just before emitting yi, Eq. (4)) and the
j-th annotation hj of the input sentence.
We parametrize the alignment model a as a feedforward neural network which is jointly trained with
all the other components of the proposed system. Note that unlike in traditional machine translation,
3
Published as a conference paper at ICLR 2015
(a) (b)
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
Related Work (Attention Mechanism)
n  Pointer Networks [Vinyals+, 2015]
⁃  Generate output sequence using a distribution over the
dictionary of inputs
14
(a) Sequence-to-Sequence (b) Ptr-Net
Figure 1: (a) Sequence-to-Sequence - An RNN (blue) processes the input sequence to create a code
vector that is used to generate the output sequence (purple) using the probability chain rule and
another RNN. The output dimensionality is fixed by the dimensionality of the problem and it is the
same during training and inference [1]. (b) Ptr-Net - An encoding RNN converts the input sequence
to a code (blue) that is fed to the generating network (purple). At each step, the generating network
produces a vector that modulates a content-based attention mechanism over inputs ([5, 2]). The
output of the attention mechanism is a softmax distribution with dictionary size equal to the length
of the input.
ion (i.e., when we only have examples of inputs and desired outputs). The proposed approach is
depicted in Figure 1.
The main contributions of our work are as follows:
This model performs significantly better than the sequence-to-sequence model on the co
problem, but it is not applicable to problems where the output dictionary size depends on
Nevertheless, a very simple extension (or rather reduction) of the model allows us to do th
2.3 Ptr-Net
We now describe a very simple modification of the attention model that allows us to
method to solve combinatorial optimization problems where the output dictionary size d
the number of elements in the input sequence.
The sequence-to-sequence model of Section 2.1 uses a softmax distribution over a fixed si
dictionary to compute p(Ci|C1, . . . , Ci 1, P) in Equation 1. Thus it cannot be used for our
where the size of the output dictionary is equal to the length of the input sequence. To
problem we model p(Ci|C1, . . . , Ci 1, P) using the attention mechanism of Equation 3 a
ui
j = vT
tanh(W1ej + W2di) j 2 (1, . . . , n)
p(Ci|C1, . . . , Ci 1, P) = softmax(ui
)
where softmax normalizes the vector ui
(of length n) to be an output distribution over the
of inputs, and v, W1, and W2 are learnable parameters of the output model. Here, we do
the encoder state ej to propagate extra information to the decoder, but instead, use ui
j a
to the input elements. In a similar way, to condition on Ci 1 as in Equation 1, we sim
the corresponding PCi 1
as the input. Both our method and the attention model can be
application of content-based attention mechanisms proposed in [6, 5, 2].
We also note that our approach specifically targets problems whose outputs are discrete
spond to positions in the input. Such problems may be addressed artificially – for example
learn to output the coordinates of the target point directly using an RNN. However, at
this solution does not respect the constraint that the outputs map back to the inputs exac
out the constraints, the predictions are bound to become blurry over longer sequences as
sequence-to-sequence models for videos [12].
3 Motivation and Datasets Structure
In the following sections, we review each of the three problems we considered, as well a
generation protocol.1
In the training data, the inputs are planar point sets P = {P1, . . . , Pn} with n elements ea
Pj = (xj, yj) are the cartesian coordinates of the points over which we find the convex hu
launay triangulation or the solution to the corresponding Travelling Salesman Problem. In
we sample from a uniform distribution in [0, 1] ⇥ [0, 1]. The outputs CP
= {C1, . . . , C
sequences representing the solution associated to the point set P. In Figure 2, we find an i
of an input/output pair (P, CP
) for the convex hull and the Delaunay problems.
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
Related Work (Attention Mechanism)
n  Sequence to Sequence for Sets [Vinyals+, ICLR2016]
⁃  Handle input sets using an extension of seq2seq framework:
Read-Process-and Write model
15
ural models with memories coupled to differentiable addressing mechanism have been success-
y applied to handwriting generation and recognition (Graves, 2012), machine translation (Bah-
au et al., 2015a), and more general computation machines (Graves et al., 2014; Weston et al.,
5). Since we are interested in associative memories we employed a “content” based attention.
s has the property that the vector retrieved from our memory would not change if we randomly
ffled the memory. This is crucial for proper treatment of the input set X as such. In particular,
process block based on an attention mechanism uses the following:
qt = LSTM(q⇤
t 1) (3)
ei,t = f(mi, qt) (4)
ai,t =
exp(ei,t)
P
j exp(ej,t)
(5)
rt =
X
i
ai,tmi (6)
q⇤
t = [qt rt] (7)
Read
Process Write
Figure 1: The Read-Process-and-Write model.
ere i indexes through each memory vector mi (typically equal to the cardinality of X), qt is
uery vector which allows us to read rt from the memories, f is a function that computes a
gle scalar from mi and qt (e.g., a dot product), and LSTM is an LSTM which computes a
urrent state but which takes no inputs. q⇤
t is the state which this LSTM evolves, and is formed
concatenating the query qt with the resulting attention readout rt. t is the index which indicates
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
Matching Networks [Vinyals+, NIPS2016]
n  Motivation
⁃  It is important for one-shot learning to attain rapid learning
from new examples while keeping an ability for common
examples
•  Simple parametric models such as deep classifiers need to be
optimized to treat with new examples
•  Non-parametric models such as k-nearest neighbor donʼt require
optimization but performance depends on the chosen metric
⁃  It could be efficient to train a end-to-end nearest neighbor
based classifier
16
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
Matching Networks [Vinyals+, NIPS2016]
n  Train a classifier through one-shot learning
17
T’: Testing taskT: Training task
dog
frog
horse
ship
truck
airplane
automobile
bird
cat
deer
dog
frog
horse
ship
truck
airplane
automobile
bird
cat
deer
L: Label set
S: Support set B : Batch
dog
horse
ship
sampling N labels from T
sampling k examples
from L
sampling b example from L
https://www.cs.toronto.edu/~kriz/cifar.html
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
Matching Networks [Vinyals+, NIPS2016]
n  System Overview
⁃  Embedding functions f, g are parameterized as a simple CNN (e.g.
VGG or Inception) or a fully conditional embedding function
mentioned later
18
Figure 1: Matching Networks architecture
train it by showing only a few examples per class, switching the task from minibatch to minibatch,
much like how it will be tested when presented with a few examples of a new task.
Besides our contributions in defining a model and training criterion amenable for one-shot learning,
we contribute by the definition of tasks that can be used to benchmark other approaches on both
ImageNet and small scale language modeling. We hope that our results will encourage others to work
on this challenging problem.
We organized the paper by first defining and explaining our model whilst linking its several compo-
nents to related work. Then in the following section we briefly elaborate on some of the related work
to the task and our model. In Section 4 we describe both our general setup and the experiments we
performed, demonstrating strong results on one-shot learning on a variety of tasks and setups.
2 Model
Our non-parametric approach to solving one-shot learning is based on two components which we
describe in the following subsections. First, our model architecture follows recent advances in neural
networks augmented with memory (as discussed in Section 3). Given a (small) support set S, our
model defines a function cS (or classifier) for each S, i.e. a mapping S ! cS(.). Second, we employ
ˆx
Query
f
g(xi )
f ( ˆx,S)
a
∑
P(ˆy|ˆx
where xi, yi are the inputs and corresp
{(xi, yi)}k
i=1, and a is an attention mech
tially describes the output for a new class
Where the attention mechanism a is a kerne
Where the attention mechanism is zero f
metric and an appropriate constant otherw
(although this requires an extension to the
Thus (1) subsumes both KDE and kNN me
mechanism and the yi act as values bound
this case we can understand this as a parti
we “point” to the corresponding example i
form defined by the classifier cS(ˆx) is very
2.1.1 The Attention Kernel
Equation 1 relies on choosing a(., .), the
fier. The simplest form that this takes
attention models and kernel functions)
a(ˆx, xi) = ec(f(ˆx),g(xi))
/
Pk
j=1 ec(f(ˆx),g(
ate neural networks (potentially with f =
examples where f and g are parameteris
tasks (as in VGG[22] or Inception[24]) or
Section 4).
We note that, though related to metric learn
For a given support set S and sample to cl
pairs (x0
, y0
) 2 S such that y0
= y and mi
methods such as Neighborhood Compone
nearest neighbor [28].
However, the objective that we are trying
classification, and thus we expect it to per
Our model in its simplest form computes a probability over ˆy as follows:
P(ˆy|ˆx, S) =
kX
i=1
a(ˆx, xi)yi
where xi, yi are the inputs and corresponding label distributions from the support
{(xi, yi)}k
i=1, and a is an attention mechanism which we discuss below. Note that e
tially describes the output for a new class as a linear combination of the labels in the s
Where the attention mechanism a is a kernel on X ⇥ X, then (1) is akin to a kernel densit
Where the attention mechanism is zero for the b furthest xi from ˆx according to som
metric and an appropriate constant otherwise, then (1) is equivalent to ‘k b’-nearest n
(although this requires an extension to the attention mechanism that we describe in Sec
Thus (1) subsumes both KDE and kNN methods. Another view of (1) is where a acts as a
mechanism and the yi act as values bound to the corresponding keys xi, much like a has
this case we can understand this as a particular kind of associative memory where, give
Figure 1: Matching Networks architecture
by showing only a few examples per class, switching the task from minibatch to minibatch,
ike how it will be tested when presented with a few examples of a new task.
s our contributions in defining a model and training criterion amenable for one-shot learning,
ntribute by the definition of tasks that can be used to benchmark other approaches on both
Figure 1: Matching Networks architecture
it by showing only a few examples per class, switching the task from minibatch to minibatch,
h like how it will be tested when presented with a few examples of a new task.
Figure 1: Matching Networks architecture
xi
Support Set(S)
yi
g
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
Matching Networks [Vinyals+, NIPS2016]
n  The Attention Kernel
⁃  Calculate softmax over the cosine distance between and
•  Similar to nearest neighbor calculation
⁃  Train a network using cross entropy loss
19
Figure 1: Matching Networks architecture
train it by showing only a few examples per class, switching the task from minibatch to minibatch,
much like how it will be tested when presented with a few examples of a new task.
Besides our contributions in defining a model and training criterion amenable for one-shot learning,
we contribute by the definition of tasks that can be used to benchmark other approaches on both
ImageNet and small scale language modeling. We hope that our results will encourage others to work
on this challenging problem.
We organized the paper by first defining and explaining our model whilst linking its several compo-
nents to related work. Then in the following section we briefly elaborate on some of the related work
to the task and our model. In Section 4 we describe both our general setup and the experiments we
performed, demonstrating strong results on one-shot learning on a variety of tasks and setups.
2 Model
Our non-parametric approach to solving one-shot learning is based on two components which we
describe in the following subsections. First, our model architecture follows recent advances in neural
networks augmented with memory (as discussed in Section 3). Given a (small) support set S, our
model defines a function cS (or classifier) for each S, i.e. a mapping S ! cS(.). Second, we employ
ˆx
Query
f
g(xi )
f ( ˆx,S)
aOur model in its simplest form computes a probability over ˆy as follow
P(ˆy|ˆx, S) =
kX
i=1
a(ˆx, xi)yi
where xi, yi are the inputs and corresponding label distributions
{(xi, yi)}k
i=1, and a is an attention mechanism which we discuss b
tially describes the output for a new class as a linear combination of
Where the attention mechanism a is a kernel on X ⇥ X, then (1) is akin
Where the attention mechanism is zero for the b furthest xi from ˆx
metric and an appropriate constant otherwise, then (1) is equivalent t
(although this requires an extension to the attention mechanism that w
∑
P(ˆy|ˆx
where xi, yi are the inputs and corresp
{(xi, yi)}k
i=1, and a is an attention mech
tially describes the output for a new class
Where the attention mechanism a is a kerne
Where the attention mechanism is zero f
metric and an appropriate constant otherw
(although this requires an extension to the
Thus (1) subsumes both KDE and kNN me
mechanism and the yi act as values bound
this case we can understand this as a parti
we “point” to the corresponding example i
form defined by the classifier cS(ˆx) is very
2.1.1 The Attention Kernel
Equation 1 relies on choosing a(., .), the
fier. The simplest form that this takes
attention models and kernel functions)
a(ˆx, xi) = ec(f(ˆx),g(xi))
/
Pk
j=1 ec(f(ˆx),g(
ate neural networks (potentially with f =
examples where f and g are parameteris
tasks (as in VGG[22] or Inception[24]) or
Section 4).
We note that, though related to metric learn
For a given support set S and sample to cl
pairs (x0
, y0
) 2 S such that y0
= y and mi
methods such as Neighborhood Compone
nearest neighbor [28].
However, the objective that we are trying
classification, and thus we expect it to per
Our model in its simplest form computes a probability over ˆy as follows:
P(ˆy|ˆx, S) =
kX
i=1
a(ˆx, xi)yi
where xi, yi are the inputs and corresponding label distributions from the support
{(xi, yi)}k
i=1, and a is an attention mechanism which we discuss below. Note that e
tially describes the output for a new class as a linear combination of the labels in the s
Where the attention mechanism a is a kernel on X ⇥ X, then (1) is akin to a kernel densit
Where the attention mechanism is zero for the b furthest xi from ˆx according to som
metric and an appropriate constant otherwise, then (1) is equivalent to ‘k b’-nearest n
(although this requires an extension to the attention mechanism that we describe in Sec
Thus (1) subsumes both KDE and kNN methods. Another view of (1) is where a acts as a
mechanism and the yi act as values bound to the corresponding keys xi, much like a has
this case we can understand this as a particular kind of associative memory where, give
Our model in its simplest form computes a probability over ˆy as follows:
P(ˆy|ˆx, S) =
kX
i=1
a(ˆx, xi)yi
where xi, yi are the inputs and corresponding label distributions from the suppo
{(xi, yi)}k
i=1, and a is an attention mechanism which we discuss below. Note that
tially describes the output for a new class as a linear combination of the labels in the
Where the attention mechanism a is a kernel on X ⇥ X, then (1) is akin to a kernel dens
Where the attention mechanism is zero for the b furthest xi from ˆx according to so
metric and an appropriate constant otherwise, then (1) is equivalent to ‘k b’-nearest
(although this requires an extension to the attention mechanism that we describe in Se
Thus (1) subsumes both KDE and kNN methods. Another view of (1) is where a acts as
mechanism and the yi act as values bound to the corresponding keys xi, much like a ha
this case we can understand this as a particular kind of associative memory where, giv
we “point” to the corresponding example in the support set, retrieving its label. Hence th
form defined by the classifier cS(ˆx) is very flexible and can adapt easily to any new sup
2.1.1 The Attention Kernel
Equation 1 relies on choosing a(., .), the attention mechanism, which fully specifie
fier. The simplest form that this takes (and which has very tight relationships wi
attention models and kernel functions) is to use the softmax over the cosine dist
a(ˆx, xi) = ec(f(ˆx),g(xi))
/
Pk
j=1 ec(f(ˆx),g(xj ))
with embedding functions f and g bein
ate neural networks (potentially with f = g) to embed ˆx and xi. In our experiments w
examples where f and g are parameterised variously as deep convolutional network
tasks (as in VGG[22] or Inception[24]) or a simple form word embedding for languag
Section 4).
We note that, though related to metric learning, the classifier defined by Equation 1 is di
For a given support set S and sample to classify ˆx, it is enough for ˆx to be sufficiently a
pairs (x0
, y0
) 2 S such that y0
= y and misaligned with the rest. This kind of loss is als
c: cosine distance
Figure 1: Matching Networks architecture
by showing only a few examples per class, switching the task from minibatch to minibatch,
ike how it will be tested when presented with a few examples of a new task.
s our contributions in defining a model and training criterion amenable for one-shot learning,
ntribute by the definition of tasks that can be used to benchmark other approaches on both
Figure 1: Matching Networks architecture
it by showing only a few examples per class, switching the task from minibatch to minibatch,
h like how it will be tested when presented with a few examples of a new task.
Figure 1: Matching Networks architecture
xi
Support Set(S)
yi
g
ˆhk, ck = LSTM(f0
(ˆx), [hk 1, rk 1], ck 1)
hk = ˆhk + f0
(ˆx)
rk 1 =
|S|
X
i=1
a(hk 1, g(xi))g(xi)
a(hk 1, g(xi)) = ehT
k 1g(xi)
/
|S|
X
j=1
ehT
k 1g(xj )
noting that LSTM(x, h, c) follows the same LSTM implementation defined in [23]
h the output (i.e., cell after the output gate), and c the cell. a is commonly referred
based attention. We do K steps of “reads”, so f(ˆx, S) = hK where hk is as describ
2.2 Training Strategy
In the previous subsection we described Matching Networks which map a support set t
function, S ! c(ˆx). We achieve this via a modification of the set-to-set paradigm
attention, with the resulting mapping being of the form P✓(.|ˆx, S), noting that ✓ are
of the model (i.e. of the embedding functions f and g described previously).
The training procedure has to be chosen carefully so as to match inference at test t
has to perform well with support sets S0
which contain classes never seen during tra
More specifically, let us define a task T as distribution over possible label sets L
consider T to uniformly weight all data sets of up to a few unique classes (e.g.
examples per class (e.g., up to 5). In this case, a label set L sampled from a task
typically have 5 to 25 examples.
To form an “episode” to compute gradients and update our model, we first sample
L could be the label set {cats, dogs}). We then use L to sample the support set S
(i.e., both S and B are labelled examples of cats and dogs). The Matching Net is
minimise the error predicting the labels in the batch B conditioned on the support
form of meta-learning since the training procedure explicitly learns to learn from a g
to minimise a loss over a batch. More precisely, the Matching Nets training objectiv
✓ = arg max
✓
EL⇠T
2
4ES⇠L,B⇠L
2
4
X
(x,y)2B
log P✓ (y|x, S)
3
5
3
5 .
Training ✓ with eq. 6 yields a model which works well when sampling S0
⇠ T0
g(xi )f ( ˆx,S)
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
Matching Networks [Vinyals+, NIPS2016]
n  The Fully Conditional Embedding g
⁃  Embed in consideration of S
g’
LSTM
LSTM
+
Figure 1: Matching Networks architecture
by showing only a few examples per class, switching the task from minibatch to minibatch,
ike how it will be tested when presented with a few examples of a new task.
s our contributions in defining a model and training criterion amenable for one-shot learning,
ntribute by the definition of tasks that can be used to benchmark other approaches on both
Net and small scale language modeling. We hope that our results will encourage others to work
challenging problem.
ganized the paper by first defining and explaining our model whilst linking its several compo-
o related work. Then in the following section we briefly elaborate on some of the related work
Figure 1: Matching Networks architecture
it by showing only a few examples per class, switching the task from minibatch to minibatch,
h like how it will be tested when presented with a few examples of a new task.
des our contributions in defining a model and training criterion amenable for one-shot learning,
ontribute by the definition of tasks that can be used to benchmark other approaches on both
geNet and small scale language modeling. We hope that our results will encourage others to work
his challenging problem.
20
Figure 1: Matching Networks architecture
examples per class, switching the task from minibatch to minibatch, much like
when presented with a few examples of a new task.
utions in defining a model and training criterion amenable for one-shot learning,
xi
Support Set(S)
yi
g’
LSTM
LSTM
+
g’
LSTM
LSTM
+
noting that LSTM(x, h, c) follows the same LSTM implementation defined in [
h the output (i.e., cell after the output gate), and c the cell. a is commonly refe
based attention, and the softmax in eq. 6 normalizes w.r.t. g(xi). The read-ou
concatenated to hk 1. Since we do K steps of “reads”, attLSTM(f0
(ˆx), g(S),
is as described in eq. 3.
A.2 The Fully Conditional Embedding g
In section 2.1.2 we described the encoding function for the elements in the sup
as a bidirectional LSTM. More precisely, let g0
(xi) be a neural network (simila
VGG or Inception model). Then we define g(xi, S) = ~hi + ~hi + g0
(xi) with:
~hi,~ci = LSTM(g0
(xi),~hi 1,~ci 1)
~hi, ~ci = LSTM(g0
(xi), ~hi+1, ~ci+1)
where, as in above, LSTM(x, h, c) follows the same LSTM implementation de
the input, h the output (i.e., cell after the output gate), and c the cell. Note tha
starts from i = |S|. As in eq. 3, we add a skip connection between input and ou
B ImageNet Class Splits
Here we define the two class splits used in our full ImageNet experiments –
excluded for training during our one-shot experiments described in section 4.1.
Lrand =
n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n016882
n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n020953
10
g’: neural network (e.g., VGG or Inception)
a(hk 1, g(xi)) = softmax(hk 1g(xi))
noting that LSTM(x, h, c) follows the same LSTM implementation defined in [23] with x th
h the output (i.e., cell after the output gate), and c the cell. a is commonly referred to as “c
based attention, and the softmax in eq. 6 normalizes w.r.t. g(xi). The read-out rk 1 from
concatenated to hk 1. Since we do K steps of “reads”, attLSTM(f0
(ˆx), g(S), K) = hK w
is as described in eq. 3.
A.2 The Fully Conditional Embedding g
In section 2.1.2 we described the encoding function for the elements in the support set S, g
as a bidirectional LSTM. More precisely, let g0
(xi) be a neural network (similar to f0
above
VGG or Inception model). Then we define g(xi, S) = ~hi + ~hi + g0
(xi) with:
~hi,~ci = LSTM(g0
(xi),~hi 1,~ci 1)
~hi, ~ci = LSTM(g0
(xi), ~hi+1, ~ci+1)
where, as in above, LSTM(x, h, c) follows the same LSTM implementation defined in [23]
the input, h the output (i.e., cell after the output gate), and c the cell. Note that the recursio
starts from i = |S|. As in eq. 3, we add a skip connection between input and outputs.
B ImageNet Class Splits
Here we define the two class splits used in our full ImageNet experiments – these classe
excluded for training during our one-shot experiments described in section 4.1.2.
Lrand =
n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n01688243, n01729977, n
n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n02095314, n02097130, n
xi
g(xi,S)
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
Matching Networks [Vinyals+, NIPS2016]
n  The Fully Conditional Embedding g
⁃  Embed in consideration of S
g’
Figure 1: Matching Networks architecture
by showing only a few examples per class, switching the task from minibatch to minibatch,
ike how it will be tested when presented with a few examples of a new task.
s our contributions in defining a model and training criterion amenable for one-shot learning,
ntribute by the definition of tasks that can be used to benchmark other approaches on both
Net and small scale language modeling. We hope that our results will encourage others to work
challenging problem.
ganized the paper by first defining and explaining our model whilst linking its several compo-
o related work. Then in the following section we briefly elaborate on some of the related work
Figure 1: Matching Networks architecture
it by showing only a few examples per class, switching the task from minibatch to minibatch,
h like how it will be tested when presented with a few examples of a new task.
des our contributions in defining a model and training criterion amenable for one-shot learning,
ontribute by the definition of tasks that can be used to benchmark other approaches on both
geNet and small scale language modeling. We hope that our results will encourage others to work
his challenging problem.
21
Figure 1: Matching Networks architecture
examples per class, switching the task from minibatch to minibatch, much like
when presented with a few examples of a new task.
utions in defining a model and training criterion amenable for one-shot learning,
xi
Support Set(S)
yi
g’
g’
noting that LSTM(x, h, c) follows the same LSTM implementation defined in [
h the output (i.e., cell after the output gate), and c the cell. a is commonly refe
based attention, and the softmax in eq. 6 normalizes w.r.t. g(xi). The read-ou
concatenated to hk 1. Since we do K steps of “reads”, attLSTM(f0
(ˆx), g(S),
is as described in eq. 3.
A.2 The Fully Conditional Embedding g
In section 2.1.2 we described the encoding function for the elements in the sup
as a bidirectional LSTM. More precisely, let g0
(xi) be a neural network (simila
VGG or Inception model). Then we define g(xi, S) = ~hi + ~hi + g0
(xi) with:
~hi,~ci = LSTM(g0
(xi),~hi 1,~ci 1)
~hi, ~ci = LSTM(g0
(xi), ~hi+1, ~ci+1)
where, as in above, LSTM(x, h, c) follows the same LSTM implementation de
the input, h the output (i.e., cell after the output gate), and c the cell. Note tha
starts from i = |S|. As in eq. 3, we add a skip connection between input and ou
B ImageNet Class Splits
Here we define the two class splits used in our full ImageNet experiments –
excluded for training during our one-shot experiments described in section 4.1.
Lrand =
n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n016882
n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n020953
10
g’: neural network (e.g., VGG or Inception)
a(hk 1, g(xi)) = softmax(hk 1g(xi))
noting that LSTM(x, h, c) follows the same LSTM implementation defined in [23] with x th
h the output (i.e., cell after the output gate), and c the cell. a is commonly referred to as “c
based attention, and the softmax in eq. 6 normalizes w.r.t. g(xi). The read-out rk 1 from
concatenated to hk 1. Since we do K steps of “reads”, attLSTM(f0
(ˆx), g(S), K) = hK w
is as described in eq. 3.
A.2 The Fully Conditional Embedding g
In section 2.1.2 we described the encoding function for the elements in the support set S, g
as a bidirectional LSTM. More precisely, let g0
(xi) be a neural network (similar to f0
above
VGG or Inception model). Then we define g(xi, S) = ~hi + ~hi + g0
(xi) with:
~hi,~ci = LSTM(g0
(xi),~hi 1,~ci 1)
~hi, ~ci = LSTM(g0
(xi), ~hi+1, ~ci+1)
where, as in above, LSTM(x, h, c) follows the same LSTM implementation defined in [23]
the input, h the output (i.e., cell after the output gate), and c the cell. Note that the recursio
starts from i = |S|. As in eq. 3, we add a skip connection between input and outputs.
B ImageNet Class Splits
Here we define the two class splits used in our full ImageNet experiments – these classe
excluded for training during our one-shot experiments described in section 4.1.2.
Lrand =
n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n01688243, n01729977, n
n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n02095314, n02097130, n
Embed into vector using g’
(g’: neural network such as VGG or Inception)
xi
xi
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
Matching Networks [Vinyals+, NIPS2016]
n  The Fully Conditional Embedding g
⁃  Embed in consideration of S
g’
LSTM
LSTM
Figure 1: Matching Networks architecture
by showing only a few examples per class, switching the task from minibatch to minibatch,
ike how it will be tested when presented with a few examples of a new task.
s our contributions in defining a model and training criterion amenable for one-shot learning,
ntribute by the definition of tasks that can be used to benchmark other approaches on both
Net and small scale language modeling. We hope that our results will encourage others to work
challenging problem.
ganized the paper by first defining and explaining our model whilst linking its several compo-
o related work. Then in the following section we briefly elaborate on some of the related work
Figure 1: Matching Networks architecture
it by showing only a few examples per class, switching the task from minibatch to minibatch,
h like how it will be tested when presented with a few examples of a new task.
des our contributions in defining a model and training criterion amenable for one-shot learning,
ontribute by the definition of tasks that can be used to benchmark other approaches on both
geNet and small scale language modeling. We hope that our results will encourage others to work
his challenging problem.
22
Figure 1: Matching Networks architecture
examples per class, switching the task from minibatch to minibatch, much like
when presented with a few examples of a new task.
utions in defining a model and training criterion amenable for one-shot learning,
xi
Support Set(S)
yi
g’
LSTM
LSTM
g’
LSTM
LSTM
noting that LSTM(x, h, c) follows the same LSTM implementation defined in [
h the output (i.e., cell after the output gate), and c the cell. a is commonly refe
based attention, and the softmax in eq. 6 normalizes w.r.t. g(xi). The read-ou
concatenated to hk 1. Since we do K steps of “reads”, attLSTM(f0
(ˆx), g(S),
is as described in eq. 3.
A.2 The Fully Conditional Embedding g
In section 2.1.2 we described the encoding function for the elements in the sup
as a bidirectional LSTM. More precisely, let g0
(xi) be a neural network (simila
VGG or Inception model). Then we define g(xi, S) = ~hi + ~hi + g0
(xi) with:
~hi,~ci = LSTM(g0
(xi),~hi 1,~ci 1)
~hi, ~ci = LSTM(g0
(xi), ~hi+1, ~ci+1)
where, as in above, LSTM(x, h, c) follows the same LSTM implementation de
the input, h the output (i.e., cell after the output gate), and c the cell. Note tha
starts from i = |S|. As in eq. 3, we add a skip connection between input and ou
B ImageNet Class Splits
Here we define the two class splits used in our full ImageNet experiments –
excluded for training during our one-shot experiments described in section 4.1.
Lrand =
n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n016882
n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n020953
10
g’: neural network (e.g., VGG or Inception)
a(hk 1, g(xi)) = softmax(hk 1g(xi))
noting that LSTM(x, h, c) follows the same LSTM implementation defined in [23] with x th
h the output (i.e., cell after the output gate), and c the cell. a is commonly referred to as “c
based attention, and the softmax in eq. 6 normalizes w.r.t. g(xi). The read-out rk 1 from
concatenated to hk 1. Since we do K steps of “reads”, attLSTM(f0
(ˆx), g(S), K) = hK w
is as described in eq. 3.
A.2 The Fully Conditional Embedding g
In section 2.1.2 we described the encoding function for the elements in the support set S, g
as a bidirectional LSTM. More precisely, let g0
(xi) be a neural network (similar to f0
above
VGG or Inception model). Then we define g(xi, S) = ~hi + ~hi + g0
(xi) with:
~hi,~ci = LSTM(g0
(xi),~hi 1,~ci 1)
~hi, ~ci = LSTM(g0
(xi), ~hi+1, ~ci+1)
where, as in above, LSTM(x, h, c) follows the same LSTM implementation defined in [23]
the input, h the output (i.e., cell after the output gate), and c the cell. Note that the recursio
starts from i = |S|. As in eq. 3, we add a skip connection between input and outputs.
B ImageNet Class Splits
Here we define the two class splits used in our full ImageNet experiments – these classe
excluded for training during our one-shot experiments described in section 4.1.2.
Lrand =
n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n01688243, n01729977, n
n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n02095314, n02097130, n
Feed into Bi-LSTM
(gʼ: neural network such as VGG or Inception)
g'(xi )
xi
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
Matching Networks [Vinyals+, NIPS2016]
n  The Fully Conditional Embedding g
⁃  Embed in consideration of S
g’
LSTM
LSTM
+
Figure 1: Matching Networks architecture
by showing only a few examples per class, switching the task from minibatch to minibatch,
ike how it will be tested when presented with a few examples of a new task.
s our contributions in defining a model and training criterion amenable for one-shot learning,
ntribute by the definition of tasks that can be used to benchmark other approaches on both
Net and small scale language modeling. We hope that our results will encourage others to work
challenging problem.
ganized the paper by first defining and explaining our model whilst linking its several compo-
o related work. Then in the following section we briefly elaborate on some of the related work
Figure 1: Matching Networks architecture
it by showing only a few examples per class, switching the task from minibatch to minibatch,
h like how it will be tested when presented with a few examples of a new task.
des our contributions in defining a model and training criterion amenable for one-shot learning,
ontribute by the definition of tasks that can be used to benchmark other approaches on both
geNet and small scale language modeling. We hope that our results will encourage others to work
his challenging problem.
23
Figure 1: Matching Networks architecture
examples per class, switching the task from minibatch to minibatch, much like
when presented with a few examples of a new task.
utions in defining a model and training criterion amenable for one-shot learning,
xi
Support Set(S)
yi
g’
LSTM
LSTM
+
g’
LSTM
LSTM
+
noting that LSTM(x, h, c) follows the same LSTM implementation defined in [
h the output (i.e., cell after the output gate), and c the cell. a is commonly refe
based attention, and the softmax in eq. 6 normalizes w.r.t. g(xi). The read-ou
concatenated to hk 1. Since we do K steps of “reads”, attLSTM(f0
(ˆx), g(S),
is as described in eq. 3.
A.2 The Fully Conditional Embedding g
In section 2.1.2 we described the encoding function for the elements in the sup
as a bidirectional LSTM. More precisely, let g0
(xi) be a neural network (simila
VGG or Inception model). Then we define g(xi, S) = ~hi + ~hi + g0
(xi) with:
~hi,~ci = LSTM(g0
(xi),~hi 1,~ci 1)
~hi, ~ci = LSTM(g0
(xi), ~hi+1, ~ci+1)
where, as in above, LSTM(x, h, c) follows the same LSTM implementation de
the input, h the output (i.e., cell after the output gate), and c the cell. Note tha
starts from i = |S|. As in eq. 3, we add a skip connection between input and ou
B ImageNet Class Splits
Here we define the two class splits used in our full ImageNet experiments –
excluded for training during our one-shot experiments described in section 4.1.
Lrand =
n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n016882
n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n020953
10
g’: neural network (e.g., VGG or Inception)
a(hk 1, g(xi)) = softmax(hk 1g(xi))
noting that LSTM(x, h, c) follows the same LSTM implementation defined in [23] with x th
h the output (i.e., cell after the output gate), and c the cell. a is commonly referred to as “c
based attention, and the softmax in eq. 6 normalizes w.r.t. g(xi). The read-out rk 1 from
concatenated to hk 1. Since we do K steps of “reads”, attLSTM(f0
(ˆx), g(S), K) = hK w
is as described in eq. 3.
A.2 The Fully Conditional Embedding g
In section 2.1.2 we described the encoding function for the elements in the support set S, g
as a bidirectional LSTM. More precisely, let g0
(xi) be a neural network (similar to f0
above
VGG or Inception model). Then we define g(xi, S) = ~hi + ~hi + g0
(xi) with:
~hi,~ci = LSTM(g0
(xi),~hi 1,~ci 1)
~hi, ~ci = LSTM(g0
(xi), ~hi+1, ~ci+1)
where, as in above, LSTM(x, h, c) follows the same LSTM implementation defined in [23]
the input, h the output (i.e., cell after the output gate), and c the cell. Note that the recursio
starts from i = |S|. As in eq. 3, we add a skip connection between input and outputs.
B ImageNet Class Splits
Here we define the two class splits used in our full ImageNet experiments – these classe
excluded for training during our one-shot experiments described in section 4.1.2.
Lrand =
n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n01688243, n01729977, n
n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n02095314, n02097130, n
g(xi,S)
Let be the sum of
and outputs of Bi-LSTM
g(xi,S) g'(xi )
xi
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
Matching Networks [Vinyals+, NIPS2016]
n  The Fully Conditional Embedding f
⁃  Embed in consideration of S
g
Figure 1: Matching Networks architecture
train it by showing only a few examples per class, switching the task from mi
much like how it will be tested when presented with a few examples of a new
Besides our contributions in defining a model and training criterion amenable
we contribute by the definition of tasks that can be used to benchmark other
ImageNet and small scale language modeling. We hope that our results will enc
on this challenging problem.
We organized the paper by first defining and explaining our model whilst linki
nents to related work. Then in the following section we briefly elaborate on som
to the task and our model. In Section 4 we describe both our general setup an
performed, demonstrating strong results on one-shot learning on a variety of ta
2 Model
Our non-parametric approach to solving one-shot learning is based on two co
describe in the following subsections. First, our model architecture follows rece
networks augmented with memory (as discussed in Section 3). Given a (sma
model defines a function cS (or classifier) for each S, i.e. a mapping S ! cS(.)
a training strategy which is tailored for one-shot learning from the support set
2.1 Model Architecture
In recent years, many groups have investigated ways to augment neural netwo
f’LSTM
rk−1
a(hk−1,g(xi ))g(xi )
LSTM
f ( ˆx,S) = hK
ˆhk−1
hk−1
ˆhk
+
+
ˆx
so, we define the following recurrence over “processing” steps k, following work from [26]:
ˆhk, ck = LSTM(f0
(ˆx), [hk 1, rk 1], ck 1) (2)
hk = ˆhk + f0
(ˆx) (3)
rk 1 =
|S|
X
i=1
a(hk 1, g(xi))g(xi) (4)
a(hk 1, g(xi)) = ehT
k 1g(xi)
/
|S|
X
j=1
ehT
k 1g(xj )
(5)
Query
weighted sum
24
Figure 1: Matching Networks architecture
by showing only a few examples per class, switching the task from minibatch to minibatch,
ike how it will be tested when presented with a few examples of a new task.
s our contributions in defining a model and training criterion amenable for one-shot learning,
ntribute by the definition of tasks that can be used to benchmark other approaches on both
Net and small scale language modeling. We hope that our results will encourage others to work
challenging problem.
ganized the paper by first defining and explaining our model whilst linking its several compo-
o related work. Then in the following section we briefly elaborate on some of the related work
Figure 1: Matching Networks architecture
it by showing only a few examples per class, switching the task from minibatch to minibatch,
h like how it will be tested when presented with a few examples of a new task.
des our contributions in defining a model and training criterion amenable for one-shot learning,
ontribute by the definition of tasks that can be used to benchmark other approaches on both
geNet and small scale language modeling. We hope that our results will encourage others to work
his challenging problem.
Figure 1: Matching Networks architecture
examples per class, switching the task from minibatch to minibatch, much like
when presented with a few examples of a new task.
utions in defining a model and training criterion amenable for one-shot learning,
xi
Support Set(S)
yi
ˆx
ollowing recurrence over “processing” steps k, following work from [26]:
ˆhk, ck = LSTM(f0
(ˆx), [hk 1, rk 1], ck 1) (2)
hk = ˆhk + f0
(ˆx) (3)
rk 1 =
|S|
X
i=1
a(hk 1, g(xi))g(xi) (4)
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
Matching Networks [Vinyals+, NIPS2016]
n  The Fully Conditional Embedding f
⁃  Embed in consideration of S
g
Figure 1: Matching Networks architecture
train it by showing only a few examples per class, switching the task from mi
much like how it will be tested when presented with a few examples of a new
Besides our contributions in defining a model and training criterion amenable
we contribute by the definition of tasks that can be used to benchmark other
ImageNet and small scale language modeling. We hope that our results will enc
on this challenging problem.
We organized the paper by first defining and explaining our model whilst linki
nents to related work. Then in the following section we briefly elaborate on som
to the task and our model. In Section 4 we describe both our general setup an
performed, demonstrating strong results on one-shot learning on a variety of ta
2 Model
Our non-parametric approach to solving one-shot learning is based on two co
describe in the following subsections. First, our model architecture follows rece
networks augmented with memory (as discussed in Section 3). Given a (sma
model defines a function cS (or classifier) for each S, i.e. a mapping S ! cS(.)
a training strategy which is tailored for one-shot learning from the support set
2.1 Model Architecture
In recent years, many groups have investigated ways to augment neural netwo
f’LSTM
g(xi ) ˆh1
h1
+
ˆx
so, we define the following recurrence over “processing” steps k, following work from [26]:
ˆhk, ck = LSTM(f0
(ˆx), [hk 1, rk 1], ck 1) (2)
hk = ˆhk + f0
(ˆx) (3)
rk 1 =
|S|
X
i=1
a(hk 1, g(xi))g(xi) (4)
a(hk 1, g(xi)) = ehT
k 1g(xi)
/
|S|
X
j=1
ehT
k 1g(xj )
(5)
Query
25
Figure 1: Matching Networks architecture
by showing only a few examples per class, switching the task from minibatch to minibatch,
ike how it will be tested when presented with a few examples of a new task.
s our contributions in defining a model and training criterion amenable for one-shot learning,
ntribute by the definition of tasks that can be used to benchmark other approaches on both
Net and small scale language modeling. We hope that our results will encourage others to work
challenging problem.
ganized the paper by first defining and explaining our model whilst linking its several compo-
o related work. Then in the following section we briefly elaborate on some of the related work
Figure 1: Matching Networks architecture
it by showing only a few examples per class, switching the task from minibatch to minibatch,
h like how it will be tested when presented with a few examples of a new task.
des our contributions in defining a model and training criterion amenable for one-shot learning,
ontribute by the definition of tasks that can be used to benchmark other approaches on both
geNet and small scale language modeling. We hope that our results will encourage others to work
his challenging problem.
Figure 1: Matching Networks architecture
examples per class, switching the task from minibatch to minibatch, much like
when presented with a few examples of a new task.
utions in defining a model and training criterion amenable for one-shot learning,
xi
Support Set(S)
yi
ollowing recurrence over “processing” steps k, following work from [26]:
ˆhk, ck = LSTM(f0
(ˆx), [hk 1, rk 1], ck 1) (2)
hk = ˆhk + f0
(ˆx) (3)
rk 1 =
|S|
X
i=1
a(hk 1, g(xi))g(xi) (4)
is calculated without using S
h1 = LSTM( f '( ˆx),[ ˆh0,r0 ],c0 )+ f '( ˆx)
h1
ˆx
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
Matching Networks [Vinyals+, NIPS2016]
n  The Fully Conditional Embedding f
⁃  Embed in consideration of S
g
Figure 1: Matching Networks architecture
train it by showing only a few examples per class, switching the task from mi
much like how it will be tested when presented with a few examples of a new
Besides our contributions in defining a model and training criterion amenable
we contribute by the definition of tasks that can be used to benchmark other
ImageNet and small scale language modeling. We hope that our results will enc
on this challenging problem.
We organized the paper by first defining and explaining our model whilst linki
nents to related work. Then in the following section we briefly elaborate on som
to the task and our model. In Section 4 we describe both our general setup an
performed, demonstrating strong results on one-shot learning on a variety of ta
2 Model
Our non-parametric approach to solving one-shot learning is based on two co
describe in the following subsections. First, our model architecture follows rece
networks augmented with memory (as discussed in Section 3). Given a (sma
model defines a function cS (or classifier) for each S, i.e. a mapping S ! cS(.)
a training strategy which is tailored for one-shot learning from the support set
2.1 Model Architecture
In recent years, many groups have investigated ways to augment neural netwo
f’LSTM
g(xi ) ˆh1
h1
+
ˆx
so, we define the following recurrence over “processing” steps k, following work from [26]:
ˆhk, ck = LSTM(f0
(ˆx), [hk 1, rk 1], ck 1) (2)
hk = ˆhk + f0
(ˆx) (3)
rk 1 =
|S|
X
i=1
a(hk 1, g(xi))g(xi) (4)
a(hk 1, g(xi)) = ehT
k 1g(xi)
/
|S|
X
j=1
ehT
k 1g(xj )
(5)
Query
26
Figure 1: Matching Networks architecture
by showing only a few examples per class, switching the task from minibatch to minibatch,
ike how it will be tested when presented with a few examples of a new task.
s our contributions in defining a model and training criterion amenable for one-shot learning,
ntribute by the definition of tasks that can be used to benchmark other approaches on both
Net and small scale language modeling. We hope that our results will encourage others to work
challenging problem.
ganized the paper by first defining and explaining our model whilst linking its several compo-
o related work. Then in the following section we briefly elaborate on some of the related work
Figure 1: Matching Networks architecture
it by showing only a few examples per class, switching the task from minibatch to minibatch,
h like how it will be tested when presented with a few examples of a new task.
des our contributions in defining a model and training criterion amenable for one-shot learning,
ontribute by the definition of tasks that can be used to benchmark other approaches on both
geNet and small scale language modeling. We hope that our results will encourage others to work
his challenging problem.
Figure 1: Matching Networks architecture
examples per class, switching the task from minibatch to minibatch, much like
when presented with a few examples of a new task.
utions in defining a model and training criterion amenable for one-shot learning,
xi
Support Set(S)
yi
ollowing recurrence over “processing” steps k, following work from [26]:
ˆhk, ck = LSTM(f0
(ˆx), [hk 1, rk 1], ck 1) (2)
hk = ˆhk + f0
(ˆx) (3)
rk 1 =
|S|
X
i=1
a(hk 1, g(xi))g(xi) (4)
Calculate the relevance between and
softmaxa(h1,g(x1)) =
a(h1,g(xi ))
(hT
1g(x1))
g(xi ) h1
ˆx
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
Matching Networks [Vinyals+, NIPS2016]
n  The Fully Conditional Embedding f
⁃  Embed in consideration of S
g
Figure 1: Matching Networks architecture
train it by showing only a few examples per class, switching the task from mi
much like how it will be tested when presented with a few examples of a new
Besides our contributions in defining a model and training criterion amenable
we contribute by the definition of tasks that can be used to benchmark other
ImageNet and small scale language modeling. We hope that our results will enc
on this challenging problem.
We organized the paper by first defining and explaining our model whilst linki
nents to related work. Then in the following section we briefly elaborate on som
to the task and our model. In Section 4 we describe both our general setup an
performed, demonstrating strong results on one-shot learning on a variety of ta
2 Model
Our non-parametric approach to solving one-shot learning is based on two co
describe in the following subsections. First, our model architecture follows rece
networks augmented with memory (as discussed in Section 3). Given a (sma
model defines a function cS (or classifier) for each S, i.e. a mapping S ! cS(.)
a training strategy which is tailored for one-shot learning from the support set
2.1 Model Architecture
In recent years, many groups have investigated ways to augment neural netwo
f’LSTM
g(xi ) ˆh1
h1
+
ˆx
so, we define the following recurrence over “processing” steps k, following work from [26]:
ˆhk, ck = LSTM(f0
(ˆx), [hk 1, rk 1], ck 1) (2)
hk = ˆhk + f0
(ˆx) (3)
rk 1 =
|S|
X
i=1
a(hk 1, g(xi))g(xi) (4)
a(hk 1, g(xi)) = ehT
k 1g(xi)
/
|S|
X
j=1
ehT
k 1g(xj )
(5)
Query
27
Figure 1: Matching Networks architecture
by showing only a few examples per class, switching the task from minibatch to minibatch,
ike how it will be tested when presented with a few examples of a new task.
s our contributions in defining a model and training criterion amenable for one-shot learning,
ntribute by the definition of tasks that can be used to benchmark other approaches on both
Net and small scale language modeling. We hope that our results will encourage others to work
challenging problem.
ganized the paper by first defining and explaining our model whilst linking its several compo-
o related work. Then in the following section we briefly elaborate on some of the related work
Figure 1: Matching Networks architecture
it by showing only a few examples per class, switching the task from minibatch to minibatch,
h like how it will be tested when presented with a few examples of a new task.
des our contributions in defining a model and training criterion amenable for one-shot learning,
ontribute by the definition of tasks that can be used to benchmark other approaches on both
geNet and small scale language modeling. We hope that our results will encourage others to work
his challenging problem.
Figure 1: Matching Networks architecture
examples per class, switching the task from minibatch to minibatch, much like
when presented with a few examples of a new task.
utions in defining a model and training criterion amenable for one-shot learning,
xi
Support Set(S)
yi
ollowing recurrence over “processing” steps k, following work from [26]:
ˆhk, ck = LSTM(f0
(ˆx), [hk 1, rk 1], ck 1) (2)
hk = ˆhk + f0
(ˆx) (3)
rk 1 =
|S|
X
i=1
a(hk 1, g(xi))g(xi) (4)
is a sum of weighted according to the
relevance to
a(h1,g(xi ))
r1
weighted sum
r1
g(xi )
h1
r1 = a(h1,g(xi ))
i=1
|S|
∑ g(xi )
ˆx
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
Matching Networks [Vinyals+, NIPS2016]
n  The Fully Conditional Embedding f
⁃  Embed in consideration of S
g
Figure 1: Matching Networks architecture
train it by showing only a few examples per class, switching the task from mi
much like how it will be tested when presented with a few examples of a new
Besides our contributions in defining a model and training criterion amenable
we contribute by the definition of tasks that can be used to benchmark other
ImageNet and small scale language modeling. We hope that our results will enc
on this challenging problem.
We organized the paper by first defining and explaining our model whilst linki
nents to related work. Then in the following section we briefly elaborate on som
to the task and our model. In Section 4 we describe both our general setup an
performed, demonstrating strong results on one-shot learning on a variety of ta
2 Model
Our non-parametric approach to solving one-shot learning is based on two co
describe in the following subsections. First, our model architecture follows rece
networks augmented with memory (as discussed in Section 3). Given a (sma
model defines a function cS (or classifier) for each S, i.e. a mapping S ! cS(.)
a training strategy which is tailored for one-shot learning from the support set
2.1 Model Architecture
In recent years, many groups have investigated ways to augment neural netwo
f’LSTM
g(xi ) ˆh1
h1
+
ˆx
so, we define the following recurrence over “processing” steps k, following work from [26]:
ˆhk, ck = LSTM(f0
(ˆx), [hk 1, rk 1], ck 1) (2)
hk = ˆhk + f0
(ˆx) (3)
rk 1 =
|S|
X
i=1
a(hk 1, g(xi))g(xi) (4)
a(hk 1, g(xi)) = ehT
k 1g(xi)
/
|S|
X
j=1
ehT
k 1g(xj )
(5)
Query
28
Figure 1: Matching Networks architecture
by showing only a few examples per class, switching the task from minibatch to minibatch,
ike how it will be tested when presented with a few examples of a new task.
s our contributions in defining a model and training criterion amenable for one-shot learning,
ntribute by the definition of tasks that can be used to benchmark other approaches on both
Net and small scale language modeling. We hope that our results will encourage others to work
challenging problem.
ganized the paper by first defining and explaining our model whilst linking its several compo-
o related work. Then in the following section we briefly elaborate on some of the related work
Figure 1: Matching Networks architecture
it by showing only a few examples per class, switching the task from minibatch to minibatch,
h like how it will be tested when presented with a few examples of a new task.
des our contributions in defining a model and training criterion amenable for one-shot learning,
ontribute by the definition of tasks that can be used to benchmark other approaches on both
geNet and small scale language modeling. We hope that our results will encourage others to work
his challenging problem.
Figure 1: Matching Networks architecture
examples per class, switching the task from minibatch to minibatch, much like
when presented with a few examples of a new task.
utions in defining a model and training criterion amenable for one-shot learning,
xi
Support Set(S)
yi
ollowing recurrence over “processing” steps k, following work from [26]:
ˆhk, ck = LSTM(f0
(ˆx), [hk 1, rk 1], ck 1) (2)
hk = ˆhk + f0
(ˆx) (3)
rk 1 =
|S|
X
i=1
a(hk 1, g(xi))g(xi) (4)
a(h1,g(xi ))
r1
weighted sum
LSTM
ˆh1
+
h1
is calculated using Sh1
ˆx
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
Matching Networks [Vinyals+, NIPS2016]
n  The Fully Conditional Embedding f
⁃  Embed in consideration of S
g
Figure 1: Matching Networks architecture
train it by showing only a few examples per class, switching the task from mi
much like how it will be tested when presented with a few examples of a new
Besides our contributions in defining a model and training criterion amenable
we contribute by the definition of tasks that can be used to benchmark other
ImageNet and small scale language modeling. We hope that our results will enc
on this challenging problem.
We organized the paper by first defining and explaining our model whilst linki
nents to related work. Then in the following section we briefly elaborate on som
to the task and our model. In Section 4 we describe both our general setup an
performed, demonstrating strong results on one-shot learning on a variety of ta
2 Model
Our non-parametric approach to solving one-shot learning is based on two co
describe in the following subsections. First, our model architecture follows rece
networks augmented with memory (as discussed in Section 3). Given a (sma
model defines a function cS (or classifier) for each S, i.e. a mapping S ! cS(.)
a training strategy which is tailored for one-shot learning from the support set
2.1 Model Architecture
In recent years, many groups have investigated ways to augment neural netwo
f’LSTM
rk−1
a(hk−1,g(xi ))g(xi )
LSTM
f ( ˆx,S) = hK
ˆhk−1
hk−1
ˆhk
+
+
ˆx
so, we define the following recurrence over “processing” steps k, following work from [26]:
ˆhk, ck = LSTM(f0
(ˆx), [hk 1, rk 1], ck 1) (2)
hk = ˆhk + f0
(ˆx) (3)
rk 1 =
|S|
X
i=1
a(hk 1, g(xi))g(xi) (4)
a(hk 1, g(xi)) = ehT
k 1g(xi)
/
|S|
X
j=1
ehT
k 1g(xj )
(5)
Query
weighted sum
29
Figure 1: Matching Networks architecture
by showing only a few examples per class, switching the task from minibatch to minibatch,
ike how it will be tested when presented with a few examples of a new task.
s our contributions in defining a model and training criterion amenable for one-shot learning,
ntribute by the definition of tasks that can be used to benchmark other approaches on both
Net and small scale language modeling. We hope that our results will encourage others to work
challenging problem.
ganized the paper by first defining and explaining our model whilst linking its several compo-
o related work. Then in the following section we briefly elaborate on some of the related work
Figure 1: Matching Networks architecture
it by showing only a few examples per class, switching the task from minibatch to minibatch,
h like how it will be tested when presented with a few examples of a new task.
des our contributions in defining a model and training criterion amenable for one-shot learning,
ontribute by the definition of tasks that can be used to benchmark other approaches on both
geNet and small scale language modeling. We hope that our results will encourage others to work
his challenging problem.
Figure 1: Matching Networks architecture
examples per class, switching the task from minibatch to minibatch, much like
when presented with a few examples of a new task.
utions in defining a model and training criterion amenable for one-shot learning,
xi
Support Set(S)
yi
ollowing recurrence over “processing” steps k, following work from [26]:
ˆhk, ck = LSTM(f0
(ˆx), [hk 1, rk 1], ck 1) (2)
hk = ˆhk + f0
(ˆx) (3)
rk 1 =
|S|
X
i=1
a(hk 1, g(xi))g(xi) (4)
Let be the output
after K steps
f ( ˆx,S)
ˆx
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
Experimental Settings
n  Datasets
⁃  Image classification sets
•  Omniglot [Lake+, 2011]
⁃  Language modeling
•  Penn Treebank [Marcus+, 1993]
30
•  ImageNet [Deng+, 2009]
ref. http://karpathy.github.io/2014/09/02/what-i-learned-
from-competing-against-a-convnet-on-imagenet/
4.1.3 One-Shot Language Modeling
We also introduce a new one-shot language task which is analogous to those examined for images.
The task is as follows: given a query sentence with a missing word in it, and a support set of sentences
which each have a missing word and a corresponding 1-hot label, choose the label from the support
set that best matches the query sentence. Here we show a single example, though note that the words
on the right are not provided and the labels for the set are given as 1-hot-of-5 vectors.
1. an experimental vaccine can alter the immune response of people infected with the aids virus a
<blank_token> u.s. scientist said.
prominent
2. the show one of five new nbc <blank_token> is the second casualty of the three networks so far
this fall.
series
3. however since eastern first filed for chapter N protection march N it has consistently promised
to pay creditors N cents on the <blank_token>.
dollar
4. we had a lot of people who threw in the <blank_token> today said <unk> ellis a partner in
benjamin jacobson & sons a specialist in trading ual stock on the big board.
towel
5. it’s not easy to roll out something that <blank_token> and make it pay mr. jacob says. comprehensive
Query: in late new york trading yesterday the <blank_token> was quoted at N marks down from N
marks late friday and at N yen down from N yen late friday.
dollar
Sentences were taken from the Penn Treebank dataset [15]. On each trial, we make sure that the set
and batch are populated with sentences that are non-overlapping. This means that we do not use
words with very low frequency counts; e.g. if there is only a single sentence for a given word we do
not use this data since the sentence would need to be in both the set and the batch. As with the image
tasks, each trial consisted of a 5 way choice between the classes available in the set. We used a batch
size of 20 throughout the sentence matching task and varied the set size across k=1,2,3. We ensured
that the same number of sentences were available for each class in the set. We split the words into a
randomly sampled 9000 for training and 1000 for testing, and we used the standard test set to report
results. Thus, neither the words nor the sentences used during test time had been seen during training.
We compared our one-shot matching model to an oracle LSTM language model (LSTM-LM) [30]
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
Experimental Settings (Omniglot)
n  Baseline
⁃  Matching on raw pixels
⁃  Matching on discriminative features from VGG
(Baseine classifier)
⁃  MANN
⁃  Convolutional Siamese Network
n  Datasets
⁃  training: 1200 characters
⁃  testing: 423 characters
31
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
Experimental Results (Omniglot)
32
n  Fully Conditional Embedding (FCE) did not seem to help much
n  Baseline and Siamese Net were improved with fine-tuning
took this network and used the features from the last layer (before the softmax) for nearest neighbour
matching, a strategy commonly used in computer vision [3] which has achieved excellent results
across many tasks. Following [11], the convolutional siamese nets were trained on a same-or-different
task of the original training data set and then the last layer was used for nearest neighbour matching.
Model Matching Fn Fine Tune
5-way Acc 20-way Acc
1-shot 5-shot 1-shot 5-shot
PIXELS Cosine N 41.7% 63.2% 26.7% 42.6%
BASELINE CLASSIFIER Cosine N 80.0% 95.0% 69.5% 89.1%
BASELINE CLASSIFIER Cosine Y 82.3% 98.4% 70.6% 92.0%
BASELINE CLASSIFIER Softmax Y 86.0% 97.6% 72.9% 92.3%
MANN (NO CONV) [21] Cosine N 82.8% 94.9% – –
CONVOLUTIONAL SIAMESE NET [11] Cosine N 96.7% 98.4% 88.0% 96.5%
CONVOLUTIONAL SIAMESE NET [11] Cosine Y 97.3% 98.4% 88.1% 97.0%
MATCHING NETS (OURS) Cosine N 98.1% 98.9% 93.8% 98.5%
MATCHING NETS (OURS) Cosine Y 97.9% 98.7% 93.5% 98.7%
Table 1: Results on the Omniglot dataset.
5
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
Experimental Settings (ImageNet)
n  Baseline
⁃  Matching on raw pixels
⁃  Matching on discriminative features from InceptionV3
(Baseine classifier)
n  Datasets
⁃  miniImageNet (size: 84x84)
•  training: (80 classes)
•  testing: (20 classes)
⁃  randImageNet
•  training: randomly picked up classes (882 classes)
•  testing: remaining classes (118 classes)
⁃  dogsImageNet
•  training: all non-dog classes (882 classes)
•  testing: dog classes (118 classes)
33
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
Experimental Results (miniImageNet)
34
Figure 2: Example of two 5-way problem instance on ImageNet. The images in the set S0
contain
classes never seen during training. Our model makes far less mistakes than the Inception baseline.
Table 2: Results on miniImageNet.
Model Matching Fn Fine Tune
5-way Acc
1-shot 5-shot
PIXELS Cosine N 23.0% 26.6%
BASELINE CLASSIFIER Cosine N 36.6% 46.0%
BASELINE CLASSIFIER Cosine Y 36.2% 52.2%
BASELINE CLASSIFIER Softmax Y 38.4% 51.2%
MATCHING NETS (OURS) Cosine N 41.2% 56.2%
MATCHING NETS (OURS) Cosine Y 42.4% 58.0%
MATCHING NETS (OURS) Cosine (FCE) N 44.2% 57.0%
MATCHING NETS (OURS) Cosine (FCE) Y 46.6% 60.0%
1-shot tasks from the training data set, incorporating Full Context Embeddings and our Matching
Networks and training strategy.
The results of the randImageNet and dogsImageNet experiments are shown in Table 3. The Inception
Oracle (trained on all classes) performs almost perfectly when restricted to 5 classes only, which is
not too surprising given its impressive top-1 accuracy. When trained solely on 6=Lrand, Matching
Nets improve upon Inception by almost 6% when tested on Lrand, halving the errors. Figure 2 shows
two instances of 5-way one-shot learning, where Inception fails. Looking at all the errors, Inception
appears to sometimes prefer an image above all others (these images tend to be cluttered like the
example in the second column, or more constant in color). Matching Nets, on the other hand, manage
to recover from these outliers that sometimes appear in the support set S0
.
Matching Nets manage to improve upon Inception on the complementary subset 6=Ldogs (although
n  Matching Networks overtook baseline
n  Fully Conditional Embedding (FCE) was shown effective to
improve the performance in this task
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
Experimental Results (randImageNet, dogsImageNet)
35
classification. Thus, we believe that if we adapted our training strategy to samples S from fine grained
sets of labels instead of sampling uniformly from the leafs of the ImageNet class tree, improvements
could be attained. We leave this as future work.
Table 3: Results on full ImageNet on rand and dogs one-shot tasks. Note that 6=Lrand and 6=Ldogs
are sets of classes which are seen during training, but are provided for completeness.
Model Matching Fn Fine Tune
ImageNet 5-way 1-shot Acc
Lrand 6=Lrand Ldogs 6=Ldogs
PIXELS Cosine N 42.0% 42.8% 41.4% 43.0%
INCEPTION CLASSIFIER Cosine N 87.6% 92.6% 59.8% 90.0%
MATCHING NETS (OURS) Cosine (FCE) N 93.2% 97.0% 58.8% 96.4%
INCEPTION ORACLE Softmax (Full) Y (Full) ⇡ 99% ⇡ 99% ⇡ 99% ⇡ 99%
7
n  Matching Networks outperformed Inception Classifier in ,
but degraded in
n  The decrease of the performance in might be caused by the
different distributions of labels between training and testing
⁃  Training support set comes from a random distribution
whereas testing one comes from similar classes
BASELINE CLASSIFIER Cosine Y 36
BASELINE CLASSIFIER Softmax Y 38
MATCHING NETS (OURS) Cosine N 41
MATCHING NETS (OURS) Cosine Y 42
MATCHING NETS (OURS) Cosine (FCE) N 44
MATCHING NETS (OURS) Cosine (FCE) Y 46
1-shot tasks from the training data set, incorporating Full Context Emb
Networks and training strategy.
The results of the randImageNet and dogsImageNet experiments are show
Oracle (trained on all classes) performs almost perfectly when restricted
not too surprising given its impressive top-1 accuracy. When trained so
Nets improve upon Inception by almost 6% when tested on Lrand, halving
two instances of 5-way one-shot learning, where Inception fails. Looking
appears to sometimes prefer an image above all others (these images te
example in the second column, or more constant in color). Matching Nets,
to recover from these outliers that sometimes appear in the support set S0
Matching Nets manage to improve upon Inception on the complementar
this setup is not one-shot, as the feature extraction has been trained on the
much more challenging Ldogs subset, our model degrades by 1%. We h
1-shot tasks from the training data set, incorporating Full Context Embeddings an
Networks and training strategy.
The results of the randImageNet and dogsImageNet experiments are shown in Table
Oracle (trained on all classes) performs almost perfectly when restricted to 5 classe
not too surprising given its impressive top-1 accuracy. When trained solely on 6=L
Nets improve upon Inception by almost 6% when tested on Lrand, halving the errors
two instances of 5-way one-shot learning, where Inception fails. Looking at all the e
appears to sometimes prefer an image above all others (these images tend to be c
example in the second column, or more constant in color). Matching Nets, on the oth
to recover from these outliers that sometimes appear in the support set S0
.
Matching Nets manage to improve upon Inception on the complementary subset 6=
this setup is not one-shot, as the feature extraction has been trained on these labels).
much more challenging Ldogs subset, our model degrades by 1%. We hypothesiz
that the sampled set during training, S, comes from a random distribution of labels
whereas the testing support set S0
from Ldogs contains similar classes, more akin
classification. Thus, we believe that if we adapted our training strategy to samples S f
sets of labels instead of sampling uniformly from the leafs of the ImageNet class tre
could be attained. We leave this as future work.
1-shot tasks from the training data set, incorporating Full C
Networks and training strategy.
The results of the randImageNet and dogsImageNet experimen
Oracle (trained on all classes) performs almost perfectly whe
not too surprising given its impressive top-1 accuracy. When
Nets improve upon Inception by almost 6% when tested on Lr
two instances of 5-way one-shot learning, where Inception fa
appears to sometimes prefer an image above all others (thes
example in the second column, or more constant in color). Ma
to recover from these outliers that sometimes appear in the su
Matching Nets manage to improve upon Inception on the com
this setup is not one-shot, as the feature extraction has been tra
much more challenging Ldogs subset, our model degrades b
that the sampled set during training, S, comes from a random
whereas the testing support set S0
from Ldogs contains simi
classification. Thus, we believe that if we adapted our training
sets of labels instead of sampling uniformly from the leafs of
could be attained. We leave this as future work.
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
Experimental Settings (Penn Treebank)
36
xi
Support Set(S)
ˆx
Query
g(xi )
f ( ˆx,S)
a
Our model in its simplest form computes a probability over ˆy as follows:
P(ˆy|ˆx, S) =
kX
i=1
a(ˆx, xi)yi
where xi, yi are the inputs and corresponding label distributions from the su
k
∑
P(ˆy|ˆx, S) =
where xi, yi are the inputs and correspondin
{(xi, yi)}k
i=1, and a is an attention mechanism
tially describes the output for a new class as a
Where the attention mechanism a is a kernel on X
Where the attention mechanism is zero for the
metric and an appropriate constant otherwise, th
(although this requires an extension to the atten
Thus (1) subsumes both KDE and kNN methods.
mechanism and the yi act as values bound to the
this case we can understand this as a particular
we “point” to the corresponding example in the s
form defined by the classifier cS(ˆx) is very flexib
2.1.1 The Attention Kernel
Equation 1 relies on choosing a(., .), the atten
fier. The simplest form that this takes (and w
attention models and kernel functions) is to
a(ˆx, xi) = ec(f(ˆx),g(xi))
/
Pk
j=1 ec(f(ˆx),g(xj ))
w
ate neural networks (potentially with f = g) to
examples where f and g are parameterised var
tasks (as in VGG[22] or Inception[24]) or a sim
Section 4).
We note that, though related to metric learning, th
For a given support set S and sample to classify
pairs (x0
, y0
) 2 S such that y0
= y and misalign
methods such as Neighborhood Component An
nearest neighbor [28].
However, the objective that we are trying to opti
classification, and thus we expect it to perform b
Our model in its simplest form computes a probability over ˆy as follows:
P(ˆy|ˆx, S) =
kX
i=1
a(ˆx, xi)yi
where xi, yi are the inputs and corresponding label distributions from the support set
{(xi, yi)}k
i=1, and a is an attention mechanism which we discuss below. Note that eq. 1
tially describes the output for a new class as a linear combination of the labels in the suppo
Where the attention mechanism a is a kernel on X ⇥ X, then (1) is akin to a kernel density esti
Where the attention mechanism is zero for the b furthest xi from ˆx according to some dis
metric and an appropriate constant otherwise, then (1) is equivalent to ‘k b’-nearest neigh
(although this requires an extension to the attention mechanism that we describe in Section 2
Thus (1) subsumes both KDE and kNN methods. Another view of (1) is where a acts as an atte
mechanism and the yi act as values bound to the corresponding keys xi, much like a hash tab
yi
Our model in its simplest form computes a probability over ˆy as follows:
P(ˆy|ˆx, S) =
kX
i=1
a(ˆx, xi)yi
where xi, yi are the inputs and corresponding label distributions from the support s
{(xi, yi)}k
i=1, and a is an attention mechanism which we discuss below. Note that eq.
tially describes the output for a new class as a linear combination of the labels in the su
Where the attention mechanism a is a kernel on X ⇥ X, then (1) is akin to a kernel density
Where the attention mechanism is zero for the b furthest xi from ˆx according to some
metric and an appropriate constant otherwise, then (1) is equivalent to ‘k b’-nearest ne
(although this requires an extension to the attention mechanism that we describe in Secti
Thus (1) subsumes both KDE and kNN methods. Another view of (1) is where a acts as an
mechanism and the yi act as values bound to the corresponding keys xi, much like a hash
this case we can understand this as a particular kind of associative memory where, given
we “point” to the corresponding example in the support set, retrieving its label. Hence the f
form defined by the classifier cS(ˆx) is very flexible and can adapt easily to any new suppo
2.1.1 The Attention Kernel
Equation 1 relies on choosing a(., .), the attention mechanism, which fully specifies th
fier. The simplest form that this takes (and which has very tight relationships with
attention models and kernel functions) is to use the softmax over the cosine distanc
a(ˆx, xi) = ec(f(ˆx),g(xi))
/
Pk
j=1 ec(f(ˆx),g(xj ))
with embedding functions f and g being
ate neural networks (potentially with f = g) to embed ˆx and xi. In our experiments we
examples where f and g are parameterised variously as deep convolutional networks f
tasks (as in VGG[22] or Inception[24]) or a simple form word embedding for language t
Section 4).
We note that, though related to metric learning, the classifier defined by Equation 1 is discri
c: cosine distance
LSTMLSTM…
virus a
LSTMLSTM…
new nbc
LSTMLSTM
on the
…
LSTMLSTM
the yesterday
…
4.1.3 One-Shot Language Modeling
We also introduce a new one-shot language task which is analogous to those examined for images.
The task is as follows: given a query sentence with a missing word in it, and a support set of sentences
which each have a missing word and a corresponding 1-hot label, choose the label from the support
set that best matches the query sentence. Here we show a single example, though note that the words
on the right are not provided and the labels for the set are given as 1-hot-of-5 vectors.
1. an experimental vaccine can alter the immune response of people infected with the aids virus a
<blank_token> u.s. scientist said.
prominent
2. the show one of five new nbc <blank_token> is the second casualty of the three networks so far
this fall.
series
3. however since eastern first filed for chapter N protection march N it has consistently promised
to pay creditors N cents on the <blank_token>.
dollar
4. we had a lot of people who threw in the <blank_token> today said <unk> ellis a partner in
benjamin jacobson & sons a specialist in trading ual stock on the big board.
towel
5. it’s not easy to roll out something that <blank_token> and make it pay mr. jacob says. comprehensive
Query: in late new york trading yesterday the <blank_token> was quoted at N marks down from N
marks late friday and at N yen down from N yen late friday.
dollar
Sentences were taken from the Penn Treebank dataset [15]. On each trial, we make sure that the set
and batch are populated with sentences that are non-overlapping. This means that we do not use
words with very low frequency counts; e.g. if there is only a single sentence for a given word we do
not use this data since the sentence would need to be in both the set and the batch. As with the image
tasks, each trial consisted of a 5 way choice between the classes available in the set. We used a batch
size of 20 throughout the sentence matching task and varied the set size across k=1,2,3. We ensured
that the same number of sentences were available for each class in the set. We split the words into a
randomly sampled 9000 for training and 1000 for testing, and we used the standard test set to report
results. Thus, neither the words nor the sentences used during test time had been seen during training.
We compared our one-shot matching model to an oracle LSTM language model (LSTM-LM) [30]
trained on all the words. In this setup, the LSTM has an unfair advantage as it is not doing one-shot
learning but seeing all the data – thus, this should be taken as an upper bound. To do so, we examined
a similar setup wherein a sentence was presented to the model with a single word filled in with 5
different possible words (including the correct answer). For each of these 5 sentences the model gave
The task is as follows: given a query sentence with a missing word in it, and a support set of sentences
which each have a missing word and a corresponding 1-hot label, choose the label from the support
set that best matches the query sentence. Here we show a single example, though note that the words
on the right are not provided and the labels for the set are given as 1-hot-of-5 vectors.
1. an experimental vaccine can alter the immune response of people infected with the aids virus a
<blank_token> u.s. scientist said.
prominent
2. the show one of five new nbc <blank_token> is the second casualty of the three networks so far
this fall.
series
3. however since eastern first filed for chapter N protection march N it has consistently promised
to pay creditors N cents on the <blank_token>.
dollar
4. we had a lot of people who threw in the <blank_token> today said <unk> ellis a partner in
benjamin jacobson & sons a specialist in trading ual stock on the big board.
towel
5. it’s not easy to roll out something that <blank_token> and make it pay mr. jacob says. comprehensive
Query: in late new york trading yesterday the <blank_token> was quoted at N marks down from N
marks late friday and at N yen down from N yen late friday.
dollar
Sentences were taken from the Penn Treebank dataset [15]. On each trial, we make sure that the set
and batch are populated with sentences that are non-overlapping. This means that we do not use
words with very low frequency counts; e.g. if there is only a single sentence for a given word we do
not use this data since the sentence would need to be in both the set and the batch. As with the image
tasks, each trial consisted of a 5 way choice between the classes available in the set. We used a batch
size of 20 throughout the sentence matching task and varied the set size across k=1,2,3. We ensured
that the same number of sentences were available for each class in the set. We split the words into a
randomly sampled 9000 for training and 1000 for testing, and we used the standard test set to report
results. Thus, neither the words nor the sentences used during test time had been seen during training.
We compared our one-shot matching model to an oracle LSTM language model (LSTM-LM) [30]
trained on all the words. In this setup, the LSTM has an unfair advantage as it is not doing one-shot
learning but seeing all the data – thus, this should be taken as an upper bound. To do so, we examined
a similar setup wherein a sentence was presented to the model with a single word filled in with 5
different possible words (including the correct answer). For each of these 5 sentences the model gave
a log-likelihood and the max of these was taken to be the choice of the model.
n  Fill in a brank in a query sentence by a label in a support set
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
Experimental Settings and Results (Penn Treebank)
37
n  Baseline
⁃  Oracle LSTM-LM
•  Trained on all the words (not one-shot)
•  Consider this model as an upper bound
n  Datasets
⁃  training: 9000 words
⁃  testing: 1000 words
n  Results
Model
5 way accuracy
1-shot 2-shot 3-shot
Matching Nets 32.4% 36.1% 38.2%
Oracle LSTM-LM (72.8%) - -
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
Conclusion
n  They proposed Matching Networks: nearest neighbor based
approach trained fully end-to-end
n  Keypoints
⁃  “One-shot learning is much easier if you train the network to
do one-shot learning” [Vinyals+, 2016]
⁃  Matching Network has non-parametric structure, thus has
ability to acquisition of new examples rapidly
n  Findings
⁃  Matching Networks was effective to improve the performance
for Omniglot, miniImageNet, randImageNet, however it
degraded for dogsImageNet
⁃  One-shot learning with fine-grained sets of labels is difficult
to solve thus could be exciting challenge in this area
38
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
References
n  Matching Networks
⁃  Vinyals, Oriol, et al. "Matching networks for one shot learning." Advances in Neural
Information Processing Systems. 2016.
n  One-shot Learning
⁃  Koch, Gregory. Siamese neural networks for one-shot image recognition. Diss.
University of Toronto, 2015.
⁃  Santoro, Adam, et al. "Meta-learning with memory-augmented neural networks."
Proceedings of The 33rd International Conference on Machine Learning. 2016.
⁃  Bertinetto, Luca, et al. "Learning feed-forward one-shot learners." Advances in Neural
Information Processing Systems. 2016.
n  Attention Mechanisms
⁃  Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation
by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014).
⁃  Vinyals, Oriol, Meire Fortunato, and Navdeep Jaitly. "Pointer networks." Advances in
Neural Information Processing Systems. 2015.
⁃  Vinyals, Oriol, Samy Bengio, and Manjunath Kudlur. "Order matters: Sequence to
sequence for sets." In ICLR2016
39
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
References
n  Datasets
⁃  Krizhevsky, Alex, and Geoffrey Hinton. "Learning multiple layers of features from tiny
images." (2009).
⁃  Deng, Jia, et al. "Imagenet: A large-scale hierarchical image database." Computer
Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 2009.
⁃  Lake, Brenden M., et al. "One shot learning of simple visual concepts." Proceedings of
the 33rd Annual Conference of the Cognitive Science Society. Vol. 172. 2011.
⁃  Marcus, Mitchell P., Mary Ann Marcinkiewicz, and Beatrice Santorini. "Building a large
annotated corpus of English: The Penn Treebank." Computational linguistics 19.2
(1993): 313-330.
40

Mais conteúdo relacionado

Mais procurados

【DL輪読会】An Image is Worth One Word: Personalizing Text-to-Image Generation usi...
【DL輪読会】An Image is Worth One Word: Personalizing Text-to-Image Generation usi...【DL輪読会】An Image is Worth One Word: Personalizing Text-to-Image Generation usi...
【DL輪読会】An Image is Worth One Word: Personalizing Text-to-Image Generation usi...Deep Learning JP
 
Convolution Neural Network (CNN)
Convolution Neural Network (CNN)Convolution Neural Network (CNN)
Convolution Neural Network (CNN)Basit Rafiq
 
近年のHierarchical Vision Transformer
近年のHierarchical Vision Transformer近年のHierarchical Vision Transformer
近年のHierarchical Vision TransformerYusuke Uchida
 
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料Yusuke Uchida
 
Zero shot-learning: paper presentation
Zero shot-learning: paper presentationZero shot-learning: paper presentation
Zero shot-learning: paper presentationJérémie Kalfon
 
Convolutional neural network
Convolutional neural networkConvolutional neural network
Convolutional neural networkMojammilHusain
 
MixMatch: A Holistic Approach to Semi- Supervised Learning
MixMatch: A Holistic Approach to Semi- Supervised LearningMixMatch: A Holistic Approach to Semi- Supervised Learning
MixMatch: A Holistic Approach to Semi- Supervised Learningharmonylab
 
Masked Autoencoders Are Scalable Vision Learners
Masked Autoencoders Are Scalable Vision LearnersMasked Autoencoders Are Scalable Vision Learners
Masked Autoencoders Are Scalable Vision LearnersGuoqingLiu9
 
Res netと派生研究の紹介
Res netと派生研究の紹介Res netと派生研究の紹介
Res netと派生研究の紹介masataka nishimori
 
【DL輪読会】Scaling Laws for Neural Language Models
【DL輪読会】Scaling Laws for Neural Language Models【DL輪読会】Scaling Laws for Neural Language Models
【DL輪読会】Scaling Laws for Neural Language ModelsDeep Learning JP
 
画像認識モデルを作るための鉄板レシピ
画像認識モデルを作るための鉄板レシピ画像認識モデルを作るための鉄板レシピ
画像認識モデルを作るための鉄板レシピTakahiro Kubo
 
【DL輪読会】Self-Supervised Learning from Images with a Joint-Embedding Predictive...
【DL輪読会】Self-Supervised Learning from Images with a Joint-Embedding Predictive...【DL輪読会】Self-Supervised Learning from Images with a Joint-Embedding Predictive...
【DL輪読会】Self-Supervised Learning from Images with a Joint-Embedding Predictive...Deep Learning JP
 
[DL輪読会]A closer look at few shot classification
[DL輪読会]A closer look at few shot classification[DL輪読会]A closer look at few shot classification
[DL輪読会]A closer look at few shot classificationDeep Learning JP
 
論文紹介: Fast R-CNN&Faster R-CNN
論文紹介: Fast R-CNN&Faster R-CNN論文紹介: Fast R-CNN&Faster R-CNN
論文紹介: Fast R-CNN&Faster R-CNNTakashi Abe
 
[DL輪読会]StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators
[DL輪読会]StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators[DL輪読会]StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators
[DL輪読会]StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image GeneratorsDeep Learning JP
 
[DL輪読会]Convolutional Conditional Neural Processesと Neural Processes Familyの紹介
[DL輪読会]Convolutional Conditional Neural Processesと Neural Processes Familyの紹介[DL輪読会]Convolutional Conditional Neural Processesと Neural Processes Familyの紹介
[DL輪読会]Convolutional Conditional Neural Processesと Neural Processes Familyの紹介Deep Learning JP
 
論文紹介「A Perspective View and Survey of Meta-Learning」
論文紹介「A Perspective View and Survey of Meta-Learning」論文紹介「A Perspective View and Survey of Meta-Learning」
論文紹介「A Perspective View and Survey of Meta-Learning」Kota Matsui
 

Mais procurados (20)

【DL輪読会】An Image is Worth One Word: Personalizing Text-to-Image Generation usi...
【DL輪読会】An Image is Worth One Word: Personalizing Text-to-Image Generation usi...【DL輪読会】An Image is Worth One Word: Personalizing Text-to-Image Generation usi...
【DL輪読会】An Image is Worth One Word: Personalizing Text-to-Image Generation usi...
 
Convolution Neural Network (CNN)
Convolution Neural Network (CNN)Convolution Neural Network (CNN)
Convolution Neural Network (CNN)
 
近年のHierarchical Vision Transformer
近年のHierarchical Vision Transformer近年のHierarchical Vision Transformer
近年のHierarchical Vision Transformer
 
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
 
Zero shot-learning: paper presentation
Zero shot-learning: paper presentationZero shot-learning: paper presentation
Zero shot-learning: paper presentation
 
BERTology のススメ
BERTology のススメBERTology のススメ
BERTology のススメ
 
Convolutional neural network
Convolutional neural networkConvolutional neural network
Convolutional neural network
 
MixMatch: A Holistic Approach to Semi- Supervised Learning
MixMatch: A Holistic Approach to Semi- Supervised LearningMixMatch: A Holistic Approach to Semi- Supervised Learning
MixMatch: A Holistic Approach to Semi- Supervised Learning
 
Masked Autoencoders Are Scalable Vision Learners
Masked Autoencoders Are Scalable Vision LearnersMasked Autoencoders Are Scalable Vision Learners
Masked Autoencoders Are Scalable Vision Learners
 
Res netと派生研究の紹介
Res netと派生研究の紹介Res netと派生研究の紹介
Res netと派生研究の紹介
 
【DL輪読会】Scaling Laws for Neural Language Models
【DL輪読会】Scaling Laws for Neural Language Models【DL輪読会】Scaling Laws for Neural Language Models
【DL輪読会】Scaling Laws for Neural Language Models
 
画像認識モデルを作るための鉄板レシピ
画像認識モデルを作るための鉄板レシピ画像認識モデルを作るための鉄板レシピ
画像認識モデルを作るための鉄板レシピ
 
【DL輪読会】Self-Supervised Learning from Images with a Joint-Embedding Predictive...
【DL輪読会】Self-Supervised Learning from Images with a Joint-Embedding Predictive...【DL輪読会】Self-Supervised Learning from Images with a Joint-Embedding Predictive...
【DL輪読会】Self-Supervised Learning from Images with a Joint-Embedding Predictive...
 
[DL輪読会]A closer look at few shot classification
[DL輪読会]A closer look at few shot classification[DL輪読会]A closer look at few shot classification
[DL輪読会]A closer look at few shot classification
 
continual learning survey
continual learning surveycontinual learning survey
continual learning survey
 
論文紹介: Fast R-CNN&Faster R-CNN
論文紹介: Fast R-CNN&Faster R-CNN論文紹介: Fast R-CNN&Faster R-CNN
論文紹介: Fast R-CNN&Faster R-CNN
 
[DL輪読会]StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators
[DL輪読会]StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators[DL輪読会]StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators
[DL輪読会]StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators
 
[DL輪読会]Convolutional Conditional Neural Processesと Neural Processes Familyの紹介
[DL輪読会]Convolutional Conditional Neural Processesと Neural Processes Familyの紹介[DL輪読会]Convolutional Conditional Neural Processesと Neural Processes Familyの紹介
[DL輪読会]Convolutional Conditional Neural Processesと Neural Processes Familyの紹介
 
GPT解説
GPT解説GPT解説
GPT解説
 
論文紹介「A Perspective View and Survey of Meta-Learning」
論文紹介「A Perspective View and Survey of Meta-Learning」論文紹介「A Perspective View and Survey of Meta-Learning」
論文紹介「A Perspective View and Survey of Meta-Learning」
 

Destaque

NIPS2013読み会 DeViSE: A Deep Visual-Semantic Embedding Model
NIPS2013読み会 DeViSE: A Deep Visual-Semantic Embedding ModelNIPS2013読み会 DeViSE: A Deep Visual-Semantic Embedding Model
NIPS2013読み会 DeViSE: A Deep Visual-Semantic Embedding ModelSeiya Tokui
 
Zero shot learning through cross-modal transfer
Zero shot learning through cross-modal transferZero shot learning through cross-modal transfer
Zero shot learning through cross-modal transferRoelof Pieters
 
One-Shot Learning
One-Shot LearningOne-Shot Learning
One-Shot LearningJisung Kim
 
[DL輪読会]Attention Is All You Need
[DL輪読会]Attention Is All You Need[DL輪読会]Attention Is All You Need
[DL輪読会]Attention Is All You NeedDeep Learning JP
 
Learning to learn by gradient descent by gradient descent
Learning to learn by gradient descent by gradient descentLearning to learn by gradient descent by gradient descent
Learning to learn by gradient descent by gradient descentHiroyuki Fukuda
 
時系列データ3
時系列データ3時系列データ3
時系列データ3graySpace999
 
Fast and Probvably Seedings for k-Means
Fast and Probvably Seedings for k-MeansFast and Probvably Seedings for k-Means
Fast and Probvably Seedings for k-MeansKimikazu Kato
 
Dual Learning for Machine Translation (NIPS 2016)
Dual Learning for Machine Translation (NIPS 2016)Dual Learning for Machine Translation (NIPS 2016)
Dual Learning for Machine Translation (NIPS 2016)Toru Fujino
 
Conditional Image Generation with PixelCNN Decoders
Conditional Image Generation with PixelCNN DecodersConditional Image Generation with PixelCNN Decoders
Conditional Image Generation with PixelCNN Decoderssuga93
 
Interaction Networks for Learning about Objects, Relations and Physics
Interaction Networks for Learning about Objects, Relations and PhysicsInteraction Networks for Learning about Objects, Relations and Physics
Interaction Networks for Learning about Objects, Relations and PhysicsKen Kuroki
 
Introduction of "TrailBlazer" algorithm
Introduction of "TrailBlazer" algorithmIntroduction of "TrailBlazer" algorithm
Introduction of "TrailBlazer" algorithmKatsuki Ohto
 
InfoGAN: Interpretable Representation Learning by Information Maximizing Gen...
InfoGAN: Interpretable Representation Learning by Information Maximizing Gen...InfoGAN: Interpretable Representation Learning by Information Maximizing Gen...
InfoGAN: Interpretable Representation Learning by Information Maximizing Gen...Shuhei Yoshida
 
Safe and Efficient Off-Policy Reinforcement Learning
Safe and Efficient Off-Policy Reinforcement LearningSafe and Efficient Off-Policy Reinforcement Learning
Safe and Efficient Off-Policy Reinforcement Learningmooopan
 
Introduction of “Fairness in Learning: Classic and Contextual Bandits”
Introduction of “Fairness in Learning: Classic and Contextual Bandits”Introduction of “Fairness in Learning: Classic and Contextual Bandits”
Introduction of “Fairness in Learning: Classic and Contextual Bandits”Kazuto Fukuchi
 
Improving Variational Inference with Inverse Autoregressive Flow
Improving Variational Inference with Inverse Autoregressive FlowImproving Variational Inference with Inverse Autoregressive Flow
Improving Variational Inference with Inverse Autoregressive FlowTatsuya Shirakawa
 
[DL輪読会]Convolutional Sequence to Sequence Learning
[DL輪読会]Convolutional Sequence to Sequence Learning[DL輪読会]Convolutional Sequence to Sequence Learning
[DL輪読会]Convolutional Sequence to Sequence LearningDeep Learning JP
 
NIPS 2016 Overview and Deep Learning Topics
NIPS 2016 Overview and Deep Learning Topics  NIPS 2016 Overview and Deep Learning Topics
NIPS 2016 Overview and Deep Learning Topics Koichi Hamada
 
論文紹介 Combining Model-Based and Model-Free Updates for Trajectory-Centric Rein...
論文紹介 Combining Model-Based and Model-Free Updates for Trajectory-Centric Rein...論文紹介 Combining Model-Based and Model-Free Updates for Trajectory-Centric Rein...
論文紹介 Combining Model-Based and Model-Free Updates for Trajectory-Centric Rein...Kusano Hitoshi
 
Differential privacy without sensitivity [NIPS2016読み会資料]
Differential privacy without sensitivity [NIPS2016読み会資料]Differential privacy without sensitivity [NIPS2016読み会資料]
Differential privacy without sensitivity [NIPS2016読み会資料]Kentaro Minami
 

Destaque (20)

NIPS2013読み会 DeViSE: A Deep Visual-Semantic Embedding Model
NIPS2013読み会 DeViSE: A Deep Visual-Semantic Embedding ModelNIPS2013読み会 DeViSE: A Deep Visual-Semantic Embedding Model
NIPS2013読み会 DeViSE: A Deep Visual-Semantic Embedding Model
 
Zero shot learning through cross-modal transfer
Zero shot learning through cross-modal transferZero shot learning through cross-modal transfer
Zero shot learning through cross-modal transfer
 
One-Shot Learning
One-Shot LearningOne-Shot Learning
One-Shot Learning
 
[DL輪読会]Attention Is All You Need
[DL輪読会]Attention Is All You Need[DL輪読会]Attention Is All You Need
[DL輪読会]Attention Is All You Need
 
Learning to learn by gradient descent by gradient descent
Learning to learn by gradient descent by gradient descentLearning to learn by gradient descent by gradient descent
Learning to learn by gradient descent by gradient descent
 
時系列データ3
時系列データ3時系列データ3
時系列データ3
 
Fast and Probvably Seedings for k-Means
Fast and Probvably Seedings for k-MeansFast and Probvably Seedings for k-Means
Fast and Probvably Seedings for k-Means
 
Value iteration networks
Value iteration networksValue iteration networks
Value iteration networks
 
Dual Learning for Machine Translation (NIPS 2016)
Dual Learning for Machine Translation (NIPS 2016)Dual Learning for Machine Translation (NIPS 2016)
Dual Learning for Machine Translation (NIPS 2016)
 
Conditional Image Generation with PixelCNN Decoders
Conditional Image Generation with PixelCNN DecodersConditional Image Generation with PixelCNN Decoders
Conditional Image Generation with PixelCNN Decoders
 
Interaction Networks for Learning about Objects, Relations and Physics
Interaction Networks for Learning about Objects, Relations and PhysicsInteraction Networks for Learning about Objects, Relations and Physics
Interaction Networks for Learning about Objects, Relations and Physics
 
Introduction of "TrailBlazer" algorithm
Introduction of "TrailBlazer" algorithmIntroduction of "TrailBlazer" algorithm
Introduction of "TrailBlazer" algorithm
 
InfoGAN: Interpretable Representation Learning by Information Maximizing Gen...
InfoGAN: Interpretable Representation Learning by Information Maximizing Gen...InfoGAN: Interpretable Representation Learning by Information Maximizing Gen...
InfoGAN: Interpretable Representation Learning by Information Maximizing Gen...
 
Safe and Efficient Off-Policy Reinforcement Learning
Safe and Efficient Off-Policy Reinforcement LearningSafe and Efficient Off-Policy Reinforcement Learning
Safe and Efficient Off-Policy Reinforcement Learning
 
Introduction of “Fairness in Learning: Classic and Contextual Bandits”
Introduction of “Fairness in Learning: Classic and Contextual Bandits”Introduction of “Fairness in Learning: Classic and Contextual Bandits”
Introduction of “Fairness in Learning: Classic and Contextual Bandits”
 
Improving Variational Inference with Inverse Autoregressive Flow
Improving Variational Inference with Inverse Autoregressive FlowImproving Variational Inference with Inverse Autoregressive Flow
Improving Variational Inference with Inverse Autoregressive Flow
 
[DL輪読会]Convolutional Sequence to Sequence Learning
[DL輪読会]Convolutional Sequence to Sequence Learning[DL輪読会]Convolutional Sequence to Sequence Learning
[DL輪読会]Convolutional Sequence to Sequence Learning
 
NIPS 2016 Overview and Deep Learning Topics
NIPS 2016 Overview and Deep Learning Topics  NIPS 2016 Overview and Deep Learning Topics
NIPS 2016 Overview and Deep Learning Topics
 
論文紹介 Combining Model-Based and Model-Free Updates for Trajectory-Centric Rein...
論文紹介 Combining Model-Based and Model-Free Updates for Trajectory-Centric Rein...論文紹介 Combining Model-Based and Model-Free Updates for Trajectory-Centric Rein...
論文紹介 Combining Model-Based and Model-Free Updates for Trajectory-Centric Rein...
 
Differential privacy without sensitivity [NIPS2016読み会資料]
Differential privacy without sensitivity [NIPS2016読み会資料]Differential privacy without sensitivity [NIPS2016読み会資料]
Differential privacy without sensitivity [NIPS2016読み会資料]
 

Semelhante a Matching networks for one shot learning

A simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsA simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsDevansh16
 
Ai_Project_report
Ai_Project_reportAi_Project_report
Ai_Project_reportRavi Gupta
 
IRJET- Image Captioning using Multimodal Embedding
IRJET-  	  Image Captioning using Multimodal EmbeddingIRJET-  	  Image Captioning using Multimodal Embedding
IRJET- Image Captioning using Multimodal EmbeddingIRJET Journal
 
nlp dl 1.pdf
nlp dl 1.pdfnlp dl 1.pdf
nlp dl 1.pdfnyomans1
 
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...Pooyan Jamshidi
 
Continuous Architecting of Stream-Based Systems
Continuous Architecting of Stream-Based SystemsContinuous Architecting of Stream-Based Systems
Continuous Architecting of Stream-Based SystemsCHOOSE
 
IRJET- Chatbot Using Gated End-to-End Memory Networks
IRJET-  	  Chatbot Using Gated End-to-End Memory NetworksIRJET-  	  Chatbot Using Gated End-to-End Memory Networks
IRJET- Chatbot Using Gated End-to-End Memory NetworksIRJET Journal
 
AMAZON STOCK PRICE PREDICTION BY USING SMLT
AMAZON STOCK PRICE PREDICTION BY USING SMLTAMAZON STOCK PRICE PREDICTION BY USING SMLT
AMAZON STOCK PRICE PREDICTION BY USING SMLTIRJET Journal
 
Python + Tensorflow: how to earn money in the Stock Exchange with Deep Learni...
Python + Tensorflow: how to earn money in the Stock Exchange with Deep Learni...Python + Tensorflow: how to earn money in the Stock Exchange with Deep Learni...
Python + Tensorflow: how to earn money in the Stock Exchange with Deep Learni...ETS Asset Management Factory
 
Transfer Learning for Improving Model Predictions in Highly Configurable Soft...
Transfer Learning for Improving Model Predictions in Highly Configurable Soft...Transfer Learning for Improving Model Predictions in Highly Configurable Soft...
Transfer Learning for Improving Model Predictions in Highly Configurable Soft...Pooyan Jamshidi
 
20070702 Text Categorization
20070702 Text Categorization20070702 Text Categorization
20070702 Text Categorizationmidi
 
Methodological Study Of Opinion Mining And Sentiment Analysis Techniques
Methodological Study Of Opinion Mining And Sentiment Analysis Techniques  Methodological Study Of Opinion Mining And Sentiment Analysis Techniques
Methodological Study Of Opinion Mining And Sentiment Analysis Techniques ijsc
 
00463517b1e90c1e63000000
00463517b1e90c1e6300000000463517b1e90c1e63000000
00463517b1e90c1e63000000Ivonne Liu
 
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...Xin-She Yang
 
Methodological study of opinion mining and sentiment analysis techniques
Methodological study of opinion mining and sentiment analysis techniquesMethodological study of opinion mining and sentiment analysis techniques
Methodological study of opinion mining and sentiment analysis techniquesijsc
 
Neural networks with python
Neural networks with pythonNeural networks with python
Neural networks with pythonSimone Piunno
 
An introduction to deep learning
An introduction to deep learningAn introduction to deep learning
An introduction to deep learningVan Thanh
 

Semelhante a Matching networks for one shot learning (20)

A simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsA simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representations
 
Ai_Project_report
Ai_Project_reportAi_Project_report
Ai_Project_report
 
IRJET- Image Captioning using Multimodal Embedding
IRJET-  	  Image Captioning using Multimodal EmbeddingIRJET-  	  Image Captioning using Multimodal Embedding
IRJET- Image Captioning using Multimodal Embedding
 
Sparse autoencoder
Sparse autoencoderSparse autoencoder
Sparse autoencoder
 
nlp dl 1.pdf
nlp dl 1.pdfnlp dl 1.pdf
nlp dl 1.pdf
 
C sharp chap6
C sharp chap6C sharp chap6
C sharp chap6
 
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...
 
Continuous Architecting of Stream-Based Systems
Continuous Architecting of Stream-Based SystemsContinuous Architecting of Stream-Based Systems
Continuous Architecting of Stream-Based Systems
 
IRJET- Chatbot Using Gated End-to-End Memory Networks
IRJET-  	  Chatbot Using Gated End-to-End Memory NetworksIRJET-  	  Chatbot Using Gated End-to-End Memory Networks
IRJET- Chatbot Using Gated End-to-End Memory Networks
 
AMAZON STOCK PRICE PREDICTION BY USING SMLT
AMAZON STOCK PRICE PREDICTION BY USING SMLTAMAZON STOCK PRICE PREDICTION BY USING SMLT
AMAZON STOCK PRICE PREDICTION BY USING SMLT
 
Python + Tensorflow: how to earn money in the Stock Exchange with Deep Learni...
Python + Tensorflow: how to earn money in the Stock Exchange with Deep Learni...Python + Tensorflow: how to earn money in the Stock Exchange with Deep Learni...
Python + Tensorflow: how to earn money in the Stock Exchange with Deep Learni...
 
Transfer Learning for Improving Model Predictions in Highly Configurable Soft...
Transfer Learning for Improving Model Predictions in Highly Configurable Soft...Transfer Learning for Improving Model Predictions in Highly Configurable Soft...
Transfer Learning for Improving Model Predictions in Highly Configurable Soft...
 
20070702 Text Categorization
20070702 Text Categorization20070702 Text Categorization
20070702 Text Categorization
 
Methodological Study Of Opinion Mining And Sentiment Analysis Techniques
Methodological Study Of Opinion Mining And Sentiment Analysis Techniques  Methodological Study Of Opinion Mining And Sentiment Analysis Techniques
Methodological Study Of Opinion Mining And Sentiment Analysis Techniques
 
00463517b1e90c1e63000000
00463517b1e90c1e6300000000463517b1e90c1e63000000
00463517b1e90c1e63000000
 
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
 
Methodological study of opinion mining and sentiment analysis techniques
Methodological study of opinion mining and sentiment analysis techniquesMethodological study of opinion mining and sentiment analysis techniques
Methodological study of opinion mining and sentiment analysis techniques
 
Neural networks with python
Neural networks with pythonNeural networks with python
Neural networks with python
 
An introduction to deep learning
An introduction to deep learningAn introduction to deep learning
An introduction to deep learning
 
Ann
Ann Ann
Ann
 

Mais de Kazuki Fujikawa

Stanford Covid Vaccine 2nd place solution
Stanford Covid Vaccine 2nd place solutionStanford Covid Vaccine 2nd place solution
Stanford Covid Vaccine 2nd place solutionKazuki Fujikawa
 
BMS Molecular Translation 3rd place solution
BMS Molecular Translation 3rd place solutionBMS Molecular Translation 3rd place solution
BMS Molecular Translation 3rd place solutionKazuki Fujikawa
 
Kaggle参加報告: Champs Predicting Molecular Properties
Kaggle参加報告: Champs Predicting Molecular PropertiesKaggle参加報告: Champs Predicting Molecular Properties
Kaggle参加報告: Champs Predicting Molecular PropertiesKazuki Fujikawa
 
Kaggle参加報告: Quora Insincere Questions Classification
Kaggle参加報告: Quora Insincere Questions ClassificationKaggle参加報告: Quora Insincere Questions Classification
Kaggle参加報告: Quora Insincere Questions ClassificationKazuki Fujikawa
 
Ordered neurons integrating tree structures into recurrent neural networks
Ordered neurons integrating tree structures into recurrent neural networksOrdered neurons integrating tree structures into recurrent neural networks
Ordered neurons integrating tree structures into recurrent neural networksKazuki Fujikawa
 
A closer look at few shot classification
A closer look at few shot classificationA closer look at few shot classification
A closer look at few shot classificationKazuki Fujikawa
 
Graph convolutional policy network for goal directed molecular graph generation
Graph convolutional policy network for goal directed molecular graph generationGraph convolutional policy network for goal directed molecular graph generation
Graph convolutional policy network for goal directed molecular graph generationKazuki Fujikawa
 
Conditional neural processes
Conditional neural processesConditional neural processes
Conditional neural processesKazuki Fujikawa
 
NIPS2017 Few-shot Learning and Graph Convolution
NIPS2017 Few-shot Learning and Graph ConvolutionNIPS2017 Few-shot Learning and Graph Convolution
NIPS2017 Few-shot Learning and Graph ConvolutionKazuki Fujikawa
 
Matrix capsules with em routing
Matrix capsules with em routingMatrix capsules with em routing
Matrix capsules with em routingKazuki Fujikawa
 
Predicting organic reaction outcomes with weisfeiler lehman network
Predicting organic reaction outcomes with weisfeiler lehman networkPredicting organic reaction outcomes with weisfeiler lehman network
Predicting organic reaction outcomes with weisfeiler lehman networkKazuki Fujikawa
 
SchNet: A continuous-filter convolutional neural network for modeling quantum...
SchNet: A continuous-filter convolutional neural network for modeling quantum...SchNet: A continuous-filter convolutional neural network for modeling quantum...
SchNet: A continuous-filter convolutional neural network for modeling quantum...Kazuki Fujikawa
 
DeNAにおける機械学習・深層学習活用
DeNAにおける機械学習・深層学習活用DeNAにおける機械学習・深層学習活用
DeNAにおける機械学習・深層学習活用Kazuki Fujikawa
 

Mais de Kazuki Fujikawa (15)

Stanford Covid Vaccine 2nd place solution
Stanford Covid Vaccine 2nd place solutionStanford Covid Vaccine 2nd place solution
Stanford Covid Vaccine 2nd place solution
 
BMS Molecular Translation 3rd place solution
BMS Molecular Translation 3rd place solutionBMS Molecular Translation 3rd place solution
BMS Molecular Translation 3rd place solution
 
ACL2020 best papers
ACL2020 best papersACL2020 best papers
ACL2020 best papers
 
Kaggle参加報告: Champs Predicting Molecular Properties
Kaggle参加報告: Champs Predicting Molecular PropertiesKaggle参加報告: Champs Predicting Molecular Properties
Kaggle参加報告: Champs Predicting Molecular Properties
 
NLP@ICLR2019
NLP@ICLR2019NLP@ICLR2019
NLP@ICLR2019
 
Kaggle参加報告: Quora Insincere Questions Classification
Kaggle参加報告: Quora Insincere Questions ClassificationKaggle参加報告: Quora Insincere Questions Classification
Kaggle参加報告: Quora Insincere Questions Classification
 
Ordered neurons integrating tree structures into recurrent neural networks
Ordered neurons integrating tree structures into recurrent neural networksOrdered neurons integrating tree structures into recurrent neural networks
Ordered neurons integrating tree structures into recurrent neural networks
 
A closer look at few shot classification
A closer look at few shot classificationA closer look at few shot classification
A closer look at few shot classification
 
Graph convolutional policy network for goal directed molecular graph generation
Graph convolutional policy network for goal directed molecular graph generationGraph convolutional policy network for goal directed molecular graph generation
Graph convolutional policy network for goal directed molecular graph generation
 
Conditional neural processes
Conditional neural processesConditional neural processes
Conditional neural processes
 
NIPS2017 Few-shot Learning and Graph Convolution
NIPS2017 Few-shot Learning and Graph ConvolutionNIPS2017 Few-shot Learning and Graph Convolution
NIPS2017 Few-shot Learning and Graph Convolution
 
Matrix capsules with em routing
Matrix capsules with em routingMatrix capsules with em routing
Matrix capsules with em routing
 
Predicting organic reaction outcomes with weisfeiler lehman network
Predicting organic reaction outcomes with weisfeiler lehman networkPredicting organic reaction outcomes with weisfeiler lehman network
Predicting organic reaction outcomes with weisfeiler lehman network
 
SchNet: A continuous-filter convolutional neural network for modeling quantum...
SchNet: A continuous-filter convolutional neural network for modeling quantum...SchNet: A continuous-filter convolutional neural network for modeling quantum...
SchNet: A continuous-filter convolutional neural network for modeling quantum...
 
DeNAにおける機械学習・深層学習活用
DeNAにおける機械学習・深層学習活用DeNAにおける機械学習・深層学習活用
DeNAにおける機械学習・深層学習活用
 

Último

Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 

Último (20)

Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 

Matching networks for one shot learning

  • 1. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. AI System Dept. System Management Unit Kazuki Fujikawa Matching Networks for One Shot Learning https://papers.nips.cc/paper/6385-matching-networks-for-one- shot-learning 論⽂紹介 1 NIPS2016 読み会 @Preferred Networks 2017/01/19
  • 2. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n  One-shot learning with attention and memory ⁃  Learn a concept from one or only a few training examples ⁃  Train a fully end-to-end nearest neighbor classifier: incorporating the best characteristics from both parametric and non-parametric models ⁃  Improved one-shot accuracy on Omniglot from 88.0% to 93.2% compared to competing approaches 2 Abstract Figure 1: Matching Networks architecture
  • 3. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. AGENDA n  Introduction n  Related work ⁃  One-shot learning ⁃  Attention mechanisms n  Matching Networks n  Experiments ⁃  Omniglot ⁃  ImageNet ⁃  Penn Treebank 3
  • 4. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Supervised Learning n  Learn a correspondence between training data and labels ⁃  Require a large labeled dataset for training (ex. CIFAR10 [Krizhevsky+, 2009]: 6000 data / class) ⁃  It is hard to let classifiers learn new concepts from little data 4 airplane automobile bird cat deer Classifier examples Labels 0 airplane 1 automobile 0 bird 0 cat 0 deer Classifier Training phase Predicting phase https://www.cs.toronto.edu/~kriz/cifar.html
  • 5. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. One-shot Learning n  Learn a concept from one or only a few training examples ⁃  A classifier can be trained by datasets with labels which donʼt be used in predicting phase 5 airplane automobile bird cat deer Classifier examples Labels 0 airplane 1 automobile 0 bird 0 cat 0 deer Classifier (Pre-)Training phase Predicting phase(one-shot learning phase) https://www.cs.toronto.edu/~kriz/cifar.html dog frog horse ship truck Classifier examples Labels
  • 6. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. One-shot Learning n  Task: N-way k-shot learning 6 T’: Testing taskT: Training task dog frog horse ship truck airplane automobile bird cat deer dog frog horse ship truck airplane automobile bird cat deer •  Separate labels for training and testing •  All the labels which you use in testing phase (one-shot learning phase) are not used in training phase https://www.cs.toronto.edu/~kriz/cifar.html
  • 7. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. One-shot Learning n  Task: N-way k-shot learning 7 T’: Testing taskT: Training task dog frog horse ship truck airplane automobile bird cat deer dog frog horse ship truck airplane automobile bird cat deer •  T’ is used for one-shot learning •  T can be used freely to train (e.g. Multiclass classification) https://www.cs.toronto.edu/~kriz/cifar.html
  • 8. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. One-shot Learning n  Task: N-way k-shot learning 8 T’: Testing taskT: Training task dog frog horse ship truck airplane automobile bird cat deer dog frog horse ship truck airplane automobile bird cat deer L’: Label set sampling N labels from Tʼ •  In this figure, Lʼ has 3 classes, thus “3-way k-shot learning” automobile cat deer https://www.cs.toronto.edu/~kriz/cifar.html
  • 9. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. One-shot Learning n  Task: N-way k-shot learning 9 T’: Testing taskT: Training task dog frog horse ship truck airplane automobile bird cat deer dog frog horse ship truck airplane automobile bird cat deer L’: Label set S’: Support set : Query automobile cat deer sampling N labels from Tʼ sampling k examples from Lʼ sampling 1 example from Lʼ ˆx •  Task: classify into 3 classes, {automobile, cat, deer}, using support set ˆx https://www.cs.toronto.edu/~kriz/cifar.html
  • 10. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Related Work (One-shot Learning) n  Convolutional Siamese Network [Koch+, 2015] ⁃  Learn image representation with a siamese neural network ⁃  Reuse features from the network for one-shot learning 10 Figure 1: Matching Networks architecture train it by showing only a few examples per class, switching the task from minibatch to minibatch, much like how it will be tested when presented with a few examples of a new task. Besides our contributions in defining a model and training criterion amenable for one-shot learning, we contribute by the definition of tasks that can be used to benchmark other approaches on both ImageNet and small scale language modeling. We hope that our results will encourage others to work CNN CNN Same?
  • 11. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Related Work (One-shot Learning) n  Memory-Augmented Neural Networks (MANN) [Santoro+, 2016] ⁃  Quickly encode and retrieve new information using external memory, inspired by the idea of Neural Turing Machine 11
  • 12. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Related Work (One-shot Learning) n  Siamese Learnet [Bertinetto+, NIPS2016] ⁃  Learn the parameters of a network to incorporate domain specific information from a few examples 12 siamese siamese learnet learnet Figure 1: Our proposed architectures predict the parameters of a network from a single example, replacing static convolutions (green) with dynamic convolutions (red). The siamese learnet predicts the parameters of an embedding function that is applied to both inputs, whereas the single-stream learnet predicts the parameters of a function that is applied to the other input. Linear layers are denoted by ⇤ and nonlinear layers by . Dashed connections represent parameter sharing. discriminative one-shot learning is to find a mechanism to incorporate domain-specific information in the learner, i.e. learning to learn. Another challenge, which is of practical importance in applications of one-shot learning, is to avoid a lengthy optimization process such as eq. (1). We propose to address both challenges by learning the parameters W of the predictor from a single exemplar z using a meta-prediction process, i.e. a non-iterative feed-forward function ! that maps (z; W0 ) to W. Since in practice this function will be implemented using a deep neural network, we call it a learnet. The learnet depends on the exemplar z, which is a single representative of the class of interest, and contains parameters W0 of its own. Learning to learn can now be posed as the problem of optimizing the learnet meta-parameters W0 using an objective function defined below. Furthermore, the feed-forward learnet evaluation is much faster than solving the optimization problem (1). In order to train the learnet, we require the latter to produce good predictors given any possible exemplar z, which is empirically evaluated as an average over n training samples zi:
  • 13. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Related Work (Attention Mechanism) n  Sequence to Sequence with Attention [Bahdanau+, 2014] ⁃  Attend to the word relevant to the generation of the next target word in the source sentence 13 t t her architectures such as a hybrid of an RNN alchbrenner and Blunsom, 2013). ral machine translation. The new architecture 3.2) and a decoder that emulates searching n (Sec. 3.1). x1 x2 x3 xT + αt,1 αt,2 αt,3 αt,T yt-1 yt h1 h2 h3 hT h1 h2 h3 hT st-1 st Figure 1: The graphical illus- tration of the proposed model trying to generate the t-th tar- get word yt given a source sentence (x1, x2, . . . , xT ). al probability (4) by –decoder ap- on a distinct annotations ntence. Each put sequence word of the ons are com- sum of these (5) ij) Figure 1: The graphical illus- tration of the proposed model trying to generate the t-th tar- get word yt given a source sentence (x1, x2, . . . , xT ). si = f(si 1, yi 1, ci). It should be noted that unlike the existing encoder–decoder ap- proach (see Eq. (2)), here the probability is conditioned on a distinct context vector ci for each target word yi. The context vector ci depends on a sequence of annotations (h1, · · · , hTx ) to which an encoder maps the input sentence. Each annotation hi contains information about the whole input sequence with a strong focus on the parts surrounding the i-th word of the input sequence. We explain in detail how the annotations are com- puted in the next section. The context vector ci is, then, computed as a weighted sum of these annotations hi: ci = TxX j=1 ↵ijhj. (5) The weight ↵ij of each annotation hj is computed by ↵ij = exp (eij) PTx k=1 exp (eik) , (6) where eij = a(si 1, hj) is an alignment model which scores how well the inputs around position j and the output at position i match. The score is based on the RNN hidden state si 1 (just before emitting yi, Eq. (4)) and the j-th annotation hj of the input sentence. We parametrize the alignment model a as a feedforward neural network which is jointly trained with all the other components of the proposed system. Note that unlike in traditional machine translation, 3 Figure 1: The graphical illus- tration of the proposed model trying to generate the t-th tar- get word yt given a source sentence (x1, x2, . . . , xT ). si = f(si 1, yi 1, ci). It should be noted that unlike the existing encoder–decoder ap- proach (see Eq. (2)), here the probability is conditioned on a distinct context vector ci for each target word yi. The context vector ci depends on a sequence of annotations (h1, · · · , hTx ) to which an encoder maps the input sentence. Each annotation hi contains information about the whole input sequence with a strong focus on the parts surrounding the i-th word of the input sequence. We explain in detail how the annotations are com- puted in the next section. The context vector ci is, then, computed as a weighted sum of these annotations hi: ci = TxX j=1 ↵ijhj. (5) The weight ↵ij of each annotation hj is computed by ↵ij = exp (eij) PTx k=1 exp (eik) , (6) where eij = a(si 1, hj) is an alignment model which scores how well the inputs around position j and the output at position i match. The score is based on the RNN hidden state si 1 (just before emitting yi, Eq. (4)) and the j-th annotation hj of the input sentence. We parametrize the alignment model a as a feedforward neural network which is jointly trained with all the other components of the proposed system. Note that unlike in traditional machine translation, 3 Figure 1: The graphical illus- tration of the proposed model trying to generate the t-th tar- get word yt given a source sentence (x1, x2, . . . , xT ). proach (see Eq. (2)), here the probability is conditioned on a distinct context vector ci for each target word yi. The context vector ci depends on a sequence of annotations (h1, · · · , hTx ) to which an encoder maps the input sentence. Each annotation hi contains information about the whole input sequence with a strong focus on the parts surrounding the i-th word of the input sequence. We explain in detail how the annotations are com- puted in the next section. The context vector ci is, then, computed as a weighted sum of these annotations hi: ci = TxX j=1 ↵ijhj. (5) The weight ↵ij of each annotation hj is computed by ↵ij = exp (eij) PTx k=1 exp (eik) , (6) where eij = a(si 1, hj) is an alignment model which scores how well the inputs around position j and the output at position i match. The score is based on the RNN hidden state si 1 (just before emitting yi, Eq. (4)) and the j-th annotation hj of the input sentence. We parametrize the alignment model a as a feedforward neural network which is jointly trained with all the other components of the proposed system. Note that unlike in traditional machine translation, 3 Published as a conference paper at ICLR 2015 (a) (b)
  • 14. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Related Work (Attention Mechanism) n  Pointer Networks [Vinyals+, 2015] ⁃  Generate output sequence using a distribution over the dictionary of inputs 14 (a) Sequence-to-Sequence (b) Ptr-Net Figure 1: (a) Sequence-to-Sequence - An RNN (blue) processes the input sequence to create a code vector that is used to generate the output sequence (purple) using the probability chain rule and another RNN. The output dimensionality is fixed by the dimensionality of the problem and it is the same during training and inference [1]. (b) Ptr-Net - An encoding RNN converts the input sequence to a code (blue) that is fed to the generating network (purple). At each step, the generating network produces a vector that modulates a content-based attention mechanism over inputs ([5, 2]). The output of the attention mechanism is a softmax distribution with dictionary size equal to the length of the input. ion (i.e., when we only have examples of inputs and desired outputs). The proposed approach is depicted in Figure 1. The main contributions of our work are as follows: This model performs significantly better than the sequence-to-sequence model on the co problem, but it is not applicable to problems where the output dictionary size depends on Nevertheless, a very simple extension (or rather reduction) of the model allows us to do th 2.3 Ptr-Net We now describe a very simple modification of the attention model that allows us to method to solve combinatorial optimization problems where the output dictionary size d the number of elements in the input sequence. The sequence-to-sequence model of Section 2.1 uses a softmax distribution over a fixed si dictionary to compute p(Ci|C1, . . . , Ci 1, P) in Equation 1. Thus it cannot be used for our where the size of the output dictionary is equal to the length of the input sequence. To problem we model p(Ci|C1, . . . , Ci 1, P) using the attention mechanism of Equation 3 a ui j = vT tanh(W1ej + W2di) j 2 (1, . . . , n) p(Ci|C1, . . . , Ci 1, P) = softmax(ui ) where softmax normalizes the vector ui (of length n) to be an output distribution over the of inputs, and v, W1, and W2 are learnable parameters of the output model. Here, we do the encoder state ej to propagate extra information to the decoder, but instead, use ui j a to the input elements. In a similar way, to condition on Ci 1 as in Equation 1, we sim the corresponding PCi 1 as the input. Both our method and the attention model can be application of content-based attention mechanisms proposed in [6, 5, 2]. We also note that our approach specifically targets problems whose outputs are discrete spond to positions in the input. Such problems may be addressed artificially – for example learn to output the coordinates of the target point directly using an RNN. However, at this solution does not respect the constraint that the outputs map back to the inputs exac out the constraints, the predictions are bound to become blurry over longer sequences as sequence-to-sequence models for videos [12]. 3 Motivation and Datasets Structure In the following sections, we review each of the three problems we considered, as well a generation protocol.1 In the training data, the inputs are planar point sets P = {P1, . . . , Pn} with n elements ea Pj = (xj, yj) are the cartesian coordinates of the points over which we find the convex hu launay triangulation or the solution to the corresponding Travelling Salesman Problem. In we sample from a uniform distribution in [0, 1] ⇥ [0, 1]. The outputs CP = {C1, . . . , C sequences representing the solution associated to the point set P. In Figure 2, we find an i of an input/output pair (P, CP ) for the convex hull and the Delaunay problems.
  • 15. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Related Work (Attention Mechanism) n  Sequence to Sequence for Sets [Vinyals+, ICLR2016] ⁃  Handle input sets using an extension of seq2seq framework: Read-Process-and Write model 15 ural models with memories coupled to differentiable addressing mechanism have been success- y applied to handwriting generation and recognition (Graves, 2012), machine translation (Bah- au et al., 2015a), and more general computation machines (Graves et al., 2014; Weston et al., 5). Since we are interested in associative memories we employed a “content” based attention. s has the property that the vector retrieved from our memory would not change if we randomly ffled the memory. This is crucial for proper treatment of the input set X as such. In particular, process block based on an attention mechanism uses the following: qt = LSTM(q⇤ t 1) (3) ei,t = f(mi, qt) (4) ai,t = exp(ei,t) P j exp(ej,t) (5) rt = X i ai,tmi (6) q⇤ t = [qt rt] (7) Read Process Write Figure 1: The Read-Process-and-Write model. ere i indexes through each memory vector mi (typically equal to the cardinality of X), qt is uery vector which allows us to read rt from the memories, f is a function that computes a gle scalar from mi and qt (e.g., a dot product), and LSTM is an LSTM which computes a urrent state but which takes no inputs. q⇤ t is the state which this LSTM evolves, and is formed concatenating the query qt with the resulting attention readout rt. t is the index which indicates
  • 16. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Matching Networks [Vinyals+, NIPS2016] n  Motivation ⁃  It is important for one-shot learning to attain rapid learning from new examples while keeping an ability for common examples •  Simple parametric models such as deep classifiers need to be optimized to treat with new examples •  Non-parametric models such as k-nearest neighbor donʼt require optimization but performance depends on the chosen metric ⁃  It could be efficient to train a end-to-end nearest neighbor based classifier 16
  • 17. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Matching Networks [Vinyals+, NIPS2016] n  Train a classifier through one-shot learning 17 T’: Testing taskT: Training task dog frog horse ship truck airplane automobile bird cat deer dog frog horse ship truck airplane automobile bird cat deer L: Label set S: Support set B : Batch dog horse ship sampling N labels from T sampling k examples from L sampling b example from L https://www.cs.toronto.edu/~kriz/cifar.html
  • 18. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Matching Networks [Vinyals+, NIPS2016] n  System Overview ⁃  Embedding functions f, g are parameterized as a simple CNN (e.g. VGG or Inception) or a fully conditional embedding function mentioned later 18 Figure 1: Matching Networks architecture train it by showing only a few examples per class, switching the task from minibatch to minibatch, much like how it will be tested when presented with a few examples of a new task. Besides our contributions in defining a model and training criterion amenable for one-shot learning, we contribute by the definition of tasks that can be used to benchmark other approaches on both ImageNet and small scale language modeling. We hope that our results will encourage others to work on this challenging problem. We organized the paper by first defining and explaining our model whilst linking its several compo- nents to related work. Then in the following section we briefly elaborate on some of the related work to the task and our model. In Section 4 we describe both our general setup and the experiments we performed, demonstrating strong results on one-shot learning on a variety of tasks and setups. 2 Model Our non-parametric approach to solving one-shot learning is based on two components which we describe in the following subsections. First, our model architecture follows recent advances in neural networks augmented with memory (as discussed in Section 3). Given a (small) support set S, our model defines a function cS (or classifier) for each S, i.e. a mapping S ! cS(.). Second, we employ ˆx Query f g(xi ) f ( ˆx,S) a ∑ P(ˆy|ˆx where xi, yi are the inputs and corresp {(xi, yi)}k i=1, and a is an attention mech tially describes the output for a new class Where the attention mechanism a is a kerne Where the attention mechanism is zero f metric and an appropriate constant otherw (although this requires an extension to the Thus (1) subsumes both KDE and kNN me mechanism and the yi act as values bound this case we can understand this as a parti we “point” to the corresponding example i form defined by the classifier cS(ˆx) is very 2.1.1 The Attention Kernel Equation 1 relies on choosing a(., .), the fier. The simplest form that this takes attention models and kernel functions) a(ˆx, xi) = ec(f(ˆx),g(xi)) / Pk j=1 ec(f(ˆx),g( ate neural networks (potentially with f = examples where f and g are parameteris tasks (as in VGG[22] or Inception[24]) or Section 4). We note that, though related to metric learn For a given support set S and sample to cl pairs (x0 , y0 ) 2 S such that y0 = y and mi methods such as Neighborhood Compone nearest neighbor [28]. However, the objective that we are trying classification, and thus we expect it to per Our model in its simplest form computes a probability over ˆy as follows: P(ˆy|ˆx, S) = kX i=1 a(ˆx, xi)yi where xi, yi are the inputs and corresponding label distributions from the support {(xi, yi)}k i=1, and a is an attention mechanism which we discuss below. Note that e tially describes the output for a new class as a linear combination of the labels in the s Where the attention mechanism a is a kernel on X ⇥ X, then (1) is akin to a kernel densit Where the attention mechanism is zero for the b furthest xi from ˆx according to som metric and an appropriate constant otherwise, then (1) is equivalent to ‘k b’-nearest n (although this requires an extension to the attention mechanism that we describe in Sec Thus (1) subsumes both KDE and kNN methods. Another view of (1) is where a acts as a mechanism and the yi act as values bound to the corresponding keys xi, much like a has this case we can understand this as a particular kind of associative memory where, give Figure 1: Matching Networks architecture by showing only a few examples per class, switching the task from minibatch to minibatch, ike how it will be tested when presented with a few examples of a new task. s our contributions in defining a model and training criterion amenable for one-shot learning, ntribute by the definition of tasks that can be used to benchmark other approaches on both Figure 1: Matching Networks architecture it by showing only a few examples per class, switching the task from minibatch to minibatch, h like how it will be tested when presented with a few examples of a new task. Figure 1: Matching Networks architecture xi Support Set(S) yi g
  • 19. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Matching Networks [Vinyals+, NIPS2016] n  The Attention Kernel ⁃  Calculate softmax over the cosine distance between and •  Similar to nearest neighbor calculation ⁃  Train a network using cross entropy loss 19 Figure 1: Matching Networks architecture train it by showing only a few examples per class, switching the task from minibatch to minibatch, much like how it will be tested when presented with a few examples of a new task. Besides our contributions in defining a model and training criterion amenable for one-shot learning, we contribute by the definition of tasks that can be used to benchmark other approaches on both ImageNet and small scale language modeling. We hope that our results will encourage others to work on this challenging problem. We organized the paper by first defining and explaining our model whilst linking its several compo- nents to related work. Then in the following section we briefly elaborate on some of the related work to the task and our model. In Section 4 we describe both our general setup and the experiments we performed, demonstrating strong results on one-shot learning on a variety of tasks and setups. 2 Model Our non-parametric approach to solving one-shot learning is based on two components which we describe in the following subsections. First, our model architecture follows recent advances in neural networks augmented with memory (as discussed in Section 3). Given a (small) support set S, our model defines a function cS (or classifier) for each S, i.e. a mapping S ! cS(.). Second, we employ ˆx Query f g(xi ) f ( ˆx,S) aOur model in its simplest form computes a probability over ˆy as follow P(ˆy|ˆx, S) = kX i=1 a(ˆx, xi)yi where xi, yi are the inputs and corresponding label distributions {(xi, yi)}k i=1, and a is an attention mechanism which we discuss b tially describes the output for a new class as a linear combination of Where the attention mechanism a is a kernel on X ⇥ X, then (1) is akin Where the attention mechanism is zero for the b furthest xi from ˆx metric and an appropriate constant otherwise, then (1) is equivalent t (although this requires an extension to the attention mechanism that w ∑ P(ˆy|ˆx where xi, yi are the inputs and corresp {(xi, yi)}k i=1, and a is an attention mech tially describes the output for a new class Where the attention mechanism a is a kerne Where the attention mechanism is zero f metric and an appropriate constant otherw (although this requires an extension to the Thus (1) subsumes both KDE and kNN me mechanism and the yi act as values bound this case we can understand this as a parti we “point” to the corresponding example i form defined by the classifier cS(ˆx) is very 2.1.1 The Attention Kernel Equation 1 relies on choosing a(., .), the fier. The simplest form that this takes attention models and kernel functions) a(ˆx, xi) = ec(f(ˆx),g(xi)) / Pk j=1 ec(f(ˆx),g( ate neural networks (potentially with f = examples where f and g are parameteris tasks (as in VGG[22] or Inception[24]) or Section 4). We note that, though related to metric learn For a given support set S and sample to cl pairs (x0 , y0 ) 2 S such that y0 = y and mi methods such as Neighborhood Compone nearest neighbor [28]. However, the objective that we are trying classification, and thus we expect it to per Our model in its simplest form computes a probability over ˆy as follows: P(ˆy|ˆx, S) = kX i=1 a(ˆx, xi)yi where xi, yi are the inputs and corresponding label distributions from the support {(xi, yi)}k i=1, and a is an attention mechanism which we discuss below. Note that e tially describes the output for a new class as a linear combination of the labels in the s Where the attention mechanism a is a kernel on X ⇥ X, then (1) is akin to a kernel densit Where the attention mechanism is zero for the b furthest xi from ˆx according to som metric and an appropriate constant otherwise, then (1) is equivalent to ‘k b’-nearest n (although this requires an extension to the attention mechanism that we describe in Sec Thus (1) subsumes both KDE and kNN methods. Another view of (1) is where a acts as a mechanism and the yi act as values bound to the corresponding keys xi, much like a has this case we can understand this as a particular kind of associative memory where, give Our model in its simplest form computes a probability over ˆy as follows: P(ˆy|ˆx, S) = kX i=1 a(ˆx, xi)yi where xi, yi are the inputs and corresponding label distributions from the suppo {(xi, yi)}k i=1, and a is an attention mechanism which we discuss below. Note that tially describes the output for a new class as a linear combination of the labels in the Where the attention mechanism a is a kernel on X ⇥ X, then (1) is akin to a kernel dens Where the attention mechanism is zero for the b furthest xi from ˆx according to so metric and an appropriate constant otherwise, then (1) is equivalent to ‘k b’-nearest (although this requires an extension to the attention mechanism that we describe in Se Thus (1) subsumes both KDE and kNN methods. Another view of (1) is where a acts as mechanism and the yi act as values bound to the corresponding keys xi, much like a ha this case we can understand this as a particular kind of associative memory where, giv we “point” to the corresponding example in the support set, retrieving its label. Hence th form defined by the classifier cS(ˆx) is very flexible and can adapt easily to any new sup 2.1.1 The Attention Kernel Equation 1 relies on choosing a(., .), the attention mechanism, which fully specifie fier. The simplest form that this takes (and which has very tight relationships wi attention models and kernel functions) is to use the softmax over the cosine dist a(ˆx, xi) = ec(f(ˆx),g(xi)) / Pk j=1 ec(f(ˆx),g(xj )) with embedding functions f and g bein ate neural networks (potentially with f = g) to embed ˆx and xi. In our experiments w examples where f and g are parameterised variously as deep convolutional network tasks (as in VGG[22] or Inception[24]) or a simple form word embedding for languag Section 4). We note that, though related to metric learning, the classifier defined by Equation 1 is di For a given support set S and sample to classify ˆx, it is enough for ˆx to be sufficiently a pairs (x0 , y0 ) 2 S such that y0 = y and misaligned with the rest. This kind of loss is als c: cosine distance Figure 1: Matching Networks architecture by showing only a few examples per class, switching the task from minibatch to minibatch, ike how it will be tested when presented with a few examples of a new task. s our contributions in defining a model and training criterion amenable for one-shot learning, ntribute by the definition of tasks that can be used to benchmark other approaches on both Figure 1: Matching Networks architecture it by showing only a few examples per class, switching the task from minibatch to minibatch, h like how it will be tested when presented with a few examples of a new task. Figure 1: Matching Networks architecture xi Support Set(S) yi g ˆhk, ck = LSTM(f0 (ˆx), [hk 1, rk 1], ck 1) hk = ˆhk + f0 (ˆx) rk 1 = |S| X i=1 a(hk 1, g(xi))g(xi) a(hk 1, g(xi)) = ehT k 1g(xi) / |S| X j=1 ehT k 1g(xj ) noting that LSTM(x, h, c) follows the same LSTM implementation defined in [23] h the output (i.e., cell after the output gate), and c the cell. a is commonly referred based attention. We do K steps of “reads”, so f(ˆx, S) = hK where hk is as describ 2.2 Training Strategy In the previous subsection we described Matching Networks which map a support set t function, S ! c(ˆx). We achieve this via a modification of the set-to-set paradigm attention, with the resulting mapping being of the form P✓(.|ˆx, S), noting that ✓ are of the model (i.e. of the embedding functions f and g described previously). The training procedure has to be chosen carefully so as to match inference at test t has to perform well with support sets S0 which contain classes never seen during tra More specifically, let us define a task T as distribution over possible label sets L consider T to uniformly weight all data sets of up to a few unique classes (e.g. examples per class (e.g., up to 5). In this case, a label set L sampled from a task typically have 5 to 25 examples. To form an “episode” to compute gradients and update our model, we first sample L could be the label set {cats, dogs}). We then use L to sample the support set S (i.e., both S and B are labelled examples of cats and dogs). The Matching Net is minimise the error predicting the labels in the batch B conditioned on the support form of meta-learning since the training procedure explicitly learns to learn from a g to minimise a loss over a batch. More precisely, the Matching Nets training objectiv ✓ = arg max ✓ EL⇠T 2 4ES⇠L,B⇠L 2 4 X (x,y)2B log P✓ (y|x, S) 3 5 3 5 . Training ✓ with eq. 6 yields a model which works well when sampling S0 ⇠ T0 g(xi )f ( ˆx,S)
  • 20. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Matching Networks [Vinyals+, NIPS2016] n  The Fully Conditional Embedding g ⁃  Embed in consideration of S g’ LSTM LSTM + Figure 1: Matching Networks architecture by showing only a few examples per class, switching the task from minibatch to minibatch, ike how it will be tested when presented with a few examples of a new task. s our contributions in defining a model and training criterion amenable for one-shot learning, ntribute by the definition of tasks that can be used to benchmark other approaches on both Net and small scale language modeling. We hope that our results will encourage others to work challenging problem. ganized the paper by first defining and explaining our model whilst linking its several compo- o related work. Then in the following section we briefly elaborate on some of the related work Figure 1: Matching Networks architecture it by showing only a few examples per class, switching the task from minibatch to minibatch, h like how it will be tested when presented with a few examples of a new task. des our contributions in defining a model and training criterion amenable for one-shot learning, ontribute by the definition of tasks that can be used to benchmark other approaches on both geNet and small scale language modeling. We hope that our results will encourage others to work his challenging problem. 20 Figure 1: Matching Networks architecture examples per class, switching the task from minibatch to minibatch, much like when presented with a few examples of a new task. utions in defining a model and training criterion amenable for one-shot learning, xi Support Set(S) yi g’ LSTM LSTM + g’ LSTM LSTM + noting that LSTM(x, h, c) follows the same LSTM implementation defined in [ h the output (i.e., cell after the output gate), and c the cell. a is commonly refe based attention, and the softmax in eq. 6 normalizes w.r.t. g(xi). The read-ou concatenated to hk 1. Since we do K steps of “reads”, attLSTM(f0 (ˆx), g(S), is as described in eq. 3. A.2 The Fully Conditional Embedding g In section 2.1.2 we described the encoding function for the elements in the sup as a bidirectional LSTM. More precisely, let g0 (xi) be a neural network (simila VGG or Inception model). Then we define g(xi, S) = ~hi + ~hi + g0 (xi) with: ~hi,~ci = LSTM(g0 (xi),~hi 1,~ci 1) ~hi, ~ci = LSTM(g0 (xi), ~hi+1, ~ci+1) where, as in above, LSTM(x, h, c) follows the same LSTM implementation de the input, h the output (i.e., cell after the output gate), and c the cell. Note tha starts from i = |S|. As in eq. 3, we add a skip connection between input and ou B ImageNet Class Splits Here we define the two class splits used in our full ImageNet experiments – excluded for training during our one-shot experiments described in section 4.1. Lrand = n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n016882 n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n020953 10 g’: neural network (e.g., VGG or Inception) a(hk 1, g(xi)) = softmax(hk 1g(xi)) noting that LSTM(x, h, c) follows the same LSTM implementation defined in [23] with x th h the output (i.e., cell after the output gate), and c the cell. a is commonly referred to as “c based attention, and the softmax in eq. 6 normalizes w.r.t. g(xi). The read-out rk 1 from concatenated to hk 1. Since we do K steps of “reads”, attLSTM(f0 (ˆx), g(S), K) = hK w is as described in eq. 3. A.2 The Fully Conditional Embedding g In section 2.1.2 we described the encoding function for the elements in the support set S, g as a bidirectional LSTM. More precisely, let g0 (xi) be a neural network (similar to f0 above VGG or Inception model). Then we define g(xi, S) = ~hi + ~hi + g0 (xi) with: ~hi,~ci = LSTM(g0 (xi),~hi 1,~ci 1) ~hi, ~ci = LSTM(g0 (xi), ~hi+1, ~ci+1) where, as in above, LSTM(x, h, c) follows the same LSTM implementation defined in [23] the input, h the output (i.e., cell after the output gate), and c the cell. Note that the recursio starts from i = |S|. As in eq. 3, we add a skip connection between input and outputs. B ImageNet Class Splits Here we define the two class splits used in our full ImageNet experiments – these classe excluded for training during our one-shot experiments described in section 4.1.2. Lrand = n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n01688243, n01729977, n n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n02095314, n02097130, n xi g(xi,S)
  • 21. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Matching Networks [Vinyals+, NIPS2016] n  The Fully Conditional Embedding g ⁃  Embed in consideration of S g’ Figure 1: Matching Networks architecture by showing only a few examples per class, switching the task from minibatch to minibatch, ike how it will be tested when presented with a few examples of a new task. s our contributions in defining a model and training criterion amenable for one-shot learning, ntribute by the definition of tasks that can be used to benchmark other approaches on both Net and small scale language modeling. We hope that our results will encourage others to work challenging problem. ganized the paper by first defining and explaining our model whilst linking its several compo- o related work. Then in the following section we briefly elaborate on some of the related work Figure 1: Matching Networks architecture it by showing only a few examples per class, switching the task from minibatch to minibatch, h like how it will be tested when presented with a few examples of a new task. des our contributions in defining a model and training criterion amenable for one-shot learning, ontribute by the definition of tasks that can be used to benchmark other approaches on both geNet and small scale language modeling. We hope that our results will encourage others to work his challenging problem. 21 Figure 1: Matching Networks architecture examples per class, switching the task from minibatch to minibatch, much like when presented with a few examples of a new task. utions in defining a model and training criterion amenable for one-shot learning, xi Support Set(S) yi g’ g’ noting that LSTM(x, h, c) follows the same LSTM implementation defined in [ h the output (i.e., cell after the output gate), and c the cell. a is commonly refe based attention, and the softmax in eq. 6 normalizes w.r.t. g(xi). The read-ou concatenated to hk 1. Since we do K steps of “reads”, attLSTM(f0 (ˆx), g(S), is as described in eq. 3. A.2 The Fully Conditional Embedding g In section 2.1.2 we described the encoding function for the elements in the sup as a bidirectional LSTM. More precisely, let g0 (xi) be a neural network (simila VGG or Inception model). Then we define g(xi, S) = ~hi + ~hi + g0 (xi) with: ~hi,~ci = LSTM(g0 (xi),~hi 1,~ci 1) ~hi, ~ci = LSTM(g0 (xi), ~hi+1, ~ci+1) where, as in above, LSTM(x, h, c) follows the same LSTM implementation de the input, h the output (i.e., cell after the output gate), and c the cell. Note tha starts from i = |S|. As in eq. 3, we add a skip connection between input and ou B ImageNet Class Splits Here we define the two class splits used in our full ImageNet experiments – excluded for training during our one-shot experiments described in section 4.1. Lrand = n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n016882 n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n020953 10 g’: neural network (e.g., VGG or Inception) a(hk 1, g(xi)) = softmax(hk 1g(xi)) noting that LSTM(x, h, c) follows the same LSTM implementation defined in [23] with x th h the output (i.e., cell after the output gate), and c the cell. a is commonly referred to as “c based attention, and the softmax in eq. 6 normalizes w.r.t. g(xi). The read-out rk 1 from concatenated to hk 1. Since we do K steps of “reads”, attLSTM(f0 (ˆx), g(S), K) = hK w is as described in eq. 3. A.2 The Fully Conditional Embedding g In section 2.1.2 we described the encoding function for the elements in the support set S, g as a bidirectional LSTM. More precisely, let g0 (xi) be a neural network (similar to f0 above VGG or Inception model). Then we define g(xi, S) = ~hi + ~hi + g0 (xi) with: ~hi,~ci = LSTM(g0 (xi),~hi 1,~ci 1) ~hi, ~ci = LSTM(g0 (xi), ~hi+1, ~ci+1) where, as in above, LSTM(x, h, c) follows the same LSTM implementation defined in [23] the input, h the output (i.e., cell after the output gate), and c the cell. Note that the recursio starts from i = |S|. As in eq. 3, we add a skip connection between input and outputs. B ImageNet Class Splits Here we define the two class splits used in our full ImageNet experiments – these classe excluded for training during our one-shot experiments described in section 4.1.2. Lrand = n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n01688243, n01729977, n n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n02095314, n02097130, n Embed into vector using g’ (g’: neural network such as VGG or Inception) xi xi
  • 22. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Matching Networks [Vinyals+, NIPS2016] n  The Fully Conditional Embedding g ⁃  Embed in consideration of S g’ LSTM LSTM Figure 1: Matching Networks architecture by showing only a few examples per class, switching the task from minibatch to minibatch, ike how it will be tested when presented with a few examples of a new task. s our contributions in defining a model and training criterion amenable for one-shot learning, ntribute by the definition of tasks that can be used to benchmark other approaches on both Net and small scale language modeling. We hope that our results will encourage others to work challenging problem. ganized the paper by first defining and explaining our model whilst linking its several compo- o related work. Then in the following section we briefly elaborate on some of the related work Figure 1: Matching Networks architecture it by showing only a few examples per class, switching the task from minibatch to minibatch, h like how it will be tested when presented with a few examples of a new task. des our contributions in defining a model and training criterion amenable for one-shot learning, ontribute by the definition of tasks that can be used to benchmark other approaches on both geNet and small scale language modeling. We hope that our results will encourage others to work his challenging problem. 22 Figure 1: Matching Networks architecture examples per class, switching the task from minibatch to minibatch, much like when presented with a few examples of a new task. utions in defining a model and training criterion amenable for one-shot learning, xi Support Set(S) yi g’ LSTM LSTM g’ LSTM LSTM noting that LSTM(x, h, c) follows the same LSTM implementation defined in [ h the output (i.e., cell after the output gate), and c the cell. a is commonly refe based attention, and the softmax in eq. 6 normalizes w.r.t. g(xi). The read-ou concatenated to hk 1. Since we do K steps of “reads”, attLSTM(f0 (ˆx), g(S), is as described in eq. 3. A.2 The Fully Conditional Embedding g In section 2.1.2 we described the encoding function for the elements in the sup as a bidirectional LSTM. More precisely, let g0 (xi) be a neural network (simila VGG or Inception model). Then we define g(xi, S) = ~hi + ~hi + g0 (xi) with: ~hi,~ci = LSTM(g0 (xi),~hi 1,~ci 1) ~hi, ~ci = LSTM(g0 (xi), ~hi+1, ~ci+1) where, as in above, LSTM(x, h, c) follows the same LSTM implementation de the input, h the output (i.e., cell after the output gate), and c the cell. Note tha starts from i = |S|. As in eq. 3, we add a skip connection between input and ou B ImageNet Class Splits Here we define the two class splits used in our full ImageNet experiments – excluded for training during our one-shot experiments described in section 4.1. Lrand = n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n016882 n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n020953 10 g’: neural network (e.g., VGG or Inception) a(hk 1, g(xi)) = softmax(hk 1g(xi)) noting that LSTM(x, h, c) follows the same LSTM implementation defined in [23] with x th h the output (i.e., cell after the output gate), and c the cell. a is commonly referred to as “c based attention, and the softmax in eq. 6 normalizes w.r.t. g(xi). The read-out rk 1 from concatenated to hk 1. Since we do K steps of “reads”, attLSTM(f0 (ˆx), g(S), K) = hK w is as described in eq. 3. A.2 The Fully Conditional Embedding g In section 2.1.2 we described the encoding function for the elements in the support set S, g as a bidirectional LSTM. More precisely, let g0 (xi) be a neural network (similar to f0 above VGG or Inception model). Then we define g(xi, S) = ~hi + ~hi + g0 (xi) with: ~hi,~ci = LSTM(g0 (xi),~hi 1,~ci 1) ~hi, ~ci = LSTM(g0 (xi), ~hi+1, ~ci+1) where, as in above, LSTM(x, h, c) follows the same LSTM implementation defined in [23] the input, h the output (i.e., cell after the output gate), and c the cell. Note that the recursio starts from i = |S|. As in eq. 3, we add a skip connection between input and outputs. B ImageNet Class Splits Here we define the two class splits used in our full ImageNet experiments – these classe excluded for training during our one-shot experiments described in section 4.1.2. Lrand = n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n01688243, n01729977, n n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n02095314, n02097130, n Feed into Bi-LSTM (gʼ: neural network such as VGG or Inception) g'(xi ) xi
  • 23. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Matching Networks [Vinyals+, NIPS2016] n  The Fully Conditional Embedding g ⁃  Embed in consideration of S g’ LSTM LSTM + Figure 1: Matching Networks architecture by showing only a few examples per class, switching the task from minibatch to minibatch, ike how it will be tested when presented with a few examples of a new task. s our contributions in defining a model and training criterion amenable for one-shot learning, ntribute by the definition of tasks that can be used to benchmark other approaches on both Net and small scale language modeling. We hope that our results will encourage others to work challenging problem. ganized the paper by first defining and explaining our model whilst linking its several compo- o related work. Then in the following section we briefly elaborate on some of the related work Figure 1: Matching Networks architecture it by showing only a few examples per class, switching the task from minibatch to minibatch, h like how it will be tested when presented with a few examples of a new task. des our contributions in defining a model and training criterion amenable for one-shot learning, ontribute by the definition of tasks that can be used to benchmark other approaches on both geNet and small scale language modeling. We hope that our results will encourage others to work his challenging problem. 23 Figure 1: Matching Networks architecture examples per class, switching the task from minibatch to minibatch, much like when presented with a few examples of a new task. utions in defining a model and training criterion amenable for one-shot learning, xi Support Set(S) yi g’ LSTM LSTM + g’ LSTM LSTM + noting that LSTM(x, h, c) follows the same LSTM implementation defined in [ h the output (i.e., cell after the output gate), and c the cell. a is commonly refe based attention, and the softmax in eq. 6 normalizes w.r.t. g(xi). The read-ou concatenated to hk 1. Since we do K steps of “reads”, attLSTM(f0 (ˆx), g(S), is as described in eq. 3. A.2 The Fully Conditional Embedding g In section 2.1.2 we described the encoding function for the elements in the sup as a bidirectional LSTM. More precisely, let g0 (xi) be a neural network (simila VGG or Inception model). Then we define g(xi, S) = ~hi + ~hi + g0 (xi) with: ~hi,~ci = LSTM(g0 (xi),~hi 1,~ci 1) ~hi, ~ci = LSTM(g0 (xi), ~hi+1, ~ci+1) where, as in above, LSTM(x, h, c) follows the same LSTM implementation de the input, h the output (i.e., cell after the output gate), and c the cell. Note tha starts from i = |S|. As in eq. 3, we add a skip connection between input and ou B ImageNet Class Splits Here we define the two class splits used in our full ImageNet experiments – excluded for training during our one-shot experiments described in section 4.1. Lrand = n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n016882 n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n020953 10 g’: neural network (e.g., VGG or Inception) a(hk 1, g(xi)) = softmax(hk 1g(xi)) noting that LSTM(x, h, c) follows the same LSTM implementation defined in [23] with x th h the output (i.e., cell after the output gate), and c the cell. a is commonly referred to as “c based attention, and the softmax in eq. 6 normalizes w.r.t. g(xi). The read-out rk 1 from concatenated to hk 1. Since we do K steps of “reads”, attLSTM(f0 (ˆx), g(S), K) = hK w is as described in eq. 3. A.2 The Fully Conditional Embedding g In section 2.1.2 we described the encoding function for the elements in the support set S, g as a bidirectional LSTM. More precisely, let g0 (xi) be a neural network (similar to f0 above VGG or Inception model). Then we define g(xi, S) = ~hi + ~hi + g0 (xi) with: ~hi,~ci = LSTM(g0 (xi),~hi 1,~ci 1) ~hi, ~ci = LSTM(g0 (xi), ~hi+1, ~ci+1) where, as in above, LSTM(x, h, c) follows the same LSTM implementation defined in [23] the input, h the output (i.e., cell after the output gate), and c the cell. Note that the recursio starts from i = |S|. As in eq. 3, we add a skip connection between input and outputs. B ImageNet Class Splits Here we define the two class splits used in our full ImageNet experiments – these classe excluded for training during our one-shot experiments described in section 4.1.2. Lrand = n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n01688243, n01729977, n n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n02095314, n02097130, n g(xi,S) Let be the sum of and outputs of Bi-LSTM g(xi,S) g'(xi ) xi
  • 24. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Matching Networks [Vinyals+, NIPS2016] n  The Fully Conditional Embedding f ⁃  Embed in consideration of S g Figure 1: Matching Networks architecture train it by showing only a few examples per class, switching the task from mi much like how it will be tested when presented with a few examples of a new Besides our contributions in defining a model and training criterion amenable we contribute by the definition of tasks that can be used to benchmark other ImageNet and small scale language modeling. We hope that our results will enc on this challenging problem. We organized the paper by first defining and explaining our model whilst linki nents to related work. Then in the following section we briefly elaborate on som to the task and our model. In Section 4 we describe both our general setup an performed, demonstrating strong results on one-shot learning on a variety of ta 2 Model Our non-parametric approach to solving one-shot learning is based on two co describe in the following subsections. First, our model architecture follows rece networks augmented with memory (as discussed in Section 3). Given a (sma model defines a function cS (or classifier) for each S, i.e. a mapping S ! cS(.) a training strategy which is tailored for one-shot learning from the support set 2.1 Model Architecture In recent years, many groups have investigated ways to augment neural netwo f’LSTM rk−1 a(hk−1,g(xi ))g(xi ) LSTM f ( ˆx,S) = hK ˆhk−1 hk−1 ˆhk + + ˆx so, we define the following recurrence over “processing” steps k, following work from [26]: ˆhk, ck = LSTM(f0 (ˆx), [hk 1, rk 1], ck 1) (2) hk = ˆhk + f0 (ˆx) (3) rk 1 = |S| X i=1 a(hk 1, g(xi))g(xi) (4) a(hk 1, g(xi)) = ehT k 1g(xi) / |S| X j=1 ehT k 1g(xj ) (5) Query weighted sum 24 Figure 1: Matching Networks architecture by showing only a few examples per class, switching the task from minibatch to minibatch, ike how it will be tested when presented with a few examples of a new task. s our contributions in defining a model and training criterion amenable for one-shot learning, ntribute by the definition of tasks that can be used to benchmark other approaches on both Net and small scale language modeling. We hope that our results will encourage others to work challenging problem. ganized the paper by first defining and explaining our model whilst linking its several compo- o related work. Then in the following section we briefly elaborate on some of the related work Figure 1: Matching Networks architecture it by showing only a few examples per class, switching the task from minibatch to minibatch, h like how it will be tested when presented with a few examples of a new task. des our contributions in defining a model and training criterion amenable for one-shot learning, ontribute by the definition of tasks that can be used to benchmark other approaches on both geNet and small scale language modeling. We hope that our results will encourage others to work his challenging problem. Figure 1: Matching Networks architecture examples per class, switching the task from minibatch to minibatch, much like when presented with a few examples of a new task. utions in defining a model and training criterion amenable for one-shot learning, xi Support Set(S) yi ˆx ollowing recurrence over “processing” steps k, following work from [26]: ˆhk, ck = LSTM(f0 (ˆx), [hk 1, rk 1], ck 1) (2) hk = ˆhk + f0 (ˆx) (3) rk 1 = |S| X i=1 a(hk 1, g(xi))g(xi) (4)
  • 25. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Matching Networks [Vinyals+, NIPS2016] n  The Fully Conditional Embedding f ⁃  Embed in consideration of S g Figure 1: Matching Networks architecture train it by showing only a few examples per class, switching the task from mi much like how it will be tested when presented with a few examples of a new Besides our contributions in defining a model and training criterion amenable we contribute by the definition of tasks that can be used to benchmark other ImageNet and small scale language modeling. We hope that our results will enc on this challenging problem. We organized the paper by first defining and explaining our model whilst linki nents to related work. Then in the following section we briefly elaborate on som to the task and our model. In Section 4 we describe both our general setup an performed, demonstrating strong results on one-shot learning on a variety of ta 2 Model Our non-parametric approach to solving one-shot learning is based on two co describe in the following subsections. First, our model architecture follows rece networks augmented with memory (as discussed in Section 3). Given a (sma model defines a function cS (or classifier) for each S, i.e. a mapping S ! cS(.) a training strategy which is tailored for one-shot learning from the support set 2.1 Model Architecture In recent years, many groups have investigated ways to augment neural netwo f’LSTM g(xi ) ˆh1 h1 + ˆx so, we define the following recurrence over “processing” steps k, following work from [26]: ˆhk, ck = LSTM(f0 (ˆx), [hk 1, rk 1], ck 1) (2) hk = ˆhk + f0 (ˆx) (3) rk 1 = |S| X i=1 a(hk 1, g(xi))g(xi) (4) a(hk 1, g(xi)) = ehT k 1g(xi) / |S| X j=1 ehT k 1g(xj ) (5) Query 25 Figure 1: Matching Networks architecture by showing only a few examples per class, switching the task from minibatch to minibatch, ike how it will be tested when presented with a few examples of a new task. s our contributions in defining a model and training criterion amenable for one-shot learning, ntribute by the definition of tasks that can be used to benchmark other approaches on both Net and small scale language modeling. We hope that our results will encourage others to work challenging problem. ganized the paper by first defining and explaining our model whilst linking its several compo- o related work. Then in the following section we briefly elaborate on some of the related work Figure 1: Matching Networks architecture it by showing only a few examples per class, switching the task from minibatch to minibatch, h like how it will be tested when presented with a few examples of a new task. des our contributions in defining a model and training criterion amenable for one-shot learning, ontribute by the definition of tasks that can be used to benchmark other approaches on both geNet and small scale language modeling. We hope that our results will encourage others to work his challenging problem. Figure 1: Matching Networks architecture examples per class, switching the task from minibatch to minibatch, much like when presented with a few examples of a new task. utions in defining a model and training criterion amenable for one-shot learning, xi Support Set(S) yi ollowing recurrence over “processing” steps k, following work from [26]: ˆhk, ck = LSTM(f0 (ˆx), [hk 1, rk 1], ck 1) (2) hk = ˆhk + f0 (ˆx) (3) rk 1 = |S| X i=1 a(hk 1, g(xi))g(xi) (4) is calculated without using S h1 = LSTM( f '( ˆx),[ ˆh0,r0 ],c0 )+ f '( ˆx) h1 ˆx
  • 26. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Matching Networks [Vinyals+, NIPS2016] n  The Fully Conditional Embedding f ⁃  Embed in consideration of S g Figure 1: Matching Networks architecture train it by showing only a few examples per class, switching the task from mi much like how it will be tested when presented with a few examples of a new Besides our contributions in defining a model and training criterion amenable we contribute by the definition of tasks that can be used to benchmark other ImageNet and small scale language modeling. We hope that our results will enc on this challenging problem. We organized the paper by first defining and explaining our model whilst linki nents to related work. Then in the following section we briefly elaborate on som to the task and our model. In Section 4 we describe both our general setup an performed, demonstrating strong results on one-shot learning on a variety of ta 2 Model Our non-parametric approach to solving one-shot learning is based on two co describe in the following subsections. First, our model architecture follows rece networks augmented with memory (as discussed in Section 3). Given a (sma model defines a function cS (or classifier) for each S, i.e. a mapping S ! cS(.) a training strategy which is tailored for one-shot learning from the support set 2.1 Model Architecture In recent years, many groups have investigated ways to augment neural netwo f’LSTM g(xi ) ˆh1 h1 + ˆx so, we define the following recurrence over “processing” steps k, following work from [26]: ˆhk, ck = LSTM(f0 (ˆx), [hk 1, rk 1], ck 1) (2) hk = ˆhk + f0 (ˆx) (3) rk 1 = |S| X i=1 a(hk 1, g(xi))g(xi) (4) a(hk 1, g(xi)) = ehT k 1g(xi) / |S| X j=1 ehT k 1g(xj ) (5) Query 26 Figure 1: Matching Networks architecture by showing only a few examples per class, switching the task from minibatch to minibatch, ike how it will be tested when presented with a few examples of a new task. s our contributions in defining a model and training criterion amenable for one-shot learning, ntribute by the definition of tasks that can be used to benchmark other approaches on both Net and small scale language modeling. We hope that our results will encourage others to work challenging problem. ganized the paper by first defining and explaining our model whilst linking its several compo- o related work. Then in the following section we briefly elaborate on some of the related work Figure 1: Matching Networks architecture it by showing only a few examples per class, switching the task from minibatch to minibatch, h like how it will be tested when presented with a few examples of a new task. des our contributions in defining a model and training criterion amenable for one-shot learning, ontribute by the definition of tasks that can be used to benchmark other approaches on both geNet and small scale language modeling. We hope that our results will encourage others to work his challenging problem. Figure 1: Matching Networks architecture examples per class, switching the task from minibatch to minibatch, much like when presented with a few examples of a new task. utions in defining a model and training criterion amenable for one-shot learning, xi Support Set(S) yi ollowing recurrence over “processing” steps k, following work from [26]: ˆhk, ck = LSTM(f0 (ˆx), [hk 1, rk 1], ck 1) (2) hk = ˆhk + f0 (ˆx) (3) rk 1 = |S| X i=1 a(hk 1, g(xi))g(xi) (4) Calculate the relevance between and softmaxa(h1,g(x1)) = a(h1,g(xi )) (hT 1g(x1)) g(xi ) h1 ˆx
  • 27. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Matching Networks [Vinyals+, NIPS2016] n  The Fully Conditional Embedding f ⁃  Embed in consideration of S g Figure 1: Matching Networks architecture train it by showing only a few examples per class, switching the task from mi much like how it will be tested when presented with a few examples of a new Besides our contributions in defining a model and training criterion amenable we contribute by the definition of tasks that can be used to benchmark other ImageNet and small scale language modeling. We hope that our results will enc on this challenging problem. We organized the paper by first defining and explaining our model whilst linki nents to related work. Then in the following section we briefly elaborate on som to the task and our model. In Section 4 we describe both our general setup an performed, demonstrating strong results on one-shot learning on a variety of ta 2 Model Our non-parametric approach to solving one-shot learning is based on two co describe in the following subsections. First, our model architecture follows rece networks augmented with memory (as discussed in Section 3). Given a (sma model defines a function cS (or classifier) for each S, i.e. a mapping S ! cS(.) a training strategy which is tailored for one-shot learning from the support set 2.1 Model Architecture In recent years, many groups have investigated ways to augment neural netwo f’LSTM g(xi ) ˆh1 h1 + ˆx so, we define the following recurrence over “processing” steps k, following work from [26]: ˆhk, ck = LSTM(f0 (ˆx), [hk 1, rk 1], ck 1) (2) hk = ˆhk + f0 (ˆx) (3) rk 1 = |S| X i=1 a(hk 1, g(xi))g(xi) (4) a(hk 1, g(xi)) = ehT k 1g(xi) / |S| X j=1 ehT k 1g(xj ) (5) Query 27 Figure 1: Matching Networks architecture by showing only a few examples per class, switching the task from minibatch to minibatch, ike how it will be tested when presented with a few examples of a new task. s our contributions in defining a model and training criterion amenable for one-shot learning, ntribute by the definition of tasks that can be used to benchmark other approaches on both Net and small scale language modeling. We hope that our results will encourage others to work challenging problem. ganized the paper by first defining and explaining our model whilst linking its several compo- o related work. Then in the following section we briefly elaborate on some of the related work Figure 1: Matching Networks architecture it by showing only a few examples per class, switching the task from minibatch to minibatch, h like how it will be tested when presented with a few examples of a new task. des our contributions in defining a model and training criterion amenable for one-shot learning, ontribute by the definition of tasks that can be used to benchmark other approaches on both geNet and small scale language modeling. We hope that our results will encourage others to work his challenging problem. Figure 1: Matching Networks architecture examples per class, switching the task from minibatch to minibatch, much like when presented with a few examples of a new task. utions in defining a model and training criterion amenable for one-shot learning, xi Support Set(S) yi ollowing recurrence over “processing” steps k, following work from [26]: ˆhk, ck = LSTM(f0 (ˆx), [hk 1, rk 1], ck 1) (2) hk = ˆhk + f0 (ˆx) (3) rk 1 = |S| X i=1 a(hk 1, g(xi))g(xi) (4) is a sum of weighted according to the relevance to a(h1,g(xi )) r1 weighted sum r1 g(xi ) h1 r1 = a(h1,g(xi )) i=1 |S| ∑ g(xi ) ˆx
  • 28. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Matching Networks [Vinyals+, NIPS2016] n  The Fully Conditional Embedding f ⁃  Embed in consideration of S g Figure 1: Matching Networks architecture train it by showing only a few examples per class, switching the task from mi much like how it will be tested when presented with a few examples of a new Besides our contributions in defining a model and training criterion amenable we contribute by the definition of tasks that can be used to benchmark other ImageNet and small scale language modeling. We hope that our results will enc on this challenging problem. We organized the paper by first defining and explaining our model whilst linki nents to related work. Then in the following section we briefly elaborate on som to the task and our model. In Section 4 we describe both our general setup an performed, demonstrating strong results on one-shot learning on a variety of ta 2 Model Our non-parametric approach to solving one-shot learning is based on two co describe in the following subsections. First, our model architecture follows rece networks augmented with memory (as discussed in Section 3). Given a (sma model defines a function cS (or classifier) for each S, i.e. a mapping S ! cS(.) a training strategy which is tailored for one-shot learning from the support set 2.1 Model Architecture In recent years, many groups have investigated ways to augment neural netwo f’LSTM g(xi ) ˆh1 h1 + ˆx so, we define the following recurrence over “processing” steps k, following work from [26]: ˆhk, ck = LSTM(f0 (ˆx), [hk 1, rk 1], ck 1) (2) hk = ˆhk + f0 (ˆx) (3) rk 1 = |S| X i=1 a(hk 1, g(xi))g(xi) (4) a(hk 1, g(xi)) = ehT k 1g(xi) / |S| X j=1 ehT k 1g(xj ) (5) Query 28 Figure 1: Matching Networks architecture by showing only a few examples per class, switching the task from minibatch to minibatch, ike how it will be tested when presented with a few examples of a new task. s our contributions in defining a model and training criterion amenable for one-shot learning, ntribute by the definition of tasks that can be used to benchmark other approaches on both Net and small scale language modeling. We hope that our results will encourage others to work challenging problem. ganized the paper by first defining and explaining our model whilst linking its several compo- o related work. Then in the following section we briefly elaborate on some of the related work Figure 1: Matching Networks architecture it by showing only a few examples per class, switching the task from minibatch to minibatch, h like how it will be tested when presented with a few examples of a new task. des our contributions in defining a model and training criterion amenable for one-shot learning, ontribute by the definition of tasks that can be used to benchmark other approaches on both geNet and small scale language modeling. We hope that our results will encourage others to work his challenging problem. Figure 1: Matching Networks architecture examples per class, switching the task from minibatch to minibatch, much like when presented with a few examples of a new task. utions in defining a model and training criterion amenable for one-shot learning, xi Support Set(S) yi ollowing recurrence over “processing” steps k, following work from [26]: ˆhk, ck = LSTM(f0 (ˆx), [hk 1, rk 1], ck 1) (2) hk = ˆhk + f0 (ˆx) (3) rk 1 = |S| X i=1 a(hk 1, g(xi))g(xi) (4) a(h1,g(xi )) r1 weighted sum LSTM ˆh1 + h1 is calculated using Sh1 ˆx
  • 29. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Matching Networks [Vinyals+, NIPS2016] n  The Fully Conditional Embedding f ⁃  Embed in consideration of S g Figure 1: Matching Networks architecture train it by showing only a few examples per class, switching the task from mi much like how it will be tested when presented with a few examples of a new Besides our contributions in defining a model and training criterion amenable we contribute by the definition of tasks that can be used to benchmark other ImageNet and small scale language modeling. We hope that our results will enc on this challenging problem. We organized the paper by first defining and explaining our model whilst linki nents to related work. Then in the following section we briefly elaborate on som to the task and our model. In Section 4 we describe both our general setup an performed, demonstrating strong results on one-shot learning on a variety of ta 2 Model Our non-parametric approach to solving one-shot learning is based on two co describe in the following subsections. First, our model architecture follows rece networks augmented with memory (as discussed in Section 3). Given a (sma model defines a function cS (or classifier) for each S, i.e. a mapping S ! cS(.) a training strategy which is tailored for one-shot learning from the support set 2.1 Model Architecture In recent years, many groups have investigated ways to augment neural netwo f’LSTM rk−1 a(hk−1,g(xi ))g(xi ) LSTM f ( ˆx,S) = hK ˆhk−1 hk−1 ˆhk + + ˆx so, we define the following recurrence over “processing” steps k, following work from [26]: ˆhk, ck = LSTM(f0 (ˆx), [hk 1, rk 1], ck 1) (2) hk = ˆhk + f0 (ˆx) (3) rk 1 = |S| X i=1 a(hk 1, g(xi))g(xi) (4) a(hk 1, g(xi)) = ehT k 1g(xi) / |S| X j=1 ehT k 1g(xj ) (5) Query weighted sum 29 Figure 1: Matching Networks architecture by showing only a few examples per class, switching the task from minibatch to minibatch, ike how it will be tested when presented with a few examples of a new task. s our contributions in defining a model and training criterion amenable for one-shot learning, ntribute by the definition of tasks that can be used to benchmark other approaches on both Net and small scale language modeling. We hope that our results will encourage others to work challenging problem. ganized the paper by first defining and explaining our model whilst linking its several compo- o related work. Then in the following section we briefly elaborate on some of the related work Figure 1: Matching Networks architecture it by showing only a few examples per class, switching the task from minibatch to minibatch, h like how it will be tested when presented with a few examples of a new task. des our contributions in defining a model and training criterion amenable for one-shot learning, ontribute by the definition of tasks that can be used to benchmark other approaches on both geNet and small scale language modeling. We hope that our results will encourage others to work his challenging problem. Figure 1: Matching Networks architecture examples per class, switching the task from minibatch to minibatch, much like when presented with a few examples of a new task. utions in defining a model and training criterion amenable for one-shot learning, xi Support Set(S) yi ollowing recurrence over “processing” steps k, following work from [26]: ˆhk, ck = LSTM(f0 (ˆx), [hk 1, rk 1], ck 1) (2) hk = ˆhk + f0 (ˆx) (3) rk 1 = |S| X i=1 a(hk 1, g(xi))g(xi) (4) Let be the output after K steps f ( ˆx,S) ˆx
  • 30. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Experimental Settings n  Datasets ⁃  Image classification sets •  Omniglot [Lake+, 2011] ⁃  Language modeling •  Penn Treebank [Marcus+, 1993] 30 •  ImageNet [Deng+, 2009] ref. http://karpathy.github.io/2014/09/02/what-i-learned- from-competing-against-a-convnet-on-imagenet/ 4.1.3 One-Shot Language Modeling We also introduce a new one-shot language task which is analogous to those examined for images. The task is as follows: given a query sentence with a missing word in it, and a support set of sentences which each have a missing word and a corresponding 1-hot label, choose the label from the support set that best matches the query sentence. Here we show a single example, though note that the words on the right are not provided and the labels for the set are given as 1-hot-of-5 vectors. 1. an experimental vaccine can alter the immune response of people infected with the aids virus a <blank_token> u.s. scientist said. prominent 2. the show one of five new nbc <blank_token> is the second casualty of the three networks so far this fall. series 3. however since eastern first filed for chapter N protection march N it has consistently promised to pay creditors N cents on the <blank_token>. dollar 4. we had a lot of people who threw in the <blank_token> today said <unk> ellis a partner in benjamin jacobson & sons a specialist in trading ual stock on the big board. towel 5. it’s not easy to roll out something that <blank_token> and make it pay mr. jacob says. comprehensive Query: in late new york trading yesterday the <blank_token> was quoted at N marks down from N marks late friday and at N yen down from N yen late friday. dollar Sentences were taken from the Penn Treebank dataset [15]. On each trial, we make sure that the set and batch are populated with sentences that are non-overlapping. This means that we do not use words with very low frequency counts; e.g. if there is only a single sentence for a given word we do not use this data since the sentence would need to be in both the set and the batch. As with the image tasks, each trial consisted of a 5 way choice between the classes available in the set. We used a batch size of 20 throughout the sentence matching task and varied the set size across k=1,2,3. We ensured that the same number of sentences were available for each class in the set. We split the words into a randomly sampled 9000 for training and 1000 for testing, and we used the standard test set to report results. Thus, neither the words nor the sentences used during test time had been seen during training. We compared our one-shot matching model to an oracle LSTM language model (LSTM-LM) [30]
  • 31. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Experimental Settings (Omniglot) n  Baseline ⁃  Matching on raw pixels ⁃  Matching on discriminative features from VGG (Baseine classifier) ⁃  MANN ⁃  Convolutional Siamese Network n  Datasets ⁃  training: 1200 characters ⁃  testing: 423 characters 31
  • 32. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Experimental Results (Omniglot) 32 n  Fully Conditional Embedding (FCE) did not seem to help much n  Baseline and Siamese Net were improved with fine-tuning took this network and used the features from the last layer (before the softmax) for nearest neighbour matching, a strategy commonly used in computer vision [3] which has achieved excellent results across many tasks. Following [11], the convolutional siamese nets were trained on a same-or-different task of the original training data set and then the last layer was used for nearest neighbour matching. Model Matching Fn Fine Tune 5-way Acc 20-way Acc 1-shot 5-shot 1-shot 5-shot PIXELS Cosine N 41.7% 63.2% 26.7% 42.6% BASELINE CLASSIFIER Cosine N 80.0% 95.0% 69.5% 89.1% BASELINE CLASSIFIER Cosine Y 82.3% 98.4% 70.6% 92.0% BASELINE CLASSIFIER Softmax Y 86.0% 97.6% 72.9% 92.3% MANN (NO CONV) [21] Cosine N 82.8% 94.9% – – CONVOLUTIONAL SIAMESE NET [11] Cosine N 96.7% 98.4% 88.0% 96.5% CONVOLUTIONAL SIAMESE NET [11] Cosine Y 97.3% 98.4% 88.1% 97.0% MATCHING NETS (OURS) Cosine N 98.1% 98.9% 93.8% 98.5% MATCHING NETS (OURS) Cosine Y 97.9% 98.7% 93.5% 98.7% Table 1: Results on the Omniglot dataset. 5
  • 33. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Experimental Settings (ImageNet) n  Baseline ⁃  Matching on raw pixels ⁃  Matching on discriminative features from InceptionV3 (Baseine classifier) n  Datasets ⁃  miniImageNet (size: 84x84) •  training: (80 classes) •  testing: (20 classes) ⁃  randImageNet •  training: randomly picked up classes (882 classes) •  testing: remaining classes (118 classes) ⁃  dogsImageNet •  training: all non-dog classes (882 classes) •  testing: dog classes (118 classes) 33
  • 34. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Experimental Results (miniImageNet) 34 Figure 2: Example of two 5-way problem instance on ImageNet. The images in the set S0 contain classes never seen during training. Our model makes far less mistakes than the Inception baseline. Table 2: Results on miniImageNet. Model Matching Fn Fine Tune 5-way Acc 1-shot 5-shot PIXELS Cosine N 23.0% 26.6% BASELINE CLASSIFIER Cosine N 36.6% 46.0% BASELINE CLASSIFIER Cosine Y 36.2% 52.2% BASELINE CLASSIFIER Softmax Y 38.4% 51.2% MATCHING NETS (OURS) Cosine N 41.2% 56.2% MATCHING NETS (OURS) Cosine Y 42.4% 58.0% MATCHING NETS (OURS) Cosine (FCE) N 44.2% 57.0% MATCHING NETS (OURS) Cosine (FCE) Y 46.6% 60.0% 1-shot tasks from the training data set, incorporating Full Context Embeddings and our Matching Networks and training strategy. The results of the randImageNet and dogsImageNet experiments are shown in Table 3. The Inception Oracle (trained on all classes) performs almost perfectly when restricted to 5 classes only, which is not too surprising given its impressive top-1 accuracy. When trained solely on 6=Lrand, Matching Nets improve upon Inception by almost 6% when tested on Lrand, halving the errors. Figure 2 shows two instances of 5-way one-shot learning, where Inception fails. Looking at all the errors, Inception appears to sometimes prefer an image above all others (these images tend to be cluttered like the example in the second column, or more constant in color). Matching Nets, on the other hand, manage to recover from these outliers that sometimes appear in the support set S0 . Matching Nets manage to improve upon Inception on the complementary subset 6=Ldogs (although n  Matching Networks overtook baseline n  Fully Conditional Embedding (FCE) was shown effective to improve the performance in this task
  • 35. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Experimental Results (randImageNet, dogsImageNet) 35 classification. Thus, we believe that if we adapted our training strategy to samples S from fine grained sets of labels instead of sampling uniformly from the leafs of the ImageNet class tree, improvements could be attained. We leave this as future work. Table 3: Results on full ImageNet on rand and dogs one-shot tasks. Note that 6=Lrand and 6=Ldogs are sets of classes which are seen during training, but are provided for completeness. Model Matching Fn Fine Tune ImageNet 5-way 1-shot Acc Lrand 6=Lrand Ldogs 6=Ldogs PIXELS Cosine N 42.0% 42.8% 41.4% 43.0% INCEPTION CLASSIFIER Cosine N 87.6% 92.6% 59.8% 90.0% MATCHING NETS (OURS) Cosine (FCE) N 93.2% 97.0% 58.8% 96.4% INCEPTION ORACLE Softmax (Full) Y (Full) ⇡ 99% ⇡ 99% ⇡ 99% ⇡ 99% 7 n  Matching Networks outperformed Inception Classifier in , but degraded in n  The decrease of the performance in might be caused by the different distributions of labels between training and testing ⁃  Training support set comes from a random distribution whereas testing one comes from similar classes BASELINE CLASSIFIER Cosine Y 36 BASELINE CLASSIFIER Softmax Y 38 MATCHING NETS (OURS) Cosine N 41 MATCHING NETS (OURS) Cosine Y 42 MATCHING NETS (OURS) Cosine (FCE) N 44 MATCHING NETS (OURS) Cosine (FCE) Y 46 1-shot tasks from the training data set, incorporating Full Context Emb Networks and training strategy. The results of the randImageNet and dogsImageNet experiments are show Oracle (trained on all classes) performs almost perfectly when restricted not too surprising given its impressive top-1 accuracy. When trained so Nets improve upon Inception by almost 6% when tested on Lrand, halving two instances of 5-way one-shot learning, where Inception fails. Looking appears to sometimes prefer an image above all others (these images te example in the second column, or more constant in color). Matching Nets, to recover from these outliers that sometimes appear in the support set S0 Matching Nets manage to improve upon Inception on the complementar this setup is not one-shot, as the feature extraction has been trained on the much more challenging Ldogs subset, our model degrades by 1%. We h 1-shot tasks from the training data set, incorporating Full Context Embeddings an Networks and training strategy. The results of the randImageNet and dogsImageNet experiments are shown in Table Oracle (trained on all classes) performs almost perfectly when restricted to 5 classe not too surprising given its impressive top-1 accuracy. When trained solely on 6=L Nets improve upon Inception by almost 6% when tested on Lrand, halving the errors two instances of 5-way one-shot learning, where Inception fails. Looking at all the e appears to sometimes prefer an image above all others (these images tend to be c example in the second column, or more constant in color). Matching Nets, on the oth to recover from these outliers that sometimes appear in the support set S0 . Matching Nets manage to improve upon Inception on the complementary subset 6= this setup is not one-shot, as the feature extraction has been trained on these labels). much more challenging Ldogs subset, our model degrades by 1%. We hypothesiz that the sampled set during training, S, comes from a random distribution of labels whereas the testing support set S0 from Ldogs contains similar classes, more akin classification. Thus, we believe that if we adapted our training strategy to samples S f sets of labels instead of sampling uniformly from the leafs of the ImageNet class tre could be attained. We leave this as future work. 1-shot tasks from the training data set, incorporating Full C Networks and training strategy. The results of the randImageNet and dogsImageNet experimen Oracle (trained on all classes) performs almost perfectly whe not too surprising given its impressive top-1 accuracy. When Nets improve upon Inception by almost 6% when tested on Lr two instances of 5-way one-shot learning, where Inception fa appears to sometimes prefer an image above all others (thes example in the second column, or more constant in color). Ma to recover from these outliers that sometimes appear in the su Matching Nets manage to improve upon Inception on the com this setup is not one-shot, as the feature extraction has been tra much more challenging Ldogs subset, our model degrades b that the sampled set during training, S, comes from a random whereas the testing support set S0 from Ldogs contains simi classification. Thus, we believe that if we adapted our training sets of labels instead of sampling uniformly from the leafs of could be attained. We leave this as future work.
  • 36. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Experimental Settings (Penn Treebank) 36 xi Support Set(S) ˆx Query g(xi ) f ( ˆx,S) a Our model in its simplest form computes a probability over ˆy as follows: P(ˆy|ˆx, S) = kX i=1 a(ˆx, xi)yi where xi, yi are the inputs and corresponding label distributions from the su k ∑ P(ˆy|ˆx, S) = where xi, yi are the inputs and correspondin {(xi, yi)}k i=1, and a is an attention mechanism tially describes the output for a new class as a Where the attention mechanism a is a kernel on X Where the attention mechanism is zero for the metric and an appropriate constant otherwise, th (although this requires an extension to the atten Thus (1) subsumes both KDE and kNN methods. mechanism and the yi act as values bound to the this case we can understand this as a particular we “point” to the corresponding example in the s form defined by the classifier cS(ˆx) is very flexib 2.1.1 The Attention Kernel Equation 1 relies on choosing a(., .), the atten fier. The simplest form that this takes (and w attention models and kernel functions) is to a(ˆx, xi) = ec(f(ˆx),g(xi)) / Pk j=1 ec(f(ˆx),g(xj )) w ate neural networks (potentially with f = g) to examples where f and g are parameterised var tasks (as in VGG[22] or Inception[24]) or a sim Section 4). We note that, though related to metric learning, th For a given support set S and sample to classify pairs (x0 , y0 ) 2 S such that y0 = y and misalign methods such as Neighborhood Component An nearest neighbor [28]. However, the objective that we are trying to opti classification, and thus we expect it to perform b Our model in its simplest form computes a probability over ˆy as follows: P(ˆy|ˆx, S) = kX i=1 a(ˆx, xi)yi where xi, yi are the inputs and corresponding label distributions from the support set {(xi, yi)}k i=1, and a is an attention mechanism which we discuss below. Note that eq. 1 tially describes the output for a new class as a linear combination of the labels in the suppo Where the attention mechanism a is a kernel on X ⇥ X, then (1) is akin to a kernel density esti Where the attention mechanism is zero for the b furthest xi from ˆx according to some dis metric and an appropriate constant otherwise, then (1) is equivalent to ‘k b’-nearest neigh (although this requires an extension to the attention mechanism that we describe in Section 2 Thus (1) subsumes both KDE and kNN methods. Another view of (1) is where a acts as an atte mechanism and the yi act as values bound to the corresponding keys xi, much like a hash tab yi Our model in its simplest form computes a probability over ˆy as follows: P(ˆy|ˆx, S) = kX i=1 a(ˆx, xi)yi where xi, yi are the inputs and corresponding label distributions from the support s {(xi, yi)}k i=1, and a is an attention mechanism which we discuss below. Note that eq. tially describes the output for a new class as a linear combination of the labels in the su Where the attention mechanism a is a kernel on X ⇥ X, then (1) is akin to a kernel density Where the attention mechanism is zero for the b furthest xi from ˆx according to some metric and an appropriate constant otherwise, then (1) is equivalent to ‘k b’-nearest ne (although this requires an extension to the attention mechanism that we describe in Secti Thus (1) subsumes both KDE and kNN methods. Another view of (1) is where a acts as an mechanism and the yi act as values bound to the corresponding keys xi, much like a hash this case we can understand this as a particular kind of associative memory where, given we “point” to the corresponding example in the support set, retrieving its label. Hence the f form defined by the classifier cS(ˆx) is very flexible and can adapt easily to any new suppo 2.1.1 The Attention Kernel Equation 1 relies on choosing a(., .), the attention mechanism, which fully specifies th fier. The simplest form that this takes (and which has very tight relationships with attention models and kernel functions) is to use the softmax over the cosine distanc a(ˆx, xi) = ec(f(ˆx),g(xi)) / Pk j=1 ec(f(ˆx),g(xj )) with embedding functions f and g being ate neural networks (potentially with f = g) to embed ˆx and xi. In our experiments we examples where f and g are parameterised variously as deep convolutional networks f tasks (as in VGG[22] or Inception[24]) or a simple form word embedding for language t Section 4). We note that, though related to metric learning, the classifier defined by Equation 1 is discri c: cosine distance LSTMLSTM… virus a LSTMLSTM… new nbc LSTMLSTM on the … LSTMLSTM the yesterday … 4.1.3 One-Shot Language Modeling We also introduce a new one-shot language task which is analogous to those examined for images. The task is as follows: given a query sentence with a missing word in it, and a support set of sentences which each have a missing word and a corresponding 1-hot label, choose the label from the support set that best matches the query sentence. Here we show a single example, though note that the words on the right are not provided and the labels for the set are given as 1-hot-of-5 vectors. 1. an experimental vaccine can alter the immune response of people infected with the aids virus a <blank_token> u.s. scientist said. prominent 2. the show one of five new nbc <blank_token> is the second casualty of the three networks so far this fall. series 3. however since eastern first filed for chapter N protection march N it has consistently promised to pay creditors N cents on the <blank_token>. dollar 4. we had a lot of people who threw in the <blank_token> today said <unk> ellis a partner in benjamin jacobson & sons a specialist in trading ual stock on the big board. towel 5. it’s not easy to roll out something that <blank_token> and make it pay mr. jacob says. comprehensive Query: in late new york trading yesterday the <blank_token> was quoted at N marks down from N marks late friday and at N yen down from N yen late friday. dollar Sentences were taken from the Penn Treebank dataset [15]. On each trial, we make sure that the set and batch are populated with sentences that are non-overlapping. This means that we do not use words with very low frequency counts; e.g. if there is only a single sentence for a given word we do not use this data since the sentence would need to be in both the set and the batch. As with the image tasks, each trial consisted of a 5 way choice between the classes available in the set. We used a batch size of 20 throughout the sentence matching task and varied the set size across k=1,2,3. We ensured that the same number of sentences were available for each class in the set. We split the words into a randomly sampled 9000 for training and 1000 for testing, and we used the standard test set to report results. Thus, neither the words nor the sentences used during test time had been seen during training. We compared our one-shot matching model to an oracle LSTM language model (LSTM-LM) [30] trained on all the words. In this setup, the LSTM has an unfair advantage as it is not doing one-shot learning but seeing all the data – thus, this should be taken as an upper bound. To do so, we examined a similar setup wherein a sentence was presented to the model with a single word filled in with 5 different possible words (including the correct answer). For each of these 5 sentences the model gave The task is as follows: given a query sentence with a missing word in it, and a support set of sentences which each have a missing word and a corresponding 1-hot label, choose the label from the support set that best matches the query sentence. Here we show a single example, though note that the words on the right are not provided and the labels for the set are given as 1-hot-of-5 vectors. 1. an experimental vaccine can alter the immune response of people infected with the aids virus a <blank_token> u.s. scientist said. prominent 2. the show one of five new nbc <blank_token> is the second casualty of the three networks so far this fall. series 3. however since eastern first filed for chapter N protection march N it has consistently promised to pay creditors N cents on the <blank_token>. dollar 4. we had a lot of people who threw in the <blank_token> today said <unk> ellis a partner in benjamin jacobson & sons a specialist in trading ual stock on the big board. towel 5. it’s not easy to roll out something that <blank_token> and make it pay mr. jacob says. comprehensive Query: in late new york trading yesterday the <blank_token> was quoted at N marks down from N marks late friday and at N yen down from N yen late friday. dollar Sentences were taken from the Penn Treebank dataset [15]. On each trial, we make sure that the set and batch are populated with sentences that are non-overlapping. This means that we do not use words with very low frequency counts; e.g. if there is only a single sentence for a given word we do not use this data since the sentence would need to be in both the set and the batch. As with the image tasks, each trial consisted of a 5 way choice between the classes available in the set. We used a batch size of 20 throughout the sentence matching task and varied the set size across k=1,2,3. We ensured that the same number of sentences were available for each class in the set. We split the words into a randomly sampled 9000 for training and 1000 for testing, and we used the standard test set to report results. Thus, neither the words nor the sentences used during test time had been seen during training. We compared our one-shot matching model to an oracle LSTM language model (LSTM-LM) [30] trained on all the words. In this setup, the LSTM has an unfair advantage as it is not doing one-shot learning but seeing all the data – thus, this should be taken as an upper bound. To do so, we examined a similar setup wherein a sentence was presented to the model with a single word filled in with 5 different possible words (including the correct answer). For each of these 5 sentences the model gave a log-likelihood and the max of these was taken to be the choice of the model. n  Fill in a brank in a query sentence by a label in a support set
  • 37. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Experimental Settings and Results (Penn Treebank) 37 n  Baseline ⁃  Oracle LSTM-LM •  Trained on all the words (not one-shot) •  Consider this model as an upper bound n  Datasets ⁃  training: 9000 words ⁃  testing: 1000 words n  Results Model 5 way accuracy 1-shot 2-shot 3-shot Matching Nets 32.4% 36.1% 38.2% Oracle LSTM-LM (72.8%) - -
  • 38. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Conclusion n  They proposed Matching Networks: nearest neighbor based approach trained fully end-to-end n  Keypoints ⁃  “One-shot learning is much easier if you train the network to do one-shot learning” [Vinyals+, 2016] ⁃  Matching Network has non-parametric structure, thus has ability to acquisition of new examples rapidly n  Findings ⁃  Matching Networks was effective to improve the performance for Omniglot, miniImageNet, randImageNet, however it degraded for dogsImageNet ⁃  One-shot learning with fine-grained sets of labels is difficult to solve thus could be exciting challenge in this area 38
  • 39. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. References n  Matching Networks ⁃  Vinyals, Oriol, et al. "Matching networks for one shot learning." Advances in Neural Information Processing Systems. 2016. n  One-shot Learning ⁃  Koch, Gregory. Siamese neural networks for one-shot image recognition. Diss. University of Toronto, 2015. ⁃  Santoro, Adam, et al. "Meta-learning with memory-augmented neural networks." Proceedings of The 33rd International Conference on Machine Learning. 2016. ⁃  Bertinetto, Luca, et al. "Learning feed-forward one-shot learners." Advances in Neural Information Processing Systems. 2016. n  Attention Mechanisms ⁃  Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014). ⁃  Vinyals, Oriol, Meire Fortunato, and Navdeep Jaitly. "Pointer networks." Advances in Neural Information Processing Systems. 2015. ⁃  Vinyals, Oriol, Samy Bengio, and Manjunath Kudlur. "Order matters: Sequence to sequence for sets." In ICLR2016 39
  • 40. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. References n  Datasets ⁃  Krizhevsky, Alex, and Geoffrey Hinton. "Learning multiple layers of features from tiny images." (2009). ⁃  Deng, Jia, et al. "Imagenet: A large-scale hierarchical image database." Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 2009. ⁃  Lake, Brenden M., et al. "One shot learning of simple visual concepts." Proceedings of the 33rd Annual Conference of the Cognitive Science Society. Vol. 172. 2011. ⁃  Marcus, Mitchell P., Mary Ann Marcinkiewicz, and Beatrice Santorini. "Building a large annotated corpus of English: The Penn Treebank." Computational linguistics 19.2 (1993): 313-330. 40