SlideShare uma empresa Scribd logo
1 de 66
Baixar para ler offline
21/ 20/1
@7
9 9
b
•
•
•
• A g
• a
•
• M 0 4e 12
0 4
• e 0 0 4 0 N
: D A 4 ~
• N 0 o 4
• 12
0 3
•
• 2 3 1 39
0
•
• 2 14 9
,/ 0/2
• a 0/2O c Pc
• 55 A
• 1 4: & &
• A 24
• 1 4:N 55 A K& OL IR
• . CB 0/2 00 45 9 A 4 :4A9 4AA A9 A5
R S BMMJL I=L AIIAF =I JK L M MCI #P I E I. 9 062 5BM # 5A 1 M / CM LFC C A # = : :
0
•
• 6 2 6 1 9
• G G GED ,DI G I D G IG :I G DIE : GG DI G B ILEGA
D I F F G
• CEEI D I + EC IGN E GE B I : EM )C D -
• N - II DI ED L I - IL I D (ND C : ED EB I ED
• GE NC EB : ED: FI - GD G ,DI GFG I D : D EG D DI D:
GEC I G B F G ED E
21 0
•
• 7 191 I
• G G GED ,DI G I D G IG :I G DIE : GG DI G B ILEGA
D I F F G
• CEEI D I + EC IGN E GE B I : EM )C D -
• N - II DI ED L I - IL I D (ND C : ED EB I ED
• GE NC EB : ED: FI - GD G ,DI GFG I D : D EG D DI D:
GEC I G B F G ED E
A
L
1 9 2
L CNO
@
):9 8
• 0( + 1
• B AB9C9 @ N
• B89B98 9 B@ C 9:B : B99 2 B B9C @ 19 BB9 9 B 9 @B C
2 9 9C A A9B
• D : D DBI B CD :C +
• I + CC DD D G D + : DG : D I F ED C E
• - EB I AD + B B D BAB D : C B C D C
B - DEB EA BF C ,
L I1
+ +
9
• c[a e
• u R: Ot RO w
• d hu c]h S w I :c]h
o IT O
• i [
• l O u c]h :t ]
• w s e R O
• w N k O :
• n cg
• r
. . /
) 1 1 ( : + 0 : 0
( ( : +
• I + IMNOT
c RI !"#$L I[ a L S %" : L]
d I ̂!"L I[ a L S '" L]
e RI !"#$ I ̂!"L%" '" MNOLT
8: / 5- 2 . 8 : : 01 : - 0 2 :
practice we use a continuous relaxation by computing the quantity p(d  k), obtained by ta
a cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. He
ˆg = E[g].
4.2 STRUCTURED GATING MECHANISM
Based on the cumax() function, we introduce a master forget gate ˜ft and a master input gate ˜it
˜ft = cumax(W ˜f xt + U ˜f ht 1 + b ˜f )
˜it = 1 cumax(W˜ixt + U˜iht 1 + b˜i)
Following the properties of the cumax() activation, the values in the master forget gate are mo
tonically increasing from 0 to 1, and those in the master input gate are monotonically decrea
from 1 to 0. These gates serve as high-level control for the update operations of cell states. U
the master gates, we define a new update rule:
!t = ˜ft
˜it
ˆft = ft !t + ( ˜ft !t) = ˜ft (ft
˜it + 1 ˜it)
ˆit = it !t + (˜it !t) = ˜it (it
˜ft + 1 ˜ft)
ct = ˆft ct 1 +ˆit ˆct
In order to explain the intuition behind the new update rule, we assume that the master gates
neurons always update more (or less) frequently than the others, and that order is pre-determined as
part of the model architecture.
4 ON-LSTM
In this section, we present a new RNN unit, ON-LSTM (“ordered neurons LSTM”). The new model
uses an architecture similar to the standard LSTM, reported below:
ft = (Wf xt + Uf ht 1 + bf ) (1)
it = (Wixt + Uiht 1 + bi) (2)
ot = (Woxt + Uoht 1 + bo) (3)
ˆct = tanh(Wcxt + Ucht 1 + bc) (4)
ht = ot tanh(ct) (5)
The difference with the LSTM is that we replace the update function for the cell state ct with a
new function that will be explained in the following sections. The forget gates ft and input gates
it are used to control the erasing and writing operation on cell states ct, as before. Since the gates
in the LSTM act independently on each neuron, it may be difficult in general to discern a hierarchy
of information between the neurons. To this end, we propose to make the gate for each neuron
dependent on the others by enforcing the order in which neurons should be updated.
4.1 ACTIVATION FUNCTION: cumax()
that only last one or a few time steps, representing smaller constituents, as shown in Figure 2(b). The
differentiation between high-ranking and low-ranking neurons is learnt in a completely data-driven
fashion by controlling the update frequency of single neurons: to erase (or update) high-ranking
neurons, the model should first erase (or update) all lower-ranking neurons. In other words, some
neurons always update more (or less) frequently than the others, and that order is pre-determined as
part of the model architecture.
4 ON-LSTM
In this section, we present a new RNN unit, ON-LSTM (“ordered neurons LSTM”). The new model
uses an architecture similar to the standard LSTM, reported below:
ft = (Wf xt + Uf ht 1 + bf ) (1)
it = (Wixt + Uiht 1 + bi) (2)
ot = (Woxt + Uoht 1 + bo) (3)
ˆct = tanh(Wcxt + Ucht 1 + bc) (4)
ht = ot tanh(ct) (5)
The difference with the LSTM is that we replace the update function for the cell state ct with a
new function that will be explained in the following sections. The forget gates ft and input gates
it are used to control the erasing and writing operation on cell states ct, as before. Since the gates
in the LSTM act independently on each neuron, it may be difficult in general to discern a hierarchy
or global information that will last anywhere from several time steps to the entire sentence, repre-
senting nodes near the root of the tree. Low-ranking neurons encode short-term or local information
that only last one or a few time steps, representing smaller constituents, as shown in Figure 2(b). The
differentiation between high-ranking and low-ranking neurons is learnt in a completely data-driven
fashion by controlling the update frequency of single neurons: to erase (or update) high-ranking
neurons, the model should first erase (or update) all lower-ranking neurons. In other words, some
neurons always update more (or less) frequently than the others, and that order is pre-determined as
part of the model architecture.
4 ON-LSTM
In this section, we present a new RNN unit, ON-LSTM (“ordered neurons LSTM”). The new model
uses an architecture similar to the standard LSTM, reported below:
ft = (Wf xt + Uf ht 1 + bf ) (1)
it = (Wixt + Uiht 1 + bi) (2)
ot = (Woxt + Uoht 1 + bo) (3)
ˆct = tanh(Wcxt + Ucht 1 + bc) (4)
ht = ot tanh(ct) (5)
The difference with the LSTM is that we replace the update function for the cell state ct with a
new function that will be explained in the following sections. The forget gates ft and input gates
+ 1 - ,
1: /-
• + (- )O e
c[ ] p !"# 1 1 ̃%# 1 1 Th
r !"# N ̃%# O lkm &# Th
s !"# &# "# iaO(- )O 1 T L '"# + (- ) 1 Th
t ̃%# &# (# iaO(- )O 1 T L ̂%# + (- ) 1 Th
u [Odo*#+,Ne Odo ̂*#T"#N(#M IL T e S
discrete variable is included in the computation graph is not trivial (Schulman et al., 2015), so in
practice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking
a cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence,
ˆg = E[g].
4.2 STRUCTURED GATING MECHANISM
Based on the cumax() function, we introduce a master forget gate ˜ft and a master input gate ˜it:
˜ft = cumax(W ˜f xt + U ˜f ht 1 + b ˜f ) (9)
˜it = 1 cumax(W˜ixt + U˜iht 1 + b˜i) (10)
Following the properties of the cumax() activation, the values in the master forget gate are mono-
tonically increasing from 0 to 1, and those in the master input gate are monotonically decreasing
from 1 to 0. These gates serve as high-level control for the update operations of cell states. Using
the master gates, we define a new update rule:
!t = ˜ft
˜it (11)
ˆft = ft !t + ( ˜ft !t) = ˜ft (ft
˜it + 1 ˜it) (12)
ˆit = it !t + (˜it !t) = ˜it (it
˜ft + 1 ˜ft) (13)
ct = ˆft ct 1 +ˆit ˆct (14)
In order to explain the intuition behind the new update rule, we assume that the master gates are
binary:
• The master forget gate ˜ft controls the erasing behavior of the model. Suppose ˜ft =
(0, . . . , 0, 1, . . . , 1) and the split point is df
t . Given the Eq. (12) and (14), the information
stored in the first df
t neurons of the previous cell state ct 1 will be completely erased. In a
parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large
number of zeroed neurons, i.e. a large df
, represents the end of a high-level constituent in
eally, g should take the form of a discrete variable. Unfortunately, computing gradients when a
screte variable is included in the computation graph is not trivial (Schulman et al., 2015), so in
actice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking
cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence,
= E[g].
2 STRUCTURED GATING MECHANISM
ased on the cumax() function, we introduce a master forget gate ˜ft and a master input gate ˜it:
˜ft = cumax(W ˜f xt + U ˜f ht 1 + b ˜f ) (9)
˜it = 1 cumax(W˜ixt + U˜iht 1 + b˜i) (10)
ollowing the properties of the cumax() activation, the values in the master forget gate are mono-
nically increasing from 0 to 1, and those in the master input gate are monotonically decreasing
om 1 to 0. These gates serve as high-level control for the update operations of cell states. Using
e master gates, we define a new update rule:
!t = ˜ft
˜it (11)
ˆft = ft !t + ( ˜ft !t) = ˜ft (ft
˜it + 1 ˜it) (12)
ˆit = it !t + (˜it !t) = ˜it (it
˜ft + 1 ˜ft) (13)
ct = ˆft ct 1 +ˆit ˆct (14)
order to explain the intuition behind the new update rule, we assume that the master gates are
nary:
• The master forget gate ˜ft controls the erasing behavior of the model. Suppose ˜ft =
(0, . . . , 0, 1, . . . , 1) and the split point is df
t . Given the Eq. (12) and (14), the information
stored in the first df
t neurons of the previous cell state ct 1 will be completely erased. In a
parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large
number of zeroed neurons, i.e. a large df
t , represents the end of a high-level constituent in
k k
ˆg = E[g].
4.2 STRUCTURED GATING MECHANISM
Based on the cumax() function, we introduce a master forget gate ˜ft and a master input gate ˜it:
˜ft = cumax(W ˜f xt + U ˜f ht 1 + b ˜f ) (9)
˜it = 1 cumax(W˜ixt + U˜iht 1 + b˜i) (10)
Following the properties of the cumax() activation, the values in the master forget gate are mono-
tonically increasing from 0 to 1, and those in the master input gate are monotonically decreasing
from 1 to 0. These gates serve as high-level control for the update operations of cell states. Using
the master gates, we define a new update rule:
!t = ˜ft
˜it (11)
ˆft = ft !t + ( ˜ft !t) = ˜ft (ft
˜it + 1 ˜it) (12)
ˆit = it !t + (˜it !t) = ˜it (it
˜ft + 1 ˜ft) (13)
ct = ˆft ct 1 +ˆit ˆct (14)
In order to explain the intuition behind the new update rule, we assume that the master gates are
binary:
• The master forget gate ˜ft controls the erasing behavior of the model. Suppose ˜ft =
(0, . . . , 0, 1, . . . , 1) and the split point is df
t . Given the Eq. (12) and (14), the information
stored in the first df
t neurons of the previous cell state ct 1 will be completely erased. In a
parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large
number of zeroed neurons, i.e. a large df
t , represents the end of a high-level constituent in
the parse tree, as most of the information in the state will be discarded. Conversely, a small
f
ik
Ideally, g should take the form of a discrete variable. Unfortunately, computing gradients when a
discrete variable is included in the computation graph is not trivial (Schulman et al., 2015), so in
practice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking
a cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence,
ˆg = E[g].
4.2 STRUCTURED GATING MECHANISM
Based on the cumax() function, we introduce a master forget gate ˜ft and a master input gate ˜it:
˜ft = cumax(W ˜f xt + U ˜f ht 1 + b ˜f ) (9)
˜it = 1 cumax(W˜ixt + U˜iht 1 + b˜i) (10)
Following the properties of the cumax() activation, the values in the master forget gate are mono-
tonically increasing from 0 to 1, and those in the master input gate are monotonically decreasing
from 1 to 0. These gates serve as high-level control for the update operations of cell states. Using
the master gates, we define a new update rule:
!t = ˜ft
˜it (11)
ˆft = ft !t + ( ˜ft !t) = ˜ft (ft
˜it + 1 ˜it) (12)
ˆit = it !t + (˜it !t) = ˜it (it
˜ft + 1 ˜ft) (13)
ct = ˆft ct 1 +ˆit ˆct (14)
In order to explain the intuition behind the new update rule, we assume that the master gates are
binary:
• The master forget gate ˜ft controls the erasing behavior of the model. Suppose ˜ft =
(0, . . . , 0, 1, . . . , 1) and the split point is df
t . Given the Eq. (12) and (14), the information
f
LSTM. A sequence of tokens S = (x1, x2, x3) and its corresponding constituency tree are illus-
d in (a). We provide a block view of the tree structure in (b), where both S and VP nodes span
e than one time step. The representation for high-ranking nodes should be relatively consistent
ss multiple time steps. (c) Visualization of the update frequency of groups of hidden state neu-
. At each time step, given the input word, dark grey blocks are completely updated while light
blocks are partially updated. The three groups of neurons have different update frequencies.
most groups update less frequently while lower groups are more frequently updated.
lobal information that will last anywhere from several time steps to the entire sentence, repre-
ng nodes near the root of the tree. Low-ranking neurons encode short-term or local information
only last one or a few time steps, representing smaller constituents, as shown in Figure 2(b). The
rentiation between high-ranking and low-ranking neurons is learnt in a completely data-driven
ion by controlling the update frequency of single neurons: to erase (or update) high-ranking
ons, the model should first erase (or update) all lower-ranking neurons. In other words, some
ons always update more (or less) frequently than the others, and that order is pre-determined as
of the model architecture.
ON-LSTM
is section, we present a new RNN unit, ON-LSTM (“ordered neurons LSTM”). The new model
an architecture similar to the standard LSTM, reported below:
ft = (Wf xt + Uf ht 1 + bf ) (1)
it = (Wixt + Uiht 1 + bi) (2)
ot = (Woxt + Uoht 1 + bo) (3)
ˆct = tanh(Wcxt + Ucht 1 + bc) (4)
ht = ot tanh(ct) (5)
difference with the LSTM is that we replace the update function for the cell state ct with a
function that will be explained in the following sections. The forget gates ft and input gates
e used to control the erasing and writing operation on cell states ct, as before. Since the gates
before the k-th being the split point, that is d  k = (d = 0) _ (d = 1) _ · · · _ (d = k). Since the
categories are mutually exclusive, we can do this by computing the cumulative distribution function:
p(gk = 1) = p(d  k) =
X
ik
p(d = i) (8)
Ideally, g should take the form of a discrete variable. Unfortunately, computing gradients when a
discrete variable is included in the computation graph is not trivial (Schulman et al., 2015), so in
practice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking
a cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence,
ˆg = E[g].
4.2 STRUCTURED GATING MECHANISM
Based on the cumax() function, we introduce a master forget gate ˜ft and a master input gate ˜it:
˜ft = cumax(W ˜f xt + U ˜f ht 1 + b ˜f ) (9)
˜it = 1 cumax(W˜ixt + U˜iht 1 + b˜i) (10)
Following the properties of the cumax() activation, the values in the master forget gate are mono-
tonically increasing from 0 to 1, and those in the master input gate are monotonically decreasing
from 1 to 0. These gates serve as high-level control for the update operations of cell states. Using
the master gates, we define a new update rule:
!t = ˜ft
˜it (11)
ˆft = ft !t + ( ˜ft !t) = ˜ft (ft
˜it + 1 ˜it) (12)
ˆit = it !t + (˜it !t) = ˜it (it
˜ft + 1 ˜ft) (13)
ct = ˆft ct 1 +ˆit ˆct (14)
In order to explain the intuition behind the new update rule, we assume that the master gates are
binary:
iaO(- )
Rn
iaO(- )
Ngf
+ 1 - 2 2
1: /-
• + (- ) LM
TO S I N !"# 1 1 ̃%# 1 1 R
eally, g should take the form of a discrete variable. Unfortunately, computing gradients when a
screte variable is included in the computation graph is not trivial (Schulman et al., 2015), so in
actice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking
cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence,
= E[g].
2 STRUCTURED GATING MECHANISM
ased on the cumax() function, we introduce a master forget gate ˜ft and a master input gate ˜it:
˜ft = cumax(W ˜f xt + U ˜f ht 1 + b ˜f ) (9)
˜it = 1 cumax(W˜ixt + U˜iht 1 + b˜i) (10)
ollowing the properties of the cumax() activation, the values in the master forget gate are mono-
nically increasing from 0 to 1, and those in the master input gate are monotonically decreasing
om 1 to 0. These gates serve as high-level control for the update operations of cell states. Using
e master gates, we define a new update rule:
!t = ˜ft
˜it (11)
ˆft = ft !t + ( ˜ft !t) = ˜ft (ft
˜it + 1 ˜it) (12)
ˆit = it !t + (˜it !t) = ˜it (it
˜ft + 1 ˜ft) (13)
ct = ˆft ct 1 +ˆit ˆct (14)
order to explain the intuition behind the new update rule, we assume that the master gates are
nary:
• The master forget gate ˜ft controls the erasing behavior of the model. Suppose ˜ft =
(0, . . . , 0, 1, . . . , 1) and the split point is df
t . Given the Eq. (12) and (14), the information
stored in the first df
t neurons of the previous cell state ct 1 will be completely erased. In a
parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large
number of zeroed neurons, i.e. a large df
t , represents the end of a high-level constituent in
&'( =
exp(.()
∑(1 exp(.(1)
'( = 2
(13(
&'(
45678(9):;"<678(9)
45678 ( ) (
+ 3 3 1 -
1: /-
• + (- ) LM
TO S I N !"# 1 1 ̃%# 1 1 R
eally, g should take the form of a discrete variable. Unfortunately, computing gradients when a
screte variable is included in the computation graph is not trivial (Schulman et al., 2015), so in
actice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking
cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence,
= E[g].
2 STRUCTURED GATING MECHANISM
ased on the cumax() function, we introduce a master forget gate ˜ft and a master input gate ˜it:
˜ft = cumax(W ˜f xt + U ˜f ht 1 + b ˜f ) (9)
˜it = 1 cumax(W˜ixt + U˜iht 1 + b˜i) (10)
ollowing the properties of the cumax() activation, the values in the master forget gate are mono-
nically increasing from 0 to 1, and those in the master input gate are monotonically decreasing
om 1 to 0. These gates serve as high-level control for the update operations of cell states. Using
e master gates, we define a new update rule:
!t = ˜ft
˜it (11)
ˆft = ft !t + ( ˜ft !t) = ˜ft (ft
˜it + 1 ˜it) (12)
ˆit = it !t + (˜it !t) = ˜it (it
˜ft + 1 ˜ft) (13)
ct = ˆft ct 1 +ˆit ˆct (14)
order to explain the intuition behind the new update rule, we assume that the master gates are
nary:
• The master forget gate ˜ft controls the erasing behavior of the model. Suppose ˜ft =
(0, . . . , 0, 1, . . . , 1) and the split point is df
t . Given the Eq. (12) and (14), the information
stored in the first df
t neurons of the previous cell state ct 1 will be completely erased. In a
parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large
number of zeroed neurons, i.e. a large df
t , represents the end of a high-level constituent in
!"# ̃%#
&'()* +,"-()*
1 0 e
!"# ̃%# = 0
!"# .
= 1
*. ,
x
= = 1 = 0,
!"# ̃%# = 0
0
,
e paper at ICLR 2019
es between a constituency parse tree and the hidden states of the proposed
of tokens S = (x1, x2, x3) and its corresponding constituency tree are illus-
a block view of the tree structure in (b), where both S and VP nodes span
The representation for high-ranking nodes should be relatively consistent
s. (c) Visualization of the update frequency of groups of hidden state neu-
given the input word, dark grey blocks are completely updated while light
updated. The three groups of neurons have different update frequencies.
ess frequently while lower groups are more frequently updated.
4 4 + = (= = / = - =
+ + : = 1
• + ) / OST
g ] L f !"# ̃%# = Ra
eally, g should take the form of a discrete variable. Unfortunately, computing gradients when a
screte variable is included in the computation graph is not trivial (Schulman et al., 2015), so in
actice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking
cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence,
= E[g].
2 STRUCTURED GATING MECHANISM
ased on the cumax() function, we introduce a master forget gate ˜ft and a master input gate ˜it:
˜ft = cumax(W ˜f xt + U ˜f ht 1 + b ˜f ) (9)
˜it = 1 cumax(W˜ixt + U˜iht 1 + b˜i) (10)
ollowing the properties of the cumax() activation, the values in the master forget gate are mono-
nically increasing from 0 to 1, and those in the master input gate are monotonically decreasing
om 1 to 0. These gates serve as high-level control for the update operations of cell states. Using
e master gates, we define a new update rule:
!t = ˜ft
˜it (11)
ˆft = ft !t + ( ˜ft !t) = ˜ft (ft
˜it + 1 ˜it) (12)
ˆit = it !t + (˜it !t) = ˜it (it
˜ft + 1 ˜ft) (13)
ct = ˆft ct 1 +ˆit ˆct (14)
order to explain the intuition behind the new update rule, we assume that the master gates are
nary:
• The master forget gate ˜ft controls the erasing behavior of the model. Suppose ˜ft =
(0, . . . , 0, 1, . . . , 1) and the split point is df
t . Given the Eq. (12) and (14), the information
stored in the first df
t neurons of the previous cell state ct 1 will be completely erased. In a
parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large
number of zeroed neurons, i.e. a large df
t , represents the end of a high-level constituent in
!"# ̃%#
&'()* +,"-()*
.
!"# ̃%#
!"# 0
. e
*.
1
, .
!"# ̃%# x
[d MI ce N ce
e paper at ICLR 2019
es between a constituency parse tree and the hidden states of the proposed
of tokens S = (x1, x2, x3) and its corresponding constituency tree are illus-
a block view of the tree structure in (b), where both S and VP nodes span
The representation for high-ranking nodes should be relatively consistent
s. (c) Visualization of the update frequency of groups of hidden state neu-
given the input word, dark grey blocks are completely updated while light
updated. The three groups of neurons have different update frequencies.
ess frequently while lower groups are more frequently updated.
+ 1 -
1: /-
• + (- )ILMN
] R O !"# 1 5 1 ̃%# 1 1 T
!"# ̃%# IS [ &# T
eally, g should take the form of a discrete variable. Unfortunately, computing gradients when a
screte variable is included in the computation graph is not trivial (Schulman et al., 2015), so in
actice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking
cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence,
= E[g].
2 STRUCTURED GATING MECHANISM
ased on the cumax() function, we introduce a master forget gate ˜ft and a master input gate ˜it:
˜ft = cumax(W ˜f xt + U ˜f ht 1 + b ˜f ) (9)
˜it = 1 cumax(W˜ixt + U˜iht 1 + b˜i) (10)
ollowing the properties of the cumax() activation, the values in the master forget gate are mono-
nically increasing from 0 to 1, and those in the master input gate are monotonically decreasing
om 1 to 0. These gates serve as high-level control for the update operations of cell states. Using
e master gates, we define a new update rule:
!t = ˜ft
˜it (11)
ˆft = ft !t + ( ˜ft !t) = ˜ft (ft
˜it + 1 ˜it) (12)
ˆit = it !t + (˜it !t) = ˜it (it
˜ft + 1 ˜ft) (13)
ct = ˆft ct 1 +ˆit ˆct (14)
order to explain the intuition behind the new update rule, we assume that the master gates are
nary:
• The master forget gate ˜ft controls the erasing behavior of the model. Suppose ˜ft =
(0, . . . , 0, 1, . . . , 1) and the split point is df
t . Given the Eq. (12) and (14), the information
stored in the first df
t neurons of the previous cell state ct 1 will be completely erased. In a
parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large
number of zeroed neurons, i.e. a large df
t , represents the end of a high-level constituent in
k k
ˆg = E[g].
4.2 STRUCTURED GATING MECHANISM
Based on the cumax() function, we introduce a master forget gate ˜ft and a master input gate ˜it:
˜ft = cumax(W ˜f xt + U ˜f ht 1 + b ˜f ) (9)
˜it = 1 cumax(W˜ixt + U˜iht 1 + b˜i) (10)
Following the properties of the cumax() activation, the values in the master forget gate are mono-
tonically increasing from 0 to 1, and those in the master input gate are monotonically decreasing
from 1 to 0. These gates serve as high-level control for the update operations of cell states. Using
the master gates, we define a new update rule:
!t = ˜ft
˜it (11)
ˆft = ft !t + ( ˜ft !t) = ˜ft (ft
˜it + 1 ˜it) (12)
ˆit = it !t + (˜it !t) = ˜it (it
˜ft + 1 ˜ft) (13)
ct = ˆft ct 1 +ˆit ˆct (14)
In order to explain the intuition behind the new update rule, we assume that the master gates are
binary:
• The master forget gate ˜ft controls the erasing behavior of the model. Suppose ˜ft =
(0, . . . , 0, 1, . . . , 1) and the split point is df
t . Given the Eq. (12) and (14), the information
stored in the first df
t neurons of the previous cell state ct 1 will be completely erased. In a
parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large
number of zeroed neurons, i.e. a large df
t , represents the end of a high-level constituent in
the parse tree, as most of the information in the state will be discarded. Conversely, a small
f
, + ( / -
+ + : 1
• ,+ ) / LO T
c [S M R !"# 6 ̃%# N]
d !"# ̃%# L a &# N]
e !"# &# "# L) / L6 N I '"# ,+ ) / 6 N]
discrete variable is included in the computation graph is not trivial (Schulman et al., 2015), so in
practice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking
a cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence,
ˆg = E[g].
4.2 STRUCTURED GATING MECHANISM
Based on the cumax() function, we introduce a master forget gate ˜ft and a master input gate ˜it:
˜ft = cumax(W ˜f xt + U ˜f ht 1 + b ˜f ) (9)
˜it = 1 cumax(W˜ixt + U˜iht 1 + b˜i) (10)
Following the properties of the cumax() activation, the values in the master forget gate are mono-
tonically increasing from 0 to 1, and those in the master input gate are monotonically decreasing
from 1 to 0. These gates serve as high-level control for the update operations of cell states. Using
the master gates, we define a new update rule:
!t = ˜ft
˜it (11)
ˆft = ft !t + ( ˜ft !t) = ˜ft (ft
˜it + 1 ˜it) (12)
ˆit = it !t + (˜it !t) = ˜it (it
˜ft + 1 ˜ft) (13)
ct = ˆft ct 1 +ˆit ˆct (14)
In order to explain the intuition behind the new update rule, we assume that the master gates are
binary:
• The master forget gate ˜ft controls the erasing behavior of the model. Suppose ˜ft =
(0, . . . , 0, 1, . . . , 1) and the split point is df
t . Given the Eq. (12) and (14), the information
stored in the first df
t neurons of the previous cell state ct 1 will be completely erased. In a
parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large
number of zeroed neurons, i.e. a large df
, represents the end of a high-level constituent in
eally, g should take the form of a discrete variable. Unfortunately, computing gradients when a
screte variable is included in the computation graph is not trivial (Schulman et al., 2015), so in
actice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking
cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence,
= E[g].
2 STRUCTURED GATING MECHANISM
ased on the cumax() function, we introduce a master forget gate ˜ft and a master input gate ˜it:
˜ft = cumax(W ˜f xt + U ˜f ht 1 + b ˜f ) (9)
˜it = 1 cumax(W˜ixt + U˜iht 1 + b˜i) (10)
ollowing the properties of the cumax() activation, the values in the master forget gate are mono-
nically increasing from 0 to 1, and those in the master input gate are monotonically decreasing
om 1 to 0. These gates serve as high-level control for the update operations of cell states. Using
e master gates, we define a new update rule:
!t = ˜ft
˜it (11)
ˆft = ft !t + ( ˜ft !t) = ˜ft (ft
˜it + 1 ˜it) (12)
ˆit = it !t + (˜it !t) = ˜it (it
˜ft + 1 ˜ft) (13)
ct = ˆft ct 1 +ˆit ˆct (14)
order to explain the intuition behind the new update rule, we assume that the master gates are
nary:
• The master forget gate ˜ft controls the erasing behavior of the model. Suppose ˜ft =
(0, . . . , 0, 1, . . . , 1) and the split point is df
t . Given the Eq. (12) and (14), the information
stored in the first df
t neurons of the previous cell state ct 1 will be completely erased. In a
parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large
number of zeroed neurons, i.e. a large df
t , represents the end of a high-level constituent in
k k
ˆg = E[g].
4.2 STRUCTURED GATING MECHANISM
Based on the cumax() function, we introduce a master forget gate ˜ft and a master input gate ˜it:
˜ft = cumax(W ˜f xt + U ˜f ht 1 + b ˜f ) (9)
˜it = 1 cumax(W˜ixt + U˜iht 1 + b˜i) (10)
Following the properties of the cumax() activation, the values in the master forget gate are mono-
tonically increasing from 0 to 1, and those in the master input gate are monotonically decreasing
from 1 to 0. These gates serve as high-level control for the update operations of cell states. Using
the master gates, we define a new update rule:
!t = ˜ft
˜it (11)
ˆft = ft !t + ( ˜ft !t) = ˜ft (ft
˜it + 1 ˜it) (12)
ˆit = it !t + (˜it !t) = ˜it (it
˜ft + 1 ˜ft) (13)
ct = ˆft ct 1 +ˆit ˆct (14)
In order to explain the intuition behind the new update rule, we assume that the master gates are
binary:
• The master forget gate ˜ft controls the erasing behavior of the model. Suppose ˜ft =
(0, . . . , 0, 1, . . . , 1) and the split point is df
t . Given the Eq. (12) and (14), the information
stored in the first df
t neurons of the previous cell state ct 1 will be completely erased. In a
parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large
number of zeroed neurons, i.e. a large df
t , represents the end of a high-level constituent in
the parse tree, as most of the information in the state will be discarded. Conversely, a small
f
̃%#
!"#
, + ( 7 7 / -
+ + : 1
• ,+ ) / LO T
c [S M R !"# 7 7 ̃%# 7 N]
d !"# ̃%# L a &# N]
e !"# &# "# L) / L 7 7 N I '"# ,+ ) / 7 7 N]
discrete variable is included in the computation graph is not trivial (Schulman et al., 2015), so in
practice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking
a cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence,
ˆg = E[g].
4.2 STRUCTURED GATING MECHANISM
Based on the cumax() function, we introduce a master forget gate ˜ft and a master input gate ˜it:
˜ft = cumax(W ˜f xt + U ˜f ht 1 + b ˜f ) (9)
˜it = 1 cumax(W˜ixt + U˜iht 1 + b˜i) (10)
Following the properties of the cumax() activation, the values in the master forget gate are mono-
tonically increasing from 0 to 1, and those in the master input gate are monotonically decreasing
from 1 to 0. These gates serve as high-level control for the update operations of cell states. Using
the master gates, we define a new update rule:
!t = ˜ft
˜it (11)
ˆft = ft !t + ( ˜ft !t) = ˜ft (ft
˜it + 1 ˜it) (12)
ˆit = it !t + (˜it !t) = ˜it (it
˜ft + 1 ˜ft) (13)
ct = ˆft ct 1 +ˆit ˆct (14)
In order to explain the intuition behind the new update rule, we assume that the master gates are
binary:
• The master forget gate ˜ft controls the erasing behavior of the model. Suppose ˜ft =
(0, . . . , 0, 1, . . . , 1) and the split point is df
t . Given the Eq. (12) and (14), the information
stored in the first df
t neurons of the previous cell state ct 1 will be completely erased. In a
parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large
number of zeroed neurons, i.e. a large df
, represents the end of a high-level constituent in
eally, g should take the form of a discrete variable. Unfortunately, computing gradients when a
screte variable is included in the computation graph is not trivial (Schulman et al., 2015), so in
actice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking
cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence,
= E[g].
2 STRUCTURED GATING MECHANISM
ased on the cumax() function, we introduce a master forget gate ˜ft and a master input gate ˜it:
˜ft = cumax(W ˜f xt + U ˜f ht 1 + b ˜f ) (9)
˜it = 1 cumax(W˜ixt + U˜iht 1 + b˜i) (10)
ollowing the properties of the cumax() activation, the values in the master forget gate are mono-
nically increasing from 0 to 1, and those in the master input gate are monotonically decreasing
om 1 to 0. These gates serve as high-level control for the update operations of cell states. Using
e master gates, we define a new update rule:
!t = ˜ft
˜it (11)
ˆft = ft !t + ( ˜ft !t) = ˜ft (ft
˜it + 1 ˜it) (12)
ˆit = it !t + (˜it !t) = ˜it (it
˜ft + 1 ˜ft) (13)
ct = ˆft ct 1 +ˆit ˆct (14)
order to explain the intuition behind the new update rule, we assume that the master gates are
nary:
• The master forget gate ˜ft controls the erasing behavior of the model. Suppose ˜ft =
(0, . . . , 0, 1, . . . , 1) and the split point is df
t . Given the Eq. (12) and (14), the information
stored in the first df
t neurons of the previous cell state ct 1 will be completely erased. In a
parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large
number of zeroed neurons, i.e. a large df
t , represents the end of a high-level constituent in
k k
ˆg = E[g].
4.2 STRUCTURED GATING MECHANISM
Based on the cumax() function, we introduce a master forget gate ˜ft and a master input gate ˜it:
˜ft = cumax(W ˜f xt + U ˜f ht 1 + b ˜f ) (9)
˜it = 1 cumax(W˜ixt + U˜iht 1 + b˜i) (10)
Following the properties of the cumax() activation, the values in the master forget gate are mono-
tonically increasing from 0 to 1, and those in the master input gate are monotonically decreasing
from 1 to 0. These gates serve as high-level control for the update operations of cell states. Using
the master gates, we define a new update rule:
!t = ˜ft
˜it (11)
ˆft = ft !t + ( ˜ft !t) = ˜ft (ft
˜it + 1 ˜it) (12)
ˆit = it !t + (˜it !t) = ˜it (it
˜ft + 1 ˜ft) (13)
ct = ˆft ct 1 +ˆit ˆct (14)
In order to explain the intuition behind the new update rule, we assume that the master gates are
binary:
• The master forget gate ˜ft controls the erasing behavior of the model. Suppose ˜ft =
(0, . . . , 0, 1, . . . , 1) and the split point is df
t . Given the Eq. (12) and (14), the information
stored in the first df
t neurons of the previous cell state ct 1 will be completely erased. In a
parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large
number of zeroed neurons, i.e. a large df
t , represents the end of a high-level constituent in
the parse tree, as most of the information in the state will be discarded. Conversely, a small
f
̃%#
"#
, + ( / -
+ + : 8 1
• ,+ ) / OT [d
m a l !"# ̃%# Se
n !"# N ̃%# O ih &# Se
o !"# &# "# f O) / O S] L '"# ,+ ) / Se
p ̃%# &# (# f O) / O S] L ̂%# ,+ ) / Se
Ock*#+,Nd Ock ̂*#S"#N(#M g ILT S[d R
discrete variable is included in the computation graph is not trivial (Schulman et al., 2015), so in
practice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking
a cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence,
ˆg = E[g].
4.2 STRUCTURED GATING MECHANISM
Based on the cumax() function, we introduce a master forget gate ˜ft and a master input gate ˜it:
˜ft = cumax(W ˜f xt + U ˜f ht 1 + b ˜f ) (9)
˜it = 1 cumax(W˜ixt + U˜iht 1 + b˜i) (10)
Following the properties of the cumax() activation, the values in the master forget gate are mono-
tonically increasing from 0 to 1, and those in the master input gate are monotonically decreasing
from 1 to 0. These gates serve as high-level control for the update operations of cell states. Using
the master gates, we define a new update rule:
!t = ˜ft
˜it (11)
ˆft = ft !t + ( ˜ft !t) = ˜ft (ft
˜it + 1 ˜it) (12)
ˆit = it !t + (˜it !t) = ˜it (it
˜ft + 1 ˜ft) (13)
ct = ˆft ct 1 +ˆit ˆct (14)
In order to explain the intuition behind the new update rule, we assume that the master gates are
binary:
• The master forget gate ˜ft controls the erasing behavior of the model. Suppose ˜ft =
(0, . . . , 0, 1, . . . , 1) and the split point is df
t . Given the Eq. (12) and (14), the information
stored in the first df
t neurons of the previous cell state ct 1 will be completely erased. In a
parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large
number of zeroed neurons, i.e. a large df
, represents the end of a high-level constituent in
eally, g should take the form of a discrete variable. Unfortunately, computing gradients when a
screte variable is included in the computation graph is not trivial (Schulman et al., 2015), so in
actice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking
cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence,
= E[g].
2 STRUCTURED GATING MECHANISM
ased on the cumax() function, we introduce a master forget gate ˜ft and a master input gate ˜it:
˜ft = cumax(W ˜f xt + U ˜f ht 1 + b ˜f ) (9)
˜it = 1 cumax(W˜ixt + U˜iht 1 + b˜i) (10)
ollowing the properties of the cumax() activation, the values in the master forget gate are mono-
nically increasing from 0 to 1, and those in the master input gate are monotonically decreasing
om 1 to 0. These gates serve as high-level control for the update operations of cell states. Using
e master gates, we define a new update rule:
!t = ˜ft
˜it (11)
ˆft = ft !t + ( ˜ft !t) = ˜ft (ft
˜it + 1 ˜it) (12)
ˆit = it !t + (˜it !t) = ˜it (it
˜ft + 1 ˜ft) (13)
ct = ˆft ct 1 +ˆit ˆct (14)
order to explain the intuition behind the new update rule, we assume that the master gates are
nary:
• The master forget gate ˜ft controls the erasing behavior of the model. Suppose ˜ft =
(0, . . . , 0, 1, . . . , 1) and the split point is df
t . Given the Eq. (12) and (14), the information
stored in the first df
t neurons of the previous cell state ct 1 will be completely erased. In a
parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large
number of zeroed neurons, i.e. a large df
t , represents the end of a high-level constituent in
k k
ˆg = E[g].
4.2 STRUCTURED GATING MECHANISM
Based on the cumax() function, we introduce a master forget gate ˜ft and a master input gate ˜it:
˜ft = cumax(W ˜f xt + U ˜f ht 1 + b ˜f ) (9)
˜it = 1 cumax(W˜ixt + U˜iht 1 + b˜i) (10)
Following the properties of the cumax() activation, the values in the master forget gate are mono-
tonically increasing from 0 to 1, and those in the master input gate are monotonically decreasing
from 1 to 0. These gates serve as high-level control for the update operations of cell states. Using
the master gates, we define a new update rule:
!t = ˜ft
˜it (11)
ˆft = ft !t + ( ˜ft !t) = ˜ft (ft
˜it + 1 ˜it) (12)
ˆit = it !t + (˜it !t) = ˜it (it
˜ft + 1 ˜ft) (13)
ct = ˆft ct 1 +ˆit ˆct (14)
In order to explain the intuition behind the new update rule, we assume that the master gates are
binary:
• The master forget gate ˜ft controls the erasing behavior of the model. Suppose ˜ft =
(0, . . . , 0, 1, . . . , 1) and the split point is df
t . Given the Eq. (12) and (14), the information
stored in the first df
t neurons of the previous cell state ct 1 will be completely erased. In a
parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large
number of zeroed neurons, i.e. a large df
t , represents the end of a high-level constituent in
the parse tree, as most of the information in the state will be discarded. Conversely, a small
f
ik
Ideally, g should take the form of a discrete variable. Unfortunately, computing gradients when a
discrete variable is included in the computation graph is not trivial (Schulman et al., 2015), so in
practice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking
a cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence,
ˆg = E[g].
4.2 STRUCTURED GATING MECHANISM
Based on the cumax() function, we introduce a master forget gate ˜ft and a master input gate ˜it:
˜ft = cumax(W ˜f xt + U ˜f ht 1 + b ˜f ) (9)
˜it = 1 cumax(W˜ixt + U˜iht 1 + b˜i) (10)
Following the properties of the cumax() activation, the values in the master forget gate are mono-
tonically increasing from 0 to 1, and those in the master input gate are monotonically decreasing
from 1 to 0. These gates serve as high-level control for the update operations of cell states. Using
the master gates, we define a new update rule:
!t = ˜ft
˜it (11)
ˆft = ft !t + ( ˜ft !t) = ˜ft (ft
˜it + 1 ˜it) (12)
ˆit = it !t + (˜it !t) = ˜it (it
˜ft + 1 ˜ft) (13)
ct = ˆft ct 1 +ˆit ˆct (14)
In order to explain the intuition behind the new update rule, we assume that the master gates are
binary:
• The master forget gate ˜ft controls the erasing behavior of the model. Suppose ˜ft =
(0, . . . , 0, 1, . . . , 1) and the split point is df
t . Given the Eq. (12) and (14), the information
f
LSTM. A sequence of tokens S = (x1, x2, x3) and its corresponding constituency tree are illus-
d in (a). We provide a block view of the tree structure in (b), where both S and VP nodes span
e than one time step. The representation for high-ranking nodes should be relatively consistent
ss multiple time steps. (c) Visualization of the update frequency of groups of hidden state neu-
. At each time step, given the input word, dark grey blocks are completely updated while light
blocks are partially updated. The three groups of neurons have different update frequencies.
most groups update less frequently while lower groups are more frequently updated.
lobal information that will last anywhere from several time steps to the entire sentence, repre-
ng nodes near the root of the tree. Low-ranking neurons encode short-term or local information
only last one or a few time steps, representing smaller constituents, as shown in Figure 2(b). The
rentiation between high-ranking and low-ranking neurons is learnt in a completely data-driven
ion by controlling the update frequency of single neurons: to erase (or update) high-ranking
ons, the model should first erase (or update) all lower-ranking neurons. In other words, some
ons always update more (or less) frequently than the others, and that order is pre-determined as
of the model architecture.
ON-LSTM
is section, we present a new RNN unit, ON-LSTM (“ordered neurons LSTM”). The new model
an architecture similar to the standard LSTM, reported below:
ft = (Wf xt + Uf ht 1 + bf ) (1)
it = (Wixt + Uiht 1 + bi) (2)
ot = (Woxt + Uoht 1 + bo) (3)
ˆct = tanh(Wcxt + Ucht 1 + bc) (4)
ht = ot tanh(ct) (5)
difference with the LSTM is that we replace the update function for the cell state ct with a
function that will be explained in the following sections. The forget gates ft and input gates
e used to control the erasing and writing operation on cell states ct, as before. Since the gates
before the k-th being the split point, that is d  k = (d = 0) _ (d = 1) _ · · · _ (d = k). Since the
categories are mutually exclusive, we can do this by computing the cumulative distribution function:
p(gk = 1) = p(d  k) =
X
ik
p(d = i) (8)
Ideally, g should take the form of a discrete variable. Unfortunately, computing gradients when a
discrete variable is included in the computation graph is not trivial (Schulman et al., 2015), so in
practice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking
a cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence,
ˆg = E[g].
4.2 STRUCTURED GATING MECHANISM
Based on the cumax() function, we introduce a master forget gate ˜ft and a master input gate ˜it:
˜ft = cumax(W ˜f xt + U ˜f ht 1 + b ˜f ) (9)
˜it = 1 cumax(W˜ixt + U˜iht 1 + b˜i) (10)
Following the properties of the cumax() activation, the values in the master forget gate are mono-
tonically increasing from 0 to 1, and those in the master input gate are monotonically decreasing
from 1 to 0. These gates serve as high-level control for the update operations of cell states. Using
the master gates, we define a new update rule:
!t = ˜ft
˜it (11)
ˆft = ft !t + ( ˜ft !t) = ˜ft (ft
˜it + 1 ˜it) (12)
ˆit = it !t + (˜it !t) = ˜it (it
˜ft + 1 ˜ft) (13)
ct = ˆft ct 1 +ˆit ˆct (14)
In order to explain the intuition behind the new update rule, we assume that the master gates are
binary:
1 1 : : : , + : :
-+ :
• +, cNM d r[
u ea!RO "# = %& − ∑)*+
,- ./#) g
%& lSf i ∑)*+
,- ./#) 9 S :Smt
v "#RO So nI ! ! + 1Sk ThR ] s L
2 2 2 : : 2 : , 22 + 0 2 : 20 2:
2 2 -+ 2:
• +, cNM d r[
u ea!RO "# = %& − ∑)*+
,- ./#) g
%& lSf i ∑)*+
,- ./#) 2 2 2S 0 :Smt
v "#RO So nI ! ! + 1Sk ThR ] s L
12 21 2 : : 2 : , 22 + 2 : 2 2:
2 2 -+ 2:
• +, cNM d r[
u ea!RO "# = %& − ∑)*+
,- ./#) g
%& lSf i ∑)*+
,- ./#) 2 2 2S :Smt
v "#RO So nI ! ! + 1Sk ThR ] s L
: + :
: , -
• + ML]c i
t [d ! N O "# = %& − ∑)*+
,- ./#) af
%& kRe h ∑)*+
,- ./#) 2: R : Rls
u "# N O IRn m ! ! + 1R Sg Or o T
2 2 2 : : 2 : , 22 + 2 : 2 2:
2 2 -+ 2:
• +, cNM d r[
u ea!RO "# = %& − ∑)*+
,- ./#) g
%& lSf i ∑)*+
,- ./#) 2 3 2 2S :Smt
v "#RO So nI ! ! + 1Sk ThR ] s L
1 0 I) - : A 4 I A
0 : 0 L BI 4
• ) pks
• k ih l) 2 +: B 2 +
• mgn nro i , 4 A N ( eW
• pks [ Yu ]2 MA N
• xP
• tg ] mgn nro if A N [ u YS]T O2 MA Nf
y
• , 4 4 7: ( YSda O4 D:Mtg y
RcO [ w
: : : F : 3 :: 2 :F 1: :
: : I F 52 :
• t r W n
• , 2+ 2+ g edfh wig a
• ,23- o A F : : : a LT0 F Rsma p
• A F : : : F R O [ Lsm WSR
• wig ec T W u WSTLTJ k
P TLWNSRl O M]
• + 0
• B AB C @ GO
• D :D: -: DBA AF: D F A D:: FD F D: AFB : DD:AF -: D -:FIBD
:A : F C C:D
• 1 @@ @ B @9 B@ C 6 @E C 2
• +: FF:AF BA I F + FI: F A ( A BA B F BA
• : -: DB B BA :CF +: DA:D AF:DCD:F A :A: BD A :AF:A :
)DB - F D C:D BA , B
P RL N I
2 + +2 2 2 + 2
2
•
• md P + 2 iLr] p no EB
• S[e sl c EB
• [
• + 2 [ ft
• P ", $ bg ahG 7 ah ft
Published as a conference paper at ICLR 2019
(a) Ontology (b) Order Embeddings (c) Box Embeddings
Figure 1: Comparison between the Order Embedding (vector lattice) and Box Embedding represen-
tations for a simple ontology. Regions represent concepts and overlaps represent their entailment.
Shading represents density in the probabilistic case.
3.1 PARTIAL ORDERS AND LATTICES
A non-strict partially ordered set (poset) is a pair P, , where P is a set, and is a binary relation.
For all a, b, c 2 P,
A 8 A A E 8:8 A82 D 8
8
• 8 L G
2A + A A8 B 8 + A A8 + A A8 D + A A8
0313 + 12 5 + 2 43 , + 12 5
+ 8 7 7 7
E A E 7 7 7
8 8 A 7 7
: B
8 A 2A8 7 7
:
2 8C
2 A
8A
:
2 8C
2 A
8A
:
2 8C
2 A
8A
:
2 8C
2 A
8A
A A A E : A 2 D
• S
2A + A A
+
E A E
9 A
: B
A 2A
:
2 C
2 A
A
+
P [ L
E A E
2 A G : V : G 2 A R [ L
0 AAE : E , A EC A CA DE (A + :D
3
• pd + :gxV a
EAC BC D E E A , DD BC D E E A
+ +
: A
D EC
D=A E DD
) AD C
E CD E A
C GAC
E
C E
C GAC
E
C E
: A D EC
nV P bcj[seS
D=A E DD
E C E [se o
) AD C E CD E A
C GAC Vli fm
hvu rR ] t L
1 BB =A B D B: DB = =E = )B , =A E
3 =
• b , =A [ePc
BD 0 CD E A =BA GEE= A 0 CD E A =BA BA 0 CD E A =BA
+014247 7203 ,0230 , 3 547 0
0 =BA
(E D
+=E B=A A EE
BE GA D
=A DE =BA
DA= BD
D =
DA= BD
D =
DA= BD
D =
0 =BA (E D
aP V ]Rd
BE GA D =A DE =BA
DA= BD P LfS
+=E B=A A EE
D = Pd L
BB =A B D B: DB = =E = )B , =A E
3 =
• b , =A [ePcV
2 BD 0 CD E A =BA GEE= A 0 CD E A =BA BA 0 CD E A =BA )B 0 CD E A =BA
0313 + 12 5 + 2 43 , + 12 5
0 =BA 7 7 7
(E D 7 7 7
+=E B=A A EE 7 7
BE GA D
=A DE =BA 7 7
DA= BD
D =
DA= BD
D =
DA= BD
D = DA= BDD =
70 =BA (E D
P ]Rd
7+=E B=A A EE
D = Ld a
7 BE GA D =A DE =BA
DA= BD P LfS
8 8 + 212 : 3 B 2
0
• xcb B , 1 : 0
• ] o B p L SV !", !$ ∈ [0, 1]* re l L
• ] o p L ts [ E r
P ! = ∏.
/
(1$,. − 1",.)
P !, 4 = ∏.
/
max 0, min 1$,., :$,. − max 1",., :",.
P !|4 =
< !,4
<(4)
• [fg n hm L E l L
• P ! P 4 p L P S fg
• P !|4 p L P S fg
4"
31
!"
!$
1 1:
4$
(a) Original lattice (b) Ground truth CPD
Figure 1: Representation of the toy probabilistic lattice used in Section 5.1. Darker color corresponds
to more unary marginal probability. The associated CPD is obtained by a weighted aggregation of leaf
elements.
(a) POE lattice (b) Box lattice
aiR G d
: 4 )4 :4 B + 0 (: 433
• ,4 4 4 0 R P "|$ P S[
• " $ G ] E P "|$ a
S[L
0
:0::0
+0:0::0
P " = &
'
(
(*+,' − *.,')
P ", $ = max 0, min *+, 6+ − max *., 6.
P "|$ =
P ", $
P($)
: ) : B 5 + 0 (: 33
• , 0 R P "|$ P S[
• " $ G ] E P "|$ a
S[L
0
:0::0
+0:0::0
P " = &
'
(
(*+,' − *.,')
P ", $ = max 0, min *+, 6+ − max *., 6.
P "|$ =
P ", $
P($)
: 6 ) : B + 0 (: 33 6
• , 0 R P "|$ P S[
• " $ G ] E P "|$ a
S[L
0
:0::0
+0:0::0
P " = &
'
(
(*+,' − *.,')
P ", $ = max 0, min *+, 6+ − max *., 6.
P "|$ =
P ", $
P($)
7: 3 )3 3 B + 0 07 7 7 ( 03227:
7
• 32
• aE P] G L [ G P E
S P
+
P " = $
%
&
softplus(/0,% − /3,%)
P ", 5 = softplus min /0, 90 − max /3, 93
P "|5 =
P ", 5
P(5)
softplus log(1 + @A
)
: 3 )3 3 B + 0 0 8 ( 0322 :
• 32
• 8 aE P] G L [ G P E
S P
8
+8
P " = $
%
&
softplus(/0,% − /3,%)
P ", 5 = softplus min /0, 90 − max /3, 93
P "|5 =
P ", 5
P(5)
softplus log(1 + @A
)
9 : 3 )3 93 B + 0 0 (90322 :
• 9 32
• aE P] G L [ G P E
S P
9 99
+999
P " = $
%
&
softplus(/0,% − /3,%)
P ", 5 = softplus min /0, 90 − max /3, 93
P "|5 =
P ", 5
P(5)
softplus log(1 + @A
)
4 4 : : 00
+
• g :0
• tix a o dNcfo emEP W
• ti ti n i rG yu s G
• P "#""#$|&'( = *
• yl W
• a cfLB [ ] L b [
W
• Lh n ] SL W
837k edges, while negative examples are generated by swapping one of the terms to a random word
in the dictionary. Experimental details are given in Appendix D.1.
The smoothed box model performs nearly as well as the original box lattice in terms of test ac-
curacy1
. While our model requires less hyper-parameter tuning than the original, we suspect that
our performance would be increased on a task with a higher degree of sparsity than the 50/50 posi-
tive/negative split of the standard WordNet data, which we explore in the next section.
5.2 IMBALANCED WORDNET
In order to confirm our intuition that the smoothed box model performs better in the sparse regime,
we perform further experiments using different numbers of positive and negative examples from the
WordNet mammal subset, comparing the box lattice, our smoothed approach, and order embeddings
(OE) as a baseline. The training data is the transitive reduction of this subset of the mammal Word-
Net, while the dev/test is the transitive closure of the training data. The training data contains 1,176
positive examples, and the dev and test sets contain 209 positive examples. Negative examples are
generated randomly using the ratio stated in the table.
As we can see from the table, with balanced data, all models include OE baseline, Box, Smoothed
Box models nearly match the full transitive closure. As the number of negative examples increases,
the performance drops for the original box model, but Smoothed Box still outperforms OE and
Box in all setting. This superior performance on imbalanced data is important for e.g. real-world
entailment graph learning, where the number of negatives greatly outweigh the positives.
Positive:Negative Box OE Smoothed Box
1:1 0.9905 0.9976 1.0
1:2 0.8982 0.9139 1.0
1:6 0.6680 0.6640 0.9561
1:10 0.5495 0.5897 0.8800
Table 5: F1 scores of the box lattice, order embeddings, and our smoothed model, for different levels
of label imbalance on the WordNet mammal subset.
5.3 FLICKR
proach retains the inductive bias of the original box model, is equivalent in the limit, and satisfies
e necessary condition that p(x, x) = p(x). A comparison of the 3 different functions is given in
gure 3, with the softplus overlap showing much better behavior for highly disjoint boxes than the
ussian model, while also preserving the meet property.
(a) Standard (hinge) overlap (b) Gaussian overlap, 2 {2, 6} (c) Softplus overlap
gure 3: Comparison of different overlap functions for two boxes of width 0.3 as a function of their
nters. Note that in order to achieve high overlap, the Gaussian model must drastically lower its
mperature, causing vanishing gradients in the tails.
EXPERIMENTS
1 WORDNET
Method Test Accuracy %
transitive 88.2
word2gauss 86.6
OE 90.6
Li et al. (2017) 91.3
POE 91.6
Box 92.2
Smoothed Box 92.0
Table 4: Classification accuracy on WordNet test set.
e perform experiments on the WordNet hypernym prediction task in order to evaluate the per-
+: : 4 1 :
• eb B :
• [ Li P E EMk]Glm ETS
• P "|$ M d ET M Lc
• P "|$ =
&(",$)
&($)
=
#+,-./0 ",$ 12 / #456+5
#+,-./0 $ 12 / #456+5
• , M P "|$ Li P h ag + : f h ag
Kk]
Box 0.050 0.900
Smoothed Box 0.036 0.917
Table 6: KL and Pearson correlation between model and gold probability.
randomly pick 100K conditional probabilities for training data and 10k probabilities for dev and test
data 2
.
We compare with several baselines: low-rank matrix factorization, complex bilinear factoriza-
tion (Trouillon et al., 2016), and two hierarchical embedding methods, POE (Lai & Hockenmaier,
2017) and the Box Lattice (Vilnis et al., 2018). Since the training matrix is asymmetric, we used
separate embeddings for target and conditioned movies. For the complex bilinear model, we added
one additional vector of parameters to capture the “imply” relation. We evaluate on the test set using
KL divergence, Pearson correlation, and Spearman correlation with the ground truth probabilities.
Experimental details are given in Appendix D.4.
From the results in Table 7, we can see that our smoothed box embedding method outperforms the
original box lattice as well as all other baselines’ performances, especially in Spearman correlation,
the most relevant metric for recommendation, a ranking task. We perform an additional study on the
robustness of the smoothed model to initialization conditions in Appendix C.
KL Pearson R Spearman R
Matrix Factorization 0.0173 0.8549 0.8374
Complex Bilinear Factorization 0.0141 0.8771 0.8636
POE 0.0170 0.8548 0.8511
Box 0.0147 0.8775 0.8768
Smoothed Box 0.0138 0.8985 0.8977
Table 7: Performance of the smoothed model, the original box model, and several baselines on
MovieLens.
6 CONCLUSION AND FUTURE WORK
We presented an approach to smoothing the energy and optimization landscape of probabilistic box
embeddings and provided a theoretical justification for the smoothing. Due to a decreased number
42
• +
• 29 @2@
• E :E: -: ECBF BG: E G B E:: GE G E:F BGC : EE:BG -: E -:G CE F
:B :FG :E
• CCG B G : ):C :GEL C EC FG C : B F +
• 2D @@ @ C @ @C @ 2 4 D 2 9A@ 0 A 1
• : -: EC L C CB : G +: EB:E BG:E E:G B :B:F CE F B :BG:B :F
(EC - G E :EI F CB , C
L I N
: 4 4
: : : -
• ead fi
• e gi t W 3 : -+ s D[
• v yDo C[
• 3 : Pw Av m ND
• ]ch a
• d L[ S v W A 3 : V
l u m [n : : :
set dynamic convolutions achieve a new state of the art of 29.7 BLEU.
1 INTRODUCTION
There has been much recent progress in sequence modeling through recurrent neural networks
(RNN; Sutskever et al. 2014; Bahdanau et al. 2015; Wu et al. 2016), convolutional networks (CNN;
Kalchbrenner et al. 2016; Gehring et al. 2016; 2017; Kaiser et al. 2017) and self-attention models
(Paulus et al., 2017; Vaswani et al., 2017). RNNs integrate context information by updating a hid-
den state at every time-step, CNNs summarize a fixed size context through multiple layers, while as
self-attention directly summarizes all context.
Attention assigns context elements attention weights which define a weighted sum over context rep-
resentations (Bahdanau et al., 2015; Sukhbaatar et al., 2015; Chorowski et al., 2015; Luong et al.,
2015). Source-target attention summarizes information from another sequence such as in machine
translation while as self-attention operates over the current sequence. Self-attention has been formu-
lated as content-based where attention weights are computed by comparing the current time-step to
all elements in the context (Figure 1a). The ability to compute comparisons over such unrestricted
context sizes are seen as a key characteristic of self-attention (Vaswani et al., 2017).
(a) Self-attention (b) Dynamic convolution
Figure 1: Self-attention computes attention weights by comparing all pairs of elements to each other
(a) while as dynamic convolutions predict separate kernels for each time-step (b).
+ A I + : I : )
(A A A 2 4
• PO RNW[ S
,, (,, A T
D
(A
(A - - - -
)
2 :
!" !# !$ !% !& !" !# !$ !% !& !" !# !$ !% !& !" !# !$ !% !&
C : D L
5 + A I + : I : 5 ) 5
(A A A 4
• a l ] [fm k
,, (,,
5D5
(A
(A - -
) 5
:
!" !# !$ !% !& !" !# !$ !% !&
C : D L
(A
A Sk S edoPpN
) 5 :
lnTONR(,, gW chi
6L , ( A A , 6A LA6
) AI A 4
• ti w[ acd mx[ug
-- )-- : ( A A
6 6
+A: A ) A
2
) C L A A A
LA6
!" !# !$ !% !& !" !# !$ !% !& !" !# !$ !% !&
A D A A A
LA6
pzok [h rOwy e Wfok[
sq n vT R S
2 ) C L
l R A P]Nn
L , ( A A , A LA 7
) AI A 4
• zr hklfu p
-- )-- : ( A A sq
+A: A ) A
2
) C L A A A A
LA 7
!" !# !$ !% !& !" !# !$ !% !& !" !# !$ !% !& !" !# !$ !% !&
A D A7 A A
LA 7 2 ) C L
)--eiOg WdS [w xy mt
]RcN : ( A A
n vae oT P
8 8 4 4 4 8 8
8 8 + -
• D 8 8 8
• 8 8 Pu f ds 8 4
• m v [ c oe v [ c g Aoe v [ c
m v [ c i wNpSL nWy N
• N 8 4 y lC
• a ]th A 4 4 8 8 P
! ∈ ℝ$×$×&
! ∈ ℝ$×'×&
! ∈ ℝ(×'×& ! ∈ ℝ$×(×'×&
: 4 8 8 4 4 8 8 8 8 8
sequence and Shen et al. (2017; 2018b) perform more fine-grained attention over each feature.
Our experiments show that lightweight convolutions perform competitively to strong self-attention
results and that dynamic convolutions can perform even better. On WMT English-German transla-
tion dynamic convolutions achieve a new state of the art of 29.7 BLEU, on WMT English-French
they match the best reported result in the literature, and on IWSLT German-English dynamic convo-
lutions outperform self-attention by 0.8 BLEU. Dynamic convolutions achieve 20% faster runtime
than a highly-optimized self-attention baseline. For language modeling on the Billion word bench-
mark dynamic convolutions perform as well as or better than self-attention and on CNN-DailyMail
abstractive document summarization we outperform a strong self-attention model.
2 BACKGROUND
We first outline sequence to sequence learning and self-attention. Our work builds on non-separable
convolutions as well as depthwise separable convolutions.
Sequence to sequence learning maps a source sequence to a target sequence via two separate
networks such as in machine translation (Sutskever et al., 2014). The encoder network computes
representations for the source sequence such as an English sentence and the decoder network au-
toregressively generates a target sequence based on the encoder output.
The self-attention module of Vaswani et al. (2017) applies three projections to the input X 2
Rn⇥d
to obtain key (K), query (Q), and value (V) representations, where n is the number of time
steps, d the input/output dimension (Figure 2a). It also defines a number of heads H where each head
can learn separate attention weights over dk features and attend to different positions. The module
computes dot-products between key/query pairs, scales to stabilize training, and then softmax nor-
malizes the result. Finally, it computes a weighted sum using the output of the value projection
(V):
Attention(Q, K, V ) = softmax(
QKT
p
dk
)V
Depthwise convolutions perform a convolution independently over every channel. The number
of parameters can be reduced from d2
k to dk where k is the kernel width. The output O 2 Rn⇥d
of
a depthwise convolution with weight W 2 Rd⇥k
for element i and output dimension c is defined as:
Oi,c = DepthwiseConv(X, Wc,:, i, c) =
kX
j=1
Wc,j · X(i+j d k+1
2 e),c
3 LIGHTWEIGHT CONVOLUTIONS
In this section, we introduce LightConv, a depthwise convolution which shares certain output chan-
nels and whose weights are normalized across the temporal dimension using a softmax. Compared to
attention weights from the previous time-step into account (Chorowski et al., 2015; Luong et al.,
2015). Shen et al. (2018a) reduce complexity by performing attention within blocks of the input
sequence and Shen et al. (2017; 2018b) perform more fine-grained attention over each feature.
Our experiments show that lightweight convolutions perform competitively to strong self-attention
results and that dynamic convolutions can perform even better. On WMT English-German transla-
tion dynamic convolutions achieve a new state of the art of 29.7 BLEU, on WMT English-French
they match the best reported result in the literature, and on IWSLT German-English dynamic convo-
lutions outperform self-attention by 0.8 BLEU. Dynamic convolutions achieve 20% faster runtime
than a highly-optimized self-attention baseline. For language modeling on the Billion word bench-
mark dynamic convolutions perform as well as or better than self-attention and on CNN-DailyMail
abstractive document summarization we outperform a strong self-attention model.
2 BACKGROUND
We first outline sequence to sequence learning and self-attention. Our work builds on non-separable
convolutions as well as depthwise separable convolutions.
Sequence to sequence learning maps a source sequence to a target sequence via two separate
networks such as in machine translation (Sutskever et al., 2014). The encoder network computes
representations for the source sequence such as an English sentence and the decoder network au-
toregressively generates a target sequence based on the encoder output.
The self-attention module of Vaswani et al. (2017) applies three projections to the input X 2
Rn⇥d
to obtain key (K), query (Q), and value (V) representations, where n is the number of time
steps, d the input/output dimension (Figure 2a). It also defines a number of heads H where each head
can learn separate attention weights over dk features and attend to different positions. The module
computes dot-products between key/query pairs, scales to stabilize training, and then softmax nor-
malizes the result. Finally, it computes a weighted sum using the output of the value projection
(V):
Attention(Q, K, V ) = softmax(
QKT
p
dk
)V
Depthwise convolutions perform a convolution independently over every channel. The number
of parameters can be reduced from d2
k to dk where k is the kernel width. The output O 2 Rn⇥d
of
a depthwise convolution with weight W 2 Rd⇥k
for element i and output dimension c is defined as:
Oi,c = DepthwiseConv(X, Wc,:, i, c) =
kX
j=1
Wc,j · X(i+j d k+1
2 e),c
3 LIGHTWEIGHT CONVOLUTIONS
Under review as a conference paper at ICLR 2019
Mat Mul
Linear
Linear Linear
Scale
SoftMax
Mat Mul
Linear
Q K V
input
(a) Self-attention
LConv
Linear
Linear
input
GLU
(b) Lightweight convolution
input
LConv
Linear
Linear
Linear
dynamic weights
GLU
(c) Dynamic convolution
Figure 2: Illustration of self-attention, lightweight convolutions and dynamic convolutions.
self-attention, LightConv has a fixed context window and it determines the importance of context el-
ements with a set of weights that do not change over time steps. We will show that models equipped
with lightweight convolutions show better generalization compared to regular convolutions and that
they can be competitive to state-of-the-art self-attention models (§6). This is surprising because the
common belief is that content-based self-attention mechanisms are crucial to obtaining state-of-the-
art results in natural language processing applications. Furthermore, the low computational profile
of LightConv enables us to formulate efficient dynamic convolutions (§4).
LightConv computes the following for the i-th element in the sequence and output channel c:
LightConv(X, Wd cH
d e,:, i, c) = DepthwiseConv(X, softmax(Wd cH
d e,:), i, c)
Weight sharing. We tie the parameters of every subsequent number of d
H channels, which reduces
the number of parameters by a factor of H. As illustration, a regular convolution requires 7,340,032
(d2
⇥ k) weights for d = 1024 and k = 7, a depthwise separable convolution has 7,168 weights
(d ⇥ k), and with weight sharing, H = 16, we have only 112 (H ⇥ k) weights. We will see that
this vast reduction in the number of parameters is crucial to make dynamic convolutions possible on
current hardware.
Under review as a conference paper at ICLR 2019
Mat Mul
Linear
Linear Linear
Scale
SoftMax
Mat Mul
Linear
Q K V
input
(a) Self-attention
LConv
Linear
Linear
input
GLU
(b) Lightweight convolution
input
LConv
Linear
Linear
Linear
dynamic weights
GLU
(c) Dynamic convolution
Figure 2: Illustration of self-attention, lightweight convolutions and dynamic convolutions.
self-attention, LightConv has a fixed context window and it determines the importance of context el-
ements with a set of weights that do not change over time steps. We will show that models equipped
with lightweight convolutions show better generalization compared to regular convolutions and that
they can be competitive to state-of-the-art self-attention models (§6). This is surprising because the
common belief is that content-based self-attention mechanisms are crucial to obtaining state-of-the-
art results in natural language processing applications. Furthermore, the low computational profile
of LightConv enables us to formulate efficient dynamic convolutions (§4).
LightConv computes the following for the i-th element in the sequence and output channel c:
LightConv(X, Wd cH
d e,:, i, c) = DepthwiseConv(X, softmax(Wd cH
d e,:), i, c)
Weight sharing. We tie the parameters of every subsequent number of d
H channels, which reduces
the number of parameters by a factor of H. As illustration, a regular convolution requires 7,340,032
(d2
⇥ k) weights for d = 1024 and k = 7, a depthwise separable convolution has 7,168 weights
(d ⇥ k), and with weight sharing, H = 16, we have only 112 (H ⇥ k) weights. We will see that
Under review as a conference paper at ICLR 2019
4 DYNAMIC CONVOLUTIONS
A dynamic convolution has kernels that vary over time as a learned functi
steps. A dynamic version of standard convolutions would be impractic
to their large memory requirements. We address this problem by build
drastically reduces the number of parameters (§3).
DynamicConv takes the same form as LightConv but uses a time-step
computed using a function f : Rd
! RH⇥k
:
DynamicConv(X, i, c) = LightConv(X, f(Xi)h,:, i
we model f with a simple linear module with learned weights WQ
2
Pd
c=1 WQ
h,j,cXi,c.
Similar to self-attention, DynamicConv changes the weights assigned to co
However, the weights of DynamicConv do not depend on the entire conte
the current time-step only. Self-attention requires a quadratic number of o
Under review as a conference paper at ICLR 2019
4 DYNAMIC CONVOLUTIONS
A dynamic convolution has kernels that vary over time as a learned function of the individual ti
steps. A dynamic version of standard convolutions would be impractical for current GPUs d
to their large memory requirements. We address this problem by building on LightConv wh
drastically reduces the number of parameters (§3).
DynamicConv takes the same form as LightConv but uses a time-step dependent kernel that
computed using a function f : Rd
! RH⇥k
:
DynamicConv(X, i, c) = LightConv(X, f(Xi)h,:, i, c)
we model f with a simple linear module with learned weights WQ
2 RH⇥k⇥d
, i.e., f(Xi)
Pd
c=1 WQ
h,j,cXi,c.
Similar to self-attention, DynamicConv changes the weights assigned to context elements over tim
However, the weights of DynamicConv do not depend on the entire context, they are a function
Under review as a conference paper at I
4 DYNAMIC CONVOLUTIONS
A dynamic convolution has kernels tha
steps. A dynamic version of standard
to their large memory requirements.
drastically reduces the number of param
DynamicConv takes the same form a
computed using a function f : Rd
! R
DynamicConv(X
we model f with a simple linear mod
Pd
c=1 WQ
h,j,cXi,c.
Similar to self-attention, DynamicConv
However, the weights of DynamicConv
=
!! ! *(,)
./
+ C 4 4 4 C :
A 9
• - 4D LME
(
)
)
(
Under review as a conference paper at ICLR 2019
Model Param BLEU Sent/sec
Vaswani et al. (2017) 213M 26.4 -
Self-attention baseline (k=inf, H=16) 210M 26.9 ± 0.1 52.1 ± 0.1
Self-attention baseline (k=3,7,15,31x3, H=16) 210M 26.9 ± 0.3 54.9 ± 0.2
CNN (k=3) 208M 25.9 ± 0.2 68.1 ± 0.3
CNN Depthwise (k=3, H=1024) 195M 26.1 ± 0.2 67.1 ± 1.0
+ Increasing kernel (k=3,7,15,31x4, H=1024) 195M 26.4 ± 0.2 63.3 ± 0.1
+ DropConnect (H=1024) 195M 26.5 ± 0.2 63.3 ± 0.1
+ Weight sharing (H=16) 195M 26.5 ± 0.1 63.7 ± 0.4
+ Softmax-normalized weights [LightConv] (H=16) 195M 26.6 ± 0.2 63.6 ± 0.1
+ Dynamic weights [DynamicConv] (H=16) 200M 26.9 ± 0.2 62.6 ± 0.4
Note: DynamicConv(H=16) w/o softmax-normalization 200M diverges
AAN decoder + self-attn encoder 260M 26.8 ± 0.1 59.5 ± 0.1
AAN decoder + AAN encoder 310M 22.5 ± 0.1 59.2 ± 2.1
Table 3: Ablation on WMT English-German newstest2013. (+) indicates that a result includes all
preceding features. Speed results based on beam size 4, batch size 256 on an NVIDIA P100 GPU.
6.2 MODEL ABLATION
In this section we evaluate the impact of the various choices we made for LightConv (§3) and Dy-
namicConv (§4). We first show that limiting the maximum context size of self-attention has no
impact on validation accuracy (Table 3). Note that our baseline is stronger than the original result
of Vaswani et al. (2017). Next, we replace self-attention blocks with non-separable convolutions
(CNN) with kernel size 3 and input/output dimension d = 1024. The CNN block has no input and
output projections compared to the baseline and we add one more encoder layer to assimilate the
parameter count. This CNN with a narrow kernel trails self-attention by 1 BLEU.
We improve this result by switching to a depthwise separable convolution (CNN Depthwise) with in-
put and output projections of size d = 1024. When we progressively increase the kernel width from
4 262402 .522 72482402 720
/
49 4
49 4
9 : 4
C : A 9
4 A 9
4 4 A 9
:
• -1 , -
• 0 : F I:IA MON
• C C + CA D ) E :C E : C EC E C D EA CC E + C + E AC D
DE B B C
• AAE : E ( A EC A CA DE AI :D
• DD EE E A E : E : E A GA E A D
• 5@ LC A FI - : , I F IA :
I + C :I : F A A : 9 2 : A
+ C - :5 5 : -5 A
A - 5 A + C -C :A: 1
• yr m
• l pvuI ] Sd [ F ]
Sd
• l pvu ns Iih u ] T , Sd[F
b N d M F ] o a Q c
• gv r
• l pvu ns Sdv L
• VeWe Sd t L
Models Visual Features Semantics
Extra Labels
Inference
# Prog. Attr.
FiLM (Perez et al., 2018) Convolutional Implicit 0 No Feature Manipulation
IEP (Johnson et al., 2017b) Convolutional Explicit 700K No Feature Manipulation
MAC (Hudson & Manning, 2018) Attentional Implicit 0 No Feature Manipulation
Stack-NMN (Hu et al., 2018) Attentional Implicit 0 No Attention Manipulation
TbD (Mascharka et al., 2018) Attentional Explicit 700K No Attention Manipulation
NS-VQA (Yi et al., 2018) Object-Based Explicit 0.2K Yes Symbolic Execution
NS-CL Object-Based Explicit 0 No Symbolic Execution
Table 1: Comparison with other frameworks on the CLEVR VQA dataset, w.r.t. visual features,
implicit or explicit semantics and supervisions.
1 2
3 4
Q: What is the shape of
the red object left of the
sphere?
✓ Query(Shape, Filter(Red, Relate(Left, Filter(Sphere))))
☓ Query(Shape, Filter(Sphere, Relate(Left, Filter(Red))))
☓ Exist(AERelate(Shape, Filter(Red, Relate(Left, Filter(Sphere)))))
……
Visual Representation
Semantic Parsing (Candidate Interpretations)
Symbolic Reasoning
Answer: Cylinder
Groundtruth: Box
Back-propagation
REINFORCE
Obj 1
Obj 2
Obj 3
Obj 4
Sphere
Concept Embeddings
……
Back-propagation
Θ"
Θ#
E C C Rg S]d ol
,C A A B 2:
Wa c L L
hru M L L
mp[ v N i
v N Ve ] I N ]d b
Ve ] ]d b N i
v N i
Obj:2
1 : -:FC : E +: C :C E:C C:E : :D
C D : E: :D C - EFC F :C D , 5
• n 2 DF : D s P
Published as a conference paper at ICLR 2019
Q: What is the shape of the red object?
A: Cube.
Q: How many cubes are behind the
sphere?
A: 3
Q: Does the red object left of the green
cube have the same shape as the
purple matte thing?
A: No
Q: Does the matte thing behind the big
sphere have the same color as the
cylinder left of the small matte cube?
A: No.
Initialized with DSL and executor.
Lesson1: Object-based questions.
Lesson2: Relational questions.
Lesson3: More complex questions.
Deploy: complex scenes, complex questions
Q: Does the red object left of the green
cube have the same shape as the
purple matte thing?
1
3
Obj 1
Obj 2
Obj 3
Obj 4
Step1: Visual Parsing
Step2, 3: Semantic Parsing and Program
Filter Green Cube
Program Representations
Relate Object 2
Left
Filter Red
Filter Purple Matte
AEQuery Object 1 Object 3
Shape
Concepts
A. Curriculum concept learning B. Illustrative execution of NS-C
Figure 4: A. Demonstration of the curriculum learning of visual concepts, words, and semant
y v[ y
. 1 B :A B A ?
8 B B
t i
--
t Ve ]
C :
+ C
0?
--
t ]d b
--
Filter(Red)
↓
Query(Shape)
Published as a conference paper at ICLR 2019
Q: What is the shape of the red object?
A: Cube.
Q: How many cubes are behind the
sphere?
A: 3
Q: Does the red object left of the green
cube have the same shape as the
purple matte thing?
A: No
Q: Does the matte thing behind the big
sphere have the same color as the
cylinder left of the small matte cube?
A: No.
Initialized with DSL and executor.
Lesson1: Object-based questions.
Lesson2: Relational questions.
Lesson3: More complex questions.
Deploy: complex scenes, complex questions
Q: Does the red object left of the green
cube have the same shape as the
purple matte thing?
1 2
3 4
Obj 1
Obj 2
Obj 3
Obj 4
Step1: Visual Parsing
Step2, 3: Semantic Parsing and Program Execution
Filter Green Cube
Program Representations Outputs
Relate Object 2
Left
Filter Red
Filter Purple Matte
AEQuery Object 1 Object 3
Shape No (0.98)
Concepts
A. Curriculum concept learning B. Illustrative execution of NS-CL
Figure 4: A. Demonstration of the curriculum learning of visual concepts, words, and semantic parsing
y v[ y
. 1 B :A B A ?
8 B B
t i
--
Published as a conference paper at ICLR 201
Q: What is the shape of the red object?
A: Cube.
Q: How many cubes are behind the
sphere?
A: 3
Q: Does the red object left of the green
cube have the same shape as the
purple matte thing?
A: No
Q: Does the matte thing behind the big
sphere have the same color as the
cylinder left of the small matte cube?
A: No.
Initialized with DSL and executor.
Lesson1: Object-based questions.
Lesson2: Relational questions.
Lesson3: More complex questions
Deploy: complex scenes, complex
A. Curriculum concept learning
Figure 4: A. Demonstration of the curriculum l
y v[ y
. 1 B :A B A ?
8 B B
t i
--
t T c
--
t ]d b
--
Filter(Red)
↓
Query(Shape)
Obj:1 Green
Red
Red
C 2 N I 5PG IF I JM . M J M B 5 L
I L 5 M L , IG 2 MN F 5NJ L I :1 I
• p v I M B I I JML 5 G M 3 L B
am ]sgfd eiUWckS3 IB GlytVhoT
ry am R[ ry u
:D O EN ODA ND A
B ODA MA FA?O
1 LE 4 22
4 L2 M
IFI +G B 5J
5C J +G B 5J
Cylinder
Sphere
Box
LN F , MN 5J
Obj:1
Obj:2
ly b
+ Q
j DAMA
n
3
↓
Filter(Red)
Query(Shape)
-4 -4
22
22
1 NG 6 ,22[i W gb uS6AN2AO [
ENP A OPMAb y
W ? AM -A? AMU+E06 06 Vf c x
- C [4M CM by
4M CM ], ?A O A E C
U, M A A E CVbk
E OAM b U6A , NE A t s ]a3 F n V
4M CM ], ?A O A E C
U D A A A E CVbk
PAMR b U3 F , NE A t s ]a D Ab
k V S mh y
je d orbl S A E C ?Abp
Red
C 2 N I 5PG IF I JM . M J M B 5 L
I L 5 M L , IG 2 MN F 5NJ L I :1 I
• d I M B I I JML 5 G M 3 L B
fsR]c ba lki nV[h T3 IB Gr m U
fs S p
1 ? ? ? F
E ? G E A :
1 LE 4 22
4 L2 M
IFI +G B 5J
5C J +G B 5J
Cylinder
Sphere
Box
LN F , MN 5J
Obj:1
Obj:2
r gy
E
sb F? G
t
3
↓
Filter(Red)
Query(Shape)
-4 -4
22
22
1 LE 4 22 fsW u e o 4 L2 M
LN F , MN vr
m Q ,D:E G + :E GO 2 2 P ] o
6+ED U0GE G C p
0GE G C i W ED: F ,C D
O EBEG C D P cy
- B G niO2 E D lxR u W[ A f P
0GE G C i W ED: F ,C D
O ? F C D P cy
1 G niO A E D lxR u W[ ? F
cyPS tea S p
sb gj d S ,C D F : hr
Red
FC 5C NL SJ LIG ,L CMP 3C N CN 1 PCNMNCPG E C C
LNB B C PC C NLJ 5 P N I MCNRG GL 4 L
• i 2LG P 3C N G E LD ,L CMP B CJ PG 6 N G E
T hbfe pom nr luV6NLEN Jv yW
dUc s
3 4F E F E 4C
B F B F
4 ,55
C 5CP
,LILN .J CBBG E M C
F MC .J CBBG E M C
Cylinder
Sphere
Box
G I C P NC M C
Obj:1
Obj:2
v k
B
o 1C
+6
↓
Filter(Red)
Query(Shape)
+G0 : 0 :
55
55
4E R pNW r yj E F R
2 E 4? 4F tl
s ]. LBCN -C LBCNV+G0 : 0 :WgUa[t
-L E 6NLEN J v
B: 4 V f Vm BA CFV A:
B?B A:M x
?F m kf SV BE A i OhsS - cuM
B: 4 V f Vm BA CFV A:
1 4C A:M x
m kf - SV BE A i OhsS 1 4C
xMP b SPQl
o SVdg a P A: 1C4 en
Red
3 LJ 6 JGE +JI N 1 LI L IN L L NEIC 6 I M
:JL M I 6 IN I M -LJ 3 N L G 6 LPEMEJI 2 J
• d JEIN 1 LIEIC JB +JI NM I 6 INE 4 LMEIC
etR]c ba lki oV[h T4LJCL r mvU
et S p
:D O EN ODA ND A
B ODA MA FA?O
2 MF 5 +33
5 M3 N
+JGJL , EIC 6
6 , EIC 6
Cylinder
Sphere
Box
EM G - N L 6
Obj:1
Obj:2
r f
+ Q
j DAMA
u
4
↓
Filter(Red)
Query(Shape)
E.5 .5
33
33
1 NG 6 ,22[i W gb uS6AN2AO [
ENP A OPMAb y
W ? AM -A? AMU+E06 06 Vf c x
- C [4M CM by
4LJCL n s W+JI N , EIC
T+JGJL EICU gy
E OAM b U6A , NE A t s ]a3 F n V
4M CM ], ?A O A E C
U D A A A E CVbk
PAMR b U3 F , NE A t s ]a D Ab
k V S mh y
je d orbl S A E C ?Abp
Red
EB 3B NL 7SJ LIF +L BMP 1B N BN PBNMNBPF 7 B BO
LN O 7B PB BO -NLJ 3 P N I 7 MBNRFOFL 2 L
• i LF P 1B N F LC +L BMPO 7BJ PF 5 NOF
j Tbhcgf pomanu lyV5NL N J r W
j eUd
:D O EN ODA ND A
B ODA MA FA?O
2 O +33
BO3BP
+LILN ,J B F 7M B
7E MB ,J B F 7M B
Cylinder
Sphere
Box
:FO I -B P NB 7M B
Obj:1
Obj:2
k
+ Q
j DAMA
5
↓
Filter(Red)
Query(Shape)
F. .
33
33
1 NG 6 ,22[i W gb tS6AN2AO [
ENP A OPMAb x
u W ? AM -A? AMU+E06 06 Vf c
- C [4M CM bx
4M CM p y ], ?A O A E C
U, M A A E CVbk
-FIPBN a sV B ]+LOF B v t [ 4 G W
4M CM p y ], ?A O A E C
U D A A A E CVbk
PAMRy b pU3 F , NE A s r ]a D Ab
k V S mh x
je d n bl S A E C ?Abo
Red
3 LJ JGE +JI N 1 LI L IN L L NEIC I M
:JL M I IN I M -LJ 3 N L G LPEMEJI 2 J
• d JEIN 1 LIEIC JB +JI NM I INE 4 LMEIC
etR]c ba lki oV[h T4LJCL r mvU
et S p
:D O EN ODA ND A
B ODA MA FA?O
2 MF 5 +33
5 M3 N
+JGJL , EIC
, EIC
Cylinder
Sphere
Box
EM G - N L
Obj:1
Obj:2
r f
+ Q
j DAMA
u
4
↓
Filter(Red)
Query(Shape)
E.58 .58
33
33
1 NG 6 ,22[i W gb uS6AN2AO [
ENP A OPMAb y
W ? AM -A? AMU+E06 06 Vf c x
- C [4M CM by
4M CM ], ?A O A E C
U, M A A E CVbk
E OAM b U6A , NE A t s ]a3 F n V
4LJCL n s W+JI N , EIC
T EICU gy
PAMR b U3 F , NE A t s ]a D Ab
k V S mh y
je d orbl S A E C ?Abp
Red
9FC 3CROM T MJG +ML CNQ 1C OLCO LQCONOCQGLE CLCP
MOBP LB CLQCL CP -OM 3 QRO J RNCOSGPGML 2 M
• l MGLQ 1C OLGLE M +ML CNQP LB C LQG 5 OPGLE
n Vekfji vtrds b 5OMEO a
n hWg
:D O EN ODA ND A
B ODA MA FA?O
2 PI +33
CP3CQ
+MJMO , CBBGLE N C
F NC , CBBGLE N C
Cylinder
Sphere
Box
GPR J -C QROC N C
Obj:1
Obj:2
o
+ Q
xg DAMA
5
↓
Filter(Red)
Query(Shape)
G. : . :
33
33
1 NG 6 ,22 f V e R6AN2AO
ENP A OPMA u
r V ? AM -A? AM +E06 06 UdSa]t
- C 4M CM u
4M CM ] m ] [, ?A O] A E C
, M A A E CU h
E OAM sm 6A ], NE A p Wo [ 3 F j U
4M CM ] m ] [, ?A O] A E C
D A A A E CU h
RCOT d 4 +MPGLC [y c F NCd
p U um ]
xgcSb ]kn i R A E C ?A l
4
Red
3 LJ 6 JGE +JI N 1 LI L IN L L NEIC 6 I M
:JL M I 6 IN I M -LJ 3 N L G 6 LPEMEJI 2 J
• e 0JEIN 1 LIEIC JB +JI NM I 6 INE 4 LMEIC
fyS d cb nm ]lsW iuU4LJCL v [o V
fyaT t
:D O EN ODA ND A
B ODA MA FA?O
2 MF +33
M3 N
+JGJL , EIC 6
6 , EIC 6
Cylinder
Sphere
Box
EM G - N L 6
Obj:1
Obj:2
v h
+ Q
uh DAMA
↓
Filter(Red)
Query(Shape)
E. .
33
33
1 NG 6 ,22[g W x eb oS6AN2AO [
ENP A OPMAb s
p W ? AM -A? AMU+E06 06 Vd c r
- C [4M CM bs
4M CM l t ], ?A O A E C
U, M A A E CVbi
E OAMt b lU6A , NE A n my ]a3 F k V
4M CM l t ], ?A O A E C
U D A A A E CVbi
PAMRt b lU3 F , NE A n my ]a D Ab
i V S jf s
gaT pr]k R, EIC 6 ]
6C 2 N I PG IF I JM . M J M B L
I L M L , IG 2 MN F NJ L I :1 I
• e i I M B I I JML G M 3 L B
3 IB GalT] R I JM +G BU dS
al [f
+ , R
i MEBNB
kc
4 I
Red
gl bW V gl h
6 E P FO PEB OE MB
C PEB NBA ?GB P
1 LE 4 22
4 L2 M
IFI +G B J
C J +G B J
Cylinder
LN F , MN J
Obj:1
Obj:2
↓
Filter(Red)
Query(Shape)
-4 -4
22
22
2 O -33 h a fc r BO3BP
:FOQ 0B PQNBc
s a/ ABN B ABNV,F1 1 WeUd u
D N DN c
N DN n x - BMP / ?BAAF D
V- N B ?BAAF DWcj
0F PBNx ctnV BA] - OF B o ] b4?G l W
N DN n x - BMP / ?BAAF D
V E MB B ?BAAF DWcj
6QBNSx ctnV4?G ] - OF B o ] b E MBc
j W kg] [
i ic BF C N B N DN pcmy
Sphere
Box
o
EIBJL
3 LJ 6 JGE +JI N 1 LI L /IN L L NEIC 6 I M
:JL M I 6 IN I M -LJ 3 N L G 6 LPEMEJI 2 J
• JEIN 1 LIEIC JB +JI NM I 6 INE 4 LMEIC
4LJCL i Wb]agS+JI N , EIC cpT
1 NG 6 ,22[g W x eb oS6AN2AO [
ENP A OPMAb s
p W ? AM -A? AMU+E06 06 Vd c r
- C [4M CM bs
4M CM l t ], ?A O A E C
U, M A A E CVbi
E OAMt b lU6A , NE A n my ]a3 F k V
4M CM l t ], ?A O A E C
U D A A A E CVbi
PAMRt b lU3 F , NE A n my ]a D Ab
i V S jf s
l tl vhV EIBJL U4LJCL Wmk e d
i r
+ Q
uh DAMA
Red
s n[R s fyu
:D O EN ODA ND A
B ODA MA FA?O
2 MF +33
M3 N
+JGJL , EIC 6
6 , EIC 6
Cylinder
EM G - N L 6
Obj:1
Obj:2
↓
Filter(Red)
Query(Shape)
E. .
33
33
Sphere
Box
1 . FC 6 , C C C C
2 C C . FC F C 3-
• e h + , C : C
• [S ]MP b W
• FCC F F , C T c I MN idTgf aLPJ
Published as a conference paper at ICLR 2019
Q: What is the shape of the red object?
A: Cube.
Q: How many cubes are behind the
sphere?
A: 3
Q: Does the red object left of the green
cube have the same shape as the
purple matte thing?
A: No
Q: Does the matte thing behind the big
Initialized with DSL and executor.
Lesson1: Object-based questions.
Lesson2: Relational questions.
Lesson3: More complex questions.
Deploy: complex scenes, complex questions
Q: Does the red object left of the green
cube have the same shape as the
purple matte thing?
1 2
3 4
Obj 1
Obj 2
Obj 3
Obj 4
Step1: Visual Parsing
Step2, 3: Semantic Parsing and Program Execution
Filter Green Cube
Program Representations Outputs
Relate Object 2
Left
Filter Red
Concepts
A. Curriculum concept learning B. Illustrative execution of NS-CL
Q: What is the shape of the red object?
A: Cube.
Q: How many cubes are behind the
sphere?
A: 3
Q: Does the red object left of the green
cube have the same shape as the
purple matte thing?
A: No
Q: Does the matte thing behind the big
sphere have the same color as the
cylinder left of the small matte cube?
A: No.
Initialized with DSL and executor.
Lesson1: Object-based questions.
Lesson2: Relational questions.
Lesson3: More complex questions.
Deploy: complex scenes, complex questions
Q: Does the red object lef
cube have the same shape
purple matte thing?
Obj 1
Obj 2
Obj 3
Obj 4
Step1: Visual Parsing
Step2, 3: Semantic Pars
Filter
Program Representa
Relate O
Filter
Filter
AEQuery Object 1 O
A. Curriculum concept learning B. Illustrative exec
Figure 4: A. Demonstration of the curriculum learning of visual concepts, wo
of sentences by watching images and reading paired questions and answers
different complexities are illustrated to the learner in an incremental mann
7 4 LI 6NE DC + F K 2 IF I FK I I KCF 6 F J
I J F 6 FK F J I E 4 KLI D 6L IMCJC F :
• +2- 5 , K J K :0 FJ F %
• uc a e t am ln
• 7I CF 1 DC 1 7 JK 1
• v m 1Rg S y s 1Rg S ya g
• am b oph ri [Td m SV] W
217 4./ 0 5: /3
8. . M P /31 9
6NA EE
217 4./ 0 5: /3
0 L
RQS-
CLLI- E DE DL A M L A M A % 2/39 5:/3 I LA I
CLLI- E DE DL A M L A M A % 2/39 5:/3 IILO
:
• : : : : : : : :: : 6 :5
• : EM I AMG ER A " Y7PDEPED MESPNM , 3MREGPARIMG RPEE RPSCRSPE IMRN PECSPPEMR MESPA MER NP "Z IM 8PNC" N
3 9 "
• 5EPIRW :REO EM ER A " Y9EGS APIXIMG AMD 7ORILIXIMG : 5 AMGSAGE 5NDE "Z IM 8PNC" N 3 9 "
• AMG I IM ER A " Y.PEA IMG R E N RLAV BNRR EMEC , - IG PAM PMM AMGSAGE LNDE "Z IM 8PNC" N 3 9 "
• : : 6
• I IAMG ER A " :LNNR IMG R E 2ENLERPW N 8PNBABI I RIC .NV 1LBEDDIMG " IM 8PNC" N 3 9 "
• 5I N NT NLA ER A " Y0I RPIBSRED PEOPE EMRARINM N NPD AMD O PA E AMD R EIP CNLON IRINMA IRW"Z IM 8PNC"
N 638: "
• I MI S E ER A " Y NPD PEOPE EMRARINM TIA GAS IAM ELBEDDIMG"Z IM 8PNC" N 3 9 "
• EMDPNT 3TAM ER A " Y7PDEP ELBEDDIMG N ILAGE AMD AMGSAGE"Z IM 8PNC" N 3 9 "
• AI - ICE ER A " Y EAPMIMG RN OPEDICR DEMNRARINMA OPNBABI IRIE NP LNDE IMG EM RAI LEMR"Z IM 8PNC" N 1-
"
• I MI S E ER A " Y8PNBABI I RIC ELBEDDIMG N MN EDGE GPAO IR BNV ARRICE LEA SPE "Z IM 8PNC" N -
"
:
• C : : 6 6 6 : C: :A :
• P . HFS H" T H NN J FKJ RF E HFDE R FDE JA ATJ IF KJQKHP FKJN" FJ MK " KC 0 8 "
• NR JF NEFNE H" J FKJ FN HH TKP J A" FJ MK " KC 0 9 "
• -6 ,C : : : : , : : , : :
, A :
• K 1F TP J H" :E J PMK NTI KHF KJ L H MJ M 0J MLM FJD N J N RKMAN JA N J J N CMKI
J PM H NPL MQFNFKJ" FJ MK " KC 0 8 "
• PANKJ ,M R H" V KILKNF FKJ H J FKJ J RKM N CKM I EFJ M NKJFJD"V FJ MK " KC 0 8 "
• N E M , QFA H" :M JNL M J T T A NFDJ HKNFJD E D L R J L MCKMI J JA FJ MLM FHF T
FJ QFNP H M NKJFJD"V FJ MK " KC 8 "
• F 2 SFJ H" PM H 9TI KHF 7 ,FN J JDHFJD M NKJFJD CMKI QFNFKJ JA H JDP D PJA MN JAFJD"V FJ
MK " KC PM0 9 "
• 1KEJNKJ 1PN FJ H" - 8 AF DJKN F A N CKM KILKNF FKJ H H JDP D JA H I J MT QFNP H
M NKJFJD"V FJ MK " KC 8 "

Mais conteúdo relacionado

Mais procurados

19 algorithms-and-complexity-110627100203-phpapp02
19 algorithms-and-complexity-110627100203-phpapp0219 algorithms-and-complexity-110627100203-phpapp02
19 algorithms-and-complexity-110627100203-phpapp02Muhammad Aslam
 
Enjoyable Front-end Development with Reagent
Enjoyable Front-end Development with ReagentEnjoyable Front-end Development with Reagent
Enjoyable Front-end Development with ReagentThiago Fernandes Massa
 
Markov Model for TMR System with Repair
Markov Model for TMR System with RepairMarkov Model for TMR System with Repair
Markov Model for TMR System with RepairSujith Jay Nair
 
zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...
zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...
zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...Alex Pruden
 
Java practice programs for beginners
Java practice programs for beginnersJava practice programs for beginners
Java practice programs for beginnersishan0019
 
Effective Modern C++ - Item 35 & 36
Effective Modern C++ - Item 35 & 36Effective Modern C++ - Item 35 & 36
Effective Modern C++ - Item 35 & 36Chih-Hsuan Kuo
 
Datastage real time scenario
Datastage real time scenarioDatastage real time scenario
Datastage real time scenarioNaresh Bala
 
ZK Study Club: Sumcheck Arguments and Their Applications
ZK Study Club: Sumcheck Arguments and Their ApplicationsZK Study Club: Sumcheck Arguments and Their Applications
ZK Study Club: Sumcheck Arguments and Their ApplicationsAlex Pruden
 
Computer architecture, a quantitative approach (solution for 5th edition)
Computer architecture, a quantitative approach (solution for 5th edition)Computer architecture, a quantitative approach (solution for 5th edition)
Computer architecture, a quantitative approach (solution for 5th edition)Zohaib Ali
 
Result and screen shots
Result and screen shotsResult and screen shots
Result and screen shotsMukesh M
 
Discrete control2 converted
Discrete control2 convertedDiscrete control2 converted
Discrete control2 convertedcairo university
 
Advanced Computer Architecture chapter 5 problem solutions
Advanced Computer  Architecture  chapter 5 problem solutionsAdvanced Computer  Architecture  chapter 5 problem solutions
Advanced Computer Architecture chapter 5 problem solutionsJoe Christensen
 
A minimal introduction to Python non-uniform fast Fourier transform (pynufft)
A minimal introduction to Python non-uniform fast Fourier transform (pynufft)A minimal introduction to Python non-uniform fast Fourier transform (pynufft)
A minimal introduction to Python non-uniform fast Fourier transform (pynufft)Jyh-Miin Lin
 
Application of recursive perturbation approach for multimodal optimization
Application of recursive perturbation approach for multimodal optimizationApplication of recursive perturbation approach for multimodal optimization
Application of recursive perturbation approach for multimodal optimizationPranamesh Chakraborty
 
Complex models in ecology: challenges and solutions
Complex models in ecology: challenges and solutionsComplex models in ecology: challenges and solutions
Complex models in ecology: challenges and solutionsPeter Solymos
 
Class 18: Measuring Cost
Class 18: Measuring CostClass 18: Measuring Cost
Class 18: Measuring CostDavid Evans
 
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHMADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHMijfls
 

Mais procurados (20)

19 algorithms-and-complexity-110627100203-phpapp02
19 algorithms-and-complexity-110627100203-phpapp0219 algorithms-and-complexity-110627100203-phpapp02
19 algorithms-and-complexity-110627100203-phpapp02
 
Enjoyable Front-end Development with Reagent
Enjoyable Front-end Development with ReagentEnjoyable Front-end Development with Reagent
Enjoyable Front-end Development with Reagent
 
Markov Model for TMR System with Repair
Markov Model for TMR System with RepairMarkov Model for TMR System with Repair
Markov Model for TMR System with Repair
 
zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...
zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...
zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...
 
Java practice programs for beginners
Java practice programs for beginnersJava practice programs for beginners
Java practice programs for beginners
 
Effective Modern C++ - Item 35 & 36
Effective Modern C++ - Item 35 & 36Effective Modern C++ - Item 35 & 36
Effective Modern C++ - Item 35 & 36
 
Datastage real time scenario
Datastage real time scenarioDatastage real time scenario
Datastage real time scenario
 
ZK Study Club: Sumcheck Arguments and Their Applications
ZK Study Club: Sumcheck Arguments and Their ApplicationsZK Study Club: Sumcheck Arguments and Their Applications
ZK Study Club: Sumcheck Arguments and Their Applications
 
1 untitled 1
1 untitled 11 untitled 1
1 untitled 1
 
Computer architecture, a quantitative approach (solution for 5th edition)
Computer architecture, a quantitative approach (solution for 5th edition)Computer architecture, a quantitative approach (solution for 5th edition)
Computer architecture, a quantitative approach (solution for 5th edition)
 
Result and screen shots
Result and screen shotsResult and screen shots
Result and screen shots
 
Discrete control2 converted
Discrete control2 convertedDiscrete control2 converted
Discrete control2 converted
 
Advanced Computer Architecture chapter 5 problem solutions
Advanced Computer  Architecture  chapter 5 problem solutionsAdvanced Computer  Architecture  chapter 5 problem solutions
Advanced Computer Architecture chapter 5 problem solutions
 
A minimal introduction to Python non-uniform fast Fourier transform (pynufft)
A minimal introduction to Python non-uniform fast Fourier transform (pynufft)A minimal introduction to Python non-uniform fast Fourier transform (pynufft)
A minimal introduction to Python non-uniform fast Fourier transform (pynufft)
 
Application of recursive perturbation approach for multimodal optimization
Application of recursive perturbation approach for multimodal optimizationApplication of recursive perturbation approach for multimodal optimization
Application of recursive perturbation approach for multimodal optimization
 
Complex models in ecology: challenges and solutions
Complex models in ecology: challenges and solutionsComplex models in ecology: challenges and solutions
Complex models in ecology: challenges and solutions
 
Fourier transform
Fourier transformFourier transform
Fourier transform
 
Class 18: Measuring Cost
Class 18: Measuring CostClass 18: Measuring Cost
Class 18: Measuring Cost
 
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHMADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
 
Digitalcontrolsystems
DigitalcontrolsystemsDigitalcontrolsystems
Digitalcontrolsystems
 

Semelhante a NLP@ICLR2019

Standard cells library design
Standard cells library designStandard cells library design
Standard cells library designBharat Biyani
 
EENG519FinalProjectReport
EENG519FinalProjectReportEENG519FinalProjectReport
EENG519FinalProjectReportDaniel K
 
[DL輪読会]Conditional Neural Processes
[DL輪読会]Conditional Neural Processes[DL輪読会]Conditional Neural Processes
[DL輪読会]Conditional Neural ProcessesDeep Learning JP
 
Conditional neural processes
Conditional neural processesConditional neural processes
Conditional neural processesKazuki Fujikawa
 
T-S Fuzzy Observer and Controller of Doubly-Fed Induction Generator
T-S Fuzzy Observer and Controller of Doubly-Fed Induction GeneratorT-S Fuzzy Observer and Controller of Doubly-Fed Induction Generator
T-S Fuzzy Observer and Controller of Doubly-Fed Induction GeneratorIJPEDS-IAES
 
Response Surface in Tensor Train format for Uncertainty Quantification
Response Surface in Tensor Train format for Uncertainty QuantificationResponse Surface in Tensor Train format for Uncertainty Quantification
Response Surface in Tensor Train format for Uncertainty QuantificationAlexander Litvinenko
 
FEEDBACK SHIFT REGISTERS AS CELLULAR AUTOMATA BOUNDARY CONDITIONS
FEEDBACK SHIFT REGISTERS AS CELLULAR AUTOMATA BOUNDARY CONDITIONS FEEDBACK SHIFT REGISTERS AS CELLULAR AUTOMATA BOUNDARY CONDITIONS
FEEDBACK SHIFT REGISTERS AS CELLULAR AUTOMATA BOUNDARY CONDITIONS cscpconf
 
FEEDBACK SHIFT REGISTERS AS CELLULAR AUTOMATA BOUNDARY CONDITIONS
FEEDBACK SHIFT REGISTERS AS CELLULAR AUTOMATA BOUNDARY CONDITIONSFEEDBACK SHIFT REGISTERS AS CELLULAR AUTOMATA BOUNDARY CONDITIONS
FEEDBACK SHIFT REGISTERS AS CELLULAR AUTOMATA BOUNDARY CONDITIONScsitconf
 
Compilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVMCompilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVMLinaro
 
Vlsi ii project presentation
Vlsi ii project presentationVlsi ii project presentation
Vlsi ii project presentationRedwan Islam
 
VLSI Design Final Project - 32 bit ALU
VLSI Design Final Project - 32 bit ALUVLSI Design Final Project - 32 bit ALU
VLSI Design Final Project - 32 bit ALUSachin Kumar Asokan
 
Mining of time series data base using fuzzy neural information systems
Mining of time series data base using fuzzy neural information systemsMining of time series data base using fuzzy neural information systems
Mining of time series data base using fuzzy neural information systemsDr.MAYA NAYAK
 
Raymond.Brunkow-Project-EEL-3657-Sp15
Raymond.Brunkow-Project-EEL-3657-Sp15Raymond.Brunkow-Project-EEL-3657-Sp15
Raymond.Brunkow-Project-EEL-3657-Sp15Raymond Brunkow
 
Load-time Hacking using LD_PRELOAD
Load-time Hacking using LD_PRELOADLoad-time Hacking using LD_PRELOAD
Load-time Hacking using LD_PRELOADDharmalingam Ganesan
 
reservoir-modeling-using-matlab-the-matalb-reservoir-simulation-toolbox-mrst.pdf
reservoir-modeling-using-matlab-the-matalb-reservoir-simulation-toolbox-mrst.pdfreservoir-modeling-using-matlab-the-matalb-reservoir-simulation-toolbox-mrst.pdf
reservoir-modeling-using-matlab-the-matalb-reservoir-simulation-toolbox-mrst.pdfRTEFGDFGJU
 
Introduction to Algorithms
Introduction to AlgorithmsIntroduction to Algorithms
Introduction to Algorithmspppepito86
 

Semelhante a NLP@ICLR2019 (20)

Standard cells library design
Standard cells library designStandard cells library design
Standard cells library design
 
EENG519FinalProjectReport
EENG519FinalProjectReportEENG519FinalProjectReport
EENG519FinalProjectReport
 
[DL輪読会]Conditional Neural Processes
[DL輪読会]Conditional Neural Processes[DL輪読会]Conditional Neural Processes
[DL輪読会]Conditional Neural Processes
 
Conditional neural processes
Conditional neural processesConditional neural processes
Conditional neural processes
 
13486500-FFT.ppt
13486500-FFT.ppt13486500-FFT.ppt
13486500-FFT.ppt
 
T-S Fuzzy Observer and Controller of Doubly-Fed Induction Generator
T-S Fuzzy Observer and Controller of Doubly-Fed Induction GeneratorT-S Fuzzy Observer and Controller of Doubly-Fed Induction Generator
T-S Fuzzy Observer and Controller of Doubly-Fed Induction Generator
 
Response Surface in Tensor Train format for Uncertainty Quantification
Response Surface in Tensor Train format for Uncertainty QuantificationResponse Surface in Tensor Train format for Uncertainty Quantification
Response Surface in Tensor Train format for Uncertainty Quantification
 
FEEDBACK SHIFT REGISTERS AS CELLULAR AUTOMATA BOUNDARY CONDITIONS
FEEDBACK SHIFT REGISTERS AS CELLULAR AUTOMATA BOUNDARY CONDITIONS FEEDBACK SHIFT REGISTERS AS CELLULAR AUTOMATA BOUNDARY CONDITIONS
FEEDBACK SHIFT REGISTERS AS CELLULAR AUTOMATA BOUNDARY CONDITIONS
 
FEEDBACK SHIFT REGISTERS AS CELLULAR AUTOMATA BOUNDARY CONDITIONS
FEEDBACK SHIFT REGISTERS AS CELLULAR AUTOMATA BOUNDARY CONDITIONSFEEDBACK SHIFT REGISTERS AS CELLULAR AUTOMATA BOUNDARY CONDITIONS
FEEDBACK SHIFT REGISTERS AS CELLULAR AUTOMATA BOUNDARY CONDITIONS
 
Ece4510 notes10
Ece4510 notes10Ece4510 notes10
Ece4510 notes10
 
Compilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVMCompilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVM
 
[IJCT-V3I2P19] Authors: Jieyin Mai, Xiaojun Li
[IJCT-V3I2P19] Authors: Jieyin Mai, Xiaojun Li[IJCT-V3I2P19] Authors: Jieyin Mai, Xiaojun Li
[IJCT-V3I2P19] Authors: Jieyin Mai, Xiaojun Li
 
Vlsi ii project presentation
Vlsi ii project presentationVlsi ii project presentation
Vlsi ii project presentation
 
VLSI Design Final Project - 32 bit ALU
VLSI Design Final Project - 32 bit ALUVLSI Design Final Project - 32 bit ALU
VLSI Design Final Project - 32 bit ALU
 
Mining of time series data base using fuzzy neural information systems
Mining of time series data base using fuzzy neural information systemsMining of time series data base using fuzzy neural information systems
Mining of time series data base using fuzzy neural information systems
 
redes neuronais
redes neuronaisredes neuronais
redes neuronais
 
Raymond.Brunkow-Project-EEL-3657-Sp15
Raymond.Brunkow-Project-EEL-3657-Sp15Raymond.Brunkow-Project-EEL-3657-Sp15
Raymond.Brunkow-Project-EEL-3657-Sp15
 
Load-time Hacking using LD_PRELOAD
Load-time Hacking using LD_PRELOADLoad-time Hacking using LD_PRELOAD
Load-time Hacking using LD_PRELOAD
 
reservoir-modeling-using-matlab-the-matalb-reservoir-simulation-toolbox-mrst.pdf
reservoir-modeling-using-matlab-the-matalb-reservoir-simulation-toolbox-mrst.pdfreservoir-modeling-using-matlab-the-matalb-reservoir-simulation-toolbox-mrst.pdf
reservoir-modeling-using-matlab-the-matalb-reservoir-simulation-toolbox-mrst.pdf
 
Introduction to Algorithms
Introduction to AlgorithmsIntroduction to Algorithms
Introduction to Algorithms
 

Mais de Kazuki Fujikawa

Stanford Covid Vaccine 2nd place solution
Stanford Covid Vaccine 2nd place solutionStanford Covid Vaccine 2nd place solution
Stanford Covid Vaccine 2nd place solutionKazuki Fujikawa
 
BMS Molecular Translation 3rd place solution
BMS Molecular Translation 3rd place solutionBMS Molecular Translation 3rd place solution
BMS Molecular Translation 3rd place solutionKazuki Fujikawa
 
Kaggle参加報告: Champs Predicting Molecular Properties
Kaggle参加報告: Champs Predicting Molecular PropertiesKaggle参加報告: Champs Predicting Molecular Properties
Kaggle参加報告: Champs Predicting Molecular PropertiesKazuki Fujikawa
 
Kaggle参加報告: Quora Insincere Questions Classification
Kaggle参加報告: Quora Insincere Questions ClassificationKaggle参加報告: Quora Insincere Questions Classification
Kaggle参加報告: Quora Insincere Questions ClassificationKazuki Fujikawa
 
Ordered neurons integrating tree structures into recurrent neural networks
Ordered neurons integrating tree structures into recurrent neural networksOrdered neurons integrating tree structures into recurrent neural networks
Ordered neurons integrating tree structures into recurrent neural networksKazuki Fujikawa
 
A closer look at few shot classification
A closer look at few shot classificationA closer look at few shot classification
A closer look at few shot classificationKazuki Fujikawa
 
Graph convolutional policy network for goal directed molecular graph generation
Graph convolutional policy network for goal directed molecular graph generationGraph convolutional policy network for goal directed molecular graph generation
Graph convolutional policy network for goal directed molecular graph generationKazuki Fujikawa
 
NIPS2017 Few-shot Learning and Graph Convolution
NIPS2017 Few-shot Learning and Graph ConvolutionNIPS2017 Few-shot Learning and Graph Convolution
NIPS2017 Few-shot Learning and Graph ConvolutionKazuki Fujikawa
 
Matrix capsules with em routing
Matrix capsules with em routingMatrix capsules with em routing
Matrix capsules with em routingKazuki Fujikawa
 
Predicting organic reaction outcomes with weisfeiler lehman network
Predicting organic reaction outcomes with weisfeiler lehman networkPredicting organic reaction outcomes with weisfeiler lehman network
Predicting organic reaction outcomes with weisfeiler lehman networkKazuki Fujikawa
 
SchNet: A continuous-filter convolutional neural network for modeling quantum...
SchNet: A continuous-filter convolutional neural network for modeling quantum...SchNet: A continuous-filter convolutional neural network for modeling quantum...
SchNet: A continuous-filter convolutional neural network for modeling quantum...Kazuki Fujikawa
 
Matching networks for one shot learning
Matching networks for one shot learningMatching networks for one shot learning
Matching networks for one shot learningKazuki Fujikawa
 
DeNAにおける機械学習・深層学習活用
DeNAにおける機械学習・深層学習活用DeNAにおける機械学習・深層学習活用
DeNAにおける機械学習・深層学習活用Kazuki Fujikawa
 

Mais de Kazuki Fujikawa (14)

Stanford Covid Vaccine 2nd place solution
Stanford Covid Vaccine 2nd place solutionStanford Covid Vaccine 2nd place solution
Stanford Covid Vaccine 2nd place solution
 
BMS Molecular Translation 3rd place solution
BMS Molecular Translation 3rd place solutionBMS Molecular Translation 3rd place solution
BMS Molecular Translation 3rd place solution
 
ACL2020 best papers
ACL2020 best papersACL2020 best papers
ACL2020 best papers
 
Kaggle参加報告: Champs Predicting Molecular Properties
Kaggle参加報告: Champs Predicting Molecular PropertiesKaggle参加報告: Champs Predicting Molecular Properties
Kaggle参加報告: Champs Predicting Molecular Properties
 
Kaggle参加報告: Quora Insincere Questions Classification
Kaggle参加報告: Quora Insincere Questions ClassificationKaggle参加報告: Quora Insincere Questions Classification
Kaggle参加報告: Quora Insincere Questions Classification
 
Ordered neurons integrating tree structures into recurrent neural networks
Ordered neurons integrating tree structures into recurrent neural networksOrdered neurons integrating tree structures into recurrent neural networks
Ordered neurons integrating tree structures into recurrent neural networks
 
A closer look at few shot classification
A closer look at few shot classificationA closer look at few shot classification
A closer look at few shot classification
 
Graph convolutional policy network for goal directed molecular graph generation
Graph convolutional policy network for goal directed molecular graph generationGraph convolutional policy network for goal directed molecular graph generation
Graph convolutional policy network for goal directed molecular graph generation
 
NIPS2017 Few-shot Learning and Graph Convolution
NIPS2017 Few-shot Learning and Graph ConvolutionNIPS2017 Few-shot Learning and Graph Convolution
NIPS2017 Few-shot Learning and Graph Convolution
 
Matrix capsules with em routing
Matrix capsules with em routingMatrix capsules with em routing
Matrix capsules with em routing
 
Predicting organic reaction outcomes with weisfeiler lehman network
Predicting organic reaction outcomes with weisfeiler lehman networkPredicting organic reaction outcomes with weisfeiler lehman network
Predicting organic reaction outcomes with weisfeiler lehman network
 
SchNet: A continuous-filter convolutional neural network for modeling quantum...
SchNet: A continuous-filter convolutional neural network for modeling quantum...SchNet: A continuous-filter convolutional neural network for modeling quantum...
SchNet: A continuous-filter convolutional neural network for modeling quantum...
 
Matching networks for one shot learning
Matching networks for one shot learningMatching networks for one shot learning
Matching networks for one shot learning
 
DeNAにおける機械学習・深層学習活用
DeNAにおける機械学習・深層学習活用DeNAにおける機械学習・深層学習活用
DeNAにおける機械学習・深層学習活用
 

Último

Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...KarteekMane1
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxHaritikaChhatwal1
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Milind Agarwal
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataTecnoIncentive
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxHimangsuNath
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 

Último (20)

Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptx
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded data
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 

NLP@ICLR2019

  • 2. b • • • • A g • a • • M 0 4e 12 0 4 • e 0 0 4 0 N : D A 4 ~ • N 0 o 4 • 12
  • 3. 0 3 • • 2 3 1 39
  • 5. ,/ 0/2 • a 0/2O c Pc • 55 A • 1 4: & & • A 24 • 1 4:N 55 A K& OL IR • . CB 0/2 00 45 9 A 4 :4A9 4AA A9 A5 R S BMMJL I=L AIIAF =I JK L M MCI #P I E I. 9 062 5BM # 5A 1 M / CM LFC C A # = : :
  • 6. 0 • • 6 2 6 1 9 • G G GED ,DI G I D G IG :I G DIE : GG DI G B ILEGA D I F F G • CEEI D I + EC IGN E GE B I : EM )C D - • N - II DI ED L I - IL I D (ND C : ED EB I ED • GE NC EB : ED: FI - GD G ,DI GFG I D : D EG D DI D: GEC I G B F G ED E
  • 7. 21 0 • • 7 191 I • G G GED ,DI G I D G IG :I G DIE : GG DI G B ILEGA D I F F G • CEEI D I + EC IGN E GE B I : EM )C D - • N - II DI ED L I - IL I D (ND C : ED EB I ED • GE NC EB : ED: FI - GD G ,DI GFG I D : D EG D DI D: GEC I G B F G ED E A L 1 9 2 L CNO @
  • 8. ):9 8 • 0( + 1 • B AB9C9 @ N • B89B98 9 B@ C 9:B : B99 2 B B9C @ 19 BB9 9 B 9 @B C 2 9 9C A A9B • D : D DBI B CD :C + • I + CC DD D G D + : DG : D I F ED C E • - EB I AD + B B D BAB D : C B C D C B - DEB EA BF C , L I1
  • 9. + + 9 • c[a e • u R: Ot RO w • d hu c]h S w I :c]h o IT O • i [ • l O u c]h :t ] • w s e R O • w N k O : • n cg • r . . /
  • 10. ) 1 1 ( : + 0 : 0 ( ( : + • I + IMNOT c RI !"#$L I[ a L S %" : L] d I ̂!"L I[ a L S '" L] e RI !"#$ I ̂!"L%" '" MNOLT 8: / 5- 2 . 8 : : 01 : - 0 2 : practice we use a continuous relaxation by computing the quantity p(d  k), obtained by ta a cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. He ˆg = E[g]. 4.2 STRUCTURED GATING MECHANISM Based on the cumax() function, we introduce a master forget gate ˜ft and a master input gate ˜it ˜ft = cumax(W ˜f xt + U ˜f ht 1 + b ˜f ) ˜it = 1 cumax(W˜ixt + U˜iht 1 + b˜i) Following the properties of the cumax() activation, the values in the master forget gate are mo tonically increasing from 0 to 1, and those in the master input gate are monotonically decrea from 1 to 0. These gates serve as high-level control for the update operations of cell states. U the master gates, we define a new update rule: !t = ˜ft ˜it ˆft = ft !t + ( ˜ft !t) = ˜ft (ft ˜it + 1 ˜it) ˆit = it !t + (˜it !t) = ˜it (it ˜ft + 1 ˜ft) ct = ˆft ct 1 +ˆit ˆct In order to explain the intuition behind the new update rule, we assume that the master gates neurons always update more (or less) frequently than the others, and that order is pre-determined as part of the model architecture. 4 ON-LSTM In this section, we present a new RNN unit, ON-LSTM (“ordered neurons LSTM”). The new model uses an architecture similar to the standard LSTM, reported below: ft = (Wf xt + Uf ht 1 + bf ) (1) it = (Wixt + Uiht 1 + bi) (2) ot = (Woxt + Uoht 1 + bo) (3) ˆct = tanh(Wcxt + Ucht 1 + bc) (4) ht = ot tanh(ct) (5) The difference with the LSTM is that we replace the update function for the cell state ct with a new function that will be explained in the following sections. The forget gates ft and input gates it are used to control the erasing and writing operation on cell states ct, as before. Since the gates in the LSTM act independently on each neuron, it may be difficult in general to discern a hierarchy of information between the neurons. To this end, we propose to make the gate for each neuron dependent on the others by enforcing the order in which neurons should be updated. 4.1 ACTIVATION FUNCTION: cumax() that only last one or a few time steps, representing smaller constituents, as shown in Figure 2(b). The differentiation between high-ranking and low-ranking neurons is learnt in a completely data-driven fashion by controlling the update frequency of single neurons: to erase (or update) high-ranking neurons, the model should first erase (or update) all lower-ranking neurons. In other words, some neurons always update more (or less) frequently than the others, and that order is pre-determined as part of the model architecture. 4 ON-LSTM In this section, we present a new RNN unit, ON-LSTM (“ordered neurons LSTM”). The new model uses an architecture similar to the standard LSTM, reported below: ft = (Wf xt + Uf ht 1 + bf ) (1) it = (Wixt + Uiht 1 + bi) (2) ot = (Woxt + Uoht 1 + bo) (3) ˆct = tanh(Wcxt + Ucht 1 + bc) (4) ht = ot tanh(ct) (5) The difference with the LSTM is that we replace the update function for the cell state ct with a new function that will be explained in the following sections. The forget gates ft and input gates it are used to control the erasing and writing operation on cell states ct, as before. Since the gates in the LSTM act independently on each neuron, it may be difficult in general to discern a hierarchy or global information that will last anywhere from several time steps to the entire sentence, repre- senting nodes near the root of the tree. Low-ranking neurons encode short-term or local information that only last one or a few time steps, representing smaller constituents, as shown in Figure 2(b). The differentiation between high-ranking and low-ranking neurons is learnt in a completely data-driven fashion by controlling the update frequency of single neurons: to erase (or update) high-ranking neurons, the model should first erase (or update) all lower-ranking neurons. In other words, some neurons always update more (or less) frequently than the others, and that order is pre-determined as part of the model architecture. 4 ON-LSTM In this section, we present a new RNN unit, ON-LSTM (“ordered neurons LSTM”). The new model uses an architecture similar to the standard LSTM, reported below: ft = (Wf xt + Uf ht 1 + bf ) (1) it = (Wixt + Uiht 1 + bi) (2) ot = (Woxt + Uoht 1 + bo) (3) ˆct = tanh(Wcxt + Ucht 1 + bc) (4) ht = ot tanh(ct) (5) The difference with the LSTM is that we replace the update function for the cell state ct with a new function that will be explained in the following sections. The forget gates ft and input gates
  • 11. + 1 - , 1: /- • + (- )O e c[ ] p !"# 1 1 ̃%# 1 1 Th r !"# N ̃%# O lkm &# Th s !"# &# "# iaO(- )O 1 T L '"# + (- ) 1 Th t ̃%# &# (# iaO(- )O 1 T L ̂%# + (- ) 1 Th u [Odo*#+,Ne Odo ̂*#T"#N(#M IL T e S discrete variable is included in the computation graph is not trivial (Schulman et al., 2015), so in practice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking a cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, ˆg = E[g]. 4.2 STRUCTURED GATING MECHANISM Based on the cumax() function, we introduce a master forget gate ˜ft and a master input gate ˜it: ˜ft = cumax(W ˜f xt + U ˜f ht 1 + b ˜f ) (9) ˜it = 1 cumax(W˜ixt + U˜iht 1 + b˜i) (10) Following the properties of the cumax() activation, the values in the master forget gate are mono- tonically increasing from 0 to 1, and those in the master input gate are monotonically decreasing from 1 to 0. These gates serve as high-level control for the update operations of cell states. Using the master gates, we define a new update rule: !t = ˜ft ˜it (11) ˆft = ft !t + ( ˜ft !t) = ˜ft (ft ˜it + 1 ˜it) (12) ˆit = it !t + (˜it !t) = ˜it (it ˜ft + 1 ˜ft) (13) ct = ˆft ct 1 +ˆit ˆct (14) In order to explain the intuition behind the new update rule, we assume that the master gates are binary: • The master forget gate ˜ft controls the erasing behavior of the model. Suppose ˜ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the first df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large number of zeroed neurons, i.e. a large df , represents the end of a high-level constituent in eally, g should take the form of a discrete variable. Unfortunately, computing gradients when a screte variable is included in the computation graph is not trivial (Schulman et al., 2015), so in actice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, = E[g]. 2 STRUCTURED GATING MECHANISM ased on the cumax() function, we introduce a master forget gate ˜ft and a master input gate ˜it: ˜ft = cumax(W ˜f xt + U ˜f ht 1 + b ˜f ) (9) ˜it = 1 cumax(W˜ixt + U˜iht 1 + b˜i) (10) ollowing the properties of the cumax() activation, the values in the master forget gate are mono- nically increasing from 0 to 1, and those in the master input gate are monotonically decreasing om 1 to 0. These gates serve as high-level control for the update operations of cell states. Using e master gates, we define a new update rule: !t = ˜ft ˜it (11) ˆft = ft !t + ( ˜ft !t) = ˜ft (ft ˜it + 1 ˜it) (12) ˆit = it !t + (˜it !t) = ˜it (it ˜ft + 1 ˜ft) (13) ct = ˆft ct 1 +ˆit ˆct (14) order to explain the intuition behind the new update rule, we assume that the master gates are nary: • The master forget gate ˜ft controls the erasing behavior of the model. Suppose ˜ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the first df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large number of zeroed neurons, i.e. a large df t , represents the end of a high-level constituent in k k ˆg = E[g]. 4.2 STRUCTURED GATING MECHANISM Based on the cumax() function, we introduce a master forget gate ˜ft and a master input gate ˜it: ˜ft = cumax(W ˜f xt + U ˜f ht 1 + b ˜f ) (9) ˜it = 1 cumax(W˜ixt + U˜iht 1 + b˜i) (10) Following the properties of the cumax() activation, the values in the master forget gate are mono- tonically increasing from 0 to 1, and those in the master input gate are monotonically decreasing from 1 to 0. These gates serve as high-level control for the update operations of cell states. Using the master gates, we define a new update rule: !t = ˜ft ˜it (11) ˆft = ft !t + ( ˜ft !t) = ˜ft (ft ˜it + 1 ˜it) (12) ˆit = it !t + (˜it !t) = ˜it (it ˜ft + 1 ˜ft) (13) ct = ˆft ct 1 +ˆit ˆct (14) In order to explain the intuition behind the new update rule, we assume that the master gates are binary: • The master forget gate ˜ft controls the erasing behavior of the model. Suppose ˜ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the first df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large number of zeroed neurons, i.e. a large df t , represents the end of a high-level constituent in the parse tree, as most of the information in the state will be discarded. Conversely, a small f ik Ideally, g should take the form of a discrete variable. Unfortunately, computing gradients when a discrete variable is included in the computation graph is not trivial (Schulman et al., 2015), so in practice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking a cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, ˆg = E[g]. 4.2 STRUCTURED GATING MECHANISM Based on the cumax() function, we introduce a master forget gate ˜ft and a master input gate ˜it: ˜ft = cumax(W ˜f xt + U ˜f ht 1 + b ˜f ) (9) ˜it = 1 cumax(W˜ixt + U˜iht 1 + b˜i) (10) Following the properties of the cumax() activation, the values in the master forget gate are mono- tonically increasing from 0 to 1, and those in the master input gate are monotonically decreasing from 1 to 0. These gates serve as high-level control for the update operations of cell states. Using the master gates, we define a new update rule: !t = ˜ft ˜it (11) ˆft = ft !t + ( ˜ft !t) = ˜ft (ft ˜it + 1 ˜it) (12) ˆit = it !t + (˜it !t) = ˜it (it ˜ft + 1 ˜ft) (13) ct = ˆft ct 1 +ˆit ˆct (14) In order to explain the intuition behind the new update rule, we assume that the master gates are binary: • The master forget gate ˜ft controls the erasing behavior of the model. Suppose ˜ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information f LSTM. A sequence of tokens S = (x1, x2, x3) and its corresponding constituency tree are illus- d in (a). We provide a block view of the tree structure in (b), where both S and VP nodes span e than one time step. The representation for high-ranking nodes should be relatively consistent ss multiple time steps. (c) Visualization of the update frequency of groups of hidden state neu- . At each time step, given the input word, dark grey blocks are completely updated while light blocks are partially updated. The three groups of neurons have different update frequencies. most groups update less frequently while lower groups are more frequently updated. lobal information that will last anywhere from several time steps to the entire sentence, repre- ng nodes near the root of the tree. Low-ranking neurons encode short-term or local information only last one or a few time steps, representing smaller constituents, as shown in Figure 2(b). The rentiation between high-ranking and low-ranking neurons is learnt in a completely data-driven ion by controlling the update frequency of single neurons: to erase (or update) high-ranking ons, the model should first erase (or update) all lower-ranking neurons. In other words, some ons always update more (or less) frequently than the others, and that order is pre-determined as of the model architecture. ON-LSTM is section, we present a new RNN unit, ON-LSTM (“ordered neurons LSTM”). The new model an architecture similar to the standard LSTM, reported below: ft = (Wf xt + Uf ht 1 + bf ) (1) it = (Wixt + Uiht 1 + bi) (2) ot = (Woxt + Uoht 1 + bo) (3) ˆct = tanh(Wcxt + Ucht 1 + bc) (4) ht = ot tanh(ct) (5) difference with the LSTM is that we replace the update function for the cell state ct with a function that will be explained in the following sections. The forget gates ft and input gates e used to control the erasing and writing operation on cell states ct, as before. Since the gates before the k-th being the split point, that is d  k = (d = 0) _ (d = 1) _ · · · _ (d = k). Since the categories are mutually exclusive, we can do this by computing the cumulative distribution function: p(gk = 1) = p(d  k) = X ik p(d = i) (8) Ideally, g should take the form of a discrete variable. Unfortunately, computing gradients when a discrete variable is included in the computation graph is not trivial (Schulman et al., 2015), so in practice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking a cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, ˆg = E[g]. 4.2 STRUCTURED GATING MECHANISM Based on the cumax() function, we introduce a master forget gate ˜ft and a master input gate ˜it: ˜ft = cumax(W ˜f xt + U ˜f ht 1 + b ˜f ) (9) ˜it = 1 cumax(W˜ixt + U˜iht 1 + b˜i) (10) Following the properties of the cumax() activation, the values in the master forget gate are mono- tonically increasing from 0 to 1, and those in the master input gate are monotonically decreasing from 1 to 0. These gates serve as high-level control for the update operations of cell states. Using the master gates, we define a new update rule: !t = ˜ft ˜it (11) ˆft = ft !t + ( ˜ft !t) = ˜ft (ft ˜it + 1 ˜it) (12) ˆit = it !t + (˜it !t) = ˜it (it ˜ft + 1 ˜ft) (13) ct = ˆft ct 1 +ˆit ˆct (14) In order to explain the intuition behind the new update rule, we assume that the master gates are binary: iaO(- ) Rn iaO(- ) Ngf
  • 12. + 1 - 2 2 1: /- • + (- ) LM TO S I N !"# 1 1 ̃%# 1 1 R eally, g should take the form of a discrete variable. Unfortunately, computing gradients when a screte variable is included in the computation graph is not trivial (Schulman et al., 2015), so in actice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, = E[g]. 2 STRUCTURED GATING MECHANISM ased on the cumax() function, we introduce a master forget gate ˜ft and a master input gate ˜it: ˜ft = cumax(W ˜f xt + U ˜f ht 1 + b ˜f ) (9) ˜it = 1 cumax(W˜ixt + U˜iht 1 + b˜i) (10) ollowing the properties of the cumax() activation, the values in the master forget gate are mono- nically increasing from 0 to 1, and those in the master input gate are monotonically decreasing om 1 to 0. These gates serve as high-level control for the update operations of cell states. Using e master gates, we define a new update rule: !t = ˜ft ˜it (11) ˆft = ft !t + ( ˜ft !t) = ˜ft (ft ˜it + 1 ˜it) (12) ˆit = it !t + (˜it !t) = ˜it (it ˜ft + 1 ˜ft) (13) ct = ˆft ct 1 +ˆit ˆct (14) order to explain the intuition behind the new update rule, we assume that the master gates are nary: • The master forget gate ˜ft controls the erasing behavior of the model. Suppose ˜ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the first df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large number of zeroed neurons, i.e. a large df t , represents the end of a high-level constituent in &'( = exp(.() ∑(1 exp(.(1) '( = 2 (13( &'( 45678(9):;"<678(9) 45678 ( ) (
  • 13. + 3 3 1 - 1: /- • + (- ) LM TO S I N !"# 1 1 ̃%# 1 1 R eally, g should take the form of a discrete variable. Unfortunately, computing gradients when a screte variable is included in the computation graph is not trivial (Schulman et al., 2015), so in actice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, = E[g]. 2 STRUCTURED GATING MECHANISM ased on the cumax() function, we introduce a master forget gate ˜ft and a master input gate ˜it: ˜ft = cumax(W ˜f xt + U ˜f ht 1 + b ˜f ) (9) ˜it = 1 cumax(W˜ixt + U˜iht 1 + b˜i) (10) ollowing the properties of the cumax() activation, the values in the master forget gate are mono- nically increasing from 0 to 1, and those in the master input gate are monotonically decreasing om 1 to 0. These gates serve as high-level control for the update operations of cell states. Using e master gates, we define a new update rule: !t = ˜ft ˜it (11) ˆft = ft !t + ( ˜ft !t) = ˜ft (ft ˜it + 1 ˜it) (12) ˆit = it !t + (˜it !t) = ˜it (it ˜ft + 1 ˜ft) (13) ct = ˆft ct 1 +ˆit ˆct (14) order to explain the intuition behind the new update rule, we assume that the master gates are nary: • The master forget gate ˜ft controls the erasing behavior of the model. Suppose ˜ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the first df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large number of zeroed neurons, i.e. a large df t , represents the end of a high-level constituent in !"# ̃%# &'()* +,"-()* 1 0 e !"# ̃%# = 0 !"# . = 1 *. , x = = 1 = 0, !"# ̃%# = 0 0 , e paper at ICLR 2019 es between a constituency parse tree and the hidden states of the proposed of tokens S = (x1, x2, x3) and its corresponding constituency tree are illus- a block view of the tree structure in (b), where both S and VP nodes span The representation for high-ranking nodes should be relatively consistent s. (c) Visualization of the update frequency of groups of hidden state neu- given the input word, dark grey blocks are completely updated while light updated. The three groups of neurons have different update frequencies. ess frequently while lower groups are more frequently updated.
  • 14. 4 4 + = (= = / = - = + + : = 1 • + ) / OST g ] L f !"# ̃%# = Ra eally, g should take the form of a discrete variable. Unfortunately, computing gradients when a screte variable is included in the computation graph is not trivial (Schulman et al., 2015), so in actice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, = E[g]. 2 STRUCTURED GATING MECHANISM ased on the cumax() function, we introduce a master forget gate ˜ft and a master input gate ˜it: ˜ft = cumax(W ˜f xt + U ˜f ht 1 + b ˜f ) (9) ˜it = 1 cumax(W˜ixt + U˜iht 1 + b˜i) (10) ollowing the properties of the cumax() activation, the values in the master forget gate are mono- nically increasing from 0 to 1, and those in the master input gate are monotonically decreasing om 1 to 0. These gates serve as high-level control for the update operations of cell states. Using e master gates, we define a new update rule: !t = ˜ft ˜it (11) ˆft = ft !t + ( ˜ft !t) = ˜ft (ft ˜it + 1 ˜it) (12) ˆit = it !t + (˜it !t) = ˜it (it ˜ft + 1 ˜ft) (13) ct = ˆft ct 1 +ˆit ˆct (14) order to explain the intuition behind the new update rule, we assume that the master gates are nary: • The master forget gate ˜ft controls the erasing behavior of the model. Suppose ˜ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the first df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large number of zeroed neurons, i.e. a large df t , represents the end of a high-level constituent in !"# ̃%# &'()* +,"-()* . !"# ̃%# !"# 0 . e *. 1 , . !"# ̃%# x [d MI ce N ce e paper at ICLR 2019 es between a constituency parse tree and the hidden states of the proposed of tokens S = (x1, x2, x3) and its corresponding constituency tree are illus- a block view of the tree structure in (b), where both S and VP nodes span The representation for high-ranking nodes should be relatively consistent s. (c) Visualization of the update frequency of groups of hidden state neu- given the input word, dark grey blocks are completely updated while light updated. The three groups of neurons have different update frequencies. ess frequently while lower groups are more frequently updated.
  • 15. + 1 - 1: /- • + (- )ILMN ] R O !"# 1 5 1 ̃%# 1 1 T !"# ̃%# IS [ &# T eally, g should take the form of a discrete variable. Unfortunately, computing gradients when a screte variable is included in the computation graph is not trivial (Schulman et al., 2015), so in actice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, = E[g]. 2 STRUCTURED GATING MECHANISM ased on the cumax() function, we introduce a master forget gate ˜ft and a master input gate ˜it: ˜ft = cumax(W ˜f xt + U ˜f ht 1 + b ˜f ) (9) ˜it = 1 cumax(W˜ixt + U˜iht 1 + b˜i) (10) ollowing the properties of the cumax() activation, the values in the master forget gate are mono- nically increasing from 0 to 1, and those in the master input gate are monotonically decreasing om 1 to 0. These gates serve as high-level control for the update operations of cell states. Using e master gates, we define a new update rule: !t = ˜ft ˜it (11) ˆft = ft !t + ( ˜ft !t) = ˜ft (ft ˜it + 1 ˜it) (12) ˆit = it !t + (˜it !t) = ˜it (it ˜ft + 1 ˜ft) (13) ct = ˆft ct 1 +ˆit ˆct (14) order to explain the intuition behind the new update rule, we assume that the master gates are nary: • The master forget gate ˜ft controls the erasing behavior of the model. Suppose ˜ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the first df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large number of zeroed neurons, i.e. a large df t , represents the end of a high-level constituent in k k ˆg = E[g]. 4.2 STRUCTURED GATING MECHANISM Based on the cumax() function, we introduce a master forget gate ˜ft and a master input gate ˜it: ˜ft = cumax(W ˜f xt + U ˜f ht 1 + b ˜f ) (9) ˜it = 1 cumax(W˜ixt + U˜iht 1 + b˜i) (10) Following the properties of the cumax() activation, the values in the master forget gate are mono- tonically increasing from 0 to 1, and those in the master input gate are monotonically decreasing from 1 to 0. These gates serve as high-level control for the update operations of cell states. Using the master gates, we define a new update rule: !t = ˜ft ˜it (11) ˆft = ft !t + ( ˜ft !t) = ˜ft (ft ˜it + 1 ˜it) (12) ˆit = it !t + (˜it !t) = ˜it (it ˜ft + 1 ˜ft) (13) ct = ˆft ct 1 +ˆit ˆct (14) In order to explain the intuition behind the new update rule, we assume that the master gates are binary: • The master forget gate ˜ft controls the erasing behavior of the model. Suppose ˜ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the first df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large number of zeroed neurons, i.e. a large df t , represents the end of a high-level constituent in the parse tree, as most of the information in the state will be discarded. Conversely, a small f
  • 16. , + ( / - + + : 1 • ,+ ) / LO T c [S M R !"# 6 ̃%# N] d !"# ̃%# L a &# N] e !"# &# "# L) / L6 N I '"# ,+ ) / 6 N] discrete variable is included in the computation graph is not trivial (Schulman et al., 2015), so in practice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking a cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, ˆg = E[g]. 4.2 STRUCTURED GATING MECHANISM Based on the cumax() function, we introduce a master forget gate ˜ft and a master input gate ˜it: ˜ft = cumax(W ˜f xt + U ˜f ht 1 + b ˜f ) (9) ˜it = 1 cumax(W˜ixt + U˜iht 1 + b˜i) (10) Following the properties of the cumax() activation, the values in the master forget gate are mono- tonically increasing from 0 to 1, and those in the master input gate are monotonically decreasing from 1 to 0. These gates serve as high-level control for the update operations of cell states. Using the master gates, we define a new update rule: !t = ˜ft ˜it (11) ˆft = ft !t + ( ˜ft !t) = ˜ft (ft ˜it + 1 ˜it) (12) ˆit = it !t + (˜it !t) = ˜it (it ˜ft + 1 ˜ft) (13) ct = ˆft ct 1 +ˆit ˆct (14) In order to explain the intuition behind the new update rule, we assume that the master gates are binary: • The master forget gate ˜ft controls the erasing behavior of the model. Suppose ˜ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the first df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large number of zeroed neurons, i.e. a large df , represents the end of a high-level constituent in eally, g should take the form of a discrete variable. Unfortunately, computing gradients when a screte variable is included in the computation graph is not trivial (Schulman et al., 2015), so in actice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, = E[g]. 2 STRUCTURED GATING MECHANISM ased on the cumax() function, we introduce a master forget gate ˜ft and a master input gate ˜it: ˜ft = cumax(W ˜f xt + U ˜f ht 1 + b ˜f ) (9) ˜it = 1 cumax(W˜ixt + U˜iht 1 + b˜i) (10) ollowing the properties of the cumax() activation, the values in the master forget gate are mono- nically increasing from 0 to 1, and those in the master input gate are monotonically decreasing om 1 to 0. These gates serve as high-level control for the update operations of cell states. Using e master gates, we define a new update rule: !t = ˜ft ˜it (11) ˆft = ft !t + ( ˜ft !t) = ˜ft (ft ˜it + 1 ˜it) (12) ˆit = it !t + (˜it !t) = ˜it (it ˜ft + 1 ˜ft) (13) ct = ˆft ct 1 +ˆit ˆct (14) order to explain the intuition behind the new update rule, we assume that the master gates are nary: • The master forget gate ˜ft controls the erasing behavior of the model. Suppose ˜ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the first df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large number of zeroed neurons, i.e. a large df t , represents the end of a high-level constituent in k k ˆg = E[g]. 4.2 STRUCTURED GATING MECHANISM Based on the cumax() function, we introduce a master forget gate ˜ft and a master input gate ˜it: ˜ft = cumax(W ˜f xt + U ˜f ht 1 + b ˜f ) (9) ˜it = 1 cumax(W˜ixt + U˜iht 1 + b˜i) (10) Following the properties of the cumax() activation, the values in the master forget gate are mono- tonically increasing from 0 to 1, and those in the master input gate are monotonically decreasing from 1 to 0. These gates serve as high-level control for the update operations of cell states. Using the master gates, we define a new update rule: !t = ˜ft ˜it (11) ˆft = ft !t + ( ˜ft !t) = ˜ft (ft ˜it + 1 ˜it) (12) ˆit = it !t + (˜it !t) = ˜it (it ˜ft + 1 ˜ft) (13) ct = ˆft ct 1 +ˆit ˆct (14) In order to explain the intuition behind the new update rule, we assume that the master gates are binary: • The master forget gate ˜ft controls the erasing behavior of the model. Suppose ˜ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the first df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large number of zeroed neurons, i.e. a large df t , represents the end of a high-level constituent in the parse tree, as most of the information in the state will be discarded. Conversely, a small f ̃%# !"#
  • 17. , + ( 7 7 / - + + : 1 • ,+ ) / LO T c [S M R !"# 7 7 ̃%# 7 N] d !"# ̃%# L a &# N] e !"# &# "# L) / L 7 7 N I '"# ,+ ) / 7 7 N] discrete variable is included in the computation graph is not trivial (Schulman et al., 2015), so in practice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking a cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, ˆg = E[g]. 4.2 STRUCTURED GATING MECHANISM Based on the cumax() function, we introduce a master forget gate ˜ft and a master input gate ˜it: ˜ft = cumax(W ˜f xt + U ˜f ht 1 + b ˜f ) (9) ˜it = 1 cumax(W˜ixt + U˜iht 1 + b˜i) (10) Following the properties of the cumax() activation, the values in the master forget gate are mono- tonically increasing from 0 to 1, and those in the master input gate are monotonically decreasing from 1 to 0. These gates serve as high-level control for the update operations of cell states. Using the master gates, we define a new update rule: !t = ˜ft ˜it (11) ˆft = ft !t + ( ˜ft !t) = ˜ft (ft ˜it + 1 ˜it) (12) ˆit = it !t + (˜it !t) = ˜it (it ˜ft + 1 ˜ft) (13) ct = ˆft ct 1 +ˆit ˆct (14) In order to explain the intuition behind the new update rule, we assume that the master gates are binary: • The master forget gate ˜ft controls the erasing behavior of the model. Suppose ˜ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the first df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large number of zeroed neurons, i.e. a large df , represents the end of a high-level constituent in eally, g should take the form of a discrete variable. Unfortunately, computing gradients when a screte variable is included in the computation graph is not trivial (Schulman et al., 2015), so in actice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, = E[g]. 2 STRUCTURED GATING MECHANISM ased on the cumax() function, we introduce a master forget gate ˜ft and a master input gate ˜it: ˜ft = cumax(W ˜f xt + U ˜f ht 1 + b ˜f ) (9) ˜it = 1 cumax(W˜ixt + U˜iht 1 + b˜i) (10) ollowing the properties of the cumax() activation, the values in the master forget gate are mono- nically increasing from 0 to 1, and those in the master input gate are monotonically decreasing om 1 to 0. These gates serve as high-level control for the update operations of cell states. Using e master gates, we define a new update rule: !t = ˜ft ˜it (11) ˆft = ft !t + ( ˜ft !t) = ˜ft (ft ˜it + 1 ˜it) (12) ˆit = it !t + (˜it !t) = ˜it (it ˜ft + 1 ˜ft) (13) ct = ˆft ct 1 +ˆit ˆct (14) order to explain the intuition behind the new update rule, we assume that the master gates are nary: • The master forget gate ˜ft controls the erasing behavior of the model. Suppose ˜ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the first df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large number of zeroed neurons, i.e. a large df t , represents the end of a high-level constituent in k k ˆg = E[g]. 4.2 STRUCTURED GATING MECHANISM Based on the cumax() function, we introduce a master forget gate ˜ft and a master input gate ˜it: ˜ft = cumax(W ˜f xt + U ˜f ht 1 + b ˜f ) (9) ˜it = 1 cumax(W˜ixt + U˜iht 1 + b˜i) (10) Following the properties of the cumax() activation, the values in the master forget gate are mono- tonically increasing from 0 to 1, and those in the master input gate are monotonically decreasing from 1 to 0. These gates serve as high-level control for the update operations of cell states. Using the master gates, we define a new update rule: !t = ˜ft ˜it (11) ˆft = ft !t + ( ˜ft !t) = ˜ft (ft ˜it + 1 ˜it) (12) ˆit = it !t + (˜it !t) = ˜it (it ˜ft + 1 ˜ft) (13) ct = ˆft ct 1 +ˆit ˆct (14) In order to explain the intuition behind the new update rule, we assume that the master gates are binary: • The master forget gate ˜ft controls the erasing behavior of the model. Suppose ˜ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the first df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large number of zeroed neurons, i.e. a large df t , represents the end of a high-level constituent in the parse tree, as most of the information in the state will be discarded. Conversely, a small f ̃%# "#
  • 18. , + ( / - + + : 8 1 • ,+ ) / OT [d m a l !"# ̃%# Se n !"# N ̃%# O ih &# Se o !"# &# "# f O) / O S] L '"# ,+ ) / Se p ̃%# &# (# f O) / O S] L ̂%# ,+ ) / Se Ock*#+,Nd Ock ̂*#S"#N(#M g ILT S[d R discrete variable is included in the computation graph is not trivial (Schulman et al., 2015), so in practice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking a cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, ˆg = E[g]. 4.2 STRUCTURED GATING MECHANISM Based on the cumax() function, we introduce a master forget gate ˜ft and a master input gate ˜it: ˜ft = cumax(W ˜f xt + U ˜f ht 1 + b ˜f ) (9) ˜it = 1 cumax(W˜ixt + U˜iht 1 + b˜i) (10) Following the properties of the cumax() activation, the values in the master forget gate are mono- tonically increasing from 0 to 1, and those in the master input gate are monotonically decreasing from 1 to 0. These gates serve as high-level control for the update operations of cell states. Using the master gates, we define a new update rule: !t = ˜ft ˜it (11) ˆft = ft !t + ( ˜ft !t) = ˜ft (ft ˜it + 1 ˜it) (12) ˆit = it !t + (˜it !t) = ˜it (it ˜ft + 1 ˜ft) (13) ct = ˆft ct 1 +ˆit ˆct (14) In order to explain the intuition behind the new update rule, we assume that the master gates are binary: • The master forget gate ˜ft controls the erasing behavior of the model. Suppose ˜ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the first df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large number of zeroed neurons, i.e. a large df , represents the end of a high-level constituent in eally, g should take the form of a discrete variable. Unfortunately, computing gradients when a screte variable is included in the computation graph is not trivial (Schulman et al., 2015), so in actice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, = E[g]. 2 STRUCTURED GATING MECHANISM ased on the cumax() function, we introduce a master forget gate ˜ft and a master input gate ˜it: ˜ft = cumax(W ˜f xt + U ˜f ht 1 + b ˜f ) (9) ˜it = 1 cumax(W˜ixt + U˜iht 1 + b˜i) (10) ollowing the properties of the cumax() activation, the values in the master forget gate are mono- nically increasing from 0 to 1, and those in the master input gate are monotonically decreasing om 1 to 0. These gates serve as high-level control for the update operations of cell states. Using e master gates, we define a new update rule: !t = ˜ft ˜it (11) ˆft = ft !t + ( ˜ft !t) = ˜ft (ft ˜it + 1 ˜it) (12) ˆit = it !t + (˜it !t) = ˜it (it ˜ft + 1 ˜ft) (13) ct = ˆft ct 1 +ˆit ˆct (14) order to explain the intuition behind the new update rule, we assume that the master gates are nary: • The master forget gate ˜ft controls the erasing behavior of the model. Suppose ˜ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the first df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large number of zeroed neurons, i.e. a large df t , represents the end of a high-level constituent in k k ˆg = E[g]. 4.2 STRUCTURED GATING MECHANISM Based on the cumax() function, we introduce a master forget gate ˜ft and a master input gate ˜it: ˜ft = cumax(W ˜f xt + U ˜f ht 1 + b ˜f ) (9) ˜it = 1 cumax(W˜ixt + U˜iht 1 + b˜i) (10) Following the properties of the cumax() activation, the values in the master forget gate are mono- tonically increasing from 0 to 1, and those in the master input gate are monotonically decreasing from 1 to 0. These gates serve as high-level control for the update operations of cell states. Using the master gates, we define a new update rule: !t = ˜ft ˜it (11) ˆft = ft !t + ( ˜ft !t) = ˜ft (ft ˜it + 1 ˜it) (12) ˆit = it !t + (˜it !t) = ˜it (it ˜ft + 1 ˜ft) (13) ct = ˆft ct 1 +ˆit ˆct (14) In order to explain the intuition behind the new update rule, we assume that the master gates are binary: • The master forget gate ˜ft controls the erasing behavior of the model. Suppose ˜ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the first df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large number of zeroed neurons, i.e. a large df t , represents the end of a high-level constituent in the parse tree, as most of the information in the state will be discarded. Conversely, a small f ik Ideally, g should take the form of a discrete variable. Unfortunately, computing gradients when a discrete variable is included in the computation graph is not trivial (Schulman et al., 2015), so in practice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking a cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, ˆg = E[g]. 4.2 STRUCTURED GATING MECHANISM Based on the cumax() function, we introduce a master forget gate ˜ft and a master input gate ˜it: ˜ft = cumax(W ˜f xt + U ˜f ht 1 + b ˜f ) (9) ˜it = 1 cumax(W˜ixt + U˜iht 1 + b˜i) (10) Following the properties of the cumax() activation, the values in the master forget gate are mono- tonically increasing from 0 to 1, and those in the master input gate are monotonically decreasing from 1 to 0. These gates serve as high-level control for the update operations of cell states. Using the master gates, we define a new update rule: !t = ˜ft ˜it (11) ˆft = ft !t + ( ˜ft !t) = ˜ft (ft ˜it + 1 ˜it) (12) ˆit = it !t + (˜it !t) = ˜it (it ˜ft + 1 ˜ft) (13) ct = ˆft ct 1 +ˆit ˆct (14) In order to explain the intuition behind the new update rule, we assume that the master gates are binary: • The master forget gate ˜ft controls the erasing behavior of the model. Suppose ˜ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information f LSTM. A sequence of tokens S = (x1, x2, x3) and its corresponding constituency tree are illus- d in (a). We provide a block view of the tree structure in (b), where both S and VP nodes span e than one time step. The representation for high-ranking nodes should be relatively consistent ss multiple time steps. (c) Visualization of the update frequency of groups of hidden state neu- . At each time step, given the input word, dark grey blocks are completely updated while light blocks are partially updated. The three groups of neurons have different update frequencies. most groups update less frequently while lower groups are more frequently updated. lobal information that will last anywhere from several time steps to the entire sentence, repre- ng nodes near the root of the tree. Low-ranking neurons encode short-term or local information only last one or a few time steps, representing smaller constituents, as shown in Figure 2(b). The rentiation between high-ranking and low-ranking neurons is learnt in a completely data-driven ion by controlling the update frequency of single neurons: to erase (or update) high-ranking ons, the model should first erase (or update) all lower-ranking neurons. In other words, some ons always update more (or less) frequently than the others, and that order is pre-determined as of the model architecture. ON-LSTM is section, we present a new RNN unit, ON-LSTM (“ordered neurons LSTM”). The new model an architecture similar to the standard LSTM, reported below: ft = (Wf xt + Uf ht 1 + bf ) (1) it = (Wixt + Uiht 1 + bi) (2) ot = (Woxt + Uoht 1 + bo) (3) ˆct = tanh(Wcxt + Ucht 1 + bc) (4) ht = ot tanh(ct) (5) difference with the LSTM is that we replace the update function for the cell state ct with a function that will be explained in the following sections. The forget gates ft and input gates e used to control the erasing and writing operation on cell states ct, as before. Since the gates before the k-th being the split point, that is d  k = (d = 0) _ (d = 1) _ · · · _ (d = k). Since the categories are mutually exclusive, we can do this by computing the cumulative distribution function: p(gk = 1) = p(d  k) = X ik p(d = i) (8) Ideally, g should take the form of a discrete variable. Unfortunately, computing gradients when a discrete variable is included in the computation graph is not trivial (Schulman et al., 2015), so in practice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking a cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, ˆg = E[g]. 4.2 STRUCTURED GATING MECHANISM Based on the cumax() function, we introduce a master forget gate ˜ft and a master input gate ˜it: ˜ft = cumax(W ˜f xt + U ˜f ht 1 + b ˜f ) (9) ˜it = 1 cumax(W˜ixt + U˜iht 1 + b˜i) (10) Following the properties of the cumax() activation, the values in the master forget gate are mono- tonically increasing from 0 to 1, and those in the master input gate are monotonically decreasing from 1 to 0. These gates serve as high-level control for the update operations of cell states. Using the master gates, we define a new update rule: !t = ˜ft ˜it (11) ˆft = ft !t + ( ˜ft !t) = ˜ft (ft ˜it + 1 ˜it) (12) ˆit = it !t + (˜it !t) = ˜it (it ˜ft + 1 ˜ft) (13) ct = ˆft ct 1 +ˆit ˆct (14) In order to explain the intuition behind the new update rule, we assume that the master gates are binary:
  • 19. 1 1 : : : , + : : -+ : • +, cNM d r[ u ea!RO "# = %& − ∑)*+ ,- ./#) g %& lSf i ∑)*+ ,- ./#) 9 S :Smt v "#RO So nI ! ! + 1Sk ThR ] s L
  • 20. 2 2 2 : : 2 : , 22 + 0 2 : 20 2: 2 2 -+ 2: • +, cNM d r[ u ea!RO "# = %& − ∑)*+ ,- ./#) g %& lSf i ∑)*+ ,- ./#) 2 2 2S 0 :Smt v "#RO So nI ! ! + 1Sk ThR ] s L
  • 21. 12 21 2 : : 2 : , 22 + 2 : 2 2: 2 2 -+ 2: • +, cNM d r[ u ea!RO "# = %& − ∑)*+ ,- ./#) g %& lSf i ∑)*+ ,- ./#) 2 2 2S :Smt v "#RO So nI ! ! + 1Sk ThR ] s L
  • 22. : + : : , - • + ML]c i t [d ! N O "# = %& − ∑)*+ ,- ./#) af %& kRe h ∑)*+ ,- ./#) 2: R : Rls u "# N O IRn m ! ! + 1R Sg Or o T
  • 23. 2 2 2 : : 2 : , 22 + 2 : 2 2: 2 2 -+ 2: • +, cNM d r[ u ea!RO "# = %& − ∑)*+ ,- ./#) g %& lSf i ∑)*+ ,- ./#) 2 3 2 2S :Smt v "#RO So nI ! ! + 1Sk ThR ] s L
  • 24. 1 0 I) - : A 4 I A 0 : 0 L BI 4 • ) pks • k ih l) 2 +: B 2 + • mgn nro i , 4 A N ( eW • pks [ Yu ]2 MA N • xP • tg ] mgn nro if A N [ u YS]T O2 MA Nf y • , 4 4 7: ( YSda O4 D:Mtg y RcO [ w
  • 25. : : : F : 3 :: 2 :F 1: : : : I F 52 : • t r W n • , 2+ 2+ g edfh wig a • ,23- o A F : : : a LT0 F Rsma p • A F : : : F R O [ Lsm WSR • wig ec T W u WSTLTJ k P TLWNSRl O M]
  • 26. • + 0 • B AB C @ GO • D :D: -: DBA AF: D F A D:: FD F D: AFB : DD:AF -: D -:FIBD :A : F C C:D • 1 @@ @ B @9 B@ C 6 @E C 2 • +: FF:AF BA I F + FI: F A ( A BA B F BA • : -: DB B BA :CF +: DA:D AF:DCD:F A :A: BD A :AF:A : )DB - F D C:D BA , B P RL N I
  • 27. 2 + +2 2 2 + 2 2 • • md P + 2 iLr] p no EB • S[e sl c EB • [ • + 2 [ ft • P ", $ bg ahG 7 ah ft Published as a conference paper at ICLR 2019 (a) Ontology (b) Order Embeddings (c) Box Embeddings Figure 1: Comparison between the Order Embedding (vector lattice) and Box Embedding represen- tations for a simple ontology. Regions represent concepts and overlaps represent their entailment. Shading represents density in the probabilistic case. 3.1 PARTIAL ORDERS AND LATTICES A non-strict partially ordered set (poset) is a pair P, , where P is a set, and is a binary relation. For all a, b, c 2 P,
  • 28. A 8 A A E 8:8 A82 D 8 8 • 8 L G 2A + A A8 B 8 + A A8 + A A8 D + A A8 0313 + 12 5 + 2 43 , + 12 5 + 8 7 7 7 E A E 7 7 7 8 8 A 7 7 : B 8 A 2A8 7 7 : 2 8C 2 A 8A : 2 8C 2 A 8A : 2 8C 2 A 8A : 2 8C 2 A 8A
  • 29. A A A E : A 2 D • S 2A + A A + E A E 9 A : B A 2A : 2 C 2 A A + P [ L E A E 2 A G : V : G 2 A R [ L
  • 30. 0 AAE : E , A EC A CA DE (A + :D 3 • pd + :gxV a EAC BC D E E A , DD BC D E E A + + : A D EC D=A E DD ) AD C E CD E A C GAC E C E C GAC E C E : A D EC nV P bcj[seS D=A E DD E C E [se o ) AD C E CD E A C GAC Vli fm hvu rR ] t L
  • 31. 1 BB =A B D B: DB = =E = )B , =A E 3 = • b , =A [ePc BD 0 CD E A =BA GEE= A 0 CD E A =BA BA 0 CD E A =BA +014247 7203 ,0230 , 3 547 0 0 =BA (E D +=E B=A A EE BE GA D =A DE =BA DA= BD D = DA= BD D = DA= BD D = 0 =BA (E D aP V ]Rd BE GA D =A DE =BA DA= BD P LfS +=E B=A A EE D = Pd L
  • 32. BB =A B D B: DB = =E = )B , =A E 3 = • b , =A [ePcV 2 BD 0 CD E A =BA GEE= A 0 CD E A =BA BA 0 CD E A =BA )B 0 CD E A =BA 0313 + 12 5 + 2 43 , + 12 5 0 =BA 7 7 7 (E D 7 7 7 +=E B=A A EE 7 7 BE GA D =A DE =BA 7 7 DA= BD D = DA= BD D = DA= BD D = DA= BDD = 70 =BA (E D P ]Rd 7+=E B=A A EE D = Ld a 7 BE GA D =A DE =BA DA= BD P LfS
  • 33. 8 8 + 212 : 3 B 2 0 • xcb B , 1 : 0 • ] o B p L SV !", !$ ∈ [0, 1]* re l L • ] o p L ts [ E r P ! = ∏. / (1$,. − 1",.) P !, 4 = ∏. / max 0, min 1$,., :$,. − max 1",., :",. P !|4 = < !,4 <(4) • [fg n hm L E l L • P ! P 4 p L P S fg • P !|4 p L P S fg 4" 31 !" !$ 1 1: 4$ (a) Original lattice (b) Ground truth CPD Figure 1: Representation of the toy probabilistic lattice used in Section 5.1. Darker color corresponds to more unary marginal probability. The associated CPD is obtained by a weighted aggregation of leaf elements. (a) POE lattice (b) Box lattice aiR G d
  • 34. : 4 )4 :4 B + 0 (: 433 • ,4 4 4 0 R P "|$ P S[ • " $ G ] E P "|$ a S[L 0 :0::0 +0:0::0 P " = & ' ( (*+,' − *.,') P ", $ = max 0, min *+, 6+ − max *., 6. P "|$ = P ", $ P($)
  • 35. : ) : B 5 + 0 (: 33 • , 0 R P "|$ P S[ • " $ G ] E P "|$ a S[L 0 :0::0 +0:0::0 P " = & ' ( (*+,' − *.,') P ", $ = max 0, min *+, 6+ − max *., 6. P "|$ = P ", $ P($)
  • 36. : 6 ) : B + 0 (: 33 6 • , 0 R P "|$ P S[ • " $ G ] E P "|$ a S[L 0 :0::0 +0:0::0 P " = & ' ( (*+,' − *.,') P ", $ = max 0, min *+, 6+ − max *., 6. P "|$ = P ", $ P($)
  • 37. 7: 3 )3 3 B + 0 07 7 7 ( 03227: 7 • 32 • aE P] G L [ G P E S P + P " = $ % & softplus(/0,% − /3,%) P ", 5 = softplus min /0, 90 − max /3, 93 P "|5 = P ", 5 P(5) softplus log(1 + @A )
  • 38. : 3 )3 3 B + 0 0 8 ( 0322 : • 32 • 8 aE P] G L [ G P E S P 8 +8 P " = $ % & softplus(/0,% − /3,%) P ", 5 = softplus min /0, 90 − max /3, 93 P "|5 = P ", 5 P(5) softplus log(1 + @A )
  • 39. 9 : 3 )3 93 B + 0 0 (90322 : • 9 32 • aE P] G L [ G P E S P 9 99 +999 P " = $ % & softplus(/0,% − /3,%) P ", 5 = softplus min /0, 90 − max /3, 93 P "|5 = P ", 5 P(5) softplus log(1 + @A )
  • 40. 4 4 : : 00 + • g :0 • tix a o dNcfo emEP W • ti ti n i rG yu s G • P "#""#$|&'( = * • yl W • a cfLB [ ] L b [ W • Lh n ] SL W 837k edges, while negative examples are generated by swapping one of the terms to a random word in the dictionary. Experimental details are given in Appendix D.1. The smoothed box model performs nearly as well as the original box lattice in terms of test ac- curacy1 . While our model requires less hyper-parameter tuning than the original, we suspect that our performance would be increased on a task with a higher degree of sparsity than the 50/50 posi- tive/negative split of the standard WordNet data, which we explore in the next section. 5.2 IMBALANCED WORDNET In order to confirm our intuition that the smoothed box model performs better in the sparse regime, we perform further experiments using different numbers of positive and negative examples from the WordNet mammal subset, comparing the box lattice, our smoothed approach, and order embeddings (OE) as a baseline. The training data is the transitive reduction of this subset of the mammal Word- Net, while the dev/test is the transitive closure of the training data. The training data contains 1,176 positive examples, and the dev and test sets contain 209 positive examples. Negative examples are generated randomly using the ratio stated in the table. As we can see from the table, with balanced data, all models include OE baseline, Box, Smoothed Box models nearly match the full transitive closure. As the number of negative examples increases, the performance drops for the original box model, but Smoothed Box still outperforms OE and Box in all setting. This superior performance on imbalanced data is important for e.g. real-world entailment graph learning, where the number of negatives greatly outweigh the positives. Positive:Negative Box OE Smoothed Box 1:1 0.9905 0.9976 1.0 1:2 0.8982 0.9139 1.0 1:6 0.6680 0.6640 0.9561 1:10 0.5495 0.5897 0.8800 Table 5: F1 scores of the box lattice, order embeddings, and our smoothed model, for different levels of label imbalance on the WordNet mammal subset. 5.3 FLICKR proach retains the inductive bias of the original box model, is equivalent in the limit, and satisfies e necessary condition that p(x, x) = p(x). A comparison of the 3 different functions is given in gure 3, with the softplus overlap showing much better behavior for highly disjoint boxes than the ussian model, while also preserving the meet property. (a) Standard (hinge) overlap (b) Gaussian overlap, 2 {2, 6} (c) Softplus overlap gure 3: Comparison of different overlap functions for two boxes of width 0.3 as a function of their nters. Note that in order to achieve high overlap, the Gaussian model must drastically lower its mperature, causing vanishing gradients in the tails. EXPERIMENTS 1 WORDNET Method Test Accuracy % transitive 88.2 word2gauss 86.6 OE 90.6 Li et al. (2017) 91.3 POE 91.6 Box 92.2 Smoothed Box 92.0 Table 4: Classification accuracy on WordNet test set. e perform experiments on the WordNet hypernym prediction task in order to evaluate the per-
  • 41. +: : 4 1 : • eb B : • [ Li P E EMk]Glm ETS • P "|$ M d ET M Lc • P "|$ = &(",$) &($) = #+,-./0 ",$ 12 / #456+5 #+,-./0 $ 12 / #456+5 • , M P "|$ Li P h ag + : f h ag Kk] Box 0.050 0.900 Smoothed Box 0.036 0.917 Table 6: KL and Pearson correlation between model and gold probability. randomly pick 100K conditional probabilities for training data and 10k probabilities for dev and test data 2 . We compare with several baselines: low-rank matrix factorization, complex bilinear factoriza- tion (Trouillon et al., 2016), and two hierarchical embedding methods, POE (Lai & Hockenmaier, 2017) and the Box Lattice (Vilnis et al., 2018). Since the training matrix is asymmetric, we used separate embeddings for target and conditioned movies. For the complex bilinear model, we added one additional vector of parameters to capture the “imply” relation. We evaluate on the test set using KL divergence, Pearson correlation, and Spearman correlation with the ground truth probabilities. Experimental details are given in Appendix D.4. From the results in Table 7, we can see that our smoothed box embedding method outperforms the original box lattice as well as all other baselines’ performances, especially in Spearman correlation, the most relevant metric for recommendation, a ranking task. We perform an additional study on the robustness of the smoothed model to initialization conditions in Appendix C. KL Pearson R Spearman R Matrix Factorization 0.0173 0.8549 0.8374 Complex Bilinear Factorization 0.0141 0.8771 0.8636 POE 0.0170 0.8548 0.8511 Box 0.0147 0.8775 0.8768 Smoothed Box 0.0138 0.8985 0.8977 Table 7: Performance of the smoothed model, the original box model, and several baselines on MovieLens. 6 CONCLUSION AND FUTURE WORK We presented an approach to smoothing the energy and optimization landscape of probabilistic box embeddings and provided a theoretical justification for the smoothing. Due to a decreased number
  • 42. 42 • + • 29 @2@ • E :E: -: ECBF BG: E G B E:: GE G E:F BGC : EE:BG -: E -:G CE F :B :FG :E • CCG B G : ):C :GEL C EC FG C : B F + • 2D @@ @ C @ @C @ 2 4 D 2 9A@ 0 A 1 • : -: EC L C CB : G +: EB:E BG:E E:G B :B:F CE F B :BG:B :F (EC - G E :EI F CB , C L I N
  • 43. : 4 4 : : : - • ead fi • e gi t W 3 : -+ s D[ • v yDo C[ • 3 : Pw Av m ND • ]ch a • d L[ S v W A 3 : V l u m [n : : : set dynamic convolutions achieve a new state of the art of 29.7 BLEU. 1 INTRODUCTION There has been much recent progress in sequence modeling through recurrent neural networks (RNN; Sutskever et al. 2014; Bahdanau et al. 2015; Wu et al. 2016), convolutional networks (CNN; Kalchbrenner et al. 2016; Gehring et al. 2016; 2017; Kaiser et al. 2017) and self-attention models (Paulus et al., 2017; Vaswani et al., 2017). RNNs integrate context information by updating a hid- den state at every time-step, CNNs summarize a fixed size context through multiple layers, while as self-attention directly summarizes all context. Attention assigns context elements attention weights which define a weighted sum over context rep- resentations (Bahdanau et al., 2015; Sukhbaatar et al., 2015; Chorowski et al., 2015; Luong et al., 2015). Source-target attention summarizes information from another sequence such as in machine translation while as self-attention operates over the current sequence. Self-attention has been formu- lated as content-based where attention weights are computed by comparing the current time-step to all elements in the context (Figure 1a). The ability to compute comparisons over such unrestricted context sizes are seen as a key characteristic of self-attention (Vaswani et al., 2017). (a) Self-attention (b) Dynamic convolution Figure 1: Self-attention computes attention weights by comparing all pairs of elements to each other (a) while as dynamic convolutions predict separate kernels for each time-step (b).
  • 44. + A I + : I : ) (A A A 2 4 • PO RNW[ S ,, (,, A T D (A (A - - - - ) 2 : !" !# !$ !% !& !" !# !$ !% !& !" !# !$ !% !& !" !# !$ !% !& C : D L
  • 45. 5 + A I + : I : 5 ) 5 (A A A 4 • a l ] [fm k ,, (,, 5D5 (A (A - - ) 5 : !" !# !$ !% !& !" !# !$ !% !& C : D L (A A Sk S edoPpN ) 5 : lnTONR(,, gW chi
  • 46. 6L , ( A A , 6A LA6 ) AI A 4 • ti w[ acd mx[ug -- )-- : ( A A 6 6 +A: A ) A 2 ) C L A A A LA6 !" !# !$ !% !& !" !# !$ !% !& !" !# !$ !% !& A D A A A LA6 pzok [h rOwy e Wfok[ sq n vT R S 2 ) C L l R A P]Nn
  • 47. L , ( A A , A LA 7 ) AI A 4 • zr hklfu p -- )-- : ( A A sq +A: A ) A 2 ) C L A A A A LA 7 !" !# !$ !% !& !" !# !$ !% !& !" !# !$ !% !& !" !# !$ !% !& A D A7 A A LA 7 2 ) C L )--eiOg WdS [w xy mt ]RcN : ( A A n vae oT P
  • 48. 8 8 4 4 4 8 8 8 8 + - • D 8 8 8 • 8 8 Pu f ds 8 4 • m v [ c oe v [ c g Aoe v [ c m v [ c i wNpSL nWy N • N 8 4 y lC • a ]th A 4 4 8 8 P ! ∈ ℝ$×$×& ! ∈ ℝ$×'×& ! ∈ ℝ(×'×& ! ∈ ℝ$×(×'×& : 4 8 8 4 4 8 8 8 8 8 sequence and Shen et al. (2017; 2018b) perform more fine-grained attention over each feature. Our experiments show that lightweight convolutions perform competitively to strong self-attention results and that dynamic convolutions can perform even better. On WMT English-German transla- tion dynamic convolutions achieve a new state of the art of 29.7 BLEU, on WMT English-French they match the best reported result in the literature, and on IWSLT German-English dynamic convo- lutions outperform self-attention by 0.8 BLEU. Dynamic convolutions achieve 20% faster runtime than a highly-optimized self-attention baseline. For language modeling on the Billion word bench- mark dynamic convolutions perform as well as or better than self-attention and on CNN-DailyMail abstractive document summarization we outperform a strong self-attention model. 2 BACKGROUND We first outline sequence to sequence learning and self-attention. Our work builds on non-separable convolutions as well as depthwise separable convolutions. Sequence to sequence learning maps a source sequence to a target sequence via two separate networks such as in machine translation (Sutskever et al., 2014). The encoder network computes representations for the source sequence such as an English sentence and the decoder network au- toregressively generates a target sequence based on the encoder output. The self-attention module of Vaswani et al. (2017) applies three projections to the input X 2 Rn⇥d to obtain key (K), query (Q), and value (V) representations, where n is the number of time steps, d the input/output dimension (Figure 2a). It also defines a number of heads H where each head can learn separate attention weights over dk features and attend to different positions. The module computes dot-products between key/query pairs, scales to stabilize training, and then softmax nor- malizes the result. Finally, it computes a weighted sum using the output of the value projection (V): Attention(Q, K, V ) = softmax( QKT p dk )V Depthwise convolutions perform a convolution independently over every channel. The number of parameters can be reduced from d2 k to dk where k is the kernel width. The output O 2 Rn⇥d of a depthwise convolution with weight W 2 Rd⇥k for element i and output dimension c is defined as: Oi,c = DepthwiseConv(X, Wc,:, i, c) = kX j=1 Wc,j · X(i+j d k+1 2 e),c 3 LIGHTWEIGHT CONVOLUTIONS In this section, we introduce LightConv, a depthwise convolution which shares certain output chan- nels and whose weights are normalized across the temporal dimension using a softmax. Compared to attention weights from the previous time-step into account (Chorowski et al., 2015; Luong et al., 2015). Shen et al. (2018a) reduce complexity by performing attention within blocks of the input sequence and Shen et al. (2017; 2018b) perform more fine-grained attention over each feature. Our experiments show that lightweight convolutions perform competitively to strong self-attention results and that dynamic convolutions can perform even better. On WMT English-German transla- tion dynamic convolutions achieve a new state of the art of 29.7 BLEU, on WMT English-French they match the best reported result in the literature, and on IWSLT German-English dynamic convo- lutions outperform self-attention by 0.8 BLEU. Dynamic convolutions achieve 20% faster runtime than a highly-optimized self-attention baseline. For language modeling on the Billion word bench- mark dynamic convolutions perform as well as or better than self-attention and on CNN-DailyMail abstractive document summarization we outperform a strong self-attention model. 2 BACKGROUND We first outline sequence to sequence learning and self-attention. Our work builds on non-separable convolutions as well as depthwise separable convolutions. Sequence to sequence learning maps a source sequence to a target sequence via two separate networks such as in machine translation (Sutskever et al., 2014). The encoder network computes representations for the source sequence such as an English sentence and the decoder network au- toregressively generates a target sequence based on the encoder output. The self-attention module of Vaswani et al. (2017) applies three projections to the input X 2 Rn⇥d to obtain key (K), query (Q), and value (V) representations, where n is the number of time steps, d the input/output dimension (Figure 2a). It also defines a number of heads H where each head can learn separate attention weights over dk features and attend to different positions. The module computes dot-products between key/query pairs, scales to stabilize training, and then softmax nor- malizes the result. Finally, it computes a weighted sum using the output of the value projection (V): Attention(Q, K, V ) = softmax( QKT p dk )V Depthwise convolutions perform a convolution independently over every channel. The number of parameters can be reduced from d2 k to dk where k is the kernel width. The output O 2 Rn⇥d of a depthwise convolution with weight W 2 Rd⇥k for element i and output dimension c is defined as: Oi,c = DepthwiseConv(X, Wc,:, i, c) = kX j=1 Wc,j · X(i+j d k+1 2 e),c 3 LIGHTWEIGHT CONVOLUTIONS Under review as a conference paper at ICLR 2019 Mat Mul Linear Linear Linear Scale SoftMax Mat Mul Linear Q K V input (a) Self-attention LConv Linear Linear input GLU (b) Lightweight convolution input LConv Linear Linear Linear dynamic weights GLU (c) Dynamic convolution Figure 2: Illustration of self-attention, lightweight convolutions and dynamic convolutions. self-attention, LightConv has a fixed context window and it determines the importance of context el- ements with a set of weights that do not change over time steps. We will show that models equipped with lightweight convolutions show better generalization compared to regular convolutions and that they can be competitive to state-of-the-art self-attention models (§6). This is surprising because the common belief is that content-based self-attention mechanisms are crucial to obtaining state-of-the- art results in natural language processing applications. Furthermore, the low computational profile of LightConv enables us to formulate efficient dynamic convolutions (§4). LightConv computes the following for the i-th element in the sequence and output channel c: LightConv(X, Wd cH d e,:, i, c) = DepthwiseConv(X, softmax(Wd cH d e,:), i, c) Weight sharing. We tie the parameters of every subsequent number of d H channels, which reduces the number of parameters by a factor of H. As illustration, a regular convolution requires 7,340,032 (d2 ⇥ k) weights for d = 1024 and k = 7, a depthwise separable convolution has 7,168 weights (d ⇥ k), and with weight sharing, H = 16, we have only 112 (H ⇥ k) weights. We will see that this vast reduction in the number of parameters is crucial to make dynamic convolutions possible on current hardware. Under review as a conference paper at ICLR 2019 Mat Mul Linear Linear Linear Scale SoftMax Mat Mul Linear Q K V input (a) Self-attention LConv Linear Linear input GLU (b) Lightweight convolution input LConv Linear Linear Linear dynamic weights GLU (c) Dynamic convolution Figure 2: Illustration of self-attention, lightweight convolutions and dynamic convolutions. self-attention, LightConv has a fixed context window and it determines the importance of context el- ements with a set of weights that do not change over time steps. We will show that models equipped with lightweight convolutions show better generalization compared to regular convolutions and that they can be competitive to state-of-the-art self-attention models (§6). This is surprising because the common belief is that content-based self-attention mechanisms are crucial to obtaining state-of-the- art results in natural language processing applications. Furthermore, the low computational profile of LightConv enables us to formulate efficient dynamic convolutions (§4). LightConv computes the following for the i-th element in the sequence and output channel c: LightConv(X, Wd cH d e,:, i, c) = DepthwiseConv(X, softmax(Wd cH d e,:), i, c) Weight sharing. We tie the parameters of every subsequent number of d H channels, which reduces the number of parameters by a factor of H. As illustration, a regular convolution requires 7,340,032 (d2 ⇥ k) weights for d = 1024 and k = 7, a depthwise separable convolution has 7,168 weights (d ⇥ k), and with weight sharing, H = 16, we have only 112 (H ⇥ k) weights. We will see that Under review as a conference paper at ICLR 2019 4 DYNAMIC CONVOLUTIONS A dynamic convolution has kernels that vary over time as a learned functi steps. A dynamic version of standard convolutions would be impractic to their large memory requirements. We address this problem by build drastically reduces the number of parameters (§3). DynamicConv takes the same form as LightConv but uses a time-step computed using a function f : Rd ! RH⇥k : DynamicConv(X, i, c) = LightConv(X, f(Xi)h,:, i we model f with a simple linear module with learned weights WQ 2 Pd c=1 WQ h,j,cXi,c. Similar to self-attention, DynamicConv changes the weights assigned to co However, the weights of DynamicConv do not depend on the entire conte the current time-step only. Self-attention requires a quadratic number of o Under review as a conference paper at ICLR 2019 4 DYNAMIC CONVOLUTIONS A dynamic convolution has kernels that vary over time as a learned function of the individual ti steps. A dynamic version of standard convolutions would be impractical for current GPUs d to their large memory requirements. We address this problem by building on LightConv wh drastically reduces the number of parameters (§3). DynamicConv takes the same form as LightConv but uses a time-step dependent kernel that computed using a function f : Rd ! RH⇥k : DynamicConv(X, i, c) = LightConv(X, f(Xi)h,:, i, c) we model f with a simple linear module with learned weights WQ 2 RH⇥k⇥d , i.e., f(Xi) Pd c=1 WQ h,j,cXi,c. Similar to self-attention, DynamicConv changes the weights assigned to context elements over tim However, the weights of DynamicConv do not depend on the entire context, they are a function Under review as a conference paper at I 4 DYNAMIC CONVOLUTIONS A dynamic convolution has kernels tha steps. A dynamic version of standard to their large memory requirements. drastically reduces the number of param DynamicConv takes the same form a computed using a function f : Rd ! R DynamicConv(X we model f with a simple linear mod Pd c=1 WQ h,j,cXi,c. Similar to self-attention, DynamicConv However, the weights of DynamicConv = !! ! *(,) ./
  • 49. + C 4 4 4 C : A 9 • - 4D LME ( ) ) ( Under review as a conference paper at ICLR 2019 Model Param BLEU Sent/sec Vaswani et al. (2017) 213M 26.4 - Self-attention baseline (k=inf, H=16) 210M 26.9 ± 0.1 52.1 ± 0.1 Self-attention baseline (k=3,7,15,31x3, H=16) 210M 26.9 ± 0.3 54.9 ± 0.2 CNN (k=3) 208M 25.9 ± 0.2 68.1 ± 0.3 CNN Depthwise (k=3, H=1024) 195M 26.1 ± 0.2 67.1 ± 1.0 + Increasing kernel (k=3,7,15,31x4, H=1024) 195M 26.4 ± 0.2 63.3 ± 0.1 + DropConnect (H=1024) 195M 26.5 ± 0.2 63.3 ± 0.1 + Weight sharing (H=16) 195M 26.5 ± 0.1 63.7 ± 0.4 + Softmax-normalized weights [LightConv] (H=16) 195M 26.6 ± 0.2 63.6 ± 0.1 + Dynamic weights [DynamicConv] (H=16) 200M 26.9 ± 0.2 62.6 ± 0.4 Note: DynamicConv(H=16) w/o softmax-normalization 200M diverges AAN decoder + self-attn encoder 260M 26.8 ± 0.1 59.5 ± 0.1 AAN decoder + AAN encoder 310M 22.5 ± 0.1 59.2 ± 2.1 Table 3: Ablation on WMT English-German newstest2013. (+) indicates that a result includes all preceding features. Speed results based on beam size 4, batch size 256 on an NVIDIA P100 GPU. 6.2 MODEL ABLATION In this section we evaluate the impact of the various choices we made for LightConv (§3) and Dy- namicConv (§4). We first show that limiting the maximum context size of self-attention has no impact on validation accuracy (Table 3). Note that our baseline is stronger than the original result of Vaswani et al. (2017). Next, we replace self-attention blocks with non-separable convolutions (CNN) with kernel size 3 and input/output dimension d = 1024. The CNN block has no input and output projections compared to the baseline and we add one more encoder layer to assimilate the parameter count. This CNN with a narrow kernel trails self-attention by 1 BLEU. We improve this result by switching to a depthwise separable convolution (CNN Depthwise) with in- put and output projections of size d = 1024. When we progressively increase the kernel width from 4 262402 .522 72482402 720 / 49 4 49 4 9 : 4 C : A 9 4 A 9 4 4 A 9
  • 50. : • -1 , - • 0 : F I:IA MON • C C + CA D ) E :C E : C EC E C D EA CC E + C + E AC D DE B B C • AAE : E ( A EC A CA DE AI :D • DD EE E A E : E : E A GA E A D • 5@ LC A FI - : , I F IA : I + C :I : F A A : 9 2 : A
  • 51. + C - :5 5 : -5 A A - 5 A + C -C :A: 1 • yr m • l pvuI ] Sd [ F ] Sd • l pvu ns Iih u ] T , Sd[F b N d M F ] o a Q c • gv r • l pvu ns Sdv L • VeWe Sd t L Models Visual Features Semantics Extra Labels Inference # Prog. Attr. FiLM (Perez et al., 2018) Convolutional Implicit 0 No Feature Manipulation IEP (Johnson et al., 2017b) Convolutional Explicit 700K No Feature Manipulation MAC (Hudson & Manning, 2018) Attentional Implicit 0 No Feature Manipulation Stack-NMN (Hu et al., 2018) Attentional Implicit 0 No Attention Manipulation TbD (Mascharka et al., 2018) Attentional Explicit 700K No Attention Manipulation NS-VQA (Yi et al., 2018) Object-Based Explicit 0.2K Yes Symbolic Execution NS-CL Object-Based Explicit 0 No Symbolic Execution Table 1: Comparison with other frameworks on the CLEVR VQA dataset, w.r.t. visual features, implicit or explicit semantics and supervisions. 1 2 3 4 Q: What is the shape of the red object left of the sphere? ✓ Query(Shape, Filter(Red, Relate(Left, Filter(Sphere)))) ☓ Query(Shape, Filter(Sphere, Relate(Left, Filter(Red)))) ☓ Exist(AERelate(Shape, Filter(Red, Relate(Left, Filter(Sphere))))) …… Visual Representation Semantic Parsing (Candidate Interpretations) Symbolic Reasoning Answer: Cylinder Groundtruth: Box Back-propagation REINFORCE Obj 1 Obj 2 Obj 3 Obj 4 Sphere Concept Embeddings …… Back-propagation Θ" Θ#
  • 52. E C C Rg S]d ol ,C A A B 2: Wa c L L hru M L L mp[ v N i v N Ve ] I N ]d b Ve ] ]d b N i v N i Obj:2 1 : -:FC : E +: C :C E:C C:E : :D C D : E: :D C - EFC F :C D , 5 • n 2 DF : D s P Published as a conference paper at ICLR 2019 Q: What is the shape of the red object? A: Cube. Q: How many cubes are behind the sphere? A: 3 Q: Does the red object left of the green cube have the same shape as the purple matte thing? A: No Q: Does the matte thing behind the big sphere have the same color as the cylinder left of the small matte cube? A: No. Initialized with DSL and executor. Lesson1: Object-based questions. Lesson2: Relational questions. Lesson3: More complex questions. Deploy: complex scenes, complex questions Q: Does the red object left of the green cube have the same shape as the purple matte thing? 1 3 Obj 1 Obj 2 Obj 3 Obj 4 Step1: Visual Parsing Step2, 3: Semantic Parsing and Program Filter Green Cube Program Representations Relate Object 2 Left Filter Red Filter Purple Matte AEQuery Object 1 Object 3 Shape Concepts A. Curriculum concept learning B. Illustrative execution of NS-C Figure 4: A. Demonstration of the curriculum learning of visual concepts, words, and semant y v[ y . 1 B :A B A ? 8 B B t i -- t Ve ] C : + C 0? -- t ]d b -- Filter(Red) ↓ Query(Shape) Published as a conference paper at ICLR 2019 Q: What is the shape of the red object? A: Cube. Q: How many cubes are behind the sphere? A: 3 Q: Does the red object left of the green cube have the same shape as the purple matte thing? A: No Q: Does the matte thing behind the big sphere have the same color as the cylinder left of the small matte cube? A: No. Initialized with DSL and executor. Lesson1: Object-based questions. Lesson2: Relational questions. Lesson3: More complex questions. Deploy: complex scenes, complex questions Q: Does the red object left of the green cube have the same shape as the purple matte thing? 1 2 3 4 Obj 1 Obj 2 Obj 3 Obj 4 Step1: Visual Parsing Step2, 3: Semantic Parsing and Program Execution Filter Green Cube Program Representations Outputs Relate Object 2 Left Filter Red Filter Purple Matte AEQuery Object 1 Object 3 Shape No (0.98) Concepts A. Curriculum concept learning B. Illustrative execution of NS-CL Figure 4: A. Demonstration of the curriculum learning of visual concepts, words, and semantic parsing y v[ y . 1 B :A B A ? 8 B B t i -- Published as a conference paper at ICLR 201 Q: What is the shape of the red object? A: Cube. Q: How many cubes are behind the sphere? A: 3 Q: Does the red object left of the green cube have the same shape as the purple matte thing? A: No Q: Does the matte thing behind the big sphere have the same color as the cylinder left of the small matte cube? A: No. Initialized with DSL and executor. Lesson1: Object-based questions. Lesson2: Relational questions. Lesson3: More complex questions Deploy: complex scenes, complex A. Curriculum concept learning Figure 4: A. Demonstration of the curriculum l y v[ y . 1 B :A B A ? 8 B B t i -- t T c -- t ]d b -- Filter(Red) ↓ Query(Shape) Obj:1 Green Red
  • 53. Red C 2 N I 5PG IF I JM . M J M B 5 L I L 5 M L , IG 2 MN F 5NJ L I :1 I • p v I M B I I JML 5 G M 3 L B am ]sgfd eiUWckS3 IB GlytVhoT ry am R[ ry u :D O EN ODA ND A B ODA MA FA?O 1 LE 4 22 4 L2 M IFI +G B 5J 5C J +G B 5J Cylinder Sphere Box LN F , MN 5J Obj:1 Obj:2 ly b + Q j DAMA n 3 ↓ Filter(Red) Query(Shape) -4 -4 22 22 1 NG 6 ,22[i W gb uS6AN2AO [ ENP A OPMAb y W ? AM -A? AMU+E06 06 Vf c x - C [4M CM by 4M CM ], ?A O A E C U, M A A E CVbk E OAM b U6A , NE A t s ]a3 F n V 4M CM ], ?A O A E C U D A A A E CVbk PAMR b U3 F , NE A t s ]a D Ab k V S mh y je d orbl S A E C ?Abp
  • 54. Red C 2 N I 5PG IF I JM . M J M B 5 L I L 5 M L , IG 2 MN F 5NJ L I :1 I • d I M B I I JML 5 G M 3 L B fsR]c ba lki nV[h T3 IB Gr m U fs S p 1 ? ? ? F E ? G E A : 1 LE 4 22 4 L2 M IFI +G B 5J 5C J +G B 5J Cylinder Sphere Box LN F , MN 5J Obj:1 Obj:2 r gy E sb F? G t 3 ↓ Filter(Red) Query(Shape) -4 -4 22 22 1 LE 4 22 fsW u e o 4 L2 M LN F , MN vr m Q ,D:E G + :E GO 2 2 P ] o 6+ED U0GE G C p 0GE G C i W ED: F ,C D O EBEG C D P cy - B G niO2 E D lxR u W[ A f P 0GE G C i W ED: F ,C D O ? F C D P cy 1 G niO A E D lxR u W[ ? F cyPS tea S p sb gj d S ,C D F : hr
  • 55. Red FC 5C NL SJ LIG ,L CMP 3C N CN 1 PCNMNCPG E C C LNB B C PC C NLJ 5 P N I MCNRG GL 4 L • i 2LG P 3C N G E LD ,L CMP B CJ PG 6 N G E T hbfe pom nr luV6NLEN Jv yW dUc s 3 4F E F E 4C B F B F 4 ,55 C 5CP ,LILN .J CBBG E M C F MC .J CBBG E M C Cylinder Sphere Box G I C P NC M C Obj:1 Obj:2 v k B o 1C +6 ↓ Filter(Red) Query(Shape) +G0 : 0 : 55 55 4E R pNW r yj E F R 2 E 4? 4F tl s ]. LBCN -C LBCNV+G0 : 0 :WgUa[t -L E 6NLEN J v B: 4 V f Vm BA CFV A: B?B A:M x ?F m kf SV BE A i OhsS - cuM B: 4 V f Vm BA CFV A: 1 4C A:M x m kf - SV BE A i OhsS 1 4C xMP b SPQl o SVdg a P A: 1C4 en
  • 56. Red 3 LJ 6 JGE +JI N 1 LI L IN L L NEIC 6 I M :JL M I 6 IN I M -LJ 3 N L G 6 LPEMEJI 2 J • d JEIN 1 LIEIC JB +JI NM I 6 INE 4 LMEIC etR]c ba lki oV[h T4LJCL r mvU et S p :D O EN ODA ND A B ODA MA FA?O 2 MF 5 +33 5 M3 N +JGJL , EIC 6 6 , EIC 6 Cylinder Sphere Box EM G - N L 6 Obj:1 Obj:2 r f + Q j DAMA u 4 ↓ Filter(Red) Query(Shape) E.5 .5 33 33 1 NG 6 ,22[i W gb uS6AN2AO [ ENP A OPMAb y W ? AM -A? AMU+E06 06 Vf c x - C [4M CM by 4LJCL n s W+JI N , EIC T+JGJL EICU gy E OAM b U6A , NE A t s ]a3 F n V 4M CM ], ?A O A E C U D A A A E CVbk PAMR b U3 F , NE A t s ]a D Ab k V S mh y je d orbl S A E C ?Abp
  • 57. Red EB 3B NL 7SJ LIF +L BMP 1B N BN PBNMNBPF 7 B BO LN O 7B PB BO -NLJ 3 P N I 7 MBNRFOFL 2 L • i LF P 1B N F LC +L BMPO 7BJ PF 5 NOF j Tbhcgf pomanu lyV5NL N J r W j eUd :D O EN ODA ND A B ODA MA FA?O 2 O +33 BO3BP +LILN ,J B F 7M B 7E MB ,J B F 7M B Cylinder Sphere Box :FO I -B P NB 7M B Obj:1 Obj:2 k + Q j DAMA 5 ↓ Filter(Red) Query(Shape) F. . 33 33 1 NG 6 ,22[i W gb tS6AN2AO [ ENP A OPMAb x u W ? AM -A? AMU+E06 06 Vf c - C [4M CM bx 4M CM p y ], ?A O A E C U, M A A E CVbk -FIPBN a sV B ]+LOF B v t [ 4 G W 4M CM p y ], ?A O A E C U D A A A E CVbk PAMRy b pU3 F , NE A s r ]a D Ab k V S mh x je d n bl S A E C ?Abo
  • 58. Red 3 LJ JGE +JI N 1 LI L IN L L NEIC I M :JL M I IN I M -LJ 3 N L G LPEMEJI 2 J • d JEIN 1 LIEIC JB +JI NM I INE 4 LMEIC etR]c ba lki oV[h T4LJCL r mvU et S p :D O EN ODA ND A B ODA MA FA?O 2 MF 5 +33 5 M3 N +JGJL , EIC , EIC Cylinder Sphere Box EM G - N L Obj:1 Obj:2 r f + Q j DAMA u 4 ↓ Filter(Red) Query(Shape) E.58 .58 33 33 1 NG 6 ,22[i W gb uS6AN2AO [ ENP A OPMAb y W ? AM -A? AMU+E06 06 Vf c x - C [4M CM by 4M CM ], ?A O A E C U, M A A E CVbk E OAM b U6A , NE A t s ]a3 F n V 4LJCL n s W+JI N , EIC T EICU gy PAMR b U3 F , NE A t s ]a D Ab k V S mh y je d orbl S A E C ?Abp
  • 59. Red 9FC 3CROM T MJG +ML CNQ 1C OLCO LQCONOCQGLE CLCP MOBP LB CLQCL CP -OM 3 QRO J RNCOSGPGML 2 M • l MGLQ 1C OLGLE M +ML CNQP LB C LQG 5 OPGLE n Vekfji vtrds b 5OMEO a n hWg :D O EN ODA ND A B ODA MA FA?O 2 PI +33 CP3CQ +MJMO , CBBGLE N C F NC , CBBGLE N C Cylinder Sphere Box GPR J -C QROC N C Obj:1 Obj:2 o + Q xg DAMA 5 ↓ Filter(Red) Query(Shape) G. : . : 33 33 1 NG 6 ,22 f V e R6AN2AO ENP A OPMA u r V ? AM -A? AM +E06 06 UdSa]t - C 4M CM u 4M CM ] m ] [, ?A O] A E C , M A A E CU h E OAM sm 6A ], NE A p Wo [ 3 F j U 4M CM ] m ] [, ?A O] A E C D A A A E CU h RCOT d 4 +MPGLC [y c F NCd p U um ] xgcSb ]kn i R A E C ?A l
  • 60. 4 Red 3 LJ 6 JGE +JI N 1 LI L IN L L NEIC 6 I M :JL M I 6 IN I M -LJ 3 N L G 6 LPEMEJI 2 J • e 0JEIN 1 LIEIC JB +JI NM I 6 INE 4 LMEIC fyS d cb nm ]lsW iuU4LJCL v [o V fyaT t :D O EN ODA ND A B ODA MA FA?O 2 MF +33 M3 N +JGJL , EIC 6 6 , EIC 6 Cylinder Sphere Box EM G - N L 6 Obj:1 Obj:2 v h + Q uh DAMA ↓ Filter(Red) Query(Shape) E. . 33 33 1 NG 6 ,22[g W x eb oS6AN2AO [ ENP A OPMAb s p W ? AM -A? AMU+E06 06 Vd c r - C [4M CM bs 4M CM l t ], ?A O A E C U, M A A E CVbi E OAMt b lU6A , NE A n my ]a3 F k V 4M CM l t ], ?A O A E C U D A A A E CVbi PAMRt b lU3 F , NE A n my ]a D Ab i V S jf s gaT pr]k R, EIC 6 ]
  • 61. 6C 2 N I PG IF I JM . M J M B L I L M L , IG 2 MN F NJ L I :1 I • e i I M B I I JML G M 3 L B 3 IB GalT] R I JM +G BU dS al [f + , R i MEBNB kc 4 I Red gl bW V gl h 6 E P FO PEB OE MB C PEB NBA ?GB P 1 LE 4 22 4 L2 M IFI +G B J C J +G B J Cylinder LN F , MN J Obj:1 Obj:2 ↓ Filter(Red) Query(Shape) -4 -4 22 22 2 O -33 h a fc r BO3BP :FOQ 0B PQNBc s a/ ABN B ABNV,F1 1 WeUd u D N DN c N DN n x - BMP / ?BAAF D V- N B ?BAAF DWcj 0F PBNx ctnV BA] - OF B o ] b4?G l W N DN n x - BMP / ?BAAF D V E MB B ?BAAF DWcj 6QBNSx ctnV4?G ] - OF B o ] b E MBc j W kg] [ i ic BF C N B N DN pcmy Sphere Box
  • 62. o EIBJL 3 LJ 6 JGE +JI N 1 LI L /IN L L NEIC 6 I M :JL M I 6 IN I M -LJ 3 N L G 6 LPEMEJI 2 J • JEIN 1 LIEIC JB +JI NM I 6 INE 4 LMEIC 4LJCL i Wb]agS+JI N , EIC cpT 1 NG 6 ,22[g W x eb oS6AN2AO [ ENP A OPMAb s p W ? AM -A? AMU+E06 06 Vd c r - C [4M CM bs 4M CM l t ], ?A O A E C U, M A A E CVbi E OAMt b lU6A , NE A n my ]a3 F k V 4M CM l t ], ?A O A E C U D A A A E CVbi PAMRt b lU3 F , NE A n my ]a D Ab i V S jf s l tl vhV EIBJL U4LJCL Wmk e d i r + Q uh DAMA Red s n[R s fyu :D O EN ODA ND A B ODA MA FA?O 2 MF +33 M3 N +JGJL , EIC 6 6 , EIC 6 Cylinder EM G - N L 6 Obj:1 Obj:2 ↓ Filter(Red) Query(Shape) E. . 33 33 Sphere Box
  • 63. 1 . FC 6 , C C C C 2 C C . FC F C 3- • e h + , C : C • [S ]MP b W • FCC F F , C T c I MN idTgf aLPJ Published as a conference paper at ICLR 2019 Q: What is the shape of the red object? A: Cube. Q: How many cubes are behind the sphere? A: 3 Q: Does the red object left of the green cube have the same shape as the purple matte thing? A: No Q: Does the matte thing behind the big Initialized with DSL and executor. Lesson1: Object-based questions. Lesson2: Relational questions. Lesson3: More complex questions. Deploy: complex scenes, complex questions Q: Does the red object left of the green cube have the same shape as the purple matte thing? 1 2 3 4 Obj 1 Obj 2 Obj 3 Obj 4 Step1: Visual Parsing Step2, 3: Semantic Parsing and Program Execution Filter Green Cube Program Representations Outputs Relate Object 2 Left Filter Red Concepts A. Curriculum concept learning B. Illustrative execution of NS-CL Q: What is the shape of the red object? A: Cube. Q: How many cubes are behind the sphere? A: 3 Q: Does the red object left of the green cube have the same shape as the purple matte thing? A: No Q: Does the matte thing behind the big sphere have the same color as the cylinder left of the small matte cube? A: No. Initialized with DSL and executor. Lesson1: Object-based questions. Lesson2: Relational questions. Lesson3: More complex questions. Deploy: complex scenes, complex questions Q: Does the red object lef cube have the same shape purple matte thing? Obj 1 Obj 2 Obj 3 Obj 4 Step1: Visual Parsing Step2, 3: Semantic Pars Filter Program Representa Relate O Filter Filter AEQuery Object 1 O A. Curriculum concept learning B. Illustrative exec Figure 4: A. Demonstration of the curriculum learning of visual concepts, wo of sentences by watching images and reading paired questions and answers different complexities are illustrated to the learner in an incremental mann
  • 64. 7 4 LI 6NE DC + F K 2 IF I FK I I KCF 6 F J I J F 6 FK F J I E 4 KLI D 6L IMCJC F : • +2- 5 , K J K :0 FJ F % • uc a e t am ln • 7I CF 1 DC 1 7 JK 1 • v m 1Rg S y s 1Rg S ya g • am b oph ri [Td m SV] W 217 4./ 0 5: /3 8. . M P /31 9 6NA EE 217 4./ 0 5: /3 0 L RQS- CLLI- E DE DL A M L A M A % 2/39 5:/3 I LA I CLLI- E DE DL A M L A M A % 2/39 5:/3 IILO
  • 65. : • : : : : : : : :: : 6 :5 • : EM I AMG ER A " Y7PDEPED MESPNM , 3MREGPARIMG RPEE RPSCRSPE IMRN PECSPPEMR MESPA MER NP "Z IM 8PNC" N 3 9 " • 5EPIRW :REO EM ER A " Y9EGS APIXIMG AMD 7ORILIXIMG : 5 AMGSAGE 5NDE "Z IM 8PNC" N 3 9 " • AMG I IM ER A " Y.PEA IMG R E N RLAV BNRR EMEC , - IG PAM PMM AMGSAGE LNDE "Z IM 8PNC" N 3 9 " • : : 6 • I IAMG ER A " :LNNR IMG R E 2ENLERPW N 8PNBABI I RIC .NV 1LBEDDIMG " IM 8PNC" N 3 9 " • 5I N NT NLA ER A " Y0I RPIBSRED PEOPE EMRARINM N NPD AMD O PA E AMD R EIP CNLON IRINMA IRW"Z IM 8PNC" N 638: " • I MI S E ER A " Y NPD PEOPE EMRARINM TIA GAS IAM ELBEDDIMG"Z IM 8PNC" N 3 9 " • EMDPNT 3TAM ER A " Y7PDEP ELBEDDIMG N ILAGE AMD AMGSAGE"Z IM 8PNC" N 3 9 " • AI - ICE ER A " Y EAPMIMG RN OPEDICR DEMNRARINMA OPNBABI IRIE NP LNDE IMG EM RAI LEMR"Z IM 8PNC" N 1- " • I MI S E ER A " Y8PNBABI I RIC ELBEDDIMG N MN EDGE GPAO IR BNV ARRICE LEA SPE "Z IM 8PNC" N - "
  • 66. : • C : : 6 6 6 : C: :A : • P . HFS H" T H NN J FKJ RF E HFDE R FDE JA ATJ IF KJQKHP FKJN" FJ MK " KC 0 8 " • NR JF NEFNE H" J FKJ FN HH TKP J A" FJ MK " KC 0 9 " • -6 ,C : : : : , : : , : : , A : • K 1F TP J H" :E J PMK NTI KHF KJ L H MJ M 0J MLM FJD N J N RKMAN JA N J J N CMKI J PM H NPL MQFNFKJ" FJ MK " KC 0 8 " • PANKJ ,M R H" V KILKNF FKJ H J FKJ J RKM N CKM I EFJ M NKJFJD"V FJ MK " KC 0 8 " • N E M , QFA H" :M JNL M J T T A NFDJ HKNFJD E D L R J L MCKMI J JA FJ MLM FHF T FJ QFNP H M NKJFJD"V FJ MK " KC 8 " • F 2 SFJ H" PM H 9TI KHF 7 ,FN J JDHFJD M NKJFJD CMKI QFNFKJ JA H JDP D PJA MN JAFJD"V FJ MK " KC PM0 9 " • 1KEJNKJ 1PN FJ H" - 8 AF DJKN F A N CKM KILKNF FKJ H H JDP D JA H I J MT QFNP H M NKJFJD"V FJ MK " KC 8 "