SlideShare uma empresa Scribd logo
1 de 15
Baixar para ler offline
Variational Dropout Sparsifies
Deep Neural Networks
2017/03/24 鈴⽊雅⼤
本論⽂について
¤ Dmitry Molchanov,Arsenii Ashukha,Dmitry Vetrov
¤ スコルコボ科学技術⼤学,国⽴研究⼤学⾼等経済学院,モスクワ物理⼯科
⼤学
¤ ICML2017投稿論⽂(2017/2/27 arXiv)
¤ Bayesian Dropoutのドロップアウト率をマックスまで設定できる⼿
法を提案.
¤ アイディア⾃体はめちゃくちゃシンプル.
¤ スパース性が⾼くなるだけではなく,通常のCNNの汎化性能に関する問
題を解消できる.
¤ 選定理由
¤ ⼀昨年輪読した論⽂[Kingma+ 15]の拡張だから.
¤ シンプルなアイディアで⼤きな効果を上げているのが好み.
¤ データ𝐷 = (𝑥%, 𝑦%)%)*
+
を観測したとき・・・
¤ ⽬標は𝑝 𝑦 𝑥, 𝑤 = 𝑝 𝐷 𝑤 を求めること.
¤ ベイズ学習の枠組みでは,パラメーターwの事前知識を考える
¤ Dを観測した後のwの事後分布は次のようになる
¤ この処理をベイズ推論という.
¤ 事後分布を求めるためには,分⺟で周辺化が必要.
->変分推論
ベイズ推論
𝑝 𝑤 𝐷 =
𝑝 𝐷 𝑤 𝑝(𝑤)
𝑝(𝐷)
=
𝑝 𝐷 𝑤 𝑝(𝑤)
∫ 𝑝 𝐷 𝑤 𝑝 𝑤 𝑑𝑤
変分推論
¤ 近似分布 𝑞(𝑤|𝜙)を考えて,真の事後分布との距離𝐷45[𝑞(𝑤|𝜙)||𝑝	(𝑤|𝐷)]を
最⼩化する.
¤ これは次の変分下界を最⼤化することと等価
¤ 再パラメータ化トリックによって,変分下界は𝜙について微分可能になる.
¤ ミニバッチ において,下界と下界の勾配の不偏推定量は
ただし
Variational Dropout Sparsifies Deep Neural Networks
L( ) = LD( ) DKL(q (w) k p(w)) ! max
2
(1)
LD( ) =
NX
n=1
Eq (w)[log p(yn | xn, w)] (2)
It consists of two parts, the expected log-likelihood LD( )
and the KL-divergence DKL(q (w) k p(w)), which acts as
a regularization term.
3.2. Stochastic Variational Inference
In the case of complex models expectations in (1) and (2)
are intractable. Therefore the variational lower bound (1)
and its gradients can not be computed exactly. However, it
is still possible to estimate them using sampling and opti-
mize the variational lower bound using stochastic optimiza-
tion.
noise ⌅ to the layer input
procedure (Hinton et al., 20
B = (A ⌅)W
The original version of dro
nary Dropout, was presente
(Hinton et al., 2012). It me
put matrix is put to zero
as a dropout rate. Later
Gaussian Dropout with con
p
1 p ) works as well and is
dropout rate p (Srivastava
to use continuous noise i
multiplying the inputs by
to putting Gaussian noise
dure can be used to obta
the model’s weights (Wan
et al., 2015). That is, puttin
and the KL-divergence DKL(q (w) k p(w)), which acts as
a regularization term.
3.2. Stochastic Variational Inference
In the case of complex models expectations in (1) and (2)
are intractable. Therefore the variational lower bound (1)
and its gradients can not be computed exactly. However, it
is still possible to estimate them using sampling and opti-
mize the variational lower bound using stochastic optimiza-
tion.
We follow (Kingma & Welling, 2013) and use the Repa-
rameterization Trick to obtain an unbiased differentiable
minibatch-based Monte Carlo estimator of the expected
log-likelihood (3). The main idea is to represent the para-
metric noise q (w) as a deterministic differentiable func-
tion w = f( , ✏) of a non-parametric noise ✏ s p(✏).
This trick allows us to obtain an unbiased estimate of
r LD(q ). Here we denote objects from a mini-batch as
(˜xm, ˜ym)M
m=1.
L( )'LSGVB
( )=LSGVB
D ( ) DKL(q (w)kp(w)) (3)
LD( )'LSGVB
D ( )=
N
M
MX
m=1
log p(˜ym|˜xm, f( , ✏m)) (4)
r LD( )'
N
M
MX
m=1
r log p(˜ym|˜xm, f( , ✏m)) (5)
The Local Reparameterization Trick is another technique
put matrix is put to zero w
as a dropout rate. Later t
Gaussian Dropout with con
p
1 p ) works as well and is
dropout rate p (Srivastava
to use continuous noise in
multiplying the inputs by
to putting Gaussian noise
dure can be used to obtai
the model’s weights (Wan
et al., 2015). That is, puttin
⇠ij ⇠ N(1, ↵) on a weigh
of wij from q(wij | ✓ij, ↵)
becomes a random variable
wij = ✓ij⇠ij = ✓ij(1 +
p
✏ij s N
Gaussian Dropout training
timization of the expected
when we use the reparamete
sample W s q(W | ✓, ↵) pe
pectation. Variational Drop
explicitly uses q(W | ✓, ↵) a
tribution for a model with
The parameters ✓ and ↵ of
tuned via stochastic variatio
are the variational paramet
The prior distribution p(W
n=1
It consists of two parts, the expected log-likelihood LD( )
and the KL-divergence DKL(q (w) k p(w)), which acts as
a regularization term.
3.2. Stochastic Variational Inference
In the case of complex models expectations in (1) and (2)
are intractable. Therefore the variational lower bound (1)
and its gradients can not be computed exactly. However, it
is still possible to estimate them using sampling and opti-
mize the variational lower bound using stochastic optimiza-
tion.
We follow (Kingma & Welling, 2013) and use the Repa-
rameterization Trick to obtain an unbiased differentiable
minibatch-based Monte Carlo estimator of the expected
log-likelihood (3). The main idea is to represent the para-
metric noise q (w) as a deterministic differentiable func-
tion w = f( , ✏) of a non-parametric noise ✏ s p(✏).
This trick allows us to obtain an unbiased estimate of
r LD(q ). Here we denote objects from a mini-batch as
(˜xm, ˜ym)M
m=1.
L( )'LSGVB
( )=LSGVB
D ( ) DKL(q (w)kp(w)) (3)
LD( )'LSGVB
D ( )=
N
M
MX
m=1
log p(˜ym|˜xm, f( , ✏m)) (4)
r LD( )'
N
M
MX
m=1
r log p(˜ym|˜xm, f( , ✏m)) (5)
The Local Reparameterization Trick is another technique
that reduces the variance of this gradient estimator even fur-
nary Dropout, was presented with
(Hinton et al., 2012). It means th
put matrix is put to zero with p
as a dropout rate. Later the sa
Gaussian Dropout with continuou
p
1 p ) works as well and is simila
dropout rate p (Srivastava et al.
to use continuous noise instead
multiplying the inputs by a Gau
to putting Gaussian noise on th
dure can be used to obtain a p
the model’s weights (Wang &
et al., 2015). That is, putting mul
⇠ij ⇠ N(1, ↵) on a weight wij
of wij from q(wij | ✓ij, ↵) = N(
becomes a random variable param
wij = ✓ij⇠ij = ✓ij(1 +
p
↵✏ij)
✏ij s N(0, 1
Gaussian Dropout training is eq
timization of the expected log l
when we use the reparameterizati
sample W s q(W | ✓, ↵) per min
pectation. Variational Dropout e
explicitly uses q(W | ✓, ↵) as an a
tribution for a model with a spe
The parameters ✓ and ↵ of the di
tuned via stochastic variational i
are the variational parameters, a
The prior distribution p(W) is ch
scale uniform to make the Variati
L( ) = LD( ) DKL(q (w) k p(w)) ! max
2
(1)
LD( ) =
NX
n=1
Eq (w)[log p(yn | xn, w)] (2)
It consists of two parts, the expected log-likelihood LD( )
and the KL-divergence DKL(q (w) k p(w)), which acts as
a regularization term.
3.2. Stochastic Variational Inference
In the case of complex models expectations in (1) and (2)
are intractable. Therefore the variational lower bound (1)
and its gradients can not be computed exactly. However, it
is still possible to estimate them using sampling and opti-
mize the variational lower bound using stochastic optimiza-
tion.
We follow (Kingma & Welling, 2013) and use the Repa-
rameterization Trick to obtain an unbiased differentiable
minibatch-based Monte Carlo estimator of the expected
log-likelihood (3). The main idea is to represent the para-
metric noise q (w) as a deterministic differentiable func-
tion w = f( , ✏) of a non-parametric noise ✏ s p(✏).
This trick allows us to obtain an unbiased estimate of
r LD(q ). Here we denote objects from a mini-batch as
The or
nary D
(Hinton
put ma
as a dr
Gaussi
p
1 p ) w
dropou
to use
multipl
to putt
dure c
the mo
et al., 2
⇠ij ⇠
of wij
becom
wij =
Gaussi
ドロップアウト
¤ 全結合層 において,ドロップアウトは各訓練処理において
ランダムなノイズ を加える.
¤ ノイズをサンプリングする分布としてベルヌーイやガウス分布が使われる
¤ 𝑊にガウスノイズを⼊れることは, から
𝑊をサンプリングすることと等価
¤ すると,確率変数𝑤は𝜃によって次のようにパラメータ化される.
In this section we consider a single fully-connected layer
with I input neurons and O output neurons before a non-
linearity. We denote an output matrix as BM⇥O
, input ma-
trix as AM⇥I
and a weight matrix as WI⇥O
. We index
the elements of these matrices as bmj, ami and wij respec-
tively. Then B = AW.
Dropout is one of the most popular regularization methods
for deep neural networks. It injects a multiplicative random
DKL(q(W | ✓,
bound (1) doe
Maximization
comes equival
likelihood (2) w
sian Dropout t
Dropout with fi
vides a way to
ational lower b
ational Dropout Sparsifies Deep Neural Networks
w)) ! max
2
(1)
n | xn, w)] (2)
g-likelihood LD( )
(w)), which acts as
tions in (1) and (2)
nal lower bound (1)
noise ⌅ to the layer input A at each iteration of training
procedure (Hinton et al., 2012).
B = (A ⌅)W, with ⇠mi s p(⇠) (6)
The original version of dropout, so-called Bernoulli or Bi-
nary Dropout, was presented with ⇠mi s Bernoulli(1 p)
(Hinton et al., 2012). It means that each element of the in-
put matrix is put to zero with probability p, also known
as a dropout rate. Later the same authors reported that
Gaussian Dropout with continuous noise ⇠mi s N(1, ↵ =
p
1 p ) works as well and is similar to Binary Dropout with
dropout rate p (Srivastava et al., 2014). It is beneficial
to use continuous noise instead of discrete one because
nal Dropout Sparsifies Deep Neural Networks
! max
2
(1)
n, w)] (2)
kelihood LD( )
), which acts as
ns in (1) and (2)
ower bound (1)
tly. However, it
mpling and opti-
hastic optimiza-
noise ⌅ to the layer input A at each iteration of training
procedure (Hinton et al., 2012).
B = (A ⌅)W, with ⇠mi s p(⇠) (6)
The original version of dropout, so-called Bernoulli or Bi-
nary Dropout, was presented with ⇠mi s Bernoulli(1 p)
(Hinton et al., 2012). It means that each element of the in-
put matrix is put to zero with probability p, also known
as a dropout rate. Later the same authors reported that
Gaussian Dropout with continuous noise ⇠mi s N(1, ↵ =
p
1 p ) works as well and is similar to Binary Dropout with
dropout rate p (Srivastava et al., 2014). It is beneficial
to use continuous noise instead of discrete one because
multiplying the inputs by a Gaussian noise is equivalent
to putting Gaussian noise on the weights. This proce-
dure can be used to obtain a posterior distribution over
the model’s weights (Wang & Manning, 2013; Kingma
eep Neural Networks
se ⌅ to the layer input A at each iteration of training
cedure (Hinton et al., 2012).
B = (A ⌅)W, with ⇠mi s p(⇠) (6)
original version of dropout, so-called Bernoulli or Bi-
y Dropout, was presented with ⇠mi s Bernoulli(1 p)
nton et al., 2012). It means that each element of the in-
matrix is put to zero with probability p, also known
a dropout rate. Later the same authors reported that
ussian Dropout with continuous noise ⇠mi s N(1, ↵ =
) works as well and is similar to Binary Dropout with
pout rate p (Srivastava et al., 2014). It is beneficial
use continuous noise instead of discrete one because
tiplying the inputs by a Gaussian noise is equivalent
putting Gaussian noise on the weights. This proce-
e can be used to obtain a posterior distribution over
model’s weights (Wang & Manning, 2013; Kingma
l., 2015). That is, putting multiplicative Gaussian noise
⇠ N(1, ↵) on a weight wij is equivalent to sampling
wij from q(wij | ✓ij, ↵) = N(wij | ✓ij, ↵✓2
ij). Now wij
omes a random variable parametrized by ✓ij.
Variational Dropout Sparsifies Deep Neural Networks
L( ) = LD( ) DKL(q (w) k p(w)) ! max
2
(1)
LD( ) =
NX
n=1
Eq (w)[log p(yn | xn, w)] (2)
onsists of two parts, the expected log-likelihood LD( )
the KL-divergence DKL(q (w) k p(w)), which acts as
gularization term.
Stochastic Variational Inference
he case of complex models expectations in (1) and (2)
intractable. Therefore the variational lower bound (1)
its gradients can not be computed exactly. However, it
till possible to estimate them using sampling and opti-
e the variational lower bound using stochastic optimiza-
.
follow (Kingma & Welling, 2013) and use the Repa-
eterization Trick to obtain an unbiased differentiable
ibatch-based Monte Carlo estimator of the expected
likelihood (3). The main idea is to represent the para-
ric noise q (w) as a deterministic differentiable func-
noise ⌅ to the layer input A at each iteration of
procedure (Hinton et al., 2012).
B = (A ⌅)W, with ⇠mi s p(⇠)
The original version of dropout, so-called Bernou
nary Dropout, was presented with ⇠mi s Bernoul
(Hinton et al., 2012). It means that each element o
put matrix is put to zero with probability p, als
as a dropout rate. Later the same authors repo
Gaussian Dropout with continuous noise ⇠mi s N
p
1 p ) works as well and is similar to Binary Drop
dropout rate p (Srivastava et al., 2014). It is b
to use continuous noise instead of discrete one
multiplying the inputs by a Gaussian noise is eq
to putting Gaussian noise on the weights. Thi
dure can be used to obtain a posterior distribut
the model’s weights (Wang & Manning, 2013;
et al., 2015). That is, putting multiplicative Gauss
⇠ij ⇠ N(1, ↵) on a weight wij is equivalent to s
of wij from q(wij | ✓ij, ↵) = N(wij | ✓ij, ↵✓2
ij).
becomes a random variable parametrized by ✓ij.
wij = ✓ij⇠ij = ✓ij(1 +
p
↵✏ij) ⇠ N(wij | ✓ij, ↵
n=1
parts, the expected log-likelihood LD( )
gence DKL(q (w) k p(w)), which acts as
erm.
ariational Inference
mplex models expectations in (1) and (2)
Therefore the variational lower bound (1)
can not be computed exactly. However, it
o estimate them using sampling and opti-
al lower bound using stochastic optimiza-
ma & Welling, 2013) and use the Repa-
ick to obtain an unbiased differentiable
Monte Carlo estimator of the expected
. The main idea is to represent the para-
w) as a deterministic differentiable func-
✏) of a non-parametric noise ✏ s p(✏).
s us to obtain an unbiased estimate of
e we denote objects from a mini-batch as
)=LSGVB
D ( ) DKL(q (w)kp(w)) (3)
The original version of dropout, so-called Bernoulli or Bi-
nary Dropout, was presented with ⇠mi s Bernoulli(1 p)
(Hinton et al., 2012). It means that each element of the in-
put matrix is put to zero with probability p, also known
as a dropout rate. Later the same authors reported that
Gaussian Dropout with continuous noise ⇠mi s N(1, ↵ =
p
1 p ) works as well and is similar to Binary Dropout with
dropout rate p (Srivastava et al., 2014). It is beneficial
to use continuous noise instead of discrete one because
multiplying the inputs by a Gaussian noise is equivalent
to putting Gaussian noise on the weights. This proce-
dure can be used to obtain a posterior distribution over
the model’s weights (Wang & Manning, 2013; Kingma
et al., 2015). That is, putting multiplicative Gaussian noise
⇠ij ⇠ N(1, ↵) on a weight wij is equivalent to sampling
of wij from q(wij | ✓ij, ↵) = N(wij | ✓ij, ↵✓2
ij). Now wij
becomes a random variable parametrized by ✓ij.
wij = ✓ij⇠ij = ✓ij(1 +
p
↵✏ij) ⇠ N(wij | ✓ij, ↵✓2
ij)
✏ij s N(0, 1)
(7)
Gaussian Dropout training is equivalent to stochastic op-
timization of the expected log likelihood (2) in the case
when we use the reparameterization trick and draw a single
sample W s q(W | ✓, ↵) per minibatch to estimate the ex-
yn | xn, w)] (2)
og-likelihood LD( )
p(w)), which acts as
ations in (1) and (2)
nal lower bound (1)
exactly. However, it
g sampling and opti-
stochastic optimiza-
) and use the Repa-
biased differentiable
tor of the expected
o represent the para-
differentiable func-
ic noise ✏ s p(✏).
nbiased estimate of
rom a mini-batch as
The original version of dropout, so-called Bernoulli or Bi-
nary Dropout, was presented with ⇠mi s Bernoulli(1 p)
(Hinton et al., 2012). It means that each element of the in-
put matrix is put to zero with probability p, also known
as a dropout rate. Later the same authors reported that
Gaussian Dropout with continuous noise ⇠mi s N(1, ↵ =
p
1 p ) works as well and is similar to Binary Dropout with
dropout rate p (Srivastava et al., 2014). It is beneficial
to use continuous noise instead of discrete one because
multiplying the inputs by a Gaussian noise is equivalent
to putting Gaussian noise on the weights. This proce-
dure can be used to obtain a posterior distribution over
the model’s weights (Wang & Manning, 2013; Kingma
et al., 2015). That is, putting multiplicative Gaussian noise
⇠ij ⇠ N(1, ↵) on a weight wij is equivalent to sampling
of wij from q(wij | ✓ij, ↵) = N(wij | ✓ij, ↵✓2
ij). Now wij
becomes a random variable parametrized by ✓ij.
wij = ✓ij⇠ij = ✓ij(1 +
p
↵✏ij) ⇠ N(wij | ✓ij, ↵✓2
ij)
✏ij s N(0, 1)
(7)
Gaussian Dropout training is equivalent to stochastic op-
timization of the expected log likelihood (2) in the case
変分ドロップアウト
¤ をパラメータ をもつ近似分布と考えると,このパ
ラメータは変分推論で計算することができる(変分ドロップアウト)
¤ 𝛼を固定すると,変分ドロップアウトとガウスドロップアウトは等価に
なる.
¤ KL項が⼀定になるため.
¤ 変分ドロップアウトにおいて,𝛼は学習するパラメータになっている!
¤ つまり, 𝛼を学習時に⾃動的に決定することができる.
¤ しかし,先⾏研究[Kigma+ 2015]では𝛼は1以下に制限されている.
¤ ノイズが⼊りすぎると,勾配の分散が⼤きくなる.
¤ しかし, 𝛼が無限⼤(=ドロップアウト率が1)まで設定できたほうが⾯⽩い
結果がでそう.
✏ij s N (0, 1)
aussian Dropout training is equivalent to stochastic op-
mization of the expected log likelihood (2) in the case
hen we use the reparameterization trick and draw a single
ample W s q(W | ✓, ↵) per minibatch to estimate the ex-
ectation. Variational Dropout extends this technique and
xplicitly uses q(W | ✓, ↵) as an approximate posterior dis-
ibution for a model with a special prior on the weights.
he parameters ✓ and ↵ of the distribution q(W | ✓, ↵) are
uned via stochastic variational inference, i.e. = (✓, ↵)
re the variational parameters, as denoted in Section 3.2.
he prior distribution p(W) is chosen to be improper log-
cale uniform to make the Variational Dropout with fixed ↵
quivalent to Gaussian Dropout (Kingma et al., 2015).
p(log |wij|) = const , p(|wij|) /
1
|wij|
(8)
n this model, it is the only prior distribution that makes
ariational inference consistent with Gaussian Dropout
Kingma et al., 2015). When parameter ↵ is fixed, the
DKL(q(W | ✓, ↵) k p(W)) term in the variational lower
ound (1) does not depend on ✓ (Kingma et al., 2015).
Maximization of the variational lower bound (1) then be-
e use the reparameterization trick and draw a single
W s q(W | ✓, ↵) per minibatch to estimate the ex-
n. Variational Dropout extends this technique and
ly uses q(W | ✓, ↵) as an approximate posterior dis-
n for a model with a special prior on the weights.
ameters ✓ and ↵ of the distribution q(W | ✓, ↵) are
ia stochastic variational inference, i.e. = (✓, ↵)
variational parameters, as denoted in Section 3.2.
or distribution p(W) is chosen to be improper log-
niform to make the Variational Dropout with fixed ↵
ent to Gaussian Dropout (Kingma et al., 2015).
p(log |wij|) = const , p(|wij|) /
1
|wij|
(8)
model, it is the only prior distribution that makes
nal inference consistent with Gaussian Dropout
a et al., 2015). When parameter ↵ is fixed, the
(W | ✓, ↵) k p(W)) term in the variational lower
(1) does not depend on ✓ (Kingma et al., 2015).
zation of the variational lower bound (1) then be-
equivalent to maximization of the expected log-
od (2) with fixed parameter ↵. It means that Gaus-
opout training is exactly equivalent to Variational
Additive Noise Reparameterization
¤ 下界の勾配 の2つめの乗数はαが⼤きくなるとノイズが
⼤きくなる.
¤ そこで,つぎのような式変形をする.
¤ すると,
となるので,勾配の分散を⼤幅に減らすことができる!
¤ これによって, 𝛼を∞にまで⼤きく設定することができる.
4.1. Additive Noise Reparameterization
Training Neural Networks with Variational Dropout is dif-
ficult when dropout rates ↵ij are large because of a huge
variance of stochastic gradients (Kingma et al., 2015). The
cause of large gradient variance arises from multiplicative
noise. To see it clearly, we can rewrite the gradient of LSGVB
w.r.t. ✓ij as follows.
@LSGVB
@✓ij
=
@LSGVB
@wij
·
@wij
@✓ij
(9)
In the case of original parameterization (✓, ↵) the second
multiplier in (9) is very noisy if ↵ij is large.
wij = ✓ij(1 +
p
↵ij · ✏ij),
@wij
@✓ij
= 1 +
p
↵ij · ✏ij,
✏ij ⇠ N(0, 1)
(10)
We propose a trick that allows us to drastically reduce the
variance of this term in the case when ↵ij is large. The idea
is to replace the multiplicative noise term 1+
p
↵ij ·✏ij with
an exactly equivalent additive noise term ij · ✏ij, where
2
ij = ↵ij✓2
ij is treated as a new independent variable. Af-
ter this trick we will optimize the variational lower bound
w.r.t. (✓, ). However, we will still use ↵ throughout the
paper, as it has a nice interpretation as a dropout rate.
wij = ✓ij(1 +
p
↵ij · ✏ij) = ✓ij + ij · ✏ij
@wij
@✓ij
= 1, ✏ij ⇠ N(0, 1)
(11)
can be decomposed into a sum:
DKL(q(W | ✓, ↵)k p(W)) =
=
X
ij
DKL(q(wij | ✓ij, ↵ij) k p(wij)) (12)
The log-scale uniform prior distribution is an improper
prior, so the KL divergence can only be calculated up to
an additive constant C (Kingma et al., 2015).
DKL(q(wij | ✓ij, ↵ij) k p(wij)) =
=
1
2
log ↵ij E✏⇠N (1,↵ij ) log |✏| + C
(13)
In the Variational Dropout model this term is intractable, as
the expectation E✏⇠N (1,↵ij ) log |✏| in (13) cannot be com-
puted analytically (Kingma et al., 2015). However, this
term can be sampled and then approximated. Two different
approximations were provided in the original paper, how-
ever they are accurate only for small values of the dropout
rate ↵ (↵  1). We propose another approximation (14)
that is tight for all values of alpha. Here (·) denotes the
sigmoid function. Different approximations and the true
value of DKL are presented in Fig. 1. Original DKL
was obtained by averaging over 107
samples of ✏ with less
than 2 ⇥ 10 3
variance of the estimation.
DKL(q(wij | ✓ij, ↵ij) k p(wij)) ⇡
⇡ k1 (k2 + k3 log ↵ij)) 0.5 log(1 + ↵ 1
ij ) + C
k1 = 0.63576 k2 = 1.87320 k3 = 1.48695
(14)
We used the following intuition to obtain this formula. The
negative KL-divergence goes to a constant as log ↵ij goes
w.r.t. ✓ij as follows.
@LSGVB
@✓ij
=
@LSGVB
@wij
·
@wij
@✓ij
(9)
In the case of original parameterization (✓, ↵) the second
multiplier in (9) is very noisy if ↵ij is large.
wij = ✓ij(1 +
p
↵ij · ✏ij),
@wij
@✓ij
= 1 +
p
↵ij · ✏ij,
✏ij ⇠ N(0, 1)
(10)
We propose a trick that allows us to drastically reduce the
variance of this term in the case when ↵ij is large. The idea
is to replace the multiplicative noise term 1+
p
↵ij ·✏ij with
an exactly equivalent additive noise term ij · ✏ij, where
2
ij = ↵ij✓2
ij is treated as a new independent variable. Af-
ter this trick we will optimize the variational lower bound
w.r.t. (✓, ). However, we will still use ↵ throughout the
paper, as it has a nice interpretation as a dropout rate.
wij = ✓ij(1 +
p
↵ij · ✏ij) = ✓ij + ij · ✏ij
@wij
@✓
= 1, ✏ij ⇠ N(0, 1)
(11)
an add
In the V
the exp
puted
term ca
approx
ever th
rate ↵
that is
sigmoi
value o
was ob
than 2
⇡ k
k1
We use
negativ
noise. To see it clearly, we can rewrite the gradient of LSGVB
w.r.t. ✓ij as follows.
@LSGVB
@✓ij
=
@LSGVB
@wij
·
@wij
@✓ij
(9)
In the case of original parameterization (✓, ↵) the second
multiplier in (9) is very noisy if ↵ij is large.
wij = ✓ij(1 +
p
↵ij · ✏ij),
@wij
@✓ij
= 1 +
p
↵ij · ✏ij,
✏ij ⇠ N(0, 1)
(10)
We propose a trick that allows us to drastically reduce the
variance of this term in the case when ↵ij is large. The idea
is to replace the multiplicative noise term 1+
p
↵ij ·✏ij with
an exactly equivalent additive noise term ij · ✏ij, where
2
ij = ↵ij✓2
ij is treated as a new independent variable. Af-
ter this trick we will optimize the variational lower bound
w.r.t. (✓, ). However, we will still use ↵ throughout the
paper, as it has a nice interpretation as a dropout rate.
w = ✓ (1 +
p
↵ · ✏ ) = ✓ + · ✏
The log-scale uniform prior dis
prior, so the KL divergence can
an additive constant C (Kingma e
DKL(q(wij | ✓ij, ↵ij
=
1
2
log ↵ij E✏⇠N(1,
In the Variational Dropout model
the expectation E✏⇠N(1,↵ij ) log |✏
puted analytically (Kingma et al
term can be sampled and then app
approximations were provided in
ever they are accurate only for sm
rate ↵ (↵  1). We propose ano
that is tight for all values of alph
sigmoid function. Different app
value of DKL are presented in
was obtained by averaging over 1
than 2 ⇥ 10 3
variance of the esti
DKL(q(wij | ✓ij, ↵ij)
⇡ k1 (k2 + k3 log ↵ij)) 0.5
k1 = 0.63576 k2 = 1.87320
@✓ij
In the case of original
multiplier in (9) is very
wij = ✓
@wij
@✓ij
✏
We propose a trick that
variance of this term in
is to replace the multipli
an exactly equivalent a
2
ij = ↵ij✓2
ij is treated
ter this trick we will op
w.r.t. (✓, ). However,
paper, as it has a nice in
wij = ✓ij(1 +
p
↵
@wij
@✓ij
= 1,
@✓ij
= 1 + ↵ij · ✏ij,
✏ij ⇠ N(0, 1)
(10)
We propose a trick that allows us to drastically reduce the
variance of this term in the case when ↵ij is large. The idea
is to replace the multiplicative noise term 1+
p
↵ij ·✏ij with
an exactly equivalent additive noise term ij · ✏ij, where
2
ij = ↵ij✓2
ij is treated as a new independent variable. Af-
ter this trick we will optimize the variational lower bound
w.r.t. (✓, ). However, we will still use ↵ throughout the
paper, as it has a nice interpretation as a dropout rate.
wij = ✓ij(1 +
p
↵ij · ✏ij) = ✓ij + ij · ✏ij
@wij
@✓ij
= 1, ✏ij ⇠ N(0, 1)
(11)
approxima
ever they
rate ↵ (↵
that is tigh
sigmoid f
value of
was obtain
than 2 ⇥ 1
⇡ k1 (
k1 = 0
We used th
negative K
ただし
𝛼が⼤きくなると,
この項も⼤きくなる
wij = ✓ij(1 + ↵ij · ✏ij),
@wij
@✓ij
= 1 +
p
↵ij · ✏ij,
✏ij ⇠ N(0, 1)
(10)
We propose a trick that allows us to drastically reduce the
variance of this term in the case when ↵ij is large. The idea
is to replace the multiplicative noise term 1+
p
↵ij ·✏ij with
an exactly equivalent additive noise term ij · ✏ij, where
2
ij = ↵ij✓2
ij is treated as a new independent variable. Af-
ter this trick we will optimize the variational lower bound
w.r.t. (✓, ). However, we will still use ↵ throughout the
paper, as it has a nice interpretation as a dropout rate.
wij = ✓ij(1 +
p
↵ij · ✏ij) = ✓ij + ij · ✏ij
@wij
@✓ij
= 1, ✏ij ⇠ N(0, 1)
(11)
puted ana
term can b
approxima
ever they a
rate ↵ (↵
that is tigh
sigmoid fu
value of
was obtain
than 2 ⇥ 1
⇡ k1 (
k1 = 0
We used th
negative K
KL項について
¤ [Kingma+15]で提案されたKL項(正規化項)の近似⽅法は, 𝛼が1以
下の場合のみ.
¤ 本研究では,すべての値の𝛼で適⽤可能なKL項を提案
ij
+
p
↵ij · ✏ij),
+
p
↵ij · ✏ij,
N(0, 1)
(10)
ws us to drastically reduce the
ase when ↵ij is large. The idea
e noise term 1+
p
↵ij ·✏ij with
ve noise term ij · ✏ij, where
new independent variable. Af-
ze the variational lower bound
will still use ↵ throughout the
etation as a dropout rate.
✏ij) = ✓ij + ij · ✏ij
✏ij ⇠ N(0, 1)
(11)
In the Variational Dropout model this term is intractable, as
the expectation E✏⇠N(1,↵ij ) log |✏| in (13) cannot be com-
puted analytically (Kingma et al., 2015). However, this
term can be sampled and then approximated. Two different
approximations were provided in the original paper, how-
ever they are accurate only for small values of the dropout
rate ↵ (↵  1). We propose another approximation (14)
that is tight for all values of alpha. Here (·) denotes the
sigmoid function. Different approximations and the true
value of DKL are presented in Fig. 1. Original DKL
was obtained by averaging over 107
samples of ✏ with less
than 2 ⇥ 10 3
variance of the estimation.
DKL(q(wij | ✓ij, ↵ij) k p(wij)) ⇡
⇡ k1 (k2 + k3 log ↵ij)) 0.5 log(1 + ↵ 1
ij ) + C
k1 = 0.63576 k2 = 1.87320 k3 = 1.48695
(14)
We used the following intuition to obtain this formula. The
negative KL-divergence goes to a constant as log ↵ij goes
Variational Dropout Sparsifies Deep Neural Networks
↵ij✓2
ij goes to zero as w
is effectively a delta func
✓ij ! 0,
q(wij | ✓ij, ↵ij) !
In the case of linear regr
alytically. We denote a d
RD
. If ↵ is fixed, the op
tained in a closed form.
✓ = (X>
X + diag(
スパース変分ドロップアウトの計算
¤ 下界の学習では,提案するAdditive Noise Reparameterizationに加
えて, Local Reparameterization Trick[Kingma+15]を適⽤して分
散を抑える.
¤ Local Reparameterization Trickは以前の輪読スライドを参照.
¤ 全結合層だけではなく,畳込み層でも適⽤可能.
der DKL + 0.5 log(1 + ↵ij )
moid function of log ↵ij, so we fit
(k2 +k3 log ↵ij) to this curve.
oximation is extremely accurate
m absolute deviation on the full
+1); the original approximation
0.04 maximum absolute devia-
0]).
↵ approaches infinity, the KL-
constant. As in this model the
up to an additive constant, it is
k1 so that the KL-divergence
to infinity. It allows us to com-
ural networks of different sizes.
see that DKL term increases
eans that this regularization term
orresponds to a Binary Dropout
p
p ). Intuitively it means that the
lmost always dropped from the
e does not influence the model
nd is put to zero during the test-
tuation from another angle. In-
ds to infinitely large multiplica-
ns that the value of this weight
m and its magnitude will be un-
lower bound (3) with our approximation of KL-divergence
(14). We apply Sparse Variational Dropout to both convo-
lutional and fully-connected layers. To reduce the variance
of LSGVB
we use a combination of the Local Reparameter-
ization Trick and Additive Noise Reparameterization. In
order to improve convergence, optimization is performed
w.r.t. (✓, log 2
).
For a fully connected layer we use the same notation as in
Section 3.3. In this case, Sparse Variational Dropout with
the Local Reparameterization Trick and Additive Noise
Reparameterization can be computed as follows:
bmj s N( mj, mj)
mj =
IX
i=1
ami✓ij, mj =
IX
i=1
a2
mi
2
ij
(17)
Now consider a convolutional layer. Take a single input
tensor AH⇥W ⇥C
m , a single filter wh⇥w⇥C
k and correspond-
ing output matrix bH0
⇥W 0
mk . This filter has corresponding
variational parameters ✓h⇥w⇥C
k and h⇥w⇥C
k . Note that in
this case Am, ✓k and k are tensors. Because of linear-
ity of convolutional layers, it is possible to apply the Local
Reparameterization Trick. Sparse Variational Dropout for
convolutional layers then can be expressed in a way, simi-
lar to (17). Here we use (·)2
as an element-wise operation,
⇤ denotes the convolution operation, vec(·) denotes reshap-
ing of a matrix/tensor into a vector.
vec(bmk) s N( mk, mk)
mk = vec(Am ⇤✓k), mk = diag(vec(A2
m ⇤ 2
k))
(18)
t. As in this model the
n additive constant, it is
o that the KL-divergence
ity. It allows us to com-
works of different sizes.
t DKL term increases
t this regularization term
nds to a Binary Dropout
uitively it means that the
lways dropped from the
not influence the model
ut to zero during the test-
from another angle. In-
nfinitely large multiplica-
the value of this weight
s magnitude will be un-
l prediction and decrease
refore it is beneficial to
o zero in such a way that
the Local Reparameterization Trick and Additive Noise
Reparameterization can be computed as follows:
bmj s N( mj, mj)
mj =
IX
i=1
ami✓ij, mj =
IX
i=1
a2
mi
2
ij
(17)
Now consider a convolutional layer. Take a single input
tensor AH⇥W ⇥C
m , a single filter wh⇥w⇥C
k and correspond-
ing output matrix bH0
⇥W 0
mk . This filter has corresponding
variational parameters ✓h⇥w⇥C
k and h⇥w⇥C
k . Note that in
this case Am, ✓k and k are tensors. Because of linear-
ity of convolutional layers, it is possible to apply the Local
Reparameterization Trick. Sparse Variational Dropout for
convolutional layers then can be expressed in a way, simi-
lar to (17). Here we use (·)2
as an element-wise operation,
⇤ denotes the convolution operation, vec(·) denotes reshap-
ing of a matrix/tensor into a vector.
vec(bmk) s N( mk, mk)
mk = vec(Am ⇤✓k), mk = diag(vec(A2
m ⇤ 2
k))
(18)
These formulae can be used for the implementation of
Sparse Variational Dropout layers. We will provide a refer-
ence implementation using Theano (Bergstra et al., 2010)
実験設定
¤ 𝛼はlog	𝛼 = 3	まで(ドロップアウト率0.95まで)に制限
¤ 事前に,本⼿法を適⽤しない学習を⾏う
¤ 事前学習をしない場合,⾼いスパースレベルとなるが,正解率が低くなる
¤ Bayesian DNNでは共通の問題点らしい.
¤ 本研究で⾏った事前学習は10~30epochほど
¤ その他の設定は論⽂参照
Additive Noise Reparameterizationの検証
¤ Additive Noise Reparameterizationによって分散が抑えられている
かを検証
¤ 本研究の⼿法を適⽤しない⽅法と,スパース性&下界の精度について⽐較Variational Dropout Sparsifies Deep Neural Network
Figure 2. Original parameterization vs Additive Noise Reparam-
Table 1. Comparison of
(Pruning (Han et al., 2015
rich et al., 2017)) on Le
the highest level of spars
Network Method
Original
Pruning
LeNet-300-100 DNS
SWS
(ours) Sparse VD
Original
Pruning
LeNet-5-Caffe DNS
SWS
(ours) Sparse VD
提案⼿法のほうが
スパースになるのが速い
提案⼿法の下界のほう
が速く収束
MNIST
¤ LeNetでMNISTを学習
¤ LeNet-300-100(全結合)とLeNet-5-Caffe(畳込み)
¤ Pruning[Han+ 15], Dynamic Network Surgery[Guo+ 16], Soft Weight
Sharing[Ullrich+ 17]と⽐較
Variational Dropout Sparsifies Deep Neural Networks
s Additive Noise Reparam-
eterization leads to a much
he variational lower bound
Table 1. Comparison of different sparsity-inducing techniques
(Pruning (Han et al., 2015b;a), DNS (Guo et al., 2016), SWS (Ull-
rich et al., 2017)) on LeNet architectures. Our method provides
the highest level of sparsity with a similar accuracy.
Network Method Error % Sparsity per Layer % |W|
|W6=0|
Original 1.64 1
Pruning 1.59 92.0 91.0 74.0 12
LeNet-300-100 DNS 1.99 98.2 98.2 94.5 56
SWS 1.94 23
(ours) Sparse VD 1.92 98.9 97.2 62.0 68
Original 0.80 1
Pruning 0.77 34 88 92.0 81 12
LeNet-5-Caffe DNS 0.91 86 97 99.3 96 111
SWS 0.97 200
(ours) Sparse VD 0.75 67 98 99.8 95 280
from a random initialization and without data augmenta-
提案⼿法が
最もスパース
提案⼿法が
最もスパース
CIFAR-10,CIFAR-100
¤ VGG-like network[Zagoruyko+15]でCIFAR10,CIFAR-100を学習
¤ ユニットサイズのスケーリングkを変更して実験
¤ 正解率はほぼ同じで,最⼤65倍のスパース性(CIFAR-10)
ランダムラベルの学習
¤ [Zhang+ 16]では,CNNがランダムラベルについても学習してしまう
ことが⽰されている.
¤ 通常のドロップアウトではこの問題を解消できない.
¤ 提案⼿法(Sparse VD)では,学習すると重みがすべて1つの値になり,
⼀定の予測しかしないようになった.
¤ しかも,スパース性が100%になった.
¤ スパース性が100%になると重みが0になる(4.3節を参照).
¤ 提案⼿法によって,記憶にペナルティがかけられて,汎化を促進してい
る?
Figure 3. Accuracy and sparsity level for VGG-like architectures of different sizes. T
networks were trained with Binary Dropout, and Sparse VD networks were trained
overall sparsity level, achieved by our method, is reported as a dashed line. The
sparsity level is high, especially in larger networks.
Table 2. Experiments with random labeling. Sparse Variational
Dropout (Sparse VD) removes all weights from the model and
fails to overfit where Binary Dropout networks (BD) learn the
random labeling perfectly.
Dataset Architecture Train acc. Test acc. Sparsity
MNIST FC + BD 1.0 0.1 —
MNIST FC + Sparse VD 0.1 0.1 100%
CIFAR-10 VGG-like + BD 1.0 0.1 —
CIFAR-10 VGG-like + Sparse VD 0.1 0.1 100%
5.5. Random Labels
Recently is was shown that the CNNs are capable of mem-
orizing the data even with random labeling (Zhang et al.,
2016). The standard dropout as well as other regulariza-
6. Discuss
The “Occam
complex sho
1992). Aut
a Bayesian
different cas
of factorize
Processes, e
(Molchanov
ing Beta dis
ARD-effect
We conside
ational infer
by the partic
distribution
selection. T
approach th
まとめ
¤ 本研究では,Variational Dropoutにおいて,𝛼を⼤きくしても勾配の
分散が⼤きくならない再パラメータ化⼿法を提案した.
¤ [Kingma+ 15]のLocal reparameterizaiton trickとも併⽤できる.
¤ CNNにも適⽤可能.
¤ 実験では,既存⼿法よりも⾼いスパース性を獲得できることがわかっ
た.
¤ さらに,ランダムラベルのデータをDNNが簡単に学習してしまう問題
が,本⼿法では当てはまらないことを⽰した.

Mais conteúdo relacionado

Mais procurados

「世界モデル」と関連研究について
「世界モデル」と関連研究について「世界モデル」と関連研究について
「世界モデル」と関連研究についてMasahiro Suzuki
 
[DL輪読会]Set Transformer: A Framework for Attention-based Permutation-Invariant...
[DL輪読会]Set Transformer: A Framework for Attention-based Permutation-Invariant...[DL輪読会]Set Transformer: A Framework for Attention-based Permutation-Invariant...
[DL輪読会]Set Transformer: A Framework for Attention-based Permutation-Invariant...Deep Learning JP
 
SSII2022 [TS1] Transformerの最前線〜 畳込みニューラルネットワークの先へ 〜
SSII2022 [TS1] Transformerの最前線〜 畳込みニューラルネットワークの先へ 〜SSII2022 [TS1] Transformerの最前線〜 畳込みニューラルネットワークの先へ 〜
SSII2022 [TS1] Transformerの最前線〜 畳込みニューラルネットワークの先へ 〜SSII
 
PRML学習者から入る深層生成モデル入門
PRML学習者から入る深層生成モデル入門PRML学習者から入る深層生成モデル入門
PRML学習者から入る深層生成モデル入門tmtm otm
 
協力ゲーム理論でXAI (説明可能なAI) を目指すSHAP (Shapley Additive exPlanation)
協力ゲーム理論でXAI (説明可能なAI) を目指すSHAP (Shapley Additive exPlanation)協力ゲーム理論でXAI (説明可能なAI) を目指すSHAP (Shapley Additive exPlanation)
協力ゲーム理論でXAI (説明可能なAI) を目指すSHAP (Shapley Additive exPlanation)西岡 賢一郎
 
ドメイン適応の原理と応用
ドメイン適応の原理と応用ドメイン適応の原理と応用
ドメイン適応の原理と応用Yoshitaka Ushiku
 
全力解説!Transformer
全力解説!Transformer全力解説!Transformer
全力解説!TransformerArithmer Inc.
 
[DL輪読会]GQNと関連研究,世界モデルとの関係について
[DL輪読会]GQNと関連研究,世界モデルとの関係について[DL輪読会]GQNと関連研究,世界モデルとの関係について
[DL輪読会]GQNと関連研究,世界モデルとの関係についてDeep Learning JP
 
強化学習と逆強化学習を組み合わせた模倣学習
強化学習と逆強化学習を組み合わせた模倣学習強化学習と逆強化学習を組み合わせた模倣学習
強化学習と逆強化学習を組み合わせた模倣学習Eiji Uchibe
 
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料Yusuke Uchida
 
最適輸送の解き方
最適輸送の解き方最適輸送の解き方
最適輸送の解き方joisino
 
深層生成モデルと世界モデル(2020/11/20版)
深層生成モデルと世界モデル(2020/11/20版)深層生成モデルと世界モデル(2020/11/20版)
深層生成モデルと世界モデル(2020/11/20版)Masahiro Suzuki
 
[DL輪読会]ICLR2020の分布外検知速報
[DL輪読会]ICLR2020の分布外検知速報[DL輪読会]ICLR2020の分布外検知速報
[DL輪読会]ICLR2020の分布外検知速報Deep Learning JP
 
深層生成モデルを用いたマルチモーダル学習
深層生成モデルを用いたマルチモーダル学習深層生成モデルを用いたマルチモーダル学習
深層生成モデルを用いたマルチモーダル学習Masahiro Suzuki
 
[DL輪読会]Temporal Abstraction in NeurIPS2019
[DL輪読会]Temporal Abstraction in NeurIPS2019[DL輪読会]Temporal Abstraction in NeurIPS2019
[DL輪読会]Temporal Abstraction in NeurIPS2019Deep Learning JP
 
[DL輪読会]Clebsch–Gordan Nets: a Fully Fourier Space Spherical Convolutional Neu...
[DL輪読会]Clebsch–Gordan Nets: a Fully Fourier Space Spherical Convolutional Neu...[DL輪読会]Clebsch–Gordan Nets: a Fully Fourier Space Spherical Convolutional Neu...
[DL輪読会]Clebsch–Gordan Nets: a Fully Fourier Space Spherical Convolutional Neu...Deep Learning JP
 
[DL輪読会]“SimPLe”,“Improved Dynamics Model”,“PlaNet” 近年のVAEベース系列モデルの進展とそのモデルベース...
[DL輪読会]“SimPLe”,“Improved Dynamics Model”,“PlaNet” 近年のVAEベース系列モデルの進展とそのモデルベース...[DL輪読会]“SimPLe”,“Improved Dynamics Model”,“PlaNet” 近年のVAEベース系列モデルの進展とそのモデルベース...
[DL輪読会]“SimPLe”,“Improved Dynamics Model”,“PlaNet” 近年のVAEベース系列モデルの進展とそのモデルベース...Deep Learning JP
 
[DL輪読会]Learning Transferable Visual Models From Natural Language Supervision
[DL輪読会]Learning Transferable Visual Models From Natural Language Supervision[DL輪読会]Learning Transferable Visual Models From Natural Language Supervision
[DL輪読会]Learning Transferable Visual Models From Natural Language SupervisionDeep Learning JP
 
強化学習アルゴリズムPPOの解説と実験
強化学習アルゴリズムPPOの解説と実験強化学習アルゴリズムPPOの解説と実験
強化学習アルゴリズムPPOの解説と実験克海 納谷
 
[DL輪読会]data2vec: A General Framework for Self-supervised Learning in Speech,...
[DL輪読会]data2vec: A General Framework for  Self-supervised Learning in Speech,...[DL輪読会]data2vec: A General Framework for  Self-supervised Learning in Speech,...
[DL輪読会]data2vec: A General Framework for Self-supervised Learning in Speech,...Deep Learning JP
 

Mais procurados (20)

「世界モデル」と関連研究について
「世界モデル」と関連研究について「世界モデル」と関連研究について
「世界モデル」と関連研究について
 
[DL輪読会]Set Transformer: A Framework for Attention-based Permutation-Invariant...
[DL輪読会]Set Transformer: A Framework for Attention-based Permutation-Invariant...[DL輪読会]Set Transformer: A Framework for Attention-based Permutation-Invariant...
[DL輪読会]Set Transformer: A Framework for Attention-based Permutation-Invariant...
 
SSII2022 [TS1] Transformerの最前線〜 畳込みニューラルネットワークの先へ 〜
SSII2022 [TS1] Transformerの最前線〜 畳込みニューラルネットワークの先へ 〜SSII2022 [TS1] Transformerの最前線〜 畳込みニューラルネットワークの先へ 〜
SSII2022 [TS1] Transformerの最前線〜 畳込みニューラルネットワークの先へ 〜
 
PRML学習者から入る深層生成モデル入門
PRML学習者から入る深層生成モデル入門PRML学習者から入る深層生成モデル入門
PRML学習者から入る深層生成モデル入門
 
協力ゲーム理論でXAI (説明可能なAI) を目指すSHAP (Shapley Additive exPlanation)
協力ゲーム理論でXAI (説明可能なAI) を目指すSHAP (Shapley Additive exPlanation)協力ゲーム理論でXAI (説明可能なAI) を目指すSHAP (Shapley Additive exPlanation)
協力ゲーム理論でXAI (説明可能なAI) を目指すSHAP (Shapley Additive exPlanation)
 
ドメイン適応の原理と応用
ドメイン適応の原理と応用ドメイン適応の原理と応用
ドメイン適応の原理と応用
 
全力解説!Transformer
全力解説!Transformer全力解説!Transformer
全力解説!Transformer
 
[DL輪読会]GQNと関連研究,世界モデルとの関係について
[DL輪読会]GQNと関連研究,世界モデルとの関係について[DL輪読会]GQNと関連研究,世界モデルとの関係について
[DL輪読会]GQNと関連研究,世界モデルとの関係について
 
強化学習と逆強化学習を組み合わせた模倣学習
強化学習と逆強化学習を組み合わせた模倣学習強化学習と逆強化学習を組み合わせた模倣学習
強化学習と逆強化学習を組み合わせた模倣学習
 
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
 
最適輸送の解き方
最適輸送の解き方最適輸送の解き方
最適輸送の解き方
 
深層生成モデルと世界モデル(2020/11/20版)
深層生成モデルと世界モデル(2020/11/20版)深層生成モデルと世界モデル(2020/11/20版)
深層生成モデルと世界モデル(2020/11/20版)
 
[DL輪読会]ICLR2020の分布外検知速報
[DL輪読会]ICLR2020の分布外検知速報[DL輪読会]ICLR2020の分布外検知速報
[DL輪読会]ICLR2020の分布外検知速報
 
深層生成モデルを用いたマルチモーダル学習
深層生成モデルを用いたマルチモーダル学習深層生成モデルを用いたマルチモーダル学習
深層生成モデルを用いたマルチモーダル学習
 
[DL輪読会]Temporal Abstraction in NeurIPS2019
[DL輪読会]Temporal Abstraction in NeurIPS2019[DL輪読会]Temporal Abstraction in NeurIPS2019
[DL輪読会]Temporal Abstraction in NeurIPS2019
 
[DL輪読会]Clebsch–Gordan Nets: a Fully Fourier Space Spherical Convolutional Neu...
[DL輪読会]Clebsch–Gordan Nets: a Fully Fourier Space Spherical Convolutional Neu...[DL輪読会]Clebsch–Gordan Nets: a Fully Fourier Space Spherical Convolutional Neu...
[DL輪読会]Clebsch–Gordan Nets: a Fully Fourier Space Spherical Convolutional Neu...
 
[DL輪読会]“SimPLe”,“Improved Dynamics Model”,“PlaNet” 近年のVAEベース系列モデルの進展とそのモデルベース...
[DL輪読会]“SimPLe”,“Improved Dynamics Model”,“PlaNet” 近年のVAEベース系列モデルの進展とそのモデルベース...[DL輪読会]“SimPLe”,“Improved Dynamics Model”,“PlaNet” 近年のVAEベース系列モデルの進展とそのモデルベース...
[DL輪読会]“SimPLe”,“Improved Dynamics Model”,“PlaNet” 近年のVAEベース系列モデルの進展とそのモデルベース...
 
[DL輪読会]Learning Transferable Visual Models From Natural Language Supervision
[DL輪読会]Learning Transferable Visual Models From Natural Language Supervision[DL輪読会]Learning Transferable Visual Models From Natural Language Supervision
[DL輪読会]Learning Transferable Visual Models From Natural Language Supervision
 
強化学習アルゴリズムPPOの解説と実験
強化学習アルゴリズムPPOの解説と実験強化学習アルゴリズムPPOの解説と実験
強化学習アルゴリズムPPOの解説と実験
 
[DL輪読会]data2vec: A General Framework for Self-supervised Learning in Speech,...
[DL輪読会]data2vec: A General Framework for  Self-supervised Learning in Speech,...[DL輪読会]data2vec: A General Framework for  Self-supervised Learning in Speech,...
[DL輪読会]data2vec: A General Framework for Self-supervised Learning in Speech,...
 

Semelhante a Variational Dropout Sparsifies Neural Networks

MVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priorsMVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priorsElvis DOHMATOB
 
Document Modeling with Implicit Approximate Posterior Distributions
Document Modeling with Implicit Approximate Posterior DistributionsDocument Modeling with Implicit Approximate Posterior Distributions
Document Modeling with Implicit Approximate Posterior DistributionsTomonari Masada
 
fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920
fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920
fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920Karl Rudeen
 
論文紹介:Towards Robust Adaptive Object Detection Under Noisy Annotations
論文紹介:Towards Robust Adaptive Object Detection Under Noisy Annotations論文紹介:Towards Robust Adaptive Object Detection Under Noisy Annotations
論文紹介:Towards Robust Adaptive Object Detection Under Noisy AnnotationsToru Tamaki
 
Linear models for classification
Linear models for classificationLinear models for classification
Linear models for classificationSung Yub Kim
 
(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Weight Uncertainty in Neural Networks(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Weight Uncertainty in Neural NetworksMasahiro Suzuki
 
Sampling and low-rank tensor approximations
Sampling and low-rank tensor approximationsSampling and low-rank tensor approximations
Sampling and low-rank tensor approximationsAlexander Litvinenko
 
SPECTRAL ESTIMATE FOR STABLE SIGNALS WITH P-ADIC TIME AND OPTIMAL SELECTION O...
SPECTRAL ESTIMATE FOR STABLE SIGNALS WITH P-ADIC TIME AND OPTIMAL SELECTION O...SPECTRAL ESTIMATE FOR STABLE SIGNALS WITH P-ADIC TIME AND OPTIMAL SELECTION O...
SPECTRAL ESTIMATE FOR STABLE SIGNALS WITH P-ADIC TIME AND OPTIMAL SELECTION O...sipij
 
Uncertainty in deep learning
Uncertainty in deep learningUncertainty in deep learning
Uncertainty in deep learningYujiro Katagiri
 

Semelhante a Variational Dropout Sparsifies Neural Networks (20)

Iclr2016 vaeまとめ
Iclr2016 vaeまとめIclr2016 vaeまとめ
Iclr2016 vaeまとめ
 
MVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priorsMVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priors
 
BNL_Research_Report
BNL_Research_ReportBNL_Research_Report
BNL_Research_Report
 
QMC: Transition Workshop - Density Estimation by Randomized Quasi-Monte Carlo...
QMC: Transition Workshop - Density Estimation by Randomized Quasi-Monte Carlo...QMC: Transition Workshop - Density Estimation by Randomized Quasi-Monte Carlo...
QMC: Transition Workshop - Density Estimation by Randomized Quasi-Monte Carlo...
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
New test123
New test123New test123
New test123
 
Document Modeling with Implicit Approximate Posterior Distributions
Document Modeling with Implicit Approximate Posterior DistributionsDocument Modeling with Implicit Approximate Posterior Distributions
Document Modeling with Implicit Approximate Posterior Distributions
 
fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920
fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920
fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920
 
Project Paper
Project PaperProject Paper
Project Paper
 
kcde
kcdekcde
kcde
 
Wavelet
WaveletWavelet
Wavelet
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
HPWFcorePRES--FUR2016
HPWFcorePRES--FUR2016HPWFcorePRES--FUR2016
HPWFcorePRES--FUR2016
 
論文紹介:Towards Robust Adaptive Object Detection Under Noisy Annotations
論文紹介:Towards Robust Adaptive Object Detection Under Noisy Annotations論文紹介:Towards Robust Adaptive Object Detection Under Noisy Annotations
論文紹介:Towards Robust Adaptive Object Detection Under Noisy Annotations
 
Lecture 2
Lecture 2Lecture 2
Lecture 2
 
Linear models for classification
Linear models for classificationLinear models for classification
Linear models for classification
 
(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Weight Uncertainty in Neural Networks(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Weight Uncertainty in Neural Networks
 
Sampling and low-rank tensor approximations
Sampling and low-rank tensor approximationsSampling and low-rank tensor approximations
Sampling and low-rank tensor approximations
 
SPECTRAL ESTIMATE FOR STABLE SIGNALS WITH P-ADIC TIME AND OPTIMAL SELECTION O...
SPECTRAL ESTIMATE FOR STABLE SIGNALS WITH P-ADIC TIME AND OPTIMAL SELECTION O...SPECTRAL ESTIMATE FOR STABLE SIGNALS WITH P-ADIC TIME AND OPTIMAL SELECTION O...
SPECTRAL ESTIMATE FOR STABLE SIGNALS WITH P-ADIC TIME AND OPTIMAL SELECTION O...
 
Uncertainty in deep learning
Uncertainty in deep learningUncertainty in deep learning
Uncertainty in deep learning
 

Mais de Masahiro Suzuki

確率的推論と行動選択
確率的推論と行動選択確率的推論と行動選択
確率的推論と行動選択Masahiro Suzuki
 
深層生成モデルと世界モデル, 深層生成モデルライブラリPixyzについて
深層生成モデルと世界モデル,深層生成モデルライブラリPixyzについて深層生成モデルと世界モデル,深層生成モデルライブラリPixyzについて
深層生成モデルと世界モデル, 深層生成モデルライブラリPixyzについてMasahiro Suzuki
 
深層生成モデルを用いたマルチモーダルデータの半教師あり学習
深層生成モデルを用いたマルチモーダルデータの半教師あり学習深層生成モデルを用いたマルチモーダルデータの半教師あり学習
深層生成モデルを用いたマルチモーダルデータの半教師あり学習Masahiro Suzuki
 
(DL輪読)Matching Networks for One Shot Learning
(DL輪読)Matching Networks for One Shot Learning(DL輪読)Matching Networks for One Shot Learning
(DL輪読)Matching Networks for One Shot LearningMasahiro Suzuki
 
(DL hacks輪読)Bayesian Neural Network
(DL hacks輪読)Bayesian Neural Network(DL hacks輪読)Bayesian Neural Network
(DL hacks輪読)Bayesian Neural NetworkMasahiro Suzuki
 
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...Masahiro Suzuki
 
(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi Divergence(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi DivergenceMasahiro Suzuki
 
(DL hacks輪読) Deep Kalman Filters
(DL hacks輪読) Deep Kalman Filters(DL hacks輪読) Deep Kalman Filters
(DL hacks輪読) Deep Kalman FiltersMasahiro Suzuki
 
(DL hacks輪読) Deep Kernel Learning
(DL hacks輪読) Deep Kernel Learning(DL hacks輪読) Deep Kernel Learning
(DL hacks輪読) Deep Kernel LearningMasahiro Suzuki
 
(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...
(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...
(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...Masahiro Suzuki
 
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
(研究会輪読) Facial Landmark Detection by Deep Multi-task LearningMasahiro Suzuki
 
(DL hacks輪読) Difference Target Propagation
(DL hacks輪読) Difference Target Propagation(DL hacks輪読) Difference Target Propagation
(DL hacks輪読) Difference Target PropagationMasahiro Suzuki
 
(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick
(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick
(DL hacks輪読) Variational Dropout and the Local Reparameterization TrickMasahiro Suzuki
 
(DL Hacks輪読) How transferable are features in deep neural networks?
(DL Hacks輪読) How transferable are features in deep neural networks?(DL Hacks輪読) How transferable are features in deep neural networks?
(DL Hacks輪読) How transferable are features in deep neural networks?Masahiro Suzuki
 

Mais de Masahiro Suzuki (14)

確率的推論と行動選択
確率的推論と行動選択確率的推論と行動選択
確率的推論と行動選択
 
深層生成モデルと世界モデル, 深層生成モデルライブラリPixyzについて
深層生成モデルと世界モデル,深層生成モデルライブラリPixyzについて深層生成モデルと世界モデル,深層生成モデルライブラリPixyzについて
深層生成モデルと世界モデル, 深層生成モデルライブラリPixyzについて
 
深層生成モデルを用いたマルチモーダルデータの半教師あり学習
深層生成モデルを用いたマルチモーダルデータの半教師あり学習深層生成モデルを用いたマルチモーダルデータの半教師あり学習
深層生成モデルを用いたマルチモーダルデータの半教師あり学習
 
(DL輪読)Matching Networks for One Shot Learning
(DL輪読)Matching Networks for One Shot Learning(DL輪読)Matching Networks for One Shot Learning
(DL輪読)Matching Networks for One Shot Learning
 
(DL hacks輪読)Bayesian Neural Network
(DL hacks輪読)Bayesian Neural Network(DL hacks輪読)Bayesian Neural Network
(DL hacks輪読)Bayesian Neural Network
 
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
 
(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi Divergence(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi Divergence
 
(DL hacks輪読) Deep Kalman Filters
(DL hacks輪読) Deep Kalman Filters(DL hacks輪読) Deep Kalman Filters
(DL hacks輪読) Deep Kalman Filters
 
(DL hacks輪読) Deep Kernel Learning
(DL hacks輪読) Deep Kernel Learning(DL hacks輪読) Deep Kernel Learning
(DL hacks輪読) Deep Kernel Learning
 
(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...
(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...
(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...
 
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
 
(DL hacks輪読) Difference Target Propagation
(DL hacks輪読) Difference Target Propagation(DL hacks輪読) Difference Target Propagation
(DL hacks輪読) Difference Target Propagation
 
(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick
(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick
(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick
 
(DL Hacks輪読) How transferable are features in deep neural networks?
(DL Hacks輪読) How transferable are features in deep neural networks?(DL Hacks輪読) How transferable are features in deep neural networks?
(DL Hacks輪読) How transferable are features in deep neural networks?
 

Último

"Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ..."Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ...Erbil Polytechnic University
 
Katarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School CourseKatarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School Coursebim.edu.pl
 
KCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitosKCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitosVictor Morales
 
Prach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism CommunityPrach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism Communityprachaibot
 
signals in triangulation .. ...Surveying
signals in triangulation .. ...Surveyingsignals in triangulation .. ...Surveying
signals in triangulation .. ...Surveyingsapna80328
 
Cost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based questionCost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based questionSneha Padhiar
 
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENTFUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENTSneha Padhiar
 
11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdfHafizMudaserAhmad
 
SOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSneha Padhiar
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating SystemRashmi Bhat
 
System Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingSystem Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingBootNeck1
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONjhunlian
 
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.elesangwon
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxsiddharthjain2303
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1
 
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...Sumanth A
 
OOP concepts -in-Python programming language
OOP concepts -in-Python programming languageOOP concepts -in-Python programming language
OOP concepts -in-Python programming languageSmritiSharma901052
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionMebane Rash
 
List of Accredited Concrete Batching Plant.pdf
List of Accredited Concrete Batching Plant.pdfList of Accredited Concrete Batching Plant.pdf
List of Accredited Concrete Batching Plant.pdfisabel213075
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating SystemRashmi Bhat
 

Último (20)

"Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ..."Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ...
 
Katarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School CourseKatarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School Course
 
KCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitosKCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitos
 
Prach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism CommunityPrach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism Community
 
signals in triangulation .. ...Surveying
signals in triangulation .. ...Surveyingsignals in triangulation .. ...Surveying
signals in triangulation .. ...Surveying
 
Cost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based questionCost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based question
 
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENTFUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
 
11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf
 
SOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATION
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating System
 
System Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingSystem Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event Scheduling
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
 
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptx
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
 
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
 
OOP concepts -in-Python programming language
OOP concepts -in-Python programming languageOOP concepts -in-Python programming language
OOP concepts -in-Python programming language
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of Action
 
List of Accredited Concrete Batching Plant.pdf
List of Accredited Concrete Batching Plant.pdfList of Accredited Concrete Batching Plant.pdf
List of Accredited Concrete Batching Plant.pdf
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating System
 

Variational Dropout Sparsifies Neural Networks

  • 1. Variational Dropout Sparsifies Deep Neural Networks 2017/03/24 鈴⽊雅⼤
  • 2. 本論⽂について ¤ Dmitry Molchanov,Arsenii Ashukha,Dmitry Vetrov ¤ スコルコボ科学技術⼤学,国⽴研究⼤学⾼等経済学院,モスクワ物理⼯科 ⼤学 ¤ ICML2017投稿論⽂(2017/2/27 arXiv) ¤ Bayesian Dropoutのドロップアウト率をマックスまで設定できる⼿ 法を提案. ¤ アイディア⾃体はめちゃくちゃシンプル. ¤ スパース性が⾼くなるだけではなく,通常のCNNの汎化性能に関する問 題を解消できる. ¤ 選定理由 ¤ ⼀昨年輪読した論⽂[Kingma+ 15]の拡張だから. ¤ シンプルなアイディアで⼤きな効果を上げているのが好み.
  • 3. ¤ データ𝐷 = (𝑥%, 𝑦%)%)* + を観測したとき・・・ ¤ ⽬標は𝑝 𝑦 𝑥, 𝑤 = 𝑝 𝐷 𝑤 を求めること. ¤ ベイズ学習の枠組みでは,パラメーターwの事前知識を考える ¤ Dを観測した後のwの事後分布は次のようになる ¤ この処理をベイズ推論という. ¤ 事後分布を求めるためには,分⺟で周辺化が必要. ->変分推論 ベイズ推論 𝑝 𝑤 𝐷 = 𝑝 𝐷 𝑤 𝑝(𝑤) 𝑝(𝐷) = 𝑝 𝐷 𝑤 𝑝(𝑤) ∫ 𝑝 𝐷 𝑤 𝑝 𝑤 𝑑𝑤
  • 4. 変分推論 ¤ 近似分布 𝑞(𝑤|𝜙)を考えて,真の事後分布との距離𝐷45[𝑞(𝑤|𝜙)||𝑝 (𝑤|𝐷)]を 最⼩化する. ¤ これは次の変分下界を最⼤化することと等価 ¤ 再パラメータ化トリックによって,変分下界は𝜙について微分可能になる. ¤ ミニバッチ において,下界と下界の勾配の不偏推定量は ただし Variational Dropout Sparsifies Deep Neural Networks L( ) = LD( ) DKL(q (w) k p(w)) ! max 2 (1) LD( ) = NX n=1 Eq (w)[log p(yn | xn, w)] (2) It consists of two parts, the expected log-likelihood LD( ) and the KL-divergence DKL(q (w) k p(w)), which acts as a regularization term. 3.2. Stochastic Variational Inference In the case of complex models expectations in (1) and (2) are intractable. Therefore the variational lower bound (1) and its gradients can not be computed exactly. However, it is still possible to estimate them using sampling and opti- mize the variational lower bound using stochastic optimiza- tion. noise ⌅ to the layer input procedure (Hinton et al., 20 B = (A ⌅)W The original version of dro nary Dropout, was presente (Hinton et al., 2012). It me put matrix is put to zero as a dropout rate. Later Gaussian Dropout with con p 1 p ) works as well and is dropout rate p (Srivastava to use continuous noise i multiplying the inputs by to putting Gaussian noise dure can be used to obta the model’s weights (Wan et al., 2015). That is, puttin and the KL-divergence DKL(q (w) k p(w)), which acts as a regularization term. 3.2. Stochastic Variational Inference In the case of complex models expectations in (1) and (2) are intractable. Therefore the variational lower bound (1) and its gradients can not be computed exactly. However, it is still possible to estimate them using sampling and opti- mize the variational lower bound using stochastic optimiza- tion. We follow (Kingma & Welling, 2013) and use the Repa- rameterization Trick to obtain an unbiased differentiable minibatch-based Monte Carlo estimator of the expected log-likelihood (3). The main idea is to represent the para- metric noise q (w) as a deterministic differentiable func- tion w = f( , ✏) of a non-parametric noise ✏ s p(✏). This trick allows us to obtain an unbiased estimate of r LD(q ). Here we denote objects from a mini-batch as (˜xm, ˜ym)M m=1. L( )'LSGVB ( )=LSGVB D ( ) DKL(q (w)kp(w)) (3) LD( )'LSGVB D ( )= N M MX m=1 log p(˜ym|˜xm, f( , ✏m)) (4) r LD( )' N M MX m=1 r log p(˜ym|˜xm, f( , ✏m)) (5) The Local Reparameterization Trick is another technique put matrix is put to zero w as a dropout rate. Later t Gaussian Dropout with con p 1 p ) works as well and is dropout rate p (Srivastava to use continuous noise in multiplying the inputs by to putting Gaussian noise dure can be used to obtai the model’s weights (Wan et al., 2015). That is, puttin ⇠ij ⇠ N(1, ↵) on a weigh of wij from q(wij | ✓ij, ↵) becomes a random variable wij = ✓ij⇠ij = ✓ij(1 + p ✏ij s N Gaussian Dropout training timization of the expected when we use the reparamete sample W s q(W | ✓, ↵) pe pectation. Variational Drop explicitly uses q(W | ✓, ↵) a tribution for a model with The parameters ✓ and ↵ of tuned via stochastic variatio are the variational paramet The prior distribution p(W n=1 It consists of two parts, the expected log-likelihood LD( ) and the KL-divergence DKL(q (w) k p(w)), which acts as a regularization term. 3.2. Stochastic Variational Inference In the case of complex models expectations in (1) and (2) are intractable. Therefore the variational lower bound (1) and its gradients can not be computed exactly. However, it is still possible to estimate them using sampling and opti- mize the variational lower bound using stochastic optimiza- tion. We follow (Kingma & Welling, 2013) and use the Repa- rameterization Trick to obtain an unbiased differentiable minibatch-based Monte Carlo estimator of the expected log-likelihood (3). The main idea is to represent the para- metric noise q (w) as a deterministic differentiable func- tion w = f( , ✏) of a non-parametric noise ✏ s p(✏). This trick allows us to obtain an unbiased estimate of r LD(q ). Here we denote objects from a mini-batch as (˜xm, ˜ym)M m=1. L( )'LSGVB ( )=LSGVB D ( ) DKL(q (w)kp(w)) (3) LD( )'LSGVB D ( )= N M MX m=1 log p(˜ym|˜xm, f( , ✏m)) (4) r LD( )' N M MX m=1 r log p(˜ym|˜xm, f( , ✏m)) (5) The Local Reparameterization Trick is another technique that reduces the variance of this gradient estimator even fur- nary Dropout, was presented with (Hinton et al., 2012). It means th put matrix is put to zero with p as a dropout rate. Later the sa Gaussian Dropout with continuou p 1 p ) works as well and is simila dropout rate p (Srivastava et al. to use continuous noise instead multiplying the inputs by a Gau to putting Gaussian noise on th dure can be used to obtain a p the model’s weights (Wang & et al., 2015). That is, putting mul ⇠ij ⇠ N(1, ↵) on a weight wij of wij from q(wij | ✓ij, ↵) = N( becomes a random variable param wij = ✓ij⇠ij = ✓ij(1 + p ↵✏ij) ✏ij s N(0, 1 Gaussian Dropout training is eq timization of the expected log l when we use the reparameterizati sample W s q(W | ✓, ↵) per min pectation. Variational Dropout e explicitly uses q(W | ✓, ↵) as an a tribution for a model with a spe The parameters ✓ and ↵ of the di tuned via stochastic variational i are the variational parameters, a The prior distribution p(W) is ch scale uniform to make the Variati L( ) = LD( ) DKL(q (w) k p(w)) ! max 2 (1) LD( ) = NX n=1 Eq (w)[log p(yn | xn, w)] (2) It consists of two parts, the expected log-likelihood LD( ) and the KL-divergence DKL(q (w) k p(w)), which acts as a regularization term. 3.2. Stochastic Variational Inference In the case of complex models expectations in (1) and (2) are intractable. Therefore the variational lower bound (1) and its gradients can not be computed exactly. However, it is still possible to estimate them using sampling and opti- mize the variational lower bound using stochastic optimiza- tion. We follow (Kingma & Welling, 2013) and use the Repa- rameterization Trick to obtain an unbiased differentiable minibatch-based Monte Carlo estimator of the expected log-likelihood (3). The main idea is to represent the para- metric noise q (w) as a deterministic differentiable func- tion w = f( , ✏) of a non-parametric noise ✏ s p(✏). This trick allows us to obtain an unbiased estimate of r LD(q ). Here we denote objects from a mini-batch as The or nary D (Hinton put ma as a dr Gaussi p 1 p ) w dropou to use multipl to putt dure c the mo et al., 2 ⇠ij ⇠ of wij becom wij = Gaussi
  • 5. ドロップアウト ¤ 全結合層 において,ドロップアウトは各訓練処理において ランダムなノイズ を加える. ¤ ノイズをサンプリングする分布としてベルヌーイやガウス分布が使われる ¤ 𝑊にガウスノイズを⼊れることは, から 𝑊をサンプリングすることと等価 ¤ すると,確率変数𝑤は𝜃によって次のようにパラメータ化される. In this section we consider a single fully-connected layer with I input neurons and O output neurons before a non- linearity. We denote an output matrix as BM⇥O , input ma- trix as AM⇥I and a weight matrix as WI⇥O . We index the elements of these matrices as bmj, ami and wij respec- tively. Then B = AW. Dropout is one of the most popular regularization methods for deep neural networks. It injects a multiplicative random DKL(q(W | ✓, bound (1) doe Maximization comes equival likelihood (2) w sian Dropout t Dropout with fi vides a way to ational lower b ational Dropout Sparsifies Deep Neural Networks w)) ! max 2 (1) n | xn, w)] (2) g-likelihood LD( ) (w)), which acts as tions in (1) and (2) nal lower bound (1) noise ⌅ to the layer input A at each iteration of training procedure (Hinton et al., 2012). B = (A ⌅)W, with ⇠mi s p(⇠) (6) The original version of dropout, so-called Bernoulli or Bi- nary Dropout, was presented with ⇠mi s Bernoulli(1 p) (Hinton et al., 2012). It means that each element of the in- put matrix is put to zero with probability p, also known as a dropout rate. Later the same authors reported that Gaussian Dropout with continuous noise ⇠mi s N(1, ↵ = p 1 p ) works as well and is similar to Binary Dropout with dropout rate p (Srivastava et al., 2014). It is beneficial to use continuous noise instead of discrete one because nal Dropout Sparsifies Deep Neural Networks ! max 2 (1) n, w)] (2) kelihood LD( ) ), which acts as ns in (1) and (2) ower bound (1) tly. However, it mpling and opti- hastic optimiza- noise ⌅ to the layer input A at each iteration of training procedure (Hinton et al., 2012). B = (A ⌅)W, with ⇠mi s p(⇠) (6) The original version of dropout, so-called Bernoulli or Bi- nary Dropout, was presented with ⇠mi s Bernoulli(1 p) (Hinton et al., 2012). It means that each element of the in- put matrix is put to zero with probability p, also known as a dropout rate. Later the same authors reported that Gaussian Dropout with continuous noise ⇠mi s N(1, ↵ = p 1 p ) works as well and is similar to Binary Dropout with dropout rate p (Srivastava et al., 2014). It is beneficial to use continuous noise instead of discrete one because multiplying the inputs by a Gaussian noise is equivalent to putting Gaussian noise on the weights. This proce- dure can be used to obtain a posterior distribution over the model’s weights (Wang & Manning, 2013; Kingma eep Neural Networks se ⌅ to the layer input A at each iteration of training cedure (Hinton et al., 2012). B = (A ⌅)W, with ⇠mi s p(⇠) (6) original version of dropout, so-called Bernoulli or Bi- y Dropout, was presented with ⇠mi s Bernoulli(1 p) nton et al., 2012). It means that each element of the in- matrix is put to zero with probability p, also known a dropout rate. Later the same authors reported that ussian Dropout with continuous noise ⇠mi s N(1, ↵ = ) works as well and is similar to Binary Dropout with pout rate p (Srivastava et al., 2014). It is beneficial use continuous noise instead of discrete one because tiplying the inputs by a Gaussian noise is equivalent putting Gaussian noise on the weights. This proce- e can be used to obtain a posterior distribution over model’s weights (Wang & Manning, 2013; Kingma l., 2015). That is, putting multiplicative Gaussian noise ⇠ N(1, ↵) on a weight wij is equivalent to sampling wij from q(wij | ✓ij, ↵) = N(wij | ✓ij, ↵✓2 ij). Now wij omes a random variable parametrized by ✓ij. Variational Dropout Sparsifies Deep Neural Networks L( ) = LD( ) DKL(q (w) k p(w)) ! max 2 (1) LD( ) = NX n=1 Eq (w)[log p(yn | xn, w)] (2) onsists of two parts, the expected log-likelihood LD( ) the KL-divergence DKL(q (w) k p(w)), which acts as gularization term. Stochastic Variational Inference he case of complex models expectations in (1) and (2) intractable. Therefore the variational lower bound (1) its gradients can not be computed exactly. However, it till possible to estimate them using sampling and opti- e the variational lower bound using stochastic optimiza- . follow (Kingma & Welling, 2013) and use the Repa- eterization Trick to obtain an unbiased differentiable ibatch-based Monte Carlo estimator of the expected likelihood (3). The main idea is to represent the para- ric noise q (w) as a deterministic differentiable func- noise ⌅ to the layer input A at each iteration of procedure (Hinton et al., 2012). B = (A ⌅)W, with ⇠mi s p(⇠) The original version of dropout, so-called Bernou nary Dropout, was presented with ⇠mi s Bernoul (Hinton et al., 2012). It means that each element o put matrix is put to zero with probability p, als as a dropout rate. Later the same authors repo Gaussian Dropout with continuous noise ⇠mi s N p 1 p ) works as well and is similar to Binary Drop dropout rate p (Srivastava et al., 2014). It is b to use continuous noise instead of discrete one multiplying the inputs by a Gaussian noise is eq to putting Gaussian noise on the weights. Thi dure can be used to obtain a posterior distribut the model’s weights (Wang & Manning, 2013; et al., 2015). That is, putting multiplicative Gauss ⇠ij ⇠ N(1, ↵) on a weight wij is equivalent to s of wij from q(wij | ✓ij, ↵) = N(wij | ✓ij, ↵✓2 ij). becomes a random variable parametrized by ✓ij. wij = ✓ij⇠ij = ✓ij(1 + p ↵✏ij) ⇠ N(wij | ✓ij, ↵ n=1 parts, the expected log-likelihood LD( ) gence DKL(q (w) k p(w)), which acts as erm. ariational Inference mplex models expectations in (1) and (2) Therefore the variational lower bound (1) can not be computed exactly. However, it o estimate them using sampling and opti- al lower bound using stochastic optimiza- ma & Welling, 2013) and use the Repa- ick to obtain an unbiased differentiable Monte Carlo estimator of the expected . The main idea is to represent the para- w) as a deterministic differentiable func- ✏) of a non-parametric noise ✏ s p(✏). s us to obtain an unbiased estimate of e we denote objects from a mini-batch as )=LSGVB D ( ) DKL(q (w)kp(w)) (3) The original version of dropout, so-called Bernoulli or Bi- nary Dropout, was presented with ⇠mi s Bernoulli(1 p) (Hinton et al., 2012). It means that each element of the in- put matrix is put to zero with probability p, also known as a dropout rate. Later the same authors reported that Gaussian Dropout with continuous noise ⇠mi s N(1, ↵ = p 1 p ) works as well and is similar to Binary Dropout with dropout rate p (Srivastava et al., 2014). It is beneficial to use continuous noise instead of discrete one because multiplying the inputs by a Gaussian noise is equivalent to putting Gaussian noise on the weights. This proce- dure can be used to obtain a posterior distribution over the model’s weights (Wang & Manning, 2013; Kingma et al., 2015). That is, putting multiplicative Gaussian noise ⇠ij ⇠ N(1, ↵) on a weight wij is equivalent to sampling of wij from q(wij | ✓ij, ↵) = N(wij | ✓ij, ↵✓2 ij). Now wij becomes a random variable parametrized by ✓ij. wij = ✓ij⇠ij = ✓ij(1 + p ↵✏ij) ⇠ N(wij | ✓ij, ↵✓2 ij) ✏ij s N(0, 1) (7) Gaussian Dropout training is equivalent to stochastic op- timization of the expected log likelihood (2) in the case when we use the reparameterization trick and draw a single sample W s q(W | ✓, ↵) per minibatch to estimate the ex- yn | xn, w)] (2) og-likelihood LD( ) p(w)), which acts as ations in (1) and (2) nal lower bound (1) exactly. However, it g sampling and opti- stochastic optimiza- ) and use the Repa- biased differentiable tor of the expected o represent the para- differentiable func- ic noise ✏ s p(✏). nbiased estimate of rom a mini-batch as The original version of dropout, so-called Bernoulli or Bi- nary Dropout, was presented with ⇠mi s Bernoulli(1 p) (Hinton et al., 2012). It means that each element of the in- put matrix is put to zero with probability p, also known as a dropout rate. Later the same authors reported that Gaussian Dropout with continuous noise ⇠mi s N(1, ↵ = p 1 p ) works as well and is similar to Binary Dropout with dropout rate p (Srivastava et al., 2014). It is beneficial to use continuous noise instead of discrete one because multiplying the inputs by a Gaussian noise is equivalent to putting Gaussian noise on the weights. This proce- dure can be used to obtain a posterior distribution over the model’s weights (Wang & Manning, 2013; Kingma et al., 2015). That is, putting multiplicative Gaussian noise ⇠ij ⇠ N(1, ↵) on a weight wij is equivalent to sampling of wij from q(wij | ✓ij, ↵) = N(wij | ✓ij, ↵✓2 ij). Now wij becomes a random variable parametrized by ✓ij. wij = ✓ij⇠ij = ✓ij(1 + p ↵✏ij) ⇠ N(wij | ✓ij, ↵✓2 ij) ✏ij s N(0, 1) (7) Gaussian Dropout training is equivalent to stochastic op- timization of the expected log likelihood (2) in the case
  • 6. 変分ドロップアウト ¤ をパラメータ をもつ近似分布と考えると,このパ ラメータは変分推論で計算することができる(変分ドロップアウト) ¤ 𝛼を固定すると,変分ドロップアウトとガウスドロップアウトは等価に なる. ¤ KL項が⼀定になるため. ¤ 変分ドロップアウトにおいて,𝛼は学習するパラメータになっている! ¤ つまり, 𝛼を学習時に⾃動的に決定することができる. ¤ しかし,先⾏研究[Kigma+ 2015]では𝛼は1以下に制限されている. ¤ ノイズが⼊りすぎると,勾配の分散が⼤きくなる. ¤ しかし, 𝛼が無限⼤(=ドロップアウト率が1)まで設定できたほうが⾯⽩い 結果がでそう. ✏ij s N (0, 1) aussian Dropout training is equivalent to stochastic op- mization of the expected log likelihood (2) in the case hen we use the reparameterization trick and draw a single ample W s q(W | ✓, ↵) per minibatch to estimate the ex- ectation. Variational Dropout extends this technique and xplicitly uses q(W | ✓, ↵) as an approximate posterior dis- ibution for a model with a special prior on the weights. he parameters ✓ and ↵ of the distribution q(W | ✓, ↵) are uned via stochastic variational inference, i.e. = (✓, ↵) re the variational parameters, as denoted in Section 3.2. he prior distribution p(W) is chosen to be improper log- cale uniform to make the Variational Dropout with fixed ↵ quivalent to Gaussian Dropout (Kingma et al., 2015). p(log |wij|) = const , p(|wij|) / 1 |wij| (8) n this model, it is the only prior distribution that makes ariational inference consistent with Gaussian Dropout Kingma et al., 2015). When parameter ↵ is fixed, the DKL(q(W | ✓, ↵) k p(W)) term in the variational lower ound (1) does not depend on ✓ (Kingma et al., 2015). Maximization of the variational lower bound (1) then be- e use the reparameterization trick and draw a single W s q(W | ✓, ↵) per minibatch to estimate the ex- n. Variational Dropout extends this technique and ly uses q(W | ✓, ↵) as an approximate posterior dis- n for a model with a special prior on the weights. ameters ✓ and ↵ of the distribution q(W | ✓, ↵) are ia stochastic variational inference, i.e. = (✓, ↵) variational parameters, as denoted in Section 3.2. or distribution p(W) is chosen to be improper log- niform to make the Variational Dropout with fixed ↵ ent to Gaussian Dropout (Kingma et al., 2015). p(log |wij|) = const , p(|wij|) / 1 |wij| (8) model, it is the only prior distribution that makes nal inference consistent with Gaussian Dropout a et al., 2015). When parameter ↵ is fixed, the (W | ✓, ↵) k p(W)) term in the variational lower (1) does not depend on ✓ (Kingma et al., 2015). zation of the variational lower bound (1) then be- equivalent to maximization of the expected log- od (2) with fixed parameter ↵. It means that Gaus- opout training is exactly equivalent to Variational
  • 7. Additive Noise Reparameterization ¤ 下界の勾配 の2つめの乗数はαが⼤きくなるとノイズが ⼤きくなる. ¤ そこで,つぎのような式変形をする. ¤ すると, となるので,勾配の分散を⼤幅に減らすことができる! ¤ これによって, 𝛼を∞にまで⼤きく設定することができる. 4.1. Additive Noise Reparameterization Training Neural Networks with Variational Dropout is dif- ficult when dropout rates ↵ij are large because of a huge variance of stochastic gradients (Kingma et al., 2015). The cause of large gradient variance arises from multiplicative noise. To see it clearly, we can rewrite the gradient of LSGVB w.r.t. ✓ij as follows. @LSGVB @✓ij = @LSGVB @wij · @wij @✓ij (9) In the case of original parameterization (✓, ↵) the second multiplier in (9) is very noisy if ↵ij is large. wij = ✓ij(1 + p ↵ij · ✏ij), @wij @✓ij = 1 + p ↵ij · ✏ij, ✏ij ⇠ N(0, 1) (10) We propose a trick that allows us to drastically reduce the variance of this term in the case when ↵ij is large. The idea is to replace the multiplicative noise term 1+ p ↵ij ·✏ij with an exactly equivalent additive noise term ij · ✏ij, where 2 ij = ↵ij✓2 ij is treated as a new independent variable. Af- ter this trick we will optimize the variational lower bound w.r.t. (✓, ). However, we will still use ↵ throughout the paper, as it has a nice interpretation as a dropout rate. wij = ✓ij(1 + p ↵ij · ✏ij) = ✓ij + ij · ✏ij @wij @✓ij = 1, ✏ij ⇠ N(0, 1) (11) can be decomposed into a sum: DKL(q(W | ✓, ↵)k p(W)) = = X ij DKL(q(wij | ✓ij, ↵ij) k p(wij)) (12) The log-scale uniform prior distribution is an improper prior, so the KL divergence can only be calculated up to an additive constant C (Kingma et al., 2015). DKL(q(wij | ✓ij, ↵ij) k p(wij)) = = 1 2 log ↵ij E✏⇠N (1,↵ij ) log |✏| + C (13) In the Variational Dropout model this term is intractable, as the expectation E✏⇠N (1,↵ij ) log |✏| in (13) cannot be com- puted analytically (Kingma et al., 2015). However, this term can be sampled and then approximated. Two different approximations were provided in the original paper, how- ever they are accurate only for small values of the dropout rate ↵ (↵  1). We propose another approximation (14) that is tight for all values of alpha. Here (·) denotes the sigmoid function. Different approximations and the true value of DKL are presented in Fig. 1. Original DKL was obtained by averaging over 107 samples of ✏ with less than 2 ⇥ 10 3 variance of the estimation. DKL(q(wij | ✓ij, ↵ij) k p(wij)) ⇡ ⇡ k1 (k2 + k3 log ↵ij)) 0.5 log(1 + ↵ 1 ij ) + C k1 = 0.63576 k2 = 1.87320 k3 = 1.48695 (14) We used the following intuition to obtain this formula. The negative KL-divergence goes to a constant as log ↵ij goes w.r.t. ✓ij as follows. @LSGVB @✓ij = @LSGVB @wij · @wij @✓ij (9) In the case of original parameterization (✓, ↵) the second multiplier in (9) is very noisy if ↵ij is large. wij = ✓ij(1 + p ↵ij · ✏ij), @wij @✓ij = 1 + p ↵ij · ✏ij, ✏ij ⇠ N(0, 1) (10) We propose a trick that allows us to drastically reduce the variance of this term in the case when ↵ij is large. The idea is to replace the multiplicative noise term 1+ p ↵ij ·✏ij with an exactly equivalent additive noise term ij · ✏ij, where 2 ij = ↵ij✓2 ij is treated as a new independent variable. Af- ter this trick we will optimize the variational lower bound w.r.t. (✓, ). However, we will still use ↵ throughout the paper, as it has a nice interpretation as a dropout rate. wij = ✓ij(1 + p ↵ij · ✏ij) = ✓ij + ij · ✏ij @wij @✓ = 1, ✏ij ⇠ N(0, 1) (11) an add In the V the exp puted term ca approx ever th rate ↵ that is sigmoi value o was ob than 2 ⇡ k k1 We use negativ noise. To see it clearly, we can rewrite the gradient of LSGVB w.r.t. ✓ij as follows. @LSGVB @✓ij = @LSGVB @wij · @wij @✓ij (9) In the case of original parameterization (✓, ↵) the second multiplier in (9) is very noisy if ↵ij is large. wij = ✓ij(1 + p ↵ij · ✏ij), @wij @✓ij = 1 + p ↵ij · ✏ij, ✏ij ⇠ N(0, 1) (10) We propose a trick that allows us to drastically reduce the variance of this term in the case when ↵ij is large. The idea is to replace the multiplicative noise term 1+ p ↵ij ·✏ij with an exactly equivalent additive noise term ij · ✏ij, where 2 ij = ↵ij✓2 ij is treated as a new independent variable. Af- ter this trick we will optimize the variational lower bound w.r.t. (✓, ). However, we will still use ↵ throughout the paper, as it has a nice interpretation as a dropout rate. w = ✓ (1 + p ↵ · ✏ ) = ✓ + · ✏ The log-scale uniform prior dis prior, so the KL divergence can an additive constant C (Kingma e DKL(q(wij | ✓ij, ↵ij = 1 2 log ↵ij E✏⇠N(1, In the Variational Dropout model the expectation E✏⇠N(1,↵ij ) log |✏ puted analytically (Kingma et al term can be sampled and then app approximations were provided in ever they are accurate only for sm rate ↵ (↵  1). We propose ano that is tight for all values of alph sigmoid function. Different app value of DKL are presented in was obtained by averaging over 1 than 2 ⇥ 10 3 variance of the esti DKL(q(wij | ✓ij, ↵ij) ⇡ k1 (k2 + k3 log ↵ij)) 0.5 k1 = 0.63576 k2 = 1.87320 @✓ij In the case of original multiplier in (9) is very wij = ✓ @wij @✓ij ✏ We propose a trick that variance of this term in is to replace the multipli an exactly equivalent a 2 ij = ↵ij✓2 ij is treated ter this trick we will op w.r.t. (✓, ). However, paper, as it has a nice in wij = ✓ij(1 + p ↵ @wij @✓ij = 1, @✓ij = 1 + ↵ij · ✏ij, ✏ij ⇠ N(0, 1) (10) We propose a trick that allows us to drastically reduce the variance of this term in the case when ↵ij is large. The idea is to replace the multiplicative noise term 1+ p ↵ij ·✏ij with an exactly equivalent additive noise term ij · ✏ij, where 2 ij = ↵ij✓2 ij is treated as a new independent variable. Af- ter this trick we will optimize the variational lower bound w.r.t. (✓, ). However, we will still use ↵ throughout the paper, as it has a nice interpretation as a dropout rate. wij = ✓ij(1 + p ↵ij · ✏ij) = ✓ij + ij · ✏ij @wij @✓ij = 1, ✏ij ⇠ N(0, 1) (11) approxima ever they rate ↵ (↵ that is tigh sigmoid f value of was obtain than 2 ⇥ 1 ⇡ k1 ( k1 = 0 We used th negative K ただし 𝛼が⼤きくなると, この項も⼤きくなる wij = ✓ij(1 + ↵ij · ✏ij), @wij @✓ij = 1 + p ↵ij · ✏ij, ✏ij ⇠ N(0, 1) (10) We propose a trick that allows us to drastically reduce the variance of this term in the case when ↵ij is large. The idea is to replace the multiplicative noise term 1+ p ↵ij ·✏ij with an exactly equivalent additive noise term ij · ✏ij, where 2 ij = ↵ij✓2 ij is treated as a new independent variable. Af- ter this trick we will optimize the variational lower bound w.r.t. (✓, ). However, we will still use ↵ throughout the paper, as it has a nice interpretation as a dropout rate. wij = ✓ij(1 + p ↵ij · ✏ij) = ✓ij + ij · ✏ij @wij @✓ij = 1, ✏ij ⇠ N(0, 1) (11) puted ana term can b approxima ever they a rate ↵ (↵ that is tigh sigmoid fu value of was obtain than 2 ⇥ 1 ⇡ k1 ( k1 = 0 We used th negative K
  • 8. KL項について ¤ [Kingma+15]で提案されたKL項(正規化項)の近似⽅法は, 𝛼が1以 下の場合のみ. ¤ 本研究では,すべての値の𝛼で適⽤可能なKL項を提案 ij + p ↵ij · ✏ij), + p ↵ij · ✏ij, N(0, 1) (10) ws us to drastically reduce the ase when ↵ij is large. The idea e noise term 1+ p ↵ij ·✏ij with ve noise term ij · ✏ij, where new independent variable. Af- ze the variational lower bound will still use ↵ throughout the etation as a dropout rate. ✏ij) = ✓ij + ij · ✏ij ✏ij ⇠ N(0, 1) (11) In the Variational Dropout model this term is intractable, as the expectation E✏⇠N(1,↵ij ) log |✏| in (13) cannot be com- puted analytically (Kingma et al., 2015). However, this term can be sampled and then approximated. Two different approximations were provided in the original paper, how- ever they are accurate only for small values of the dropout rate ↵ (↵  1). We propose another approximation (14) that is tight for all values of alpha. Here (·) denotes the sigmoid function. Different approximations and the true value of DKL are presented in Fig. 1. Original DKL was obtained by averaging over 107 samples of ✏ with less than 2 ⇥ 10 3 variance of the estimation. DKL(q(wij | ✓ij, ↵ij) k p(wij)) ⇡ ⇡ k1 (k2 + k3 log ↵ij)) 0.5 log(1 + ↵ 1 ij ) + C k1 = 0.63576 k2 = 1.87320 k3 = 1.48695 (14) We used the following intuition to obtain this formula. The negative KL-divergence goes to a constant as log ↵ij goes Variational Dropout Sparsifies Deep Neural Networks ↵ij✓2 ij goes to zero as w is effectively a delta func ✓ij ! 0, q(wij | ✓ij, ↵ij) ! In the case of linear regr alytically. We denote a d RD . If ↵ is fixed, the op tained in a closed form. ✓ = (X> X + diag(
  • 9. スパース変分ドロップアウトの計算 ¤ 下界の学習では,提案するAdditive Noise Reparameterizationに加 えて, Local Reparameterization Trick[Kingma+15]を適⽤して分 散を抑える. ¤ Local Reparameterization Trickは以前の輪読スライドを参照. ¤ 全結合層だけではなく,畳込み層でも適⽤可能. der DKL + 0.5 log(1 + ↵ij ) moid function of log ↵ij, so we fit (k2 +k3 log ↵ij) to this curve. oximation is extremely accurate m absolute deviation on the full +1); the original approximation 0.04 maximum absolute devia- 0]). ↵ approaches infinity, the KL- constant. As in this model the up to an additive constant, it is k1 so that the KL-divergence to infinity. It allows us to com- ural networks of different sizes. see that DKL term increases eans that this regularization term orresponds to a Binary Dropout p p ). Intuitively it means that the lmost always dropped from the e does not influence the model nd is put to zero during the test- tuation from another angle. In- ds to infinitely large multiplica- ns that the value of this weight m and its magnitude will be un- lower bound (3) with our approximation of KL-divergence (14). We apply Sparse Variational Dropout to both convo- lutional and fully-connected layers. To reduce the variance of LSGVB we use a combination of the Local Reparameter- ization Trick and Additive Noise Reparameterization. In order to improve convergence, optimization is performed w.r.t. (✓, log 2 ). For a fully connected layer we use the same notation as in Section 3.3. In this case, Sparse Variational Dropout with the Local Reparameterization Trick and Additive Noise Reparameterization can be computed as follows: bmj s N( mj, mj) mj = IX i=1 ami✓ij, mj = IX i=1 a2 mi 2 ij (17) Now consider a convolutional layer. Take a single input tensor AH⇥W ⇥C m , a single filter wh⇥w⇥C k and correspond- ing output matrix bH0 ⇥W 0 mk . This filter has corresponding variational parameters ✓h⇥w⇥C k and h⇥w⇥C k . Note that in this case Am, ✓k and k are tensors. Because of linear- ity of convolutional layers, it is possible to apply the Local Reparameterization Trick. Sparse Variational Dropout for convolutional layers then can be expressed in a way, simi- lar to (17). Here we use (·)2 as an element-wise operation, ⇤ denotes the convolution operation, vec(·) denotes reshap- ing of a matrix/tensor into a vector. vec(bmk) s N( mk, mk) mk = vec(Am ⇤✓k), mk = diag(vec(A2 m ⇤ 2 k)) (18) t. As in this model the n additive constant, it is o that the KL-divergence ity. It allows us to com- works of different sizes. t DKL term increases t this regularization term nds to a Binary Dropout uitively it means that the lways dropped from the not influence the model ut to zero during the test- from another angle. In- nfinitely large multiplica- the value of this weight s magnitude will be un- l prediction and decrease refore it is beneficial to o zero in such a way that the Local Reparameterization Trick and Additive Noise Reparameterization can be computed as follows: bmj s N( mj, mj) mj = IX i=1 ami✓ij, mj = IX i=1 a2 mi 2 ij (17) Now consider a convolutional layer. Take a single input tensor AH⇥W ⇥C m , a single filter wh⇥w⇥C k and correspond- ing output matrix bH0 ⇥W 0 mk . This filter has corresponding variational parameters ✓h⇥w⇥C k and h⇥w⇥C k . Note that in this case Am, ✓k and k are tensors. Because of linear- ity of convolutional layers, it is possible to apply the Local Reparameterization Trick. Sparse Variational Dropout for convolutional layers then can be expressed in a way, simi- lar to (17). Here we use (·)2 as an element-wise operation, ⇤ denotes the convolution operation, vec(·) denotes reshap- ing of a matrix/tensor into a vector. vec(bmk) s N( mk, mk) mk = vec(Am ⇤✓k), mk = diag(vec(A2 m ⇤ 2 k)) (18) These formulae can be used for the implementation of Sparse Variational Dropout layers. We will provide a refer- ence implementation using Theano (Bergstra et al., 2010)
  • 10. 実験設定 ¤ 𝛼はlog 𝛼 = 3 まで(ドロップアウト率0.95まで)に制限 ¤ 事前に,本⼿法を適⽤しない学習を⾏う ¤ 事前学習をしない場合,⾼いスパースレベルとなるが,正解率が低くなる ¤ Bayesian DNNでは共通の問題点らしい. ¤ 本研究で⾏った事前学習は10~30epochほど ¤ その他の設定は論⽂参照
  • 11. Additive Noise Reparameterizationの検証 ¤ Additive Noise Reparameterizationによって分散が抑えられている かを検証 ¤ 本研究の⼿法を適⽤しない⽅法と,スパース性&下界の精度について⽐較Variational Dropout Sparsifies Deep Neural Network Figure 2. Original parameterization vs Additive Noise Reparam- Table 1. Comparison of (Pruning (Han et al., 2015 rich et al., 2017)) on Le the highest level of spars Network Method Original Pruning LeNet-300-100 DNS SWS (ours) Sparse VD Original Pruning LeNet-5-Caffe DNS SWS (ours) Sparse VD 提案⼿法のほうが スパースになるのが速い 提案⼿法の下界のほう が速く収束
  • 12. MNIST ¤ LeNetでMNISTを学習 ¤ LeNet-300-100(全結合)とLeNet-5-Caffe(畳込み) ¤ Pruning[Han+ 15], Dynamic Network Surgery[Guo+ 16], Soft Weight Sharing[Ullrich+ 17]と⽐較 Variational Dropout Sparsifies Deep Neural Networks s Additive Noise Reparam- eterization leads to a much he variational lower bound Table 1. Comparison of different sparsity-inducing techniques (Pruning (Han et al., 2015b;a), DNS (Guo et al., 2016), SWS (Ull- rich et al., 2017)) on LeNet architectures. Our method provides the highest level of sparsity with a similar accuracy. Network Method Error % Sparsity per Layer % |W| |W6=0| Original 1.64 1 Pruning 1.59 92.0 91.0 74.0 12 LeNet-300-100 DNS 1.99 98.2 98.2 94.5 56 SWS 1.94 23 (ours) Sparse VD 1.92 98.9 97.2 62.0 68 Original 0.80 1 Pruning 0.77 34 88 92.0 81 12 LeNet-5-Caffe DNS 0.91 86 97 99.3 96 111 SWS 0.97 200 (ours) Sparse VD 0.75 67 98 99.8 95 280 from a random initialization and without data augmenta- 提案⼿法が 最もスパース 提案⼿法が 最もスパース
  • 13. CIFAR-10,CIFAR-100 ¤ VGG-like network[Zagoruyko+15]でCIFAR10,CIFAR-100を学習 ¤ ユニットサイズのスケーリングkを変更して実験 ¤ 正解率はほぼ同じで,最⼤65倍のスパース性(CIFAR-10)
  • 14. ランダムラベルの学習 ¤ [Zhang+ 16]では,CNNがランダムラベルについても学習してしまう ことが⽰されている. ¤ 通常のドロップアウトではこの問題を解消できない. ¤ 提案⼿法(Sparse VD)では,学習すると重みがすべて1つの値になり, ⼀定の予測しかしないようになった. ¤ しかも,スパース性が100%になった. ¤ スパース性が100%になると重みが0になる(4.3節を参照). ¤ 提案⼿法によって,記憶にペナルティがかけられて,汎化を促進してい る? Figure 3. Accuracy and sparsity level for VGG-like architectures of different sizes. T networks were trained with Binary Dropout, and Sparse VD networks were trained overall sparsity level, achieved by our method, is reported as a dashed line. The sparsity level is high, especially in larger networks. Table 2. Experiments with random labeling. Sparse Variational Dropout (Sparse VD) removes all weights from the model and fails to overfit where Binary Dropout networks (BD) learn the random labeling perfectly. Dataset Architecture Train acc. Test acc. Sparsity MNIST FC + BD 1.0 0.1 — MNIST FC + Sparse VD 0.1 0.1 100% CIFAR-10 VGG-like + BD 1.0 0.1 — CIFAR-10 VGG-like + Sparse VD 0.1 0.1 100% 5.5. Random Labels Recently is was shown that the CNNs are capable of mem- orizing the data even with random labeling (Zhang et al., 2016). The standard dropout as well as other regulariza- 6. Discuss The “Occam complex sho 1992). Aut a Bayesian different cas of factorize Processes, e (Molchanov ing Beta dis ARD-effect We conside ational infer by the partic distribution selection. T approach th
  • 15. まとめ ¤ 本研究では,Variational Dropoutにおいて,𝛼を⼤きくしても勾配の 分散が⼤きくならない再パラメータ化⼿法を提案した. ¤ [Kingma+ 15]のLocal reparameterizaiton trickとも併⽤できる. ¤ CNNにも適⽤可能. ¤ 実験では,既存⼿法よりも⾼いスパース性を獲得できることがわかっ た. ¤ さらに,ランダムラベルのデータをDNNが簡単に学習してしまう問題 が,本⼿法では当てはまらないことを⽰した.