Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.
Próximos SlideShares
Carregando em…5
×

# （DL輪読）Variational Dropout Sparsifies Deep Neural Networks

2017/02/24 DL輪読会

• Full Name
Comment goes here.

Are you sure you want to Yes No
• Entre para ver os comentários

### （DL輪読）Variational Dropout Sparsifies Deep Neural Networks

1. 1. Variational Dropout Sparsifies Deep Neural Networks 2017/03/24 鈴⽊雅⼤
2. 2. 本論⽂について ¤ Dmitry Molchanov，Arsenii Ashukha，Dmitry Vetrov ¤ スコルコボ科学技術⼤学，国⽴研究⼤学⾼等経済学院，モスクワ物理⼯科 ⼤学 ¤ ICML2017投稿論⽂（2017/2/27 arXiv） ¤ Bayesian Dropoutのドロップアウト率をマックスまで設定できる⼿ 法を提案． ¤ アイディア⾃体はめちゃくちゃシンプル． ¤ スパース性が⾼くなるだけではなく，通常のCNNの汎化性能に関する問 題を解消できる． ¤ 選定理由 ¤ ⼀昨年輪読した論⽂[Kingma+ 15]の拡張だから． ¤ シンプルなアイディアで⼤きな効果を上げているのが好み．
3. 3. ¤ データ𝐷 = (𝑥%, 𝑦%)%)* + を観測したとき・・・ ¤ ⽬標は𝑝 𝑦 𝑥, 𝑤 = 𝑝 𝐷 𝑤 を求めること． ¤ ベイズ学習の枠組みでは，パラメーターwの事前知識を考える ¤ Dを観測した後のwの事後分布は次のようになる ¤ この処理をベイズ推論という． ¤ 事後分布を求めるためには，分⺟で周辺化が必要． ->変分推論 ベイズ推論 𝑝 𝑤 𝐷 = 𝑝 𝐷 𝑤 𝑝(𝑤) 𝑝(𝐷) = 𝑝 𝐷 𝑤 𝑝(𝑤) ∫ 𝑝 𝐷 𝑤 𝑝 𝑤 𝑑𝑤
4. 4. 変分推論 ¤ 近似分布 𝑞(𝑤|𝜙)を考えて，真の事後分布との距離𝐷45[𝑞(𝑤|𝜙)||𝑝 (𝑤|𝐷)]を 最⼩化する． ¤ これは次の変分下界を最⼤化することと等価 ¤ 再パラメータ化トリックによって，変分下界は𝜙について微分可能になる． ¤ ミニバッチ において，下界と下界の勾配の不偏推定量は ただし Variational Dropout Sparsiﬁes Deep Neural Networks L( ) = LD( ) DKL(q (w) k p(w)) ! max 2 (1) LD( ) = NX n=1 Eq (w)[log p(yn | xn, w)] (2) It consists of two parts, the expected log-likelihood LD( ) and the KL-divergence DKL(q (w) k p(w)), which acts as a regularization term. 3.2. Stochastic Variational Inference In the case of complex models expectations in (1) and (2) are intractable. Therefore the variational lower bound (1) and its gradients can not be computed exactly. However, it is still possible to estimate them using sampling and opti- mize the variational lower bound using stochastic optimiza- tion. noise ⌅ to the layer input procedure (Hinton et al., 20 B = (A ⌅)W The original version of dro nary Dropout, was presente (Hinton et al., 2012). It me put matrix is put to zero as a dropout rate. Later Gaussian Dropout with con p 1 p ) works as well and is dropout rate p (Srivastava to use continuous noise i multiplying the inputs by to putting Gaussian noise dure can be used to obta the model’s weights (Wan et al., 2015). That is, puttin and the KL-divergence DKL(q (w) k p(w)), which acts as a regularization term. 3.2. Stochastic Variational Inference In the case of complex models expectations in (1) and (2) are intractable. Therefore the variational lower bound (1) and its gradients can not be computed exactly. However, it is still possible to estimate them using sampling and opti- mize the variational lower bound using stochastic optimiza- tion. We follow (Kingma & Welling, 2013) and use the Repa- rameterization Trick to obtain an unbiased differentiable minibatch-based Monte Carlo estimator of the expected log-likelihood (3). The main idea is to represent the para- metric noise q (w) as a deterministic differentiable func- tion w = f( , ✏) of a non-parametric noise ✏ s p(✏). This trick allows us to obtain an unbiased estimate of r LD(q ). Here we denote objects from a mini-batch as (˜xm, ˜ym)M m=1. L( )'LSGVB ( )=LSGVB D ( ) DKL(q (w)kp(w)) (3) LD( )'LSGVB D ( )= N M MX m=1 log p(˜ym|˜xm, f( , ✏m)) (4) r LD( )' N M MX m=1 r log p(˜ym|˜xm, f( , ✏m)) (5) The Local Reparameterization Trick is another technique put matrix is put to zero w as a dropout rate. Later t Gaussian Dropout with con p 1 p ) works as well and is dropout rate p (Srivastava to use continuous noise in multiplying the inputs by to putting Gaussian noise dure can be used to obtai the model’s weights (Wan et al., 2015). That is, puttin ⇠ij ⇠ N(1, ↵) on a weigh of wij from q(wij | ✓ij, ↵) becomes a random variable wij = ✓ij⇠ij = ✓ij(1 + p ✏ij s N Gaussian Dropout training timization of the expected when we use the reparamete sample W s q(W | ✓, ↵) pe pectation. Variational Drop explicitly uses q(W | ✓, ↵) a tribution for a model with The parameters ✓ and ↵ of tuned via stochastic variatio are the variational paramet The prior distribution p(W n=1 It consists of two parts, the expected log-likelihood LD( ) and the KL-divergence DKL(q (w) k p(w)), which acts as a regularization term. 3.2. Stochastic Variational Inference In the case of complex models expectations in (1) and (2) are intractable. Therefore the variational lower bound (1) and its gradients can not be computed exactly. However, it is still possible to estimate them using sampling and opti- mize the variational lower bound using stochastic optimiza- tion. We follow (Kingma & Welling, 2013) and use the Repa- rameterization Trick to obtain an unbiased differentiable minibatch-based Monte Carlo estimator of the expected log-likelihood (3). The main idea is to represent the para- metric noise q (w) as a deterministic differentiable func- tion w = f( , ✏) of a non-parametric noise ✏ s p(✏). This trick allows us to obtain an unbiased estimate of r LD(q ). Here we denote objects from a mini-batch as (˜xm, ˜ym)M m=1. L( )'LSGVB ( )=LSGVB D ( ) DKL(q (w)kp(w)) (3) LD( )'LSGVB D ( )= N M MX m=1 log p(˜ym|˜xm, f( , ✏m)) (4) r LD( )' N M MX m=1 r log p(˜ym|˜xm, f( , ✏m)) (5) The Local Reparameterization Trick is another technique that reduces the variance of this gradient estimator even fur- nary Dropout, was presented with (Hinton et al., 2012). It means th put matrix is put to zero with p as a dropout rate. Later the sa Gaussian Dropout with continuou p 1 p ) works as well and is simila dropout rate p (Srivastava et al. to use continuous noise instead multiplying the inputs by a Gau to putting Gaussian noise on th dure can be used to obtain a p the model’s weights (Wang & et al., 2015). That is, putting mul ⇠ij ⇠ N(1, ↵) on a weight wij of wij from q(wij | ✓ij, ↵) = N( becomes a random variable param wij = ✓ij⇠ij = ✓ij(1 + p ↵✏ij) ✏ij s N(0, 1 Gaussian Dropout training is eq timization of the expected log l when we use the reparameterizati sample W s q(W | ✓, ↵) per min pectation. Variational Dropout e explicitly uses q(W | ✓, ↵) as an a tribution for a model with a spe The parameters ✓ and ↵ of the di tuned via stochastic variational i are the variational parameters, a The prior distribution p(W) is ch scale uniform to make the Variati L( ) = LD( ) DKL(q (w) k p(w)) ! max 2 (1) LD( ) = NX n=1 Eq (w)[log p(yn | xn, w)] (2) It consists of two parts, the expected log-likelihood LD( ) and the KL-divergence DKL(q (w) k p(w)), which acts as a regularization term. 3.2. Stochastic Variational Inference In the case of complex models expectations in (1) and (2) are intractable. Therefore the variational lower bound (1) and its gradients can not be computed exactly. However, it is still possible to estimate them using sampling and opti- mize the variational lower bound using stochastic optimiza- tion. We follow (Kingma & Welling, 2013) and use the Repa- rameterization Trick to obtain an unbiased differentiable minibatch-based Monte Carlo estimator of the expected log-likelihood (3). The main idea is to represent the para- metric noise q (w) as a deterministic differentiable func- tion w = f( , ✏) of a non-parametric noise ✏ s p(✏). This trick allows us to obtain an unbiased estimate of r LD(q ). Here we denote objects from a mini-batch as The or nary D (Hinton put ma as a dr Gaussi p 1 p ) w dropou to use multipl to putt dure c the mo et al., 2 ⇠ij ⇠ of wij becom wij = Gaussi
5. 5. ドロップアウト ¤ 全結合層 において，ドロップアウトは各訓練処理において ランダムなノイズ を加える． ¤ ノイズをサンプリングする分布としてベルヌーイやガウス分布が使われる ¤ 𝑊にガウスノイズを⼊れることは， から 𝑊をサンプリングすることと等価 ¤ すると，確率変数𝑤は𝜃によって次のようにパラメータ化される． In this section we consider a single fully-connected layer with I input neurons and O output neurons before a non- linearity. We denote an output matrix as BM⇥O , input ma- trix as AM⇥I and a weight matrix as WI⇥O . We index the elements of these matrices as bmj, ami and wij respec- tively. Then B = AW. Dropout is one of the most popular regularization methods for deep neural networks. It injects a multiplicative random DKL(q(W | ✓, bound (1) doe Maximization comes equival likelihood (2) w sian Dropout t Dropout with ﬁ vides a way to ational lower b ational Dropout Sparsiﬁes Deep Neural Networks w)) ! max 2 (1) n | xn, w)] (2) g-likelihood LD( ) (w)), which acts as tions in (1) and (2) nal lower bound (1) noise ⌅ to the layer input A at each iteration of training procedure (Hinton et al., 2012). B = (A ⌅)W, with ⇠mi s p(⇠) (6) The original version of dropout, so-called Bernoulli or Bi- nary Dropout, was presented with ⇠mi s Bernoulli(1 p) (Hinton et al., 2012). It means that each element of the in- put matrix is put to zero with probability p, also known as a dropout rate. Later the same authors reported that Gaussian Dropout with continuous noise ⇠mi s N(1, ↵ = p 1 p ) works as well and is similar to Binary Dropout with dropout rate p (Srivastava et al., 2014). It is beneﬁcial to use continuous noise instead of discrete one because nal Dropout Sparsiﬁes Deep Neural Networks ! max 2 (1) n, w)] (2) kelihood LD( ) ), which acts as ns in (1) and (2) ower bound (1) tly. However, it mpling and opti- hastic optimiza- noise ⌅ to the layer input A at each iteration of training procedure (Hinton et al., 2012). B = (A ⌅)W, with ⇠mi s p(⇠) (6) The original version of dropout, so-called Bernoulli or Bi- nary Dropout, was presented with ⇠mi s Bernoulli(1 p) (Hinton et al., 2012). It means that each element of the in- put matrix is put to zero with probability p, also known as a dropout rate. Later the same authors reported that Gaussian Dropout with continuous noise ⇠mi s N(1, ↵ = p 1 p ) works as well and is similar to Binary Dropout with dropout rate p (Srivastava et al., 2014). It is beneﬁcial to use continuous noise instead of discrete one because multiplying the inputs by a Gaussian noise is equivalent to putting Gaussian noise on the weights. This proce- dure can be used to obtain a posterior distribution over the model’s weights (Wang & Manning, 2013; Kingma eep Neural Networks se ⌅ to the layer input A at each iteration of training cedure (Hinton et al., 2012). B = (A ⌅)W, with ⇠mi s p(⇠) (6) original version of dropout, so-called Bernoulli or Bi- y Dropout, was presented with ⇠mi s Bernoulli(1 p) nton et al., 2012). It means that each element of the in- matrix is put to zero with probability p, also known a dropout rate. Later the same authors reported that ussian Dropout with continuous noise ⇠mi s N(1, ↵ = ) works as well and is similar to Binary Dropout with pout rate p (Srivastava et al., 2014). It is beneﬁcial use continuous noise instead of discrete one because tiplying the inputs by a Gaussian noise is equivalent putting Gaussian noise on the weights. This proce- e can be used to obtain a posterior distribution over model’s weights (Wang & Manning, 2013; Kingma l., 2015). That is, putting multiplicative Gaussian noise ⇠ N(1, ↵) on a weight wij is equivalent to sampling wij from q(wij | ✓ij, ↵) = N(wij | ✓ij, ↵✓2 ij). Now wij omes a random variable parametrized by ✓ij. Variational Dropout Sparsiﬁes Deep Neural Networks L( ) = LD( ) DKL(q (w) k p(w)) ! max 2 (1) LD( ) = NX n=1 Eq (w)[log p(yn | xn, w)] (2) onsists of two parts, the expected log-likelihood LD( ) the KL-divergence DKL(q (w) k p(w)), which acts as gularization term. Stochastic Variational Inference he case of complex models expectations in (1) and (2) intractable. Therefore the variational lower bound (1) its gradients can not be computed exactly. However, it till possible to estimate them using sampling and opti- e the variational lower bound using stochastic optimiza- . follow (Kingma & Welling, 2013) and use the Repa- eterization Trick to obtain an unbiased differentiable ibatch-based Monte Carlo estimator of the expected likelihood (3). The main idea is to represent the para- ric noise q (w) as a deterministic differentiable func- noise ⌅ to the layer input A at each iteration of procedure (Hinton et al., 2012). B = (A ⌅)W, with ⇠mi s p(⇠) The original version of dropout, so-called Bernou nary Dropout, was presented with ⇠mi s Bernoul (Hinton et al., 2012). It means that each element o put matrix is put to zero with probability p, als as a dropout rate. Later the same authors repo Gaussian Dropout with continuous noise ⇠mi s N p 1 p ) works as well and is similar to Binary Drop dropout rate p (Srivastava et al., 2014). It is b to use continuous noise instead of discrete one multiplying the inputs by a Gaussian noise is eq to putting Gaussian noise on the weights. Thi dure can be used to obtain a posterior distribut the model’s weights (Wang & Manning, 2013; et al., 2015). That is, putting multiplicative Gauss ⇠ij ⇠ N(1, ↵) on a weight wij is equivalent to s of wij from q(wij | ✓ij, ↵) = N(wij | ✓ij, ↵✓2 ij). becomes a random variable parametrized by ✓ij. wij = ✓ij⇠ij = ✓ij(1 + p ↵✏ij) ⇠ N(wij | ✓ij, ↵ n=1 parts, the expected log-likelihood LD( ) gence DKL(q (w) k p(w)), which acts as erm. ariational Inference mplex models expectations in (1) and (2) Therefore the variational lower bound (1) can not be computed exactly. However, it o estimate them using sampling and opti- al lower bound using stochastic optimiza- ma & Welling, 2013) and use the Repa- ick to obtain an unbiased differentiable Monte Carlo estimator of the expected . The main idea is to represent the para- w) as a deterministic differentiable func- ✏) of a non-parametric noise ✏ s p(✏). s us to obtain an unbiased estimate of e we denote objects from a mini-batch as )=LSGVB D ( ) DKL(q (w)kp(w)) (3) The original version of dropout, so-called Bernoulli or Bi- nary Dropout, was presented with ⇠mi s Bernoulli(1 p) (Hinton et al., 2012). It means that each element of the in- put matrix is put to zero with probability p, also known as a dropout rate. Later the same authors reported that Gaussian Dropout with continuous noise ⇠mi s N(1, ↵ = p 1 p ) works as well and is similar to Binary Dropout with dropout rate p (Srivastava et al., 2014). It is beneﬁcial to use continuous noise instead of discrete one because multiplying the inputs by a Gaussian noise is equivalent to putting Gaussian noise on the weights. This proce- dure can be used to obtain a posterior distribution over the model’s weights (Wang & Manning, 2013; Kingma et al., 2015). That is, putting multiplicative Gaussian noise ⇠ij ⇠ N(1, ↵) on a weight wij is equivalent to sampling of wij from q(wij | ✓ij, ↵) = N(wij | ✓ij, ↵✓2 ij). Now wij becomes a random variable parametrized by ✓ij. wij = ✓ij⇠ij = ✓ij(1 + p ↵✏ij) ⇠ N(wij | ✓ij, ↵✓2 ij) ✏ij s N(0, 1) (7) Gaussian Dropout training is equivalent to stochastic op- timization of the expected log likelihood (2) in the case when we use the reparameterization trick and draw a single sample W s q(W | ✓, ↵) per minibatch to estimate the ex- yn | xn, w)] (2) og-likelihood LD( ) p(w)), which acts as ations in (1) and (2) nal lower bound (1) exactly. However, it g sampling and opti- stochastic optimiza- ) and use the Repa- biased differentiable tor of the expected o represent the para- differentiable func- ic noise ✏ s p(✏). nbiased estimate of rom a mini-batch as The original version of dropout, so-called Bernoulli or Bi- nary Dropout, was presented with ⇠mi s Bernoulli(1 p) (Hinton et al., 2012). It means that each element of the in- put matrix is put to zero with probability p, also known as a dropout rate. Later the same authors reported that Gaussian Dropout with continuous noise ⇠mi s N(1, ↵ = p 1 p ) works as well and is similar to Binary Dropout with dropout rate p (Srivastava et al., 2014). It is beneﬁcial to use continuous noise instead of discrete one because multiplying the inputs by a Gaussian noise is equivalent to putting Gaussian noise on the weights. This proce- dure can be used to obtain a posterior distribution over the model’s weights (Wang & Manning, 2013; Kingma et al., 2015). That is, putting multiplicative Gaussian noise ⇠ij ⇠ N(1, ↵) on a weight wij is equivalent to sampling of wij from q(wij | ✓ij, ↵) = N(wij | ✓ij, ↵✓2 ij). Now wij becomes a random variable parametrized by ✓ij. wij = ✓ij⇠ij = ✓ij(1 + p ↵✏ij) ⇠ N(wij | ✓ij, ↵✓2 ij) ✏ij s N(0, 1) (7) Gaussian Dropout training is equivalent to stochastic op- timization of the expected log likelihood (2) in the case
6. 6. 変分ドロップアウト ¤ をパラメータ をもつ近似分布と考えると，このパ ラメータは変分推論で計算することができる（変分ドロップアウト） ¤ 𝛼を固定すると，変分ドロップアウトとガウスドロップアウトは等価に なる． ¤ KL項が⼀定になるため． ¤ 変分ドロップアウトにおいて，𝛼は学習するパラメータになっている！ ¤ つまり， 𝛼を学習時に⾃動的に決定することができる． ¤ しかし，先⾏研究[Kigma+ 2015]では𝛼は1以下に制限されている． ¤ ノイズが⼊りすぎると，勾配の分散が⼤きくなる． ¤ しかし， 𝛼が無限⼤（=ドロップアウト率が1）まで設定できたほうが⾯⽩い 結果がでそう． ✏ij s N (0, 1) aussian Dropout training is equivalent to stochastic op- mization of the expected log likelihood (2) in the case hen we use the reparameterization trick and draw a single ample W s q(W | ✓, ↵) per minibatch to estimate the ex- ectation. Variational Dropout extends this technique and xplicitly uses q(W | ✓, ↵) as an approximate posterior dis- ibution for a model with a special prior on the weights. he parameters ✓ and ↵ of the distribution q(W | ✓, ↵) are uned via stochastic variational inference, i.e. = (✓, ↵) re the variational parameters, as denoted in Section 3.2. he prior distribution p(W) is chosen to be improper log- cale uniform to make the Variational Dropout with ﬁxed ↵ quivalent to Gaussian Dropout (Kingma et al., 2015). p(log |wij|) = const , p(|wij|) / 1 |wij| (8) n this model, it is the only prior distribution that makes ariational inference consistent with Gaussian Dropout Kingma et al., 2015). When parameter ↵ is ﬁxed, the DKL(q(W | ✓, ↵) k p(W)) term in the variational lower ound (1) does not depend on ✓ (Kingma et al., 2015). Maximization of the variational lower bound (1) then be- e use the reparameterization trick and draw a single W s q(W | ✓, ↵) per minibatch to estimate the ex- n. Variational Dropout extends this technique and ly uses q(W | ✓, ↵) as an approximate posterior dis- n for a model with a special prior on the weights. ameters ✓ and ↵ of the distribution q(W | ✓, ↵) are ia stochastic variational inference, i.e. = (✓, ↵) variational parameters, as denoted in Section 3.2. or distribution p(W) is chosen to be improper log- niform to make the Variational Dropout with ﬁxed ↵ ent to Gaussian Dropout (Kingma et al., 2015). p(log |wij|) = const , p(|wij|) / 1 |wij| (8) model, it is the only prior distribution that makes nal inference consistent with Gaussian Dropout a et al., 2015). When parameter ↵ is ﬁxed, the (W | ✓, ↵) k p(W)) term in the variational lower (1) does not depend on ✓ (Kingma et al., 2015). zation of the variational lower bound (1) then be- equivalent to maximization of the expected log- od (2) with ﬁxed parameter ↵. It means that Gaus- opout training is exactly equivalent to Variational