Sparse estimation tutorial 2014

1. スパース推定概観：モデル・理論・応用 y 鈴木　大慈 yTokyo Institute of Technology Department of Mathematical and Computing Sciences 2014 年9 月15 日統計連合大会@東京大学 1 / 56

2. Outline 1 スパース推定のモデル 2 いろいろなスパース正則化 3 スパース推定の理論 n ≫ p の理論 n ≪ p の理論 4 高次元線形回帰の検定 5 スパース推定の最適化手法 2 / 56

3. 高次元データでの問題意識ゲノムデータ金融データ協調フィルタリングコンピュータビジョン音声認識次元d = 10000 の時，サンプル数n = 1000 で推定ができるか？どのような条件があれば推定が可能か？何らかの低次元性(スパース性) を利用． 3 / 56

4. 歴史: スパース推定の手法と理論 1992 Donoho and Johnstone Wavelet shrinkage (Soft-thresholding) 1996 Tibshirani Lasso の提案 2000 Knight and Fu Lasso の漸近分布 (n ≫ p) 2006 Candes and Tao, 圧縮センシング Donoho (制限等長性，完全復元，p ≫ n) 2009 Bickel et al., Zhang 制限固有値条件 (Lasso のリスク評価, p ≫ n) 2013 van de Geer et al., スパース推定における検定 Lockhart et al. (p ≫ n) これ以前にも反射法地震探査や画像雑音除去，忘却付き構造学習にL1 正則化は使われていた．詳しくは田中利幸(2010) を参照. 4 / 56

6. 高次元データ解析サンプル数≪ 次元バイオインフォテキストデータ画像データ 6 / 56

7. 高次元データ解析サンプル数≪ 次元 × 古典的数理統計学：サンプル数≫ 次元バイオインフォテキストデータ画像データ 6 / 56

8. スパース推定サンプル数≪ 次元無駄な情報を切り落とす→スパース性 Lasso 推定量 R. Tsibshirani (1996). Regression shrinkage and selection via the lasso. J. Royal. Statist. Soc B., Vol. 58, No. 1, pages 267{288. 引用数：10185 (2014 年5 月25 日) 7 / 56

9. 変数選択の問題（線形回帰）デザイン行列X = (Xij ) 2 Rnp. p (次元) ≫ n (サンプル数). 真のベクトル

10. 2 Rp: 非ゼロ要素の個数がたかだかd 個(スパース). モデル: Y = X

11. + : (Y ; X) から

12. を推定．実質推定しなくてはいけない変数の数はd 個→変数選択． 8 / 56

13. 変数選択の問題（線形回帰）デザイン行列X = (Xij ) 2 Rnp. p (次元) ≫ n (サンプル数). 真のベクトル

14. 2 Rp: 非ゼロ要素の個数がたかだかd 個(スパース). モデル: Y = X

15. + : (Y ; X) から

16. を推定．実質推定しなくてはいけない変数の数はd 個→変数選択． Mallows' Cp, AIC: ^

17. MC = argmin

18. 2Rp ∥Y X

19. ∥2 + 22∥

20. ∥0 ただし∥

21. ∥0 = jfj j

22. j̸= 0gj. → 2p 個の候補を探索．NP-困難． 8 / 56

23. Lasso 推定量 Mallows' Cp 最小化: ^

24. MC = argmin

25. 2Rp ∥Y X

26. ∥2 + 22∥

27. ∥0: 問題点: ∥

28. ∥0 は凸関数ではない．連続でもない．沢山の局所最適解． → 凸関数で近似． Lasso [L1 正則化] ^

29. Lasso = argmin

30. 2Rp ∥Y X

31. ∥2 + ∥

32. ∥1 ただし∥

33. ∥1 = Σp j=1 j

34. j j. → 凸最適化！ L1 ノルムはL0 ノルムの[1; 1]p における凸包(下から抑える最大の凸関数) L1 ノルムは要素数関数のLovasz 拡張 9 / 56

35. Lasso 推定量のスパース性 p = n, X = I の場合． ^

36. Lasso = argmin

37. 2Rp 1 2 ∥Y

38. ∥2 + C∥

39. ∥1 ) ^

40. Lasso;i = argmin b2R 1 2 (yi b)2 + Cjbj = { sign(yi )(yi C) (jyi j C) 0 (jyi j C): 小さいシグナルは0 に縮小される→スパース！ 10 / 56

41. Lasso 推定量のスパース性 ^

42. = arg min

43. 2Rp 1 n ∥X

44. Y ∥22+ n Σp j=1 j

45. j j: 11 / 56

46. スパース性の恩恵 ^

47. = arg min

48. 2Rp 1 n ∥X

49. Y ∥22 + n Σp j=1 j

50. j j: Theorem (Lasso の収束レート) ある条件のもと，定数C が存在して高い確率で次の不等式が成り立つ： ∥ ^

52. ∥22 C dlog(p) n : ※次元が高くても，たかだかlog(p) でしか効いてこない．実質的な次元 d が支配的．（「ある条件」については後で詳細を説明） 12 / 56

54. Lasso を一般化 Lasso: min

55. 2Rp 1 n Σn i=1 ⊤ i

56. )2 + ∥

57. ∥| {z }1 (yi x 正則化項 : 14 / 56

58. Lasso を一般化 Lasso: min

59. 2Rp 1 n Σn i=1 ⊤ i

60. )2 + ∥

61. ∥| {z }1 (yi x 正則化項 : 一般化したスパース正則化推定法: min w2Rp 1 n Σn i=1 ℓ(zi ;

62. ) + (

63. ): L1 正則化項以外にどのような正則化項が有用であろうか？ 14 / 56

64. L1 正則化によってスパースになる理由：座標軸に沿って尖っている．正則化項の尖り方を工夫することで様々なスパース性が得られる． 15 / 56

65. グループ正則化 C Σ g2G ∥

66. g∥ 重複なし重複ありグループ内すべての変数が同時に0 になりやすい．より積極的にスパースにできる．応用例：ゲノムワイド相関解析 16 / 56

67. グループ正則化の応用例マルチタスク学習Lounici et al. (2009) T 個のタスクで同時に推定: y(t) i x(t)⊤ i

68. (t) (i = 1; : : : ; n(t); t = 1; : : : ;T): min

69. (t) ΣT t=1 n(t) Σ i=1 (yi x(t)⊤ i

70. (t))2 + C Σp k=1 ∥(

71. (1) k ; : : : ;

72. (T) k )∥ | {z } グループ正則化 : b(1)b(2) b(T) *URXS *URXS ؞؞؞؞؞؞ *URXS タスク間共通で非ゼロな変数を選択 17 / 56

73. グループ正則化の応用例マルチタスク学習Lounici et al. (2009) T 個のタスクで同時に推定: y(t) i x(t)⊤ i

74. (t) (i = 1; : : : ; n(t); t = 1; : : : ;T): min

75. (t) ΣT t=1 n(t) Σ i=1 (yi x(t)⊤ i

76. (t))2 + C Σp k=1 ∥(

77. (1) k ; : : : ;

78. (T) k )∥ | {z } グループ正則化 : b(1)b(2) b(T) *URXS *URXS ؞؞؞؞؞؞ *URXS タスク間共通で非ゼロな変数を選択 17 / 56

79. トレースノルム正則化 W : M N 行列． ∥W∥Tr = Tr[(WW ⊤ ) 1 2 ] = minfM;Ng Σ j=1 j (W) j (W) はW のj 番目の特異値(非負とする)．特異値の和= 特異値へのL1 正則化→ 特異値がスパース特異値がスパース= 低ランク 18 / 56

80. 例: 推薦システム映画A 映画B 映画C 映画X ユーザ1 4 8 * 2 ユーザ2 2 * 2 * ユーザ3 2 4 * * ... (e.g., Srebro et al. (2005), NetFlix Bennett and Lanning (2007)) 19 / 56

81. 例: 推薦システムランク1 と仮定する映画A 映画B 映画C 映画X ユーザ1 4 8 4 2 ユーザ2 2 4 2 1 ユーザ3 2 4 2 1 ... (e.g., Srebro et al. (2005), NetFlix Bennett and Lanning (2007)) 19 / 56

82. 例: 推薦システム W*= N M 映画ユーザ → 低ランク行列補完: 低ランク行列のRademacher Complexity: Srebro et al. (2005). Compressed sensing: Candes and Tao (2009), Candes and Recht (2009). 20 / 56

83. 例: 縮小ランク回帰縮小ランク回帰(Anderson, 1951, Burket, 1964, Izenman, 1975) マルチタスク学習(Argyriou et al., 2008) 縮小ランク回帰 = W* n Y X N M N W + W は低ランク. 21 / 56

84. スパース共分散選択 xk N(0; ) (i:i:d:; 2 Rpp), b = 1 n Σn k=1 xkx⊤ k . ^S = argmin S:半正定対称 { log(det(S)) + Tr[S b] + Σp i ;j=1 } : jSi ;j j (Meinshausen and B uhlmann, 2006, Yuan and Lin, 2007, Banerjee et al., 2008) の逆行列S を推定． Si ;j = 0 , X(i); X(j) が条件付き独立．ガウシアングラフィカルモデルが凸最適化で推定できる． 22 / 56

85. (一般化) Fused Lasso (

86. ) = C Σ (i ;j)2E j

88. j j: (Tibshirani et al. (2005), Jacob et al. (2009)) Fused lasso による遺伝子データ解析(Tibshirani and Taylor `11) TV デノイジング(Chambolle `04) 23 / 56

89. 非凸正則化 -3 -2 -1 0 1 2 3 3.5 3 2.5 2 1.5 1 0.5 0 L1 SCAD MCP Lq (q=0.5) SCAD (Smoothly Clipped Absolute Deviation) (Fan and Li, 2001) MCP (Minimax Concave Penalty) (Zhang, 2010) Lq 正則化(q 1), Bridge 正則化(Frank and Friedman, 1993) よりスパースな解．その代わり最適化は難しくなる． 24 / 56

90. その他L1 正則化の拡張 Adaptive Lasso (Zou, 2006) ある一致推定量~

91. があるとして，それを利用． (

92. ) = C Σp j=1 j

93. j j j ~

94. j j Lasso よりも小さいバイアス(漸近不偏)．オラクルプロパティ．スパースΣ加法モデル(Hastie and Tibshirani, 1999, Ravikumar et al., 2009) f (x) = p j=1 fj (xj ) なる非線形関数を推定． fj 2 Hj (Hj : 再生核ヒルベルト空間) とする． (f ) = C Σp j=1 ∥fj∥Hj Group Lasso の一般化． Multiple Kernel Learning とも呼ばれる． 25 / 56

96. 問題設定簡単のため線形回帰を考える． Y = X

97. + ϵ: Y 2 Rn : 応答変数, X 2 Rnp : 説明変数, ϵ = [ϵ1; : : : ; ϵn]⊤ 2 Rn. 一般化線形回帰への一般化も可能． 27 / 56

98. n ≫ p の理論 28 / 56

99. Lasso の漸近分布 p は固定，n ! 1の漸近的振る舞いを考える． 1 nX⊤X p! C ≻ O: ノイズϵi は平均0 分散2 とする. Theorem (Lasso の漸近分布(Knight and Fu, 2000)) p n n ! 0 0 なら p n( ^

101. ) d! argmin u V(u); ただし， V(u) = u⊤Cu 2u⊤W + 0 Σp j=1[uj sign(

102. j )1(

103. j ̸= 0) + juj j1(

104. j = 0)]; W N(0; 2C). ^

105. は p n-consistent である．

106. j = 0 なる成分で^

107. j は正の確率で0 となる．第三項のせいで，漸近的にバイアスが残る． ^

108. = argmin

109. 1 n ∥Y X

110. ∥2 + n Σp j=1 j

111. j j: 29 / 56

112. Adaptive Lasso のオラクルプロパティ ~

113. はある一致推定量． ^

114. = argmin

115. 1 n ∥Y X

116. ∥2 + n Σp j=1 j

117. j j j ~

118. j j : Theorem (Adaptive Lasso のオラクルプロパティ(Zou, 2006)) p n n ! 1+ 0; nn 2 ! 1のとき， 1 limn!1 P(^J = J) ! 1 (^J := fj j j ^

119. j j̸= 0g; J := fj j j

120. j j ̸= 0g), 2 p n( ^

121. J

122. J ) d! N(0; 2C 1 JJ ). 変数選択の一致性あり．漸近不偏，漸近正規性あり．ただし，

123. のある成分が原点に近づくような局所的な議論 (

124. p n) なる状況) については何も言っていないことに注意． j = O(1= 30 / 56

125. n ≪ p の理論 31 / 56

126. Lass のリスクの上界 ^

127. = arg min

128. 2Rp 1 n ∥X

129. Y ∥22+ n Σp j=1 j

130. j j: Theorem (Lasso の収束レート(Bickel et al., 2009, Zhang, 2009)) デザイン行列がRestricted eigenvalue condition (Bickel et al., 2009) かつ maxi ;j jXij j 1 を満たし，ノイズがE[ei ] e22=2 (8 0) を満たすなら, 確率1 で ∥ ^

132. ∥22 C d log(p=) n : 次元が高くても，たかだかlog(p) でしか効いてこない．実質的な次元d が支配的．ノイズの条件はサブガウシアンの必要十分条件． 32 / 56

133. Lasso のminimax 最適性 Theorem (ミニマクス最適レート(Raskutti and Wainwright, 2011)) ある条件のもと，確率1=2 以上で， min ^

134. :推定量 max

135. :d-スパース ∥ ^

137. ∥2 C d log(p=d) n : Lasso はminimax レートを達成する( d log(d) n の項を除いて)．この結果をMultiple Kernel Learning に拡張した結果: Raskutti et al. (2012), Suzuki and Sugiyama (2013). 33 / 56

138. 制限固有値条件(Restricted eigenvalue condition) nX⊤X とする． A = 1 De

139. nition (制限固有値条件(RE(k′; C))) ϕRE(k′; C) = ϕRE(k′; C; A) := inf Jf1;:::;ng;v2Rp: jJjk′;C∥vJ∥1∥vJc ∥1 v⊤Av ∥vJ∥22 に対し，ϕRE 0 が成り立つ．ほぼスパースなベクトルに制限して定義した最小固有値． J ইঝছথॡ ৼঢ়ऋ৵औः 34 / 56

140. 適合性条件(Compatibility condition) nX⊤X とする． A = 1 De

141. nition (適合性条件(COM(J; C))) ϕCOM(J; C) = ϕCOM(J; C;A) := inf v2Rp: C∥vJ∥1∥vJc ∥1 k v⊤Av ∥vJ∥21 に対し，ϕCOM 0 が成り立つ． jJj k′ なら，RE よりも弱い条件． 35 / 56

142. 制限等長性条件(Restricted isometory condition) De

143. nition (制限等長性条件(RI(k′; ))) ある普遍定数c が存在して，ある 0 に対し， (1 )∥

144. ∥2 ∥X

145. ∥2 (1 + )∥

146. ∥2 が全てのk′-スパースなベクトル

147. 2 Rp に対して成り立つ． RE, COM よりも強い条件． Johnson-Lindenstrauss の補題. 圧縮センシングにおける完全復元の文脈でよく用いられる． 36 / 56

148. 各条件の関係と収束レート ^

149. : Lasso 推定量. J := fj j

150. j ̸= 0g. d := jJj. 強い1n ∥X( ^

152. )∥22 ∥ ^

154. ∥22 ∥ ^

156. ∥21 RI(2d; ) → 圧縮センシングにおける完全復元 + RE(2d; 3) → d log(p) n d log(p) n d2 log(p) n + COM(J; 3) → d log(p) n d2 log(p) n d2 log(p) n 弱い関連事項の詳細はBuhlmann and van de Geer (2011) に網羅されている． 37 / 56

157. 制限固有値条件(RE) が成立する確率制限固有値条件はどれだけ成り立ちやすいか？ p 次元確率変数Z が等方的: E[⟨Z; z⟩2] = ∥z∥22 (8z 2 Rp). サブガウシアンノルム∥Z∥ 2 を次のように定義する: ∥Z∥ 2 = supz2Rp;∥z∥=1 inftft j E[exp(⟨Z; z⟩)2=t2] 2g: 1. Z = [Z1; Z2; : : : ; Zn]⊤ 2 Rnp の各行Zi 2 Rp が独立な等方的サブガウシアン確率変数とする． 2. ある半正定対称行列 2 Rpp を用いてX = Z であるとする． Theorem (Rudelson and Zhou (2013)) ∥Zi∥ 2 (8i) とする．ある普遍定数c0 が存在しm = c0 maxi(i ;i )2 ϕ2 RE(k;9;) に対し，n 4c0m4 log(60ep=(m)) ならば P ( ϕRE(k; 3; b) 1 2 ) ϕRE(k; 9; ) 1 2 exp(n=(4c04)): つまり，真の分散共分散行列が制限固有値条件を満たす ) 経験的分散共分散行列も高い確率で同条件を満たす． 38 / 56

158. 凸最適化を用いないスパース推定法の性質情報量規準型推定量: Massart (2003), Bunea et al. (2007), Rigollet and Tsybakov (2011). min

159. 2Rp ∥Y X

160. ∥2 + C2∥

161. ∥0 { 1 + log ( p ∥

162. ∥0 )} : Bayes 推定量: Dalalyan and Tsybakov (2008), Alquier and Lounici (2011), Suzuki (2012). オラクル不等式: X に何も条件を課さずに次の不等式が成り立つ: 1 n ∥X

163. X ^

164. ∥2 C2 d n log ( 1 + p d ) : ■ ミニマックス最適． ■ 凸正則化推定法と大きなギャップ． ■ 計算量と統計的性質のトレードオフ． 39 / 56

166. バイアス除去による方法アイディア：Lasso 推定量^

167. からバイアスを除去. (van de Geer et al., 2014, Javanmard and Montanari, 2014) ~

168. = ^

169. + MX ⊤ (Y X ^

170. ) M が(X⊤X)1 なら， ~

171. =

172. + (X ⊤ X) 1X ⊤ ϵ: → バイアスなし， (漸近) 正規．問題点: n ≫ p のとき，X⊤X は非可逆． 41 / 56

173. M の求め方: min M2Rpp jb M ⊤ I j1: (j j1 は中身をベクトルとみなした無限大ノルム) b Theorem (Javanmard and Montanari (2014)) ϵ i N(0; 2) (i :i :d:) とする． p n(

175. ~ ) = Z+Δ; Z N(0; 2MM ⊤ ); Δ = p n(Mb I )(

176. ^

177. ): また，X がランダムで分散共分散行列が正定な時，n = c √ log(p)=n とすると， ∥Δ∥1 = Op ( d lpog(p) n ) : p n( ~

179. ) はほぼ正規分布に従う． ※ n ≫ d2 log2(p) ならばΔ 0 で， → 信頼区間の構築や検定ができる． 42 / 56

180. M の求め方: min M2Rpp jb M ⊤ I j1: (j j1 は中身をベクトルとみなした無限大ノルム) Theorem (Javanmard and Montanari (2014)) p n( ~

182. ) = |{Zz} 正規分布 + |{Δz} X が非可逆であることによる残りカス条件が良い時0 へ収束 p n( ~

184. ) はほぼ正規分布に従う． ※ n ≫ d2 log2(p) ならばΔ 0 で， → 信頼区間の構築や検定ができる． 42 / 56

185. 数値実験 (a) 人口データにおける95%信頼区間. (n; p; d) = (1000; 600; 10). (b) 人口データにおけるp 値のCDF. (n; p; d) = (1000; 600; 10). 図はJavanmard and Montanari (2014) から引用． 43 / 56

186. 共分散検定統計量(Lockhart et al., 2014) 0 500 1000 1500 2000 2500 3000 −600 −400 −200 0 200 400 600 L1 Norm Coecients 0 2 4 6 8 10 10 44 / 56

187. 共分散検定統計量(Lockhart et al., 2014) 0 500 1000 1500 2000 2500 3000 −600 −400 −200 0 200 400 600 L1 Norm Coecients 0 2 4 6 8 10 10 J = supp( ^

188. (k )); J = supp(

189. ), ~

190. (k+1) := argmin

191. :

192. J2RjJj;

193. Jc =0 ∥Y XJ

194. J∥2 + k+1∥

195. J∥1: J J ならば(

196. j = 0 である)， Tk = ( ⟨Y ; X ^

197. (k+1)⟩ ⟨Y ; X ~

198. (k+1)⟩ ) =2 d! Exp(1) (n; p ! 1): 44 / 56

200. スパース推定における最適化の問題意識 R(

201. ) = Σn i=1 ℓ(yi ; x⊤ i

202. ) | {z } f (

203. ):ロス関数 + | {(z

204. }) 正則化項 = f (

205. ) + (

206. ) が尖っている→微分不可能．尖っている関数の最適化は難しい． f はなめらかな場合が多い．の構造を利用すれば，あたかもR がなめらかであるかのように最適化可能． 46 / 56

207. スパース推定における最適化の問題意識 R(

208. ) = Σn i=1 ℓ(yi ; x⊤ i

209. ) | {z } f (

210. ):ロス関数 + | {(z

211. }) 正則化項 = f (

212. ) + (

213. ) が尖っている→微分不可能．尖っている関数の最適化は難しい． f はなめらかな場合が多い．の構造を利用すれば，あたかもR がなめらかであるかのように最適化可能．典型例：L1 正則化 (

214. ) = C Σp j=1 j

215. j j → 座標ごとに分かれている. 一次元の最適化minbf(b y)2 + Cjbjg は簡単. 46 / 56

216. 座標降下法座標降下法の手順 1 座標j 2 f1; : : : ; pg を何らかの方法で選択． 2 j 番目の座標

217. j を更新．（以下は更新方法の例）

218. (k+1) j argmin

219. j R([

220. (k) 1 ; : : : ;

221. j ; : : : ;

222. (k) p ]). gj = @f (

223. (k)) @

224. j として，

225. (k+1) j argmin

226. j ⟨gj ;

227. j ⟩ + j (

228. j ) + k 2 ∥

229. j

230. (k) j ∥. 座標を一つずつではなく複数個選ぶことも多い: ブロック座標降下法．座標はあるルールに従って選んだり，ランダムに選んだりする． 47 / 56

231. 座標降下法の収束レート分解可能な正則化項 (

232. ) = Σp j=1 j (

233. j ) を考える． f は微分可能で勾配がL-Lipschitz 連続(∇f (

234. ) ∇f (

235. ′) L∥

237. ′∥) → このとき，f は「滑らか」であると言う．サイクリック(Saha and Tewari, 2013) R(

238. (k)) R( ^

239. ) L∥

240. (0) ^

241. ∥2 2k = O(1=k): ランダム選択(Nesterov, 2012, Peter Richtarik, 2014) 加速法なし: O(1=k). Nesterov の加速法: O(1=k2) (Fercoq and Richtarik, 2013). f が-強凸: O(exp(C(=L)k)). f が-強凸+加速法: O(exp(C √ =Lk)) (Lin et al., 2014). 座標の選び方はランダムで良い． 48 / 56

242. 大規模データにおける座標降下法 Hydra: 並列分散計算を用いた座標降下法(Richtarik and Takac, 2013, Fercoq et al., 2014). 大規模Lasso (p = 5 108, n = 109) におけるHydra の計算効率 (Richtarik and Takac, 2013)．128 ノード，4,096 コア． 49 / 56

243. 近接勾配法型手法 f (

244. ) |{z} 線形近似 + (

245. ) gk 2 @f (

246. (k)); gk = 1 k Σk =1 g : 近接勾配法:

247. (k+1) = arg min

248. 2Rp { g ⊤ k

249. + (

250. ) + k 2 ∥

252. (k)∥2 } : 正則化双対平均法(Xiao, 2009, Nesterov, 2009):

253. (k+1) = arg min

254. 2Rp { g ⊤ k

255. + (

256. ) + k 2 ∥

257. ∥2 } : 鍵となる計算は近接写像: prox(qj ) := arg min x { (x) + 1 2 ∥x q∥2 } : L1 正則化なら簡単に計算できる(Soft-thresholding 関数)． 50 / 56

258. 近接写像の例: L1 正則化 prox(qjC∥ ∥1) = arg min x { C∥x∥1 + 1 2 ∥x q∥2 } = (sign(qj ) max(jqj j C; 0))j : ! Soft-thresholding 関数. 解析解! 51 / 56

259. 近接勾配法型手法の収束レート f の性質滑らか非滑らか √ 強凸exp( =Lk) 1 k 非強凸 1 k2 p1 k 1 滑らかな場合，Nesterov の加速法を使った時の収束レートを示している(Nesterov, 2007, Zhang et al., 2010)． 2 加速法を使わなければ，それぞれexp((=L)k), 1 k になる． 3 上のオーダーは勾配情報のみを用いる方法(First order method) の中で最適． 52 / 56

260. 拡張ラグランジアン型手法 min

261. f (

262. ) + (

263. ) , min x;y f (x) + (y) s:t: x = y: 最適化の難しさを分離．拡張ラグランジアン: L(x; y; ) = f (x) + (y) + ⊤(y x) + 2 ∥y x∥2: 乗数法(Hestenes, 1969, Powell, 1969, Rockafellar, 1976) 1 (x(k+1); y(k+1)) = argminx;y L(x; y; (k)): 2 (k+1) = (k) (y(k+1) x(k+1)). x; y での同時最適化はやや面倒→ 交互方向乗数法． 53 / 56

264. 交互方向乗数法 min x;y f (x) + (y) s:t: x = y: L(x; y; ) = f (x) + (y) + ⊤ (y x) + 2 ∥y x∥2: 交互方向乗数法(Gabay and Mercier, 1976) x(k+1) = arg min x f (x) (k)⊤ x + 2 ∥y(k) x∥2 y(k+1) = arg min y (y) + (k)⊤ y + 2 ∥y x(k+1)∥2(= prox(x(k+1) (k)=j =)) (k+1) = (k) (x(k+1) y(k+1)) x; y の同時最適化は交互に最適化することで回避． y の更新は近接写像．L1 正則化のような場合は簡単．構造的正則化への拡張も容易．最適解に収束する保証あり．一般的にはO(1=k) (He and Yuan, 2012), 強凸ならば線形収束(Deng and Yin, 2012, Hong and Luo, 2012)． 54 / 56

265. 確率的最適化サンプル数の多い大規模データで有用．一回の更新にすべてのサンプルを読み込まないでも大丈夫. ■ オンライン型 FOBOS (Duchi and Singer, 2009) RDA (Xiao, 2009) ■ バッチ型 SVRG (Stochastic Variance Reduced Gradient) (Johnson and Zhang, 2013) SDCA (Stochastic Dual Coordinate Ascent) (Shalev-Shwartz and Zhang, 2013) SAG (Stochastic Averaging Gradient) (Le Roux et al., 2013) 確率的交互方向乗数法: Suzuki (2013), Ouyang et al. (2013), Suzuki (2014). 55 / 56

266. まとめ様々なスパースモデリング L1 正則化グループ正則化トレースノルム正則化 Lasso の漸近的振る舞い漸近分布 Adaptive Lasso のオラクルプロパティ制限固有値条件→ ∥ ^

268. ∥2 = Op(d log(p)=n) 検定バイアス除去法，共分散検定統計量最適化手法座標降下法近接勾配法 (交互方向) 乗数法 56 / 56

269. P. Alquier and K. Lounici. PAC-Bayesian bounds for sparse regression estimation with exponential weights. Electronic Journal of Statistics, 5: 127{145, 2011. T. Anderson. Estimating linear restrictions on regression coefficients for multivariate normal distributions. Annals of Mathematical Statistics, 22: 327{351, 1951. A. Argyriou, C. A. Micchelli, M. Pontil, and Y. Ying. A spectral regularization framework for multi-task structure learning. In Y. S. J.C. Platt, D. Koller and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 25{32, Cambridge, MA, 2008. MIT Press. O. Banerjee, L. E. Ghaoui, and A. d'Aspremont. Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. Journal of Machine Learning Research, 9:485{516, 2008. J. Bennett and S. Lanning. The net ix prize. In Proceedings of KDD Cup and Workshop 2007, 2007. P. J. Bickel, Y. Ritov, and A. B. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics, 37(4):1705{1732, 2009. 56 / 56

270. P. Buhlmann and S. van de Geer. Statistics for high-dimensional data. Springer, 2011. F. Bunea, A. Tsybakov, and M. Wegkamp. Aggregation for gaussian regression. The Annals of Statistics, 35(4):1674{1697, 2007. G. R. Burket. A study of reduced-rank models for multiple prediction, volume 12 of Psychometric monographs. Psychometric Society, 1964. E. Candes and T. Tao. The power of convex relaxations: Near-optimal matrix completion. IEEE Transactions on Information Theory, 56: 2053{2080, 2009. E. J. Candes and B. Recht. Exact matrix completion via convex optimization. Foundations of Computational Mathematics, 9(6): 717{772, 2009. E. J. Candes and T. Tao. Near-optimal signal recovery from random projections: Universal encoding strategies? IEEE Transactions on Information Theory, 52(12):5406{5425, 2006. A. Dalalyan and A. B. Tsybakov. Aggregation by exponential weighting sharp PAC-Bayesian bounds and sparsity. Machine Learning, 72:39{61, 2008. 56 / 56

271. W. Deng and W. Yin. On the global and linear convergence of the generalized alternating direction method of multipliers. Technical report, Rice University CAAM TR12-14, 2012. D. Donoho. Compressed sensing. IEEE Transactions of Information Theory, 52(4):1289{1306, 2006. D. L. Donoho and J. M. Johnstone. Ideal spatial adaptation by wavelet shrinkage. Biometrika, 81(3):425{455, 1994. J. Duchi and Y. Singer. Efficient online and batch learning using forward backward splitting. Journal of Machine Learning Research, 10: 2873{2908, 2009. J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456), 2001. O. Fercoq and P. Richtarik. Accelerated, parallel and proximal coordinate descent. Technical report, 2013. arXiv:1312.5799. O. Fercoq, Z. Qu, P. Richtarik, and M. Takac. Fast distributed coordinate descent for non-strongly convex losses. In Proceedings of MLSP2014: IEEE International Workshop on Machine Learning for Signal Processing, 2014. 56 / 56

272. I. E. Frank and J. H. Friedman. A statistical view of some chemometrics regression tools. Technometrics, 35(2):109{135, 1993. D. Gabay and B. Mercier. A dual algorithm for the solution of nonlinear variational problems via

273. nite-element approximations. Computers Mathematics with Applications, 2:17{40, 1976. T. Hastie and R. Tibshirani. Generalized additive models. Chapman Hall Ltd, 1999. B. He and X. Yuan. On the O(1=n) convergence rate of the Douglas-Rachford alternating direction method. SIAM J. Numerical Analisis, 50(2):700{709, 2012. M. Hestenes. Multiplier and gradient methods. Journal of Optimization Theory Applications, 4:303{320, 1969. M. Hong and Z.-Q. Luo. On the linear convergence of the alternating direction method of multipliers. Technical report, 2012. arXiv:1208.3922. A. J. Izenman. Reduced-rank regression for the multivariate linear model. Journal of Multivariate Analysis, pages 248{264, 1975. 56 / 56

274. L. Jacob, G. Obozinski, and J.-P. Vert. Group lasso with overlap and graph lasso. In Proceedings of the 26th International Conference on Machine Learning, 2009. A. Javanmard and A. Montanari. Con

275. dence intervals and hypothesis testing for high-dimensional regression. Journal of Machine Learning, page to appear, 2014. R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 315{323. Curran Associates, Inc., 2013. URL http://papers.nips.cc/paper/ 4937-accelerating-stochastic-gradient-descent-using-predictive-variance-reduction. pdf. K. Knight and W. Fu. Asymptotics for lasso-type estimators. The Annals of Statistics, 28(5):1356{1378, 2000. N. Le Roux, M. Schmidt, and F. Bach. A stochastic gradient method with an exponential convergence rate for strongly-convex optimization with

276. nite training sets. In Advances in Neural Information Processing Systems 25, 2013. 56 / 56

277. Q. Lin, Z. Lu, and L. Xiao. An accelerated proximal coordinate gradient method and its application to regularized empirical risk minimization. Technical report, 2014. arXiv:1407.1296. R. Lockhart, J. Taylor, R. J. Tibshirani, and R. Tibshirani. A signi

278. cance test for the lasso. The Annals of Statistics, 42(2):413{468, 2014. K. Lounici, A. Tsybakov, M. Pontil, and S. van de Geer. Taking advantage of sparsity in multi-task learning. 2009. P. Massart. Concentration Inequalities and Model Selection: Ecole d'ete de Probabilites de Saint-Flour 23. Springer, 2003. N. Meinshausen and P. B uhlmann. High-dimensional graphs and variable selection with the lasso. The Annals of Statistics, 34(3):1436{1462, 2006. Y. Nesterov. Gradient methods for minimizing composite objective function. Technical Report 76, Center for Operations Research and Econometrics (CORE), Catholic University of Louvain (UCL), 2007. Y. Nesterov. Primal-dual subgradient methods for convex problems. Mathematical Programming, Series B, 120:221{259, 2009. 56 / 56

279. Y. Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341{362, 2012. H. Ouyang, N. He, L. Q. Tran, and A. Gray. Stochastic alternating direction method of multipliers. In Proceedings of the 30th International Conference on Machine Learning, 2013. M. T. Peter Richtarik. Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Programming, Series A, 144:1{38, 2014. M. Powell. A method for nonlinear constraints in minimization problems. In R. Fletcher, editor, Optimization, pages 283{298. Academic Press, London, New York, 1969. G. Raskutti and M. J. Wainwright. Minimax rates of estimation for high-dimensional linear regression over ℓq-balls. IEEE Transactions on Information Theory, 57(10):6976{6994, 2011. G. Raskutti, M. Wainwright, and B. Yu. Minimax-optimal rates for sparse additive models over kernel classes via convex programming. Journal of Machine Learning Research, 13:389{427, 2012. 56 / 56

280. P. Ravikumar, J. Lafferty, H. Liu, and L. Wasserman. Sparse additive models. Journal of the Royal Statistical Society: Series B, 71(5): 1009{1030, 2009. P. Richtarik and M. Takac. Distributed coordinate descent method for learning with big data. Technical report, 2013. arXiv:1310.2059. P. Rigollet and A. Tsybakov. Exponential screening and optimal rates of sparse estimation. The Annals of Statistics, 39(2):731{771, 2011. R. T. Rockafellar. Augmented Lagrangians and applications of the proximal point algorithm in convex programming. Mathematics of Operations Research, 1:97{116, 1976. M. Rudelson and S. Zhou. Reconstruction from anisotropic random measurements. IEEE Transactions of Information Theory, 39, 2013. A. Saha and A. Tewari. On the non-asymptotic convergence of cyclic coordinate descent methods. SIAM Journal on Optimization, 23(1): 576{601, 2013. S. Shalev-Shwartz and T. Zhang. Proximal stochastic dual coordinate ascent. Technical report, 2013. arXiv:1211.2717. 56 / 56

281. N. Srebro, N. Alon, and T. Jaakkola. Generalization error bounds for collaborative prediction with low-rank matrices. In Advances in Neural Information Processing Systems (NIPS) 17, 2005. T. Suzuki. Pac-bayesian bound for gaussian process regression and multiple kernel additive model. In JMLR Workshop and Conference Proceedings, volume 23, pages 8.1{8.20, 2012. Conference on Learning Theory (COLT2012). T. Suzuki. Dual averaging and proximal gradient descent for online alternating direction multiplier method. In Proceedings of the 30th International Conference on Machine Learning, pages 392{400, 2013. T. Suzuki. Stochastic dual coordinate ascent with alternating direction method of multipliers. In Proceedings of the 31th International Conference on Machine Learning, pages 736{744, 2014. T. Suzuki and M. Sugiyama. Fast learning rate of multiple kernel learning: trade-off between sparsity and smoothness. The Annals of Statistics, 41 (3):1381{1405, 2013. R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58(1):267{288, 1996. 56 / 56

282. R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and K. Knight. Sparsity and smoothness via the fused lasso. 67(1):91{108, 2005. S. van de Geer, P. B uehlmann, Y. Ritov, and R. Dezeure. On asymptotically optimal con

283. dence regions and tests for high-dimensional models. The Annals of Statistics, 42(3):1166{1202, 2014. L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization. In Advances in Neural Information Processing Systems 23, 2009. M. Yuan and Y. Lin. Model selection and estimation in the Gaussian graphical model. Biometrika, 94(1):19{35, 2007. C.-H. Zhang. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statist, 38(2):894{942, 2010. P. Zhang, A. Saha, and S. V. N. Vishwanathan. Regularized risk minimization by nesterov's accelerated gradient methods: Algorithmic extensions and empirical studies. CoRR, abs/1011.0472, 2010. T. Zhang. Some sharp performance bounds for least squares regression with l1 regularization. The Annals of Statistics, 37(5):2109{2144, 2009. H. Zou. The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476):1418{1429, 2006. 56 / 56

284. 田中利幸. 圧縮センシングの数理. IEICE Fundamentals Review, 4(1): 39{47, 2010. 56 / 56

Sparse estimation tutorial 2014

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (13)

Semelhante a Sparse estimation tutorial 2014

Semelhante a Sparse estimation tutorial 2014 (20)

Mais de Taiji Suzuki

Mais de Taiji Suzuki (11)

Sparse estimation tutorial 2014