O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Próximos SlideShares
Understanding AlphaGo
Understanding AlphaGo
Carregando em…3
×

Confira estes a seguir

1 de 57 Anúncio

Mais Conteúdo rRelacionado

Diapositivos para si (20)

Quem viu também gostou (20)

Anúncio

Semelhante a AlphaGo (20)

Mais recentes (20)

Anúncio

AlphaGo

  1. 1. Mastering the Game of Go with and
  2. 2.  state-space complexity: 103  simple upper bound: 39= 19,683  legal positions: 5,478  rotation and reflection identical: 765  game-tree complexity: 105  simple upper bound: 9! = 362,880  stop at win: 255,168  rotation and reflection identical: 26,830
  3. 3. Game Board Size State-Space Complexity Game-Tree Complexity Masterpiece OXO 3x3 103 105 OXO (1952) Gomoku 15x15 10105 1070 Victoria (1994) Othello 8x8 1028 1058 Logistello (1997) Chess 8x8 1047 10123 Deep Blue (1997) Chinese Chess 9x10 1040 10150 棋天大聖 (2006) Go 19x19 10170 10360 AlphaGo (2016)
  4. 4. 4 stones
  5. 5.  HandTalk/Goemate (1993~2000)  developed by 陳志行  reach amateur 5k (Japan 1d) level  assembly, rule-based  highly depends on domain knowledge  win 11 stones vs. amateur 6d (2 of 3) in 應氏盃 prize: NTD$250,000  if go program could win even play vs. pro. (4 of 7) prize: NTD$40,000,000 (USD$1,000,000)
  6. 6.  Goal-oriented sub-games  search connections, dividers, life & death, extend territory… and resolve  Combinational Game Theory  the effect of who play first at the position  Opening Book  database of opening, life & death…  Territory and Influence  erosion, dilation, close…  Board State Evaluation  no effective means
  7. 7.  Crazy Stone (2006~)  developed by Rémi Coulom  reach amateur 6d level  Monte Carlo Tree Search (MCTS) Zen
  8. 8. evaluation of non-terminal positions (highly depends on domain knowledge) terminal positions
  9. 9. 4 4 4 10 4 10 4 4 4 4 10 17 11 8 11 17 10 4 6 12 13 17 13 17 13 12 6 7 22 13 21 14 21 13 22 7 4 14 13 17 18 17 13 14 4 4 12 15 16 17 16 15 12 4 6 8 12 9 12 9 12 8 6 7 6 8 9 6 9 8 6 7 -4 4 6 7 -6 7 6 4 -4 0 -5 4 7 4 7 4 -5 0 10 4~5 2~3 1~2 5~6 1~5 100
  10. 10.  局勢已贏,專精求生。局勢已弱,銳意侵綽。 局勢對自己有利,判斷可勝時,要詳細檢視盤面,儘早補強相對薄弱的棋。若形勢不利, 則應考慮大膽的侵略手段。  沿邊而走,雖得其生者,敗。弱而不伏者,愈屈。躁而求勝者,多敗。 沿邊而逃,就算作活了棋也會輸。局勢已顯薄弱,還要胡亂用強,情勢會變得更糟。獲勝 需要時間的,慌慌張張急於求勝的話,理智漸失,算路不清,多半輸得更快。  兩勢相圍,先蹙其外。勢孤援寡,則勿走。機危陣潰,則勿下。 我方勢單力薄又乏援兵的情況下,想單獨直接地逃出,是會遭到不測的。  是故棋有不走之走,不下之下。 所以說,下棋要懂得「不走之走,不下之下」的道理。換句話說,孤棋、險棋最好不要直 接逃,要利用以攻代守的手段,技巧地逃出,或攻對方缺陷順勢補強己方,才算是高招。  誤人者多方,成功者一路而已。能審局者多勝。 圍棋只要贏一目就是贏了。精於形勢判斷,又不貪不受誘惑者,勝率會比較高。  《易》曰:「窮則變,變則通,通則久。」 正如易經所說:「窮則變,變則通,通則久。」
  11. 11. terminal positions  10000 simulations rollout by  pure random → amateur 20k  3x3 patterns → amateur 3k  trade-off of amount & quality
  12. 12.  converge to optimal play and evaluation
  13. 13. Number of successes of arm i Number of pulls of arm i Total number of pulls of all arms Exploit Explore ii i i n n n c log2 Priority  
  14. 14.  Monte-Carlo Tree Search  Aja Huang & David Sliver  Policy Network  input: a board state  output: probabilities of next positions  reduce the search breadth effectively  Value Network  input: a board state  output: reward (-1.0 loss ~ +1.0 win)  reduce the search depth effectively  Reinforcement Learning
  15. 15.  predict expert moves  It’s hard/slow to learn from random play.  Testing results show the win rates (strength) increased with the accuracy of predictions.  Supervised Learning (SL) Policy DNN  data: KGS 6~9d, 160K games, 29.4M moves ▪ training set: 28.4M, test set: 1M ▪ 8 reflections and rotations ▪ train 3 weeks on 50 GPUs  accuracy: 57.0% ▪ state of-the-art: 44.4%  Fast Rollout Policy (simple NN)  24.2% accuracy but 1500X faster (2us)
  16. 16. 55.7% → 57.0%
  17. 17.  mini-batch size: 16  run Disbelief on 50 GPUs  3 weeks for 340M training steps
  18. 18.  predict expert moves by linear softmax  train from 8M moves of Tygem server  local pattern based  12 points diamond shape around previous move  3x3 around candidate moves  simulate 1000 games/sec. per CPU thread
  19. 19.  The strategy of the game is unknown.  delayed feedback from win/lose at the end  self-play to improve policy  https://www.youtube.com/watch?v=V1eYniJ0Rnk
  20. 20.  REINFORCE  REward Increment = Nonnegative Factor Offset ReinforCEment
  21. 21.  RL Policy win 80% against SL Policy  train on 50 GPUs one day  play against Pachi (KGS 2d)  SL Policy win 11%  RL Policy win 85%
  22. 22. for Komi (black points – white points > 7.5)
  23. 23.  reduce search depth  predict win rate based on single board state  KGS dataset  training set MSE: 0.19  testing set MSE: 0.37  self-play 30M games to get 30M board states  training set MSE: 0.226  testing set MSE: 0.234  train on 50 GPUs one week
  24. 24.  For each edge (state, action), store the following information  Q: combined mean action value  P: prior probability (rollout go ahead and update by SL policy DNN)  Wr: total reward of rollout  Wv: total reward of value DNN (sum of random selected symmetries)  Nr: counts of rollout  Nv: counts of value DNN (max:8)
  25. 25. Exploit Explore Prior probability (rollout, value DNN) Visit count of edge Visit count of parent node
  26. 26.  if Nr (rollout count) > Nthr, insert the successor node  Nthr is determined to balance CPU (rollout) and GPU (DNN)  insert works into GPU queue (asynchronous)  request policy DNN for next moves of the node (board state)  request value DNN for the reward of the node (board state)  rollout by CPU in the mean while continuously  simulate to the end of game and get reward  compute reward from both rollout and value DNN
  27. 27.  update information for every visited edges  atomic update, lock-free  virtual loss strategy  add negative reward to edges in computations  encourage each thread to explore different paths  resign when win rate < 10% (max reward < -0.8)  scalable to distributed system
  28. 28.  strange moves  AI considers win rates.  Human considers win-more.  Honorary 9 dan  2 versions  World No.2  much stronger than last year  different AlphaGo  self-play alone  no human pollution

×