O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Chainer GTC 2016

Chainer introduction presentation
NVIDIA GTC @ San Jose, April 6, 2016

  • Entre para ver os comentários

Chainer GTC 2016

  1. 1. A Powerful, Flexible, and Intui5ve Deep Learning Framework @ NVIDIA GTC, April 6th, 2016 Shohei Hido Chief Research Officer Preferred Networks, Inc.
  2. 2. Overview l  Chainer is a Python-based deep learning framework l  Chainer v1.0 was released as an open source on June 2015 l  It DOESN’T rely on Theano, unlike other Python frameworks l  Chainer uses a unique scheme named Define-by-Run http://chainer.org/ l  Why do users sOll need another framework? l  How different and effecOve Chainer is? 2
  3. 3. Preferred Networks (PFN) A startup that applies deep learning to industrial IoT l  Founded: March 2014 l  Headquarter: Tokyo, Japan l  U.S. Subsidiary: San Mateo, California l  Company size: 35 engineers & researchers l  Investors: Toyota, FANUC, NTT Deep learning Industrial IoT 3 Manufacturing Automotive Healthcare
  4. 4. Partnering with world-leading companies using Chainer l  R&D collaboraOon on industrial problems with real-world data ̶  Specific requirements, modified algorithms, many trials and errors, etc ̶  Different from making general-purpose recogniOon system 4 Toyota FANUC Panasonic NTT Cisco NVIDIA
  5. 5. Two types of background behind DL frameworks 1. Scalability-oriented l  Use-cases in mind ̶  Image/speech recogniOon system ̶  Fast DL as a service in cloud l  Problem type ̶  A few general applicaOons ̶  10+ million training samples ̶  10+ nodes cluster w/ fast network l  Possible boZleneck ̶  Tuning of well-known algorithms ̶  Distributed computaOon for model/data-parallel training 2. Flexibility-oriented l  Use-cases in mind ̶  Algorithm research ̶  R&D projects for new products l  Problem type ̶  Various specific applicaOons ̶  10+ k training samples ̶  1 node with mulOple GPUs l  Possible boZleneck ̶  Trial-and-error in prototyping ̶  Debugging, profiling & refactoring ̶  (wait Ome during compilaOon)
  6. 6. Designed for efficient research & development l  Flexible: new kinds of complex models for various applicaOons l  IntuiOve: rapid prototyping and efficient trial-and-error l  Powerful: comparable performance for 1 node & mulO-GPUs 6 Scalability-oriented Flexibility-oriented
  7. 7. Agenda l  Deep learning framework basics l  IntroducOon to Chainer l  CuPy: NumPy-compaOble GPU library l  Performance and applicaOons 7
  8. 8. Neural network and computation x1 xN ・・ h1 hH ・・・・ kM k1 yM y1 Forward computation Backward computation (backpropagation) ・・ ・・ Input Hidden units Output Text Image Sensor Object:
 Tulip Anomaly score:
 0.35 Category:
 Sports ・・ ・・・・ 8
  9. 9. Chainer focuses on network representation/training l  Design choices for deep learning frameworks ̶  How to build neural networks? ̶  How to train neural networks? ̶  Which text format/language for modeling? ̶  Which language for compuOng? ̶  Run with GPU? ̶  Run on mulOple GPUs? ̶  Run on mulOple compute nodes? 9
  10. 10. Building and training neural networks: Computational graph construction is the key 1.  Construct a computaOonal graph ̶  Based on network definiOon given by users ̶  Chains of funcOons and operaOons on input variables 2.  Compute loss and gradients ̶  Forward computaOon to calculate loss for a minibatch ̶  BackpropagaOon gives gradients to all of parameters 3.  OpOmize model ̶  Update each parameter with the gradient ̶  Repeat unOl convergence Step 1. is the most important and there are many approaches 10
  11. 11. Building blocks l  These funcOonaliOes are very similar between frameworks l  But the structure, abstracOon level, and interface are different l  It comes to the design of domain-specific language for NN Array data structure (vector/matrix/tensor) Operations & functions Network (computational graph) Optimizer (SGD/AdaGrad/Adam) 11
  12. 12. Types of domain-specific language for neural networks l  Text DSL ̶  Ex. Caffe (prototxt) ̶  Ex. CNTK (NDL) l  Symbolic program ̶  OperaOons on symbols ̶  Ex. Theano ̶  Ex. TensorFlow l  ImperaOve program ̶  Direct computaOons on raw data arrays ̶  Ex. Torch.nn ̶  Ex. Chainer # Symbolic definiOon A = Variable(‘A’) B = Variable(‘B’) C = B * A D = C + Constant(1) # Compile f = compile(D) d = f(A=np.ones(10), B=np.ones(10) * 2) # ImperaOve declaraOon a = np.ones(10) b = np.ones(10) * 2 c = b * a d = c + 1 %% DefiniOon in text f: { “A”: “Variable”, “B”: “Variable”, “C”: [“B”, “*”, “A”], “ret”: [“C”, “+”, 1] } # Compile f = compile(“f.txt”) d = f(A=np.ones(10), B=np.ones(10) * 2) 12 Ex. MXNet
  13. 13. Comparison of DSL type DSL type Pros. Cons. Text DSL •  Human-readable definiOon •  Non-programmer can easily edit the network •  Users must study the format •  Format might have to be extended for new algorithms Internal DSL Symbolic •  StaOc analysis at compile •  OpOmizaOon before training •  Easy to parallelize •  Users must study special syntax •  May need more efforts to implement new algorithms ImperaOve •  Less efforts to learn syntax •  Easy debugging and profiling •  Suitable for new algorithms with complex logic •  Hard to opOmize in advance •  Less efficient in memory allocaOon and parallelizaOon Chainer is at the extreme end of imperaOve program for high flexibility 13
  14. 14. Agenda l  Deep learning framework basics l  IntroducOon to Chainer l  CuPy: NumPy-compaOble GPU library l  Performance and applicaOons 14
  15. 15. Chainer as an open-source project l  hZps://github.com/pfnet/chainer l  50 contributors l  1,277 stars & 255 fork l  3,708 commits l  AcOve development & release for last 10 months ̶  v1.0.0 (June 2015) to v1.7.2 (March 2016) 15 Original developer Seiya Tokui
  16. 16. CuPy Chainer software stack CPU NVIDIA GPU CUDA cuDNN BLAS NumPy Chainer l  Chainer is built on top of NumPy and CUDA l  CuPy is also introduced as an equivalent of NumPy on GPU 16
  17. 17. Run Define Graph build scheme (1/2) - Define-and-Run: Most of frameworks use this scheme (Chainer does not) l  Define: build a computaOonal graph based on definiOon l  Run: update the model (parameters) using training dataset Network definiOon ComputaOonal graph Gradient funcOon Parameters ComputaOonal graph Gradient funcOon Parameters Training data Update Loss & gradient Auto differenOaOon 17
  18. 18. Define-by-Run Graph build scheme (2/2) - Define-by-Run: Computational graph construction on the fly l  No graph is constructed before training l  Instead, the graph is built at each forward computaOon l  ComputaOonal graph can be modified dynamically for each iteraOon/sample or depending on some condiOons Model definiOon ComputaOonal graph Gradient funcOon Parameters Training data Update Dynamic change CondiOons 18
  19. 19. Define-by-Run example: MLP for MNIST l  Only transformaOons between units are set before training l  ConnecOon is given as forward computaOon l1 = Linear(784, n_units) l2 = Linear(n_units, 10)) Linear l2Linear l1 x yh1 W bias 0 5 9 W bias ReLU def forward(x): h1 = ReLU(l1(x)) return l2(h1) 19
  20. 20. Define-by-Run: An interpreted language for neural network l  Idea ̶  Forward computaOon actually goes through computaOonal graph ̶  By remembering the history, the actual graph can be obtained l  Advantage ̶  Flexibility for new algorithms with complex components u  Ex. recurrent, recursive, aZenOon, memory, adversarial, etc ̶  IntuiOve coding with highly imperaOve network definiOon u  Ex. stochasOc network of which graph changes for each iteraOon l  Current drawbacks ̶  Graph is generated every Ome also for fixed networks ̶  No opOmizaOon even for staOc part of graphs u  JIT-like analysis and subgraph cache might be useful 20
  21. 21. Basic components (1/2): Variable and Function l  Variable ̶  Variable wraps arrays (.data) ̶  It remembers parent funcOon (.creator) ̶  It will be assigned gradient (.grad) ̶  It keeps track of not only data but also computaOons l  FuncOon ̶  TransformaOon between Variable ̶  Stateless ̶  e.g. sigmoid, tanh, ReLU, maxpooling, dropout Function x y Variable x yh1 0 5 9 21
  22. 22. Chain (MLP2) Basic components (2/2): Link and Chain l  Link = funcOon with state ̶  Parameters are also Variable and gradients will be assigned ̶  e.g. Linear (fully-connected), LSTM ConvoluOon2d, word-embedding l  Chain = network ̶  Chain has a set of child Link ̶  Forward computaOon is defined in . __call__() ̶  e.g. MLP2, AlexNet, GoogLeNet, RNNLM, seq2seq, Link (Linear) y=f(W*x+b) x y W b Linear l2Linear l1 yh1 W bias W bias ReLU 22
  23. 23. Backpropagation through computational graph l  Consider an objecOve (Link.Linear): L = f(x * w + b) l  This computes the value of L in forward computaOon, and simultaneously builds the following computaOonal graph l  The gradient of L can be computed with respect to any variables by backpropagaOon l  Then the opOmizer updates the value of parameters *x W + b f L is Variable is FuncOon 23
  24. 24. Code sample (1/4): Multi-layer perceptron class MLP2(Chain): def __init__(self): super(MLP2, self).__init__( l1=L.Linear(784, 100), l2=L.Linear(100, 10), ) def __call__(self, x): h1 = F.relu(self.l1(x)) y = self.l2(h1) return y class Classifier(Chain): def __init__(self, predictor): super(Classifier, self). __init__(predictor=predictor) def __call__(self, x, t): y = self.predictor(x) self.accuracy = F.accuracy(y, t) self.loss = F.softmax_cross_entropy(y, t) return self.loss, self.accuracy # Model and optimizer setup model = Classifier(MLP2()) optimizer = optimizers.SGD() optimizer.setup(model) # training loop with minibatch for i in range(0, datasize, batchsize): x = Variable(x_tr[i:i+batchsize]) t = Variable(y_tr[i:i+batchsize]) model.zerograds() loss, acc = model(x, t) loss.backward() optimizer.update() Chain (MLP2) Linear l2Linear l1 yh1 W bias W bias ReLU 24
  25. 25. Code sample (2/4): Convolutional neural network class AlexNet(Chain): def __init__(self): super(AlexNet, self).__init__( conv1=L.Convolution2D(3, 96, 11, stride=4), conv2=L.Convolution2D(96, 256, 5, pad=2), conv3=L.Convolution2D(256, 384, 3, pad=1), conv4=L.Convolution2D(384, 384, 3, pad=1), conv5=L.Convolution2D(384, 256, 3, pad=1), fc6=L.Linear(9216, 4096), fc7=L.Linear(4096, 4096), fc8=L.Linear(4096, 1000), ) def __call__(self, x, t): h = F.max_pooling_2d(F.relu( F.local_response_normalization(self.conv1(x))), 3, stride=2) h = F.max_pooling_2d(F.relu( F.local_response_normalization(self.conv2(h))), 3, stride=2) h = F.relu(self.conv3(h)) h = F.relu(self.conv4(h)) h = F.max_pooling_2d(F.relu(self.conv5(h)), 3, stride=2) h = F.dropout(F.relu(self.fc6(h)), train=self.train) h = F.dropout(F.relu(self.fc7(h)), train=self.train) y = self.fc8(h) return y * ImageNet Classification with Deep Convolutional Neural Networks http://www.image-net.org/challenges/LSVRC/2012/supervision.pdf conv2d conv2d conv2d conv2d conv2d linear linear 25 linear
  26. 26. Code sample (3/4): Recurrent neural network class SimpleRNN(Chain): def __init__(self, n_vocab, n_units): super(SimpleRNN, self).__init__( embed=L.EmbedID(n_vocab, n_units) x2h=L.Linear(n_units, n_units), h2h=L.Linear(n_units, n_units), h2y=L.Linear(n_units, n_vocab),) self.h = None def __call__(self, x): y, h_new = self.fwd_one_step(x, self.h) self.h = h_new return y def fwd_one_step(self, x, h): x = F.tanh(self.embed(x)) if h is None: h = F.tanh(self.x2h(x)) else: h = F.tanh(self.x2h(x) + self.h2h(h)) y = F.softmax(self.h2y(h)) return y, h x_1 h y_1 x_2 h y_2 x_3 h y_3 x_4 h y_4 BPTT length = 3 Input word OutputRecurrent state # Truncated BPTT (length=3) for i in range(0, datasize, batchsize): ... accum_loss += model(x, t) if i % bptt_length == 0: model.zerograds() accum_loss.backward() accum_loss.unchain_backward() optimizer.update() 26
  27. 27. Code sample (4/4): Deep Networks with Stochastic Depth A paper published on arXiv, March 30, 2016 l  A variant of Residual Net that skips connecOons stochasOcally ̶  Outperformed the original Residual Net (ImageNet 2015 winner, MSR) ̶  StochasOc skip: Taken from http://arxiv.org/abs/1603.09382v2 G. Huang et al. # Mock code in Chainer class StochasticResNet(Chain): def __init__(self, prob, size, …): super(StochasticResNet, size, …).__init__( ## Define f[i] as same for Residual Net ) self.p = prob # Survival probabilities def __call__(self, h): for i in range(self.size): b = numpy.random.binomial(1, self.p[i]) c = self.f[i](h) + h if b == 1 else h h = F.relu(c) return h w/ survival probability: 27
  28. 28. Miscellaneous l  Other features ̶  Install with pip in one line: ̶  MulO-GPU support by explicitly selecOng the ID to use ̶  Pre-trained Caffe model import from Model Zoo ̶  Model serializaOon & save & load : HDF5 or NumPy npz l  Future direcOon (not only for Chainer) ̶  JIT-like opOmizaOon during Define-by-Run ̶  Memory consumpOon reducOon (GPU memory is sOll small) ̶  Handling variable-length inputs without minibatch ̶  Maximizing performance on mulO-node & mulO-GPU environment $ pip install chainer 28
  29. 29. Agenda l  Deep learning framework basics l  IntroducOon to Chainer l  CuPy: NumPy-compaOble GPU library l  Performance and applicaOons 29
  30. 30. CuPy: (partially-)NumPy-compatible GPU library l  MoOvaOon: NumPy + CUDA = CuPy ̶  NumPy is the standard library in Python for numerical computaOon ̶  CUDA is the standard APIs for using GPU for high-performance ̶  Unfortunately, NumPy does NOT work with CUDA l  CuPy supports: ̶  Fast computaOon using NVIDIA’s cuBLAS and cuDNN ̶  Array indexing, slicing, transpose, and reshape ̶  Most of operaOons/funcOons in NumPy u  Chainer v1.7.2 already supports more than 170 funcOons ̶  User-defined funcOons and kernels ̶  all dtypes, broadcasOng, memory pool, etc 30
  31. 31. How to use CuPy l  Usage of CuPy: just replace NumPy with CuPy l  Conversion between numpy.ndarray and cupy.ndarray l  Ex. CPU/GPU-agnosOc logsumexp funcOon def logsumexp(x, axis=None): xp = cuda.get_array_module(x) #Get CuPy or NumPy x_max = x.max(axis) exp_sum = xp.exp(x - x_max).sum(axis) return x_max + xp.log(exp_sum) import numpy, cupy enable_cupy = True xp = cupy if enable_cupy else numpy w_c = cupy.asarray(numpy.ones(10)) # cupy.ndarray w_n = cupy.asnumpy(cupy.ones(10)) # numpy.ndarray 31
  32. 32. CuPy implementation: Optimized for performance & NumPy-compatibility l  Use Cython for cupy.core & cupy.cuda l  Dynamic code generaOon & compile ̶  CUDA code is generated for specific tensor dimension & data type ̶  On-the-fly compile by nvcc and binary cache (faster awer 1st use) CUDA libraries (cuBLAS, cuRAND, cuDNN) ndarray ufunc, elementwise, reduc5on CUDA Python wrapper cupy.cuda cupy.core Tensor opera5ons & func5ons cupy 32
  33. 33. CuPy performance on linear algebra: 5 to 25 times faster than NumPy def test(xp): a = xp.arange(1000000).reshape(1000, -1) return a.T * 2 test(numpy) t1 = datetime.datetime.now() for i in range(1000): test(numpy) t2 = datetime.datetime.now() print(t2 -t1) test(cupy) t1 = datetime.datetime.now() for i in range(1000): test(cupy) t2 = datetime.datetime.now() print(t2 -t1) msec speed up NumPy 2,929 1.0 CuPy 585 5.0 CuPy + Memory Pool 123 23.8 Intel Core i7-4790 @3.60GHz,32GB, GeForce GTX 970 33
  34. 34. Use CuPy for GPU-based computation l  Support three paZerns as wrappers ̶  ElementwiseKernel: for element-wise computaOon ̶  ReducOonKernel: for reduce operaOon along axis ̶  ufunc: universal funcOon as in Numpy l  Ex. definiOon of an element-wise funcOon l  Usage (automaOc broadcast and type check are supported) squared_diff = cupy.ElementwiseKernel( ‘float32 x, float32 y’, # Input ‘float32 z’, # Output ‘z = (x - y) * (x - y)’, # Operation ‘squared_diff’) # Name squared_diff(cupy.arange(10), 10) 34
  35. 35. Agenda l  Deep learning framework basics l  IntroducOon to Chainer l  CuPy: NumPy-compaOble GPU library l  Performance and applicaOons 35
  36. 36. Public benchmark results (CNN): Chainer shows comparable performance l  Forward computaOon is almost the same with TensorFlow l  Training with backward computaOon is slower, but it can be offset by no compilaOon Ome while debugging/tuning 0 200 400 600 800 1000 1200 AlexNet GoogLeNet VGG-A OverFeat Torch TensorFlow Chainer Caffe (naCve) 0 200 400 600 800 1000 1200 AlexNet GoogLeNet VGG-A OverFeat Torch TensorFlow Chainer Caffe (naCve) Forward computation (msec) Backward computation (msec) Taken from https://github.com/soumith/convnet-benchmarks, using cuDNN except Caffe 36
  37. 37. Chainer can benefit from latest CUDA libraries: Ex. Winograd algorithm in cuDNN v5 l  Conv3x3 is common in CNNs & now computed with Winograd l  State-of-the-art CNN models (e.g., GoogLeNet, VGG-A) can be accelerated up to 2.0x at test Ome (forward only) 0 100 200 300 400 500 600 AlexNet GoogLeNet VGG-A OverFeat cuDNN v4 cuDNN v5 0 100 200 300 400 500 600 AlexNet GoogLeNet VGG-A OverFeat cuDNN v4 cuDNN v5 Forward computation (msec) Backward computation (msec) Independently measured by a modified version of soumith/convnet-benchmarks cuDNN v5 can be used in Chainer v1.8.0 37
  38. 38. Algorithm implementation in Chainer: A Neural Algorithm of Artistic Style (Gatys et al., 2015) l  hZps://github.com/maZya/chainer-gogh Content image (cat) Style image New artistic image + = Main code (45 lines) 38
  39. 39. l  Many collaboraOons are on-going w/ Chainer-based computer vision, deep reinforcement learning, etc… l  Ex. 1 Chainer-controlled toy cars in Toyota booth at CES 2016 l  Ex. 2 Highly accurate FANUC’s bin-picking robot at IREX 2015 ̶  8 hours training to reach expert-level, commercializaOon by 2016 end Chainer in industry: Used in demonstrations & being commercialized http://tinyurl.com/pfn-irex15http://tinyurl.com/pfn-ces16 39
  40. 40. Summary l  Chainer is a Python-based deep learning framework with dynamic network construcOon scheme and CuPy l  It is designed for efficient research and prototyping while keeping comparable performance thanks to NVIDIA GPU l  Official web: hZp://chainer.org/ l  Github: hZps://github.com/pfnet/chainer Your contribuOons will be appreciated & we are hiring! 40